All the computing units running in lockstep is what appears to the program. They...

Jasper_ · on Oct 3, 2021

This is not how Volta's "Independent Thread Scheduling" works under the hood, though I'm not at liberty to say what's actually going on (unfortunately, one of the upsetting things about doing GPU programming is how much "behind NDA" tribal knowledge there tends to be. If you have an NVIDIA devrel contact, ask for their "GPU Programming Guide", that should hopefully clear things up). So for now, you'll just have to trust me.

Suffice to say, the Volta "Independent Thread Scheduling" is only used for compute shaders, not for vertex and pixel shaders.

You're right to note that different warps can run with different program counters, and when communicating across warps, those need synchronization. That's what those barriers are necessary for (see GroupMemoryBarrier and friends in the high-level languages). However, the definition of dFdX chosen by these languages basically requires that all threads participating in the derivative be scheduled within the same warp. dFdX will never synchronize or wait on another warp.