All the computing units running in lockstep is what appears to the program. They don't have to run in lockstep for every single instruction in reality. SIMT works in a warp with up to 32 threads [1]. Different thread warps working on different cell regions can run with different program counters.
Also the newer Volta architecture (section 3.2 in [1]) allows independent thread scheduling of threads across warps such that each thread can have its own program counter and stack.
There're certainly sync. See how barrier is used (9.7.12.1 in [1]) to synchronize threads.
This is not how Volta's "Independent Thread Scheduling" works under the hood, though I'm not at liberty to say what's actually going on (unfortunately, one of the upsetting things about doing GPU programming is how much "behind NDA" tribal knowledge there tends to be. If you have an NVIDIA devrel contact, ask for their "GPU Programming Guide", that should hopefully clear things up). So for now, you'll just have to trust me.
Suffice to say, the Volta "Independent Thread Scheduling" is only used for compute shaders, not for vertex and pixel shaders.
You're right to note that different warps can run with different program counters, and when communicating across warps, those need synchronization. That's what those barriers are necessary for (see GroupMemoryBarrier and friends in the high-level languages). However, the definition of dFdX chosen by these languages basically requires that all threads participating in the derivative be scheduled within the same warp. dFdX will never synchronize or wait on another warp.
Also the newer Volta architecture (section 3.2 in [1]) allows independent thread scheduling of threads across warps such that each thread can have its own program counter and stack.
There're certainly sync. See how barrier is used (9.7.12.1 in [1]) to synchronize threads.
[1] https://docs.nvidia.com/cuda/parallel-thread-execution/index...