subgroupExclusiveAdd is part of the subgroup extension, and expected to be imple...

fluffything · on April 19, 2020

Which devices do implement this in hardware ?

None of the NVIDIA compute devices do, right? Do AMD GPUs implement this ?

CPUs from Intel, ARM, RISCV, MIPS, Power support horizontal SIMD add instructions that perform a tree-reduction.

I'm skeptical of GPUs being able to actually implement this on hardware. A CPU puts the whole vector in a single 512-bit register, but in a GPU, even for a single half-warp, the operation cannot take 16x64bit registers. The operation would need to be performed on memory, and move the content to the different registers, do warp-local shuffles for the reduction, etc. So we would be talking about a quite big macro-op here. I wouldn't see the point of doing this in hardware, when the compiler or driver could just call a library function that does this for you.

But then we are back to the issue that, if you can't write those library functions in pure Vulkan, you are quite limited w.r.t what one can do.

There are many prefix-sum like algorithms, like, e.g., a weighted prefix-sum (prefix_sum(x * w)) [0]. In a GPU, you don't want to do the vector multiply first followed by the prefix-sum, but do it all at once. You can probably work around this in Vulkan by doing the vector multiply in shared memory, and then the prefix sum there (while in CUDA you wouldn't need shared memory for that at all), but that won't really work if the reduction is more complicated than just a sum.

[0] Convolved prefix sums happen a lot in signal processing with linear-recurrence relations, e.g., in IIR filters, e.g. see [1] https://dl.acm.org/doi/10.1145/3229631.3229649 and [2] https://dl.acm.org/doi/10.1145/3296957.3173168. The weighted prefix sum example is the simplest possible instance of those.

raphlinus · on April 19, 2020

Right, I haven't looked at PTX or Intel Gen ASM, but on AMD this compiles to a lg(n) series of add instructions each with a warp local shuffle (v_add_nc_u32 with a row_shr: modifier). It's not going to shared memory. I've experimented a bit with hand-rolling scans (using subgroupShuffleUp) and get similar but not identical asm (that intrinsic gives you a wave_ror: modifier). So I think you could write this using even lower-level intrinsics and also get good performance.

I strongly suspect Intel and NVidia are similar here, but haven't experimented.

[1]: http://shader-playground.timjones.io/4d7402168290f8922da5df3...

raphlinus · on April 19, 2020

There's actually a lot more to say here, but it really gets into the weeds. To me, the fundamental question is: can you write Vulkan 1.2 compute shaders that exploit most of the performance potential of GPU hardware in a portable way? I believe the answer is yes, but it requires validation. It is, however, not a good approach if the goal is to squeeze out the last 5% or 10% of performance.

fluffything · on April 20, 2020

That makes sense. Given that one can write good compute shaders with Vulkan, the next obvious question worth answering to me is whether one should do that.

If I had to write a new foundational library, e.g., something like CUTLASS or cuFFT for device code, I won't start with code that fully uses a device, but I'd like to be able to do that if I want to.

If I need to reach for CUDA to do that, then Vulkan, WebGPU, etc. will need to have great FFI with CUDA C++. Nobody wants to have to "drop up" to a higher-level CUDA C++ API like CUTLASS, but being forced to do so by "dropping down" that API to a C FFI wrapper.

It would be like having Rust, but without any unsafe code support except for C FFI, C actually not existing, and having to drop "down" to C++ for everything, and then numbing down the C++ to C FFI for interoperability. You might as well just stick with C++ and spare you the pain.

I hope Vulkan and WebGPU get enough extensions so that foundational libraries can be written in those languages.

raphlinus · on April 21, 2020

Thanks for your continued discussion!

I wrote some more of my thoughts on the xi Zulip #gpu channel (login required, open to everybody with Github): https://xi.zulipchat.com/#narrow/stream/197075-gpu/topic/Pre...

fluffything · on April 25, 2020

@Raph I don't have Zulip, but you have motivated me to play a lot the last weekends with Vulkan compute, and I am a bit disappointed.

I think I was expecting to have a similar level of control as with CUDA or OpenCL, but it turns out that wgpu/Vulkan compute is more like "OpenGL compute shaders", which is "compute for a rendering pipeline", something quite different than "CUDA/OpenCL compute", which has nothing to do with rendering.

If all you want is to stick a compute shader in your rendering pipeline, then wgpu/vulkan are great IMO.

If you want to do CFD, linear algebra, or machine learning, then it feels like fighting against a technology that just wasn't meant to be used for that. I managed to write a tiny explicit CFD code with Vulkan compute just fine, but the moment I wanted to implement a slightly more complicated algorithm, I was limited by things like how to properly do linear algebra (e.g. I needed a algebraic multi-grid preconditioner and a conjugate gradients implementation, and ended up fighting against how to control what's get put into shared memory when). When trying some NN kernels, I got stuck with the same issue. I managed to implement an FFT and convolutions with Vulkan, but... I couldn't manage to get convolutions achieve the same perf as with CUDA cause with Vulkan I had to use a "pull" algorithm (read from all points in the window, write to one point), but that just performed way worse than the "push" algorithms that are possible to implement with CUDA (maintain a shared memory cache where you write, and as you read one value from the window, write its contribution to multiple locations, then do a full cache write to global memory). In numbers, I went from ~150 Gb/s throughput with Vulkan to ~700Gb/s with CUDA.

I think maybe Vulkan/SPIRV could benefit from better compilers that take the "pull" algorithm and compile it to a "push" one using shared memory, but you are then at the mercy of the compiler for a 4x perf diff.

I think that if I have a rendering pipeline in which I want to integrate some compute, then wgpu/Vulkan are great, particularly because chances are that the compute you need is per "vertex/pixel/..." kind of naive-massively-parallel kind of compute. But if I just have a raw compute pipeline, then it feels like using the wrong tool for the job.