> Would you be able to list them? scan (e.g. prefix sum), merge, partition, redu...

9q9 · on April 20, 2020

Thanks.

How to abstract "hardware primitives" in a way that can be instantiated to many GPU architectures without performance penalty and be useful for higher-level programming is not so clear. How would you, to take an example from the CPU world, fruitfully abstract the CPU's memory model? As far as I'm aware that's not a solved problem in April 2020, and write papers on this subject are still appearing in top conferences .

fluffything · on April 20, 2020

There exists a language that can optimally target AMD's ROCm, NVIDIA's PTX, and multicore x86-64 CPUs: CUDA C++. This language has 3 production quality compilers for these targets (nvcc, hip, and pgi).

Unfortunately, this language is from 2007, and the state-of-the art has not really been improved since then (except for the evolution of CUDA proper).

It would be cool for people to work on improving the state of the art, providing languages that are simpler, perform better, or are more high-level than CUDA, without sacrificing anything (i.e. true improvements).

There have been some notable experiments in this regard, e.g., Sequoia for roadrunner scratchpads had very interesting abstraction capabilities over the cache hierarchy, that could have led to a net total improvement over CUDA __shared__ memory.

Most newer languages have the right goal of trying to simplify CUDA, but they end up picking trade-offs that only allow them to do so by sacrificing a lot of performance. That's not an interesting proposition for most CUDA developers - the reason they pick up CUDA is performance, and sacrificing ~5% might be acceptable if the productivity gains are there, but a 20-30% perf loss isn't acceptable - too much money involved.

One can simplify CUDA while retaining performance by restricting a language to GPUs and a particular domain, e.g., computer graphics / gfx shaders or even stencil codes or map-reduce, etc.

However, a lot of the widely advertised languages try to (1) target both GPUs and CPUs, compromising on some minimum common denominator of features, (2) support general-purpose compute kernels, and (3) significantly simplify CUDA, often doing so by just removing the explicit memory transfers, which compromises performance. These languages are quite neat, but they aren't really practical, because they aren't really true improvements over CUDA.

Most companies I work with that use CUDA today are using the state of the art (C++17 or 20 extensions), experimental libraries, drivers, etc. So it isn't hard to sell them into a new technology, _if it is better than what they are already using_.

9q9 · on April 20, 2020

Thanks. [1] contains an interesting discussion of Sequoia, and why it was considered a failed experiment, leading to Legion ][2], its successor.

[1] https://news.ycombinator.com/item?id=18009581

[2] https://legion.stanford.edu/

fluffything · on April 21, 2020

Thanks for the link to the HN discussion.

The Regent language (Legion is Regent's runtime) is yet another async task-graph-based PGAS language/run-time, similar to, e.g., HPX, but leaving out what in my opinion made Sequoia interesting (e.g. the memory hierarchy abstraction capabilities).

9q9 · on April 21, 2020

As far as I understand, the idea in Regent is that the memory hierarchy is used dynamically, since the programmer cannot know about much of the memory hierarchy anyway at program writing time. So dynamic scheduling is used to construct the memory hierarchy and schedule at run-time. The typical use case for Regent is physicists writing e.g. weather / particle physics simulations, they are not aware of the details of L1/2/3/Memory / network/cluster ... sizes.

This is probably quite different from your use case.

fluffything · on April 22, 2020

Those applications do linear algebra, and pretty much any kind of linear algebra on the GPU, including simple vector-vector dot-products, requires a very careful usage of the memory hierarchy.

For doing a dot-product on the GPU, you need to take your vectors in global memory, and:

- split them into chunks that will be processed by thread blocks

- allocate shared memory for storing partial reductions from warps within a thread-block

- decide how many elements a thread operates on, and allocate enough registers within a thread

- do thread-block reductions on shared memory

- communicate thread-block reduction to global memory

- do a final inter-thread block reduction

A sufficiently-smart compiler can take a:

   sum = 0.
   for x,y in zip(x,y): sum += x+y

and transform it into an algorithm that does the above (e.g. a call to CUB).

But if you are not able to implement the above in the language, it suffices for the user to run into the need to do so once (e.g. your compiler does not optimize their slightly different reduction efficiently), for the value proposition of your language to suffer a huge hit (if I need to learn CUDA anyways, I might just use CUDA from the start; if I don't need performance, I wouldn't be using a super-expensive GPU, etc.).

This is IMO why CUDA is successful, and pretty much all other languages are not, maybe with the exception of OpenACC which has great CUDA interop (so you can start with OpenACC, and use cuda inline for a single kernel, if you need to).

9q9 · on April 22, 2020

The 6 requirements you list for doing a dot-product on the GPU can be phrased in abstract as a constraint solving problem where the number of thread blocks, the cost of communication etc are parameters.