Interestingly, a lot of programmers (who program for the CPU) worry about caches...

fluffything · on April 23, 2020

People that worry about caches for CPUs, can't control them directly, so they need to instead, e.g., "block"/"tile" their loops in such a way that their caches are used efficiently, which is hard.

On different CPUs (e.g. with different cache sizes), these loops need to be tiled differently. If you want a single binary to perform well across the board, it just need to support using the different tile-sizes depending on the hardware.

Usually, this is however not enough. For example, you have some data in memory (a vector of f16s on an intel CPU), and for operating on them, you need to decompress that first into a vector of f32s.

You probably want to decompress to fill the cache, operate, recompress, to save memory and memory bandwidth. For that you need a "scratchpad" (or __shared__ memory in CUDA), e.g., for the different levels of the cache hierarchy.

The compiler needs to know, for the different architectures, what their L1/L2/L3/shared/constant/texture/.... memory sizes are, and either fill these sizes for you to use the whole cache, or let the user pick them, making the program fail if run on hardware where this isn't correct. NVCC (and pretty much every production compiler) can bundle code for different architectures within a single binary, and pick the best at run-time.

So if your L3 can be 4,8,16, or 32 Mb, your compiler can bundle 4 copies of your kernel, query the CPU cache size at initialization, and be done with it.

The way in which you abstract the memory hierarchy is by just partitioning your problem's memory.

If you have a f16s vector in global memory, you might want to partition it into N different chunks, e.g., recursively, until if a chunk where to be decompressed into f32s, that decompressed chunk would fit into a faster memory (e.g. L3). At that point you might want to do the decompression, and continue partitioning the f32s up to some threshold (e.g. the L1), on which you operate.

That is, a kernel has multiple pieces:

- a state initialization (e.g. owns the global memory)

- partitioning: how to partition that into smaller subproblems

- the merge: what gets computed at each level of the partitioning (e.g. how are the results of the partitions merged together to compute the final result)

- the leaf computation: what gets computed at the "finest" level

The programmer doesn't know a priori how many partitions would be necessary, that would depend on the memory hierarchy. But it does know everything else.

For example, for a sum reduction:

- the programmer knows how to perform a leaf computation: by summing a whole leaf partition and putting the result somewhere in the next larger memory level.

  - a CUDA or CPU thread can sum  N elements in parallel using SIMD, and put the result in inter-warp shared memory or the L1

- the programmer knows how to merge computations: by summing the results of the current level, and putting them in the next level of the hierarchy (inter-warp -> inter-block -> inter-grid, or L1 -> L2 -> L3 -> RAM)

  - in a GPU / CPU, depending on the actual level, the compiler can use different operations (e.g. SIMD shuffles, inter warp shuffles with warp-level synchronization, inter block shuffles with block-level synchronization, local sum + atomic memory operation to write the result, etc.)

- the programmer knows how to partition an input vector into N pieces (N CUDA threads, N warps, N blocks, N grids, N cpu threads, N numa domains, etc.)

A good programming model needs to allow the user to express what they know, and abstract away what they cannot know (how big the caches are, how many cache levels, how many threads of execution, etc.)

9q9 · on April 23, 2020

Thanks. Amazing overview.

   abstract away what they 
   cannot know

Here I'm a bit surprised. For theoretical reasons (I've never gotten my hands dirty with GPU implementations, I'm afraid to admit), I would have expected that it is highly useful to have a language interface that allows run-time (or at least compile-time) reflection on:

- Number of levels (e.g. how many layers of caches).

- Size of levels.

- Cost of memory access at each level.

- Cost of moving data from a level to the next level up/down.

The last 3 numbers don't have to be absolute, I imagine, but can be relative, e.g.: size(Level3) = 32 * size(Level2). This data would be useful to decide how to partition compute jobs as you describe, and do so in a way that is (somewhat) portable. There are all manner of subtle issues, e.g. what counts as cost of memory access and data movements (average, worst case, single byte, DMA ...), and what the compiler and/or runtime should do (if anything) if they are are violated. In abstract terms: what is the semantics of the language representation of the memory hierarchy. Another subtle but important question is: what primitives should a language provide to access a memory level, and which ones to move between levels. An obvious choice is to treat each level as an array, and have DMA-like send/receives to move blocks of data between levels. Is that a good idea?

Equally subtle, and I already alluded to this above, is when to make this information available. Since the processor doesn't change during computing, I imagine that using a multi-stage meta-programming setup (see e.g. [1] for a rationale), might be the right framework: you have a meta-program, specialising the program doing the compute you are interested in. C++ use template for program specialisation, but C++'s interface for meta-programming is not easy to use. It's possible to do much better.

As you wrote above, programming "in those languages is hard and error prone", and the purpose of language primitives is to catch errors early. What errors would a compiler / typing system for such a language catch, ideally without impeding performance?

[1] Z. DeVito, J. Hegarty, A. Aiken, P. Hanrahan, J. Vitek, Terra: A Multi-Stage Language for High-Performance Computing. https://cs.stanford.edu/~zdevito/pldi071-devito.pdf

fluffything · on April 25, 2020

> would have expected that it is highly useful to have a language interface that allows run-time (or at least compile-time) reflection

That's an interesting thought. I'm not sure I agree, maybe?

The user job is to express how to partition the problem. To do that properly, they need to know, e.g., "how many bytes can I fit in the next partition per unit-of-compute", so the langue/run-time has to tell them that.

I don't know if knowing the costs of memory access at each level or across levels is useful. It is a reasonable assumption that, at the next level, memory accesses are at least one order of magnitude faster, and that synchronizing across the hierarchy is very expensive and should be minimized. That's a good assumption to write your programs with, and knowing actual costs do not really help you there.

> Another subtle but important question is: what primitives should a language provide to access a memory level, and which ones to move between levels. An obvious choice is to treat each level as an array

I think CUDA's __shared__ memory is quite good. E.g. per kernel, a way to obtain memory in the level of the hierarchy that the kernel is currently working on. Nobody has extended CUDA to multiple levels because there hasn't been a need, but I expect these programs to be "recursive" in the sense that they recursively partition a problem, and for __shared__ memory on each recursion level to give you memory at a different level of the hierarchy.

To move memory across levels of the hierarchy, you would just use raw pointers, with appropriate reads/writes (e.g. atomic ones). The exact instructions that get emitted would then depend on which level of the hierarchy you are. For that the compiler needs to know something about the architecture it is compiling for, but that seems unavoidable.

> [1] Z. DeVito, J. Hegarty, A. Aiken, P. Hanrahan, J. Vitek, Terra: A Multi-Stage Language for High-Performance Computing. https://cs.stanford.edu/~zdevito/pldi071-devito.pdf

Thanks, I actually hadn't read the Terra paper. I'll try to get to it this weekend. I think that using a Lua-like language as the meta-programming for your language (e.g. Regent) is an interesting approach, but it is not necessary.

For example, Scheme, Lisp, Haskell, D and Rust have shown that you can do meta-programming quite well in the same language you do normal programming in, without having to learn a completely different "meta-language".

I particularly like Rust procedural macros. It is normal run-time Rust code that just get first compiled, and then executed (with your source code as input) during compilation, with some twist to make that quite fast (e.g. the compiler parses an AST, and the proc macros do AST->AST folds).

If one wants to delay that until run-time, one should just do so (e.g. you just embed your compiler in your app, and when your proc macros run, its "run-time"). No need to layer multiple languages, but maybe the Terra paper convices me otherwise.

9q9 · on April 25, 2020

   Terra paper

I didn't mean to push Terra in particular, and I agree that doing meta-programming is easier in the same language you do normal programming in. I just mentioned the Terra paper since it provided a rationale for using meta-programming in high-performance programming (I could have pointed to others making the same argument).

Rust's procedural macros are standard compile-time meta programming.

   proc macros do AST->AST folds)

Last time I looked (I have not used Rust for a while) procedural macros operate over tokens, not ASTs. Maybe that changed?

steveklabnik · on April 25, 2020

It hasn’t changed; we didn’t want to freeze the AST forever, so we went with the token approach. There are libraries to form an AST though.