Thanks for the video and slides, super interesting! One question: in one of the ...

raphlinus · on April 17, 2020

That could be the subject of a whole other talk :)

The memory model is needed to do more sophisticated communication between the parallel workgroups. The classic example is to make a queue, even a simple one such as a simple single-producer single-consumer ring buffer (for communication between a pair of workgroups). The way this works, the producer fills an element in the buffer, then bumps an index. The consumer observes the index, then reads the element from the buffer.

Without a memory model, there's no way to guarantee that this ordering is preserved, and (to map this to hardware), the write of the index bump could flow through the caches, while the write of the element could still be "dirty" in the cache on the producer, so the consumer reads stale data.

What the memory model does is provide the programmer an explicit way to express ordering constraints. So the write of the index bump is with "release" semantics and the read of that by the consumer is with "acquire" semantics, which guarantees that producer writes before the release are visible to consumer reads after the acquire.

This unlocks a whole range of sophisticated concurrent data structures, such as concurrent hash maps, fancier memory allocation (even without a memory model, a simple allocate-only bump allocator is feasible), and so on.

Thanks for the question and kind words, I hope this illuminates.

Jhsto · on April 17, 2020

Thanks, the answer with some examples was exactly what I was hoping to get answered! Thank you for taking the time to respond.

jabl · on April 17, 2020

Perhaps in a few years when we'll start to see things like CXL providing cache-coherency between main memory and GPU memory this will become important?