Ah, that makes a lot of sense. Is this fine grained task scheduling related to C...

touisteur · 2025-06-20T05:55:40 1750398940

Think more of it as 'tasking by hand' where you have one kernel driving the 18k+ cores and you manually (or using device libraries) fine-grained synchronize them, handle memory traffic asynchoronously and pipeline as much as you can.

You might have a look at cooperative groups, also things cuda::pipeline in libcudacxx to handle asynchronous and pipelined memory traffic, and also most of block/warp CUB primitives, and move on up to cuFFTDx, cuBLASDx and now cuSolverDx as the starting toolbox for your fused kernel journey.