Fair enough. I should have clarified that “approximately the cost of a single ke...

touisteur · 2025-06-20T05:40:43 1750398043

It's not so much kernel overhead than memory traffic between global memory and L2 cache / shared memory that you (or at least I) target with fused kernel approach. Kernel launch overhead can be drastically reduced with cuda-graph indeed.

I'm not sure it applies so well in LLMs though (should read the paper...).

zhihaojia · 2025-06-20T02:46:06 1750387566

You are right that CUDA graph can help reduce launch overhead but does not support overlapping computation/communication across layers, since data dependencies are described at the kernel level.