It's not so much kernel overhead than memory traffic between global memory and L2 cache / shared memory that you (or at least I) target with fused kernel approach. Kernel launch overhead can be drastically reduced with cuda-graph indeed.
I'm not sure it applies so well in LLMs though (should read the paper...).
You are right that CUDA graph can help reduce launch overhead but does not support overlapping computation/communication across layers, since data dependencies are described at the kernel level.
What I was getting at was that a “megakernel” and a captured graph should have similar launch costs.