Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Fair enough. I should have clarified that “approximately the cost of a single kernel launch” is pretty much what I meant by “tiny”.

What I was getting at was that a “megakernel” and a captured graph should have similar launch costs.



It's not so much kernel overhead than memory traffic between global memory and L2 cache / shared memory that you (or at least I) target with fused kernel approach. Kernel launch overhead can be drastically reduced with cuda-graph indeed.

I'm not sure it applies so well in LLMs though (should read the paper...).


You are right that CUDA graph can help reduce launch overhead but does not support overlapping computation/communication across layers, since data dependencies are described at the kernel level.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: