Considering that each kernel / kernel size is usually custom tuned on NVIDIA, I'...

almostgotcaught · on Nov 2, 2024

> Considering that each kernel / kernel size is usually custom tuned on NVIDIA, I'd say no. Working in this field at several different companies, there are likely thousands of hand-tuned variations of a simple GEMM kernel. Each one required an engineer to look at specifically, even if they're all variations on a common theme.

Lol that's absolutely not true. What you're describing is literally impossible for any company that has more than one product family on the market since each product has different scratch sizes, number of vector registers, data types supported/emulated etc.

Outside of trade show demos, kernels are codegened. What is true is there are recurring "themes/patterns" that are handled by engineers for a class of products. Lately this is flash attention...

> Again, it is trivial to write a correct opencl matrix multiply, but that's never going to be the highest performance.

I guess you work at AMD. The reason AMD ships a whole bunch of binary kernels is not because someone tuned/designed each one but because AMD doesn't have a PTX/SASS equivalent. So each kernel has to be compiled at build time for each device (it's also why they can't have LTS support for architectures).

anon291 · on Nov 2, 2024

> outside of trade show demos, kernels are condegened. What is true is there are recurring "themes/patterns" that are handled by engineers for a class of products. Lately this is flash attention

I never said they weren't using code generation. I said that each one requires a manual tune. You will set various parameters, determine if the generated code does well enough and then if there's performance to squeeze out, you modify the code generator.

> I guess you work at AMD.

Close but not quite

almostgotcaught · on Nov 2, 2024

> that each one requires a manual tune

Ya definitely not - everyone uses grid search or whatever latest BPO tuning strategy.

anon291 · on Nov 2, 2024

Oh right. Those require no engineering effort because I said so.

almostgotcaught · on Nov 2, 2024

They require one person or team to engineer and then a whole bunch of people to use...? That doesn't resemble in the least what you were describing where each kernel is hand-tuned for each shape and device. But please do continue to insist you're still somehow right

anon291 · on Nov 2, 2024

Sigh. I use to be an engineering manager for a kernels team. I think I know what I'm talking about. Yes, each kernel is paid individual attention to even if many are basically the same and require little rework. It's a lot of work. Now I work as an ic in the same field . I don't need to insist I'm right because it's what I do all day

almostgotcaught · on Nov 3, 2024

> I think I know what I'm talking about

I've refuted your claims point by point and all you can say is "trust me bro"? Cool but I think you do not in fact know what you're talking about.

lumost · on Nov 4, 2024

I work in DB kernels, everything gets hand tuned as there is economic reason for hand tuning it. The expectation in many of these systems is that there are no wasted cycles. You can codegen a decent kernel, but then someone will find a better approach - do you want the slower version of the product, or the faster one?

You can see this in action with matrix math libraries, folks have been hand tuning those for decades at this point.

anon291 · on Nov 3, 2024

Your claim is that there are automated methods (which I mentioned in my original post) to manage the complexity. My claim is that it requires a large team of engineers working on it. I'm not really sure what you think you've refuted.