The memory bandwith on M4 Max is 546 GB/s, M5 Max is 614GB/s, so not a huge jump...

anentropic · 2026-03-03T15:42:06 1772552526

Do any frameworks manage to use the neural engine cores for that?

Most stuff ends up running Metal -> GPU I thought

abhikul0 · 2026-03-03T16:44:44 1772556284

It's referring to the neural cores(for matrix mul) in the GPU itself, not the NPU.

sumek83 · 2026-03-03T16:51:12 1772556672

irusensei · 2026-03-04T09:33:29 1772616809

I noticed that even on my M3 MLX tends to do prefill it a lot faster than llama.cpp and GGML models. Anyone knows how they do it?