Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Agreed. The real value proposition of Apple Silicon for local inference is running models that won't fit on consumer GPUs. I run Qwen 70B 4-bit on an M2 Max 96GB through llama.cpp and it's usable — not fast, but the unified memory means it actually loads. Would be interested to see MetalRT benchmarks at that scale, since the architectural advantages (fused kernels, reduced dispatch overhead) should matter more as models get memory-bandwidth-bound.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: