Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Any idea if there is a way to run on 256gb ram + 16gb vram with usable performance, even if barely?


Yes! 3bit maybe 4bit can also fit! llama.cpp has MoE offloading so your GPU holds the active experts and non MoE layers, thus you only need 16GB to 24GB of VRAM! I wrote about how to do in this section: https://docs.unsloth.ai/basics/qwen3-coder#improving-generat...


awesome documentation, I'll try this. thank you!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: