Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> small fp8 local model with almost 100k token context

Would not fit Qwen3.5 27B would it? That's the SOTA



This is a fp16 model. That's 54G in weights. I can load it only with fp8 quantization enabled (>= 128k context). I run into this error during generation though: https://github.com/vllm-project/vllm/issues/36350. Looks like an issue with the flash attention backend. But yeah, if you are OK with fp8 quantization on this model, it fits. I expect with 64G VRAM it will fit without quantization




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: