> small fp8 local model with almost 100k token context Would not fit Qwen3.5 27B...

oakpond · 2026-03-27T08:25:15 1774599915

This is a fp16 model. That's 54G in weights. I can load it only with fp8 quantization enabled (>= 128k context). I run into this error during generation though: https://github.com/vllm-project/vllm/issues/36350. Looks like an issue with the flash attention backend. But yeah, if you are OK with fp8 quantization on this model, it fits. I expect with 64G VRAM it will fit without quantization