While Llama 4 had a pretty bad launch (the LM Arena gaming in particular is terr...

While Llama 4 had a pretty bad launch (the LM Arena gaming in particular is terrible), having run my own evals on it (using the April 5 v0.8.3 vLLM release - https://blog.vllm.ai/2025/04/05/llama4.html , so before the QKNorm fix https://github.com/vllm-project/vllm/pull/16311) - it seemed pretty decent to me.

For English, on a combination of MixEval, LiveBench, IFEval, and EvalPlus Maverick FP8 (17B/400B) was about on par with DeepSeek V3 FP8 (37B/671B) and Scout (17B/109B) was punching in the ballpark of Gemma 3 27B, but not too far off Llama 3.3 70B and Mistral Large 2411 (123B).

Llama 4 claimed to be trained on 10X more multilingual tokens than Llama 3 and testing on Japanese (including with some new, currently unreleased evals) the models did perform better than Llama 3 (although I'd characterize their overall Japanese performance as "middle of the pack": https://shisa.ai/posts/llama4-japanese-performance/

I think a big part of the negative reaction is that in terms of memory footprint, Llama 4 looks more built for Meta (large scale inference provider) than home users, although with the move to APUs and more efficient CPU offloading, there's still something to be said for strong capabilities at 17B of inference.

I think people are quick to forget that Llama 3, while not so disastrous, was much improved with 3.1. Also the competitive landscape is pretty different now. And I think the visual capabilities are being a bit slept upon, but I think that's also the case of releasing before the inference code was baked...