anything that doesn't touch the model parameters at all once it has been compiled. for example, in streaming ASR of an encoder-decoder you can get gains in accuracy just by enhancing the encoder-decoder orchestration and ratio, frequency of fwd passes, dynamically adjusting the length of rolling windows (if using full attention). Prompting would be part of this too, including few-shot examples. Decoding strategy is also part of this (top-k, nucleus, speculative decoding, greedy or anything else). Applying signal processing or any kind of processing to the input before getting it into the model, or to the output. There are a lot of things you can do.
That's how you can tell this isn't in the US. Though there are financial reasons why divorced people live together, standard procedure is often for the divorce lawyer on the female side to file a restraining order (in this case easy since the husband punched the father in law) and get the husband dispossessed of the house in said order, which also has the benefit of de facto putting the kids in the custody of the mother. During the divorce this is also used as leverage.
Plumbers are working on very standard systems, they have to front costs and secure work. It only works for them because enough people use basic plumbing services to sustain them. But how many have return customers and guaranteed work for long term projects?
Barely any of them break 0% on any of the demo tasks, with Claude Opus 4.6 coming out on top with a few <3% scores, Gemini 3.1 Pro getting two nonzero scores, and the others (GPT-5.4 and Grok 4.20) getting all 0%
Pre-release, I would have expected Gemini 3.1 Pro to get ahead of Opus 4.6, with GPT-5.4 and Grok 4.20 trailing. Guess I shouldn't have bet against Anthropic.
Not like it's a big lead as of yet. I expect to see more action within the next few months, as people tune the harnesses and better models roll in.
This is far more of a "VLA" task than it is an "LLM" task at its core, but I guess ARC-AGI-3 is making an argument that human intelligence is VLA-shaped.
My broad vibe is that Gemini 3.1 Pro is the best at visual/spatial tasks and oneshotting while Opus 4.6 is the best at path planning. This task leans heavily on both but maybe a little more towards planning so I'm not too shocked that Opus in narrowly on top.
When running, the grids are represented in JSON, so the visual component is nullified but it still requires pretty heavy spatial understanding to parse a big old JSON array of cell values. Given Gemini's image understanding I do wonder if it would perform better with a harness that renders the grid visually.
Given the drastic difference in price, I think the chart definitely shows Gemini 3.1 in the best light. Google DeepMind is basically the same thing but they're willing to pay as much electricity as Anthropic is to achieve its benchmarks
The individual task scores are all on public tasks, they still held out a hundred or so private tasks that presumably GPT-5.4 did well on to get its leaderboard position.
Speaking of zed, if the team at zed is reading this, then I genuinely believe with my full heart that you guys can actually solve this issue given the impressive work you have done at trying to make the editor fast.
If you work towards something like google docs etc., this product feels right within your category and can work with the team features at zed to a far greater degree.
Zed also natively has AI functionality so it can work for some people and the best part about Zed is that AI functionality can be toggled off too :-)
This might as well be a billion dollar unsolved problem which the team at zed could use their expertise on perhaps. Although I suggest that maybe instead of bolting these functionalities into zed itself, maybe a zed-fork can be created for a more Microsoft word alternative?
Has someone tried at making a zed extension which can somehow be a word editor or anything similar, perhaps it might be possible within the frameworks of zed now itself but I am not sure.
I hope someone at zed team reads this and solves this problem. Zed is fantastic piece of software, thanks for making it zed team :-)
It's a bummer but I click almost no blog or Show HNs any more due to the content engagement hacking with LLMs (and 99% are not sincere "well see English is not my native language", etc.)
reply