Hacker Newsnew | past | comments | ask | show | jobs | submit | nubg's commentslogin

what exactly does scaffolding mean in this context? genuine question

anything that doesn't touch the model parameters at all once it has been compiled. for example, in streaming ASR of an encoder-decoder you can get gains in accuracy just by enhancing the encoder-decoder orchestration and ratio, frequency of fwd passes, dynamically adjusting the length of rolling windows (if using full attention). Prompting would be part of this too, including few-shot examples. Decoding strategy is also part of this (top-k, nucleus, speculative decoding, greedy or anything else). Applying signal processing or any kind of processing to the input before getting it into the model, or to the output. There are a lot of things you can do.

> Now divorced, Biesma is still living with his ex-wife in their home, which is on the market.

sounds like hell on earth


Selling won't be a problem in the current housing market in Amsterdam. Getting somewhere new to live on the other hand…

Particularly for his poor (ex)partner…

[That feels a bit like victim blaming, but there are more than one victim here and one of them is much more culpable than the rest]


That's how you can tell this isn't in the US. Though there are financial reasons why divorced people live together, standard procedure is often for the divorce lawyer on the female side to file a restraining order (in this case easy since the husband punched the father in law) and get the husband dispossessed of the house in said order, which also has the benefit of de facto putting the kids in the custody of the mother. During the divorce this is also used as leverage.

> 1 dollar per minute

so 60 usd / hour? a plumber earns more

if this allows you to produce features that bring you money, it's a no-brainer


Plumbers are working on very standard systems, they have to front costs and secure work. It only works for them because enough people use basic plumbing services to sustain them. But how many have return customers and guaranteed work for long term projects?

Any benchmarks?

The main frontier models are all up on https://arcprize.org/tasks

Barely any of them break 0% on any of the demo tasks, with Claude Opus 4.6 coming out on top with a few <3% scores, Gemini 3.1 Pro getting two nonzero scores, and the others (GPT-5.4 and Grok 4.20) getting all 0%


Pre-release, I would have expected Gemini 3.1 Pro to get ahead of Opus 4.6, with GPT-5.4 and Grok 4.20 trailing. Guess I shouldn't have bet against Anthropic.

Not like it's a big lead as of yet. I expect to see more action within the next few months, as people tune the harnesses and better models roll in.

This is far more of a "VLA" task than it is an "LLM" task at its core, but I guess ARC-AGI-3 is making an argument that human intelligence is VLA-shaped.


My broad vibe is that Gemini 3.1 Pro is the best at visual/spatial tasks and oneshotting while Opus 4.6 is the best at path planning. This task leans heavily on both but maybe a little more towards planning so I'm not too shocked that Opus in narrowly on top.

When running, the grids are represented in JSON, so the visual component is nullified but it still requires pretty heavy spatial understanding to parse a big old JSON array of cell values. Given Gemini's image understanding I do wonder if it would perform better with a harness that renders the grid visually.


Given the drastic difference in price, I think the chart definitely shows Gemini 3.1 in the best light. Google DeepMind is basically the same thing but they're willing to pay as much electricity as Anthropic is to achieve its benchmarks

Curious, that doesn't match the graph up on the Leaderboard page? https://arcprize.org/leaderboard

The individual task scores are all on public tasks, they still held out a hundred or so private tasks that presumably GPT-5.4 did well on to get its leaderboard position.

bubble popping

Are they trying to slide stuff down? but it just bumps stuff up?

> If anomalies were detected

but also

> Latency, throughput, and resiliency characteristics had to remain consistent, and in some cases improve.

> Payment requests could not be dropped, delayed, or left unanswered.

what else would an "anomaly" be?


What exactly would the perfect tool look like?

Perfect isn't the goal. But something on the tier of KiCad, Blender, Zed, Sublime, etc.

Speaking of zed, if the team at zed is reading this, then I genuinely believe with my full heart that you guys can actually solve this issue given the impressive work you have done at trying to make the editor fast.

If you work towards something like google docs etc., this product feels right within your category and can work with the team features at zed to a far greater degree.

Zed also natively has AI functionality so it can work for some people and the best part about Zed is that AI functionality can be toggled off too :-)

This might as well be a billion dollar unsolved problem which the team at zed could use their expertise on perhaps. Although I suggest that maybe instead of bolting these functionalities into zed itself, maybe a zed-fork can be created for a more Microsoft word alternative?

Has someone tried at making a zed extension which can somehow be a word editor or anything similar, perhaps it might be possible within the frameworks of zed now itself but I am not sure.

I hope someone at zed team reads this and solves this problem. Zed is fantastic piece of software, thanks for making it zed team :-)


> AI insight porn

put very succinctly. if done without ai, the more impressive!


Dear author, can you post the prompt? I'm not sure which parts are what you actually meant to write and which are LLM fillings.

It's a bummer but I click almost no blog or Show HNs any more due to the content engagement hacking with LLMs (and 99% are not sincere "well see English is not my native language", etc.)

Man it doesn't post automatically

It's AI for sure.

Great analysis, you should open a blog with your great abilities

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: