I think prompts like this are where agentic workflows come in to play. If you as...

wahnfrieden · 2026-04-22T15:40:00 1776872400

That's what makes it a fair evaluation of its limits

fennecfoxy · 2026-04-22T16:45:38 1776876338

I mean asking these transformers to do maths has always been the wrong task. It's like we're now considering "it doesn't have x tools built with traditional code built in".

Though I suppose we're testing their model + agent harness here as well. It really _should_ have all of those tools/reasoning available to accomplish a task like the above without issue.

wahnfrieden · 2026-04-22T17:26:33 1776878793

It's only been the wrong task because they've been deficient at it and expensive to use, so we had workarounds. They are getting better at these tasks and cheaper (sometimes). It's fair to evaluate even if there are more economical and accurate alternatives available.

fjdjshsh · 2026-04-23T13:10:19 1776949819

You can evaluate the limits of a spoon by trying to cut meat with it.

The point is what are the typical use cases for the tool / what are the agreed upon areas of application?

Making the LLM do math with large numbers, I would argue, is not in its typical use case, thought it's at the border.

Asking an image generator model to calculate numbers before running an image sounds definitely NOT like a reasonable use case (do people need it? Will people try using it for this purpose?)