I think prompts like this are where agentic workflows come in to play. If you asked it to do generate the first 64 prime numbers, AI tools could do that. If you asked it to draw a charcoal image of Pokemon 13, it could do that. If you asked it to add a white Menlo 13 on a black background to the top left corner of that image, it could do that. If you asked it to do that 63 more times, it could do those things, and if you asked it to assemble those into a grid, it could.
It can't get that in a one-shot. Perhaps, though, it could figure out when it needs to break a problem into individual tasks to delegate to itself and assemble them at the end.
I mean asking these transformers to do maths has always been the wrong task. It's like we're now considering "it doesn't have x tools built with traditional code built in".
Though I suppose we're testing their model + agent harness here as well. It really _should_ have all of those tools/reasoning available to accomplish a task like the above without issue.
It's only been the wrong task because they've been deficient at it and expensive to use, so we had workarounds. They are getting better at these tasks and cheaper (sometimes). It's fair to evaluate even if there are more economical and accurate alternatives available.
You can evaluate the limits of a spoon by trying to cut meat with it.
The point is what are the typical use cases for the tool / what are the agreed upon areas of application?
Making the LLM do math with large numbers, I would argue, is not in its typical use case, thought it's at the border.
Asking an image generator model to calculate numbers before running an image sounds definitely NOT like a reasonable use case (do people need it? Will people try using it for this purpose?)
It can't get that in a one-shot. Perhaps, though, it could figure out when it needs to break a problem into individual tasks to delegate to itself and assemble them at the end.