Have the LLMs generate tests that measure the “ease of use” and “effectiveness” of coding agents using the language.
Then have them use these tests to get data for their language design process.
They should also smoke test their own “meta process” here. E.g. Write a toy language that should be obviously much worse for LLMs, and then verify that the effectiveness tests produce a result agreeing with that.
By the same reasoning, why on earth would a person sincerely ask you that question unless the car that they want to wash is either already at the car wash, or that someone is bringing it to them there for some reason?
If it's as unambiguous as you say, then the natural human response to that question isn't "you should drive there". It's "why are you fucking with me?" Or maybe "have you recently suffered a head injury?"
If you trust that the questioner isn't stupid and is interacting with you honestly, you'd probably just assume that they were asking about an unusual situation where the answer isn't obvious. It's implicitly baked into the premise of the question.
That still doesn't make sense. I'm going to use another car, or borrow a car to drive to a carwash where my car I want to wash is and then....I guess leave it there? Or leave the car I came in?
This isn't a viable out for explaining why AI can't "reason" through this.
But why would they reason through it in that way? You haven't asked them to listen carefully and find the secret reason you're a dumb-ass in order to prove how smart they are. If they default to that mode on every query, that would just make them insufferable conversational partners, which is not the training goal.
Let me put it this way. If you were to prefix the prompts they used with "This is an IQ test: ", I wouldn't be surprised if most of the the models did much better. That would give them the context that the humans reading this article already have.
You already brought the car there earlier? You bought a new car and negotiated that you get it washed, so you want to collect it? You have a butler? You plan to get someone or something from the car wash to do it at home, because the car you want to wash is dead?
Yeah similarly we can make a few distinctions here:
1) Intended signal, true
2) Unintended signal, but true
3) Unintended signal, but false
(Sure, 1' intended but false; though not really important here)
When (1) obtains we can describe this situation as one where sender and received coordinate on a message.
When (2) obtains we can say the sender acted in a way that indicative of some fact or other and the received is recognizes this; (2) can obtain when one obtains as a separate signal or when the sender hasn't intended to send a signal.
(3) obtains when the receiver attributes to the sender some expressive behavior or information that is inaccurate, say, because an interpretive schema has characterized the sender and the coding system incorrectly producing an interpretation that is false.
Also remember that each recipient of the signal will have their own reaction to it. What signals professional competence to one person can signal lickspittle corporate toadying to another.
It cannot be overstated how absurd the marketing campaign for AI was. OpenAI and Anthropic have convinced half the world that AI is going to become a literal god. They deserve to eat a lot of shit for those outright lies.
Have the LLMs generate tests that measure the “ease of use” and “effectiveness” of coding agents using the language.
Then have them use these tests to get data for their language design process.
They should also smoke test their own “meta process” here. E.g. Write a toy language that should be obviously much worse for LLMs, and then verify that the effectiveness tests produce a result agreeing with that.
I await the blog post :)
reply