> We find testing and evals to be the hardest problem here. This is not entirely surprising, but the agentic nature makes it even harder. Unlike prompts, you cannot just do the evals in some external system because there’s too much you need to feed into it. This means you want to do evals based on observability data or instrumenting your actual test runs. So far none of the solutions we have tried have convinced us that they found the right approach here.
I'm curious about the solutions the op has tried so far here.
"Because there’s too much you need to feed into it" - what does the author mean by this? If it is the amount of data, then I would say sampling needs to be implemented. If that's the extent of the information required from the agent builder, I agree that an LLM-as-a-judge e2e eval setup is necessary.
In general, a more generic eval setup is needed, with minimal requirements from AI engineers, if we want to move forward from Vibe's reliability engineering practices as a sector.
Likewise. I have a nasty feeling that most AI agent deployments happen with nothing more than some cursory manual testing. Going with the ‘vibes’ (to coin an over used term in the industry).
I can confirm this after hundreds of talks about the topic over the last 2 years. 90% of cases are simply not high-volume or high-stakes enough for the devs to care enough. I'm a founder of an evaluation automation startup, and our challenge is spotting teams right as their usage starts to grow and quality issues are about to escalate. Since that’s tough, we're trying to make the getting-to-first-evals so simple that teams can start building the mental models before things get out of hand.
A lot of "generative" work is like this. While you can come up with benchmarks galore, at the end of the day how a model "feels" only seems to come out from actual usage. Just read /r/localllama for opinions on which models are "benchmaxed" as they put it. It seems to be common knowledge in the local LLM community that many models perform well on benchmarks but that doesn't always reflect how good they actually are.
In my case I was until recently working on TTS and this was a huge barrier for us. We used all the common signal quality and MOS-simulation models that judged so called "naturalness" and "expressiveness" etc. But we found that none of these really helped us much in deciding when one model was better than another, or when a model was "good enough" for release. Our internal evaluations correlated poorly with them, and we even disagreed quite a bit within the team on the quality of output. This made hyperparameter tuning as well as commercial planning extremely difficult and we suffered greatly for it. (Notice my use of past tense here..)
Having good metrics is just really key and I'm now at the point where I'd go as far as to say that if good metrics don't exist it's almost not even worth working on something. (Almost.)
What are the main shortcomings of the solutions you tried out?
We believe you need to both automatically create the evaluation policies from OTEL data (data-first) and to bring in rigorous LLM judge automation from the other end (intent-first) for the truly open-ended aspects.
Its a 2 day project at best to create your own bespoke llm as judge e2e eval framework. Thats what we did. Works fine. Not great. Still need someone to write the evals though.
I'm curious about the solutions the op has tried so far here.