This model is quite impressive. Not just useful for math/research with great reasoning, it also maintained a very low hallucination rate of 1.1% on Vectara Hallucination Leaderboard:
https://github.com/vectara/hallucination-leaderboard
It is common these days to see in large companies multiple teams developing isolated RAG applications.
This is similar to the problem of "Shadow IT" back in the early cloud era - causes a big headache to IT teams.
I work at Vectara, and we see this all the time.
Wondering how others are experiencing this?
Fascinating, thanks for calling that out: I found 1.0 promising in practice, but with hallucination problems. Then I saw it had gotten 57% of questions wrong on open book true/false and I wrote it off completely - no reason to switch to it for speed and cost if it's just a random generator. That's a great outcome.
Speaking of which, I wonder how they'd do on SimpleQA. OpenAI is an outlier there in the negative sense vs Anthropic. This benchmark also deals with hallucination and "inappropriate certainty".
We recently launched UDF reranking as part of the RAG stack, and we think this supports a lot of interesting use-cases to go beyond simple relevance. For example, it supports ranking by distance (geo-location), by recency, and more.
I wanted to ask advice from the HN community: what are some real use-cases you have that can benefit from UDF reranking in RAG?
About a year ago we launched in partnership with the Airbyte team the Vectara Destination, to help developers accelerate Generative AI applications - congrats on the Airbyte team on this great launch and looking forward to 2.0