Hacker Newsnew | past | comments | ask | show | jobs | submit | more ofermend's commentslogin

This model is quite impressive. Not just useful for math/research with great reasoning, it also maintained a very low hallucination rate of 1.1% on Vectara Hallucination Leaderboard: https://github.com/vectara/hallucination-leaderboard


It is common these days to see in large companies multiple teams developing isolated RAG applications. This is similar to the problem of "Shadow IT" back in the early cloud era - causes a big headache to IT teams.

I work at Vectara, and we see this all the time. Wondering how others are experiencing this?


DeepSeek-R1 is an amazing reasoning LLM, but it seems to hallucinate more than we might expect.


Gemini-2.0-Flash does extremely well on the Hallucination Evaluation Leaderboard, at 1.3% hallucination rate https://github.com/vectara/hallucination-leaderboard


Fascinating, thanks for calling that out: I found 1.0 promising in practice, but with hallucination problems. Then I saw it had gotten 57% of questions wrong on open book true/false and I wrote it off completely - no reason to switch to it for speed and cost if it's just a random generator. That's a great outcome.


Speaking of which, I wonder how they'd do on SimpleQA. OpenAI is an outlier there in the negative sense vs Anthropic. This benchmark also deals with hallucination and "inappropriate certainty".


We've done a study (see link) that shows that - unlike common belief - semantic chunking is not always the best approach.

Curious to hear from the YC community - anyone else did systemic testing and if so what did you find?


Check out Granite 3.0 on the hallucination leaderboard: https://github.com/vectara/hallucination-leaderboard


We recently launched UDF reranking as part of the RAG stack, and we think this supports a lot of interesting use-cases to go beyond simple relevance. For example, it supports ranking by distance (geo-location), by recency, and more.

I wanted to ask advice from the HN community: what are some real use-cases you have that can benefit from UDF reranking in RAG?


I remember the Magnus/Niemann controversy from 2023 - that was quite a drama... https://en.wikipedia.org/wiki/Carlsen%E2%80%93Niemann_contro...


Great release. Models just added to Hallucination Leaderboard: https://github.com/vectara/hallucination-leaderboard.

TL;DR: * 90B-Vision: 4.3% hallucination rate * 11B-Vision: 5.5% hallucination rate


About a year ago we launched in partnership with the Airbyte team the Vectara Destination, to help developers accelerate Generative AI applications - congrats on the Airbyte team on this great launch and looking forward to 2.0


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: