If you're wondering how they prompt the models: "Perform OCR on this image. Retu...

Terretta · on Feb 14, 2025

TL;DR: For original object truth rather than image truth, this paper shows VLMS are superior, even though prompt shows the authors are "holding it wrong".

Yet another paper where the authors don't address what tokens are. It's like publishing Rolling pin fails at math or Calculator fails to turn dough ball into round pizza.

While I can understand where they're coming from in a desire to avoid hallucination when doing some letter for letter transcription from an image, certainly most times you reach for OCR you want the original copy, despite damage to its representation (paper tears, coffee stains, hands in front of it). Turns out token conjunction probability conjectures come in handy here!

Whether the image of an object, or the object, is "Ground Truth" is an exercise left to the user's goal. Almost all use cases would want what was originally written on the object, not its present occlulded [sic] representation.