"Perform OCR on this image. Return only the text found in the image as a single continuous string without any newlines, additional text, or commentary. Separate words with single spaces. For any truncated, partially visible, or occluded text, include only the visible portions without attempting to complete or guess the full text. If no text is present, return empty double quotes."
TL;DR: For original object truth rather than image truth, this paper shows VLMS are superior, even though prompt shows the authors are "holding it wrong".
Yet another paper where the authors don't address what tokens are. It's like publishing Rolling pin fails at math or Calculator fails to turn dough ball into round pizza.
While I can understand where they're coming from in a desire to avoid hallucination when doing some letter for letter transcription from an image, certainly most times you reach for OCR you want the original copy, despite damage to its representation (paper tears, coffee stains, hands in front of it). Turns out token conjunction probability conjectures come in handy here!
Whether the image of an object, or the object, is "Ground Truth" is an exercise left to the user's goal. Almost all use cases would want what was originally written on the object, not its present occlulded [sic] representation.
"Perform OCR on this image. Return only the text found in the image as a single continuous string without any newlines, additional text, or commentary. Separate words with single spaces. For any truncated, partially visible, or occluded text, include only the visible portions without attempting to complete or guess the full text. If no text is present, return empty double quotes."
Found in: https://github.com/video-db/ocr-benchmark/blob/main/prompts....