I ended up using EasyOCR. I assume it is too slow in CPU-only mode.

aidenn0 · on Aug 9, 2024

> I assume it is too slow in CPU-only mode.

So you don't have to assume: I gave up after running on 8 cores (Ryzen 7 2700) for 10 days for a single page.

fred123 · on Aug 9, 2024

Something wrong with your setup. It should be less than 30 s per page with your hardware

aidenn0 · on Aug 9, 2024

Huh, I tried with the version from pip (instead of my package manager) and it completes in 22s. Output on the only page I tested is considerably worse than tesseract, particularly with punctuation. The paragraph detection seemed to not work at all, rendering the entire thing on a single line.

Even worse for my uses, Tesseract had two mistakes on this page (part of why I picked it), and neither of them were correctly read by EasyOCR.

Partial list of mistakes:

1. Missed several full-stops at the end of sentences

2. Rendered two full-stops as colons

3. Rendered two commas as semicolons

4. Misrendered every single em-dash in various ways (e.g. "\_~")

5. Missed 4 double-quotes

6. Missed 3 apostrophes, including rendering "I'll" as "Il"

7. All 5 exclamation points were rendered as a lowercase-ell ("l"). Tesseract got 4 correct and missed one.

ein0p · on Aug 9, 2024

I use a container on a machine with an old quad core i7 and no GPU compute. This should take at most tens of seconds per page.

yard2010 · on Aug 9, 2024

...how is it so slow?