Documents OCR: Improving Efficiency by Making PDFs Searchable

pkz · on Oct 13, 2018

If these were medical claims documents won't they contain sensitive data about individuals? I would have tried a local tesseract approach thoroughly before sending them off to a cloud service.

josteink · on Oct 13, 2018

This is the right answer.

Personally I’ve wrapped it up in some shell and python scripts[1], and it does the job just fine.

[1] https://github.com/josteink/autoarchiver

JshWright · on Oct 13, 2018

It was the right answer for you. It may not have been the right answer given a different set of requirements (including scale, service availability, ops complexity, etc)

jayalpha · on Oct 13, 2018

I would have tried recoll that can work with tesseract in the background. https://www.lesbonscomptes.com/recoll/

Abbyy is in my opinion by far the most powerfull OCR software. Linux command line OCR is available but unfortunately too pricey for the normal private user: https://www.ocr4linux.com/en:pricing:start

Davidbrcz · on Oct 13, 2018

I've been happily using paperwork for this task ; https://gitlab.gnome.org/World/OpenPaperwork/paperwork/

A simple button for scanning a document, the software does all the processing for you (rotation, OCR,..). You can apply tags to documents, search them, export them. The storage format is more than open (a png file + OCR results as an HTML page,...)

tjoff · on Oct 13, 2018

That's great :)

I would have liked a comparison between Google, Amazon and tesseract. There are projects for doing this at home and I've been meaning to do this for some time but haven't got around to it yet.

Been a while but the only one I can remember off the top of my head is paperless: https://github.com/danielquinn/paperless

matwood · on Oct 13, 2018

I would like to know more about their comparison results between Google vs. Amazon vs. Tesseract. How did they determine 98% accuracy? If I assume that was Google Visions accuracy, what was the accuracy of the others?

jccalhoun · on Oct 13, 2018

98% sounds good but that remaining 2% is crucial. I do OCR on a lot of my scanned documents to make them searchable and easy to copy text. I've tried a number of OCR programs but even after a lot of training something like Omnipage will still think a lower case l is a 1 in the middle of a word. It seems like it should be easy to put in a rule that says it is unlikely that a document will have a number surrounded by letters (without spaces) but that doesn't seem possible.

leokennis · on Oct 13, 2018

Depends on your use case. If you need a definitive answer of how often in 2017 you bought “Coca Cola Zero 1.5L” based on scanned receipts then 98% accuracy will be not enough. If you need that receipt from the one time you bought a garden hose, searching for “garden” and “hose” will probably get you there, even if the OCR tool read “garden h0se” or “g4rden hose”.

mcguire · on Oct 13, 2018

They OCR'd PDF documents. Interesting (and I wish I had time to do it to some of the PDFs I've been reading), but not earthshaking. Particularly with 98% accuracy; if 1 out of 50 characters are wrong, searches are going to be problematic.

If you find the term, yay; if you don't you can't really conclude it's not there.

catchmeifyoucan · on Oct 13, 2018

$3200 seems pretty expensive. In the startup I used to work at, I was tasked with building something similar. Except, the documents I had to index was around 1000 pages each almost - they were building plans and diagrams. We did all of our processing within an AWS Lambda pipeline and Elasticsearch. Elasticsearch was the only real cost.

matwood · on Oct 13, 2018

For 98% accuracy that's pretty cheap for OCRing a large number of pages. It wasn't that long ago that you had to use something like abbyy - now that was expensive.

catchmeifyoucan · on Oct 13, 2018

Wow I didn't know. But I'd highly recommend using a lambda function and try with a local OCR library. To see if it afffects cost. No need for Google to be reading your docs. I believed I used PYOCR. Not sure how they got the 98% figure

tensor · on Oct 13, 2018

I'm curious why you think Amazon isn't reading your docs (lambda function) but Google is?

snowwindwaves · on Oct 13, 2018

I use Qiqqa http://www.qiqqa.com for searching my library of PDFs. It does OCR on scanned documents. Supports shared and cloud PDF libraries and managing bibtex citations.

carbocation · on Oct 13, 2018

As long as there is a BAA between this person’s company and Google, this should be acceptable from a HIPAA standpoint, no? Or are people’s concerns more about the fact that Google is in the mix, without the patients being aware?

hayd · on Oct 13, 2018

How's that pricing work? Google Vision is $.60 per 1000 pages (after 5m pages, $1.50 before that), how can it be $.25?

Even $.25 seems expensive...

solarkraft · on Oct 13, 2018

More pages? 0.00025/page seems okay, tbh. You'd already be at 10ct for a 400 page document, but consider what people spend on printing. I think the price can probably be justified for many.

SQL2219 · on Oct 13, 2018

Adobe reader has a built in advanced search function that will search every pdf in a folder. Not that this would replace the system that this author is writing about, but it works well for those times when you have to search through dozens of documents that tally to thousands of pages.

CodeWriter23 · on Oct 13, 2018

And it works when the PDF is made up of images of text?

ausjke · on Oct 13, 2018

google vision api for OCR should be this one:

https://cloud.google.com/vision/docs/ocr

Asmod4n · on Oct 13, 2018

The headline should be: How we handed Google all your medical information and you didn't even know about it.

jaclaz · on Oct 13, 2018

Exactly.

The article is from someone in the Oscar Health Insurance.

According to Wikipedia, the company:

https://en.wikipedia.org/wiki/Oscar_Health

is claiming transparency in claims pricing:

>Oscar Health Insurance is a technology-focused health insurance company founded in 2012 and headquartered in New York City. The company has plans to change the health insurance industry through telemedicine, healthcare focused technological interfaces, and transparent claims pricing systems.

Maybe they extended the transparency to people's health data.

Here is another article (still on Medium) that incidentally talks of the Author (as engineer employed in the company):

https://medium.com/oscar-tech/whats-it-like-to-be-an-enginee...

infocollector · on Oct 13, 2018

Is this legal? Did they even read the TOS for Google?

JshWright · on Oct 13, 2018

Of course it's legal. They have a BAA with Google.

Do people think companies that deal with PHI have to build every service in house...? Run their own data centers? Build their own backhaul connections to their users?

infocollector · on Oct 13, 2018

You have a copy of the BAA? (or a link?)

JshWright · on Oct 13, 2018

Here is Google's information on the subject: https://cloud.google.com/security/compliance/hipaa/ (note that the Google Vision API is listed as a covered service, so it's approved for use under a BAA)

Obviously I don't have a copy of the contract between Google and Oscar, but I think it's profoundly unlikely that they would write a blog post about something that would be the end of their company without the "simple" (and industry standard) step of signing a BAA with a vendor.

mcguire · on Oct 13, 2018

"Google Cloud Platform was built under the guidance of a more than 700 person security engineering team, which is larger than most on-premises security teams."

That's...encouraging.

RandomBookmarks · on Oct 13, 2018

With Abbyy and ocr.space Local there are good and (for companies) affordable local OCR solutions available. There is really no need to use online(!) OCR for sensitive data. Plus, local ocr is faster.

JshWright · on Oct 13, 2018

I'm sure they have a BAA with Google, which holds Google to the same HIPAA PHI handling requirements.

infocollector · on Oct 13, 2018

I am not sure they do. They probably are just using Google Vision API as a regular customer. The pricing will change if Google had to do a legal deal with these folks.

JshWright · on Oct 13, 2018

That would immediately trigger a company ending lawsuit. There is zero chance they are using a service for handling PHI without a BAA in place with the vendor of that service.

Defending on the service provider, there may be an upfront cost for signing a BAA, but generally the service costs remain the same (assuming you stick to "BAA approved" services). In the case of Google, there is no additional cost:

"As such, we can offer HIPAA regulated customers the same products at the same pricing that is available to all customers, including sustained use discounts. Other public clouds charge more money for their HIPAA cloud, we do not."

https://cloud.google.com/security/compliance/hipaa/

Spooky23 · on Oct 13, 2018

Evidence? Getting a BAA from Google or Amazon or Microsoft is trivial.

mav3rick · on Oct 13, 2018

Because that will fit the HN narrative well ?. How about commenting on the Google vision API and it's accuracy as well ?

Spooky23 · on Oct 13, 2018

It’s a HIPPA compliant cloud service. Google is orders of magnitude less scary than your average health insurer.

mcguire · on Oct 13, 2018

I've never dealt with HIPPA; what restrictions does it put on internal use? Who, within Google, could they share raw data with? What about "anonymized" data?

Spooky23 · on Oct 13, 2018

See: https://cloud.google.com/security/compliance/hipaa-complianc...

Any reputable cloud service implements technical and process controls to control or eliminate access to customer data. Doing something like incorporating customer data into advertising would be a very serious situation, and frankly I doubt anyone would be that dumb at scale.

Google will contractually agree to a bunch of things relative to this and has third party audits to provide additional assurance.

agumonkey · on Oct 13, 2018

What about embedded metadata comments in PDF source ?

diminish · on Oct 13, 2018

Searchability is the killer feature for readin. Tt hat's why no matter how much I try I can't go back to paper books and newspapers except for outdoor fashion purposes.

crunchiebones · on Oct 13, 2018

isn't it already possible to search PDFs? I can start a search in zathura by typing '/'

imglorp · on Oct 13, 2018

Only if the document is already pdf text, or if it has the text annotation layer. There are tools, like the OP one, that can discover text and add a layer to an image-only doc.

forapurpose · on Oct 13, 2018

As far as I know, almost every PDF application, including Acrobat, has had this capability for many years and it's widely used for the same purposes. In addition to searching, it also allows you to copy text.

Perhaps using Google's service is faster or more accurate, but it would be necessary to have a comparison.