If these were medical claims documents won't they contain sensitive data about individuals? I would have tried a local tesseract approach thoroughly before sending them off to a cloud service.
It was the right answer for you. It may not have been the right answer given a different set of requirements (including scale, service availability, ops complexity, etc)
Abbyy is in my opinion by far the most powerfull OCR software. Linux command line OCR is available but unfortunately too pricey for the normal private user:
https://www.ocr4linux.com/en:pricing:start
A simple button for scanning a document, the software does all the processing for you (rotation, OCR,..). You can apply tags to documents, search them, export them. The storage format is more than open (a png file + OCR results as an HTML page,...)
I would have liked a comparison between Google, Amazon and tesseract. There are projects for doing this at home and I've been meaning to do this for some time but haven't got around to it yet.
I would like to know more about their comparison results between Google vs. Amazon vs. Tesseract. How did they determine 98% accuracy? If I assume that was Google Visions accuracy, what was the accuracy of the others?
98% sounds good but that remaining 2% is crucial. I do OCR on a lot of my scanned documents to make them searchable and easy to copy text. I've tried a number of OCR programs but even after a lot of training something like Omnipage will still think a lower case l is a 1 in the middle of a word. It seems like it should be easy to put in a rule that says it is unlikely that a document will have a number surrounded by letters (without spaces) but that doesn't seem possible.
Depends on your use case. If you need a definitive answer of how often in 2017 you bought “Coca Cola Zero 1.5L” based on scanned receipts then 98% accuracy will be not enough. If you need that receipt from the one time you bought a garden hose, searching for “garden” and “hose” will probably get you there, even if the OCR tool read “garden h0se” or “g4rden hose”.
They OCR'd PDF documents. Interesting (and I wish I had time to do it to some of the PDFs I've been reading), but not earthshaking. Particularly with 98% accuracy; if 1 out of 50 characters are wrong, searches are going to be problematic.
If you find the term, yay; if you don't you can't really conclude it's not there.
$3200 seems pretty expensive. In the startup I used to work at, I was tasked with building something similar. Except, the documents I had to index was around 1000 pages each almost - they were building plans and diagrams. We did all of our processing within an AWS Lambda pipeline and Elasticsearch. Elasticsearch was the only real cost.
For 98% accuracy that's pretty cheap for OCRing a large number of pages. It wasn't that long ago that you had to use something like abbyy - now that was expensive.
Wow I didn't know. But I'd highly recommend using a lambda function and try with a local OCR library. To see if it afffects cost. No need for Google to be reading your docs. I believed I used PYOCR. Not sure how they got the 98% figure
I use Qiqqa http://www.qiqqa.com for searching my library of PDFs. It does OCR on scanned documents. Supports shared and cloud PDF libraries and managing bibtex citations.
As long as there is a BAA between this person’s company and Google, this should be acceptable from a HIPAA standpoint, no? Or are people’s concerns more about the fact that Google is in the mix, without the patients being aware?
More pages? 0.00025/page seems okay, tbh. You'd already be at 10ct for a 400 page document, but consider what people spend on printing. I think the price can probably be justified for many.
Adobe reader has a built in advanced search function that will search every pdf in a folder. Not that this would replace the system that this author is writing about, but it works well for those times when you have to search through dozens of documents that tally to thousands of pages.
>Oscar Health Insurance is a technology-focused health insurance company founded in 2012 and headquartered in New York City. The company has plans to change the health insurance industry through telemedicine, healthcare focused technological interfaces, and transparent claims pricing systems.
Maybe they extended the transparency to people's health data.
Here is another article (still on Medium) that incidentally talks of the Author (as engineer employed in the company):
Of course it's legal. They have a BAA with Google.
Do people think companies that deal with PHI have to build every service in house...? Run their own data centers? Build their own backhaul connections to their users?
Obviously I don't have a copy of the contract between Google and Oscar, but I think it's profoundly unlikely that they would write a blog post about something that would be the end of their company without the "simple" (and industry standard) step of signing a BAA with a vendor.
"Google Cloud Platform was built under the guidance of a more than 700 person security engineering team, which is larger than most on-premises security teams."
With Abbyy and ocr.space Local there are good and (for companies) affordable local OCR solutions available. There is really no need to use online(!) OCR for sensitive data. Plus, local ocr is faster.
I am not sure they do. They probably are just using Google Vision API as a regular customer. The pricing will change if Google had to do a legal deal with these folks.
That would immediately trigger a company ending lawsuit. There is zero chance they are using a service for handling PHI without a BAA in place with the vendor of that service.
Defending on the service provider, there may be an upfront cost for signing a BAA, but generally the service costs remain the same (assuming you stick to "BAA approved" services). In the case of Google, there is no additional cost:
"As such, we can offer HIPAA regulated customers the same products at the same pricing that is available to all customers, including sustained use discounts. Other public clouds charge more money for their HIPAA cloud, we do not."
I've never dealt with HIPPA; what restrictions does it put on internal use? Who, within Google, could they share raw data with? What about "anonymized" data?
Any reputable cloud service implements technical and process controls to control or eliminate access to customer data. Doing something like incorporating customer data into advertising would be a very serious situation, and frankly I doubt anyone would be that dumb at scale.
Google will contractually agree to a bunch of things relative to this and has third party audits to provide additional assurance.
Searchability is the killer feature for readin. Tt
hat's why no matter how much I try I can't go back to paper books and newspapers except for outdoor fashion purposes.
Only if the document is already pdf text, or if it has the text annotation layer. There are tools, like the OP one, that can discover text and add a layer to an image-only doc.
As far as I know, almost every PDF application, including Acrobat, has had this capability for many years and it's widely used for the same purposes. In addition to searching, it also allows you to copy text.
Perhaps using Google's service is faster or more accurate, but it would be necessary to have a comparison.