Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Acrobat has had this as a built in feature forever.


Didn't know that either. Makes me wonder if my browser ( PDF.js ) is doing the OCR? Anybody knows?


He means that Acrobat Pro includes an OCR system that you can use to add a searchable text layer to scanned documents. Readers like Acrobat Reader and PDF.js do not perform OCR. You won't be able to use them to search scanned documents if the document creator did not run OCR.

Google runs its own OCR pass on scanned PDF documents in order to index them better. It can be annoying when you get a 50 page scanned document as a search result and then find out that it doesn't include a text layer, so you need to run your own OCR or skim the whole thing to find the relevant parts.


What PDF.js is showing is an invisible text layer overlaid on top of the original image. It does not do OCR which can take up to 1-2 seconds per page, it would be too slow and require a large-ish neural net if you care about accuracy.


No, the OCR has already happened at the time of the scanning (or shortly thereafter), and the result is embedded into the PDF document.


True, but as parent, I realized that only very recently.


Guy's login is "the-dude," he's been around for a while.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: