This really does not resonate at all, and I have the scars to prove it. I used t...

ankrgyl · on March 17, 2022

I'll just throw my hat in the ring and mention that at Impira, we are one of those startups wholly dedicated to (4). We happen to use Google's OCR engine (1) under the hood (for raw OCR), and what you said resonates for sure: there's a lot of engineering work required to make it work performantly and generally (happy to chat about this with anyone who is interested).

Feel free to take Impira for a spin (https://www.impira.com) if you need to accurately extract data from PDF documents. Would love feedback from anyone who tries it out. [Disclaimer: I am the CEO/Founder of Impira].

jfk13 · on March 17, 2022

I agree many of these things are a pain. This often reflects a workflow that is approaching things from entirely the wrong direction. ("If I wanted to go there, I wouldn't start from here.")

E.g. instead of trying to OCR a PDF, go back to the source document or database or whatever from which the PDF was generated. (Yes, I know that's not always an option. But it should be the first avenue to explore. We should push back against people who send around PDFs as though they were an all-purpose interchange format for textual or structured data.)

I'm a bit puzzled by (3), though:

> Office to PDF ... it's not easy ... when people see their PDF looks very different than what they saw on Word, they get upset

To get a PDF that looks the same as the Word document, just tell them to use the Print to PDF driver from right there within Word.

ankrgyl · on March 17, 2022

I think you recognize this already, but to add a bit of color, in highly regulated industries (e.g. financial services) and B2B settings with lots of peers (e.g. supply chain), "going back to the source document or database or whatever" requires an insane amount of consensus (which is not currently incentivized).

To add to that, a lot of PDFs (e.g. financial reports) are generated procedurally with ancient code that would have to be rewritten to generate a different format. The underlying database format is often many layers of abstraction different than the final output.

pipeline_peak · on March 17, 2022

> Office to PDF is an extremely standard need

Is it really an extremely standard need or just something that appears in the bs corners of our jobs a few times a year.

danielrhodes · on March 17, 2022

Yes, if you're working with documents a lot it is. Word docs are not portable and people don't like them because they can be changed easily, not to mention not everybody has Word. You also can't display them in inline in a browser.

yyyk · on March 17, 2022

>HTML to PDF is also a pain: you have to set up an instance that is running headless Chromium, which can be quite slow...

There are at least 6 non-Chromium alternative that I can think of in a moment's notice, and also LGPL wkhtmltopdf.

>Office to PDF.... You have to hack together a headless OpenOffice to have it work at all, but it doesn't do a great job... Microsoft does not offer a service to do this, unfortunately.

Microsoft sorta does offer a service to do this. Sharepoint has a word to pdf action, and with some stitching you can make it into an API. There are also several commercial solution (e.g. Spire.NET) for this and also ways exist to mangle the OpenXML into HTML (of course losing some fidelity into the process).

amluto · on March 19, 2022

All of the above may be correct, but nothing here advocates for a web service instead of licensed software. If I want to solve a linear program, I can use an open source library or I can pay for a commercial offering, but that commercial offering will run on my hardware (or cloud instance) and will operate independently of the network. If I want to edit a Word document, I can pay Microsoft for a local copy of Word.

jcuenod · on March 17, 2022

I'm a very happy user of OCRmyPDF: https://github.com/jbarlow83/OCRmyPDF/

eastendguy · on March 17, 2022

> 1) OCR'ing of a PDF is difficult. The only good service is Google

OCRspace is OK, too, and easier to use. You can just send the PDF. It is free for PDFs with 3 or less pages.

> 2) People want searchable OCR'd PDFs where you can highlight the text, even when it's a bitmap underneath.

OCRspace can also create searchable PDFs: https://ocr.space/searchablepdf