Hacker Newsnew | past | comments | ask | show | jobs | submit | nhirschfeld's commentslogin

You'll need to use a different OCR engine. Look at easy ocr


Yes, there have already been several suggestions here for other backend etc.

You should try using a different PSM to see if you get better results.

If it's scientific texts specifically, look at grobid


thats why Kreuzberg also exposes a sync API for you to consume.


I'm actually considering another library with optional API called `Kreuzköln` - probably without the Umlaut!


Retrieval Augmented Generation. Its a class of techniques for generating content using LLMs. I'd recommend Googling this.


Was going to reply indignantly that it's hard to google rag and get that answer when I read your comment. Then I did, and it was the first result.

Apologies!


I understood the comment as "Google <the long version I provided> to get more info"


Thanks for asking!

It's both. The OCR part is ofc CPU bound, but the entire text extraction involves reading files, or writing and then reading files.

Without async, these simply block.

As for efficiency - if you're working in an async application context you have to "asyncify" these operations or suffer the consequences.


in that case, what’s the deal with extract_bytes being async? i’m not incredibly familiar with python, but i’d expect a “byte string” to be in memory.


You still need to write it to file to process it via pandoc/tesseract etc.

There are alternative options to tesseract ofc.


> You still need to write it to file to process it via pandoc/tesseract etc.

This sounds... I guess Pythonic? Sheesh.


Yup, easy OCR is good.

My reasons for using Tesseract - easy OCR is larger, and it has a significant cold start.

It benchmarks better for many OCR tasks though, so I'm thinking of adding it as an alternative backend.


Where did you find benchmarks for OCR tools? There have been so many OCR engines coming lately, I would love to see benchmarks!


I google this for a while...


Any experience with Paddle OCR? https://github.com/PaddlePaddle/PaddleOCR

Personally I‘ve used Tesseract before but the results were underwhelming, so I‘m curious how Paddle OCR performs in comparison.


I haven't, testing it out is on my todo list for sure


interesting!


lol ;).

But seriously, in 13 years living here, only one guy tried to pick pocket me.


I live in 36 since 15 years or so. Wasn't as lucky as you :)


Sorry to hear...


Thanks, I'll check these links.

In my tests I found tesseract quite good for regular text documents. For other kinds of texts it's not great.

As for using models - there are some good small language models as well, and of course LLMs.

I sorta feel though that if one needs complex OCR, or a vision model for layout, one should opt for either a commercial solution that abstracts the deployment and GPU management, or bake ones own system.

For most use cases involving text documents though, my subjective opinion is that tesseract is sufficient.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: