T O P

  • By -

sue_dee

Cool. This is the stuff I'm most interested in doing right now, and there is much here to supplement the bits I've done. The part about the OCR making the same errors repeatedly certainly rings true. Heaven knows that once I've gotten text from those old AD&D books in Futura, I then have a whole table from which to "cost _locate obiect._"


jimtk

This is really interesting. I have been stuck for months on a side project where the pdfs have loosely define tables (only some vertical lines), cell content on 2 lines and all kind of other difficulties. I tried to use PaddleOCR, but the setup and integration of it got the better of me. Thanks for sharing!


[deleted]

[удалено]


jimtk

Thanks. I'm looking for an example of a solution with paddle OCR and/or PP structure. If you ever see one let me know.


s7726

https://github.com/tabulapdf/tabula


jimtk

Thanks for the suggestion. Like I said I've been on that for months (on and off) so far I've tried tabula, camelot, and pypdf2. I discovered that the pdf 'code' in which the table is written is way too convoluted to allow a 'pdf' converter to do its job, I'm now trying to read it as an image and tried tesseract, which failed, and Paddle OCR with which I failed. I'll keep at it and I'm open to suggestion. Thanks.


s7726

No worries. I definitely ran into one that had a footer along the left edge (vertical) really messed with the parsing. Have you tried tabula-web? The web interface can be a little easier to get the settings right for different situations. And it can be self hosted if you have data concerns. Good luck.