On 23/4/25 10:37, Max Nikulin wrote:
I would be great if a data extractor warned users when text from document (either really text or embedded OCR layer for scans) does not match text recognized from rendered document. Besides routine sanity checks, document author might try to intentionally add some tricks with fonts aiming to confuse indexers or humans who copy text to their notes.

Accidentally I have noticed
find_tables(clip=None, strategy=None, vertical_strategy=None,
horizontal_strategy=None, vertical_lines=None, horizontal_lines=None,
snap_tolerance=None, snap_x_tolerance=None, snap_y_tolerance=None,
join_tolerance=None, join_x_tolerance=None, join_y_tolerance=None,
edge_min_length=3, min_words_vertical=3, min_words_horizontal=1,
intersection_tolerance=None, intersection_x_tolerance=None,
intersection_y_tolerance=None, text_tolerance=None,
text_x_tolerance=None, text_y_tolerance=None, add_lines=None)

<https://pymupdf.readthedocs.io/en/latest/page.html#Page.find_tables>

I have not tried it since currently I am not interested in table extraction and the version packaged for bookworm does not have this feature. I was surprised by mixing of functions to manipulate simple PDF objects and one quite sensitive to heuristics and implementation details.


One underlying problem with tabular data in pdf is the order the text is encoded.

Consider PDF as basically Postscript on steroids but underlying it is the same algorithm of "go to a location and display some text"

Fancy generators such as some of my bank statements do things like write the framework and titles first, then some types of data and then some other types. It's not a linear process. Simple pdf to text programs miss all this and you get a real mess of text output that's impossible to parse.

My first solution was to render fully and then OCR the output. It worked mostly but then you have to parse the OCR output and that can be difficult unless you know what you are dealing with.

My current solution is convert PDF to image and pass that to LLMs. I use online LLMs as they are fast and more clever than ones In can run under llama on my PC. At some stage the PC versions will catch up as hardware improves. Sadly (for me at least) the best affordable platform to run LLM is the more advanced mac PCs with 96GB+ RAM/VRAM

Reply via email to