Re: How to extract TABULAR data from a PDF document?

jeremy ardley Tue, 22 Apr 2025 19:57:12 -0700


On 23/4/25 10:37, Max Nikulin wrote:

I would be great if a data extractor warned users when text fromdocument (either really text or embedded OCR layer for scans) does notmatch text recognized from rendered document. Besides routine sanitychecks, document author might try to intentionally add some trickswith fonts aiming to confuse indexers or humans who copy text to theirnotes.
Accidentally I have noticed
find_tables(clip=None, strategy=None, vertical_strategy=None,
horizontal_strategy=None, vertical_lines=None, horizontal_lines=None,
snap_tolerance=None, snap_x_tolerance=None, snap_y_tolerance=None,
join_tolerance=None, join_x_tolerance=None, join_y_tolerance=None,
edge_min_length=3, min_words_vertical=3, min_words_horizontal=1,
intersection_tolerance=None, intersection_x_tolerance=None,
intersection_y_tolerance=None, text_tolerance=None,
text_x_tolerance=None, text_y_tolerance=None, add_lines=None)
<https://pymupdf.readthedocs.io/en/latest/page.html#Page.find_tables>
I have not tried it since currently I am not interested in tableextraction and the version packaged for bookworm does not have thisfeature. I was surprised by mixing of functions to manipulate simplePDF objects and one quite sensitive to heuristics and implementationdetails.

One underlying problem with tabular data in pdf is the order the text isencoded.

Consider PDF as basically Postscript on steroids but underlying it isthe same algorithm of "go to a location and display some text"

Fancy generators such as some of my bank statements do things like writethe framework and titles first, then some types of data and then someother types. It's not a linear process. Simple pdf to text programs missall this and you get a real mess of text output that's impossible to parse.

My first solution was to render fully and then OCR the output. It workedmostly but then you have to parse the OCR output and that can bedifficult unless you know what you are dealing with.

My current solution is convert PDF to image and pass that to LLMs. I useonline LLMs as they are fast and more clever than ones In can run underllama on my PC. At some stage the PC versions will catch up as hardwareimproves. Sadly (for me at least) the best affordable platform to runLLM is the more advanced mac PCs with 96GB+ RAM/VRAM

Re: How to extract TABULAR data from a PDF document?

Reply via email to