Re: How to extract TABULAR data from a PDF document?

Max Nikulin Tue, 22 Apr 2025 19:38:29 -0700

On 22/04/2025 09:51, jeremy ardley wrote:

Some LLM can also accept pdf for input but you'd need to snip out thepages you are interested in. I consider that slightly more risky as whatyou see rendered or printed and what some programs see internal to thepdf varies

I would be great if a data extractor warned users when text fromdocument (either really text or embedded OCR layer for scans) does notmatch text recognized from rendered document. Besides routine sanitychecks, document author might try to intentionally add some tricks withfonts aiming to confuse indexers or humans who copy text to their notes.


Accidentally I have noticed

find_tables(clip=None, strategy=None, vertical_strategy=None,
horizontal_strategy=None, vertical_lines=None, horizontal_lines=None,
snap_tolerance=None, snap_x_tolerance=None, snap_y_tolerance=None,
join_tolerance=None, join_x_tolerance=None, join_y_tolerance=None,
edge_min_length=3, min_words_vertical=3, min_words_horizontal=1,
intersection_tolerance=None, intersection_x_tolerance=None,
intersection_y_tolerance=None, text_tolerance=None,
text_x_tolerance=None, text_y_tolerance=None, add_lines=None)


<https://pymupdf.readthedocs.io/en/latest/page.html#Page.find_tables>

I have not tried it since currently I am not interested in tableextraction and the version packaged for bookworm does not have thisfeature. I was surprised by mixing of functions to manipulate simple PDFobjects and one quite sensitive to heuristics and implementation details.

Re: How to extract TABULAR data from a PDF document?

Reply via email to