On 22/04/2025 09:51, jeremy ardley wrote:
Some LLM can also accept pdf for input but you'd need to snip out the
pages you are interested in. I consider that slightly more risky as what
you see rendered or printed and what some programs see internal to the
pdf varies
I would be great if a data extractor warned users when text from
document (either really text or embedded OCR layer for scans) does not
match text recognized from rendered document. Besides routine sanity
checks, document author might try to intentionally add some tricks with
fonts aiming to confuse indexers or humans who copy text to their notes.
Accidentally I have noticed
find_tables(clip=None, strategy=None, vertical_strategy=None,
horizontal_strategy=None, vertical_lines=None, horizontal_lines=None,
snap_tolerance=None, snap_x_tolerance=None, snap_y_tolerance=None,
join_tolerance=None, join_x_tolerance=None, join_y_tolerance=None,
edge_min_length=3, min_words_vertical=3, min_words_horizontal=1,
intersection_tolerance=None, intersection_x_tolerance=None,
intersection_y_tolerance=None, text_tolerance=None,
text_x_tolerance=None, text_y_tolerance=None, add_lines=None)
<https://pymupdf.readthedocs.io/en/latest/page.html#Page.find_tables>
I have not tried it since currently I am not interested in table
extraction and the version packaged for bookworm does not have this
feature. I was surprised by mixing of functions to manipulate simple PDF
objects and one quite sensitive to heuristics and implementation details.