On 23/4/25 10:37, Max Nikulin wrote:
I would be great if a data extractor warned users when text from
document (either really text or embedded OCR layer for scans) does not
match text recognized from rendered document. Besides routine sanity
checks, document author might try to intentionally add some tricks
with fonts aiming to confuse indexers or humans who copy text to their
notes.
Accidentally I have noticed
find_tables(clip=None, strategy=None, vertical_strategy=None,
horizontal_strategy=None, vertical_lines=None, horizontal_lines=None,
snap_tolerance=None, snap_x_tolerance=None, snap_y_tolerance=None,
join_tolerance=None, join_x_tolerance=None, join_y_tolerance=None,
edge_min_length=3, min_words_vertical=3, min_words_horizontal=1,
intersection_tolerance=None, intersection_x_tolerance=None,
intersection_y_tolerance=None, text_tolerance=None,
text_x_tolerance=None, text_y_tolerance=None, add_lines=None)
<https://pymupdf.readthedocs.io/en/latest/page.html#Page.find_tables>
I have not tried it since currently I am not interested in table
extraction and the version packaged for bookworm does not have this
feature. I was surprised by mixing of functions to manipulate simple
PDF objects and one quite sensitive to heuristics and implementation
details.
One underlying problem with tabular data in pdf is the order the text is
encoded.
Consider PDF as basically Postscript on steroids but underlying it is
the same algorithm of "go to a location and display some text"
Fancy generators such as some of my bank statements do things like write
the framework and titles first, then some types of data and then some
other types. It's not a linear process. Simple pdf to text programs miss
all this and you get a real mess of text output that's impossible to parse.
My first solution was to render fully and then OCR the output. It worked
mostly but then you have to parse the OCR output and that can be
difficult unless you know what you are dealing with.
My current solution is convert PDF to image and pass that to LLMs. I use
online LLMs as they are fast and more clever than ones In can run under
llama on my PC. At some stage the PC versions will catch up as hardware
improves. Sadly (for me at least) the best affordable platform to run
LLM is the more advanced mac PCs with 96GB+ RAM/VRAM