On 23/04/2025 09:56, jeremy ardley wrote:
On 23/4/25 10:37, Max Nikulin wrote:

Accidentally I have noticed
[...]
<https://pymupdf.readthedocs.io/en/latest/page.html#Page.find_tables>

I have not tried it
[...]

One underlying problem with tabular data in pdf is the order the text is encoded.

Again have not tried it, just spotted the title after reading your reply:
<https://pymupdf.readthedocs.io/en/latest/recipes-text.html#how-to-extract-text-in-natural-reading-order>
"How to Extract Text in Natural Reading Order"

It matches my expectation that an application or a library may sort text fragments based on coordinates. Of course, watermarks, "stamps" (that are not annotation) and similar stuff is a problem.

Fancy generators such as some of my bank statements do things like write the framework and titles first, then some types of data and then some other types. It's not a linear process. Simple pdf to text programs miss all this and you get a real mess of text output that's impossible to parse.

I am curious if selection in xpdf (the one packaged in Debian that is xpopple) gives "visually" ordered text for that files. And the same question is concerning "pdftotext -layout".

My first solution was to render fully and then OCR the output. It worked mostly but then you have to parse the OCR output and that can be difficult unless you know what you are dealing with.

Some years ago I tried to convert several images taken by phone camera by tesseract. With PDF as the intermediate format and "pdftotext -layout" the result was acceptable for me. I had only a few tables, so block operations in a text editor were enough to add cell separators.

My current solution is convert PDF to image and pass that to LLMs.

I am not trying to dispute that it is useful, especially as a cross-check. I just trust text data in the file a bit more than OCR result.

By the way, PDF files may be tagged for screen readers. Is there a dedicated structure to explicitly mark tables? It would be the best source for data extraction.

Is the following beyond "Simple pdf to text programs"?
<https://docs.kde.org/stable5/en/okular/okular/menutools.html>
Tools → Area Selection (Ctrl+3)
    The mouse will work as a rectangular region selection tool. In that
mode clicking left mouse button and dragging will draw a selection box
and provide the option of copying the selected content to the clipboard,
speaking the selected text, or transforming the selection region into an
image and saving it to a file.

Tools → Text Selection (Ctrl+4)
    The mouse will work as a text selection tool. In that mode clicking
left mouse button and dragging will give the option of selecting the
text of the document. Then, just click with the right mouse button to
copy to the clipboard or speak the current selection.

Tools → Table Selection (Ctrl+5)
    Draw a rectangle around the text for the table, then click with the
left mouse button to divide the text block into rows and columns. A left
mouse button click on an existing line removes it and merges the
adjacent rows or columns. Finally, just click with the right mouse
button to copy the table to the clipboard.

Perhaps this is the story related to the feature:
<https://nightcrawlerinshadow.wordpress.com/2011/08/20/advanced-text-selection-in-okular/>

Reply via email to