Re: How to extract TABULAR data from a PDF document?

Max Nikulin Wed, 23 Apr 2025 20:49:45 -0700

On 23/04/2025 09:56, jeremy ardley wrote:

On 23/4/25 10:37, Max Nikulin wrote:
Accidentally I have noticed
[...]
<https://pymupdf.readthedocs.io/en/latest/page.html#Page.find_tables>

I have not tried it
[...]
One underlying problem with tabular data in pdf is the order the text isencoded.


Again have not tried it, just spotted the title after reading your reply:
<https://pymupdf.readthedocs.io/en/latest/recipes-text.html#how-to-extract-text-in-natural-reading-order>
"How to Extract Text in Natural Reading Order"

It matches my expectation that an application or a library may sort textfragments based on coordinates. Of course, watermarks, "stamps" (thatare not annotation) and similar stuff is a problem.

Fancy generators such as some of my bank statements do things like writethe framework and titles first, then some types of data and then someother types. It's not a linear process. Simple pdf to text programs missall this and you get a real mess of text output that's impossible to parse.

I am curious if selection in xpdf (the one packaged in Debian that isxpopple) gives "visually" ordered text for that files. And the samequestion is concerning "pdftotext -layout".

My first solution was to render fully and then OCR the output. It workedmostly but then you have to parse the OCR output and that can bedifficult unless you know what you are dealing with.

Some years ago I tried to convert several images taken by phone cameraby tesseract. With PDF as the intermediate format and "pdftotext-layout" the result was acceptable for me. I had only a few tables, soblock operations in a text editor were enough to add cell separators.

My current solution is convert PDF to image and pass that to LLMs.

I am not trying to dispute that it is useful, especially as across-check. I just trust text data in the file a bit more than OCR result.

By the way, PDF files may be tagged for screen readers. Is there adedicated structure to explicitly mark tables? It would be the bestsource for data extraction.


Is the following beyond "Simple pdf to text programs"?
<https://docs.kde.org/stable5/en/okular/okular/menutools.html>

Tools → Area Selection (Ctrl+3)
    The mouse will work as a rectangular region selection tool. In that
mode clicking left mouse button and dragging will draw a selection box
and provide the option of copying the selected content to the clipboard,
speaking the selected text, or transforming the selection region into an
image and saving it to a file.

Tools → Text Selection (Ctrl+4)
    The mouse will work as a text selection tool. In that mode clicking
left mouse button and dragging will give the option of selecting the
text of the document. Then, just click with the right mouse button to
copy to the clipboard or speak the current selection.

Tools → Table Selection (Ctrl+5)
    Draw a rectangle around the text for the table, then click with the
left mouse button to divide the text block into rows and columns. A left
mouse button click on an existing line removes it and merges the
adjacent rows or columns. Finally, just click with the right mouse
button to copy the table to the clipboard.


Perhaps this is the story related to the feature:
<https://nightcrawlerinshadow.wordpress.com/2011/08/20/advanced-text-selection-in-okular/>

Re: How to extract TABULAR data from a PDF document?

Reply via email to