Re: How to extract TABULAR data from a PDF document?

Richard Owlett Wed, 16 Apr 2025 05:32:16 -0700

On 4/15/25 12:56 PM, David Christensen wrote:

On 4/15/25 07:19, Richard Owlett wrote:
I don't know how to approach the problem.
What I would like to end up with is a CSV formatted file containingthe two left columns of Table A4.14 (pages 106&107) of[ https://fns-prod.azureedge.us/sites/default/files/resource-files/TFP2021.pdf ].
Suggestions?

TIA
I normally open the document in Atril Document Viewer, select thecontent I want, copy the selection to the clipboard, open LibreOfficeCalc (opens with a new spreadsheet), and paste. The crux is whateverfile structure the author's software used to generate the PDF vs.Atril's ability to parse it vs. my ability to use the "Text Import" dialog.
In this case, selecting content in Atril from the table title throughthe last value in the last row and in "Text Import" checking the options"Separator Options" -> Space" and "Trim spaces", it appears the PDFcontent is placed into the spreadsheet. But, formatting is a mess andwill require a lot of manual correction. Experimenting with differentoptions in "Text Import" may help. Using a different PDF viewer and/orusing a different spreadsheet may help. YMMV.


I'll try the pdftotext route first.

In this case, the table is small enough that the fastest route formyself on the above platform would be to transcribe it into a newspreadsheet by hand.

As my immediate need is only for the one table, I've been consideringthat. But several other tables are of possible interest. Besides whatelse is retirement for than the learning to use new tools ;}

If you need to convert many tables or to convert repeatedly, and thereis encoding consistency across your input documents, then I suggestlooking for PDF parsing libraries for your favorite programming/scripting language and coding a solution.


Any favorite tutorials.



Alternatively, ask the author for the table in CSV format.


Chuckle. This is a USDA publication.
Thanks.



David

Re: How to extract TABULAR data from a PDF document?

Reply via email to