Re: Alternative to Debian Repository - extract CSV formatted data from PDF

Hans Thu, 20 Feb 2025 09:52:17 -0800

Am Donnerstag, 20. Februar 2025, 15:08:27 CET schrieb Richard Owlett:
> I wish to extract CSV formatted data from a PDF document. [1]
> Page ES-7 has a weekly grocery list for males grouped by age.
> I need only the first and last columns.
> 
> Can someone point me in a suitable direction?
> 
> TIA
> 
> [1] https://www.fns.usda.gov/cnpp/thrifty-food-plan-2006
>      Table ES-1. Thrifty Food Plan market baskets, quantities of food
>       purchased for a week, by age-gender group, 2006


Without knowing the content of your pdf file, maybe you can port the pdf file 
to a text file for 
example by using "pdftotext". 

pdftotext [options] <PDF-file> [<text-file>]

Then, you could read every line in this text file and filter only lines with a 
unique word in this 
line (i.e. "sold") and create a new file with all lines you only need. For 
example:

cat ~/my_file.txt | grep "sold" > my_new_file.txt

Now you have this one, you can cat and cut only words you need (see manual of 
cut). 

The syntax is similar like:

cat `cut  --fields 3 5 7` ~/my_new_files.txt > my_target_file.txt

This would read linewise and only print the 3rd,  5th and 7th word of the 
source file.

See manual of cut, what options you need. 

At last, you can edit the my_target_file.txt with any editor and add a 
separator sign at any 
space between the words. The space is also a sign like any other and can be 
exchanged like any 
other letter. 

Then you would have a csv file!

If you are familiar with these commands, you can write a shell script, which 
does all in once in a 
future. Usefull also vor very very big files.

Please note: Above might not be the correct syntax!!! My goal was more, to 
show, which way 
you could like to go and it maybe not usefull at your special pdf file. 


Maybe, if my suggestion is usefull, someone more experienced as me can you tell 
the correct 
commands.

Please take look at the manuals of pdftotext, cat and cut, hope this helps.

Best regards

Hans

Re: Alternative to Debian Repository - extract CSV formatted data from PDF

Reply via email to