Re: Alternative to Debian Repository - extract CSV formatted data from PDF

David Wright Tue, 25 Feb 2025 16:43:58 -0800

On Sun 23 Feb 2025 at 22:13:55 (+0700), Max Nikulin wrote:
> On 22/02/2025 05:02, David Wright wrote:
> > 
> > With mupdf, I don't even
> > know how to copy, as the mouse just drags the page around.
> 
> I have not tried it, but...
> https://manpages.debian.org/bookworm/mupdf/mupdf.1.en.html#Right~2

I'm not sure how I missed that. But pasting the region gives a single
column, which then has to be reassembled. That's not difficult, but
it does mean finding the starts of the total lines as they're unmarked.

> > On Fri 21 Feb 2025 at 09:53:46 (+0700), Max Nikulin wrote:
> > > When text file has properly aligned columns, instead of
> > > "quoting" some spaces, it may be better to add TAB characters at
> > > certain positions on each line. Perhaps LibreOffice Calc even has GUI
> > > to select column widths during importing of text files.
> > 
> > Yes, gnumeric has that too. But I would hate to have a lot of
> > mousework if I were repeating this frequently. And for a
> > postprandial one-off, I just took a no-tools approach
> > (barring an editor, of course).
> 
> Maybe I have missed something, but you trick with "=" is not
> necessary. For tab-separated values
> 
> sed -e 's/^ \{10\}/.&/' -e 's/^ \+//' -e 's/  \+/\t/g' /tmp/es-7.txt
> 
> is not perfect, but should be acceptable.

It was insurance, lest I needed to use comma delimiters. Also,
other people may have different tools, by choice or availability.

> I am sure there should be ready to use tools that extract tables from
> PDF and from aligned text. Out of curiosity I tried to create a small
> python script to process text you attached earlier. It does not try to
> join text for multiline cells. Input file requires a couple of
> corrections to avoid overlapped text and a stray column. Heuristics
> may be improved.

I tend to scrape with bash scripts, using temporary intermediate files
between each step. When the page format changes, as it inevitably does,
the intermediates act as a script trace, making it easier to adapt.

Cheers,
David.

Re: Alternative to Debian Repository - extract CSV formatted data from PDF

Reply via email to