On Sun 23 Feb 2025 at 22:13:55 (+0700), Max Nikulin wrote: > On 22/02/2025 05:02, David Wright wrote: > > > > With mupdf, I don't even > > know how to copy, as the mouse just drags the page around. > > I have not tried it, but... > https://manpages.debian.org/bookworm/mupdf/mupdf.1.en.html#Right~2
I'm not sure how I missed that. But pasting the region gives a single column, which then has to be reassembled. That's not difficult, but it does mean finding the starts of the total lines as they're unmarked. > > On Fri 21 Feb 2025 at 09:53:46 (+0700), Max Nikulin wrote: > > > When text file has properly aligned columns, instead of > > > "quoting" some spaces, it may be better to add TAB characters at > > > certain positions on each line. Perhaps LibreOffice Calc even has GUI > > > to select column widths during importing of text files. > > > > Yes, gnumeric has that too. But I would hate to have a lot of > > mousework if I were repeating this frequently. And for a > > postprandial one-off, I just took a no-tools approach > > (barring an editor, of course). > > Maybe I have missed something, but you trick with "=" is not > necessary. For tab-separated values > > sed -e 's/^ \{10\}/.&/' -e 's/^ \+//' -e 's/ \+/\t/g' /tmp/es-7.txt > > is not perfect, but should be acceptable. It was insurance, lest I needed to use comma delimiters. Also, other people may have different tools, by choice or availability. > I am sure there should be ready to use tools that extract tables from > PDF and from aligned text. Out of curiosity I tried to create a small > python script to process text you attached earlier. It does not try to > join text for multiline cells. Input file requires a couple of > corrections to avoid overlapped text and a stray column. Heuristics > may be improved. I tend to scrape with bash scripts, using temporary intermediate files between each step. When the page format changes, as it inevitably does, the intermediates act as a script trace, making it easier to adapt. Cheers, David.