Re: Alternative to Debian Repository - extract CSV formatted data from PDF

Max Nikulin Sun, 23 Feb 2025 07:14:42 -0800

On 22/02/2025 05:02, David Wright wrote:


With mupdf, I don't even
know how to copy, as the mouse just drags the page around.


I have not tried it, but...
https://manpages.debian.org/bookworm/mupdf/mupdf.1.en.html#Right~2

On Fri 21 Feb 2025 at 09:53:46 (+0700), Max Nikulin wrote:

When text file has properly aligned columns, instead of
"quoting" some spaces, it may be better to add TAB characters at
certain positions on each line. Perhaps LibreOffice Calc even has GUI
to select column widths during importing of text files.


Yes, gnumeric has that too. But I would hate to have a lot of
mousework if I were repeating this frequently. And for a
postprandial one-off, I just took a no-tools approach
(barring an editor, of course).

Maybe I have missed something, but you trick with "=" is not necessary.For tab-separated values


sed -e 's/^ \{10\}/.&/' -e 's/^ \+//' -e 's/  \+/\t/g' /tmp/es-7.txt

is not perfect, but should be acceptable.

I am sure there should be ready to use tools that extract tables fromPDF and from aligned text. Out of curiosity I tried to create a smallpython script to process text you attached earlier. It does not try tojoin text for multiline cells. Input file requires a couple ofcorrections to avoid overlapped text and a stray column. Heuristics maybe improved.

#!/usr/bin/env python3
import csv
import itertools
import sys

debug = False
debug_file = sys.stderr


def build_histogram(lines, debug=False):
    histogram = ()
    for row in lines:
        row = row.rstrip()
        pairs = itertools.zip_longest(
            histogram, map(lambda c: int(c != " "), row), fillvalue=0)
        histogram = tuple(itertools.starmap(lambda x, y: x + y, pairs))
        if debug:
            print(row, file=debug_file)
            print(
                "".join(map(lambda x: " " if x == 0 else "-", histogram)),
                file=debug_file)
    return histogram


def print_histogram(histogram, file=debug_file):
    length = len(str(max(histogram)))
    numbers = zip(*tuple(map(  # transpose
        lambda c: str(c if c > 0 else " ").rjust(length), histogram)))
    print("\n".join(map(lambda x: "".join(x), numbers)), file=file)


def cell_ranges(histogram):
    space = True
    for i, count in enumerate(histogram):
        current = count == 0
        if space == current:
            continue
        space = current
        if not space:
            start = i
        else:
            yield (start, i)
    if not space:
        yield (start, i + 1)


def cells_from_line(line, ranges):
    length = len(line)
    for begin, end in ranges:
        if begin >= length:
            break
        yield line[begin:end].strip()


def to_csv(lines):
    h = build_histogram(lines, debug)
    if debug:
        print_histogram(h)
    cells = tuple(cell_ranges(h))
    writer = csv.writer(sys.stdout)
    writer.writerows(map(lambda r: cells_from_line(r, cells), lines))


if __name__ == '__main__':
    args = sys.argv[1:] or ("-",)
    for filename in args:
        if filename == "-":
            lines = sys.stdin.readlines()
        else:
            with open(filename) as f:
                lines = f.readlines()
        to_csv(lines)

Re: Alternative to Debian Repository - extract CSV formatted data from PDF

Reply via email to