On 22/02/2025 05:02, David Wright wrote:
With mupdf, I don't even
know how to copy, as the mouse just drags the page around.
I have not tried it, but...
https://manpages.debian.org/bookworm/mupdf/mupdf.1.en.html#Right~2
On Fri 21 Feb 2025 at 09:53:46 (+0700), Max Nikulin wrote:
When text file has properly aligned columns, instead of
"quoting" some spaces, it may be better to add TAB characters at
certain positions on each line. Perhaps LibreOffice Calc even has GUI
to select column widths during importing of text files.
Yes, gnumeric has that too. But I would hate to have a lot of
mousework if I were repeating this frequently. And for a
postprandial one-off, I just took a no-tools approach
(barring an editor, of course).
Maybe I have missed something, but you trick with "=" is not necessary.
For tab-separated values
sed -e 's/^ \{10\}/.&/' -e 's/^ \+//' -e 's/ \+/\t/g' /tmp/es-7.txt
is not perfect, but should be acceptable.
I am sure there should be ready to use tools that extract tables from
PDF and from aligned text. Out of curiosity I tried to create a small
python script to process text you attached earlier. It does not try to
join text for multiline cells. Input file requires a couple of
corrections to avoid overlapped text and a stray column. Heuristics may
be improved.#!/usr/bin/env python3
import csv
import itertools
import sys
debug = False
debug_file = sys.stderr
def build_histogram(lines, debug=False):
histogram = ()
for row in lines:
row = row.rstrip()
pairs = itertools.zip_longest(
histogram, map(lambda c: int(c != " "), row), fillvalue=0)
histogram = tuple(itertools.starmap(lambda x, y: x + y, pairs))
if debug:
print(row, file=debug_file)
print(
"".join(map(lambda x: " " if x == 0 else "-", histogram)),
file=debug_file)
return histogram
def print_histogram(histogram, file=debug_file):
length = len(str(max(histogram)))
numbers = zip(*tuple(map( # transpose
lambda c: str(c if c > 0 else " ").rjust(length), histogram)))
print("\n".join(map(lambda x: "".join(x), numbers)), file=file)
def cell_ranges(histogram):
space = True
for i, count in enumerate(histogram):
current = count == 0
if space == current:
continue
space = current
if not space:
start = i
else:
yield (start, i)
if not space:
yield (start, i + 1)
def cells_from_line(line, ranges):
length = len(line)
for begin, end in ranges:
if begin >= length:
break
yield line[begin:end].strip()
def to_csv(lines):
h = build_histogram(lines, debug)
if debug:
print_histogram(h)
cells = tuple(cell_ranges(h))
writer = csv.writer(sys.stdout)
writer.writerows(map(lambda r: cells_from_line(r, cells), lines))
if __name__ == '__main__':
args = sys.argv[1:] or ("-",)
for filename in args:
if filename == "-":
lines = sys.stdin.readlines()
else:
with open(filename) as f:
lines = f.readlines()
to_csv(lines)