Package: uni2ascii
Version: 4.4-1
Severity: minor

I would like to discuss today the Unicodes
¯ ’“”− ff fi fl ffi ...
that is
00AF 2019 201C 201D 2212 FB00 FB01 FB02 FB03 ...

You see, I noticed them when I used pdftotext on
http://www.cs.ucr.edu/~anirban/Anir-networking07.pdf
and then tired to read the results on my ASCII PDA.

I wish pdftotext had a flag to make the output ASCII.

Anyway, even uni2ascii -ydpxef wouldn't get all of them into ASCII.
The ligatures remained -- but turned into 0x codes. (P.S., I wish
there was one flag to "give me best ASCII", lest one ponder the man
page too long.) Also apparently there is no way to get uni2ascii to
not turn what it can't deal with to 0x codes, and let sail thru for
some other filter to complete the job.

Now turning to pstotext, whose man page says "pstotext deals better
with punctuation and ligatures." Not in this case.

Now turning to Text::Unidecode: sorry: mangled ligatures.

Anyways, I ended up having to write by hand:

#!/usr/bin/perl
use strict;
use warnings;
while (<>) {
    s/¯/_/g; #just a guess
    s/’/'/g;
    s/“/"/g;
    s/”/"/g;
    s/−/-/g;
    s/ff/ff/g;
    s/fi/fi/g;
    s/fl/fl/g;
    s/ffi/ffi/g;
    s/ffl/ffl/g;
    s/ſt/ft/g;
    s/st/st/g;
    print;
}


Reply via email to