Package: uni2ascii Version: 4.4-1 Severity: minor I would like to discuss today the Unicodes ¯ ’“”− ff fi fl ffi ... that is 00AF 2019 201C 201D 2212 FB00 FB01 FB02 FB03 ...
You see, I noticed them when I used pdftotext on http://www.cs.ucr.edu/~anirban/Anir-networking07.pdf and then tired to read the results on my ASCII PDA. I wish pdftotext had a flag to make the output ASCII. Anyway, even uni2ascii -ydpxef wouldn't get all of them into ASCII. The ligatures remained -- but turned into 0x codes. (P.S., I wish there was one flag to "give me best ASCII", lest one ponder the man page too long.) Also apparently there is no way to get uni2ascii to not turn what it can't deal with to 0x codes, and let sail thru for some other filter to complete the job. Now turning to pstotext, whose man page says "pstotext deals better with punctuation and ligatures." Not in this case. Now turning to Text::Unidecode: sorry: mangled ligatures. Anyways, I ended up having to write by hand: #!/usr/bin/perl use strict; use warnings; while (<>) { s/¯/_/g; #just a guess s/’/'/g; s/“/"/g; s/”/"/g; s/−/-/g; s/ff/ff/g; s/fi/fi/g; s/fl/fl/g; s/ffi/ffi/g; s/ffl/ffl/g; s/ſt/ft/g; s/st/st/g; print; }