On 6 Sep, [EMAIL PROTECTED] wrote: > D> (converting text to Unicode) doesn't work, and your choice of output > D> text encoding doesn't matter at all. > > I see. The reason I can still read the electric bill using xpdf is > it must be stored in some wasteful image-like format, but not > a pdfimages kind of image. OK.
Not exactly. It's a TrueType font, so the glyphs would be vector format, not images. > And if a chars are to be extracted, there must be the triple yes line > below, I suppose. > > $ pdffonts -upw xxxxx phone_bill.pdf > name type emb sub uni object ID > ------------------------------------ ------------ --- --- --- --------- > IKKPHJ+DFKaiShu-SB-Estd-BF CID TrueType yes yes yes 2 0 > <--pdftotext can use this > MingLiU CID TrueType no no no 12 0 > DFKaiShu-SB-Estd-BF CID TrueType no no no 13 0 The "uni" column means that there is a ToUnicode map. Fonts with ToUnicode maps can almost always be extracted. (The exception is cases where the ToUnicode map is just plain incorrect, which I have seen, though rarely.) The "emb" column means the font is embedded. Non-embedded fonts generally use a standard encoding, which means they'll be extractable, but not always. The "sub" column means the font is a subset (only applicable for embedded fonts). Full (non-subset) fonts generally use a standard encoding (but not always), just like non-embedded fonts. There is unfortunately no easy way to tell if a font will be extractable or not. > (By the way there's also > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=440746 but that is a > pdftohtml bug, so never mind.) Last I checked, pdftohtml was based on an old version of Xpdf (version 2.x), and was no longer being maintained. My guess would be that there's some bug in the decryption code in Xpdf 2.x that has since been fixed. Incidentally, I couldn't open that file with Acrobat 7. The case where the user password is non-empty and the owner password is empty is a bit unusual (and not terribly useful). - Derek -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]