On  6 Sep, [EMAIL PROTECTED] wrote:
> D> (converting text to Unicode) doesn't work, and your choice of output
> D> text encoding doesn't matter at all.
> 
> I see. The reason I can still read the electric bill using xpdf is
> it must be stored in some wasteful image-like format, but not
> a pdfimages kind of image. OK.

Not exactly.  It's a TrueType font, so the glyphs would be vector
format, not images.

> And if a chars are to be extracted, there must be the triple yes line
> below, I suppose.
> 
> $ pdffonts -upw xxxxx phone_bill.pdf
> name                                 type         emb sub uni object ID
> ------------------------------------ ------------ --- --- --- ---------
> IKKPHJ+DFKaiShu-SB-Estd-BF           CID TrueType yes yes yes      2  0 
> <--pdftotext can use this
> MingLiU                              CID TrueType no  no  no      12  0
> DFKaiShu-SB-Estd-BF                  CID TrueType no  no  no      13  0

The "uni" column means that there is a ToUnicode map.  Fonts with
ToUnicode maps can almost always be extracted.  (The exception is cases
where the ToUnicode map is just plain incorrect, which I have seen,
though rarely.)

The "emb" column means the font is embedded.  Non-embedded fonts
generally use a standard encoding, which means they'll be extractable,
but not always.

The "sub" column means the font is a subset (only applicable for
embedded fonts).  Full (non-subset) fonts generally use a standard
encoding (but not always), just like non-embedded fonts.

There is unfortunately no easy way to tell if a font will be extractable
or not.

> (By the way there's also
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=440746 but that is a
> pdftohtml bug, so never mind.)

Last I checked, pdftohtml was based on an old version of Xpdf (version
2.x), and was no longer being maintained.  My guess would be that
there's some bug in the decryption code in Xpdf 2.x that has since been
fixed.

Incidentally, I couldn't open that file with Acrobat 7.  The case where
the user password is non-empty and the owner password is empty is a bit
unusual (and not terribly useful).

- Derek



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Reply via email to