Bug#440747: pdftotext debian bug

Derek B. Noonburg Wed, 05 Sep 2007 11:19:03 -0700

On  6 Sep, [EMAIL PROTECTED] wrote:
>>>>>> "D" == Derek B Noonburg <[EMAIL PROTECTED]> writes:
> 
> D> On  5 Sep, [EMAIL PROTECTED] wrote:
>>> Perhaps you can look at
>>> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=440747
>>> even thought it might not be your fault :-)
> 
> D> Looks like font subsets without any useful encoding info.
> 
> I see, no matter what -enc one guesses (Big5 or UTF8* surely), there
> is no hope of extracting any chars, and one can only read the file
> with xpdf, like it was just a big image blob?


Right.  The text encoding ("-enc ...") only affects the final output.
Internally, Xpdf takes two steps: first it converts all text to Unicode,
then it converts Unicode to the selected text encoding (Big5, UTF-8,
etc.).

The terminology is a little confusing -- "encoding" means a couple
different things.  You're familiar with text encodings, as mentioned
above.  Fonts also have encodings, which map character codes (used in
PDF text drawing operations) to either glyph names or glyph IDs or CIDs
(depending on the font type - but basically it's some sort of ID used
internally by the font to select the glyph to draw).

If a PDF font does not have a "ToUnicode" map, and does not have usable
encoding information (standard glyph names), then the first step
(converting text to Unicode) doesn't work, and your choice of output
text encoding doesn't matter at all.

- Derek



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Bug#440747: pdftotext debian bug

Reply via email to