Hi all,

Josh and I continue to work on improving the PDF to HTML conversion in poppler, 
but I am running into one persistent bug that I am having trouble solving and 
would like to solicit the list's help on.

The quick background is that text in the HTML is often slightly vertically 
offset from where it appears in the PDF. The degree to which it is offset and 
whether it is offset upwards or downwards varies depending on the font applied 
to the text and some fonts are not offset at all. My analysis indicates that 
the problem has to do with the ascent/descent metrics of the font being used. 
In PDF, text is vertically positioned by specifying the position of the text's 
baseline. However, absolutely positioned spans in HTML must have a top or 
bottom y-coordinate, baseline is not an available positioning option. So, to 
convert the y-coordinate from the PDF into something usable in HTML, poppler is 
currently subtracting from it the ascent of the current font multiplied by the 
height in pixels of the current font, which produces the behavior described 
above.

My attempts to improve this positioning adjustment have been stymied by the 
inconsistent and sometimes conflicting information I have found about font 
metrics. So, if anyone with a good understanding of fonts can help answer the 
following questions, I would appreciate it:


1.       What is the relationship between the ascent/descent of a font and the 
units per em (UPM) of a font? Most sources I have read indicate the the sum of 
the ascent and descent (or the difference between them if the descent is 
expressed as a negative number) should equal the UPM and that, depending on the 
underlying font file type, the UPM should be either 1000 or a multiple of 2 
like 1024 or 2048. However, of the many fonts I have examined, the ascent and 
descent of a particular font have very rarely summed to the UPM.

2.       Which of the several ascent/descent values describing a font is the 
correct one to pay attention to? In examining various font file types, I have 
often found that the different of tables in a font have different, conflicting 
ascent/descent values. For example, in a TrueType font, there is one set of 
ascender/descender values in the "hhea" table while in the "os/2" there are two 
more sets of values, one known as the typoasecender/typodescender and the other 
as the winascent/windescent. When a browser is positioning a font, does it just 
look at one of these or is there a more complicated relationship between them 
that I don't understand? Furthermore, it is often the case that beyond the 
ascent/descent information included in the embedded font file itself, yet 
another and often different set of values will be included in that font's font 
descriptor in the PDF. Which should I pay attention to then?

Thanks in advance for your advice!

-Stephen Reichling

_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to