I can't recall what you said about this in the past, but since I was just dealing with it today.
What do you do about embedded fonts? As my company (Adobe) sells/creates fonts, I want to make sure that pdftohtml won't be violating our IP/licenses. Thanks in advance, Leonard On 9/22/11 5:51 PM, "Josh Richardson" <[email protected]> wrote: >On 9/22/11 12:20 PM, "Jonathan Kew" <[email protected]> wrote: >>More generally, it is not possible to recreate useful XHTML (or similar) >>documents from arbitrary PDF files with anything like 100% reliability, >>because many PDF files do not contain adequate information to accurately >>map the rendered glyphs back to correct Unicode text, or to reliably >>reconstruct the proper flow of text. Constructs such as ActualText may >>help, but are often lacking from real-world PDF documents. > >W.r.t. rendering glyphs, we get around the problem of missing unicode >mappings by taking any glyph without a unicode mapping and assigning it an >offset in the private space of Unicode. This produces the correct visual >result in the XHTML, but not a full semantic representation. If someone's >interested, they could get the semantics right too by pattern-matching the >glyph against an appropriate Unicode font. > >W.r.t. the flow of text, there have been other threads on this topic, but >pdftohtml does make some attempt, and I believe it's possible to do this >to a high degree of accuracy, maybe >99% -- that said, noone has done it >yet, so either it's harder than I think, or no-one has cared enough to >really try (and I still fall into that camp.) > >Best, --josh > >_______________________________________________ >poppler mailing list >[email protected] >http://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
