On 9/22/11 12:20 PM, "Jonathan Kew" <[email protected]> wrote: >More generally, it is not possible to recreate useful XHTML (or similar) >documents from arbitrary PDF files with anything like 100% reliability, >because many PDF files do not contain adequate information to accurately >map the rendered glyphs back to correct Unicode text, or to reliably >reconstruct the proper flow of text. Constructs such as ActualText may >help, but are often lacking from real-world PDF documents.
W.r.t. rendering glyphs, we get around the problem of missing unicode mappings by taking any glyph without a unicode mapping and assigning it an offset in the private space of Unicode. This produces the correct visual result in the XHTML, but not a full semantic representation. If someone's interested, they could get the semantics right too by pattern-matching the glyph against an appropriate Unicode font. W.r.t. the flow of text, there have been other threads on this topic, but pdftohtml does make some attempt, and I believe it's possible to do this to a high degree of accuracy, maybe >99% -- that said, noone has done it yet, so either it's harder than I think, or no-one has cared enough to really try (and I still fall into that camp.) Best, --josh _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
