Hi All! I have encountered another strange PDF document. When viewing it in graphical viewers like Okular/Evince/Reader/FoxIt - it looks totally fine.
But when I extract the content using the pdftotext or pdftohtml, the text is garbled. Little tinkering with the output, showed that ASCII characters as if have being shifted by 29. E.g. '5' (0x35) became 0x18. I have applied a simple script to add 29 to the characters and can now read most of the text (except for the German umlauts; also some strange characters appear in beginning of some lines). I gather my question would be: what should I fix in pdftohtml to make it print text properly? P.S. Okular (KDE 4.7.4) also showed that the embedded subset fonts are "Type 1C" and have the funny names: IKFZYK+MSTT31c39b00 ILOQIT+MSTT31c38e00 MBQOWW+MSTT31c38100 _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
