Hi! Few findings, for anybody interested in recovering information from PDFs with embedded fonts with wrong or missing mapping to unicode table.
I have stumbled on another such PDF, and decided to take a closer look at the insane idea of OCRing the texts. #1. As it turned out, the idea is not so insane. Probably just crazy. If I'm reading forums correctly, in the last year the situation has changed: libtesseract got pretty stable and can even recognize italic formatting. Tesseract is pretty old - but rather good - OCR tool developed by HP. Some years ago it was open-sourced and one of the fruits of open-sourcing it, was the repackaging it as a library. (Home page: http://code.google.com/p/tesseract-ocr/ . On Debian systems: `apt-get install libtesseract-dev` - but I haven't checked it in depth myself.) With libtesseract in mind, it should be feasible to attempt the text conversion, even if neither the mapping to Unicode is supplied, nor the embedded font has the mapping table: use freetype to render the text using the embedded font into in-memory bitmap, feed the bitmap to OCR library, match the OCR'ed text with the input string to build a mapping table. The mapping table is needed since OCR is pretty slow and one should avoid calling it, if the text can be converted using solely the mapping table. OCR should be pretty reliable, since the input image would be properly aligned and clean of the usual post-scanner garbage. #2. The mapping table inside the font was actually my second finding when I checked yesterday freetype library interface: font can contain the charset mapping table(s). And I haven't found any trace in the poppler (but neither am I a specialist in the innards) of any attempt to access the font's own mapping tables. But I might be totally off here, since I do not have any experience with the font rendering and have only surface understanding of the purpose of the mapping tables inside fonts. #3. The handling of fonts inside pdftohtml (and in some parts inside the poppler too) isn't very clean: font comparison disregards encoding. IOW, two fonts would be considered equivalent even if they have different encodings. That means, in some scenarios, when text is extracted and merged into lines, the information about font encoding is lost, if the text using the font with custom encoding is surrounded by text with font with known encoding (and probably vice versa). I did a fix for it in my private repo, and it has the (desired) side effect of producing more line breaks. I have also made it so that characters with custom encoding are extracted as hex literals - `�`. I can post the patch for pdftohtml if anybody's interested. fyi. On 3/24/12, Ihar `Philips` Filipau <[email protected]> wrote: > On 3/24/12, suzuki toshiya <[email protected]> wrote: >> >> I think so. If we restrict our scope to Latin script, there might >> be some heuristic technology to restore the original text (I think >> the number of unique glyphs in the document is less than 255 x 3). >> If there are so many Latin script (or small charset) documents >> that the texts cannot be extracted, some experts may be interested >> in the solution for this issue. I think it's interesting theme for >> some engineers (including me), but unfortunately, I don't have >> sufficient sparetime to do it now, and I'm a CJK people :-). >> > > Sort of solution exists already: "print" to PNG and OCR. Because > that's what it really is: guess which symbol of the font maps to which > character. Provided that we have only the image of the character, that > is the job of OCR to do it. > > It seems I had a luck and can guess meaning of those few symbols which > are still garbled, but it seems that other different things are going > on too in the document, e.g. capital letter C is "C\x8a\x8dX" and > question mark is "C@PQX". > _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
