Hi Constantine.

Yes, using the glyphless font of tesseract is perfect for putting invisible OCR 
text over an image. You bascially take an OCR word (which contains size and all 
in image coordiantes) and convert this into PDF coordinates and width and all 
and then show a text with the glyphless font at these coordinates with a proper 
scaled font. Since characters are usually not all the same width it is not 
perfect but good enough for highlighting in readers etc.

That font is quite a hack, it consists of two PDF-fonts, with the TTF file 
being more or less a placeholder:
A Type 0 font with a ToUnicode map, mapping cid 0 - 0xffff to unicode 
codepoints 0 - 0xffff. I have not yet verified how this handles non BMP 
characters, but the Tesseract PDF renderer c++ code is easy to understand and 
they claim it works. (Not sure about this. Need to read the tesseract 
pdfrenderer code some more.) This is what allows a client app to convert 
character ids in the content stream to unicode.
This parent font contains a descandant font, which is a type2 font, its 
fontdescriptor points to the TTF. It basically describes the size of the glyphs 
and all (which is exactly 1 pt high and 0.5pts wide, so fontsize translates 
directly). It also declares a CID to GID map, that maps all CIDs to GID 1. 

[Fun fact: even this small font puts quite the strain on things. The PDFBox 
PDFDebugger renders pages without a resource cache and tesseract sets the font 
size very, very often. Which reloads the font (w/o verifying if already set) 
every time if rendered, which my modified debugger does. The GC can't keep up 
with that rate of object creation and either the debugger died with OOM after 
1.5 GB had been used or it took 20 - 30 seconds. With resource cache hacked in 
it actually renders quickly. I tried some improvements to FontBox before 
realizing that the cache was missing, and I got object creation rate down so 
that even without resource cache it peaks a 150 MB and renders in 3 seconds. 
Might polish that when I have some free time and hand it in.]

So basically all info for PDFBox is there, but unfortunately the TTF contains 
an empty cmap. Now if the cmap was missing, PDFBox would use the other tables 
and work fine (in lenient mode, in strict mode it would fail).
To circumvent that one can either:
* Hack PDFBox to ignore empty cmaps (I did that for testing).
* Write the Tj command raw (alas these methods are labelled deprecated and will 
probably removed in the future, which would be kind of a bummer for advanced 
users).
* Trick PDFBox to do what we want.

How do we trick PDFBox? Yes, we make the cmap go missing!
The file I tried to append in the other message is was a 3KB Pdf consisting of 
an empty page OCR of tesseract. I call it template.pdf. So all it contains is 
the glyphless font. You'd load the font from this file if you have anything  
non tesseract OCR pdf that you need to merge into a PDF. Simply copy this 
template into a 2nd one, let's call it trickster.pdf.  This you open in a hex 
editor and swap the p in "cmap" of the embedded TTF into a "b". Checksums are 
not verified, so I didn't bother fixing this. This is basically an invalid TTF 
file now since the cmap is required, but we only need it for tricking PDFBox.

So when working on a PDF page you want to enrich with OCR, you first load the 
font from trickster.pdf, and add it into the page/form resources. PDFBox will 
happily use the cidtogid tables and work and when you are done with the 
page/form  your overwrite the font in the resources with the one from the 
template.pdf, so there is a proper TTF embedded. Ugly? Yes. But it's cheaper 
than to maintain a branch of PDFBox.  PDFBox seems to simply copy the front 
from the open template/trickster PDFs into the altered PDF when saving, which 
is perfect. Otherwise one would have to do use clone util.

I am gonna work on this a bit more and show some code except later.

Gunnar

-----Ursprüngliche Nachricht-----
Von: Constantine Dokolas <[email protected]> 
Gesendet: Freitag, 26. März 2021 11:38
An: [email protected]
Betreff: Re: Empty cmap in TTF Files.

Hi, Gunnar,

Do you think this SO question
<https://stackoverflow.com/questions/49363954/using-arialmt-for-arabic-text-without-embedding-font-with-pdfbox>
is related? I'm the OP and the (admittedly somewhat niche) case for no-glyph 
(i.e. non-renderable) chars on a PDF is a "capability" that's been missing for 
me.

To give some context, at work I'm responsible for a library that, among other 
things, overlays OCRed text (from diverse sources) on images placed in PDF 
pages. There have been issues I've overcome (especially concerning Unicode), 
but "glyphless font" embedding is something that would really make a noticeable 
impact on PDF size. Most OCR software that produce PDFs from images do this in 
some way, Tesseract included.

I think PDFBox is a great library for reading and generating PDFs, and I'm 
seriously considering contributing as soon as possible. A big thanks to 
everyone working to make this project successful.

C.D.
--
There is a computer disease that anybody who works with computers knows about. 
It's a very serious disease and it interferes completely with the work. The 
trouble with computers is that you 'play' with them!
- Richard P. Feynman


On Thu, Mar 25, 2021 at 2:30 PM Gunnar Brand < 
[email protected]> wrote:

> Hi.
>
> The process is as follows:
> 1) For images: use the image
>     For PDFs: render each page to 300 dpi (since optimized PDFs don't 
> necessarily have a single big image), maybe even with text if text 
> extraction returned gibberish (missing unicode mapping).
> 2) Use tesseract to OCR image/page with PDF and HOCR output. (for pages:
> create an imageless PDF). The HOCR is used for additional page layout 
> information and word confidence values.
> 3) For images, use the HOCR to filter the PDF text stream and add 
> layout information
>     For PDFs, insert the tesseract PDF text stream into the orignal 
> PDF's page (+add that glyphless font), use the HOCR to filter and add 
> layout information.
>
> For step 3, I would like to use a normal PDPageContentStream to add 
> the content instead of working with a raw stream. But that step fails 
> since I cannot use the showText() method with a Font that has an empty cmap.
>
> I attached an empty tesseract PDF with the glyphless font. Appending 
> text using the font to the single page in there will fail immediately 
> with the exception due to the empty cmap. Adding the font to any other 
> PDF and trying to show text using it will fail as well.
>
> I can probably get away with just creating/transfering the Tj commands 
> raw, but I was wondering if the empty cmap behaviour is ok or would it 
> be better to ignore empty cmaps (i.e. look for a non empty one first 
> and return null if none can be found in TrueTypeFont.getUnicodeCmapImpl).
>
> Gunnar
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Tilman Hausherr <[email protected]>
> Gesendet: Donnerstag, 25. März 2021 04:37
> An: [email protected]
> Betreff: Re: Empty cmap in TTF Files.
>
> Am 24.03.2021 um 14:40 schrieb Gunnar Brand:
> > Hi.
> >
> > I am working on merging original PDFs and the PDF/HOCR output of
> Tesseract, as to create a searchable PDF. Transplanting the glyphless 
> font used by tesseract was no problem, it doesn’t matter if I simply 
> use the font in the original PDF or use cloneutil, when saving the 
> file the font is embedded properly.
> >
> > The problem is when I show text using a content stream, I get a “No
> Glyph for …” exception. I traced this down to the glyphless font 
> containing empty cmap tables. There is a CIDToGIDMap. Coincidentally 
> PDFBOX-5103 just addressed this issue with a reverse mapping if the 
> cmap is null. But the cmap is just empty and will return 0 for any 
> character code, so this new feature will never work in this case.
> >
> > For testing I modified TrueTypeFont.getUnicodeCmapImpl(isStrict) so 
> > that
> it ignores empty cmap subtables  (even the fallback at the end of the 
> method now being a loop). With this PDFBox will happily use the 
> tesseract glyphless font. Now I lack the knowledge if empty cmaps make 
> any sense at all and if they do I will simply write raw show text 
> commands, but maybe it is something to consider?
> >
> > Gunnar
>
> I tried tesseract some time ago and it generates searchable PDFs out 
> of the box, why not use that?
>
> Can you upload one of your files to a sharehoster so that I understand 
> what this is about?
>
> Tilman
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]

Reply via email to