It looks like that the font size is not calculated correctly from the bounding boxes and contained text in the HTML - I am not an expert, but this link might help you: http://www.emdpi.com/fontsize.html
I have discussed this with somebody who is an expert in PDF and my current understanding is that for creating the PDF the underlying text behind the image displayed needs font size, spacing etc information to be correctly displayed in the viewer. I noticed that not only the selection in the viewer does not work correctly. Also a lot of words are not found using the internal search functionality of viewers (tested with Evince and Adobe Acrobat Reader). Side note: If I extract the full text using a PDF library I get a correct looking text (words separated by space, no spaces between words). I think that creating a correct sandvich PDF is crucial and wonder why not more people are interested in this. But I also think, that it is not easy. I think it would be necessary to get experts in OCR, experts in PDF and experts in fonts together to solve this. - The key missing thing IMHO is to get font metric (font name, size, spacing, ...) information when only having the bounding boxes and contained text. Therefore I posted also the link above which I find important. -- Bounding boxes not handled correctly https://bugs.launchpad.net/bugs/632524 You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs