It looks like that the font size is not calculated correctly from the bounding 
boxes and contained text in the HTML - I am not an expert, but this link might 
help you:
http://www.emdpi.com/fontsize.html

I have discussed this with somebody who is an expert in PDF and my
current understanding is that for creating the PDF the underlying text
behind the image displayed needs font size, spacing etc information to
be correctly displayed in the viewer.

I noticed that not only the selection in the viewer does not work
correctly. Also a lot of words are not found using the internal search
functionality of viewers (tested with Evince and Adobe Acrobat Reader).

Side note: If I extract the full text using a PDF library I get a
correct looking text (words separated by space, no spaces between
words).

I think that creating a correct sandvich PDF is crucial and wonder why
not more people are interested in this. But I also think, that it is not
easy. I think it would be necessary to get experts in OCR, experts in
PDF and experts in fonts together to solve this. - The key missing thing
IMHO is to get font metric (font name, size, spacing, ...) information
when only having the bounding boxes and contained text. Therefore I
posted also the link above which I find important.

-- 
Bounding boxes not handled correctly
https://bugs.launchpad.net/bugs/632524
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to