Rupert Swarbrick <[email protected]> writes: > Notice the weird 80.999 / 84.149 oscillating thing. And then the sudden > jump to the right: I wonder whether the 80/84 lines are bullet points > and then "The Microchip" starts with the 332.999000 line? The document > displays fine with Evince, though, and selecting the relevant text > doesn't behave strangely.
I hunted further. I get the same behaviour with the current code from git, so I added some g_printf () calls to that. With the document [1] From before, the call to poppler_page_get_text_layout first outputs the title and a couple of expected lines, then outputs a column of bullet points. Ahah! That explains the "oscillation" I saw before. So basically what's going on is that the text output by poppler_page_get_text_layout is not in the same order as that output by poppler_page_get_text. The latter works using TextPage, rather than brute-force working through the word list, and there seems to be clever algorithmics to put stuff in a sensible order. As such, I think the fact that these come out in a different order from each other must be intentional: am I right? If so, is there currently any way to use the glib interface to get a list of characters on the page, along with their bounding boxes? I can't work out how to match up the indices from poppler_page_get_text_layout with anything else. Another point is that the relationship between the two should probably be clarified in the documentation shipped with the source: I'll happily provide a patch, but I can't really do that until I understand what's going on... Any help greatly appreciated! Rupert [1] http://ww1.microchip.com/downloads/en/DeviceDoc/22197B.pdf
pgp4OfGLdYmle.pgp
Description: PGP signature
_______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
