Rupert Swarbrick <[email protected]> writes:
> Notice the weird 80.999 / 84.149 oscillating thing. And then the sudden
> jump to the right: I wonder whether the 80/84 lines are bullet points
> and then "The Microchip" starts with the 332.999000 line? The document
> displays fine with Evince, though, and selecting the relevant text
> doesn't behave strangely.

I hunted further. I get the same behaviour with the current code from
git, so I added some g_printf () calls to that. With the document [1]
From before, the call to poppler_page_get_text_layout first outputs the
title and a couple of expected lines, then outputs a column of bullet
points.

Ahah! That explains the "oscillation" I saw before. So basically what's
going on is that the text output by poppler_page_get_text_layout is not
in the same order as that output by poppler_page_get_text.

The latter works using TextPage, rather than brute-force working through
the word list, and there seems to be clever algorithmics to put stuff in
a sensible order.

As such, I think the fact that these come out in a different order from
each other must be intentional: am I right? If so, is there currently
any way to use the glib interface to get a list of characters on the
page, along with their bounding boxes? I can't work out how to match up
the indices from poppler_page_get_text_layout with anything else.

Another point is that the relationship between the two should probably
be clarified in the documentation shipped with the source: I'll happily
provide a patch, but I can't really do that until I understand what's
going on...

Any help greatly appreciated!

Rupert

[1] http://ww1.microchip.com/downloads/en/DeviceDoc/22197B.pdf

Attachment: pgp4OfGLdYmle.pgp
Description: PGP signature

_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to