Hi, (please do not mail me directly unless you really have to) A Dissabte, 20 de novembre de 2010, Brian Ewins va escriure: > On 20 Nov 2010, at 00:22, Albert Astals Cid <[email protected]> wrote: > > A Dilluns, 15 de novembre de 2010, Baz va escriure: > >> On 14 November 2010 16:23, Albert Astals Cid <[email protected]> wrote: > >>> BTW here comes the updated release schedule > >>> > >>> * Nov 29 (+2 weeks) Poppler 0.15.3 (0.16 RC) > >> > >> Can you consider the performance bugfix on bug 3188 for this release. > >> https://bugs.freedesktop.org/show_bug.cgi?id=3188#c65 > >> > >> Marek commented on the bug that he's working on further changes, but > >> that's to fix a different section of slow code triggered by the same > >> test document that Dennis Sheil mentioned on the list > >> (http://www.ratp.info/picts/touristes/photos/plan%20paris-touriste.pdf). > > > > A fix that changes pdftotext output, asked the author if that is to be > > expected or not (i'd expect only more speed, not a different output). > > The heuristic has changed slightly so yes there are circumstances where you > could get different output; I've not seen an example
I can send you the pdf file if you want. > (though I guess it is > likely to happen somewhere in that scattershot bus map). 2 things have > changed, the initial sort order and the heuristic for deciding which > blocks to visit first. I think only the first of these changes the > results; a long explanation follows. > > The previous heuristic said block A must be visited before block B if it is > entirely to the left of B and there is no block C that is above A, below > B, and overlaps both horizontally. The new heuristic avoids an explicit > search for block C by tracking an interval that starts off as the > horizontal bounds of B and widens to cover any blocks that it overlaps as > we move down the page. In the 3 block case, this is the same, but it > differs when you have 5 or more blocks: If there is a block D that > overlaps neither A nor B, and blocks E, F such that E overlaps A and D, > and F overlaps D and B, then A will be marked to visit before B. However, > this case would have happened before by induction anyway-D would have been > visited before B, and A before D under the old rule. So I don't think this > change did anything other than improve speed. It relies on the blocks > having being pre-sorted vertically though, which is the other change. > > If the heuristic does not have any way to decide which of two blocks should > be visited first, previously it would visit the first one in physical > order. Now it visits the one closest to the top; and leftmost if at the > same height (or top right for RTL, etc). This tends to be the same but is > not always; the bus map labels were in a random order physically, for > example. However for normal text, top left is a decent guess. Where it can > go wrong is eg if you have a 2 column doc with leading vertical space in > the left col, and the left column ends up overlapping the right (due to > some non-rectangular layout). In this case there the heuristic will not > spot that the left column should have been first. Well, reading this long description i don't see this as an optimization but as a [small] behaviour change. We are past the feature freeze so i'm a bit hesitant to let this in, anyway what i'll do is this: * Run the test suite and see on how many files pdftotext gives a different result with the patch and without * If the number of files is relatively small, see if the differences are improvements or not and if they are not if we can "live" with the changes. > > As I've said in the past, we'll get better results than this if we take the > reading order from tagged PDF. Otherwise it is just guesswork. Patches welcome ;-) Albert > > > Albert > > > >> Thanks, > >> Brian > >> > >>> * Dec 27 (+4 weeks) Poppler 0.16.0 > >>> > >>> We are in bugfixing mode in trunk until we release Poppler 0.16.0 > >>> > >>> Albert > >>> _______________________________________________ > >>> poppler mailing list > >>> [email protected] > >>> http://lists.freedesktop.org/mailman/listinfo/poppler > >> > >> _______________________________________________ > >> poppler mailing list > >> [email protected] > >> http://lists.freedesktop.org/mailman/listinfo/poppler > > > > _______________________________________________ > > poppler mailing list > > [email protected] > > http://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
