A Dilluns, 6 de setembre de 2010, vàreu escriure: > On Mon, Sep 06, 2010 at 08:30:14PM +0100, Albert Astals Cid wrote: > > A Dilluns, 6 de setembre de 2010, Daniel Garcia Moreno va escriure: > > > Poppler does not make table selection in "order". It detects tables as > > > columns, because poppler uses distance between text to decide what is a > > > column so tables are selected in column order when the "logic way" is > > > as rows. > > > > > > Other problem in selection caused by that heuristic is when you have a > > > pdf with near columns or text with spaces. > > > > > > I looked at acroread to see how it does columns and tables selection > > > and I realized that it selects text in "order", I mean, in the order > > > that you put it in pdf file. To see that I created a text pdf file > > > with inkscape. > > > > > > So the selection logic is simple, we select the nearest word to the > > > first selection point and the nearest word to the last selection point, > > > and every word between that two words (in text order, no matter where > > > the words are at screen) is selected too. > > > > What is "text order"? > > I think Dani means raw order, or the order in which the PDF creation tool > put the text into the PDF file. For example, when authoring tables with > OpenOffice, it generates the text in row order. When using a vector > drawing tool like Inkscape the order matches the order the user created > the objects. We belive selection should respect this order since it makes > the right thing in most cases and the algorithm is so much simpler to > understand and to maintain. > > As Leonard mentioned we should also use PDF structure/tagging into account > but I don't think we should elaborate heuristics for guessing how the text > should be selected (e.g., trying to see columns where the user did some > ascii art). > > By the way, I'm Dani's coworker and I helped him with the algorithm design.
I don't think raw order is acceptable. It might work with your files since the pdf creator put them in a nice raw order, but raw order is raw and nothing guarantees it will be in a logic order. Albert > > Best regards, > > Lorenzo Gil _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
