Re: [poppler] New selection algorithm

Lorenzo Gil Tue, 07 Sep 2010 14:18:57 -0700

On Tue, Sep 07, 2010 at 11:05:05AM -0700, Leonard Rosenthol wrote:
> I can tell you with 100% certainty that Acrobat/Reader do NOT use raw order - 
> they use "reading order".   The algorithms haven't changed between 8 & 9.  
>


Ok

> I also looked at your PDFs and in both cases, OO is writing the content 
> streams in exactly the same way & order - top to bottom, left to right.  It 
> doesn't write the first column and then the second in the "real-column" 
> example.  Open up the PDF's content stream and look.  You'll see almost 
> identical streams.

I don't know how to see the PDF's content stream. I have try with the "Save as 
text" option in Acrobat/Reader and it does what you say. Still that doesn't 
explain why Acrobat/Reader does no select the right thing in the fake-columns 
example.

> 
> And Adobe Reader 9.3.4 is the current version for Linux - I just checked on 
> Adobe.com.

I double checked and the problem was the language preference of my browser. The 
latest version of Acrobat/Reader in Spanish is 8.1.7. If you choose the English 
version then you are right and the latest one is 9.3.4.

Best regards,

Lorenzo

> 
> 
> Leonard Rosenthol
> PDF Standards Architect
> Adobe Systems
> 
> -----Original Message-----
> From: Lorenzo Gil [mailto:[email protected]] 
> Sent: Tuesday, September 07, 2010 2:00 PM
> To: Leonard Rosenthol
> Cc: 'Albert Astals Cid'; [email protected]
> Subject: Re: [poppler] New selection algorithm
> 
> On Mon, Sep 06, 2010 at 01:45:55PM -0700, Leonard Rosenthol wrote:
> > >I don't think raw order is acceptable.
> > >
> > Agreed - never use raw order since it means nothing.
> > 
> > You should either use "reading order" (top->bottom, left->right (or RTL, 
> > depending)) as computed through geometric sorting - which is what the 
> > current code does, at least to some extent.
> > 
> > The difference with Acrobat/Reader is that we use additional heuristics to 
> > offer smarter selection semantics for columnar data, vertical text, and 
> > other such things.
> 
> I've created two pdf files (attached to this mail) with OpenOffice that looks 
> pretty much the same in terms of layout and structure. Acrobat/Reader behaves 
> completely different in terms of selection: in the real-columns.pdf it 
> selects the text by columns but in the fake-columns if selects the text by 
> lines. In both cases Adobe Reader selects the text in the order that 
> OpenOffice put it in the document stream (e.g. raw order). The 
> fake-columns.pdf document was created using tabs and spaces to simulate a two 
> columns layout instead of the columns feature of OpenOffice.
> 
> I'm using Adobe Reader 8.1.7 for Linux. Maybe the heuristics that you mention 
> were added to Adobe Reader 9 but unfortunately that's not available in Linux.
> 
> Sorry to focus on Adobe Reader when this is Poppler list but I think we 
> should see Adobe Reader as the reference implementation for a PDF viewer.
> 
> Best regards,
> 
> Lorenzo
> 
_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Re: [poppler] New selection algorithm

Reply via email to