Thanks for the pointer, Glad. FYI, I am also interested in being able to analyze document structure. Our first step is to put the text back together, since in many PDFs, it is not logically organized in the original PDF. pdf2html has a "coalesce" function which is the starting point for us. We have made some improvements on it which are not yet contributed back -- so let me know if you want the source and/or if you want to join forces.
--josh On 10/11/11 12:31 AM, "Glad Deschrijver" <[email protected]> wrote: >On Tuesday 11 October 2011, Alec Taylor wrote: >> Good afternoon, >> >> Do you have some recommends and/or sample code for comparing textual >> and geometric layout information across pages? >> >> Basically I'm trying to realise patterns within documents, e.g., page >> numbers, header and footers, title, column information &etc; using the >> capabilities of the Poppler PDF library. > >Not sure that it will help you much, but you can have a look at DiffPDF >which >uses poppler to compare two PDF files page by page (both textually and >visually): >http://www.qtrac.eu/diffpdf.html > >Best regards, >Glad > >-- > Everything that is really great and inspiring is created by > the individual who can labor in freedom. > -- Albert Einstein, Out of My Later Years (1950) > >_______________________________________________ >poppler mailing list >[email protected] >http://lists.freedesktop.org/mailman/listinfo/poppler > _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
