Don't get me wrong, I know what they are, just happy that the tool support them "out of the box" for PDFs
On Wed, Oct 12, 2011 at 9:17 PM, Albert Astals Cid <[email protected]> wrote: > A Dimecres, 12 d'octubre de 2011, Alec Taylor vàreu escriure: >> I can get bounding boxes? > > /me points to the various getBBox functions in TextOutputDev.h or to the > TextBox class in the Qt4 > > Albert > >> >> SOLD! - I'll start using your product now :] >> >> On Wed, Oct 12, 2011 at 3:23 PM, Josh Richardson <[email protected]> wrote: >> > Hmm. MuPDF, bless their hearts, is a cool bit of tech, but MUCH less >> > sophisticated than Poppler. If I found the right project, pdfdraw is no >> > exception -- a very small piece of code that doesn't do any structure >> > analysis; it looks like it just spits out whatever blobs are natively in >> > the PDF. If you find that I'm wrong about that, please let me know. >> > >> > If you start with Poppler, and my version of pdftohtml in particular, >> > then you at least start out with a notion of words, lines of text, and >> > paragraphs -- albeit that these things are not very accurate. Each of >> > those entities is tagged with font size and style. You also get >> > bounding boxes on all that text, as well as image objects (coalesced >> > from multiple draw operations,) which I use to find the page margins, >> > but can be extended to find some of the other items you're interested >> > in finding. >> > >> > Best, --josh >> > >> > On 10/11/11 9:08 PM, "Alec Taylor" <[email protected]> wrote: >> >>Thanks Josh, I was actually researching quite heavily, and found >> >>myself on the #ghostscript channel @ freenode >> >> >> >>They pointed me to MuPDF (one of there projects), and it seems like >> >>the "pdfdraw" example project is something to work from, either >> >>directly; or through parsing XML output from it. >> >> >> >>However, if this doesn't suit your needs, please tell me why, as I >> >>might have the same problem, and then I'll join forces! :] >> >> >> >>On Wed, Oct 12, 2011 at 3:44 AM, Josh Richardson <[email protected]> wrote: >> >>> Thanks for the pointer, Glad. >> >>> >> >>> FYI, I am also interested in being able to analyze document >> >>> structure. >> >>> Our first step is to put the text back together, since in many PDFs, >> >>> it >> >>> >> >>>is >> >>> >> >>> not logically organized in the original PDF. pdf2html has a >> >>> "coalesce" >> >>> function which is the starting point for us. We have made some >> >>> improvements on it which are not yet contributed back -- so let me >> >>> know >> >>> >> >>>if >> >>> >> >>> you want the source and/or if you want to join forces. >> >>> >> >>> --josh >> >>> >> >>> On 10/11/11 12:31 AM, "Glad Deschrijver" >> >>> <[email protected]> >> >>> >> >>> wrote: >> >>>>On Tuesday 11 October 2011, Alec Taylor wrote: >> >>>>> Good afternoon, >> >>>>> >> >>>>> Do you have some recommends and/or sample code for comparing >> >>>>> textual >> >>>>> and geometric layout information across pages? >> >>>>> >> >>>>> Basically I'm trying to realise patterns within documents, e.g., >> >>>>> page >> >>>>> numbers, header and footers, title, column information &etc; >> >>>>> using the capabilities of the Poppler PDF library. >> >>>> >> >>>>Not sure that it will help you much, but you can have a look at >> >>>>DiffPDF >> >>>>which >> >>>>uses poppler to compare two PDF files page by page (both textually >> >>>>and >> >>>>visually): >> >>>>http://www.qtrac.eu/diffpdf.html >> >>>> >> >>>>Best regards, >> >>>>Glad >> >>>> >> >>>>-- >> >>>> >> >>>> Everything that is really great and inspiring is created by >> >>>> the individual who can labor in freedom. >> >>>> -- Albert Einstein, Out of My Later Years (1950) >> >>>> >> >>>>_______________________________________________ >> >>>>poppler mailing list >> >>>>[email protected] >> >>>>http://lists.freedesktop.org/mailman/listinfo/poppler >> >> _______________________________________________ >> poppler mailing list >> [email protected] >> http://lists.freedesktop.org/mailman/listinfo/poppler > _______________________________________________ > poppler mailing list > [email protected] > http://lists.freedesktop.org/mailman/listinfo/poppler > _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
