I can get bounding boxes? SOLD! - I'll start using your product now :]
On Wed, Oct 12, 2011 at 3:23 PM, Josh Richardson <[email protected]> wrote: > Hmm. MuPDF, bless their hearts, is a cool bit of tech, but MUCH less > sophisticated than Poppler. If I found the right project, pdfdraw is no > exception -- a very small piece of code that doesn't do any structure > analysis; it looks like it just spits out whatever blobs are natively in > the PDF. If you find that I'm wrong about that, please let me know. > > If you start with Poppler, and my version of pdftohtml in particular, then > you at least start out with a notion of words, lines of text, and > paragraphs -- albeit that these things are not very accurate. Each of > those entities is tagged with font size and style. You also get bounding > boxes on all that text, as well as image objects (coalesced from multiple > draw operations,) which I use to find the page margins, but can be > extended to find some of the other items you're interested in finding. > > Best, --josh > > On 10/11/11 9:08 PM, "Alec Taylor" <[email protected]> wrote: > >>Thanks Josh, I was actually researching quite heavily, and found >>myself on the #ghostscript channel @ freenode >> >>They pointed me to MuPDF (one of there projects), and it seems like >>the "pdfdraw" example project is something to work from, either >>directly; or through parsing XML output from it. >> >>However, if this doesn't suit your needs, please tell me why, as I >>might have the same problem, and then I'll join forces! :] >> >>On Wed, Oct 12, 2011 at 3:44 AM, Josh Richardson <[email protected]> wrote: >>> Thanks for the pointer, Glad. >>> >>> FYI, I am also interested in being able to analyze document structure. >>> Our first step is to put the text back together, since in many PDFs, it >>>is >>> not logically organized in the original PDF. pdf2html has a "coalesce" >>> function which is the starting point for us. We have made some >>> improvements on it which are not yet contributed back -- so let me know >>>if >>> you want the source and/or if you want to join forces. >>> >>> --josh >>> >>> On 10/11/11 12:31 AM, "Glad Deschrijver" <[email protected]> >>> wrote: >>> >>>>On Tuesday 11 October 2011, Alec Taylor wrote: >>>>> Good afternoon, >>>>> >>>>> Do you have some recommends and/or sample code for comparing textual >>>>> and geometric layout information across pages? >>>>> >>>>> Basically I'm trying to realise patterns within documents, e.g., page >>>>> numbers, header and footers, title, column information &etc; using the >>>>> capabilities of the Poppler PDF library. >>>> >>>>Not sure that it will help you much, but you can have a look at DiffPDF >>>>which >>>>uses poppler to compare two PDF files page by page (both textually and >>>>visually): >>>>http://www.qtrac.eu/diffpdf.html >>>> >>>>Best regards, >>>>Glad >>>> >>>>-- >>>> Everything that is really great and inspiring is created by >>>> the individual who can labor in freedom. >>>> -- Albert Einstein, Out of My Later Years (1950) >>>> >>>>_______________________________________________ >>>>poppler mailing list >>>>[email protected] >>>>http://lists.freedesktop.org/mailman/listinfo/poppler >>>> >>> >>> >> > > _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
