On Wed, 2011-11-02 at 06:14 -0700, Leonard Rosenthol wrote: > What about non-dictionary objects with indirect IDs? Remember that ANY > PDF object (null, number, string, etc. ) can be indirect. >
I treat all objects the same, i.e., I take all entries in XRef and convert them to XML. I also convert indirect reference objects, so that I can recreate the document's tree structure. In effect, I preserve all information from the PDF that is referenced in the XRef. This excludes the xref table/stream itself, the trailer dictionary, file header, EOF and any possible old-generation objects. > And how do you handle recursion? Or do you simply treat each indirect > object as unique and not related? > I keep the flat physical representation of the PDF, just like in the PDF format. > Leonard > > On 11/2/11 9:00 AM, "Nedim Srndic" <[email protected]> wrote: > > >I am doing some research on the structure of PDF files. I wrote a > >utility to convert the object (i.e., dictionary) structure of PDFs into > >XML so that I can query the structure using XPath or similar query > >languages. I also care about the context, and the context can be rebuilt > >from the resulting XML when necessary. > > > >Nedim > > > >On Tue, 2011-11-01 at 05:26 -0700, Leonard Rosenthol wrote: > >> Why would you iterate over the objects w/o any understanding of their > >> context? Wouldn't it make MUCH MORE sense to "walk the tree" - starting > >> at the Catalog/Root and then simply recursing down the object tree based > >> on known relationships? > >> > >> What use are the objects w/o context? > >> > >> Leonard > >> > >> On 11/1/11 7:55 AM, "Nedim Srndic" <[email protected]> wrote: > >> > >> >I'm sorry, I see now that I wasn't clear enough. I would like to > >> >enumerate every PDF dictionary from a given PDF file, including but not > >> >limited to the Catalog, Pages, Actions, Annotations, Name tree - > >> >everything. Currently I can successfully do that for all dictionaries > >> >that can be located using XRef, but it seems that indirect objects > >> >inside object streams cannot be found this way. I could obviously test > >> >if any of the objects pointed to by the XRef is an object stream and > >>get > >> >all the objects from the stream, but I'm wondering if Poppler has a > >>more > >> >elegant solution. > >> > > >> >Nedim > >> > > >> >On Mon, 2011-10-31 at 11:12 -0700, Josh Richardson wrote: > >> >> What kinds of objects are you interested in? I have a version of > >> >> pdftohtml which I believe is not yet merged into the master repo that > >> >> extracts images and fonts. > >> >> > >> >> --josh > >> >> > >> >> On 10/31/11 9:16 AM, "Nedim Srndic" <[email protected]> wrote: > >> >> > >> >> >Dear list, > >> >> > > >> >> >I am using the Poppler library (in the src/poppler folder, no > >>bindings, > >> >> >version 7 from the Ubuntu 10.10 repos) and would like to retrieve > >>all > >> >> >objects from a PDF file. Currently, I am running a loop on XRef and > >> >> >getting all the non-null objects from it, but it doesn't seem to > >> >> >retrieve objects from object streams. What solution would you > >>propose > >> >> >for this problem? > >> >> > > >> >> >Thanks, > >> >> >Nedim Srndic > >> >> > > >> >> >_______________________________________________ > >> >> >poppler mailing list > >> >> >[email protected] > >> >> >http://lists.freedesktop.org/mailman/listinfo/poppler > >> >> > > >> >> > >> > > >> > > >> >_______________________________________________ > >> >poppler mailing list > >> >[email protected] > >> >http://lists.freedesktop.org/mailman/listinfo/poppler > >> > > > > > >_______________________________________________ > >poppler mailing list > >[email protected] > >http://lists.freedesktop.org/mailman/listinfo/poppler > _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
