What about non-dictionary objects with indirect IDs? Remember that ANY PDF object (null, number, string, etc. ) can be indirect.
And how do you handle recursion? Or do you simply treat each indirect object as unique and not related? Leonard On 11/2/11 9:00 AM, "Nedim Srndic" <[email protected]> wrote: >I am doing some research on the structure of PDF files. I wrote a >utility to convert the object (i.e., dictionary) structure of PDFs into >XML so that I can query the structure using XPath or similar query >languages. I also care about the context, and the context can be rebuilt >from the resulting XML when necessary. > >Nedim > >On Tue, 2011-11-01 at 05:26 -0700, Leonard Rosenthol wrote: >> Why would you iterate over the objects w/o any understanding of their >> context? Wouldn't it make MUCH MORE sense to "walk the tree" - starting >> at the Catalog/Root and then simply recursing down the object tree based >> on known relationships? >> >> What use are the objects w/o context? >> >> Leonard >> >> On 11/1/11 7:55 AM, "Nedim Srndic" <[email protected]> wrote: >> >> >I'm sorry, I see now that I wasn't clear enough. I would like to >> >enumerate every PDF dictionary from a given PDF file, including but not >> >limited to the Catalog, Pages, Actions, Annotations, Name tree - >> >everything. Currently I can successfully do that for all dictionaries >> >that can be located using XRef, but it seems that indirect objects >> >inside object streams cannot be found this way. I could obviously test >> >if any of the objects pointed to by the XRef is an object stream and >>get >> >all the objects from the stream, but I'm wondering if Poppler has a >>more >> >elegant solution. >> > >> >Nedim >> > >> >On Mon, 2011-10-31 at 11:12 -0700, Josh Richardson wrote: >> >> What kinds of objects are you interested in? I have a version of >> >> pdftohtml which I believe is not yet merged into the master repo that >> >> extracts images and fonts. >> >> >> >> --josh >> >> >> >> On 10/31/11 9:16 AM, "Nedim Srndic" <[email protected]> wrote: >> >> >> >> >Dear list, >> >> > >> >> >I am using the Poppler library (in the src/poppler folder, no >>bindings, >> >> >version 7 from the Ubuntu 10.10 repos) and would like to retrieve >>all >> >> >objects from a PDF file. Currently, I am running a loop on XRef and >> >> >getting all the non-null objects from it, but it doesn't seem to >> >> >retrieve objects from object streams. What solution would you >>propose >> >> >for this problem? >> >> > >> >> >Thanks, >> >> >Nedim Srndic >> >> > >> >> >_______________________________________________ >> >> >poppler mailing list >> >> >[email protected] >> >> >http://lists.freedesktop.org/mailman/listinfo/poppler >> >> > >> >> >> > >> > >> >_______________________________________________ >> >poppler mailing list >> >[email protected] >> >http://lists.freedesktop.org/mailman/listinfo/poppler >> > > >_______________________________________________ >poppler mailing list >[email protected] >http://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
