Hi Deri, Thanks for trying it out.
On 09/10/17 01:21, Deri James wrote: > Some pdfs I have tried fail with "syntax error". That's yacc's default behaviour, when the sequence of tokens returned by the lexer doesn't conform to its notion of a valid grammar -- either the order isn't as expected, or the sequence is incomplete. > It seems to occur if MediaBox is defined in an ancestor object rather > than in a "/Page object. There are a number of page attributes which > are inheritable in this way, MediaBox is one of them. I do know that, thanks; it is a configuration which I did test, (albeit with contrived, hand crafted test files): $ ./psbb *.pdf inherited.pdf: bounding box = (0,0)..(612,792) minimal.pdf: bounding box = (0,0)..(612,792) override.pdf: bounding box = (0,0)..(606,809) > So in case a MediaBox is superseded by an entry further down the tree > you still have to continue looking till you get to the object for > page 1, to make sure. And this is exactly what my code does! (To be precise, it parses the trailer dictionary, to locate the /Catalog object, whence it follows the indirect object reference to the top level /Pages object, and thence, it follows the chain of the first /Kids references, through as many /Pages objects as it may find, until it finds the first /Page object. In each /Pages object it traverses, it evaluates any /MediaBox specifications it may find; at each lower level, any such specification overrides any which was evaluated at a higher level. Thus, when the /Page object is parsed, the last /MediaBox encountered -- which may be within the /Page object itself, or in its nearest /Pages ancestor which specified one -- will prevail). Perhaps, you could: $ make clean $ make CFLAGS=-DDEBUGGING and check your failing PDFs again, so we can see whatever unexpected token sequence is leading to the "syntax error"; only when we know that, will we have any chance of handling it, before the parser simply gives up on the offending PDF. -- Regards, Keith.
samples.tar.xz
Description: application/xz