On 09/10/17 23:44, Deri James wrote: > On Mon 09 Oct 2017 09:10:18 Keith Marshall wrote: >> Perhaps, you could: >> >> $ make clean >> $ make CFLAGS=-DDEBUGGING >> >> and check your failing PDFs again, so we can see whatever >> unexpected token sequence is leading to the "syntax error"; only >> when we know that, will we have any chance of handling it, before >> the parser simply gives up on the offending PDF. > > Thought I'd better take this off list (it's a bit too "techy" > perhaps), hope you don't mind.
Actually, I do mind ... and I completely disagree with your reasoning. Certainly, some list members -- perhaps even a majority -- will not be interested in the technical details, but there will surely be some who may be interested, and who may even contribute constructively. Your arbitrary decision to communicate privately denies *all* list members the freedom to choose whether they wish to participate, or not, and it denies *me* potential benefit from Eric Raymond's "many eyes make bugs shallow" principle. I will not publish your sample files, without your permission, but otherwise, this belongs on the list, so I'm taking it back there. > I ran psbb against the errant pdf with the lex debugging turned on > and got this:- > > [derij@pip groff-psbb]$ ./psbb ../pdf > 20: 18 0 R > 17: return token PDFROOT (259) > 17: return token VALUE (260) > 17: return token VALUE (260) > 10: return token 'R' (82) > 20: 19 > 11: return token PDFOBJREF (263) > 12: pdfseek to offset = 305005 > 13: return token VALUE (260) > 13: return token VALUE (260) > 13: pdfseek to offset = 305035 > 14: lookup object #1 @ 305015 within 0..19 > 14: 0000002355 00000 n --> 2355; 0 n > 14: pdfseek to offset = 2355 > 15: return token VALUE (260) > 15: return token VALUE (260) > 16: return token PDFOBJECT (262) > 16: object: 1; generation = 0 > 17: return token VALUE (260) > 17: return token VALUE (260) > 10: return token 'R' (82) > psbb:t-psbb (t-psbb.cpp):193: syntax error > > Now I believe it located the xref section and then found the /Catalog > (at offset 2355) but does not like something in it. Right. After seeking to offset 2355, in state 14 (PDFGOXREF), the lexer switches to state 15 (PDFGETOBJECT), where it reads the signature of the object at that offset, then in state 16 (PDFSCANOBJECT), it checks that it has actually found the object it expected (1 0 obj, in this case), and proceeds to scan the object content. As it does so, it will find a dictionary, which in the case of this /Catalog object, is expected to include, at least a "/Type /Catalog" entry, and a "/Pages n n R" entry. In the case of your PDF, it looks like: 1 0 obj << /Pages 2 0 R /Type /Catalog >> endobj >From state 16, the lexer passes through state 10 (PDFDICT), switching to state 17 (PDFREFER) as soon as it encounters the /Pages key, whence it returns a pair of VALUE tokens to the yacc stack, (which, prior to this had been empty); control then reverts to state 10, whence the 'R' token is returned, to complete the indirect reference for the /Pages object. At this point, yacc throws the "syntax error", because there is no rule in its grammar, to handle the token sequence: VALUE VALUE 'R' Had the "/Type /Catalog" entry preceded the "/Pages 2 0 R" entry, within the /Catalog object dictionary, then it would have caused the lexer to return a PDFOBJREF token, *before* the /Pages object reference, yielding a yacc stack state of: PDFOBJREF VALUE VALUE 'R' for which a grammar rule has been specified, so the lexer would have successfully followed the object reference. However, there is nothing in the PDF specifications to require the /Type entry to precede the /Pages, so we need a postfix equivalent rule, to accommodate: VALUE VALUE 'R' PDFOBJREF Adding such a rule is sufficient to fix the issue, for all of your sample PDF files, with two exceptions (see below). > Unfortunately, my lexer foo is waning, well to be honest it never > existed!! > > The attached archive holds some samples of two types of pdfs, either > produced by gropdf or produced by cairo software. Inside the two > subdirectories there are three types of files:- > > *-structure.pdf (these illustrate the structure of the pdf with > similar name) > > *.pdf (these are the pdfs to run against psbb) > > *.mm (a program called "freemind" can open these files, they also > illustrate the structure, but you can interactively click to > open/close object nodes). > > In the gropdf directory the gropdf.pdf file is the one having > problems, and the gs.pdf is the same file after running through > ghostscript, which psbb handles perfectly. Both files load fine in > acroread, which can be quite picky when it comes to syntax, although > the probability is that gropdf is not quite standard enough. The gropdf.pdf file has the /Catalog object structure I've illustrated above. I guess passing it through ghostscript reversed the order of the /Type and /Pages dictionary entries; inspection reveals it to be thus: 1 0 obj <</Type /Catalog /Pages 3 0 R /Metadata 23 0 R >> endobj This would have worked anyway, with the original psbb grammar; adding the additional postfix PDFOBJREF rule makes it work just as well for the inverted order in the gropdf.pdf /Catalog dictionary. (This inverted order may, perhaps, seem less logical, but it doesn't violate the PDF standard, so we do need to accommodate it). > The cairo directory contains two examples created by inkscape, psbb > has a big problem with these. The yacc grammar adjustment also fixes all but two of these: SJP.pdf and SJP-Whole.pdf seem to confuse psbb, such that having followed object references through the /Catalog and /Pages object, it gets into an infinite loop, rescanning the first /Page object ad infinitum; it appears to be confused by an embedded /Group dictionary, which places the lexer in a state in which it overruns the "endobj" sentinel, and reads ahead until it discovers the /Kids reference in the /Pages object, (which actually appears later in the file than the /Page object to which it refers), and follows that reference back to the /Page object again, (and again, and again ...). I have an idea how to fix this too ... > I hope these are helpful to you, sorry for being a nuisance. > Integrating pdf bounding boxes into groff would be a big benefit. > These are the MediaBoxes which would be expected. Thanks. These are helpful to me, (but obviously not to others, unless you're willing to distribute them). Regardless, I'll leave the analysis here, for reference. > [derij@pip Samples]$ pdfbb Cairo/*.pdf gropdf/*.pdf > Processing 'Cairo/gropdf-pdf-structure.pdf' > Cairo/gropdf-pdf-structure.pdf: MediaBox: 0,0,842,595 > Processing 'Cairo/gs-pdf-structure.pdf' > Cairo/gs-pdf-structure.pdf: MediaBox: 0,0,842,595 > Processing 'Cairo/SJP.pdf' > Cairo/SJP.pdf: MediaBox: 0,0,114.146561,115.235786 > Processing 'Cairo/SJP-structure.pdf' > Cairo/SJP-structure.pdf: MediaBox: 0,0,842,595 > Processing 'Cairo/SJP-Whole.pdf' > Cairo/SJP-Whole.pdf: MediaBox: 0,0,210.231384,138.239899 > Processing 'Cairo/SJP-Whole-structure.pdf' > Cairo/SJP-Whole-structure.pdf: MediaBox: 0,0,842,595 > Processing 'gropdf/gropdf.pdf' > gropdf/gropdf.pdf: MediaBox: 0,0,612,792 > Processing 'gropdf/gropdf-pdf-structure.pdf' > gropdf/gropdf-pdf-structure.pdf: MediaBox: 0,0,842,595 > Processing 'gropdf/gs.pdf' > gropdf/gs.pdf: MediaBox: 0,0,612,792 > Processing 'gropdf/gs-pdf-structure.pdf' > gropdf/gs-pdf-structure.pdf: MediaBox: 0,0,842,595 -- Regards, Keith.