Re: [Groff] PDFPIC macro

Keith Marshall Wed, 11 Oct 2017 02:10:19 -0700

On 09/10/17 23:44, Deri James wrote:
> On Mon 09 Oct 2017 09:10:18 Keith Marshall wrote:
>> Perhaps, you could:
>>
>>   $ make clean
>>   $ make CFLAGS=-DDEBUGGING
>>
>> and check your failing PDFs again, so we can see whatever 
>> unexpected token sequence is leading to the "syntax error"; only 
>> when we know that, will we have any chance of handling it, before 
>> the parser simply gives up on the offending PDF.
> 
> Thought I'd better take this off list (it's a bit too "techy"
> perhaps), hope you don't mind.


Actually, I do mind ... and I completely disagree with your reasoning.  
Certainly, some list members -- perhaps even a majority -- will not be 
interested in the technical details, but there will surely be some who 
may be interested, and who may even contribute constructively.  Your 
arbitrary decision to communicate privately denies *all* list members 
the freedom to choose whether they wish to participate, or not, and it 
denies *me* potential benefit from Eric Raymond's "many eyes make bugs 
shallow" principle.

I will not publish your sample files, without your permission, but 
otherwise, this belongs on the list, so I'm taking it back there.

> I ran psbb against the errant pdf with the lex debugging turned on
> and got this:-
> 
> [derij@pip groff-psbb]$ ./psbb ../pdf
> 20: 18 0 R
> 17: return token PDFROOT (259)
> 17: return token VALUE (260)
> 17: return token VALUE (260)
> 10: return token 'R' (82)
> 20: 19
> 11: return token PDFOBJREF (263)
> 12: pdfseek to offset = 305005
> 13: return token VALUE (260)
> 13: return token VALUE (260)
> 13: pdfseek to offset = 305035
> 14: lookup object #1 @ 305015 within 0..19
> 14: 0000002355 00000 n --> 2355; 0 n
> 14: pdfseek to offset = 2355
> 15: return token VALUE (260)
> 15: return token VALUE (260)
> 16: return token PDFOBJECT (262)
> 16: object: 1; generation = 0
> 17: return token VALUE (260)
> 17: return token VALUE (260)
> 10: return token 'R' (82)
> psbb:t-psbb (t-psbb.cpp):193: syntax error
> 
> Now I believe it located the xref section and then found the /Catalog
> (at offset 2355) but does not like something in it.

Right.  After seeking to offset 2355, in state 14 (PDFGOXREF), the lexer 
switches to state 15 (PDFGETOBJECT), where it reads the signature of the 
object at that offset, then in state 16 (PDFSCANOBJECT), it checks that 
it has actually found the object it expected (1 0 obj, in this case), 
and proceeds to scan the object content.  As it does so, it will find a 
dictionary, which in the case of this /Catalog object, is expected to 
include, at least a "/Type /Catalog" entry, and a "/Pages n n R" entry.  
In the case of your PDF, it looks like:

  1 0 obj << /Pages 2 0 R 
  /Type /Catalog
  >>
  endobj

>From state 16, the lexer passes through state 10 (PDFDICT), switching 
to state 17 (PDFREFER) as soon as it encounters the /Pages key, whence 
it returns a pair of VALUE tokens to the yacc stack, (which, prior to 
this had been empty); control then reverts to state 10, whence the 'R' 
token is returned, to complete the indirect reference for the /Pages 
object.  At this point, yacc throws the "syntax error", because there 
is no rule in its grammar, to handle the token sequence:

  VALUE VALUE 'R'

Had the "/Type /Catalog" entry preceded the "/Pages 2 0 R" entry, within 
the /Catalog object dictionary, then it would have caused the lexer to 
return a PDFOBJREF token, *before* the /Pages object reference, yielding 
a yacc stack state of:

  PDFOBJREF VALUE VALUE 'R'

for which a grammar rule has been specified, so the lexer would have 
successfully followed the object reference.  However, there is nothing 
in the PDF specifications to require the /Type entry to precede the 
/Pages, so we need a postfix equivalent rule, to accommodate:

  VALUE VALUE 'R' PDFOBJREF

Adding such a rule is sufficient to fix the issue, for all of your 
sample PDF files, with two exceptions (see below).

> Unfortunately, my lexer foo is waning, well to be honest it never
> existed!!
> 
> The attached archive holds some samples of two types of pdfs, either
> produced by gropdf or produced by cairo software.  Inside the two
> subdirectories there are three types of files:-
> 
> *-structure.pdf (these illustrate the structure of the pdf with
> similar name)
> 
> *.pdf (these are the pdfs to run against psbb)
> 
> *.mm (a program called "freemind" can open these files, they also
> illustrate the structure, but you can interactively click to
> open/close object nodes).
> 
> In the gropdf directory the gropdf.pdf file is the one having
> problems, and the gs.pdf is the same file after running through
> ghostscript, which psbb handles perfectly.  Both files load fine in
> acroread, which can be quite picky when it comes to syntax, although
> the probability is that gropdf is not quite standard enough.

The gropdf.pdf file has the /Catalog object structure I've illustrated 
above.  I guess passing it through ghostscript reversed the order of the 
/Type and /Pages dictionary entries; inspection reveals it to be thus:

  1 0 obj
  <</Type /Catalog /Pages 3 0 R
  /Metadata 23 0 R
  >>
  endobj

This would have worked anyway, with the original psbb grammar; adding 
the additional postfix PDFOBJREF rule makes it work just as well for the 
inverted order in the gropdf.pdf /Catalog dictionary.  (This inverted 
order may, perhaps, seem less logical, but it doesn't violate the PDF 
standard, so we do need to accommodate it).

> The cairo directory contains two examples created by inkscape, psbb
> has a big problem with these.

The yacc grammar adjustment also fixes all but two of these: SJP.pdf 
and SJP-Whole.pdf seem to confuse psbb, such that having followed object 
references through the /Catalog and /Pages object, it gets into an 
infinite loop, rescanning the first /Page object ad infinitum; it 
appears to be confused by an embedded /Group dictionary, which places 
the lexer in a state in which it overruns the "endobj" sentinel, and 
reads ahead until it discovers the /Kids reference in the /Pages object, 
(which actually appears later in the file than the /Page object to which 
it refers), and follows that reference back to the /Page object again, 
(and again, and again ...).  I have an idea how to fix this too ...

> I hope these are helpful to you, sorry for being a nuisance.
> Integrating pdf bounding boxes into groff would be a big benefit.
> These are the MediaBoxes which would be expected.

Thanks.  These are helpful to me, (but obviously not to others, unless 
you're willing to distribute them).  Regardless, I'll leave the analysis 
here, for reference.

> [derij@pip Samples]$ pdfbb Cairo/*.pdf gropdf/*.pdf
> Processing 'Cairo/gropdf-pdf-structure.pdf'
> Cairo/gropdf-pdf-structure.pdf: MediaBox: 0,0,842,595
> Processing 'Cairo/gs-pdf-structure.pdf'
> Cairo/gs-pdf-structure.pdf: MediaBox: 0,0,842,595
> Processing 'Cairo/SJP.pdf'
> Cairo/SJP.pdf: MediaBox: 0,0,114.146561,115.235786
> Processing 'Cairo/SJP-structure.pdf'
> Cairo/SJP-structure.pdf: MediaBox: 0,0,842,595
> Processing 'Cairo/SJP-Whole.pdf'
> Cairo/SJP-Whole.pdf: MediaBox: 0,0,210.231384,138.239899
> Processing 'Cairo/SJP-Whole-structure.pdf'
> Cairo/SJP-Whole-structure.pdf: MediaBox: 0,0,842,595
> Processing 'gropdf/gropdf.pdf'
> gropdf/gropdf.pdf: MediaBox: 0,0,612,792
> Processing 'gropdf/gropdf-pdf-structure.pdf'
> gropdf/gropdf-pdf-structure.pdf: MediaBox: 0,0,842,595
> Processing 'gropdf/gs.pdf'
> gropdf/gs.pdf: MediaBox: 0,0,612,792
> Processing 'gropdf/gs-pdf-structure.pdf'
> gropdf/gs-pdf-structure.pdf: MediaBox: 0,0,842,595

-- 
Regards,
Keith.

Re: [Groff] PDFPIC macro

Reply via email to