Dear Glen,
PDFStreamParser is only for parsing PDF content streams (so specific parts of a
PDF) and not the complete PDF. As a starting
point take a look at CustomGraphicsStreamEngine and or CustomPageDrawer in the
examples package.
Also PDFTextStripper will give you some ideas how to process a PDF.
BR
Maruan
> I'm trying to examine an existing PDF file. Initially to extract text and
> maybe images, but ultimately to apply some logic to the formatting of the
> text to turn it into valid HTML with H1, H2, ul, li, etc. I thought I
> would start like this:
>
> PDFStreamParser sParse = new PDFStreamParser(fileItem.get());
> Object token = sParse.parseNextToken();
> while (token != null) {
> logger.info("token: " + token);
> token = sParse.parseNextToken();
> }
>
> That yields:
>
> file size: 5289793
> token: COSInt{6066}
> token: COSInt{0}
> token: PDFOperator{obj}
> token:
> COSDictionary{COSName{Filter}:COSName{FlateDecode};COSName{First}:COSInt{1193};COSName{Length}:COSInt{12594};COSName{N}:COSInt{98};COSName{Type}:COSName{ObjStm};}
> token: PDFOperator{stream}
> token: PDFOperator{hÞìÛ}
> token: COSNull{}
> token: PDFOperator{ ·½'à¯R—» '"Y¬}
> token: COSInt{7}
> token: PDFOperator{àà}
> Error trying to process request
> java.io.IOException: Error: Expected operator 'ID' actual='I6' at stream
> offset 125
> at
> org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:311)
>
> I'm using PDFBox 2.0.19.
>
> I'm probably doing this wrong at many levels. When I went to look at the
> samples on the web site, the classes in the 1.8 samples don't exist any
> more. The link to the sources for 2.0 samples actually has 3.0 samples,
> whose classes don't exist yet. So I just kind of bumbled along looking at
> the source code and guessing.
>
> If I had to guess what I'm seeing, everything looks good up
> until PDFOperator{stream}, after which, it looks like all garbage until it
> blows up. What do I do now?
>
> Is there an example somewhere of how I should be doing this that you could
> just point me to? My sample file opens well in the Ubuntu 18.04 PDF viewer.
>
--
Maruan Sahyoun
FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen
Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
[email protected]
www.fileaffairs.de
Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]