Re: pdfbox parse error "Header doesn't contain versioninfo"

Tilman Hausherr Mon, 13 May 2019 08:54:19 -0700

Am 13.05.2019 um 16:15 schrieb Zeke Steer:

Hi,
I'm using the latest version of the pdfbox command line tools(pdfbox-app-2.0.15.jar) to extract the text from UK company annualreports. I invoke the command line tools from a pythonscript, extracting each page of the company annual report .pdfdocument in turn.
I've noticed that some pages of the annual reports aren't extractedcorrectly. I originally observed this problem in an earlier version ofthe command line tools (pdfbox-app-2.0.6.jar). However, moving to thelatest version of the tools hasn't fixed the issue.
I've attached a sample report which consistently reproduces the issue(00054_CCH_Annual Report_2016-82-85.pdf). The report opens fine inAdobe Reader but pdfbox is unable to extract it. The issue manifestsdifferently depending on whether the sequential (default) ornon-sequential parser is used.


The non-sequential parser is the only parser in 2.0.*

Using the option should bring a FileNotFoundException with 2.0.15, thatis what I get.

"Header doesn't contain versioninfo" is with empty files. I suspect oneof your calls used the PDF file as destination and you destroyed it.


Re different text extractions, please read

https://pdfbox.apache.org/2.0/faq.html#text-extraction

Your PDF file attachment didn't get through, please upload it to asharehoster.


Tilman

_Sequential Parser_
I was initially executing the following command with the -nonSeq flagunset:
java -jar pdfbox-app-2.0.15.jar ExtractText -startPage 1 -endPage 1"E:\Analyst Reports\2019-05-13 PDF Extraction IssueInvestigation\00054_CCH_Annual Report_2016-82-85\00054_CCH_AnnualReport_2016-82-85.pdf" "out\00054_CCH_Annual Report_2016-82-85\1.txt"
This would generate a large number of unicode warnings in the console,e.g.:
May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0FonttoUnicode
WARNING: No Unicode mapping for CID+36 (36) in fontEffra-Medium-Identity-H
May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0FonttoUnicode
WARNING: No Unicode mapping for CID+88 (88) in fontEffra-Medium-Identity-H
May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0FonttoUnicode
WARNING: No Unicode mapping for CID+71 (71) in fontEffra-Medium-Identity-H
May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0FonttoUnicode
WARNING: No Unicode mapping for CID+76 (76) in fontEffra-Medium-Identity-H
May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0FonttoUnicode
WARNING: No Unicode mapping for CID+87 (87) in fontEffra-Medium-Identity-H
May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0FonttoUnicode
WARNING: No Unicode mapping for CID+3 (3) in font Effra-Medium-Identity-H
The pdfbox output was missing a large amount of text present on thefirst page of the report. See the pdfbox output in the attached 1.txtfile and compare this to the first page of the company annual report,also attached.
_Non-Sequential Parser_
I found the issue affected several of the annual reports within mydataset. Investigating further, I read about the non-sequentialparser. You advise using this if the sequential parser fails so Itried executing the following command instead, with the -nonSeq flag set:
java -jar pdfbox-app-2.0.15.jar ExtractText -nonSeq -startPage 1-endPage 1 "E:\Analyst Reports\2019-05-13 PDF Extraction IssueInvestigation\00054_CCH_Annual Report_2016-82-85\00054_CCH_AnnualReport_2016-82-85.pdf" "out\00054_CCH_Annual Report_2016-82-85\1.txt"
However, this consistently fails with a 'java.io.IOException: Error:Header doesn't contain versioninfo'. See the full exception stacktrace below:
Exception in thread "main" java.io.IOException: Error: Header doesn'tcontain versioninfo
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:221)

        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1070)

        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1008)
atorg.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:216)
        at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:96)

        at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
failed to extract text from 'E:\Analyst Reports\2019-05-13 PDFExtraction Issue Investigation\00054_
CCH_Annual Report_2016-82-85\00054_CCH_Annual Report_2016-82-85.pdf':Command 'java -jar pdfbox-app-
2.0.15.jar ExtractText -nonSeq -startPage 1 -endPage 1 "E:\AnalystReports\2019-05-13 PDF Extraction
Issue Investigation\00054_CCH_AnnualReport_2016-82-85\00054_CCH_Annual Report_2016-82-85.pdf" "out
\00054_CCH_Annual Report_2016-82-85\1.txt"' returned non-zero exitstatus 1.
I found a similar issue reported on your JIRA issue tracker here:https://issues.apache.org/jira/browse/PDFBOX-4203?jql=text%20~%20%22versioninfo%22. However,it was closed without being resolved as the original reporter failedto provide a PDF document which reproduced the issue. Hopefully withthe information I've supplied, you'll be able to reopen the bug andtake another look.
Please can you keep me updated?

Many thanks,

Zeke


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: pdfbox parse error "Header doesn't contain versioninfo"

Reply via email to