Hello everyone,
I'm new, so please be gentle with me.
We are using PDFBox to extract text from a large amount of PDFs (approx.
80,000) in preparation for indexing in Solr/Lucene.
In order to do this, we use the
org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages() method in order
to iterate over the pages and strip the contents using the
PDFTextStripper a page at a time.
The vast majority are fine, but approx. 0.8% suffer from a
NullPointerException when it reaches
org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:102)
I'm currently working from the trunk after seeing a similar problem in
the archives
(<http://mail-archives.apache.org/mod_mbox/incubator-pdfbox-dev/200809.mbox/%3cof15421546.54f415dc-on862574ba.006a9e36-862574ba.006ad...@uscmail.uscourts.gov%3e>)
but unfortunately it hasn't solved the issue.
The stack trace is:
Caused by: java.lang.NullPointerException
: at
org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:102)
: at
org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:754)
: at
com.semantico.depp.extractor.PDFBoxPdfExtractor.writeText(PDFBoxPdfExtractor.java:71)
: at
com.semantico.depp.extractor.PDFBoxPdfExtractor.extractText(PDFBoxPdfExtractor.java:56)
: at com.semantico.depp.task.JobTask.doJob(JobTask.java:129)
Having delved into the code, the "page" variable is null when:
page.getDictionaryObject( COSName.COUNT )).intValue()
is called in PDPageNode.getCount(PDPageNode)
I understand that not all PDFs can be supported, and to be honest I
think 99.2% is amazing. I just thought I would post this in the hopes
that someone has come across it before.
Thanks for any help.
Regards,
Declan