NullPointerException on org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode)

Declan Newman Tue, 13 Jan 2009 05:29:00 -0800

Hello everyone,

I'm new, so please be gentle with me.

We are using PDFBox to extract text from a large amount of PDFs (approx.80,000) in preparation for indexing in Solr/Lucene.

In order to do this, we use theorg.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages() method in orderto iterate over the pages and strip the contents using thePDFTextStripper a page at a time.

The vast majority are fine, but approx. 0.8% suffer from aNullPointerException when it reachesorg.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:102)

I'm currently working from the trunk after seeing a similar problem inthe archives(<http://mail-archives.apache.org/mod_mbox/incubator-pdfbox-dev/200809.mbox/%3cof15421546.54f415dc-on862574ba.006a9e36-862574ba.006ad...@uscmail.uscourts.gov%3e>)but unfortunately it hasn't solved the issue.


The stack trace is:

Caused by: java.lang.NullPointerException

: atorg.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:102): atorg.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:754): atcom.semantico.depp.extractor.PDFBoxPdfExtractor.writeText(PDFBoxPdfExtractor.java:71): atcom.semantico.depp.extractor.PDFBoxPdfExtractor.extractText(PDFBoxPdfExtractor.java:56)

             : at com.semantico.depp.task.JobTask.doJob(JobTask.java:129)

Having delved into the code, the "page" variable is null when:

page.getDictionaryObject( COSName.COUNT )).intValue()

is called in PDPageNode.getCount(PDPageNode)

I understand that not all PDFs can be supported, and to be honest Ithink 99.2% is amazing. I just thought I would post this in the hopesthat someone has come across it before.


Thanks for any help.

Regards,

Declan

NullPointerException on org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode)

Reply via email to