Howdy Folks, I'm stumped and hope somebody can give me some clues on how to work around this occasional error I'm getting.
I've got a .Net console program using SolrNet to scour certain folders at certain times and extract text from PDF files and index them. It succeeds on a majority of the files, but it fails on several test files. Though I'm new to this environment, I gather the SolrNet library calls on Solr (v. 3.5.0) to do this, which in turn calls on the Tika library (v. 0.10) , which calls on the PDFBox library (v. 1.6.0). To try and isolate the problem I took SolrNet and .Net out of the equation and switched to a Linux console. I downloaded the pdfbox-app-1.6.0.jar and executed: Java -jar pdfbox-app-1.6.0.jar ExtractText -console a.pdf Everything worked fine. I moved up to Tika. Downloaded tika-app-0.10.jar and executed: Java -jar tika-app-0.10.jar -t a.pdf And again everything worked fine. I then executed: Curl 'http://localhost:8993/solr/MyCore/update/extract?map.content=text&commit-tr ue' -F file=@a.pdf And it failed with the following output (Note: the above command works fine with other pdf files, but fails on these few pdf files) <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/> <title>Error 500 org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@58c5f8 org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@58c5f8 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingD ocumentLoader.java:219) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentSt reamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase. java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3 56) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java: 252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler .java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerColl ection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:11 4) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java: 945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:22 8) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582 ) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@58c5f8 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingD ocumentLoader.java:213) ... 22 more Caused by: java.lang.NullPointerException at org.apache.pdfbox.pdmodel.font.PDFont.getEncodingFromFont(PDFont.java:832) at org.apache.pdfbox.pdmodel.font.PDFont.determineEncoding(PDFont.java:293) at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:178) at org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:7 9) at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:139 ) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:1 09) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:7 6) at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java :243) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:22 5) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:441) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:365 ) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:321) at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:241) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:90) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) ... 25 more </title> </head> <body><h2>HTTP ERROR 500</h2> <p>Problem accessing /solr/karaoke/update/extract. Reason: <pre> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@58c5f8 org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@58c5f8 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingD ocumentLoader.java:219) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentSt reamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase. java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3 56) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java: 252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler .java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerColl ection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:11 4) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java: 945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:22 8) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582 ) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@58c5f8 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingD ocumentLoader.java:213) ... 22 more Caused by: java.lang.NullPointerException at org.apache.pdfbox.pdmodel.font.PDFont.getEncodingFromFont(PDFont.java:832) at org.apache.pdfbox.pdmodel.font.PDFont.determineEncoding(PDFont.java:293) at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:178) at org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:7 9) at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:139 ) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:1 09) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:7 6) at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java :243) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:22 5) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:441) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:365 ) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:321) at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:241) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:90) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) ... 25 more Can anybody explain to me what's going on here and how I can get around this problem? Jon