I have a shell script set up to clear a solr core and re-index a folder of PDF files nightly like so:
cd /opt/solr/ && bin/post -c comox_core -host 67.231.17.10 -d "<delete><query>attr_is_pdf:true</query></delete>" && bin/post -c comox_core -host 67.231.17.10 -filetypes pdf /home/townofco/public_html/modx/assets/pdfs -params "literal.is_pdf=true&uprefix=attr_" All was working fine as far as I could tell but now I'm getting errors. For every single file (558 of them) I'm getting something along these lines: SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 500 for URL: http://67.231.17.10:8983/solr/comox_core/update/extract?literal.is_pdf=true&uprefix=attr_&resource.name=%2Fhome%2Ftownofco%2Fpublic_html%2Fmodx%2Fassets%2Fpdfs%2F2016+Meeting+dates.pdf&literal.id=%2Fhome%2Ftownofco%2Fpublic_html%2Fmodx%2Fassets%2Fpdfs%2F2016+Meeting+dates.pdf SimplePostTool: WARNING: Solr returned an error #500 (Server Error) for url: http://67.231.17.10:8983/solr/comox_core/update/extract?literal.is_pdf=true&uprefix=attr_&resource.name=%2Fhome%2Ftownofco%2Fpublic_html%2Fmodx%2Fassets%2Fpdfs%2FTips+on+packing+your+blue+box+for+a+windy+day.pdf&literal.id=%2Fhome%2Ftownofco%2Fpublic_html%2Fmodx%2Fassets%2Fpdfs%2FTips+on+packing+your+blue+box+for+a+windy+day.pdf SimplePostTool: WARNING: Response: <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/> <title>Error 500 Server Error</title> </head> <body> HTTP ERROR 500 <p>Problem accessing /solr/comox_core/update/extract. Reason: <pre> Server Error</pre></p> Caused by: <pre>java.lang.NoClassDefFoundError: Could not initialize class org.apache.pdfbox.pdmodel.PDPage at org.apache.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:217) at org.apache.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:185) at org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:212) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:69) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:155) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2053) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:652) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:229) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:184) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.server.Server.handle(Server.java:518) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572) at java.lang.Thread.run(Thread.java:745) </pre> </body> </html> Does anyone know what is happening and how to fix it? After I run the script I get the COMMIT confirmation saying 558 files committed, but in my Solr Admin page there are only 162 showing and if I search for a specific text string from one of the PDF files, it does not get returned. EDIT: I should add that if I search for the title of a PDF file, it DOES get returned... I checked my lib dir in the solrconfig.xml file and everything looks fine. Here's my ExtractRequestHandler: <requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="lowernames">true</str> <str name="fmap.meta">ignored_</str> <str name="fmap.content">_text_</str> </lst> </requestHandler> EDIT 2: When I try to run a query in the Solr Admin using the /update/extract request handler, I get the following returned: { "responseHeader":{ "status":400, "QTime":0}, "error":{ "metadata":[ "error-class","org.apache.solr.common.SolrException", "root-error-class","org.apache.solr.common.SolrException"], "msg":"missing content stream", "code":400}} -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-PDF-parsing-failing-with-java-error-tp4342909.html Sent from the Solr - User mailing list archive at Nabble.com.