Solr PDF parsing failing with java error

MatthewMeredith Mon, 26 Jun 2017 17:48:07 -0700

I have a shell script set up to clear a solr core and re-index a folder of
PDF files nightly like so:


cd /opt/solr/ && 
bin/post -c comox_core -host 67.231.17.10 -d
"<delete><query>attr_is_pdf:true</query></delete>" && 
bin/post -c comox_core -host 67.231.17.10 -filetypes pdf
/home/townofco/public_html/modx/assets/pdfs -params
"literal.is_pdf=true&uprefix=attr_"
All was working fine as far as I could tell but now I'm getting errors. For
every single file (558 of them) I'm getting something along these lines:

SimplePostTool: WARNING: IOException while reading response:
java.io.IOException: Server returned HTTP response code: 500 for URL:
http://67.231.17.10:8983/solr/comox_core/update/extract?literal.is_pdf=true&uprefix=attr_&resource.name=%2Fhome%2Ftownofco%2Fpublic_html%2Fmodx%2Fassets%2Fpdfs%2F2016+Meeting+dates.pdf&literal.id=%2Fhome%2Ftownofco%2Fpublic_html%2Fmodx%2Fassets%2Fpdfs%2F2016+Meeting+dates.pdf
SimplePostTool: WARNING: Solr returned an error #500 (Server Error) for url:
http://67.231.17.10:8983/solr/comox_core/update/extract?literal.is_pdf=true&uprefix=attr_&resource.name=%2Fhome%2Ftownofco%2Fpublic_html%2Fmodx%2Fassets%2Fpdfs%2FTips+on+packing+your+blue+box+for+a+windy+day.pdf&literal.id=%2Fhome%2Ftownofco%2Fpublic_html%2Fmodx%2Fassets%2Fpdfs%2FTips+on+packing+your+blue+box+for+a+windy+day.pdf
SimplePostTool: WARNING: Response: <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 500 Server Error</title>
</head>
<body>
HTTP ERROR 500

<p>Problem accessing /solr/comox_core/update/extract. Reason:
<pre>    Server Error</pre></p>
Caused by:
<pre>java.lang.NoClassDefFoundError: Could not initialize class
org.apache.pdfbox.pdmodel.PDPage
    at org.apache.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:217)
    at org.apache.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:185)
    at
org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:212)
    at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
    at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:69)
    at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:155)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:2053)
    at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:652)
    at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)
    at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:229)
    at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:184)
    at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)
    at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
    at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
    at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
    at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160)
    at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
    at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092)
    at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
    at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
    at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
    at org.eclipse.jetty.server.Server.handle(Server.java:518)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)
    at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244)
    at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
    at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
    at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)
    at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)
    at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
    at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
    at java.lang.Thread.run(Thread.java:745)
</pre>

</body>
</html>
Does anyone know what is happening and how to fix it? After I run the script
I get the COMMIT confirmation saying 558 files committed, but in my Solr
Admin page there are only 162 showing and if I search for a specific text
string from one of the PDF files, it does not get returned.

EDIT: I should add that if I search for the title of a PDF file, it DOES get
returned...

I checked my lib dir in the solrconfig.xml file and everything looks fine.
Here's my ExtractRequestHandler:

<requestHandler name="/update/extract"
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="fmap.meta">ignored_</str>
      <str name="fmap.content">_text_</str>
    </lst>
  </requestHandler>
EDIT 2: When I try to run a query in the Solr Admin using the
/update/extract request handler, I get the following returned:

{
  "responseHeader":{
    "status":400,
    "QTime":0},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.solr.common.SolrException"],
    "msg":"missing content stream",
    "code":400}}



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-PDF-parsing-failing-with-java-error-tp4342909.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr PDF parsing failing with java error

Reply via email to