Re: Solr PDF parsing failing with java error

Erick Erickson Mon, 26 Jun 2017 20:22:36 -0700

Well, assuming you didn't, say, install a new Solr or some such it
looks like somebody removed some of the jar files that Tika depends
on, they're in the contrib area. Or changed the solrconfig.xml file to
not contain the <lib...> something like:


  <lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib"
regex=".*\.jar" />
  <lib dir="${solr.install.dir:../../../..}/dist/"
regex="solr-cell-\d.*\.jar" />

BTW, for various reasons I prefer to do the heavy Tika lifting on a
client rather than use Solr's extracting request handler see:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

That said it's up to you.

Best,
Erick

On Mon, Jun 26, 2017 at 4:21 PM, MatthewMeredith
<matthewmeredith...@gmail.com> wrote:
> I have a shell script set up to clear a solr core and re-index a folder of
> PDF files nightly like so:
>
> cd /opt/solr/ &&
> bin/post -c comox_core -host 67.231.17.10 -d
> "<delete><query>attr_is_pdf:true</query></delete>" &&
> bin/post -c comox_core -host 67.231.17.10 -filetypes pdf
> /home/townofco/public_html/modx/assets/pdfs -params
> "literal.is_pdf=true&uprefix=attr_"
> All was working fine as far as I could tell but now I'm getting errors. For
> every single file (558 of them) I'm getting something along these lines:
>
> SimplePostTool: WARNING: IOException while reading response:
> java.io.IOException: Server returned HTTP response code: 500 for URL:
> http://67.231.17.10:8983/solr/comox_core/update/extract?literal.is_pdf=true&uprefix=attr_&resource.name=%2Fhome%2Ftownofco%2Fpublic_html%2Fmodx%2Fassets%2Fpdfs%2F2016+Meeting+dates.pdf&literal.id=%2Fhome%2Ftownofco%2Fpublic_html%2Fmodx%2Fassets%2Fpdfs%2F2016+Meeting+dates.pdf
> SimplePostTool: WARNING: Solr returned an error #500 (Server Error) for url:
> http://67.231.17.10:8983/solr/comox_core/update/extract?literal.is_pdf=true&uprefix=attr_&resource.name=%2Fhome%2Ftownofco%2Fpublic_html%2Fmodx%2Fassets%2Fpdfs%2FTips+on+packing+your+blue+box+for+a+windy+day.pdf&literal.id=%2Fhome%2Ftownofco%2Fpublic_html%2Fmodx%2Fassets%2Fpdfs%2FTips+on+packing+your+blue+box+for+a+windy+day.pdf
> SimplePostTool: WARNING: Response: <html>
> <head>
> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
> <title>Error 500 Server Error</title>
> </head>
> <body>
> HTTP ERROR 500
>
> <p>Problem accessing /solr/comox_core/update/extract. Reason:
> <pre>    Server Error</pre></p>
> Caused by:
> <pre>java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.pdfbox.pdmodel.PDPage
>     at org.apache.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:217)
>     at org.apache.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:185)
>     at
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:212)
>     at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
>     at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134)
>     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146)
>     at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
>     at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
>     at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>     at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
>     at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:69)
>     at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:155)
>     at org.apache.solr.core.SolrCore.execute(SolrCore.java:2053)
>     at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:652)
>     at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)
>     at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:229)
>     at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:184)
>     at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)
>     at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
>     at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>     at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
>     at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160)
>     at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
>     at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>     at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092)
>     at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>     at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
>     at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>     at org.eclipse.jetty.server.Server.handle(Server.java:518)
>     at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)
>     at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244)
>     at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
>     at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
>     at
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>     at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)
>     at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)
>     at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
>     at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
>     at java.lang.Thread.run(Thread.java:745)
> </pre>
>
> </body>
> </html>
> Does anyone know what is happening and how to fix it? After I run the script
> I get the COMMIT confirmation saying 558 files committed, but in my Solr
> Admin page there are only 162 showing and if I search for a specific text
> string from one of the PDF files, it does not get returned.
>
> EDIT: I should add that if I search for the title of a PDF file, it DOES get
> returned...
>
> I checked my lib dir in the solrconfig.xml file and everything looks fine.
> Here's my ExtractRequestHandler:
>
> <requestHandler name="/update/extract"
>                   startup="lazy"
>                   class="solr.extraction.ExtractingRequestHandler" >
>     <lst name="defaults">
>       <str name="lowernames">true</str>
>       <str name="fmap.meta">ignored_</str>
>       <str name="fmap.content">_text_</str>
>     </lst>
>   </requestHandler>
> EDIT 2: When I try to run a query in the Solr Admin using the
> /update/extract request handler, I get the following returned:
>
> {
>   "responseHeader":{
>     "status":400,
>     "QTime":0},
>   "error":{
>     "metadata":[
>       "error-class","org.apache.solr.common.SolrException",
>       "root-error-class","org.apache.solr.common.SolrException"],
>     "msg":"missing content stream",
>     "code":400}}
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-PDF-parsing-failing-with-java-error-tp4342909.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr PDF parsing failing with java error

Reply via email to