It wasn't just a single file, it was dozens of files all having problems toward the end just before I killed the process.
IPADDR - - [04/04/2011:17:17:03 +0000] "POST /solr/update/extract?literal.id=32-130-AFB-84&commit=false HTTP/1.1" 500 4558 IPADDR - - [04/04/2011:17:17:05 +0000] "POST /solr/update/extract?literal.id=32-130-AFC-84&commit=false HTTP/1.1" 500 4558 IPADDR - - [04/04/2011:17:17:09 +0000] "POST /solr/update/extract?literal.id=32-130-AFD-84&commit=false HTTP/1.1" 500 4557 IPADDR - - [04/04/2011:17:17:14 +0000] "POST /solr/update/extract?literal.id=32-130-AFE-84&commit=false HTTP/1.1" 500 4558 IPADDR - - [04/04/2011:17:17:21 +0000] "POST /solr/update/extract?literal.id=32-130-AFF-84&commit=false HTTP/1.1" 500 4558 IPADDR - - [04/04/2011:17:17:21 +0000] "POST /solr/update/extract?literal.id=32-130-B00-84&commit=false HTTP/1.1" 500 4557 That is by no means all the errors, that is just a sample of a few. You can see they all threw HTTP 500 errors. What is strange is, nearly every file succeeded before about the 2200-files-mark, and nearly every file after that failed. ~Brandon Waterloo ________________________________ From: Anuj Kumar [anujs...@gmail.com] Sent: Monday, April 04, 2011 2:48 PM To: solr-user@lucene.apache.org Cc: Brandon Waterloo Subject: Re: Problems indexing very large set of documents In the log messages are you able to locate the file at which it fails? Looks like TIKA is unable to parse one of your PDF files for the details. We need to hunt that one out. Regards, Anuj On Mon, Apr 4, 2011 at 11:57 PM, Brandon Waterloo <brandon.water...@matrix.msu.edu<mailto:brandon.water...@matrix.msu.edu>> wrote: Looks like I'm using Tika 0.4: apache-solr-1.4.1/contrib/extraction/lib/tika-core-0.4.jar .../tika-parsers-0.4.jar ~Brandon Waterloo ________________________________________ From: Anuj Kumar [anujs...@gmail.com<mailto:anujs...@gmail.com>] Sent: Monday, April 04, 2011 2:12 PM To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> Cc: Brandon Waterloo Subject: Re: Problems indexing very large set of documents This is related to Apache TIKA. Which version are you using? Please see this thread for more details- http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html <http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html>Hope it helps. Regards, Anuj On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo < brandon.water...@matrix.msu.edu<mailto:brandon.water...@matrix.msu.edu>> wrote: > Hey everybody, > > I've been running into some issues indexing a very large set of documents. > There's about 4000 PDF files, ranging in size from 160MB to 10KB. > Obviously this is a big task for Solr. I have a PHP script that iterates > over the directory and uses PHP cURL to query Solr to index the files. For > now, commit is set to false to speed up the indexing, and I'm assuming that > Solr should be auto-committing as necessary. I'm using the default > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf. Once > all the documents have been finished the PHP script queries Solr to commit. > > The main problem is that after a few thousand documents (around 2000 last > time I tried), nearly every document begins causing Java exceptions in Solr: > > Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log > SEVERE: org.apache.solr.common.SolrException: > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from > org.apache.tika.parser.pdf.PDFParser@11d329d > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) > at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) > at > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) > at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) > at > org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) > at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) > at org.mortbay.jetty.Server.handle(Server.java:285) > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) > at > org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) > at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202) > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) > at > org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) > at > org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) > Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal > IOException from org.apache.tika.parser.pdf.PDFParser@11d329d > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190) > ... 23 more > Caused by: java.io.IOException: expected='endobj' firstReadAttempt='' > secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc > at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502) > at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176) > at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707) > at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119) > ... 25 more > > As far as I know there's nothing special about these documents so I'm > wondering if it's not properly autocommitting. What would be appropriate > settings in solrconfig.xml for this particular application? I'd like it to > autocommit as soon as it needs to but no more often than that for the sake > of efficiency. Obviously it takes long enough to index 4000 documents and > there's no reason to make it take longer. Thanks for your help! > > ~Brandon Waterloo >