Hi Brandon, Sorry, I can't make out much here. The exception gives TIKA error that signifies the parsing issue with PDF. That's all I can make out. May be someone else on this mailing list can help.
Sorry. - Anuj On Tue, Apr 5, 2011 at 6:35 PM, Brandon Waterloo < brandon.water...@matrix.msu.edu> wrote: > It wasn't just a single file, it was dozens of files all having problems > toward the end just before I killed the process. > > IPADDR - - [04/04/2011:17:17:03 +0000] "POST /solr/update/extract? > literal.id=32-130-AFB-84&commit=false HTTP/1.1" 500 4558 > IPADDR - - [04/04/2011:17:17:05 +0000] "POST /solr/update/extract? > literal.id=32-130-AFC-84&commit=false HTTP/1.1" 500 4558 > IPADDR - - [04/04/2011:17:17:09 +0000] "POST /solr/update/extract? > literal.id=32-130-AFD-84&commit=false HTTP/1.1" 500 4557 > IPADDR - - [04/04/2011:17:17:14 +0000] "POST /solr/update/extract? > literal.id=32-130-AFE-84&commit=false HTTP/1.1" 500 4558 > IPADDR - - [04/04/2011:17:17:21 +0000] "POST /solr/update/extract? > literal.id=32-130-AFF-84&commit=false HTTP/1.1" 500 4558 > IPADDR - - [04/04/2011:17:17:21 +0000] "POST /solr/update/extract? > literal.id=32-130-B00-84&commit=false HTTP/1.1" 500 4557 > > That is by no means all the errors, that is just a sample of a few. You > can see they all threw HTTP 500 errors. What is strange is, nearly every > file succeeded before about the 2200-files-mark, and nearly every file after > that failed. > > > ~Brandon Waterloo > > ________________________________ > From: Anuj Kumar [anujs...@gmail.com] > Sent: Monday, April 04, 2011 2:48 PM > To: solr-user@lucene.apache.org > Cc: Brandon Waterloo > Subject: Re: Problems indexing very large set of documents > > In the log messages are you able to locate the file at which it fails? > Looks like TIKA is unable to parse one of your PDF files for the details. We > need to hunt that one out. > > Regards, > Anuj > > On Mon, Apr 4, 2011 at 11:57 PM, Brandon Waterloo < > brandon.water...@matrix.msu.edu<mailto:brandon.water...@matrix.msu.edu>> > wrote: > Looks like I'm using Tika 0.4: > apache-solr-1.4.1/contrib/extraction/lib/tika-core-0.4.jar > .../tika-parsers-0.4.jar > > ~Brandon Waterloo > > ________________________________________ > From: Anuj Kumar [anujs...@gmail.com<mailto:anujs...@gmail.com>] > Sent: Monday, April 04, 2011 2:12 PM > To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> > Cc: Brandon Waterloo > Subject: Re: Problems indexing very large set of documents > > This is related to Apache TIKA. Which version are you using? > Please see this thread for more details- > http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html > > <http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html > >Hope > it helps. > > Regards, > Anuj > > On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo < > brandon.water...@matrix.msu.edu<mailto:brandon.water...@matrix.msu.edu>> > wrote: > > > Hey everybody, > > > > I've been running into some issues indexing a very large set of > documents. > > There's about 4000 PDF files, ranging in size from 160MB to 10KB. > > Obviously this is a big task for Solr. I have a PHP script that > iterates > > over the directory and uses PHP cURL to query Solr to index the files. > For > > now, commit is set to false to speed up the indexing, and I'm assuming > that > > Solr should be auto-committing as necessary. I'm using the default > > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf. > Once > > all the documents have been finished the PHP script queries Solr to > commit. > > > > The main problem is that after a few thousand documents (around 2000 last > > time I tried), nearly every document begins causing Java exceptions in > Solr: > > > > Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log > > SEVERE: org.apache.solr.common.SolrException: > > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException > from > > org.apache.tika.parser.pdf.PDFParser@11d329d > > at > > > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211) > > at > > > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) > > at > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) > > at > > > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) > > at > > > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) > > at > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) > > at > > > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) > > at > > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) > > at > > > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > > at > > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) > > at > > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) > > at > > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) > > at > > > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) > > at > > > org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) > > at > > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) > > at org.mortbay.jetty.Server.handle(Server.java:285) > > at > > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) > > at > > > org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) > > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) > > at > org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202) > > at > org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) > > at > > > org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) > > at > > > org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) > > Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal > > IOException from org.apache.tika.parser.pdf.PDFParser@11d329d > > at > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125) > > at > > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) > > at > > > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190) > > ... 23 more > > Caused by: java.io.IOException: expected='endobj' firstReadAttempt='' > > secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc > > at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502) > > at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176) > > at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707) > > at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691) > > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40) > > at > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119) > > ... 25 more > > > > As far as I know there's nothing special about these documents so I'm > > wondering if it's not properly autocommitting. What would be appropriate > > settings in solrconfig.xml for this particular application? I'd like it > to > > autocommit as soon as it needs to but no more often than that for the > sake > > of efficiency. Obviously it takes long enough to index 4000 documents > and > > there's no reason to make it take longer. Thanks for your help! > > > > ~Brandon Waterloo > > > >