Looks like I'm using Tika 0.4: apache-solr-1.4.1/contrib/extraction/lib/tika-core-0.4.jar .../tika-parsers-0.4.jar
~Brandon Waterloo ________________________________________ From: Anuj Kumar [anujs...@gmail.com] Sent: Monday, April 04, 2011 2:12 PM To: solr-user@lucene.apache.org Cc: Brandon Waterloo Subject: Re: Problems indexing very large set of documents This is related to Apache TIKA. Which version are you using? Please see this thread for more details- http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html <http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html>Hope it helps. Regards, Anuj On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo < brandon.water...@matrix.msu.edu> wrote: > Hey everybody, > > I've been running into some issues indexing a very large set of documents. > There's about 4000 PDF files, ranging in size from 160MB to 10KB. > Obviously this is a big task for Solr. I have a PHP script that iterates > over the directory and uses PHP cURL to query Solr to index the files. For > now, commit is set to false to speed up the indexing, and I'm assuming that > Solr should be auto-committing as necessary. I'm using the default > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf. Once > all the documents have been finished the PHP script queries Solr to commit. > > The main problem is that after a few thousand documents (around 2000 last > time I tried), nearly every document begins causing Java exceptions in Solr: > > Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log > SEVERE: org.apache.solr.common.SolrException: > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from > org.apache.tika.parser.pdf.PDFParser@11d329d > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) > at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) > at > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) > at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) > at > org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) > at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) > at org.mortbay.jetty.Server.handle(Server.java:285) > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) > at > org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) > at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202) > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) > at > org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) > at > org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) > Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal > IOException from org.apache.tika.parser.pdf.PDFParser@11d329d > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190) > ... 23 more > Caused by: java.io.IOException: expected='endobj' firstReadAttempt='' > secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc > at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502) > at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176) > at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707) > at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119) > ... 25 more > > As far as I know there's nothing special about these documents so I'm > wondering if it's not properly autocommitting. What would be appropriate > settings in solrconfig.xml for this particular application? I'd like it to > autocommit as soon as it needs to but no more often than that for the sake > of efficiency. Obviously it takes long enough to index 4000 documents and > there's no reason to make it take longer. Thanks for your help! > > ~Brandon Waterloo >