Hey everybody, I've been running into some issues indexing a very large set of documents. There's about 4000 PDF files, ranging in size from 160MB to 10KB. Obviously this is a big task for Solr. I have a PHP script that iterates over the directory and uses PHP cURL to query Solr to index the files. For now, commit is set to false to speed up the indexing, and I'm assuming that Solr should be auto-committing as necessary. I'm using the default solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf. Once all the documents have been finished the PHP script queries Solr to commit.
The main problem is that after a few thousand documents (around 2000 last time I tried), nearly every document begins causing Java exceptions in Solr: Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@11d329d at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@11d329d at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190) ... 23 more Caused by: java.io.IOException: expected='endobj' firstReadAttempt='' secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502) at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176) at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707) at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119) ... 25 more As far as I know there's nothing special about these documents so I'm wondering if it's not properly autocommitting. What would be appropriate settings in solrconfig.xml for this particular application? I'd like it to autocommit as soon as it needs to but no more often than that for the sake of efficiency. Obviously it takes long enough to index 4000 documents and there's no reason to make it take longer. Thanks for your help! ~Brandon Waterloo