RE: Problems indexing very large set of documents

Brandon Waterloo Tue, 05 Apr 2011 06:10:38 -0700

It wasn't just a single file, it was dozens of files all having problems toward 
the end just before I killed the process.


IPADDR -  -  [04/04/2011:17:17:03 +0000] "POST 
/solr/update/extract?literal.id=32-130-AFB-84&commit=false HTTP/1.1" 500 4558
IPADDR -  -  [04/04/2011:17:17:05 +0000] "POST 
/solr/update/extract?literal.id=32-130-AFC-84&commit=false HTTP/1.1" 500 4558
IPADDR -  -  [04/04/2011:17:17:09 +0000] "POST 
/solr/update/extract?literal.id=32-130-AFD-84&commit=false HTTP/1.1" 500 4557
IPADDR -  -  [04/04/2011:17:17:14 +0000] "POST 
/solr/update/extract?literal.id=32-130-AFE-84&commit=false HTTP/1.1" 500 4558
IPADDR -  -  [04/04/2011:17:17:21 +0000] "POST 
/solr/update/extract?literal.id=32-130-AFF-84&commit=false HTTP/1.1" 500 4558
IPADDR -  -  [04/04/2011:17:17:21 +0000] "POST 
/solr/update/extract?literal.id=32-130-B00-84&commit=false HTTP/1.1" 500 4557

That is by no means all the errors, that is just a sample of a few.  You can 
see they all threw HTTP 500 errors.  What is strange is, nearly every file 
succeeded before about the 2200-files-mark, and nearly every file after that 
failed.


~Brandon Waterloo

________________________________
From: Anuj Kumar [anujs...@gmail.com]
Sent: Monday, April 04, 2011 2:48 PM
To: solr-user@lucene.apache.org
Cc: Brandon Waterloo
Subject: Re: Problems indexing very large set of documents

In the log messages are you able to locate the file at which it fails? Looks 
like TIKA is unable to parse one of your PDF files for the details. We need to 
hunt that one out.

Regards,
Anuj

On Mon, Apr 4, 2011 at 11:57 PM, Brandon Waterloo 
<brandon.water...@matrix.msu.edu<mailto:brandon.water...@matrix.msu.edu>> wrote:
Looks like I'm using Tika 0.4:
apache-solr-1.4.1/contrib/extraction/lib/tika-core-0.4.jar
.../tika-parsers-0.4.jar

~Brandon Waterloo

________________________________________
From: Anuj Kumar [anujs...@gmail.com<mailto:anujs...@gmail.com>]
Sent: Monday, April 04, 2011 2:12 PM
To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
Cc: Brandon Waterloo
Subject: Re: Problems indexing very large set of documents

This is related to Apache TIKA. Which version are you using?
Please see this thread for more details-
http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html

<http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html>Hope
it helps.

Regards,
Anuj

On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo <
brandon.water...@matrix.msu.edu<mailto:brandon.water...@matrix.msu.edu>> wrote:

>  Hey everybody,
>
> I've been running into some issues indexing a very large set of documents.
>  There's about 4000 PDF files, ranging in size from 160MB to 10KB.
>  Obviously this is a big task for Solr.  I have a PHP script that iterates
> over the directory and uses PHP cURL to query Solr to index the files.  For
> now, commit is set to false to speed up the indexing, and I'm assuming that
> Solr should be auto-committing as necessary.  I'm using the default
> solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once
> all the documents have been finished the PHP script queries Solr to commit.
>
> The main problem is that after a few thousand documents (around 2000 last
> time I tried), nearly every document begins causing Java exceptions in Solr:
>
> Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.pdf.PDFParser@11d329d
>        at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>        at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>        at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>        at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>        at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>        at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>        at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>        at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>        at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>        at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>        at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>        at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>        at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>        at org.mortbay.jetty.Server.handle(Server.java:285)
>        at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>        at
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>        at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>        at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal
> IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
>        at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
>        at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>        at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>        ... 23 more
> Caused by: java.io.IOException: expected='endobj' firstReadAttempt=''
> secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc
>        at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
>        at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
>        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
>        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
>        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40)
>        at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
>        ... 25 more
>
> As far as I know there's nothing special about these documents so I'm
> wondering if it's not properly autocommitting.  What would be appropriate
> settings in solrconfig.xml for this particular application?  I'd like it to
> autocommit as soon as it needs to but no more often than that for the sake
> of efficiency.  Obviously it takes long enough to index 4000 documents and
> there's no reason to make it take longer.  Thanks for your help!
>
> ~Brandon Waterloo
>

RE: Problems indexing very large set of documents

Reply via email to