Re: Problems indexing very large set of documents

Anuj Kumar Tue, 05 Apr 2011 10:41:46 -0700

Hi Brandon,

Sorry, I can't make out much here. The exception gives TIKA error that
signifies the parsing issue with PDF. That's all I can make out.
May be someone else on this mailing list can help.


Sorry.

- Anuj

On Tue, Apr 5, 2011 at 6:35 PM, Brandon Waterloo <
brandon.water...@matrix.msu.edu> wrote:

> It wasn't just a single file, it was dozens of files all having problems
> toward the end just before I killed the process.
>
> IPADDR -  -  [04/04/2011:17:17:03 +0000] "POST /solr/update/extract?
> literal.id=32-130-AFB-84&commit=false HTTP/1.1" 500 4558
> IPADDR -  -  [04/04/2011:17:17:05 +0000] "POST /solr/update/extract?
> literal.id=32-130-AFC-84&commit=false HTTP/1.1" 500 4558
> IPADDR -  -  [04/04/2011:17:17:09 +0000] "POST /solr/update/extract?
> literal.id=32-130-AFD-84&commit=false HTTP/1.1" 500 4557
> IPADDR -  -  [04/04/2011:17:17:14 +0000] "POST /solr/update/extract?
> literal.id=32-130-AFE-84&commit=false HTTP/1.1" 500 4558
> IPADDR -  -  [04/04/2011:17:17:21 +0000] "POST /solr/update/extract?
> literal.id=32-130-AFF-84&commit=false HTTP/1.1" 500 4558
> IPADDR -  -  [04/04/2011:17:17:21 +0000] "POST /solr/update/extract?
> literal.id=32-130-B00-84&commit=false HTTP/1.1" 500 4557
>
> That is by no means all the errors, that is just a sample of a few.  You
> can see they all threw HTTP 500 errors.  What is strange is, nearly every
> file succeeded before about the 2200-files-mark, and nearly every file after
> that failed.
>
>
> ~Brandon Waterloo
>
> ________________________________
> From: Anuj Kumar [anujs...@gmail.com]
> Sent: Monday, April 04, 2011 2:48 PM
> To: solr-user@lucene.apache.org
> Cc: Brandon Waterloo
> Subject: Re: Problems indexing very large set of documents
>
> In the log messages are you able to locate the file at which it fails?
> Looks like TIKA is unable to parse one of your PDF files for the details. We
> need to hunt that one out.
>
> Regards,
> Anuj
>
> On Mon, Apr 4, 2011 at 11:57 PM, Brandon Waterloo <
> brandon.water...@matrix.msu.edu<mailto:brandon.water...@matrix.msu.edu>>
> wrote:
> Looks like I'm using Tika 0.4:
> apache-solr-1.4.1/contrib/extraction/lib/tika-core-0.4.jar
> .../tika-parsers-0.4.jar
>
> ~Brandon Waterloo
>
> ________________________________________
> From: Anuj Kumar [anujs...@gmail.com<mailto:anujs...@gmail.com>]
> Sent: Monday, April 04, 2011 2:12 PM
> To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Cc: Brandon Waterloo
> Subject: Re: Problems indexing very large set of documents
>
> This is related to Apache TIKA. Which version are you using?
> Please see this thread for more details-
> http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html
>
> <http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html
> >Hope
> it helps.
>
> Regards,
> Anuj
>
> On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo <
> brandon.water...@matrix.msu.edu<mailto:brandon.water...@matrix.msu.edu>>
> wrote:
>
> >  Hey everybody,
> >
> > I've been running into some issues indexing a very large set of
> documents.
> >  There's about 4000 PDF files, ranging in size from 160MB to 10KB.
> >  Obviously this is a big task for Solr.  I have a PHP script that
> iterates
> > over the directory and uses PHP cURL to query Solr to index the files.
>  For
> > now, commit is set to false to speed up the indexing, and I'm assuming
> that
> > Solr should be auto-committing as necessary.  I'm using the default
> > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.
>  Once
> > all the documents have been finished the PHP script queries Solr to
> commit.
> >
> > The main problem is that after a few thousand documents (around 2000 last
> > time I tried), nearly every document begins causing Java exceptions in
> Solr:
> >
> > Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
> > SEVERE: org.apache.solr.common.SolrException:
> > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException
> from
> > org.apache.tika.parser.pdf.PDFParser@11d329d
> >        at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
> >        at
> >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
> >        at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
> >        at
> >
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
> >        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> >        at
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> >        at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
> >        at
> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
> >        at
> > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
> >        at
> >
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> >        at
> > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> >        at
> > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
> >        at
> > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
> >        at
> >
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
> >        at
> >
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> >        at
> > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
> >        at org.mortbay.jetty.Server.handle(Server.java:285)
> >        at
> > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
> >        at
> >
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
> >        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
> >        at
> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
> >        at
> org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
> >        at
> >
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
> >        at
> >
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> > Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal
> > IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
> >        at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
> >        at
> > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
> >        at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
> >        ... 23 more
> > Caused by: java.io.IOException: expected='endobj' firstReadAttempt=''
> > secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc
> >        at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
> >        at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
> >        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
> >        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
> >        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40)
> >        at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
> >        ... 25 more
> >
> > As far as I know there's nothing special about these documents so I'm
> > wondering if it's not properly autocommitting.  What would be appropriate
> > settings in solrconfig.xml for this particular application?  I'd like it
> to
> > autocommit as soon as it needs to but no more often than that for the
> sake
> > of efficiency.  Obviously it takes long enough to index 4000 documents
> and
> > there's no reason to make it take longer.  Thanks for your help!
> >
> > ~Brandon Waterloo
> >
>
>

Re: Problems indexing very large set of documents

Reply via email to