Re: Problems indexing very large set of documents

Ezequiel Calderara Fri, 08 Apr 2011 08:36:08 -0700

Maybe those files are created with a different Adobe Format version...

See this:
http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html


On Fri, Apr 8, 2011 at 12:14 PM, Brandon Waterloo <
brandon.water...@matrix.msu.edu> wrote:

> A second test has revealed that it is something to do with the contents,
> and not the literal filenames, of the second set of files.  I renamed one of
> the second-format files and tested it and Solr still failed.  However, the
> problem still only applies to those files of the second naming format.
> ________________________________________
> From: Brandon Waterloo [brandon.water...@matrix.msu.edu]
> Sent: Friday, April 08, 2011 10:40 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Problems indexing very large set of documents
>
> I had some time to do some research into the problems.  From what I can
> tell, it appears Solr is tripping up over the filename.  These are strictly
> examples, but, Solr handles this filename fine:
>
> 32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf
>
> However, it fails with either a parsing error or an EOF exception on this
> filename:
>
> 32-130-A08-84-al.sff.document.nusa197102.pdf
>
> The only significant difference is that the second filename contains
> multiple periods.  As there are about 1700 files whose filenames are similar
> to the second format it is simply not possible to change their filenames.
>  In addition they are being used by other applications.
>
> Is there something I can change in Solr configs to fix this issue or am I
> simply SOL until the Solr dev team can work on this? (assuming I put in a
> ticket)
>
> Thanks again everyone,
>
> ~Brandon Waterloo
>
>
> ________________________________________
> From: Chris Hostetter [hossman_luc...@fucit.org]
> Sent: Tuesday, April 05, 2011 3:03 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Problems indexing very large set of documents
>
> : It wasn't just a single file, it was dozens of files all having problems
> : toward the end just before I killed the process.
>        ...
> : That is by no means all the errors, that is just a sample of a few.
> : You can see they all threw HTTP 500 errors.  What is strange is, nearly
> : every file succeeded before about the 2200-files-mark, and nearly every
> : file after that failed.
>
> ..the root question is: do those files *only* fail if you have already
> indexed ~2200 files, or do they fail if you start up your server and index
> them first?
>
> there may be a resource issued (if it only happens after indexing 2200) or
> it may just be a problem with a large number of your PDFs that your
> iteration code just happens to get to at that point.
>
> If it's the former, then there may e something buggy about how Solr is
> using Tika to cause the problem -- if it's the later, then it's a straight
> Tika parsing issue.
>
> : > now, commit is set to false to speed up the indexing, and I'm assuming
> that
> : > Solr should be auto-committing as necessary.  I'm using the default
> : > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.
>  Once
>
> solr does no autocommitting by default, you need to check your
> solrconfig.xml
>
>
> -Hoss
>



-- 
______
Ezequiel.

Http://www.ironicnet.com

Re: Problems indexing very large set of documents

Reply via email to