RE: Problems indexing very large set of documents

Brandon Waterloo Fri, 08 Apr 2011 09:45:31 -0700

I think I've finally found the problem.  The files that work are PDF version 
1.6.  The files that do NOT work are PDF version 1.4.  I'll look into updating 
all the old documents to PDF 1.6.

Thanks everyone!

~Brandon Waterloo
________________________________
From: Ezequiel Calderara [ezech...@gmail.com]
Sent: Friday, April 08, 2011 11:35 AM
To: solr-user@lucene.apache.org
Cc: Brandon Waterloo
Subject: Re: Problems indexing very large set of documents

Maybe those files are created with a different Adobe Format version...

See this: http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html

On Fri, Apr 8, 2011 at 12:14 PM, Brandon Waterloo 
<brandon.water...@matrix.msu.edu<mailto:brandon.water...@matrix.msu.edu>> wrote:
A second test has revealed that it is something to do with the contents, and 
not the literal filenames, of the second set of files.  I renamed one of the 
second-format files and tested it and Solr still failed.  However, the problem 
still only applies to those files of the second naming format.
________________________________________
From: Brandon Waterloo 
[brandon.water...@matrix.msu.edu<mailto:brandon.water...@matrix.msu.edu>]
Sent: Friday, April 08, 2011 10:40 AM
To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
Subject: RE: Problems indexing very large set of documents

I had some time to do some research into the problems.  From what I can tell, 
it appears Solr is tripping up over the filename.  These are strictly examples, 
but, Solr handles this filename fine:

32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf

However, it fails with either a parsing error or an EOF exception on this 
filename:

32-130-A08-84-al.sff.document.nusa197102.pdf

The only significant difference is that the second filename contains multiple 
periods.  As there are about 1700 files whose filenames are similar to the 
second format it is simply not possible to change their filenames.  In addition 
they are being used by other applications.

Is there something I can change in Solr configs to fix this issue or am I 
simply SOL until the Solr dev team can work on this? (assuming I put in a 
ticket)

Thanks again everyone,

~Brandon Waterloo

________________________________________
From: Chris Hostetter 
[hossman_luc...@fucit.org<mailto:hossman_luc...@fucit.org>]
Sent: Tuesday, April 05, 2011 3:03 PM
To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
Subject: RE: Problems indexing very large set of documents

: It wasn't just a single file, it was dozens of files all having problems
: toward the end just before I killed the process.
       ...
: That is by no means all the errors, that is just a sample of a few.
: You can see they all threw HTTP 500 errors.  What is strange is, nearly
: every file succeeded before about the 2200-files-mark, and nearly every
: file after that failed.

..the root question is: do those files *only* fail if you have already
indexed ~2200 files, or do they fail if you start up your server and index
them first?

there may be a resource issued (if it only happens after indexing 2200) or
it may just be a problem with a large number of your PDFs that your
iteration code just happens to get to at that point.

If it's the former, then there may e something buggy about how Solr is
using Tika to cause the problem -- if it's the later, then it's a straight
Tika parsing issue.

: > now, commit is set to false to speed up the indexing, and I'm assuming that
: > Solr should be auto-committing as necessary.  I'm using the default
: > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once

solr does no autocommitting by default, you need to check your
solrconfig.xml

-Hoss

--
______
Ezequiel.

Http://www.ironicnet.com

RE: Problems indexing very large set of documents

Reply via email to