Re: Problems indexing very large set of documents

Ezequiel Calderara Fri, 08 Apr 2011 08:38:12 -0700

Ohh sorry... didn't realize that they already sent you that link :P

On Fri, Apr 8, 2011 at 12:35 PM, Ezequiel Calderara <ezech...@gmail.com>wrote:


> Maybe those files are created with a different Adobe Format version...
>
> See this:
> http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html
>
> On Fri, Apr 8, 2011 at 12:14 PM, Brandon Waterloo <
> brandon.water...@matrix.msu.edu> wrote:
>
>> A second test has revealed that it is something to do with the contents,
>> and not the literal filenames, of the second set of files.  I renamed one of
>> the second-format files and tested it and Solr still failed.  However, the
>> problem still only applies to those files of the second naming format.
>> ________________________________________
>> From: Brandon Waterloo [brandon.water...@matrix.msu.edu]
>> Sent: Friday, April 08, 2011 10:40 AM
>> To: solr-user@lucene.apache.org
>> Subject: RE: Problems indexing very large set of documents
>>
>> I had some time to do some research into the problems.  From what I can
>> tell, it appears Solr is tripping up over the filename.  These are strictly
>> examples, but, Solr handles this filename fine:
>>
>> 32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf
>>
>> However, it fails with either a parsing error or an EOF exception on this
>> filename:
>>
>> 32-130-A08-84-al.sff.document.nusa197102.pdf
>>
>> The only significant difference is that the second filename contains
>> multiple periods.  As there are about 1700 files whose filenames are similar
>> to the second format it is simply not possible to change their filenames.
>>  In addition they are being used by other applications.
>>
>> Is there something I can change in Solr configs to fix this issue or am I
>> simply SOL until the Solr dev team can work on this? (assuming I put in a
>> ticket)
>>
>> Thanks again everyone,
>>
>> ~Brandon Waterloo
>>
>>
>> ________________________________________
>> From: Chris Hostetter [hossman_luc...@fucit.org]
>> Sent: Tuesday, April 05, 2011 3:03 PM
>> To: solr-user@lucene.apache.org
>> Subject: RE: Problems indexing very large set of documents
>>
>> : It wasn't just a single file, it was dozens of files all having problems
>> : toward the end just before I killed the process.
>>        ...
>> : That is by no means all the errors, that is just a sample of a few.
>> : You can see they all threw HTTP 500 errors.  What is strange is, nearly
>> : every file succeeded before about the 2200-files-mark, and nearly every
>> : file after that failed.
>>
>> ..the root question is: do those files *only* fail if you have already
>> indexed ~2200 files, or do they fail if you start up your server and index
>> them first?
>>
>> there may be a resource issued (if it only happens after indexing 2200) or
>> it may just be a problem with a large number of your PDFs that your
>> iteration code just happens to get to at that point.
>>
>> If it's the former, then there may e something buggy about how Solr is
>> using Tika to cause the problem -- if it's the later, then it's a straight
>> Tika parsing issue.
>>
>> : > now, commit is set to false to speed up the indexing, and I'm assuming
>> that
>> : > Solr should be auto-committing as necessary.  I'm using the default
>> : > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.
>>  Once
>>
>> solr does no autocommitting by default, you need to check your
>> solrconfig.xml
>>
>>
>> -Hoss
>>
>
>
>
> --
> ______
> Ezequiel.
>
> Http://www.ironicnet.com
>



-- 
______
Ezequiel.

Http://www.ironicnet.com

Re: Problems indexing very large set of documents

Reply via email to