Maybe those files are created with a different Adobe Format version... See this: http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html
On Fri, Apr 8, 2011 at 12:14 PM, Brandon Waterloo < brandon.water...@matrix.msu.edu> wrote: > A second test has revealed that it is something to do with the contents, > and not the literal filenames, of the second set of files. I renamed one of > the second-format files and tested it and Solr still failed. However, the > problem still only applies to those files of the second naming format. > ________________________________________ > From: Brandon Waterloo [brandon.water...@matrix.msu.edu] > Sent: Friday, April 08, 2011 10:40 AM > To: solr-user@lucene.apache.org > Subject: RE: Problems indexing very large set of documents > > I had some time to do some research into the problems. From what I can > tell, it appears Solr is tripping up over the filename. These are strictly > examples, but, Solr handles this filename fine: > > 32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf > > However, it fails with either a parsing error or an EOF exception on this > filename: > > 32-130-A08-84-al.sff.document.nusa197102.pdf > > The only significant difference is that the second filename contains > multiple periods. As there are about 1700 files whose filenames are similar > to the second format it is simply not possible to change their filenames. > In addition they are being used by other applications. > > Is there something I can change in Solr configs to fix this issue or am I > simply SOL until the Solr dev team can work on this? (assuming I put in a > ticket) > > Thanks again everyone, > > ~Brandon Waterloo > > > ________________________________________ > From: Chris Hostetter [hossman_luc...@fucit.org] > Sent: Tuesday, April 05, 2011 3:03 PM > To: solr-user@lucene.apache.org > Subject: RE: Problems indexing very large set of documents > > : It wasn't just a single file, it was dozens of files all having problems > : toward the end just before I killed the process. > ... > : That is by no means all the errors, that is just a sample of a few. > : You can see they all threw HTTP 500 errors. What is strange is, nearly > : every file succeeded before about the 2200-files-mark, and nearly every > : file after that failed. > > ..the root question is: do those files *only* fail if you have already > indexed ~2200 files, or do they fail if you start up your server and index > them first? > > there may be a resource issued (if it only happens after indexing 2200) or > it may just be a problem with a large number of your PDFs that your > iteration code just happens to get to at that point. > > If it's the former, then there may e something buggy about how Solr is > using Tika to cause the problem -- if it's the later, then it's a straight > Tika parsing issue. > > : > now, commit is set to false to speed up the indexing, and I'm assuming > that > : > Solr should be auto-committing as necessary. I'm using the default > : > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf. > Once > > solr does no autocommitting by default, you need to check your > solrconfig.xml > > > -Hoss > -- ______ Ezequiel. Http://www.ironicnet.com