I think I've finally found the problem. The files that work are PDF version 1.6. The files that do NOT work are PDF version 1.4. I'll look into updating all the old documents to PDF 1.6.
Thanks everyone! ~Brandon Waterloo ________________________________ From: Ezequiel Calderara [ezech...@gmail.com] Sent: Friday, April 08, 2011 11:35 AM To: solr-user@lucene.apache.org Cc: Brandon Waterloo Subject: Re: Problems indexing very large set of documents Maybe those files are created with a different Adobe Format version... See this: http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html On Fri, Apr 8, 2011 at 12:14 PM, Brandon Waterloo <brandon.water...@matrix.msu.edu<mailto:brandon.water...@matrix.msu.edu>> wrote: A second test has revealed that it is something to do with the contents, and not the literal filenames, of the second set of files. I renamed one of the second-format files and tested it and Solr still failed. However, the problem still only applies to those files of the second naming format. ________________________________________ From: Brandon Waterloo [brandon.water...@matrix.msu.edu<mailto:brandon.water...@matrix.msu.edu>] Sent: Friday, April 08, 2011 10:40 AM To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> Subject: RE: Problems indexing very large set of documents I had some time to do some research into the problems. From what I can tell, it appears Solr is tripping up over the filename. These are strictly examples, but, Solr handles this filename fine: 32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf However, it fails with either a parsing error or an EOF exception on this filename: 32-130-A08-84-al.sff.document.nusa197102.pdf The only significant difference is that the second filename contains multiple periods. As there are about 1700 files whose filenames are similar to the second format it is simply not possible to change their filenames. In addition they are being used by other applications. Is there something I can change in Solr configs to fix this issue or am I simply SOL until the Solr dev team can work on this? (assuming I put in a ticket) Thanks again everyone, ~Brandon Waterloo ________________________________________ From: Chris Hostetter [hossman_luc...@fucit.org<mailto:hossman_luc...@fucit.org>] Sent: Tuesday, April 05, 2011 3:03 PM To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> Subject: RE: Problems indexing very large set of documents : It wasn't just a single file, it was dozens of files all having problems : toward the end just before I killed the process. ... : That is by no means all the errors, that is just a sample of a few. : You can see they all threw HTTP 500 errors. What is strange is, nearly : every file succeeded before about the 2200-files-mark, and nearly every : file after that failed. ..the root question is: do those files *only* fail if you have already indexed ~2200 files, or do they fail if you start up your server and index them first? there may be a resource issued (if it only happens after indexing 2200) or it may just be a problem with a large number of your PDFs that your iteration code just happens to get to at that point. If it's the former, then there may e something buggy about how Solr is using Tika to cause the problem -- if it's the later, then it's a straight Tika parsing issue. : > now, commit is set to false to speed up the indexing, and I'm assuming that : > Solr should be auto-committing as necessary. I'm using the default : > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf. Once solr does no autocommitting by default, you need to check your solrconfig.xml -Hoss -- ______ Ezequiel. Http://www.ironicnet.com