Ohh sorry... didn't realize that they already sent you that link :P On Fri, Apr 8, 2011 at 12:35 PM, Ezequiel Calderara <ezech...@gmail.com>wrote:
> Maybe those files are created with a different Adobe Format version... > > See this: > http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html > > On Fri, Apr 8, 2011 at 12:14 PM, Brandon Waterloo < > brandon.water...@matrix.msu.edu> wrote: > >> A second test has revealed that it is something to do with the contents, >> and not the literal filenames, of the second set of files. I renamed one of >> the second-format files and tested it and Solr still failed. However, the >> problem still only applies to those files of the second naming format. >> ________________________________________ >> From: Brandon Waterloo [brandon.water...@matrix.msu.edu] >> Sent: Friday, April 08, 2011 10:40 AM >> To: solr-user@lucene.apache.org >> Subject: RE: Problems indexing very large set of documents >> >> I had some time to do some research into the problems. From what I can >> tell, it appears Solr is tripping up over the filename. These are strictly >> examples, but, Solr handles this filename fine: >> >> 32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf >> >> However, it fails with either a parsing error or an EOF exception on this >> filename: >> >> 32-130-A08-84-al.sff.document.nusa197102.pdf >> >> The only significant difference is that the second filename contains >> multiple periods. As there are about 1700 files whose filenames are similar >> to the second format it is simply not possible to change their filenames. >> In addition they are being used by other applications. >> >> Is there something I can change in Solr configs to fix this issue or am I >> simply SOL until the Solr dev team can work on this? (assuming I put in a >> ticket) >> >> Thanks again everyone, >> >> ~Brandon Waterloo >> >> >> ________________________________________ >> From: Chris Hostetter [hossman_luc...@fucit.org] >> Sent: Tuesday, April 05, 2011 3:03 PM >> To: solr-user@lucene.apache.org >> Subject: RE: Problems indexing very large set of documents >> >> : It wasn't just a single file, it was dozens of files all having problems >> : toward the end just before I killed the process. >> ... >> : That is by no means all the errors, that is just a sample of a few. >> : You can see they all threw HTTP 500 errors. What is strange is, nearly >> : every file succeeded before about the 2200-files-mark, and nearly every >> : file after that failed. >> >> ..the root question is: do those files *only* fail if you have already >> indexed ~2200 files, or do they fail if you start up your server and index >> them first? >> >> there may be a resource issued (if it only happens after indexing 2200) or >> it may just be a problem with a large number of your PDFs that your >> iteration code just happens to get to at that point. >> >> If it's the former, then there may e something buggy about how Solr is >> using Tika to cause the problem -- if it's the later, then it's a straight >> Tika parsing issue. >> >> : > now, commit is set to false to speed up the indexing, and I'm assuming >> that >> : > Solr should be auto-committing as necessary. I'm using the default >> : > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf. >> Once >> >> solr does no autocommitting by default, you need to check your >> solrconfig.xml >> >> >> -Hoss >> > > > > -- > ______ > Ezequiel. > > Http://www.ironicnet.com > -- ______ Ezequiel. Http://www.ironicnet.com