I found a simpler command-line method to update the PDF files. On some documents it does so perfect, the result is a pixel-for-pixel match and none of the OCR text (which is what all these PDFs are, newspaper articles that have been passed through OCR) is lost. However, on other documents the result is considerably blurrier and some of the OCR text is lost.
We've decided to skip any documents that Tika cannot index for now. As Lance stated, it's not specifically the version that causes the problem but rather some quirks caused by different PDF writers, a few tests have confirmed this, so we can't use version to determine which should be skipped. I'm examining the XML responses from the queries, and I cannot figure out how to tell from the XML response whether or not a document was successfully indexed. The status value seems to be 0 regardless of whether indexing was successful or not. So my question is, how can I tell from the response whether or not indexing was actually successful? ~Brandon Waterloo ________________________________________ From: Lance Norskog [goks...@gmail.com] Sent: Sunday, April 10, 2011 5:22 PM To: solr-user@lucene.apache.org Subject: Re: Problems indexing very large set of documents There is a library called iText. It parses and writes PDFs very very well, and a simple program will let you do a batch conversion. PDFs are made by a wide range of programs, not just Adobe code. Many of these do weird things and make small mistakes that Tika does not know to handle. In other words there is "dirty PDF" just like "dirty HTML". A percentage of PDFs will fail and that's life. One site that gets press releases from zillions of sites (and thus a wide range of PDF generators) has a 15% failure rate with Tika. Lance On Fri, Apr 8, 2011 at 9:44 AM, Brandon Waterloo <brandon.water...@matrix.msu.edu> wrote: > I think I've finally found the problem. The files that work are PDF version > 1.6. The files that do NOT work are PDF version 1.4. I'll look into > updating all the old documents to PDF 1.6. > > Thanks everyone! > > ~Brandon Waterloo > ________________________________ > From: Ezequiel Calderara [ezech...@gmail.com] > Sent: Friday, April 08, 2011 11:35 AM > To: solr-user@lucene.apache.org > Cc: Brandon Waterloo > Subject: Re: Problems indexing very large set of documents > > Maybe those files are created with a different Adobe Format version... > > See this: > http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html > > On Fri, Apr 8, 2011 at 12:14 PM, Brandon Waterloo > <brandon.water...@matrix.msu.edu<mailto:brandon.water...@matrix.msu.edu>> > wrote: > A second test has revealed that it is something to do with the contents, and > not the literal filenames, of the second set of files. I renamed one of the > second-format files and tested it and Solr still failed. However, the > problem still only applies to those files of the second naming format. > ________________________________________ > From: Brandon Waterloo > [brandon.water...@matrix.msu.edu<mailto:brandon.water...@matrix.msu.edu>] > Sent: Friday, April 08, 2011 10:40 AM > To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> > Subject: RE: Problems indexing very large set of documents > > I had some time to do some research into the problems. From what I can tell, > it appears Solr is tripping up over the filename. These are strictly > examples, but, Solr handles this filename fine: > > 32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf > > However, it fails with either a parsing error or an EOF exception on this > filename: > > 32-130-A08-84-al.sff.document.nusa197102.pdf > > The only significant difference is that the second filename contains multiple > periods. As there are about 1700 files whose filenames are similar to the > second format it is simply not possible to change their filenames. In > addition they are being used by other applications. > > Is there something I can change in Solr configs to fix this issue or am I > simply SOL until the Solr dev team can work on this? (assuming I put in a > ticket) > > Thanks again everyone, > > ~Brandon Waterloo > > > ________________________________________ > From: Chris Hostetter > [hossman_luc...@fucit.org<mailto:hossman_luc...@fucit.org>] > Sent: Tuesday, April 05, 2011 3:03 PM > To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> > Subject: RE: Problems indexing very large set of documents > > : It wasn't just a single file, it was dozens of files all having problems > : toward the end just before I killed the process. > ... > : That is by no means all the errors, that is just a sample of a few. > : You can see they all threw HTTP 500 errors. What is strange is, nearly > : every file succeeded before about the 2200-files-mark, and nearly every > : file after that failed. > > ..the root question is: do those files *only* fail if you have already > indexed ~2200 files, or do they fail if you start up your server and index > them first? > > there may be a resource issued (if it only happens after indexing 2200) or > it may just be a problem with a large number of your PDFs that your > iteration code just happens to get to at that point. > > If it's the former, then there may e something buggy about how Solr is > using Tika to cause the problem -- if it's the later, then it's a straight > Tika parsing issue. > > : > now, commit is set to false to speed up the indexing, and I'm assuming > that > : > Solr should be auto-committing as necessary. I'm using the default > : > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf. Once > > solr does no autocommitting by default, you need to check your > solrconfig.xml > > > -Hoss > > > > -- > ______ > Ezequiel. > > Http://www.ironicnet.com > -- Lance Norskog goks...@gmail.com