Just a word of caution: I've been bitten by this bug, which affects Tika 0.6: 
https://issues.apache.org/jira/browse/PDFBOX-541

It causes the parser to go into an infinite loop, which isn't exactly great 
for server stability. Tika 0.4 is not affected in the same way - as far as I 
remember, the parser just fails on such PDF files.

According to the Tika folks, PDFBox and Tika releases need to be synchronized, 
so it might be wise to hold off upgrading until the next Tika version has been 
released that contains the fixed PDFBox.

Best regards
- Christian


On Wednesday 17 February 2010 11:40:50 am Liam O'Boyle wrote:
> I just copied in the newer .jars and got rid of the old ones and
> everything seemed to work smoothly enough.
> 
> Liam
> 
> On Tue, 2010-02-16 at 13:11 -0500, Grant Ingersoll wrote:
> > I've got a task open to upgrade to 0.6.  Will try to get to it this week.
> >  Upgrading is usually pretty trivial.
> >
> > On Feb 14, 2010, at 12:37 AM, Liam O'Boyle wrote:
> > > Afternoon,
> > >
> > > I've got a large collections of documents which I'm attempting to add
> > > to a Solr index using Tika via the ExtractingRequestHandler, but there
> > > are a large number that it has problems with (PDFs, PPTX and XLS
> > > documents mainly).
> > >
> > > I've tried them with the most recent stand alone version of Tika and it
> > > handles most of the failing documents correctly.  I tried using a
> > > recent nightly build of Solr, but the same problems seem to occur.
> > >
> > > Are there instructions somewhere on installing a more recent Tika build
> > > into Solr?
> > >
> > > Thanks,
> > > Liam
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem using Solr/Lucene:
> > http://www.lucidimagination.com/search
> 

-- 
Christian Vogler, Ph.D.
Institute for Language and Speech Processing, Athens, Greece

Reply via email to