I did try standalone version of tika0.7, and it extracted pdf content successfully. Then i replaced tika related jars in contrib/extraction/lib of solr1.4 dist'n with their newer versions, and now it doesn;t extract contents from ANY pdf. Earlier (0.4) it was throwing exception for few pdfs, but now no contents or exception.
On Fri, Apr 30, 2010 at 4:14 PM, Grant Ingersoll <gsing...@apache.org>wrote: > Can you share the PDF it is failing on? FWIW, PDFs are notoriously hard to > extract. They come in all shapes and flavors and I've seen many a > commercial extractor fail on them too. Have you tried using either Tika > standalone or PDFBox standalone? Does the file work there? > > On Apr 26, 2010, at 8:35 AM, Marc Ghorayeb wrote: > > > > > Okay i've been digging a little bit through the Java code from the SVN, > and it seems the load function inside the ExtractingDocumentLoader class > does not receive the ContentStream (it is set to null...).Maybe i should > send this to the developper mailing list? > > Marc > > > >> From: dekay...@hotmail.com > >> To: solr-user@lucene.apache.org > >> Subject: RE: Problem with pdf, upgrading Cell > >> Date: Fri, 23 Apr 2010 16:03:28 +0200 > >> > >> > >> Seems like i'm not the only one with this "no extraction" problem: > http://www.mail-archive.com/solr-user@lucene.apache.org/msg33609.htmlApparentlyhe > tried the same thing, building from the trunk, and indexing a pdf, and no > extraction occured... Strange. > >> Marc G. > >> > >> _________________________________________________________________ > >> Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, > Blackberry, … > >> http://www.messengersurvotremobile.com/?d=Hotmail > > > > _________________________________________________________________ > > Découvrez comment SURFER DISCRETEMENT sur un site de rencontres ! > > http://clk.atdmt.com/FRM/go/206608211/direct/01/ > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem using Solr/Lucene: > http://www.lucidimagination.com/search > >