Can you share the PDF it is failing on? FWIW, PDFs are notoriously hard to extract. They come in all shapes and flavors and I've seen many a commercial extractor fail on them too. Have you tried using either Tika standalone or PDFBox standalone? Does the file work there?
On Apr 26, 2010, at 8:35 AM, Marc Ghorayeb wrote: > > Okay i've been digging a little bit through the Java code from the SVN, and > it seems the load function inside the ExtractingDocumentLoader class does not > receive the ContentStream (it is set to null...).Maybe i should send this to > the developper mailing list? > Marc > >> From: dekay...@hotmail.com >> To: solr-user@lucene.apache.org >> Subject: RE: Problem with pdf, upgrading Cell >> Date: Fri, 23 Apr 2010 16:03:28 +0200 >> >> >> Seems like i'm not the only one with this "no extraction" >> problem:http://www.mail-archive.com/solr-user@lucene.apache.org/msg33609.htmlApparently >> he tried the same thing, building from the trunk, and indexing a pdf, and >> no extraction occured... Strange. >> Marc G. >> >> _________________________________________________________________ >> Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, >> Blackberry, … >> http://www.messengersurvotremobile.com/?d=Hotmail > > _________________________________________________________________ > Découvrez comment SURFER DISCRETEMENT sur un site de rencontres ! > http://clk.atdmt.com/FRM/go/206608211/direct/01/ -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search