Marc & Sandhya, Did you use Solr from trunk? I used Solr 1.4 distn, and even after copying all the jars, i still get the same results for the pdfs i posted here. Thanks.
On Wed, May 5, 2010 at 1:09 PM, Marc Ghorayeb <dekay...@hotmail.com> wrote: > > Hey, > I have the same list, and i added to it the extraction library (apache solr > cell jar), though you might not need it specifically inside the war file. > Marc > > From: sagar...@opentext.com > > To: solr-user@lucene.apache.org > > Date: Wed, 5 May 2010 10:21:36 +0530 > > Subject: RE: Problem with pdf, upgrading Cell > > > > Looks like the highlighting may not work here. Following is the list of > jars I copied : > > > > asm-3.1.jar > > bcmail-jdk15-1.45.jar > > bcprov-jdk15-1.45.jar > > commons-compress-1.0.jar > > commons-logging-1.1.1.jar > > dom4j-1.6.1.jar > > fontbox-1.1.0.jar > > geronimo-stax-api_1.0_spec-1.0.1.jar > > jempbox-1.1.0.jar > > log4j-1.2.14.jar > > metadata-extractor-2.4.0-beta-1.jar > > pdfbox-1.1.0.jar > > poi-3.6.jar > > poi-ooxml-3.6.jar > > poi-ooxml-schemas-3.6.jar > > poi-scratchpad-3.6.jar > > tagsoup-1.2.jar > > tika-core-0.7.jar > > tika-parsers-0.7.jar > > xml-apis-1.0.b2.jar > > xmlbeans-2.3.0.jar > > > > Thanks, > > Sandhya > > > > > > > > -----Original Message----- > > From: Sandhya Agarwal [mailto:sagar...@opentext.com] > > Sent: Wednesday, May 05, 2010 10:06 AM > > To: solr-user@lucene.apache.org > > Subject: RE: Problem with pdf, upgrading Cell > > > > Praveen, > > > > > > > > I only have the highlighted jars copied. Not sure, if we need the other > jars. Also, I copied the jars directly into solr\WEB-INF\lib, like you did. > > > > > > > > Thanks, > > > > Sandhya > > > > > > > > -----Original Message----- > > From: Praveen Agrawal [mailto:pkal...@gmail.com] > > Sent: Tuesday, May 04, 2010 8:10 PM > > To: solr-user@lucene.apache.org > > Subject: Re: Problem with pdf, upgrading Cell > > > > > > > > Hi Sandhya.. > > > > I must be missing something. I copied all dependencies jars to both > > > > contrib/extraction/lib and web-in/lib folders. Here is the list of jars > > > > copied: > > > > > > > > asm-3.1.jar > > > > bcmail-jdk15-1.45.jar > > > > bcprov-jdk15-1.45.jar > > > > commons-compress-1.0.jar > > > > commons-logging-1.1.1.jar > > > > dom4j-1.6.1.jar > > > > fontbox-1.1.0.jar > > > > geronimo-stax-api_1.0_spec-1.0.1.jar > > > > hamcrest-core-1.1.jar > > > > jempbox-1.1.0.jar > > > > junit-3.8.1.jar > > > > log4j-1.2.14.jar > > > > metadata-extractor-2.4.0-beta-1.jar > > > > mockito-core-1.7.jar > > > > nekohtml-1.9.9.jar > > > > objenesis-1.0.jar > > > > ooxml-schemas-1.0.jar > > > > pdfbox-1.1.0.jar > > > > poi-3.6.jar > > > > poi-ooxml-3.6.jar > > > > poi-ooxml-schemas-3.6.jar > > > > poi-scratchpad-3.6.jar > > > > tagsoup-1.2.jar > > > > tika-core-0.7.jar > > > > tika-parsers-0.7.jar > > > > xml-apis-1.0.b2.jar > > > > xmlbeans-2.3.0.jar > > > > > > > > Still same result for me.. > > > > > > > > Marc, > > > > i'm on windows, and i copied above jars directly into already extracted > > > > folder webapps/solr/web-in/lib, in addition to what were already there. I > > > > didn;t explicitly un-jar'd and re-jar'd the solr.war, but do you think > that > > > > could be the issue? i think tomcat extract the war and use the folder in > > > > webapps (i didn;t put the war file in webapps, instead had put extracted > > > > solr folder directly) > > > > > > > > If it has worked for you guys, specially with my two pdfs, then that's > > > > really great. Please let me know your exact procedure, including what all > > > > you copied and where, or if you see i missed something obvious.. > > > > > > > > Thanks, > > > > Praveen > > > > > > > > > > > > On Tue, May 4, 2010 at 5:28 PM, Sandhya Agarwal <sagar...@opentext.com > >wrote: > > > > > > > > > Both the files work for me, Praveen. > > > > > > > > > > Thanks, > > > > > Sandhya > > > > > > > > > > From: Praveen Agrawal [mailto:pkal...@gmail.com] > > > > > Sent: Tuesday, May 04, 2010 5:22 PM > > > > > To: solr-user@lucene.apache.org > > > > > Subject: Re: Problem with pdf, upgrading Cell > > > > > > > > > > another one here.. > > > > > On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal <pkal...@gmail.com > <mailto: > > > > > pkal...@gmail.com>> wrote: > > > > > It bounced because of attachment's size.. > > > > > attaching one by one now.. > > > > > > > > > > > > > > > On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal <pkal...@gmail.com > <mailto: > > > > > pkal...@gmail.com>> wrote: > > > > > I noticed following pattern/relationship b/w producer/creator and > content > > > > > extraction, not sure if helpful (as Grant told earlier pdfs are > notorious): > > > > > > > > > > producer: Bullzip PDF Printer / www.bullzip.com<http://www.bullzip.com> > / > > > > > Freeware Edition (not registered) > > > > > Creator: PScript5.dll Version 5.2.2 > > > > > Extraction: no content -- "installing Solr in Tomcat.pdf" (attached - > i > > > > > generated) > > > > > --------------------- > > > > > > > > > > Producer: Acrobat Distiller 7.0.5 (Windows) > > > > > creator: PScript5.dll Version 5.2.2 > > > > > Extraction: One line content > > > > > ---------------------- > > > > > > > > > > Producer: Acrobat Distiller 8.1.0 (Windows) > > > > > creator: Acrobat PDFMaker 8.1 for Word > > > > > Extraction: one line of content (Free_Two_way_Radio_Guide.pdf - > > > > > attached - was available freely on their website) > > > > > ------------------------- > > > > > > > > > > Producer: FOP 0.20.5 > > > > > Extraction: full content "/docs/features.pdf | linkmap.pdf" etc > > > > > -------------- > > > > > Thanks. > > > > > Praveen > > > > > > > > > > > > > > > On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal <pkal...@gmail.com > <mailto: > > > > > pkal...@gmail.com>> wrote: > > > > > Yes Sandhya, > > > > > i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is > > > > > what you were asking. > > > > > Thanks. > > > > > > > > > > > > > > > On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal <sagar...@opentext.com > > > > > <mailto:sagar...@opentext.com>> wrote: > > > > > Praveen, > > > > > > > > > > Along with the tika core and parser jars, did you run "mvn > > > > > dependency:copy-dependencies", to generate all the dependencies too. > > > > > > > > > > Thanks, > > > > > Sandhya > > > > > > > > > > -----Original Message----- > > > > > From: Praveen Agrawal [mailto:pkal...@gmail.com<mailto: > pkal...@gmail.com>] > > > > > Sent: Tuesday, May 04, 2010 4:52 PM > > > > > To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> > > > > > Subject: Re: Problem with pdf, upgrading Cell > > > > > I seems to have mixed results: > > > > > > > > > > Here is what i did: > > > > > copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in > > > > > contrib/extraction/lib (of-course removed old ones),. as well as in > > > > > web-inf/lib of solr web app in tomcat. > > > > > > > > > > Now it extracts contents from some pdf, but either no content from > others, > > > > > or only a line of content. For ex, "/docs/Installing Solr in > Tomcat.pdf" > > > > > still shows no contents. I've two other pdfs, for which it extracts > only > > > > > one > > > > > line of content. > > > > > > > > > > Also, now i;m getting a field 'title' single value for some pdfs, and > two > > > > > for others. In case where it can extract full content, it shows title > as > > > > > what i gave as literal while submitting the pdf. For pdf wher no > comtent > > > > > was > > > > > extracted, it shows one empty title and one mine. For pdf where it > > > > > extracted > > > > > only one line of content, it shows that line as title too and mine one. > > > > > 'title' field is defined as multivalue in schema. > > > > > > > > > > Any idea, whats going on? or am i missing something? > > > > > > > > > > > > > > > > > > > > On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb <dekay...@hotmail.com > > > > > <mailto:dekay...@hotmail.com>> wrote: > > > > > > > > > > > > > > > > > Hey, > > > > > > I got it to work. I just redid my steps, i had forgotten several > > > > > libraries > > > > > > that were imported through the xml. PDF extraction seems to work once > > > > > again, > > > > > > i have yet to find one that raises an exception! > > > > > > > > > > > > Thanks for the investigation, at least we now have a fix :) > > > > > > Marc > > > > > > _________________________________________________________________ > > > > > > Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows > Phone, > > > > > > Blackberry, … > > > > > > http://www.messengersurvotremobile.com/?d=Hotmail > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _________________________________________________________________ > Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, > Blackberry, … > http://www.messengersurvotremobile.com/?d=Hotmail >