Re: Problem with pdf, upgrading Cell

Praveen Agrawal Wed, 05 May 2010 01:22:49 -0700

Marc & Sandhya,
Did you use Solr from trunk?
I used Solr 1.4 distn, and even after copying all the jars, i still get the
same results for the pdfs i posted here.
Thanks.


On Wed, May 5, 2010 at 1:09 PM, Marc Ghorayeb <dekay...@hotmail.com> wrote:

>
> Hey,
> I have the same list, and i added to it the extraction library (apache solr
> cell jar), though you might not need it specifically inside the war file.
> Marc
> > From: sagar...@opentext.com
> > To: solr-user@lucene.apache.org
> > Date: Wed, 5 May 2010 10:21:36 +0530
> > Subject: RE: Problem with pdf, upgrading Cell
> >
> > Looks like the highlighting may not work here. Following is the list of
> jars I copied :
> >
> > asm-3.1.jar
> > bcmail-jdk15-1.45.jar
> > bcprov-jdk15-1.45.jar
> > commons-compress-1.0.jar
> > commons-logging-1.1.1.jar
> > dom4j-1.6.1.jar
> > fontbox-1.1.0.jar
> > geronimo-stax-api_1.0_spec-1.0.1.jar
> > jempbox-1.1.0.jar
> > log4j-1.2.14.jar
> > metadata-extractor-2.4.0-beta-1.jar
> > pdfbox-1.1.0.jar
> > poi-3.6.jar
> > poi-ooxml-3.6.jar
> > poi-ooxml-schemas-3.6.jar
> > poi-scratchpad-3.6.jar
> > tagsoup-1.2.jar
> > tika-core-0.7.jar
> > tika-parsers-0.7.jar
> > xml-apis-1.0.b2.jar
> > xmlbeans-2.3.0.jar
> >
> > Thanks,
> > Sandhya
> >
> >
> >
> > -----Original Message-----
> > From: Sandhya Agarwal [mailto:sagar...@opentext.com]
> > Sent: Wednesday, May 05, 2010 10:06 AM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Problem with pdf, upgrading Cell
> >
> > Praveen,
> >
> >
> >
> > I only have the highlighted jars copied. Not sure, if we need the other
> jars. Also, I copied the jars directly into solr\WEB-INF\lib, like you did.
> >
> >
> >
> > Thanks,
> >
> > Sandhya
> >
> >
> >
> > -----Original Message-----
> > From: Praveen Agrawal [mailto:pkal...@gmail.com]
> > Sent: Tuesday, May 04, 2010 8:10 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Problem with pdf, upgrading Cell
> >
> >
> >
> > Hi Sandhya..
> >
> > I must be missing something. I copied all dependencies jars to both
> >
> > contrib/extraction/lib and web-in/lib folders. Here is the list of jars
> >
> > copied:
> >
> >
> >
> > asm-3.1.jar
> >
> > bcmail-jdk15-1.45.jar
> >
> > bcprov-jdk15-1.45.jar
> >
> > commons-compress-1.0.jar
> >
> > commons-logging-1.1.1.jar
> >
> > dom4j-1.6.1.jar
> >
> > fontbox-1.1.0.jar
> >
> > geronimo-stax-api_1.0_spec-1.0.1.jar
> >
> > hamcrest-core-1.1.jar
> >
> > jempbox-1.1.0.jar
> >
> > junit-3.8.1.jar
> >
> > log4j-1.2.14.jar
> >
> > metadata-extractor-2.4.0-beta-1.jar
> >
> > mockito-core-1.7.jar
> >
> > nekohtml-1.9.9.jar
> >
> > objenesis-1.0.jar
> >
> > ooxml-schemas-1.0.jar
> >
> > pdfbox-1.1.0.jar
> >
> > poi-3.6.jar
> >
> > poi-ooxml-3.6.jar
> >
> > poi-ooxml-schemas-3.6.jar
> >
> > poi-scratchpad-3.6.jar
> >
> > tagsoup-1.2.jar
> >
> > tika-core-0.7.jar
> >
> > tika-parsers-0.7.jar
> >
> > xml-apis-1.0.b2.jar
> >
> > xmlbeans-2.3.0.jar
> >
> >
> >
> > Still same result for me..
> >
> >
> >
> > Marc,
> >
> > i'm on windows, and i copied above jars directly into already extracted
> >
> > folder webapps/solr/web-in/lib, in addition to what were already there. I
> >
> > didn;t explicitly un-jar'd and re-jar'd the solr.war, but do you think
> that
> >
> > could be the issue? i think tomcat extract the war and use the folder in
> >
> > webapps (i didn;t put the war file in webapps, instead had put extracted
> >
> > solr folder directly)
> >
> >
> >
> > If it has worked for you guys, specially with my two pdfs, then that's
> >
> > really great. Please let me know your exact procedure, including what all
> >
> > you copied and where, or if you see i missed something obvious..
> >
> >
> >
> > Thanks,
> >
> > Praveen
> >
> >
> >
> >
> >
> > On Tue, May 4, 2010 at 5:28 PM, Sandhya Agarwal <sagar...@opentext.com
> >wrote:
> >
> >
> >
> > > Both the files work for me, Praveen.
> >
> > >
> >
> > > Thanks,
> >
> > > Sandhya
> >
> > >
> >
> > > From: Praveen Agrawal [mailto:pkal...@gmail.com]
> >
> > > Sent: Tuesday, May 04, 2010 5:22 PM
> >
> > > To: solr-user@lucene.apache.org
> >
> > > Subject: Re: Problem with pdf, upgrading Cell
> >
> > >
> >
> > > another one here..
> >
> > > On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal <pkal...@gmail.com
> <mailto:
> >
> > > pkal...@gmail.com>> wrote:
> >
> > > It bounced because of attachment's size..
> >
> > > attaching one by one now..
> >
> > >
> >
> > >
> >
> > > On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal <pkal...@gmail.com
> <mailto:
> >
> > > pkal...@gmail.com>> wrote:
> >
> > > I noticed following pattern/relationship b/w producer/creator and
> content
> >
> > > extraction, not sure if helpful (as Grant told earlier pdfs are
> notorious):
> >
> > >
> >
> > > producer: Bullzip PDF Printer / www.bullzip.com<http://www.bullzip.com>
> /
> >
> > > Freeware Edition (not registered)
> >
> > > Creator: PScript5.dll Version 5.2.2
> >
> > > Extraction: no content  --  "installing Solr in Tomcat.pdf" (attached -
> i
> >
> > > generated)
> >
> > > ---------------------
> >
> > >
> >
> > > Producer: Acrobat Distiller 7.0.5 (Windows)
> >
> > > creator: PScript5.dll Version 5.2.2
> >
> > > Extraction: One line content
> >
> > > ----------------------
> >
> > >
> >
> > > Producer: Acrobat Distiller 8.1.0 (Windows)
> >
> > > creator: Acrobat PDFMaker 8.1 for Word
> >
> > > Extraction:  one line of content    (Free_Two_way_Radio_Guide.pdf -
> >
> > > attached - was available freely on their website)
> >
> > > -------------------------
> >
> > >
> >
> > > Producer: FOP 0.20.5
> >
> > > Extraction: full content    "/docs/features.pdf | linkmap.pdf" etc
> >
> > > --------------
> >
> > > Thanks.
> >
> > > Praveen
> >
> > >
> >
> > >
> >
> > > On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal <pkal...@gmail.com
> <mailto:
> >
> > > pkal...@gmail.com>> wrote:
> >
> > > Yes Sandhya,
> >
> > > i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is
> >
> > > what you were asking.
> >
> > > Thanks.
> >
> > >
> >
> > >
> >
> > > On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal <sagar...@opentext.com
> >
> > > <mailto:sagar...@opentext.com>> wrote:
> >
> > > Praveen,
> >
> > >
> >
> > > Along with the tika core and parser jars, did you run "mvn
> >
> > > dependency:copy-dependencies", to generate all the dependencies too.
> >
> > >
> >
> > > Thanks,
> >
> > > Sandhya
> >
> > >
> >
> > > -----Original Message-----
> >
> > > From: Praveen Agrawal [mailto:pkal...@gmail.com<mailto:
> pkal...@gmail.com>]
> >
> > > Sent: Tuesday, May 04, 2010 4:52 PM
> >
> > > To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> >
> > > Subject: Re: Problem with pdf, upgrading Cell
> >
> > > I seems to have mixed results:
> >
> > >
> >
> > > Here is what i did:
> >
> > > copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in
> >
> > > contrib/extraction/lib (of-course removed old ones),. as well as in
> >
> > > web-inf/lib of solr web app in tomcat.
> >
> > >
> >
> > > Now it extracts contents from some pdf, but either no content from
> others,
> >
> > > or only a line of content. For ex, "/docs/Installing Solr in
> Tomcat.pdf"
> >
> > > still shows no contents. I've two other pdfs, for which it extracts
> only
> >
> > > one
> >
> > > line of content.
> >
> > >
> >
> > > Also, now i;m getting a field 'title' single value for some pdfs, and
> two
> >
> > > for others. In case where it can extract full content, it shows title
> as
> >
> > > what i gave as literal while submitting the pdf. For pdf wher no
> comtent
> >
> > > was
> >
> > > extracted, it shows one empty title and one mine. For pdf where it
> >
> > > extracted
> >
> > > only one line of content, it shows that line as title too and mine one.
> >
> > > 'title' field is defined as multivalue in schema.
> >
> > >
> >
> > > Any idea, whats going on? or am i missing something?
> >
> > >
> >
> > >
> >
> > >
> >
> > > On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb <dekay...@hotmail.com
> >
> > > <mailto:dekay...@hotmail.com>> wrote:
> >
> > >
> >
> > > >
> >
> > > > Hey,
> >
> > > > I got it to work. I just redid my steps, i had forgotten several
> >
> > > libraries
> >
> > > > that were imported through the xml. PDF extraction seems to work once
> >
> > > again,
> >
> > > > i have yet to find one that raises an exception!
> >
> > > >
> >
> > > > Thanks for the investigation, at least we now have a fix :)
> >
> > > > Marc
> >
> > > > _________________________________________________________________
> >
> > > > Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows
> Phone,
> >
> > > > Blackberry, …
> >
> > > > http://www.messengersurvotremobile.com/?d=Hotmail
> >
> > > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
>
> _________________________________________________________________
> Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone,
> Blackberry, …
> http://www.messengersurvotremobile.com/?d=Hotmail
>

Re: Problem with pdf, upgrading Cell

Reply via email to