Re: PDFBox/Tika Performance Issues

Mattmann, Chris A (388J) Thu, 18 Mar 2010 21:04:57 -0700

Hi Giovanni,

Let's try and isolate the problem. Can you try parsing the PDF file with 
tika-app as a standalone? Take your tika-app jar file then run java -jar 
tika-app-0.7-SNAPSHOT.jar -m /path/to/pdf/file


That should give you something like:

Content-Type: application/pdf
created: Thu Sep 06 00:41:55 PDT 2007
creator: TeX
producer: pdfeTeX-1.21a
resourceName: Dissertation.pdf

(e.g., this is what I got when I ran it on my Dissertation PDF file).

Let's start there - if that works, then there is something up with the 
integration into SolrCell, and we can start to figure that out...

Cheers,
Chris



On 3/17/10 8:06 AM, "Giovanni Fernandez-Kincade" 
<gfernandez-kinc...@capitaliq.com> wrote:

Hmm. Unfortunately that didn't work. Same problem - Solr doesn't report an 
error, but the data doesn't get extracted. Using the same PDF with my previous 
/Lib contents works fine.

Any other ideas?

These are the jar files I have in my /Lib

apache-solr-cell-1.4-dev.jar
asm-3.1.jar
bcmail-jdk15-1.45.jar
bcprov-jdk15-1.45.jar
commons-codec-1.3.jar
commons-compress-1.0.jar
commons-io-1.4.jar
commons-lang-2.1.jar
commons-logging-1.1.1.jar
dom4j-1.6.1.jar
fontbox-1.0.0.jar
geronimo-stax-api_1.0_spec-1.0.1.jar
hamcrest-core-1.1.jar
icu4j-3.8.jar
jempbox-1.0.0.jar
junit-3.8.1.jar
log4j-1.2.14.jar
lucene-core-2.9.1-dev.jar
lucene-misc-2.9.1-dev.jar
metadata-extractor-2.4.0-beta-1.jar
mockito-core-1.7.jar
nekohtml-1.9.9.jar
objenesis-1.0.jar
ooxml-schemas-1.0.jar
pdfbox-1.0.0.jar
poi-3.6.jar
poi-ooxml-3.6.jar
poi-ooxml-schemas-3.6.jar
poi-scratchpad-3.6.jar
tagsoup-1.2.jar
tika-core-0.7-SNAPSHOT.jar
tika-parsers-0.7-SNAPSHOT.jar
xercesImpl-2.8.1.jar
xml-apis-1.0.b2.jar
xmlbeans-2.3.0.jar

-----Original Message-----
From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Tuesday, March 16, 2010 11:50 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

Hi Giovanni,

Comments below:

> I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance.
> This is what I've tried so far (which was really just me guessing):
>
>
>
> 1.     Got the latest version of the trunk code from
> http://svn.apache.org/repos/asf/lucene/tika/trunk
>
> 2.     Built this using Maven (mvn install)
>

On track so far.

> 3.     I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib
> folder for my Solr Core, and renamed it to the name of the existing Tika Jar
> (tika-0.3.jar).

I don't think you need to do this (w.r.t to the renaming). I think what you
need to do is to drop:

tika-core-0.7-SNAPSHOT.jar
tika-parsers-0.7-SNAPSHOT.jar

Into your Solr core /lib folder. Also you should make sure to take the
updated PDFBox 1.0.0 jar (you can get this by typing mvn:copy-dependencies
in the tika-parsers project, see here:
http://maven.apache.org/plugins/maven-dependency-plugin/copy-dependencies-mo
jo.html), along with the rest of the jar deps for tika-parsers and drop them
in there as well. Then, make sure to remove the existing tika-0.3.jar, as
well as any of the existing parser lib jar files and replace them with the
new deps.

A bunch of manual labor yes, but you're on the bleeding edge, so c'est la
vie, right? :) The alternative is to wait for Tika 0.7 to be released and
then for Solr to upgrade to it.

>
> 4.     Then I bounced my servlet server and tried indexing a document. The
> document was successfully indexed, and there were no errors logged as a
> result, but the PDF data does not appear to have been extracted (the field I
> used for map.content had an empty-string as a value).

I think probably has to do with the lib deps. Try what I mentioned above and
let's go from there.

Cheers,
Chris

> -----Original Message-----
> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
> Sent: Tuesday, March 16, 2010 5:41 PM
> To: solr-user@lucene.apache.org
> Subject: RE: PDFBox/Tika Performance Issues
>
>
>
> Thanks Chris!
>
>
>
> I'll try the patch.
>
>
>
> -----Original Message-----
>
> From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]
>
> Sent: Tuesday, March 16, 2010 5:37 PM
>
> To: solr-user@lucene.apache.org
>
> Subject: Re: PDFBox/Tika Performance Issues
>
>
>
> Guys, I think this is an issue with PDFBOX and the version that Tika 0.6
> depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may
> include a fix for the problem you're seeing.
>
>
>
> See this discussion [2] on how to patch Tika to use the new PDFBox if you
> can't wait for the 0.7 release which should happen soon (hopefully next few
> weeks).
>
>
>
> Cheers,
>
> Chris
>
>
>
> [1] http://issues.apache.org/jira/browse/TIKA-380
>
> [2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html
>
>
>
>
>
> On 3/16/10 2:31 PM, "Giovanni Fernandez-Kincade"
> <gfernandez-kinc...@capitaliq.com> wrote:
>
>
>
> Originally 16 (the number of CPUs on the machine), but even with 5 threads
> it's not looking so hot.
>
>
>
> -----Original Message-----
>
> From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
>
> Sent: Tuesday, March 16, 2010 5:15 PM
>
> To: solr-user@lucene.apache.org
>
> Subject: Re: PDFBox/Tika Performance Issues
>
>
>
> Hmm, that is an ugly thing in PDFBox.  We should probably take this over to
> the PDFBox project.  How many threads are you indexing with?
>
>
>
> FWIW, for that many documents, I might consider using Tika on the client side
> to save on a lot of network traffic.
>
>
>
> -Grant
>
>
>
> On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote:
>
>
>
>> I've been trying to bulk index about 11 million PDFs, and while profiling our
>> Solr instance, I noticed that all of the threads that are processing indexing
>> requests are constantly blocking each other during this call:
>
>>
>
>> http-8080-Processor39 [BLOCKED] CPU time: 9:35
>
>> java.util.Collections$SynchronizedMap.get(Object)
>
>> org.pdfbox.pdmodel.font.PDFont.getAFM()
>
>> org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)
>
>> org.pdfbox.util.PDFStreamEngine.showString(byte[])
>
>> org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)
>
>> org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)
>
>> org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources,
>> COSStream)
>
>> org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)
>
>> org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)
>
>> org.pdfbox.util.PDFTextStripper.processPages(List)
>
>> org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)
>
>> org.pdfbox.util.PDFTextStripper.getText(PDDocument)
>
>> org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler,
>> Metadata)
>
>> org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler,
>> Metadata)
>
>> org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler,
>> Metadata)
>
>> org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler,
>> Metadata)
>
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryReq
>> uest, SolrQueryResponse, ContentStream)
>
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryR
>> equest, SolrQueryResponse)
>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest,
>> SolrQueryResponse)
>
>> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(
>> SolrQueryRequest, SolrQueryResponse)
>
>> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest,
>> SolrQueryResponse)
>
>> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest,
>> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest,
>> ServletResponse, FilterChain)
>
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletReque
>> st, ServletResponse)
>
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest,
>> ServletResponse)
>
>> org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
>
>> org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
>
>> org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
>
>> org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
>
>> org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
>
>> org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
>
>> org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
>
>> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processCo
>> nnection(TcpConnection, Object[])
>
>> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket,
>> TcpConnection, Object[])
>
>> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
>
>> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
>
>> java.lang.Thread.run()
>
>>
>
>> Has anyone run into this before? Any ideas on how to reduce the contention?
>
>>
>
>> Thanks,
>
>> Gio.
>
>
>
> --------------------------
>
> Grant Ingersoll
>
> http://www.lucidimagination.com/
>
>
>
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
>
>
>
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Chris Mattmann, Ph.D.
>
> Senior Computer Scientist
>
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>
> Office: 171-266B, Mailstop: 171-246
>
> Email: chris.mattm...@jpl.nasa.gov
>
> WWW:   http://sunset.usc.edu/~mattmann/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Adjunct Assistant Professor, Computer Science Department
>
> University of Southern California, Los Angeles, CA 90089 USA
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: PDFBox/Tika Performance Issues

Reply via email to