RE: PDFBox/Tika Performance Issues

2010-03-23 Thread Giovanni Fernandez-Kincade
nasa.gov] Sent: Tuesday, March 23, 2010 11:03 AM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hi Giovanni, The error that you're showing in your logs below indicates that this message signature: org.apache.solr.handler.ContentStreamLoader.load(Lorg

Re: PDFBox/Tika Performance Issues

2010-03-23 Thread Mattmann, Chris A (388J)
a couple of days. >From my solrconfig: ignored_ text -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Saturday, March 20, 2010 8:43 AM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Per

RE: PDFBox/Tika Performance Issues

2010-03-23 Thread Giovanni Fernandez-Kincade
ser@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues What's your configuration look like for the ExtractReqHandler? On Mar 19, 2010, at 2:42 PM, Giovanni Fernandez-Kincade wrote: > Yeah I've been trying that - I keep getting this error when indexing a PDF > with a trunk-b

Re: PDFBox/Tika Performance Issues

2010-03-20 Thread Grant Ingersoll
at.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81) >at > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689) >at java.lang.Thread.run(Unknown Source) ) that prevented it from > fulfilling this request.Apache Tomcat/5.5.27 &

RE: PDFBox/Tika Performance Issues

2010-03-19 Thread Giovanni Fernandez-Kincade
al Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Friday, March 19, 2010 1:46 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Can you try trunk? On Mar 19, 2010, at 1:12 PM, Giovanni Fernandez-Kincade wrote: > Solr

Re: PDFBox/Tika Performance Issues

2010-03-19 Thread Grant Ingersoll
, 2010 1:02 PM > To: solr-user@lucene.apache.org > Subject: Re: PDFBox/Tika Performance Issues > > > On Mar 16, 2010, at 6:55 PM, Giovanni Fernandez-Kincade wrote: >> >> 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib >> folder for my

RE: PDFBox/Tika Performance Issues

2010-03-19 Thread Giovanni Fernandez-Kincade
Time:Wed Mar 17 17:05:19 EDT 2010 -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Friday, March 19, 2010 1:02 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues On Mar 16, 2010, at 6:55 PM, Giovanni Fernandez

Re: PDFBox/Tika Performance Issues

2010-03-19 Thread Grant Ingersoll
On Mar 16, 2010, at 6:55 PM, Giovanni Fernandez-Kincade wrote: > > 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib > folder for my Solr Core, and renamed it to the name of the existing Tika Jar > (tika-0.3.jar). What version are you on of Solr? It's been a while s

Re: PDFBox/Tika Performance Issues

2010-03-19 Thread Mattmann, Chris A (388J)
that works... -Original Message- From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Friday, March 19, 2010 12:04 AM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hi Giovanni, Let's try and isolate the problem. Can you try parsing th

RE: PDFBox/Tika Performance Issues

2010-03-19 Thread Giovanni Fernandez-Kincade
Yeah I had tested it previously and that works... -Original Message- From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Friday, March 19, 2010 12:04 AM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hi Giovanni, Let's tr

Re: PDFBox/Tika Performance Issues

2010-03-18 Thread Mattmann, Chris A (388J)
v] Sent: Tuesday, March 16, 2010 11:50 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hi Giovanni, Comments below: > I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance. > This is what I've tried so far (which was really just

RE: PDFBox/Tika Performance Issues

2010-03-17 Thread Giovanni Fernandez-Kincade
as to do with the lib deps. Try what I mentioned above and let's go from there. Cheers, Chris > -Original Message- > From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com] > Sent: Tuesday, March 16, 2010 5:41 PM > To: solr-user@lucene.

Re: PDFBox/Tika Performance Issues

2010-03-16 Thread Mattmann, Chris A (388J)
he lib deps. Try what I mentioned above and let's go from there. Cheers, Chris > -Original Message- > From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com] > Sent: Tuesday, March 16, 2010 5:41 PM > To: solr-user@lucene.apache.org > Subject: RE: PDFBox/Tika

RE: PDFBox/Tika Performance Issues

2010-03-16 Thread Giovanni Fernandez-Kincade
gfernandez-kinc...@capitaliq.com] Sent: Tuesday, March 16, 2010 5:41 PM To: solr-user@lucene.apache.org Subject: RE: PDFBox/Tika Performance Issues Thanks Chris! I'll try the patch. -Original Message- From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Tuesd

RE: PDFBox/Tika Performance Issues

2010-03-16 Thread Giovanni Fernandez-Kincade
Thanks Chris! I'll try the patch. -Original Message- From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Tuesday, March 16, 2010 5:37 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Guys, I think this is an issue with PDFBO

Re: PDFBox/Tika Performance Issues

2010-03-16 Thread Mattmann, Chris A (388J)
umber of CPUs on the machine), but even with 5 threads it's not looking so hot. -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Tuesday, March 16, 2010 5:15 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Iss

RE: PDFBox/Tika Performance Issues

2010-03-16 Thread Giovanni Fernandez-Kincade
DFBox/Tika Performance Issues Hmm, that is an ugly thing in PDFBox. We should probably take this over to the PDFBox project. How many threads are you indexing with? FWIW, for that many documents, I might consider using Tika on the client side to save on a lot of network traffic. -Grant On M

Re: PDFBox/Tika Performance Issues

2010-03-16 Thread Grant Ingersoll
Hmm, that is an ugly thing in PDFBox. We should probably take this over to the PDFBox project. How many threads are you indexing with? FWIW, for that many documents, I might consider using Tika on the client side to save on a lot of network traffic. -Grant On Mar 16, 2010, at 4:37 PM, Giovan

PDFBox/Tika Performance Issues

2010-03-16 Thread Giovanni Fernandez-Kincade
I've been trying to bulk index about 11 million PDFs, and while profiling our Solr instance, I noticed that all of the threads that are processing indexing requests are constantly blocking each other during this call: http-8080-Processor39 [BLOCKED] CPU time: 9:35 java.util.Collections$Synchroni