nasa.gov]
Sent: Tuesday, March 23, 2010 11:03 AM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues
Hi Giovanni,
The error that you're showing in your logs below indicates that this message
signature:
org.apache.solr.handler.ContentStreamLoader.load(Lorg
a couple of days.
>From my solrconfig:
ignored_
text
-Original Message-
From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
Sent: Saturday, March 20, 2010 8:43 AM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Per
ser@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues
What's your configuration look like for the ExtractReqHandler?
On Mar 19, 2010, at 2:42 PM, Giovanni Fernandez-Kincade wrote:
> Yeah I've been trying that - I keep getting this error when indexing a PDF
> with a trunk-b
at.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
>at
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
>at java.lang.Thread.run(Unknown Source) ) that prevented it from
> fulfilling this request.Apache Tomcat/5.5.27
&
al Message-
From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
Sent: Friday, March 19, 2010 1:46 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues
Can you try trunk?
On Mar 19, 2010, at 1:12 PM, Giovanni Fernandez-Kincade wrote:
> Solr
, 2010 1:02 PM
> To: solr-user@lucene.apache.org
> Subject: Re: PDFBox/Tika Performance Issues
>
>
> On Mar 16, 2010, at 6:55 PM, Giovanni Fernandez-Kincade wrote:
>>
>> 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib
>> folder for my
Time:Wed Mar 17 17:05:19 EDT 2010
-Original Message-
From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
Sent: Friday, March 19, 2010 1:02 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues
On Mar 16, 2010, at 6:55 PM, Giovanni Fernandez
On Mar 16, 2010, at 6:55 PM, Giovanni Fernandez-Kincade wrote:
>
> 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib
> folder for my Solr Core, and renamed it to the name of the existing Tika Jar
> (tika-0.3.jar).
What version are you on of Solr? It's been a while s
that works...
-Original Message-
From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Friday, March 19, 2010 12:04 AM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues
Hi Giovanni,
Let's try and isolate the problem. Can you try parsing th
Yeah I had tested it previously and that works...
-Original Message-
From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Friday, March 19, 2010 12:04 AM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues
Hi Giovanni,
Let's tr
v]
Sent: Tuesday, March 16, 2010 11:50 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues
Hi Giovanni,
Comments below:
> I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance.
> This is what I've tried so far (which was really just
as to do with the lib deps. Try what I mentioned above and
let's go from there.
Cheers,
Chris
> -Original Message-
> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
> Sent: Tuesday, March 16, 2010 5:41 PM
> To: solr-user@lucene.
he lib deps. Try what I mentioned above and
let's go from there.
Cheers,
Chris
> -Original Message-
> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
> Sent: Tuesday, March 16, 2010 5:41 PM
> To: solr-user@lucene.apache.org
> Subject: RE: PDFBox/Tika
gfernandez-kinc...@capitaliq.com]
Sent: Tuesday, March 16, 2010 5:41 PM
To: solr-user@lucene.apache.org
Subject: RE: PDFBox/Tika Performance Issues
Thanks Chris!
I'll try the patch.
-Original Message-
From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Tuesd
Thanks Chris!
I'll try the patch.
-Original Message-
From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Tuesday, March 16, 2010 5:37 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues
Guys, I think this is an issue with PDFBO
umber of CPUs on the machine), but even with 5 threads it's
not looking so hot.
-Original Message-
From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
Sent: Tuesday, March 16, 2010 5:15 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Iss
DFBox/Tika Performance Issues
Hmm, that is an ugly thing in PDFBox. We should probably take this over to the
PDFBox project. How many threads are you indexing with?
FWIW, for that many documents, I might consider using Tika on the client side
to save on a lot of network traffic.
-Grant
On M
Hmm, that is an ugly thing in PDFBox. We should probably take this over to the
PDFBox project. How many threads are you indexing with?
FWIW, for that many documents, I might consider using Tika on the client side
to save on a lot of network traffic.
-Grant
On Mar 16, 2010, at 4:37 PM, Giovan
I've been trying to bulk index about 11 million PDFs, and while profiling our
Solr instance, I noticed that all of the threads that are processing indexing
requests are constantly blocking each other during this call:
http-8080-Processor39 [BLOCKED] CPU time: 9:35
java.util.Collections$Synchroni
19 matches
Mail list logo