I vaguely recall some thread blocking issue with trying to parse too many PDF files at one time in the same JVM.

Occasionally Tika (actually PDFBox) has been known to hang for some PDF docs.

Do you have enough memory in the JVM? When the CPU is busy, is there much memory available in the JVM? Maybe garbage collection is taking too much of the CPU.

-- Jack Krupansky

-----Original Message----- From: chris.a.mattm...@jpl.nasa.gov
Sent: Thursday, May 24, 2012 9:55 AM
To: solr-user@lucene.apache.org
Subject: Solr Performance

Hi Chris

First of all,thanks lot that your earlier inputs for my document indexing
failures helped me a lot!

Now I am facing few performance issues with the indexing.
This is what I am doing-

- Read data from an excel sheet which essentially contains the path of the PDF
file to be indexed and few literals that I have to add to the Solr Update
request which i can use as filter query to solr when I am
searching.[Category$Subcategory$pathTotheFile]

- My input sheet data may vary from few thousands to upto 6 million lines.

- I am making a Set from these lines and deviding it into 4 chunks and spawning
4 threads which will prepare the Solr ContentStreamUpdateRequest request and
post it to solr.

- In this process I have these issues ::

1. My system's cpu touches a high percentile and the indexing is aborted.

2. If I have a "setAutoCommitWithin" it doesn't work (meaning that initially I
can find few documents committed,after that nothing happens)

3.I have used StreamingUpdateSolrServer with quesize 20, and thread count of 4.

4.My main aim is to boost up the indexing rate [speed].

Can you suggest where and all I can tweak my routine?

Thanks in advace...

Surendra.

Reply via email to