Re: indexing cpu utilization

Mark Miller Wed, 02 Jan 2013 16:26:28 -0800

32 cores eh? You probably have to raise some limits to take advantage of that.


https://issues.apache.org/jira/browse/SOLR-4078
support configuring IndexWriter max thread count in solrconfig

That's coming in 4.1 and is likely important - the default is only 8.

You might always want to experiment with using more merge threads? I think the 
default may be 3.

Beyond that, you may want to look at running multiple jvms on the one host and 
doing distributed. That can certainly have benefits, but you have to weigh 
against the management costs. And make sure process->processor affinity is in 
gear.

Finally, make sure you are using many threads to add docs...

- Mark

On Jan 2, 2013, at 4:39 PM, Uwe Reh <r...@hebis.uni-frankfurt.de> wrote:

> Hi,
> 
> while trying to optimize our indexing workflow I reached the same endpoint 
> like gabriel shen described in his mail. My Solr server won't utilize more 
> than 40% of the computing power.
> I made some tests, but i'm not able to find the bottleneck. Could anybody 
> help to solve this quest?
> 
> At first let me describe the environment:
> 
> Server:
> - Two socket Opteron (interlagos) => 32 cores
> - 64Gb Ram (1600Mhz)
> - SATA Disks: spindle and ssd
> - Solaris 5.11
> - JRE 1.7.0
> - Solr 4.0
> - ApplicationServer Jetty
> - 1Gb network interface
> 
> Client:
> - same hardware as client
> - either multi threaded solrj client using multiple instances of 
> HttpSolrServer
> - or multi threaded solrj client using a ConcurrentUpdateSolrServer with 100 
> threads
> 
> Problem:
> - 10,000,000 docs of bibliographic data (~4k each)
> - with a simplified schema definition it takes 10 hours to index <=> 
> ~250docs/second
> - with the real schema.xml it takes 50 hours to index  <=> ~50docs/second
> In both cases the client takes just 2% of the cpu resources and the server 
> 35%. It's obvious that there is some optimization potential in the schema 
> definition, but why uses the Server never more than 40% of the cpu power?
> 
> 
> Discarded possible bottlenecks:
> - Ram for the JVM
> Solr takes only up to 12G of heap and there is just a negligible gc activity. 
> So the increase from 16G to 32G of possible heap made no difference.
> - Bandwidth of the net
> The transmitted data is identical in both cases. The size of the transmitted 
> data is somewhat below 50G. Since both machines have a dedicated 1G line to 
> the switch, the raw transmission should not take much more than 10 minutes
> - Performance of the client
> Like above, the client ist fast enough for the simplified case (10h). A dry 
> run (just preprocessing not indexing) may finish after 75 minutes.
> - Servers disk IO
> The size of the simpler index is ~100G the size of the other is ~150G. This 
> makes factor of 1.5 not 5. The difference between a ssd and a real disk is 
> not noticeable. The output of 'iostat' and 'zpool iostat' is unsuspicious.
> - Bad thread distribution
> 'mpstat' shows a well distributed load over all cpus and a sensible amount of 
> crosscalls (less than ten/cpu)
> - Solr update parameter (solrconfig.xml)
> Inspired from 
> >http://www.hathitrust.org/blogs/large-scale-search/forty-days-and-forty-nights-re-indexing-7-million-books-part-1
>  I'm using:
>> <ramBufferSizeMB>256</ramBufferSizeMB>
>> <mergeFactor>40</mergeFactor>
>> <termIndexInterval>1024</termIndexInterval>
>> <lockType>native</lockType>
>> <unlockOnStartup>true</unlockOnStartup>
> Any changes on this Parameters made it worse.
> 
> To get an idea whats going on, I've done some statistics with visualvm. (see 
> attachement)
> The distribution of real and cpu time looks significant, but Im not smart 
> enough to interpret the results.
> The method 
> org.apache.lucene.index.treadAffinityDocumentsWriterThreadPool.getAndLock() 
> is active at 80% of the time but takes only 1% of the cpu time. On the other 
> hand the second method 
> org.apache.commons.codec.language.bm.PhoneticEngine$PhonemeBuilder.append() 
> is active at 12% of the time and is always running on a cpu
> 
> So again the question "When there are free resources in all dimensions, why 
> utilizes Solr not more than 40% of the computing Power"?
> Bandwidth of the RAM?? I can't believe this. How to verify?
> ???
> 
> Any hints are welcome.
> Uwe
> 
> 
> 
> 
> 
>

Re: indexing cpu utilization

Reply via email to