RE: Solr cloud performance degradation with billions of documents

Wilburn, Scott Wed, 13 Aug 2014 14:43:46 -0700

Thanks for replying Jack. I have 4 SolrCloud instances( or clusters ), each 
consisting of 32 shards. The clusters do not have any interaction with each 
other.

Thanks,
Scott 

-----Original Message-----
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Wednesday, August 13, 2014 2:17 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr cloud performance degradation with billions of documents

Could you clarify what you mean with the term "cloud", as in "per cloud" and 
"individual clouds"? That's not a proper Solr or SolrCloud concept per se. 
SolrCloud works with a single "cluster" of nodes. And there is no interaction 
between separate SolrCloud clusters.

-- Jack Krupansky

-----Original Message-----
From: Wilburn, Scott
Sent: Wednesday, August 13, 2014 5:08 PM
To: solr-user@lucene.apache.org
Subject: Solr cloud performance degradation with billions of documents

Hello everyone,
I am trying to use SolrCloud to index a very large number of simple documents 
and have run into some performance and scalability limitations and was 
wondering what can be done about it.

Hardware wise, I have a 32-node Hadoop cluster that I use to run all of the 
Solr shards and each node has 128GB of memory. The current SolrCloud setup is 
split into 4 separate and individual clouds of 32 shards each thereby giving 
four running shards per cloud or one cloud per eight nodes. Each shard is 
currently assigned a 6GB heap size. I’d prefer to avoid increasing heap memory 
for Solr shards to have enough to run other MapReduce jobs on the cluster.

The rate of documents that I am currently inserting into these clouds per day 
is 5 Billion each in two clouds, 3 Billion into the third, and 2 Billion into 
the fourth ; however to account for capacity, the aim is to scale the solution 
to support double that amount of documents. To index these documents, there are 
MapReduce jobs that run that generate the Solr XML documents and will then 
submit these documents via SolrJ's CloudSolrServer interface. In testing, I 
have found that limiting the number of active parallel inserts to 80 per cloud 
gave the best performance as anything higher gave diminishing returns, most 
likely due to the constant shuffling of documents internally to SolrCloud. From 
an index perspective, dated collections are being created to hold an entire 
day's of documents and generally the inserting happens primarily on the current 
day (the previous days are only to allow for searching) and the plan is to keep 
up to 60 days (or collections) in each cloud. A single shard index in one 
collection in the busiest cloud currently takes up 30G disk space or 960G for 
the entire collection. The documents are being auto committed with a hard 
commit time of 4 minutes (opensearcher = false) and soft commit time of 8 
minutes.

From a search perspective, the use case is fairly generic and simple searches 
of the type :, so there is no need to tune the system to use any of the more 
advanced querying features. Therefore, the most important thing for me is to 
have the indexing performance be able to keep up with the rate of input.

In the initial load testing, I was able to achieve a projected indexing rate of 
10 Billion documents per cloud per day for a grand total of 40 Billion per day. 
However, the initial load testing was done on fairly empty clouds with just a 
few small collections. Now that there have been several days of documents being 
indexed, I am starting to see a fairly steep drop-off in indexing performance 
once the clouds reached about 15 full collections (or about 80-100 Billion 
documents per cloud) in the two biggest clouds. Based on current application 
logging I’m seeing a 40% drop off in indexing performance. Because of this, I 
have concerns on how performance will hold as more collections are added.

My question to the community is if anyone else has had any experience in using 
Solr at this scale (hundreds of Billions) and if anyone has observed such a 
decline in indexing performance as the number of collections increases. My 
understanding is that each collection is a separate index and therefore the 
inserting rate should remain constant. Aside from that, what other tweaks or 
changes can be done in the SolrCloud configuration to increase the rate of 
indexing performance? Am I hitting a hard limitation of what Solr can handle?

Thanks,
Scott

RE: Solr cloud performance degradation with billions of documents

Reply via email to