Re: Solr cloud performance degradation with billions of documents

Jack Krupansky Wed, 13 Aug 2014 16:31:45 -0700

Be careful when you say "instance" - that usually refers to a single Solrnode. Anyway...


32 shards - with a replication factor of 1?

So, given your worst case here, 5 billion documents in a 32-node cluster,that's 156 million documents per node. What is the index size on a typicalnode? And how much system memory is available for caching of file reads?Generally, you want to have enough system memory to cache the full index. Ordo you have SSD?

But please clarify what you mean by "about 80-100 Billion documents percloud". Is it really 5 billion total, refreshed every day, or 5 billionadded per day and lots of days stored?

If you start seeing indexing rate drop off, that could be caused by nothaving enough RAM system memory to cache the full index. In particular,Lucene will occasionally be performing index merges, which would otherwisebe I/O-intensive.

I would start with a rule of thumb of 100 million documents per node (andthat is million, not billion.) That could be a lot higher - or a lot lower -based on your actual schema and data value distribution.


-- Jack Krupansky

-----Original Message-----From: Wilburn, Scott

Sent: Wednesday, August 13, 2014 5:42 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr cloud performance degradation with billions of documents

Thanks for replying Jack. I have 4 SolrCloud instances( or clusters ), eachconsisting of 32 shards. The clusters do not have any interaction with eachother.


Thanks,
Scott


-----Original Message-----
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Wednesday, August 13, 2014 2:17 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr cloud performance degradation with billions of documents

Could you clarify what you mean with the term "cloud", as in "per cloud" and"individual clouds"? That's not a proper Solr or SolrCloud concept per se.SolrCloud works with a single "cluster" of nodes. And there is nointeraction between separate SolrCloud clusters.


-- Jack Krupansky

-----Original Message-----
From: Wilburn, Scott
Sent: Wednesday, August 13, 2014 5:08 PM
To: solr-user@lucene.apache.org
Subject: Solr cloud performance degradation with billions of documents

Hello everyone,

I am trying to use SolrCloud to index a very large number of simpledocuments and have run into some performance and scalability limitations andwas wondering what can be done about it.

Hardware wise, I have a 32-node Hadoop cluster that I use to run all of theSolr shards and each node has 128GB of memory. The current SolrCloud setupis split into 4 separate and individual clouds of 32 shards each therebygiving four running shards per cloud or one cloud per eight nodes. Eachshard is currently assigned a 6GB heap size. I’d prefer to avoid increasingheap memory for Solr shards to have enough to run other MapReduce jobs onthe cluster.

The rate of documents that I am currently inserting into these clouds perday is 5 Billion each in two clouds, 3 Billion into the third, and 2 Billioninto the fourth ; however to account for capacity, the aim is to scale thesolution to support double that amount of documents. To index thesedocuments, there are MapReduce jobs that run that generate the Solr XMLdocuments and will then submit these documents via SolrJ's CloudSolrServerinterface. In testing, I have found that limiting the number of activeparallel inserts to 80 per cloud gave the best performance as anythinghigher gave diminishing returns, most likely due to the constant shufflingof documents internally to SolrCloud. From an index perspective, datedcollections are being created to hold an entire day's of documents andgenerally the inserting happens primarily on the current day (the previousdays are only to allow for searching) and the plan is to keep up to 60 days(or collections) in each cloud. A single shard index in one collection inthe busiest cloud currently takes up 30G disk space or 960G for the entirecollection. The documents are being auto committed with a hard commit timeof 4 minutes (opensearcher = false) and soft commit time of 8 minutes.

From a search perspective, the use case is fairly generic and simple

searches of the type :, so there is no need to tune the system to use any ofthe more advanced querying features. Therefore, the most important thing forme is to have the indexing performance be able to keep up with the rate ofinput.

In the initial load testing, I was able to achieve a projected indexing rateof 10 Billion documents per cloud per day for a grand total of 40 Billionper day. However, the initial load testing was done on fairly empty cloudswith just a few small collections. Now that there have been several days ofdocuments being indexed, I am starting to see a fairly steep drop-off inindexing performance once the clouds reached about 15 full collections (orabout 80-100 Billion documents per cloud) in the two biggest clouds. Basedon current application logging I’m seeing a 40% drop off in indexingperformance. Because of this, I have concerns on how performance will holdas more collections are added.

My question to the community is if anyone else has had any experience inusing Solr at this scale (hundreds of Billions) and if anyone has observedsuch a decline in indexing performance as the number of collectionsincreases. My understanding is that each collection is a separate index andtherefore the inserting rate should remain constant. Aside from that, whatother tweaks or changes can be done in the SolrCloud configuration toincrease the rate of indexing performance? Am I hitting a hard limitation ofwhat Solr can handle?


Thanks,
Scott

Re: Solr cloud performance degradation with billions of documents

Reply via email to