Be careful when you say "instance" - that usually refers to a single Solr
node. Anyway...
32 shards - with a replication factor of 1?
So, given your worst case here, 5 billion documents in a 32-node cluster,
that's 156 million documents per node. What is the index size on a typical
node? And how much system memory is available for caching of file reads?
Generally, you want to have enough system memory to cache the full index. Or
do you have SSD?
But please clarify what you mean by "about 80-100 Billion documents per
cloud". Is it really 5 billion total, refreshed every day, or 5 billion
added per day and lots of days stored?
If you start seeing indexing rate drop off, that could be caused by not
having enough RAM system memory to cache the full index. In particular,
Lucene will occasionally be performing index merges, which would otherwise
be I/O-intensive.
I would start with a rule of thumb of 100 million documents per node (and
that is million, not billion.) That could be a lot higher - or a lot lower -
based on your actual schema and data value distribution.
-- Jack Krupansky
-----Original Message-----
From: Wilburn, Scott
Sent: Wednesday, August 13, 2014 5:42 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr cloud performance degradation with billions of documents
Thanks for replying Jack. I have 4 SolrCloud instances( or clusters ), each
consisting of 32 shards. The clusters do not have any interaction with each
other.
Thanks,
Scott
-----Original Message-----
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Wednesday, August 13, 2014 2:17 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr cloud performance degradation with billions of documents
Could you clarify what you mean with the term "cloud", as in "per cloud" and
"individual clouds"? That's not a proper Solr or SolrCloud concept per se.
SolrCloud works with a single "cluster" of nodes. And there is no
interaction between separate SolrCloud clusters.
-- Jack Krupansky
-----Original Message-----
From: Wilburn, Scott
Sent: Wednesday, August 13, 2014 5:08 PM
To: solr-user@lucene.apache.org
Subject: Solr cloud performance degradation with billions of documents
Hello everyone,
I am trying to use SolrCloud to index a very large number of simple
documents and have run into some performance and scalability limitations and
was wondering what can be done about it.
Hardware wise, I have a 32-node Hadoop cluster that I use to run all of the
Solr shards and each node has 128GB of memory. The current SolrCloud setup
is split into 4 separate and individual clouds of 32 shards each thereby
giving four running shards per cloud or one cloud per eight nodes. Each
shard is currently assigned a 6GB heap size. I’d prefer to avoid increasing
heap memory for Solr shards to have enough to run other MapReduce jobs on
the cluster.
The rate of documents that I am currently inserting into these clouds per
day is 5 Billion each in two clouds, 3 Billion into the third, and 2 Billion
into the fourth ; however to account for capacity, the aim is to scale the
solution to support double that amount of documents. To index these
documents, there are MapReduce jobs that run that generate the Solr XML
documents and will then submit these documents via SolrJ's CloudSolrServer
interface. In testing, I have found that limiting the number of active
parallel inserts to 80 per cloud gave the best performance as anything
higher gave diminishing returns, most likely due to the constant shuffling
of documents internally to SolrCloud. From an index perspective, dated
collections are being created to hold an entire day's of documents and
generally the inserting happens primarily on the current day (the previous
days are only to allow for searching) and the plan is to keep up to 60 days
(or collections) in each cloud. A single shard index in one collection in
the busiest cloud currently takes up 30G disk space or 960G for the entire
collection. The documents are being auto committed with a hard commit time
of 4 minutes (opensearcher = false) and soft commit time of 8 minutes.
From a search perspective, the use case is fairly generic and simple
searches of the type :, so there is no need to tune the system to use any of
the more advanced querying features. Therefore, the most important thing for
me is to have the indexing performance be able to keep up with the rate of
input.
In the initial load testing, I was able to achieve a projected indexing rate
of 10 Billion documents per cloud per day for a grand total of 40 Billion
per day. However, the initial load testing was done on fairly empty clouds
with just a few small collections. Now that there have been several days of
documents being indexed, I am starting to see a fairly steep drop-off in
indexing performance once the clouds reached about 15 full collections (or
about 80-100 Billion documents per cloud) in the two biggest clouds. Based
on current application logging I’m seeing a 40% drop off in indexing
performance. Because of this, I have concerns on how performance will hold
as more collections are added.
My question to the community is if anyone else has had any experience in
using Solr at this scale (hundreds of Billions) and if anyone has observed
such a decline in indexing performance as the number of collections
increases. My understanding is that each collection is a separate index and
therefore the inserting rate should remain constant. Aside from that, what
other tweaks or changes can be done in the SolrCloud configuration to
increase the rate of indexing performance? Am I hitting a hard limitation of
what Solr can handle?
Thanks,
Scott