Re: Solr cloud performance degradation with billions of documents

Erick Erickson Wed, 13 Aug 2014 16:48:54 -0700

Several points:

1> Have you considered using the MapReduceIndexerTool for your ingestion?
Assuming you don't have duplicate IDs, i.e. each doc is new, you can spread
your indexing across as many nodes as you have in your cluster. That said,
it's not entirely clear that you'll gain throughput since you have as many
nodes as you do.


2> Uhhhhm, fitting this many documents into 6G of memory is ambitious. Very
ambitious. Actually it's impossible. By my calculations:
bq: 4 separate and individual clouds of 32 shards each
so 128 shards in aggregate

bq:  inserting into these clouds per day is 5 Billion each in two clouds, 3
Billion into the third, and 2 Billion into the fourth
so we're talking 15B docs/day

bq: the plan is to keep up to 60 days...
So were talking 900B documents.

It just won't work. 900B/128 docs/shard is over 7B documents/shard on
average. Your two larger collections will have more than that, the two
smaller ones less. But it doesn't matter because:
1: Lucene has a limit of 2B docs per core(shard), positive signed int.
2: It ain't gonna fit in 6G of memory even without this limit I'm pretty
sure.
3: I've rarely heard of a single shard coping with over 300M docs without
performance issues. I usually start getting nervous around 100M and insist
on stress testing. Of course it depends lots on your query profile.

So you're going to need a LOT more shards. You might be able to squeeze
some more from your hardware by hosting multiple shards on for each
collection on each machine, but I'm pretty sure your present setup is
inadequate for your projected load.

Of course I may be misinterpreting what you're saying hugely, but from what
I understand this system just won't work.

Best,
Erick




On Wed, Aug 13, 2014 at 2:39 PM, Markus Jelsma <[email protected]>
wrote:

> Hi - You are running mapred jobs on the same nodes as Solr runs right? The
> first thing i would think of is that your OS file buffer cache is abused.
> The mappers read all data, presumably residing on the same node. The mapper
> output and shuffling part would take place on the same node, only the
> reducer output is sent to your nodes, which i assume are on the same
> machines. Those same machines have a large Lucene index. All this data,
> written to and read from the same disk, competes for a nice spot in the OS
> buffer cache.
>
> Forget it if i misread anything, but when you're using serious figures of
> size, then do not abuse your caches. Have a separate mapred and Solr
> cluster, because they both eat cache space. I assume you can see serious IO
> WAIT times.
>
> Split the stuff and maybe even use smaller hardware, but more.
>
> M
>
> -----Original message-----
> > From:Wilburn, Scott <[email protected]>
> > Sent: Wednesday 13th August 2014 23:09
> > To: [email protected]
> > Subject: Solr cloud performance degradation with billions of documents
> >
> > Hello everyone,
> > I am trying to use SolrCloud to index a very large number of simple
> documents and have run into some performance and scalability limitations
> and was wondering what can be done about it.
> >
> > Hardware wise, I have a 32-node Hadoop cluster that I use to run all of
> the Solr shards and each node has 128GB of memory. The current SolrCloud
> setup is split into 4 separate and individual clouds of 32 shards each
> thereby giving four running shards per cloud or one cloud per eight nodes.
> Each shard is currently assigned a 6GB heap size. I’d prefer to avoid
> increasing heap memory for Solr shards to have enough to run other
> MapReduce jobs on the cluster.
> >
> > The rate of documents that I am currently inserting into these clouds
> per day is 5 Billion each in two clouds, 3 Billion into the third, and 2
> Billion into the fourth ; however to account for capacity, the aim is to
> scale the solution to support double that amount of documents. To index
> these documents, there are MapReduce jobs that run that generate the Solr
> XML documents and will then submit these documents via SolrJ's
> CloudSolrServer interface. In testing, I have found that limiting the
> number of active parallel inserts to 80 per cloud gave the best performance
> as anything higher gave diminishing returns, most likely due to the
> constant shuffling of documents internally to SolrCloud. From an index
> perspective, dated collections are being created to hold an entire day's of
> documents and generally the inserting happens primarily on the current day
> (the previous days are only to allow for searching) and the plan is to keep
> up to 60 days (or collections) in each cloud. A single shar
>  d index in one collection in the busiest cloud currently takes up 30G
> disk space or 960G for the entire collection. The documents are being auto
> committed with a hard commit time of 4 minutes (opensearcher = false) and
> soft commit time of 8 minutes.
> >
> > From a search perspective, the use case is fairly generic and simple
> searches of the type :, so there is no need to tune the system to use any
> of the more advanced querying features. Therefore, the most important thing
> for me is to have the indexing performance be able to keep up with the rate
> of input.
> >
> > In the initial load testing, I was able to achieve a projected indexing
> rate of 10 Billion documents per cloud per day for a grand total of 40
> Billion per day. However, the initial load testing was done on fairly empty
> clouds with just a few small collections. Now that there have been several
> days of documents being indexed, I am starting to see a fairly steep
> drop-off in indexing performance once the clouds reached about 15 full
> collections (or about 80-100 Billion documents per cloud) in the two
> biggest clouds. Based on current application logging I’m seeing a 40% drop
> off in indexing performance. Because of this, I have concerns on how
> performance will hold as more collections are added.
> >
> > My question to the community is if anyone else has had any experience in
> using Solr at this scale (hundreds of Billions) and if anyone has observed
> such a decline in indexing performance as the number of collections
> increases. My understanding is that each collection is a separate index and
> therefore the inserting rate should remain constant. Aside from that, what
> other tweaks or changes can be done in the SolrCloud configuration to
> increase the rate of indexing performance? Am I hitting a hard limitation
> of what Solr can handle?
> >
> > Thanks,
> > Scott
> >
> >
>

Re: Solr cloud performance degradation with billions of documents

Reply via email to