Re: Solr cloud performance degradation with billions of documents

Jack Krupansky Thu, 14 Aug 2014 08:32:18 -0700

You're using the term "cloud" again. Maybe that's the cause of yourmisunderstanding - SolrCloud probably should have been named SolrClustersince that's what it really is, a cluster rather than a "cloud". The term"cloud" conjures up images of vast, unlimited numbers of nodes, thousands,tens of thousands of machines, but SolrCloud is much more modest than that.

Again, start with a model of 100 million documents on a fairly commodity box(say, 32GB as opposed to expensive 16-core 256GB machines). So, 1 billiondocs means 10 servers, times replication - I assume you want to serve ahealthy query load. So, 5 billion docs needs 50 servers, times replication.100 billion docs would require 1,000 servers. 500 billion documents wouldrequire 5,000 servers, times replication. Not quite Google class, but not atypical SolrCloud "cluster" either. You will have to test for yourselfwhether that 100 million number is achievable for your particular hardwareand data. Maybe you can double it... or maybe only half of that.

And, once again, make sure your index for each node fits in the OS systemmemory available for file caching.

I haven't heard of any specific experiences of SolrCloud beyond dozens ofnodes, but 64 nodes is probably a reasonable expectation for a SolrCloudcluster. How much bigger than that a SolrCloud cluster could grow isunknown. Whatever the actual practical limit, based on your own hardware,I/O, and network, and your own data schema and data patterns, which you willhave to test for yourself, you will probably need to use an applicationlayer to "shard" your 100s of billions to specific SolrCloud clusters.


-- Jack Krupansky

-----Original Message-----From: Wilburn, Scott

Sent: Thursday, August 14, 2014 11:05 AM
To: [email protected]
Subject: RE: Solr cloud performance degradation with billions of documents

Erick,

Thanks for your suggestion to look into MapReduceIndexerTool, I'm lookinginto that now. I agree what I am trying to do is a tall order, and the moreI hear from all of your comments, the more I am convinced that lack ofmemory is my biggest problem. I'm going to work on increasing the memorynow, but was wondering if there are any configuration or other techniquesthat could also increase ingest performance? Does anyone know if a cloud ofthis size( hundreds of billions ) with an ingest rate of 5 billion new eachday, has ever been attempted before?


Thanks,
Scott


-----Original Message-----
From: Erick Erickson [mailto:[email protected]]
Sent: Wednesday, August 13, 2014 4:48 PM
To: [email protected]
Subject: Re: Solr cloud performance degradation with billions of documents

Several points:

1> Have you considered using the MapReduceIndexerTool for your ingestion?

Assuming you don't have duplicate IDs, i.e. each doc is new, you can spreadyour indexing across as many nodes as you have in your cluster. That said,it's not entirely clear that you'll gain throughput since you have as manynodes as you do.


2> Uhhhhm, fitting this many documents into 6G of memory is ambitious.
2> Very
ambitious. Actually it's impossible. By my calculations:

bq: 4 separate and individual clouds of 32 shards each so 128 shards inaggregate

bq: inserting into these clouds per day is 5 Billion each in two clouds, 3Billion into the third, and 2 Billion into the fourth so we're talking 15Bdocs/day


bq: the plan is to keep up to 60 days...
So were talking 900B documents.

It just won't work. 900B/128 docs/shard is over 7B documents/shard onaverage. Your two larger collections will have more than that, the twosmaller ones less. But it doesn't matter because:

1: Lucene has a limit of 2B docs per core(shard), positive signed int.

2: It ain't gonna fit in 6G of memory even without this limit I'm prettysure.3: I've rarely heard of a single shard coping with over 300M docs withoutperformance issues. I usually start getting nervous around 100M and insiston stress testing. Of course it depends lots on your query profile.

So you're going to need a LOT more shards. You might be able to squeeze somemore from your hardware by hosting multiple shards on for each collection oneach machine, but I'm pretty sure your present setup is inadequate for yourprojected load.

Of course I may be misinterpreting what you're saying hugely, but from whatI understand this system just won't work.


Best,
Erick




On Wed, Aug 13, 2014 at 2:39 PM, Markus Jelsma <[email protected]>
wrote:

Hi - You are running mapred jobs on the same nodes as Solr runs right?

The first thing i would think of is that your OS file buffer cache isabused.

The mappers read all data, presumably residing on the same node. The
mapper output and shuffling part would take place on the same node,
only the reducer output is sent to your nodes, which i assume are on
the same machines. Those same machines have a large Lucene index. All
this data, written to and read from the same disk, competes for a nice
spot in the OS buffer cache.

Forget it if i misread anything, but when you're using serious figures
of size, then do not abuse your caches. Have a separate mapred and
Solr cluster, because they both eat cache space. I assume you can see
serious IO WAIT times.

Split the stuff and maybe even use smaller hardware, but more.

M

-----Original message-----
> From:Wilburn, Scott <[email protected]>
> Sent: Wednesday 13th August 2014 23:09
> To: [email protected]
> Subject: Solr cloud performance degradation with billions of
> documents
>
> Hello everyone,
> I am trying to use SolrCloud to index a very large number of simple
documents and have run into some performance and scalability
limitations and was wondering what can be done about it.
>
> Hardware wise, I have a 32-node Hadoop cluster that I use to run all
> of
the Solr shards and each node has 128GB of memory. The current
SolrCloud setup is split into 4 separate and individual clouds of 32

shards each thereby giving four running shards per cloud or one cloud pereight nodes.

Each shard is currently assigned a 6GB heap size. I’d prefer to avoid
increasing heap memory for Solr shards to have enough to run other
MapReduce jobs on the cluster.
>
> The rate of documents that I am currently inserting into these
> clouds
per day is 5 Billion each in two clouds, 3 Billion into the third, and
2 Billion into the fourth ; however to account for capacity, the aim
is to scale the solution to support double that amount of documents.
To index these documents, there are MapReduce jobs that run that
generate the Solr XML documents and will then submit these documents
via SolrJ's CloudSolrServer interface. In testing, I have found that
limiting the number of active parallel inserts to 80 per cloud gave
the best performance as anything higher gave diminishing returns, most
likely due to the constant shuffling of documents internally to
SolrCloud. From an index perspective, dated collections are being
created to hold an entire day's of documents and generally the
inserting happens primarily on the current day (the previous days are
only to allow for searching) and the plan is to keep up to 60 days (or
collections) in each cloud. A single shar  d index in one collection
in the busiest cloud currently takes up 30G disk space or 960G for the
entire collection. The documents are being auto committed with a hard

commit time of 4 minutes (opensearcher = false) and soft commit time of 8minutes.

>
> From a search perspective, the use case is fairly generic and simple
searches of the type :, so there is no need to tune the system to use
any of the more advanced querying features. Therefore, the most
important thing for me is to have the indexing performance be able to
keep up with the rate of input.
>
> In the initial load testing, I was able to achieve a projected
> indexing
rate of 10 Billion documents per cloud per day for a grand total of 40
Billion per day. However, the initial load testing was done on fairly
empty clouds with just a few small collections. Now that there have
been several days of documents being indexed, I am starting to see a
fairly steep drop-off in indexing performance once the clouds reached
about 15 full collections (or about 80-100 Billion documents per
cloud) in the two biggest clouds. Based on current application logging
I’m seeing a 40% drop off in indexing performance. Because of this, I
have concerns on how performance will hold as more collections are added.
>
> My question to the community is if anyone else has had any
> experience in
using Solr at this scale (hundreds of Billions) and if anyone has
observed such a decline in indexing performance as the number of
collections increases. My understanding is that each collection is a
separate index and therefore the inserting rate should remain
constant. Aside from that, what other tweaks or changes can be done in
the SolrCloud configuration to increase the rate of indexing
performance? Am I hitting a hard limitation of what Solr can handle?
>
> Thanks,
> Scott
>
>

Re: Solr cloud performance degradation with billions of documents

Reply via email to