What about SOLR-10619 and SOLR-10983? Of the two, 10619 is probably the most important in this respect. The way the Overseer consumed requests from the queue was very inefficient and may particularly affect this problem. There are a couple of other JIRAs that center around not creating unnecessary messages in the first place, but 10619 is a major improvement.
Erick On Tue, Oct 10, 2017 at 9:41 AM, Shawn Heisey <elyog...@elyograg.org> wrote: > On 10/10/2017 9:11 AM, Erick Erickson wrote: >> >> Hmmm, that page is quite a bit out of date. I think Shawn is talking >> about the "old style" Solr (4.x) that put all the state information >> for all the collections in a single znode "clusterstate.json". Newer >> style Solr puts each collection's state in >> /collections/my_collection/state.json which has very significantly >> reduced this issue. >> >> There are still some issues in the 5x code line where you can have a >> ton of messages be processed by the "Overseer" at massive scales... >> >> However, I know of installations with several 100s of K (yes hundreds >> of thousands) of replicas out there, split up amongst a _lot_ of >> collections. That takes quite a bit of care and feeding, mind you. >> >> So your setup shouldn't be a problem, although I'd bring up my Solr >> instances one at a time. >> >> Whether ZK is embedded or not isn't really a problem, but I would very >> seriously consider moving it to an external ensemble. It's not so much >> a functional issue as administrative. You have to be careful to bring >> your Solr nodes up and down carefully or you lose quorum. > > > The testing I did on SOLR-7191, which is where that statement came from, was > mostly on 5.x with the per-collection clusterstate that was new at the time, > and I still found that it would not scale well. > > Some later poking around with 6.x (long after SOLR-7191 was resolved with no > commits) indicates that current versions scale even worse than early 5.x > did. I believe the biggest source of the scalability problems is the fact > that the overseer queue gets spammed with a very large number of operations > that cannot be handled quickly. > > One collection with 200 shards probably would not present much of a > scalability problem where ZK is concerned, but because a query on that > collection will consist of between 201 and 401 smaller queries, I would not > expect the single-query performance to be very good. > > Thanks, > Shawn >