On 6/29/2015 2:48 PM, Reitzel, Charles wrote: > I take your point about shards and segments being different things. I > understand that the hash ranges per segment are not kept in ZK. I guess I > wish they were. > > In this regard, I liked Mongodb, uses a 2-level sharding scheme. Each shard > manages a list of "chunks", each has its own hash range which is kept in the > cluster state. If data needs to be balanced across nodes, it works at the > chunk level. No record/doc level I/O is necessary. Much more targeted and > only the data that needs to move is touched. Solr does most things better > than Mongo, imo. But this is one area where the Mongo got it right.
Segment detail would not only lead to a data explosion in the clusterstate, it would be crossing abstraction boundaries, and would potentially require updating the clusterstate just because a single document was inserted into the index. That one tiny update could (and probably would) create a new segment on one shard. Due to the way SolrCloud replicates data during normal operation, every replica for a given shard might have a different set of segments, which means segments would need to be tracked at the replica level, not the shard level. Also, Solr cannot control which hash ranges end up in each segment. Solr only knows about the index as a whole ... implementation details like segments are left entirely up to Lucene, and although I admit to not knowing Lucene internals very well, I don't think Lucene offers any way to control that either. You mention that MongoDB dictates which hash ranges end up in each chunk. That implies that MongoDB can control each chunk. If we move the analogy to Solr, it breaks down because Solr cannot control segments. Although Solr does have several configuration knobs that affect how segments are created, those configurations are simply passed through to Lucene, Solr itself does not use that information. Thanks, Shawn