On Tue, Jun 30, 2015, at 04:42 PM, Shawn Heisey wrote: > On 6/29/2015 2:48 PM, Reitzel, Charles wrote: > > I take your point about shards and segments being different things. I > > understand that the hash ranges per segment are not kept in ZK. I guess I > > wish they were. > > > > In this regard, I liked Mongodb, uses a 2-level sharding scheme. Each > > shard manages a list of "chunks", each has its own hash range which is > > kept in the cluster state. If data needs to be balanced across nodes, it > > works at the chunk level. No record/doc level I/O is necessary. Much > > more targeted and only the data that needs to move is touched. Solr does > > most things better than Mongo, imo. But this is one area where the Mongo > > got it right. > > Segment detail would not only lead to a data explosion in the > clusterstate, it would be crossing abstraction boundaries, and would > potentially require updating the clusterstate just because a single > document was inserted into the index. That one tiny update could (and > probably would) create a new segment on one shard. Due to the way > SolrCloud replicates data during normal operation, every replica for a > given shard might have a different set of segments, which means segments > would need to be tracked at the replica level, not the shard level. > > Also, Solr cannot control which hash ranges end up in each segment. > Solr only knows about the index as a whole ... implementation details > like segments are left entirely up to Lucene, and although I admit to > not knowing Lucene internals very well, I don't think Lucene offers any > way to control that either. You mention that MongoDB dictates which > hash ranges end up in each chunk. That implies that MongoDB can control > each chunk. If we move the analogy to Solr, it breaks down because Solr > cannot control segments. Although Solr does have several configuration > knobs that affect how segments are created, those configurations are > simply passed through to Lucene, Solr itself does not use that > information.
To put it more specifically - when a (hard) commit happens, all of the documents in that commit are written into a new segment. Thus, it has no bearing on what hash range is used. A segment can never be edited. When there are too many, segments are merged into a new one, and the originals deleted. So, there is no way for Solr/Lucene to insert a document into anything other than a brand new segment. Hence, the idea of using a second level of sharding at the segment level does not fit with how a lucene index is structured. Upayavira