On Tue, Jun 30, 2015, at 04:42 PM, Shawn Heisey wrote:
> On 6/29/2015 2:48 PM, Reitzel, Charles wrote:
> > I take your point about shards and segments being different things.  I 
> > understand that the hash ranges per segment are not kept in ZK.   I guess I 
> > wish they were.
> >
> > In this regard, I liked Mongodb, uses a 2-level sharding scheme.   Each 
> > shard manages a list of  "chunks", each has its own hash range which is 
> > kept in the cluster state.   If data needs to be balanced across nodes, it 
> > works at the chunk level.  No record/doc level I/O is necessary.   Much 
> > more targeted and only the data that needs to move is touched.  Solr does 
> > most things better than Mongo, imo.  But this is one area where the Mongo 
> > got it right.
> 
> Segment detail would not only lead to a data explosion in the
> clusterstate, it would be crossing abstraction boundaries, and would
> potentially require updating the clusterstate just because a single
> document was inserted into the index.  That one tiny update could (and
> probably would) create a new segment on one shard.  Due to the way
> SolrCloud replicates data during normal operation, every replica for a
> given shard might have a different set of segments, which means segments
> would need to be tracked at the replica level, not the shard level.
> 
> Also, Solr cannot control which hash ranges end up in each segment. 
> Solr only knows about the index as a whole ... implementation details
> like segments are left entirely up to Lucene, and although I admit to
> not knowing Lucene internals very well, I don't think Lucene offers any
> way to control that either.  You mention that MongoDB dictates which
> hash ranges end up in each chunk.  That implies that MongoDB can control
> each chunk.  If we move the analogy to Solr, it breaks down because Solr
> cannot control segments.  Although Solr does have several configuration
> knobs that affect how segments are created, those configurations are
> simply passed through to Lucene, Solr itself does not use that
> information.

To put it more specifically - when a (hard) commit happens, all of the
documents in that commit are written into a new segment. Thus, it has no
bearing on what hash range is used. A segment can never be edited. When
there are too many, segments are merged into a new one, and the
originals deleted. So, there is no way for Solr/Lucene to insert a
document into anything other than a brand new segment.

Hence, the idea of using a second level of sharding at the segment level
does not fit with how a lucene index is structured.

Upayavira

Reply via email to