Mostly a lot of other systems already offer these types of things, so they were 
hard not to think about while building :) Just hard to get back to a lot of 
those things, even though a lot of them are fairly low hanging fruit. Hardening 
takes the priority :(

- Mark

On Nov 19, 2013, at 12:42 PM, Timothy Potter <thelabd...@gmail.com> wrote:

> You're thinking is always one-step ahead of me! I'll file the JIRA
> 
> Thanks.
> Tim
> 
> 
> On Tue, Nov 19, 2013 at 10:38 AM, Mark Miller <markrmil...@gmail.com> wrote:
> 
>> Yeah, this is kind of like one of many little features that we have just
>> not gotten to yet. I’ve always planned for a param that let’s you say how
>> many replicas an update must be verified on before responding success.
>> Seems to make sense to fail that type of request early if you notice there
>> are not enough replicas up to satisfy the param to begin with.
>> 
>> I don’t think there is a JIRA issue yet, fire away if you want.
>> 
>> - Mark
>> 
>> On Nov 19, 2013, at 12:14 PM, Timothy Potter <thelabd...@gmail.com> wrote:
>> 
>>> I've been thinking about how SolrCloud deals with write-availability
>> using
>>> in-sync replica sets, in which writes will continue to be accepted so
>> long
>>> as there is at least one healthy node per shard.
>>> 
>>> For a little background (and to verify my understanding of the process is
>>> correct), SolrCloud only considers active/healthy replicas when
>>> acknowledging a write. Specifically, when a shard leader accepts an
>> update
>>> request, it forwards the request to all active/healthy replicas and only
>>> considers the write successful if all active/healthy replicas ack the
>>> write. Any down / gone replicas are not considered and will sync up with
>>> the leader when they come back online using peer sync or snapshot
>>> replication. For instance, if a shard has 3 nodes, A, B, C with A being
>> the
>>> current leader, then writes to the shard will continue to succeed even
>> if B
>>> & C are down.
>>> 
>>> The issue is that if a shard leader continues to accept updates even if
>> it
>>> loses all of its replicas, then we have acknowledged updates on only 1
>>> node. If that node, call it A, then fails and one of the previous
>> replicas,
>>> call it B, comes back online before A does, then any writes that A
>> accepted
>>> while the other replicas were offline are at risk to being lost.
>>> 
>>> SolrCloud does provide a safe-guard mechanism for this problem with the
>>> leaderVoteWait setting, which puts any replicas that come back online
>>> before node A into a temporary wait state. If A comes back online within
>>> the wait period, then all is well as it will become the leader again and
>> no
>>> writes will be lost. As a side note, sys admins definitely need to be
>> made
>>> more aware of this situation as when I first encountered it in my
>> cluster,
>>> I had no idea what it meant.
>>> 
>>> My question is whether we want to consider an approach where SolrCloud
>> will
>>> not accept writes unless there is a majority of replicas available to
>>> accept the write? For my example, under this approach, we wouldn't accept
>>> writes if both B&C failed, but would if only C did, leaving A & B online.
>>> Admittedly, this lowers the write-availability of the system, so may be
>>> something that should be tunable? Just wanted to put this out there as
>>> something I've been thinking about lately ...
>>> 
>>> Cheers,
>>> Tim
>> 
>> 

Reply via email to