Mostly a lot of other systems already offer these types of things, so they were hard not to think about while building :) Just hard to get back to a lot of those things, even though a lot of them are fairly low hanging fruit. Hardening takes the priority :(
- Mark On Nov 19, 2013, at 12:42 PM, Timothy Potter <thelabd...@gmail.com> wrote: > You're thinking is always one-step ahead of me! I'll file the JIRA > > Thanks. > Tim > > > On Tue, Nov 19, 2013 at 10:38 AM, Mark Miller <markrmil...@gmail.com> wrote: > >> Yeah, this is kind of like one of many little features that we have just >> not gotten to yet. I’ve always planned for a param that let’s you say how >> many replicas an update must be verified on before responding success. >> Seems to make sense to fail that type of request early if you notice there >> are not enough replicas up to satisfy the param to begin with. >> >> I don’t think there is a JIRA issue yet, fire away if you want. >> >> - Mark >> >> On Nov 19, 2013, at 12:14 PM, Timothy Potter <thelabd...@gmail.com> wrote: >> >>> I've been thinking about how SolrCloud deals with write-availability >> using >>> in-sync replica sets, in which writes will continue to be accepted so >> long >>> as there is at least one healthy node per shard. >>> >>> For a little background (and to verify my understanding of the process is >>> correct), SolrCloud only considers active/healthy replicas when >>> acknowledging a write. Specifically, when a shard leader accepts an >> update >>> request, it forwards the request to all active/healthy replicas and only >>> considers the write successful if all active/healthy replicas ack the >>> write. Any down / gone replicas are not considered and will sync up with >>> the leader when they come back online using peer sync or snapshot >>> replication. For instance, if a shard has 3 nodes, A, B, C with A being >> the >>> current leader, then writes to the shard will continue to succeed even >> if B >>> & C are down. >>> >>> The issue is that if a shard leader continues to accept updates even if >> it >>> loses all of its replicas, then we have acknowledged updates on only 1 >>> node. If that node, call it A, then fails and one of the previous >> replicas, >>> call it B, comes back online before A does, then any writes that A >> accepted >>> while the other replicas were offline are at risk to being lost. >>> >>> SolrCloud does provide a safe-guard mechanism for this problem with the >>> leaderVoteWait setting, which puts any replicas that come back online >>> before node A into a temporary wait state. If A comes back online within >>> the wait period, then all is well as it will become the leader again and >> no >>> writes will be lost. As a side note, sys admins definitely need to be >> made >>> more aware of this situation as when I first encountered it in my >> cluster, >>> I had no idea what it meant. >>> >>> My question is whether we want to consider an approach where SolrCloud >> will >>> not accept writes unless there is a majority of replicas available to >>> accept the write? For my example, under this approach, we wouldn't accept >>> writes if both B&C failed, but would if only C did, leaving A & B online. >>> Admittedly, this lowers the write-availability of the system, so may be >>> something that should be tunable? Just wanted to put this out there as >>> something I've been thinking about lately ... >>> >>> Cheers, >>> Tim >> >>