You're thinking is always one-step ahead of me! I'll file the JIRA

Thanks.
Tim


On Tue, Nov 19, 2013 at 10:38 AM, Mark Miller <markrmil...@gmail.com> wrote:

> Yeah, this is kind of like one of many little features that we have just
> not gotten to yet. I’ve always planned for a param that let’s you say how
> many replicas an update must be verified on before responding success.
> Seems to make sense to fail that type of request early if you notice there
> are not enough replicas up to satisfy the param to begin with.
>
> I don’t think there is a JIRA issue yet, fire away if you want.
>
> - Mark
>
> On Nov 19, 2013, at 12:14 PM, Timothy Potter <thelabd...@gmail.com> wrote:
>
> > I've been thinking about how SolrCloud deals with write-availability
> using
> > in-sync replica sets, in which writes will continue to be accepted so
> long
> > as there is at least one healthy node per shard.
> >
> > For a little background (and to verify my understanding of the process is
> > correct), SolrCloud only considers active/healthy replicas when
> > acknowledging a write. Specifically, when a shard leader accepts an
> update
> > request, it forwards the request to all active/healthy replicas and only
> > considers the write successful if all active/healthy replicas ack the
> > write. Any down / gone replicas are not considered and will sync up with
> > the leader when they come back online using peer sync or snapshot
> > replication. For instance, if a shard has 3 nodes, A, B, C with A being
> the
> > current leader, then writes to the shard will continue to succeed even
> if B
> > & C are down.
> >
> > The issue is that if a shard leader continues to accept updates even if
> it
> > loses all of its replicas, then we have acknowledged updates on only 1
> > node. If that node, call it A, then fails and one of the previous
> replicas,
> > call it B, comes back online before A does, then any writes that A
> accepted
> > while the other replicas were offline are at risk to being lost.
> >
> > SolrCloud does provide a safe-guard mechanism for this problem with the
> > leaderVoteWait setting, which puts any replicas that come back online
> > before node A into a temporary wait state. If A comes back online within
> > the wait period, then all is well as it will become the leader again and
> no
> > writes will be lost. As a side note, sys admins definitely need to be
> made
> > more aware of this situation as when I first encountered it in my
> cluster,
> > I had no idea what it meant.
> >
> > My question is whether we want to consider an approach where SolrCloud
> will
> > not accept writes unless there is a majority of replicas available to
> > accept the write? For my example, under this approach, we wouldn't accept
> > writes if both B&C failed, but would if only C did, leaving A & B online.
> > Admittedly, this lowers the write-availability of the system, so may be
> > something that should be tunable? Just wanted to put this out there as
> > something I've been thinking about lately ...
> >
> > Cheers,
> > Tim
>
>

Reply via email to