On Fri, Feb 23, 2018 at 12:49 AM, Paulo Motta <pauloricard...@gmail.com> wrote:
> > Is this a realistic case when Cassandra (unless I'm missing something) is > limited to adding or removing a single node at a time? I'm sure this > can happen under some sort of generic range movement of some > sort (how does one initiate such movement, and why), but will it happen > under "normal" conditions of node bootstrap or decomission of a single > node? > > It's possible to make simultaneous range movements when either > {{-Dcassandra.consistent.range.movement=false}}(CASSANDRA-7069) or > {{-Dcassandra.consistent.simultaneousmoves.allow=true}} > (CASSANDRA-11005) are specified. > > In any case, I'm not saying it's not possible, just that we cannot > apply this optimization when there are simultaneous range movements in > the same rack. > > > How/when would we have two pending nodes for a single view partition? > > Actually I meant if there are multiple range movements going on in the > same rack, not exactly in the same partition. > But the code we're discussing now, in mutateMV, isn't it about sending just one mutation, of a single partition in the view table? So don't we only care which node (just one? can it be more than one?) this partition will move to? > > > Yes, it seems it will not be trivial. But if this is the common case in > common operations such as node addition or removal, it may significantly > reduce > (from RF*2 to RF+1) the number of view updates being sent around, and avoid > MV update performance degredation during the streaming process. > > Agreed, we should definitely look into making this optimization, but > just was never done before due to other priorities, please open a > ticket for it. Ok, I will, though I'm not sure I understood all the caveats you mentioned so you may need to edit the ticket later to add them. > There's a similar optimization that can be done for > view batchlog replays - right now the view update is sent to all > replicas during batchlog replay, but we could simplify it and also > send only to the paired view replicas. > > > Is it actually possible to repair *only* a view, not its base table? If > you repair a view table which has an inconsistency, namely one view row in > one replica and a different view row in another replica, won't the repair > just cause both versions to be kept, which is wrong? > > ... > > When there are permanent inconsistencies though (when the base is > consistent and the view has extraneous rows), it doesn't really matter > if the inconsistency is present on a subset or all view replicas, > since the inconsistency is already visible to clients. The only way to > fix permanent inconsistencies currently is to drop and re-create the > view. CASSANDRA-10346 was created to address this. > So this is why I asked why the view repair, which you suggested in the release notes, will help in this case. I'm also worried that the fact *each* replica sends to the pending node (and not the paired replica) will mean that we are more likely to create these inconsistencies - if two base replicas have different values, *both* values will be sent to the pending view replica, and create an inconsistency there that cannot be fixed. > > If you have more comments about CASSANDRA-14251 would you mind adding > them to the ticket itself so the discussion is registered on the > relevant JIRA? > I think most of the issues I raised later are not really part of 14251 but different issues, I'll see what I can express clearly enough to become new JIRA issues and submit them.