My view is that if this is a pretty serious bug. I wonder if transactional metadata will make it possible to safely fix this for users without rebuilding (only via opt-in, of course).
> On 7 Mar 2023, at 15:54, Miklosovic, Stefan <stefan.mikloso...@netapp.com> > wrote: > > Thanks everybody for the feedback. > > I think that emitting a warning upon keyspace creation (and alteration) > should be enough for starters. If somebody can not live without 100% bullet > proof solution over time we might choose some approach from the offered ones. > As the saying goes there is no silver bullet. If we decide to implement that > new strategy, we would probably emit warnings anyway on NTS but it would be > already done so just new strategy would be provided. > > ________________________________________ > From: Paulo Motta <pauloricard...@gmail.com> > Sent: Monday, March 6, 2023 17:48 > To: dev@cassandra.apache.org > Subject: Re: Degradation of availability when using NTS and RF > number of > racks > > NetApp Security WARNING: This is an external email. Do not click links or > open attachments unless you recognize the sender and know the content is safe. > > > > It's a bit unfortunate that NTS does not maintain the ability to lose a rack > without loss of quorum for RF > #racks > 2, since this can be easily achieved > by evenly placing replicas across all racks. > > Since RackAwareTopologyStrategy is a superset of NetworkTopologyStrategy, > can't we just use the new correct placement logic for newly created keyspaces > instead of having a new strategy? > > The placement logic would be backwards-compatible for RF <= #racks. On > upgrade, we could mark existing keyspaces with RF > #racks with > use_legacy_replica_placement=true to maintain backwards compatibility and log > a warning that the rack loss guarantee is not maintained for keyspaces > created before the fix. Old keyspaces with RF <=#racks would still work with > the new replica placement. The downside is that we would need to keep the old > NTS logic around, or we could eventually deprecate it and require users to > migrate keyspaces using the legacy placement strategy. > > Alternatively we could have RackAwareTopologyStrategy and fail NTS keyspace > creation for RF > #racks and indicate users to use RackAwareTopologyStrategy > to maintain the quorum guarantee on rack loss or set an override flag > "support_quorum_on_rack_loss=false". This feels a bit iffy though since it > could potentially confuse users about when to use each strategy. > > On Mon, Mar 6, 2023 at 5:51 AM Miklosovic, Stefan > <stefan.mikloso...@netapp.com<mailto:stefan.mikloso...@netapp.com>> wrote: > Hi all, > > some time ago we identified an issue with NetworkTopologyStrategy. The > problem is that when RF > number of racks, it may happen that NTS places > replicas in such a way that when whole rack is lost, we lose QUORUM and data > are not available anymore if QUORUM CL is used. > > To illustrate this problem, lets have this setup: > > 9 nodes in 1 DC, 3 racks, 3 nodes per rack. RF = 5. Then, NTS could place > replicas like this: 3 replicas in rack1, 1 replica in rack2, 1 replica in > rack3. Hence, when rack1 is lost, we do not have QUORUM. > > It seems to us that there is already some logic around this scenario (1) but > the implementation is not entirely correct. This solution is not computing > the replica placement correctly so the above problem would be addressed. > > We created a draft here (2, 3) which fixes it. > > There is also a test which simulates this scenario. When I assign 256 tokens > to each node randomly (by same mean as generatetokens command uses) and I try > to compute natural replicas for 1 billion random tokens and I compute how > many cases there will be when 3 replicas out of 5 are inserted in the same > rack (so by losing it we would lose quorum), for above setup I get around 6%. > > For 12 nodes, 3 racks, 4 nodes per rack, rf = 5, this happens in 10% cases. > > To interpret this number, it basically means that with such topology, RF and > CL, when a random rack fails completely, when doing a random read, there is > 6% chance that data will not be available (or 10%, respectively). > > One caveat here is that NTS is not compatible with this new strategy anymore > because it will place replicas differently. So I guess that fixing this in > NTS will not be possible because of upgrades. I think people would need to > setup completely new keyspace and somehow migrate data if they wish or they > just start from scratch with this strategy. > > Questions: > > 1) do you think this is meaningful to fix and it might end up in trunk? > > 2) should not we just ban this scenario entirely? It might be possible to > check the configuration upon keyspace creation (rf > num of racks) and if we > see this is problematic we would just fail that query? Guardrail maybe? > > 3) people in the ticket mention writing "CEP" for this but I do not see any > reason to do so. It is just a strategy as any other. What would that CEP > would even be about? Is this necessary? > > Regards > > (1) > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java#L126-L128 > (2) https://github.com/apache/cassandra/pull/2191 > (3) https://issues.apache.org/jira/browse/CASSANDRA-16203