Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

2022-09-06 Thread Sam Tunnicliffe


> On 5 Sep 2022, at 22:02, Henrik Ingo  wrote:
> 
> Mostly I just wanted to ack that at least someone read the doc (somewhat 
> superficially sure, but some parts with thought...)
> 

Thanks, it's a lot to digest, so we appreciate that people are working through 
it. 
> One pre-feature that we would include in the preceding minor release is a 
> node level switch to disable all operations that modify cluster metadata 
> state. This would include schema changes as well as topology-altering events 
> like move, decommission or (gossip-based) bootstrap and would be activated on 
> all nodes for the duration of the major upgrade. If this switch were 
> accessible via internode messaging, activating it for an upgrade could be 
> automated. When an upgraded node starts up, it could send a request to 
> disable metadata changes to any peer still running the old version. This 
> would cost a few redundant messages, but simplify things operationally.
> Although this approach would necessitate an additional minor version upgrade, 
> this is not without precedent and we believe that the benefits outweigh the 
> costs of additional operational overhead.
> 
> Sounds like a great idea, and probably necessary in practice?
>  

Although I think we _could_ manage without this, it would certainly simplify 
this and future upgrades.
> If this part of the proposal is accepted, we could also include further 
> messaging protocol changes in the minor release, as these would largely 
> constitute additional verbs which would be implemented with no-op verb 
> handlers initially. This would simplify the major version code, as it would 
> not need to gate the sending of asynchronous replication messages on the 
> receiver's release version. During the migration, it may be useful to have a 
> way to directly inject gossip messages into the cluster, in case the states 
> of the yet-to-be upgraded nodes become inconsistent. This isn't intended, so 
> such a tool may never be required, but we have seen that gossip propagation 
> can be difficult to reason about at times.
> 
> Others will know the code better and I understand that adding new no-op verbs 
> can be considered safe... But instinctively a bit hesitant on this one. 
> Surely adding a few if statements to the upgraded version isn't that big of a 
> deal?
> 
> Also, it should make sense to minimize the dependencies from the previous 
> major version (without CEP-21) to the new major version (with CEP-21). If a 
> bug is found, it's much easier to fix code in the new major version than the 
> old and supposedly stable one.
> 

Yep, agreed. Adding verb handlers in advance may not buy us very much, so may 
not be worth the risk of additionally perturbing the stable system. I would say 
that having a means to directly manipulate gossip state during the upgrade 
would be a useful safety net in case something unforeseen occurs and we need to 
dig ourselves out of a hole. The precise scope of the feature & required 
changes are not something we've given extensive thought to yet, so we'd want to 
assess that carefully before proceeding.

> henrik
> 
> -- 
> Henrik Ingo
> +358 40 569 7354 
>        
> 
>



Re: [DISCUSS] CEP-23: Enhancement for Sparse Data Serialization

2022-09-06 Thread Josh McKenzie
> if that is standard for this project I will move the information there.
It is. I'd go to a CEP if you have something you think might be controversial 
(due to design, size, whatever) and you want to get early consensus on before 
going to deep on implementation.

I'm in favor of JIRA + DISCUSS (+benchmarks) as well fwiw.

On Tue, Sep 6, 2022, at 2:28 AM, Benedict wrote:
> I agree a Jira would suffice, and if visibility there required a DISCUSS 
> thread or simply a notice sent to the list.
> 
> While we’re here though, while I don’t have a lot of time to engage in 
> discussion it’s unclear to me what advantage this encoding scheme brings. It 
> might be worth outlining what algorithmic advantage you foresee for what data 
> distributions in which collection types.
> 
> > On 6 Sep 2022, at 07:16, Claude Warren via dev  
> > wrote:
> > 
> > I am just learning the ropes here so perhaps it is not CEP worthy.  That 
> > being said, It felt like there was a lot of information to put into and 
> > track in a ticket, particularly when I expected discussion about how to 
> > best encode, changes to the algorithms etc.  It feels like it would be 
> > difficult to track. But if that is standard for this project I will move 
> > the information there.
> > 
> > As to the benchmarking, I had thought that usage and performance measures 
> > should be included.  Thank you for calling out the subset of data selected 
> > query as being of particular importance.
> > 
> > Claude
> > 
> >> On 06/09/2022 03:11, Abe Ratnofsky wrote:
> >> Looking at this link: 
> >> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-23%3A++Enhancement+for+Sparse+Data+Serialization
> >> 
> >> Do you have any plans to include benchmarks in your test plan? It would be 
> >> useful to include disk usage / read performance / write performance 
> >> comparisons with the new encodings, particularly for sparse collections 
> >> where a subset of data is selected out of a collection.
> >> 
> >> I do wonder whether this is CEP-worthy. The CEP says that the changes will 
> >> not impact existing users, will be backwards compatible, and overall is an 
> >> efficiency improvement. The CEP guidelines say a CEP is encouraged “for 
> >> significant user-facing or changes that cut across multiple subsystems”. 
> >> Any reason why a Jira isn’t sufficient?
> >> 
> >> Abe
> >> 
>  On Sep 5, 2022, at 1:57 AM, Claude Warren via dev 
>   wrote:
> >>> 
> >>> I have just posted a CEP  covering an Enhancement for Sparse Data 
> >>> Serialzation.  This is in response to CASSANDRA-8959
> >>> 
> >>> I look forward to responses.
> >>> 
> >>> 
> 
> 


Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

2022-09-06 Thread Jacek Lewandowski
Hi Sam, this is a great idea and a really well described CEP!

I have some questions, perhaps they reflect my weak understanding, but
maybe you can answer:
Is it going to work so that each node reads the log individually and try to
catch up in a way that it applies a transition locally once the previous
change is confirmed on the majority of the affected nodes, right? If so,
will it be a replica group explicitly associated with each event
(explicitly mentioned nodes which are affected by the change and a list of
those which already applied the change, so that each node individually can
make a decision whether to move forward?). If so, can the node skip a
transformation which does not affect it and move forward thus making
another change concurrently?


What if a node(s) failure prevents progress over the log? For example, we
are unable to get a majority of nodes which process an event so we cannot
move forward. We cannot remove those nodes though, because the removal will
be later in the log and we cannot make progress. I've read about manual
intervention but maybe it can be avoided in some cases for example by
adding no more than one pending event to the log?

For multistep actions - are they going to be added all or none? If they are
added one by one, can they be interleaved with other multistep actions?

Reconfiguration itself occurs using the process that is analogous to
> "regular" bootstrap and also uses Paxos as a linearizability mechanism,
> except for there is no concept of "token" ownership in CMS; all CMS nodes
> own an entire range from MIN to MAX token. This means that during
> bootstrap, we do not have to split ranges, or have some nodes "lose" a part
> of the ring...


This sounds like an implementation of everywhere replication strategy,
doesn't it?


- - -- --- -  -
Jacek Lewandowski


On Tue, Sep 6, 2022 at 9:19 AM Sam Tunnicliffe  wrote:
>
>
>
> On 5 Sep 2022, at 22:02, Henrik Ingo  wrote:
>
> Mostly I just wanted to ack that at least someone read the doc (somewhat
superficially sure, but some parts with thought...)
>
>
> Thanks, it's a lot to digest, so we appreciate that people are working
through it.
>>
>> One pre-feature that we would include in the preceding minor release is
a node level switch to disable all operations that modify cluster metadata
state. This would include schema changes as well as topology-altering
events like move, decommission or (gossip-based) bootstrap and would be
activated on all nodes for the duration of the major upgrade. If this
switch were accessible via internode messaging, activating it for an
upgrade could be automated. When an upgraded node starts up, it could send
a request to disable metadata changes to any peer still running the old
version. This would cost a few redundant messages, but simplify things
operationally.
>>
>> Although this approach would necessitate an additional minor version
upgrade, this is not without precedent and we believe that the benefits
outweigh the costs of additional operational overhead.
>
>
> Sounds like a great idea, and probably necessary in practice?
>
>
>
> Although I think we _could_ manage without this, it would certainly
simplify this and future upgrades.
>>
>> If this part of the proposal is accepted, we could also include further
messaging protocol changes in the minor release, as these would largely
constitute additional verbs which would be implemented with no-op verb
handlers initially. This would simplify the major version code, as it would
not need to gate the sending of asynchronous replication messages on the
receiver's release version. During the migration, it may be useful to have
a way to directly inject gossip messages into the cluster, in case the
states of the yet-to-be upgraded nodes become inconsistent. This isn't
intended, so such a tool may never be required, but we have seen that
gossip propagation can be difficult to reason about at times.
>
>
> Others will know the code better and I understand that adding new no-op
verbs can be considered safe... But instinctively a bit hesitant on this
one. Surely adding a few if statements to the upgraded version isn't that
big of a deal?
>
> Also, it should make sense to minimize the dependencies from the previous
major version (without CEP-21) to the new major version (with CEP-21). If a
bug is found, it's much easier to fix code in the new major version than
the old and supposedly stable one.
>
>
> Yep, agreed. Adding verb handlers in advance may not buy us very much, so
may not be worth the risk of additionally perturbing the stable system. I
would say that having a means to directly manipulate gossip state during
the upgrade would be a useful safety net in case something unforeseen
occurs and we need to dig ourselves out of a hole. The precise scope of the
feature & required changes are not something we've given extensive thought
to yet, so we'd want to assess that carefully before proceeding.
>
> henrik
>
> --
> Henrik Ingo
>

Re: [DISCUSS] CEP-23: Enhancement for Sparse Data Serialization

2022-09-06 Thread Benedict
So, looking more closely at your proposal I realise what you are trying to do. 
The thing that threw me was your mention of lists and other collections. This 
will likely not work as there is no index that is possible to define on a list 
(or other collection) within a single sstable - a list is defined over the 
whole on-disk contents, so the index is undefined within a given sstable.

Tuple and UDT are encoded inefficiently if there are many null fields, but this 
is a very localised change, affecting just one class. You should take a look at 
Columns.Serializer for code you can lift for encoding and decoding sparse 
subsets of fields.

It might be that this can be switched on or off per sstable with a header flag 
bit so that there is no additional cost for datasets that would not benefit. 
Likely we can also migrate to vint encoding for the component sizes also (and 
either 1 or 0 bytes for fixed width values), no doubt saving a lot of space 
over the status quo, even for small UDT with few null entries.

Essentially at this point we’re talking about pushing through storage 
optimisations applied elsewhere to tuples and UDT, which is a very 
uncontroversial change.

> On 6 Sep 2022, at 07:28, Benedict  wrote:
> 
> I agree a Jira would suffice, and if visibility there required a DISCUSS 
> thread or simply a notice sent to the list.
> 
> While we’re here though, while I don’t have a lot of time to engage in 
> discussion it’s unclear to me what advantage this encoding scheme brings. It 
> might be worth outlining what algorithmic advantage you foresee for what data 
> distributions in which collection types.
> 
>> On 6 Sep 2022, at 07:16, Claude Warren via dev  
>> wrote:
>> 
>> I am just learning the ropes here so perhaps it is not CEP worthy.  That 
>> being said, It felt like there was a lot of information to put into and 
>> track in a ticket, particularly when I expected discussion about how to best 
>> encode, changes to the algorithms etc.  It feels like it would be difficult 
>> to track. But if that is standard for this project I will move the 
>> information there.
>> 
>> As to the benchmarking, I had thought that usage and performance measures 
>> should be included.  Thank you for calling out the subset of data selected 
>> query as being of particular importance.
>> 
>> Claude
>> 
 On 06/09/2022 03:11, Abe Ratnofsky wrote:
>>> Looking at this link: 
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-23%3A++Enhancement+for+Sparse+Data+Serialization
>>> 
>>> Do you have any plans to include benchmarks in your test plan? It would be 
>>> useful to include disk usage / read performance / write performance 
>>> comparisons with the new encodings, particularly for sparse collections 
>>> where a subset of data is selected out of a collection.
>>> 
>>> I do wonder whether this is CEP-worthy. The CEP says that the changes will 
>>> not impact existing users, will be backwards compatible, and overall is an 
>>> efficiency improvement. The CEP guidelines say a CEP is encouraged “for 
>>> significant user-facing or changes that cut across multiple subsystems”. 
>>> Any reason why a Jira isn’t sufficient?
>>> 
>>> Abe
>>> 
> On Sep 5, 2022, at 1:57 AM, Claude Warren via dev 
>  wrote:
 
 I have just posted a CEP  covering an Enhancement for Sparse Data 
 Serialzation.  This is in response to CASSANDRA-8959
 
 I look forward to responses.
 
 
> 



Re: [DISCUSS] LWT UPDATE semantics with + and - when null

2022-09-06 Thread David Capwell
Thanks all, going to merge those changes today!

> On Sep 2, 2022, at 5:47 AM, Josh McKenzie  wrote:
> 
> +1 to matching SQL. If we look at our population of users that are going to 
> run into this, my intuition is that more of them will be familiar with SQL 
> semantics than counters, so there's the angle where "the more consistent 
> option" here is to follow SQL convention.
> 
> On Wed, Aug 31, 2022, at 12:19 PM, Benjamin Lerer wrote:
>> The approach 2) is the one used by CQL operators. 
>> SELECT v + 1 FROM t WHERE pk = 1; Will return null if the row exists but the 
>> v is null.
>> 
>> Le mer. 31 août 2022 à 18:05, David Capwell > > a écrit :
>> Sounds like matching SQL is the current favor, the current patch matches 
>> this so will leave this thread open a while longer before trying to merge 
>> the patch.
>> 
>>> On Aug 31, 2022, at 5:07 AM, Ekaterina Dimitrova >> > wrote:
>>> 
>>> I am also +1 to match SQL, option 2. Also, I like Andres’ suggestion
>>> 
>>> On Wed, 31 Aug 2022 at 7:15, Claude Warren via dev 
>>> mailto:dev@cassandra.apache.org>> wrote:
>>> I like this approach.  However, in light of some of the discussions on view 
>>> and the like perhaps the function is  (column value as returned by select ) 
>>> + 42
>>> 
>>> So a null counter column becomes 0 before the update calculation is applied.
>>> 
>>> Then any null can be considered null unless addressed by IfNull(), or 
>>> zeroIfNull()
>>> 
>>> Any operation on null returns null.
>>> 
>>> I think this follows what would be expected by most users in most cases.
>>> 
>>> 
>>> 
>>> On 31/08/2022 11:55, Andrés de la Peña wrote:
 I think I'd prefer 2), the SQL behaviour. We could also get the 
 convenience of 3) by adding CQL functions such as "ifNull(column, 
 default)" or "zeroIfNull(column)", as it's done by other dbs. So we could 
 do things like "UPDATE ... SET name = zeroIfNull(name) + 42".
 
 On Wed, 31 Aug 2022 at 04:54, Caleb Rackliffe >>> > wrote:
 Also +1 on the SQL behavior here. I was uneasy w/ coercing to "" / 0 / 1 
 (depending on the type) in our previous discussion, but for some reason 
 didn't bring up the SQL analog :-|
 
 On Tue, Aug 30, 2022 at 5:38 PM Benedict >>> > wrote:
 I’m a bit torn here, as consistency with counters is important. But they 
 are a unique eventually consistent data type, and I am inclined to default 
 standard numeric types to behave as SQL does, since they write a new value 
 rather than a “delta” 
 
 It is far from optimal to have divergent behaviours, but also suboptimal 
 to diverge from relational algebra, and probably special casing counters 
 is the least bad outcome IMO.
 
 
> On 30 Aug 2022, at 22:52, David Capwell  > wrote:
> 
>  
> 4.1 added the ability for LWT to support "UPDATE ... SET name = name + 
> 42", but we never really fleshed out with the larger community what the 
> semantics should be in the case where the column or row are NULL; I 
> opened up https://issues.apache.org/jira/browse/CASSANDRA-17857 
>  for this issue. 
> 
> As I see it there are 3 possible outcomes:
> 1) fail the query
> 2) null + 42 = null (matches SQL)
> 3) null + 42 == 0 + 42 = 42 (matches counters)
> 
> In SQL you get NULL (option 2), but CQL counters treat NULL as 0 (option 
> 3) meaning we already do not match SQL (though counters are not a 
> standard SQL type so might not be applicable).  Personally I lean towards 
> option 3 as the "zero" for addition and subtraction is 0 (1 for 
> multiplication and division).
> 
> So looking for feedback so we can update in CASSANDRA-17857 before 4.1 
> release.



Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

2022-09-06 Thread Sam Tunnicliffe
Hi Jacek, 

Thanks for the great questions, they certainly all relate to things we 
considered, so I hope I can answer them in a coherent way!

>  will it be a replica group explicitly associated with each event

Some (but not all) events are explicitly associated with a replica group (or 
more accurately with the superset of present and future replica groups). Where 
this is the case, the acknowlegment of these events by a majority of the 
involved replicas is required for the "larger" process of which the event is 
part, to make progress. This is not a global watermark though, which would halt 
progress across the cluster, it only affects the specific multistep operation 
it is part of. 

In such an operation (say a bootstrap or decommission), only one actor is 
actually in control of moving forward through the process, all the other nodes 
in the cluster simply apply the relevant metadata updates locally. It is the 
progress of this primary actor which is gated on the acknowledgments. In the 
bootstrap case, the joining node itself drives the process. 

The joining node will submit the first event to the CMS, which hopefully is 
accepted (because it would violate no invariants on the cluster metadata) and 
becomes committed. That joining node will then await notification from the CMS 
that a majority of the relevant peers have acked the event. Until it receives 
that, it will not submit the event representing the next step in the operation. 
By the same mechanism, it will not perform other aspects of its bootstrap until 
the preceding metadata change is acked (i.e. it won't initiate streaming until 
the step which adds it to the write groups - making it a pending node - is 
acked).

Other metadata changes can certainly be occurring while this is going on. 
Another joining node may start submitting similar events, and as long as the 
operation is permitted, that process will progress concurrently. In order to 
ensure that these multistep operations are safe to execute concurrently, we 
reject submissions which affect ranges already being affected by an in-flight 
operation. Essentially, you can safely run concurrent bootstraps provided the 
nodes involved do not share replicated ranges.


> For multistep actions - are they going to be added all or none? If they are 
> added one by one, can they be interleaved with other multistep actions?

As you can see from the above, the steps are committed to the log one at a time 
and multistep operations can be interleaved. 

However, at the point of executing the first step, the plan for executing the 
rest of the steps is known. We persist this in-progress operation plan in the 
cluster metadata as an effect of executing the first step - note that this is 
different from actually committing the steps to the log itself, the pending 
steps do not yet have an order assigned (and may never get one). This 
persistence of in-progress operations is to enable an operation to be resumed 
if the node driving it were to restart part way through. 

> What if a node(s) failure prevents progress over the log? For example, we are 
> unable to get a majority of nodes which process an event so we cannot move 
> forward. We cannot remove those nodes though, because the removal will be 
> later in the log and we cannot make progress.


It's the specific operation that's unable to make progress, but other metadata 
updates can proceed. To make this concrete: you're trying to join a new node to 
the cluster, but are unable to because some affected replicas are down and so 
cannot acknowledge one of the steps. If the replicas are temporarily down, 
bringing them back up would be sufficient to resume the join. If they are 
permanently unavailable, in order to preserve consistency, you need to cancel 
the ongoing join, replace them and restart the join from scratch.

Cancelling an in-progress operation like a join is a matter of reverting the 
metadata changes made by any of the steps which have already been committed, 
including the persistence of the aforementioned pending steps. In the proposal, 
we've suggested an operator should be involved in this, but that would be 
something trivial like running a nodetool command to submit the necessary event 
to the log. It may be possible to automate that, but I would prefer to omit it 
initially, not least to keep the scope manageable. Of course, nothing would 
preclude an external monitoring system to run the nodetool command if it's 
trusted to accurately detect such failures. 

> This sounds like an implementation of everywhere replication strategy, 
> doesn't it?


It does sound similar, but fear not, it isn't quite the same. The "everywhere" 
here is limited to the CMS nodes, which are only a small subset of the cluster. 
Essentially, it just means that all the (current) CMS members own and replicate 
the entire event log and so when a node joins the CMS it bootstraps the 
entirety of the current log state (note: this needn't be the

Re: Cassandra Token ownership split-brain (3.0.14)

2022-09-06 Thread Jaydeep Chovatia
If anyone has seen this issue and knows a fix, it would be a great help!
Thanks in advance.

Jaydeep

On Fri, Sep 2, 2022 at 1:56 PM Jaydeep Chovatia 
wrote:

> Hi,
>
> We are running a production Cassandra version (3.0.14) with 256 tokens
> v-node configuration. Occasionally, we see that different nodes show
> different ownership for the same key. Only a node restart corrects;
> otherwise, it continues to behave in a split-brain.
>
> Say, for example,
>
> *NodeA*
> nodetool getendpoints ks1 table1 10
> - n1
> - n2
> - n3
>
> *NodeB*
> nodetool getendpoints ks1 table1 10
> - n1
> - n2
> *- n5*
>
> If I restart NodeB, then it shows the correct ownership {n1,n2,n3}. The
> majority of the nodes in the ring show correct ownership {n1,n2,n3}, only a
> few show this issue, and restarting them solves the problem.
>
> To me, it seems I think Cassandra's Gossip cache and StorageService cache
> (TokenMetadata) are having some sort of cache coherence.
>
> Anyone has observed this behavior?
> Any help would be highly appreciated.
>
> Jaydeep
>


New episode of The Apache Cassandra Corner(R)

2022-09-06 Thread Aaron Ploetz
Link to next episode:

Ep9 - Otavio Santana (Java Champion, open source dev)
https://drive.google.com/file/d/1NYk1zCyyHErkuyrJsFGannBx0NAYreea/view?usp=sharing

(You may have to download it to listen)

It will remain in staging for 72 hours, going live (assuming no objections)
by Monday, September 12th.

If anyone should have any questions, comments, or if you want to be a
guest, please reach out to me.

As for my guest pipeline, I do have a gap coming up.  So if you or someone
you know would be a great guest, please let me know!

Thanks, everyone!

Aaron Ploetz


Re: Cassandra Token ownership split-brain (3.0.14)

2022-09-06 Thread C. Scott Andreas
Hi Jaydeep,Thanks for reaching out and for bumping this thread.This is probably not the answer you’re after, but mentioning as it may address the issue.C* 3.0.14 was released over five years ago, with many hundreds of important bug fixes landing since July 2017. These include fixes for issues that have affected gossip in the past which may be related to this issue. Note that 3.0.14 also is susceptible to several critical data loss bugs including C-14513 and C-14515.I’d strongly recommend upgrading to Cassandra 3.0.27 as a starting point. If this doesn’t resolve your issue, members of the community may be in a better position to help triage a bug report against a current release of the database.- ScottOn Sep 6, 2022, at 5:13 PM, Jaydeep Chovatia  wrote:If anyone has seen this issue and knows a fix, it would be a great help! Thanks in advance.JaydeepOn Fri, Sep 2, 2022 at 1:56 PM Jaydeep Chovatia  wrote:Hi,We are running a production Cassandra version (3.0.14) with 256 tokens v-node configuration. Occasionally, we see that different nodes show different ownership for the same key. Only a node restart corrects; otherwise, it continues to behave in a split-brain.Say, for example, NodeAnodetool getendpoints ks1 table1 10- n1- n2- n3NodeBnodetool getendpoints ks1 table1 10- n1- n2- n5If I restart NodeB, then it shows the correct ownership {n1,n2,n3}. The majority of the nodes in the ring show correct ownership {n1,n2,n3}, only a few show this issue, and restarting them solves the problem.To me, it seems I think Cassandra's Gossip cache and StorageService cache (TokenMetadata) are having some sort of cache coherence.Anyone has observed this behavior? Any help would be highly appreciated.Jaydeep



Re: Cassandra Token ownership split-brain (3.0.14)

2022-09-06 Thread Jaydeep Chovatia
Thanks Scott. I will prioritize upgrading to 3.0.27 and will circle back if
this issue persists.

Jaydeep


On Tue, Sep 6, 2022 at 3:45 PM C. Scott Andreas 
wrote:

> Hi Jaydeep,
>
> Thanks for reaching out and for bumping this thread.
>
> This is probably not the answer you’re after, but mentioning as it may
> address the issue.
>
> C* 3.0.14 was released over five years ago, with many hundreds of
> important bug fixes landing since July 2017. These include fixes for issues
> that have affected gossip in the past which may be related to this issue.
> Note that 3.0.14 also is susceptible to several critical data loss bugs
> including C-14513 and C-14515.
>
> I’d strongly recommend upgrading to Cassandra 3.0.27 as a starting point.
> If this doesn’t resolve your issue, members of the community may be in a
> better position to help triage a bug report against a current release of
> the database.
>
> - Scott
>
> On Sep 6, 2022, at 5:13 PM, Jaydeep Chovatia 
> wrote:
>
> 
> If anyone has seen this issue and knows a fix, it would be a great help!
> Thanks in advance.
>
> Jaydeep
>
> On Fri, Sep 2, 2022 at 1:56 PM Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> Hi,
>>
>> We are running a production Cassandra version (3.0.14) with 256 tokens
>> v-node configuration. Occasionally, we see that different nodes show
>> different ownership for the same key. Only a node restart corrects;
>> otherwise, it continues to behave in a split-brain.
>>
>> Say, for example,
>>
>> *NodeA*
>> nodetool getendpoints ks1 table1 10
>> - n1
>> - n2
>> - n3
>>
>> *NodeB*
>> nodetool getendpoints ks1 table1 10
>> - n1
>> - n2
>> *- n5*
>>
>> If I restart NodeB, then it shows the correct ownership {n1,n2,n3}. The
>> majority of the nodes in the ring show correct ownership {n1,n2,n3}, only a
>> few show this issue, and restarting them solves the problem.
>>
>> To me, it seems I think Cassandra's Gossip cache and StorageService cache
>> (TokenMetadata) are having some sort of cache coherence.
>>
>> Anyone has observed this behavior?
>> Any help would be highly appreciated.
>>
>> Jaydeep
>>
>