Thanks for the feedback David, this is helpful. > If you skip UPGRADING you ignore this slow rollout of v5 protocol. Where I think this matters is when we send LivenessInfo / Cell back to a coordinator…. Both encode as a relative time, so that feels safe (assuming you are making this change before the year 2038).
I think the version-gated rollout might still be needed to safely support TTLs beyond 2038 (ie. TTL of 10 years) during mixed mode, for example: Node A: scm=CASSANDRA_4 or UPGRADING Node B: scm=NONE Since node B scm=NONE, it is able to store TTL beyond 2038, but it cannot accept it until node A is upgraded to NONE. > As far as I can tell, the only real difference between UPGRADING and NONE is a volatile read while constructing a Cell while in SCM=UPGRADING. I'm not sure we even need to maintain the expiration overflow check at the cell level, since it's already done at the coordinator when the update is received. I remember adding it back in CASSANDRA-14092 to prevent a non-upgraded node from receiving a write with ttl beyond 2038 from a non-upgraded node before CASSANDRA-14092. After CASSANDRA-14227 this might no longer be needed, since it's impossible for a node in 4.x to send data with TTL beyond 2038, so this check becomes worthless at the cell level ? On Fri, Aug 22, 2025 at 12:02 PM David Capwell <dcapw...@apple.com> wrote: > Looking over the code the single usage that differs between UPGRADING and > NONE is > > org.apache.cassandra.db.rows.Cell#getVersionedMaxDeletiontionTime > > public static long getVersionedMaxDeletiontionTime() > { > if (DatabaseDescriptor.getStorageCompatibilityMode().disabled()) > // The whole cluster is 2016, we're out of the 2038/2106 mixed > cluster scenario. Shortcut to avoid the 'minClusterVersion' volatile read > return Cell.MAX_DELETION_TIME; > else > return MessagingService.instance().versions.minClusterVersion >= > MessagingService.VERSION_50 > ? Cell.MAX_DELETION_TIME > : Cell.MAX_DELETION_TIME_2038_LEGACY_CAP; > } > > > So if you are in upgrading we allow each node to use the VERSION_50 > messaging version, so while this is being rolled out some nodes will be on > v4 other will be on v5. Once a local node has learned that all peers are > at least on v5 then it acts the same as if you were on NONE. > > If you skip UPGRADING you ignore this slow rollout of v5 protocol. Where > I think this matters is when we send LivenessInfo / Cell back to a > coordinator…. Both encode as a relative time, so that feels safe (assuming > you are making this change before the year 2038). > > As far as I can tell, the only real difference between UPGRADING and NONE > is a volatile read while constructing a Cell while in SCM=UPGRADING. Given > this It does feel like we could simplify this to a single bounce and just > ignoring UPGRADING all together? > > On Aug 22, 2025, at 7:51 AM, Paulo Motta <pa...@apache.org> wrote: > > Hi, > > I wanted to discuss the online upgrade procedure from 4.X to 5.x that > increased the number of rolling restarts required from 1 to 3, making the > upgrade procedure more cumbersome to operators. > > The main reason for this change as far as I understand is to support > larger TTLs. To give some context, CASSANDRA-14092 capped the maximum TTL > expiration date to 2038 which is the maximum deletionTime that can be > represented in a signed integer (version -na-). CASSANDRA-14227 expanded > the maximum expiration date to 2106 by updating the storage format to use > an unsigned integer instead to represent deletionTime (version -nc-). > > In order to support seamless upgrade from 4.X (maxExpirationDate=2038) to > 5.X (maxExpirationDate=2106), the upgrade procedure described in [1][2] > suggests the following steps: > 1) Rolling restart the cluster with > storage_compatibility_mode=CASSANDRA_4. At this point, > maxEpirationDate=2038. > 2) Rolling restart the cluster with storage_compatibility_mode=UPGRADING. > At this point, maxEpirationDate is 2038 before all nodes are upgraded, and > maxEpirationDate=2106 after all nodes are deemed upgraded. > 3) Rolling restart the cluster with storage_compatibility_mode=NONE. At > this point, maxExpirationDate=2106. > > In my understanding users are encouraged to start in > storage_compatibility_mode=4 for 2 reasons: > A) Allow rollback to Cassandra 4 if something goes wrong during an > upgrade, decoupling the binary upgrade from the storage version upgrade, > allowing users to build confidence in the binary upgrade before doing the > storage version upgrade, where higher TTLs are supported. > B) During mixed mode, prevent a streaming or write operation with a higher > TTL from being sent to a node in 4.0 which does not support this yet. > > When the node moves to storage_compatibility_mode=UPGRADING, the node's > storage format changes to 5.0 format and a rollback to 4 is no longer > possible, but it still prevents sending a higher TTL to a node which is > already in 5.0 but still in storage_compatibility_mode=4. > > I'm uncertain about the requirement for the third rolling restart to bring > the storage_compatibility to NONE. The main reason given in [2] is: > > This eliminates the cost of checking node versions and ensures > stability. If Cassandra was started at the previous version by accident, a > node with disabled compatibility mode would no longer toggle behaviors as > when it was running in the UPGRADING mode. > > I believe the cost of checking versions[3] is negligible and does not > justify a third restart. Regarding the storage compatibility mode > stability, I think we can address this by persisting the storage version in > a system table to ensure that once a node goes to storage version 5 it can > longer switch back to 4. > > I think the upgrade instructions added by CASSANDRA-14227 conflated > downgradbility of storage with increase of maximum supported TTL, which may > put an unnecessary burden on operators by requiring 3 restarts. > > I'd like to propose simplifying the upgrade instructions to the following: > 1) If you'd like to be able to downgrade to 4.0 seamlessly, start with > storage_compatibility_mode=4. Once you are confident with Cassandra 5.0, do > a rolling restart with storage_compatibility_mode=NONE, two restarts needed > - no UPGRADING step needed. > 2) If you are starting on 5.0 or are confident with 5.0 storage format, > start with storage_compatibility_mode=NONE, single restart needed, no > downgrade supported. > > In order to support this, a new field storage_version would be added to > the system_local table. When storage_compatibility_mode=NONE and all peers > are in 5.0, this field would be populated with 5. Support to TTLs beyond > 2038 are gated on this flag. > > Please let me know what you think and if you think it is worth pursuing > this effort to simplify the upgrade to 5.x. > > Thanks, > > Paulo > > [1] - > https://github.com/apache/cassandra/blob/cassandra-5.0/NEWS.txt#L15-L21 > [2] - > https://github.com/apache/cassandra/blob/cassandra-5.0/conf/cassandra.yaml#L2275-L2281 > [3] - > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/rows/Cell.java#L97 > > >