Re: [DISCUSS] Online upgrade procedure to 5.x and storage_compatibility_mode

Paulo Motta Fri, 22 Aug 2025 10:13:56 -0700

Thanks for the feedback David, this is helpful.

> If you skip UPGRADING you ignore this slow rollout of v5 protocol.  Where
I think this matters is when we send LivenessInfo / Cell back to a
coordinator…. Both encode as a relative time, so that feels safe (assuming
you are making this change before the year 2038).


I think the version-gated rollout might still be needed to safely support
TTLs beyond 2038 (ie. TTL of 10 years) during mixed mode, for example:
Node A: scm=CASSANDRA_4 or UPGRADING
Node B: scm=NONE

Since node B scm=NONE, it is able to store TTL beyond 2038, but it cannot
accept it until node A is upgraded to NONE.

> As far as I can tell, the only real difference between UPGRADING and NONE
is a volatile read while constructing a Cell while in SCM=UPGRADING.

I'm not sure we even need to maintain the expiration overflow check at the
cell level, since it's already done at the coordinator when the update is
received.

I remember adding it back in CASSANDRA-14092 to prevent a non-upgraded node
from receiving a write with ttl beyond 2038 from a non-upgraded node before
CASSANDRA-14092.

After CASSANDRA-14227 this might no longer be needed, since it's impossible
for a node in 4.x to send data with TTL beyond 2038, so this check becomes
worthless at the cell level ?

On Fri, Aug 22, 2025 at 12:02 PM David Capwell <dcapw...@apple.com> wrote:

> Looking over the code the single usage that differs between UPGRADING and
> NONE is
>
> org.apache.cassandra.db.rows.Cell#getVersionedMaxDeletiontionTime
>
> public static long getVersionedMaxDeletiontionTime()
> {
>     if (DatabaseDescriptor.getStorageCompatibilityMode().disabled())
>         // The whole cluster is 2016, we're out of the 2038/2106 mixed 
> cluster scenario. Shortcut to avoid the 'minClusterVersion' volatile read
>         return Cell.MAX_DELETION_TIME;
>     else
>         return MessagingService.instance().versions.minClusterVersion >= 
> MessagingService.VERSION_50
>                ? Cell.MAX_DELETION_TIME
>                : Cell.MAX_DELETION_TIME_2038_LEGACY_CAP;
> }
>
>
> So if you are in upgrading we allow each node to use the VERSION_50
> messaging version, so while this is being rolled out some nodes will be on
> v4 other will be on v5.  Once a local node has learned that all peers are
> at least on v5 then it acts the same as if you were on NONE.
>
> If you skip UPGRADING you ignore this slow rollout of v5 protocol.  Where
> I think this matters is when we send LivenessInfo / Cell back to a
> coordinator…. Both encode as a relative time, so that feels safe (assuming
> you are making this change before the year 2038).
>
> As far as I can tell, the only real difference between UPGRADING and NONE
> is a volatile read while constructing a Cell while in SCM=UPGRADING.  Given
> this It does feel like we could simplify this to a single bounce and just
> ignoring UPGRADING all together?
>
> On Aug 22, 2025, at 7:51 AM, Paulo Motta <pa...@apache.org> wrote:
>
> Hi,
>
> I wanted to discuss the online upgrade procedure from 4.X to 5.x that
> increased the number of rolling restarts required from 1 to 3, making the
> upgrade procedure more cumbersome to operators.
>
> The main reason for this change as far as I understand is to support
> larger TTLs. To give some context, CASSANDRA-14092 capped the maximum TTL
> expiration date to 2038 which is the maximum deletionTime that can be
> represented in a signed integer (version -na-). CASSANDRA-14227 expanded
> the maximum expiration date to 2106 by updating the storage format to use
> an unsigned integer instead to represent deletionTime (version -nc-).
>
> In order to support seamless upgrade from 4.X (maxExpirationDate=2038) to
> 5.X (maxExpirationDate=2106), the upgrade procedure described in [1][2]
> suggests the following steps:
> 1) Rolling restart the cluster with
> storage_compatibility_mode=CASSANDRA_4. At this point,
> maxEpirationDate=2038.
> 2) Rolling restart the cluster with storage_compatibility_mode=UPGRADING.
> At this point, maxEpirationDate is 2038 before all nodes are upgraded, and
> maxEpirationDate=2106 after all nodes are deemed upgraded.
> 3) Rolling restart the cluster with storage_compatibility_mode=NONE. At
> this point, maxExpirationDate=2106.
>
> In my understanding users are encouraged to start in
> storage_compatibility_mode=4 for 2 reasons:
> A) Allow rollback to Cassandra 4 if something goes wrong during an
> upgrade, decoupling the binary upgrade from the storage version upgrade,
> allowing users to build confidence in the binary upgrade before doing the
> storage version upgrade, where higher TTLs are supported.
> B) During mixed mode, prevent a streaming or write operation with a higher
> TTL from being sent to a node in 4.0 which does not support this yet.
>
> When the node moves to storage_compatibility_mode=UPGRADING, the node's
> storage format changes to 5.0 format and a rollback to 4 is no longer
> possible, but it still prevents sending a higher TTL to a node which is
> already in 5.0 but still in storage_compatibility_mode=4.
>
> I'm uncertain about the requirement for the third rolling restart to bring
> the storage_compatibility to NONE. The main reason given in [2] is:
> > This eliminates the cost of checking node versions and ensures
> stability. If Cassandra was started at the previous version by accident, a
> node with disabled compatibility mode would no longer toggle behaviors as
> when it was running in the UPGRADING mode.
>
> I believe the cost of checking versions[3] is negligible and does not
> justify a third restart. Regarding the storage compatibility mode
> stability, I think we can address this by persisting the storage version in
> a system table to ensure that once a node goes to storage version 5 it can
> longer switch back to 4.
>
> I think the upgrade instructions added by CASSANDRA-14227 conflated
> downgradbility of storage with increase of maximum supported TTL, which may
> put an unnecessary burden on operators by requiring 3 restarts.
>
> I'd like to propose simplifying the upgrade instructions to the following:
> 1) If you'd like to be able to downgrade to 4.0 seamlessly, start with
> storage_compatibility_mode=4. Once you are confident with Cassandra 5.0, do
> a rolling restart with storage_compatibility_mode=NONE, two restarts needed
> - no UPGRADING step needed.
> 2) If you are starting on 5.0 or are confident with 5.0 storage format,
> start with storage_compatibility_mode=NONE, single restart needed, no
> downgrade supported.
>
> In order to support this, a new field storage_version would be added to
> the system_local table. When storage_compatibility_mode=NONE and all peers
> are in 5.0, this field would be populated with 5. Support to TTLs beyond
> 2038 are gated on this flag.
>
> Please let me know what you think and if you think it is worth pursuing
> this effort to simplify the upgrade to 5.x.
>
> Thanks,
>
> Paulo
>
> [1] -
> https://github.com/apache/cassandra/blob/cassandra-5.0/NEWS.txt#L15-L21
> [2] -
> https://github.com/apache/cassandra/blob/cassandra-5.0/conf/cassandra.yaml#L2275-L2281
> [3] -
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/rows/Cell.java#L97
>
>
>

Re: [DISCUSS] Online upgrade procedure to 5.x and storage_compatibility_mode

Reply via email to