Hi all,

I have a patch ready fixing this bug on Cassandra 4.1 (
https://github.com/apache/cassandra/pull/4746). CI can be found here (
https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch-before-5/2734/);
I've confirmed that all the unit tests pass locally, and that the dtests
that failed were flaky and unrelated to the changes in this patch.

I would appreciate a review on this before I make a similar patch for 4.0,
5.0, and trunk. Thank you in advance for your time!

Best,
Andrés

On Fri, Apr 10, 2026 at 3:52 AM Štefan Miklošovič <[email protected]>
wrote:

> On Fri, Apr 10, 2026 at 9:37 AM Runtian Liu <[email protected]> wrote:
> >
> > Hi Andrés, Isaac,
> >
> > Thank you for the detailed write-up, Andrés. Your investigation into the
> FastBuilder.reset() bug was the starting point for our own analysis, which
> led us to identify an additional impact beyond the ClassCastException.
> >
> > Isaac — yes, we believe CASSANDRA-21260 and CASSANDRA-21216 are directly
> related. CASSANDRA-21260 was filed by our team to track the SSTable header
> contamination we've been seeing. Based on Andrés' findings about the stale
> savedBuffer/savedNextKey in FastBuilder.reset(), we investigated whether
> the same bug could explain our corrupted SSTable headers — and we believe
> it does.
> >
> > What we observed (CASSANDRA-21260)
> >
> > We have been seeing corrupted SSTable headers where an SSTable for one
> table contains column metadata belonging to a completely different table.
> When we deserialize the on-disk SerializationHeader.Component and compare
> it against the table's TableMetadata, we find column names that are not
> part of the table's schema — they belong to another table in the same
> keyspace. In one case, a table with ~2000 columns had 29 foreign columns
> from a ~150-column table embedded in its SSTable header.
> >
> > These corrupted SSTables are otherwise structurally valid — they are
> accepted into the live set and only detected by explicit header validation
> we added. The foreign columns do not correspond to dropped columns or any
> prior schema version of the affected table. As noted in CASSANDRA-21260,
> once a corrupted SSTable exists, compaction merges headers blindly, so the
> contamination propagates to new SSTables indefinitely.
>
> I do not want to be a scope creep here, but there is also
> CASSANDRA-21000 which will keep deleted columns in there forever. I do
> not see any issue with fixing it (details in the ticket), but at the
> same time I can not say 100% that it will not have any side-effects we
> did not count on.
>
> However, if we change the logic / do some fixes around
> SerializationHeader, I think it would be great to think about the
> inclusion of this ticket as well / to have it in mind.
>
> > How the FastBuilder bug (CASSANDRA-21216) causes this
> >
> > Building on Andrés' analysis of the FastBuilder state leakage, we traced
> a path from the stale savedBuffer/savedNextKey all the way to on-disk
> SSTable header contamination:
> >
> > 1. A schema disagreement (e.g. during column addition) causes an
> internode READ_REQ deserialization to fail on a replica.
> Columns.Serializer.deserialize() uses a thread-local pooled FastBuilder,
> and if the table has more than 31 columns, the overflow populates
> savedBuffer and savedNextKey before the exception. Since reset() does not
> clear these fields, the FastBuilder is returned to the pool with stale
> ColumnMetadata from the source table.
> >
> > 2. When a deletion-only mutation (partition delete or range tombstone)
> for a different table is later deserialized on the same thread,
> Columns.Serializer.deserialize() acquires the poisoned FastBuilder. The
> stale ColumnMetadata from the source table are drained into the victim
> table's Columns via propagateOverflow(). Because the mutation contains only
> a deletion — no rows, no static row — no per-row column-subset
> deserialization occurs, so the contaminated Columns survives without error.
> (Mutations with actual row data would fail due to subset encoding
> mismatches, which is why only deletion-only mutations propagate the
> contamination silently.)
> >
> > When the contaminated PartitionUpdate is applied to the memtable,
> ColumnsCollector.update() records the foreign ColumnMetadata. At flush,
> BigTableWriter.openFinal() writes the SSTable using the in-memory
> SerializationHeader directly, bypassing toHeader() validation. The result
> is an on-disk SSTable whose header contains columns from the wrong table.
> >
> > This also affects small messages on the Netty event loop
> >
> > Andrés, your investigation focused on wide tables where messages exceed
> the ~64KB large-message threshold and are deserialized on SEPWorker
> threads. We found that the same contamination also occurs with small
> messages deserialized on the Netty event loop.
> >
> > For messages under 64KB, processSmallMessage() deserializes the payload
> inline on the event loop thread, which has its own
> TinyThreadLocalPool<FastBuilder>. Since Netty binds each channel to a
> single EventLoop, messages from the same peer are handled by the same
> thread — making thread reuse virtually guaranteed rather than probabilistic.
> >
> > This lowers the trigger threshold significantly: the source table only
> needs more than 31 columns (for FastBuilder overflow) rather than the ~4200
> needed to exceed the large-message threshold. In our case, a 150-column
> table was the contamination source. The 29 foreign columns we observed are
> consistent with the 31 + 1 items retained in savedBuffer/savedNextKey,
> minus a few consumed as internal BTree node keys during build().
> >
> > Summary
> >
> > We strongly support the proposed fix to clear savedBuffer and
> savedNextKey in FastBuilder.reset(). Beyond the ClassCastException that
> Andrés identified, the same bug can cause the silent SSTable header
> contamination tracked in CASSANDRA-21260. We have written JVM dtests
> reproducing both the large-message and small-message contamination paths
> and are happy to share them.
> >
> > Best regards
> > Runtian
>

Reply via email to