Hi all, I have a patch ready fixing this bug on Cassandra 4.1 ( https://github.com/apache/cassandra/pull/4746). CI can be found here ( https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch-before-5/2734/); I've confirmed that all the unit tests pass locally, and that the dtests that failed were flaky and unrelated to the changes in this patch.
I would appreciate a review on this before I make a similar patch for 4.0, 5.0, and trunk. Thank you in advance for your time! Best, Andrés On Fri, Apr 10, 2026 at 3:52 AM Štefan Miklošovič <[email protected]> wrote: > On Fri, Apr 10, 2026 at 9:37 AM Runtian Liu <[email protected]> wrote: > > > > Hi Andrés, Isaac, > > > > Thank you for the detailed write-up, Andrés. Your investigation into the > FastBuilder.reset() bug was the starting point for our own analysis, which > led us to identify an additional impact beyond the ClassCastException. > > > > Isaac — yes, we believe CASSANDRA-21260 and CASSANDRA-21216 are directly > related. CASSANDRA-21260 was filed by our team to track the SSTable header > contamination we've been seeing. Based on Andrés' findings about the stale > savedBuffer/savedNextKey in FastBuilder.reset(), we investigated whether > the same bug could explain our corrupted SSTable headers — and we believe > it does. > > > > What we observed (CASSANDRA-21260) > > > > We have been seeing corrupted SSTable headers where an SSTable for one > table contains column metadata belonging to a completely different table. > When we deserialize the on-disk SerializationHeader.Component and compare > it against the table's TableMetadata, we find column names that are not > part of the table's schema — they belong to another table in the same > keyspace. In one case, a table with ~2000 columns had 29 foreign columns > from a ~150-column table embedded in its SSTable header. > > > > These corrupted SSTables are otherwise structurally valid — they are > accepted into the live set and only detected by explicit header validation > we added. The foreign columns do not correspond to dropped columns or any > prior schema version of the affected table. As noted in CASSANDRA-21260, > once a corrupted SSTable exists, compaction merges headers blindly, so the > contamination propagates to new SSTables indefinitely. > > I do not want to be a scope creep here, but there is also > CASSANDRA-21000 which will keep deleted columns in there forever. I do > not see any issue with fixing it (details in the ticket), but at the > same time I can not say 100% that it will not have any side-effects we > did not count on. > > However, if we change the logic / do some fixes around > SerializationHeader, I think it would be great to think about the > inclusion of this ticket as well / to have it in mind. > > > How the FastBuilder bug (CASSANDRA-21216) causes this > > > > Building on Andrés' analysis of the FastBuilder state leakage, we traced > a path from the stale savedBuffer/savedNextKey all the way to on-disk > SSTable header contamination: > > > > 1. A schema disagreement (e.g. during column addition) causes an > internode READ_REQ deserialization to fail on a replica. > Columns.Serializer.deserialize() uses a thread-local pooled FastBuilder, > and if the table has more than 31 columns, the overflow populates > savedBuffer and savedNextKey before the exception. Since reset() does not > clear these fields, the FastBuilder is returned to the pool with stale > ColumnMetadata from the source table. > > > > 2. When a deletion-only mutation (partition delete or range tombstone) > for a different table is later deserialized on the same thread, > Columns.Serializer.deserialize() acquires the poisoned FastBuilder. The > stale ColumnMetadata from the source table are drained into the victim > table's Columns via propagateOverflow(). Because the mutation contains only > a deletion — no rows, no static row — no per-row column-subset > deserialization occurs, so the contaminated Columns survives without error. > (Mutations with actual row data would fail due to subset encoding > mismatches, which is why only deletion-only mutations propagate the > contamination silently.) > > > > When the contaminated PartitionUpdate is applied to the memtable, > ColumnsCollector.update() records the foreign ColumnMetadata. At flush, > BigTableWriter.openFinal() writes the SSTable using the in-memory > SerializationHeader directly, bypassing toHeader() validation. The result > is an on-disk SSTable whose header contains columns from the wrong table. > > > > This also affects small messages on the Netty event loop > > > > Andrés, your investigation focused on wide tables where messages exceed > the ~64KB large-message threshold and are deserialized on SEPWorker > threads. We found that the same contamination also occurs with small > messages deserialized on the Netty event loop. > > > > For messages under 64KB, processSmallMessage() deserializes the payload > inline on the event loop thread, which has its own > TinyThreadLocalPool<FastBuilder>. Since Netty binds each channel to a > single EventLoop, messages from the same peer are handled by the same > thread — making thread reuse virtually guaranteed rather than probabilistic. > > > > This lowers the trigger threshold significantly: the source table only > needs more than 31 columns (for FastBuilder overflow) rather than the ~4200 > needed to exceed the large-message threshold. In our case, a 150-column > table was the contamination source. The 29 foreign columns we observed are > consistent with the 31 + 1 items retained in savedBuffer/savedNextKey, > minus a few consumed as internal BTree node keys during build(). > > > > Summary > > > > We strongly support the proposed fix to clear savedBuffer and > savedNextKey in FastBuilder.reset(). Beyond the ClassCastException that > Andrés identified, the same bug can cause the silent SSTable header > contamination tracked in CASSANDRA-21260. We have written JVM dtests > reproducing both the large-message and small-message contamination paths > and are happy to share them. > > > > Best regards > > Runtian >
