[ 
https://issues.apache.org/jira/browse/CASSANDRA-21260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Yan updated CASSANDRA-21260:
---------------------------------
    Description: 
Slack: [https://the-asf.slack.com/archives/CJZLTM05A/p1762461191355119]

Multiple reports on this issue (reported by [~gilg]):
{code:java}
Hey, I have an issue with a 4.1.5 cluster.For some reason a few tables appear 
to have bad metadata - sstablemetadata reports certain columns which are not 
really there (sstabledump does not show said columns on those sstables), on a 
lot of sstables, most of the table actually.Not sure exactly how it got there 
(we suspect someone loaded sstables of other tables since wrong column names 
match ones in other tables), but right now situation is some servers work fine 
and serve clients, while other servers, who have gone through restarts, are 
virtually without data for those tables, since on startup there were exceptions 
during sstables loading - "Unknown column xxx during deserialization" {code}
reported by [~tolbertam] :
{code:java}
it's something I saw once recently, but I haven't been able to reproduce it.  I 
suspect it has something to do with the refactoring around making Schema 
pluggable (https://issues.apache.org/jira/browse/CASSANDRA-17044).   When I saw 
Gil's post it reminded me a lot of what I saw
My suspicion is it's some kind of timing issue with schema changes being pushed 
out, while also being pulled from other instances.
Like, if you have a bunch of concurrent schema changes on the same keyspaces, 
all submitted to different coordinators submitted relatively at the same time, 
without schema disagreement, that has traditionally caused a lot of issues with 
Cassandra. {code}
reported by [~curlylrt] / [~yukei] 
{code:java}
We were running 4.0.6 when the column was added to a table in another keyspace. 
And now, we are running 4.1.3 and we recently noticed this issue because we are 
seeing error logs when loading sstables during node restart. {code}
Known facts
 * On restart, SSTables with unknown columns in SSTable header will be ignored 
-> once the SSTable header get contaminated, it’s data loss on restart (until 
we scrub and fix the headers)
 * Because the contaminated SSTable will be ignored after restart + other node 
receiving the corrupted SSTables will not be able to parse it and throws 
UnknownColumnException, these contaminated SSTables should not spread to other 
nodes
 * When doing compaction, there is no schema check. I.e. once a SSTable has the 
unexpected columns in the header, new SSTable will inherit the header by 
merging them blindly
 ** This is confirmed in local test by injecting a SSTable with unexpected 
columns and running compaction
 * All the victim tables / offender tables share the same primary key (for our 
case)
 * Seems related to schema change

  was:
Slack: [https://the-asf.slack.com/archives/CJZLTM05A/p1762461191355119]

Multiple reports on this issue (reported by [~gilg]):
{code:java}
Hey, I have an issue with a 4.1.5 cluster.For some reason a few tables appear 
to have bad metadata - sstablemetadata reports certain columns which are not 
really there (sstabledump does not show said columns on those sstables), on a 
lot of sstables, most of the table actually.Not sure exactly how it got there 
(we suspect someone loaded sstables of other tables since wrong column names 
match ones in other tables), but right now situation is some servers work fine 
and serve clients, while other servers, who have gone through restarts, are 
virtually without data for those tables, since on startup there were exceptions 
during sstables loading - "Unknown column xxx during deserialization" {code}
reported by [~tolbertam] :
{code:java}
it's something I saw once recently, but I haven't been able to reproduce it.  I 
suspect it has something to do with the refactoring around making Schema 
pluggable (https://issues.apache.org/jira/browse/CASSANDRA-17044).   When I saw 
Gil's post it reminded me a lot of what I saw
My suspicion is it's some kind of timing issue with schema changes being pushed 
out, while also being pulled from other instances.
Like, if you have a bunch of concurrent schema changes on the same keyspaces, 
all submitted to different coordinators submitted relatively at the same time, 
without schema disagreement, that has traditionally caused a lot of issues with 
Cassandra. {code}
reported by [~curlylrt] / [~yukei] 
{code:java}
We were running 4.0.6 when the column was added to a table in another keyspace. 
And now, we are running 4.1.3 and we recently noticed this issue because we are 
seeing error logs when loading sstables during node restart. {code}
Known facts
 * On restart, SSTables with unknown columns in SSTable header will be ignored 
-> once the SSTable header get contaminated, it’s data loss on restart (until 
we scrub and fix the headers)
 * Because the contaminated SSTable will be ignored after restart + other node 
receiving the corrupted SSTables will not be able to parse it and throws 
UnknownColumnException, these contaminated SSTables should not spread to other 
nodes
 * When doing compaction, there is no schema check. I.e. once a SSTable has the 
unexpected columns in the header, new SSTable will inherit the header by 
merging them blindly
 ** This is confirmed in local test by injecting a SSTable with unexpected 
columns and running compaction
 * All the victim tables / offender tables share the same primary key (for our 
case)


> SSTable header contains unknown columns from other tables
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-21260
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21260
>             Project: Apache Cassandra
>          Issue Type: Bug
>            Reporter: Yuqi Yan
>            Priority: Normal
>
> Slack: [https://the-asf.slack.com/archives/CJZLTM05A/p1762461191355119]
> Multiple reports on this issue (reported by [~gilg]):
> {code:java}
> Hey, I have an issue with a 4.1.5 cluster.For some reason a few tables appear 
> to have bad metadata - sstablemetadata reports certain columns which are not 
> really there (sstabledump does not show said columns on those sstables), on a 
> lot of sstables, most of the table actually.Not sure exactly how it got there 
> (we suspect someone loaded sstables of other tables since wrong column names 
> match ones in other tables), but right now situation is some servers work 
> fine and serve clients, while other servers, who have gone through restarts, 
> are virtually without data for those tables, since on startup there were 
> exceptions during sstables loading - "Unknown column xxx during 
> deserialization" {code}
> reported by [~tolbertam] :
> {code:java}
> it's something I saw once recently, but I haven't been able to reproduce it.  
> I suspect it has something to do with the refactoring around making Schema 
> pluggable (https://issues.apache.org/jira/browse/CASSANDRA-17044).   When I 
> saw Gil's post it reminded me a lot of what I saw
> My suspicion is it's some kind of timing issue with schema changes being 
> pushed out, while also being pulled from other instances.
> Like, if you have a bunch of concurrent schema changes on the same keyspaces, 
> all submitted to different coordinators submitted relatively at the same 
> time, without schema disagreement, that has traditionally caused a lot of 
> issues with Cassandra. {code}
> reported by [~curlylrt] / [~yukei] 
> {code:java}
> We were running 4.0.6 when the column was added to a table in another 
> keyspace. And now, we are running 4.1.3 and we recently noticed this issue 
> because we are seeing error logs when loading sstables during node restart. 
> {code}
> Known facts
>  * On restart, SSTables with unknown columns in SSTable header will be 
> ignored -> once the SSTable header get contaminated, it’s data loss on 
> restart (until we scrub and fix the headers)
>  * Because the contaminated SSTable will be ignored after restart + other 
> node receiving the corrupted SSTables will not be able to parse it and throws 
> UnknownColumnException, these contaminated SSTables should not spread to 
> other nodes
>  * When doing compaction, there is no schema check. I.e. once a SSTable has 
> the unexpected columns in the header, new SSTable will inherit the header by 
> merging them blindly
>  ** This is confirmed in local test by injecting a SSTable with unexpected 
> columns and running compaction
>  * All the victim tables / offender tables share the same primary key (for 
> our case)
>  * Seems related to schema change



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to