I would like to propose the following solution to CASSANDRA-17047,
CASSANDRA-18589, CASSANDRA-18591 etc. (the rationale and line of thought
included below) for pre-TCM Cassandra versions, and I'd love to hear your
feedback.
Definitions:
There are three states of knowledge of a node wrt to a particular column:
C - a column exists (there’s a column in TableMetadata::columns, no column
in TableMetadata::droppedColumns)
D - a column has been dropped (there’s no column in TableMetadata::columns,
and a column in TableMetadata::droppedColumns)
A - column has been dropped and then subsequently re-added (there’s a
column *both* in TableMetadata::columns and TableMetadata::droppedColumns)
For convenience, later in the document, I can just refer to column
creation, column drop, and column re-addition as C, D, or A events,
respectively.
Column lifecycle is always:
CDADAD…
A post-drop write is a write written with the “old” schema (where the
column was not yet dropped), but with a timestamp greater than the
respective column drop time.
Proposal:
The proposal consists of three postulates:
1. If either the coordinator or replica considers the column to be dropped
- it is OK for them to ignore the column data *regardless* of timestamp
Motivation:
Coordinator D should be allowed to return data as if the column was
removed; if the client wants an up-to-date view of data written after
subsequent A it should use an up-to-date coordinator (i.e. if you write to
coordinator A, and subsequently read from a coordinator with the previous D
state you may not be able to read your new write).
Replica D couldn’t have accepted any writes that happened after other nodes
performed subsequent A because of the check in
UpdateStatement::prepareInternal (i.e. only writes to columns visible in
the node’s schema are allowed). Thus, ignoring data does not affect writes
confirmed after the A event.
2. If neither the coordinator nor the replica considers the column to be
dropped - accept write resurrection
Motivation:
Without TCM there’s no race-free way to tell whether a write is a new,
confirmed write, and should be returned or an old, post-drop write.
Ignoring the written data may lead to data loss.
3. Read-repair should be disabled when there is schema inconsistency
between nodes
Motivation:
Eagerly ignoring written data by a replica in D state can lead to
read-repair storm. This is fundamentally caused by difference of opinion
between nodes with respect to existence of a particular write. A limitation
that read repairs can only be sent to replicas which agree on the current
version of the schema is simple and temporary, so it looks acceptable.
To mitigate the window for race conditions, it could be beneficial to
return a "schema disagreement detected" flag with the read from a replica.
Replicas already have the capacity to identify this condition based on the
ColumnFilter, which is sent by the coordinator alongside the read request.
This detection does not necessitate checking the EndpointState and so forth.
Full, in-depth spectrum of possible scenarios with some my loose
deliberations:
Coordinator knowledge
Replica knowledge
Current state, in particular in presence of post-drop writes
Expected behaviour (in users’ eyes)
D
C
IllegalStateException (CASSANDRA-17047, CASSANDRA-18532)
???
Ignore dropped column data?
D
D
ColumnFilter `fetches` all columns, including the dropped one;
NPE/AssertioError for post-drop-writes (CASSANDRA-18591, CASSANDRA-18589)
Ignore dropped column data
C
D
CASSANDRA-18591, CASSANDRA-18589
???
Replica knows more, but the coordinator is OK with returning the column.
The replica cannot guarantee that it will return the data anyway (it might
have already been removed).
Currently, the read path does not support returning data for dropped
columns.
The easiest solution would be to ignore the column as in the D D case Ignoring
the column may lead to read-repair storm
A
A
post-drop-writes are resurrected
???
The user would like the writes to be dropped.
We could do that by e.g. comparing write timestamp to the timestamp of
column re-addition. The problem is then that in the presence of clock skew
between nodes we might ignore reading a write that was confirmed (and e.g.
written in QUORUM). This sounds worse than resurrection.
A
D
Currently on replica it is not possible to distinguish this case from C D.
Also, there is no way to distinguish which node is “right” - whether the
column was dropped and has been re-added or if it was re-added and then
dropped again.
???
Again, the easiest is to ignore data for the dropped column (IIUC new
writes couldn’t have happened to the replica yet, because of the check in
UpdateStatement::prepareInternal - so, we will not remove a new, confirmed
write)
Or perhaps we should fail the read until schema convergence…
D
A
See above and DC case
See above and DC case
A
C
writes may be resurrected
???
I guess we can live with this…
We really shouldn’t