Hi everybody,

CEP-29 (CQL NOT Operator) is hitting the grey area of how we want as a
community to handle NULL including for things like unfrozen (multi-cell)
collections and I would like to make a proposal for moving forward with
NULL related issues.

We have currently 2 tickets open about NULL handling (I might have missed
others):

   1. CASSANDRA-10715
   <https://issues.apache.org/jira/browse/CASSANDRA-10715>: Allowing
   Filtering on NULL
   2. CASSANDRA-17762
   <https://issues.apache.org/jira/browse/CASSANDRA-17762>: LWT IF col =
   NULL is inconsistent with SQL NULL

We also had previously some discussion on which we touched the subject:

   - [DISCUSS] LWT UPDATE semantics with + and - when null
   - CEP-15 multi key transaction syntax

In all those tickets and discussions the consensus was to have a behavior
similar to SQL.

For null comparisons, SQL uses the three-value logic (
https://modern-sql.com/concept/three-valued-logic) introducing the need for
IS NULL and IS NOT NULL operators. Those conflict with the col = NULL
predicate supported in LWT conditions (CASSANDRA-17762
<https://issues.apache.org/jira/browse/CASSANDRA-17762>).

So far, as Cassandra was only using inclusive operators, comparisons were
behaving in an expected way. According to three-valued logic NULL CONTAINS
'foo' should return UNKNOWN and the filtering behavior should exclude
everything which is not true.Therefore the row should not be returned as
expected. With exclusive operators things are more tricky. NULL NOT
CONTAINS 'foo' will also return UNKNOWN causing the row to not be returned
which might not match people's expectations.
This behavior can be even more confusing once you take into account empty
and null collections. NOT CONTAINS on an empty collection will return true
while it will return UNKNOWN on a NULL collection. Unfortunately, for
unfrozen (multicell) collections we are unable to differentiate between an
empty and null collection and therefore always treat empty collections as
NULL.
For predicates such as map[myKey] != 'foo' when myKey is not present the
result can also be surprising as it will end up comparing NULL to 'foo'
returning once more UNKNOWN and ignoring the row.
In order to respect the SQL three-valued logic and be able to allow the
user to fetch all the rows which do not contains a specific value we would
need support IS NULL, IS NOT NULL and OR to allow query like:
WHERE c IS NULL OR c NOT CONTAINS 'foo' / WHERE m IS NULL OR m[myKey] != foo

Supporting the three-valued logic makes sense to me even if some behavior
might end up being confusing. In which case we can easily fix
CASSANDRA-10715 and deprectate support for col = NULL/col != NULL in LWT.

What is people's opinion? Should we go for the three-valued logic
everywhere? Should we try something else?

Reply via email to