Re: Meaningless emptiness and filtering

2025-04-04 Thread David Capwell
Been expanding the current AST Harry tests to include collections and UDTs and 
been finding even more fun with  vs null… what Cassandra returns is a 
product of the schema and what you do with the table; and these types can act 
differently than other types.

> On Feb 13, 2025, at 4:45 AM, Mick Semb Wever  wrote:
> 
>
> 
> On Tue, 11 Feb 2025 at 19:56, Caleb Rackliffe  > wrote:
>> When we add IS [NOT] NULL support, that would preferably NOT match EMPTY 
>> values for the types where empty means something, like strings. For 
>> everything else, EMPTY could be equivalent to null and match IS NULL.
> 
> 
> 
> Makes sense to me to say this is what we intend in advance of IS NULL landing.
> 
> i.e. `isEmptyValueMeaningless=true and v=EMPTY_BYTE_BUFFER` is for now 
> equivalent to what will be `IS NULL`, so isEmptyValueMeaningless effectively 
> (temporarily?) means isEmptyValueTreatedAsNull 
> 
> And in the meantime just say that SAI currently does not support such NULL 
> values (and just leave it as-is wrong in 2i and SASI – they're legacy – 
> despite making CQL statements inconsistent depending on the index impl).
> 



Re: Huge NetApp donation of hardware for ci-cassandra

2025-04-04 Thread Maxim Muzafarov
This is extremely useful, I am the one who completely relies on the
ci-cassandra ;-)
Thank you!

On Thu, 20 Mar 2025 at 11:52, Jacek Lewandowski
 wrote:
>
> Awesome, thank you!!!
>
>
> On Thu, Mar 20, 2025, 08:05 guo Maxwell  wrote:
>>
>> Thanks so much!!!
>>
>> Berenguer Blasi  于2025年3月20日周四 14:53写道:
>>>
>>> Thanks a million!
>>>
>>> On 19/3/25 20:03, Francisco Guerrero wrote:
>>> > wow! This is big! Thanks for the donation to the project.
>>> >
>>> > On 2025/03/19 15:50:34 Mick Semb Wever wrote:
>>> >> Under a ASF targeted sponsorship, NetApp (Instaclustr) has been very
>>> >> generous with the community and donated ten beefy (AMD EPYC 9454P Genoa
>>> >> 48-Core, 256G ram) servers to be used with our ci-cassandra.apache.org
>>> >> infrastructure.
>>> >>
>>> >> On each server we fit 6 jenkins executors, increasing our 
>>> >> ci-cassandra.a.o
>>> >> executor count by 42 !
>>> >> (60 new, minus 18 old executors from Instaclustr now removed).
>>> >>
>>> >> This raises our executor count from 98 to 140, and means NetApp's 
>>> >> donation
>>> >> is currently running 30% of the project's CI resources !
>>> >>
>>> >> This is a big deal for the project, adding both stability and improved
>>> >> throughput of CI for the community.
>>> >> https://github.com/apache/cassandra-builds/blob/trunk/ASF-jenkins-agents.md
>>> >>
>>> >> A very big thank you to NetApp, and to all our contributors employed 
>>> >> there
>>> >> to help make this happen.
>>> >>


Re: Per partition local ordering

2025-04-04 Thread Patrick McFadin
I played around with this idea by simulating it in ChatGPT (Yes you can do
that) It occurred to me that this is similar SQL functionality to the
DISTINCT keyword. Seeing how we can align CQL with SQL is something I'm
personally investing more time in for the long-term of the project. This
could be an opportunity to get one step closer with useful syntax.

Re-arranging your idea in SQL syntax, it would look like this:

SELECT DISTINCT ON (sensor_id) device_id, sensor_id, time, value
FROM data
WHERE device_id = 'mydevice'
  AND sensor_id IN ('s1', 's2', 's3')
ORDER BY sensor_id, time DESC;

I think this is the same outcome and similar partition-level
implementation. DISTINCT on a multi-partition query would return the first
value of each partition. This would especially work in these types of
primary keys: PRIMARY KEY((device_id, sensor_id), time)

In the long term, we don't have more unique syntax building up, which I
really prefer.

Patrick

On Tue, Apr 1, 2025 at 9:55 AM Artem Golovko 
wrote:

> Hello everyone,
>
> I did not find any discussions about that topic and would like to ask
> if there any considerations to introduce the "PER PARTITION ORDER"
> functionality. It's a duplication of Scylla question, but now for
> Cassandra https://forum.scylladb.com/t/per-partition-local-ordering/3412.
> I am also not so experienced from the cassandra code implementation
> point of view, but according to my knowledge it should make sense.
>
> Let me introduce the use case.
>
> Data model:
>
> CREATE TABLE data(
>device_id TEXT,
>sensor_id TEXT,
>time TIMESTAMP,
>value BLOB,
>PRIMARY KEY((device_id, sensor_id), time)
> )
>
> Queries: Give me the first and the last value for all sensors within
> deviceId.
>
> Problem: Within the device it's possible to have 10k of sensors or
> more and if we wanted to get a "snapshot" (e.g. list of sensors with
> values having the max timestamp) then it may take lots of round trips
> for small request-response. Therefore we can use the "IN" clause here,
> grouping keys based on the replica node (e.g. batch node aware read).
>
> 1. First point
> SELECT * FROM data WHERE deviceId = 'mydevice' and sensor_id IN (‘s1’,
> ‘s2’, ‘s3’) PER PARTITION LIMIT 1
>
> Here we can get the first point for each partition and don’t care
> about “global” ordering, so the resulting rows won’t be sorted by
> clustering key and natural order will be applied only locally within
> each partition.
>
> 2. Last point
> SELECT * FROM data WHERE deviceId = 'mydevice' and sensor_id IN (‘s1’,
> ‘s2’, ‘s3’) ORDER BY time DESC PER PARTITION LIMIT 1
>
> It’s not possible to use IN and ORDER BY together with paging enabled.
> The reason is that Cassandra applies order “localy” within each
> partition, but also applies it “globally” across the resulting rows
> that makes cassandra store the result in-memory to apply “global”
> sorting. But if I don’t care about “global” ordering and only want to
> specify ordering within each partition that introduces performance
> overhead.
>
> What if to introduce "PER PARTITION ORDER" statement? In most of the
> use cases it should not introduce much benefits, because we're limited
> to the number of keys in the IN clause (by default 100), so the result
> should not be big enough to do not fit into the memory, but maybe
> someone have another use case when PER PARTITION LIMIT more than 1 or
> payload is big enough.
>


Re: [VOTE][IP CLEARANCE] Spark-Cassandra-Connector

2025-04-04 Thread Benjamin Lerer
+1

Le mar. 18 mars 2025 à 19:02, Bernardo Botella 
a écrit :

> +1 (nb)
>
> On Mar 18, 2025, at 10:52 AM, Yifan Cai  wrote:
>
> +1 (nb)
>
> --
> *From:* Jeremiah Jordan 
> *Sent:* Tuesday, March 18, 2025 10:32:14 AM
> *To:* dev@cassandra.apache.org 
> *Cc:* gene...@incubator.apache.org 
> *Subject:* Re: [VOTE][IP CLEARANCE] Spark-Cassandra-Connector
>
> +1
>
> On Mar 18, 2025 at 3:13:09 AM, Mick Semb Wever  wrote:
>
> (general@incubator cc'd)
>
> Please vote on the acceptance of the Spark-Cassandra-Connector and its
> IP Clearance:
>
> https://incubator.apache.org/ip-clearance/cassandra-spark-cassandra-connector.html
>
> All consent from original authors of the donation, and tracking of
> collected CLAs, is found in
> https://github.com/datastax/spark-cassandra-connector/pull/1376 and
>
> https://docs.google.com/spreadsheets/d/1rkFtfnXbIckV1tYQlgFtwoHHOKUJj0vv-VndlQWA4rY
> These do not all require acknowledgement before the vote.
>
> The code is prepared for donation at
> https://github.com/datastax/spark-cassandra-connector
>
> Once this vote passes we will request ASF Infra to move the
> datastax/spark-cassandra-connector as-is to
> apache/cassandra-spark-connector  .  The master and gh-pages branches,
> all tags, and all history, will be kept.  The master branch will be
> renamed to trunk.
>
> PMC members, please check carefully the IP Clearance requirements before
> voting.
>
> The vote will be open for 72 hours (or longer). Votes by PMC members
> are considered binding. A vote passes if there are at least three
> binding +1s and no -1's.
>
> regards,
> Mick
>
>
>