date:20230504

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-04 Thread guo Maxwell

This is a very meaningful work, thanks , but I would like to ask a question
that is not particularly related to the cep project's code design itself
but the project engineering management : what is the future development and
release plan of this project?
As far as I know, project Cassandra Sidecar does not actually have an
finnally release version. I think everyone will definitely not want the
project code to be merged, but it has been unable to release for a long
time as this project relies on Cassandra sidecar.

Dinesh Joshi  于2023年5月4日周四 02:35写道：

> If there aren't additional questions / comments I will start the VOTE
> thread on this CEP tonight.
>
> On 2023/05/01 19:50:12 Dinesh Joshi wrote:
> > Does anybody have any questions that we could answer about this proposal?
>

-- 
you are the apple of my eye !

Re: [POLL] Vector type for ML

2023-05-04 Thread Mick Semb Wever

>
> Did we agree on a CQL syntax?
>
> I don’t believe there has been a pool on CQL syntax… my understanding
> reading all the threads is that there are ~4-5 options and non are -1ed, so
> believe we are waiting for majority rule on this?
>


Re-reading that thread, IIUC the valid choices remaining are…

1. VECTOR FLOAT[n]
2. FLOAT VECTOR[n]
3. VECTOR
4. VECTOR[n]
5. ARRAY
6. NON-NULL FROZEN


Yes I'm putting my preference (1) first ;) because (banging on) if the
future of CQL will have FLOAT[n] and FROZEN, where the VECTOR
keyword is: for general cql users; just meaning "non-null and frozen",
these gel best together.

Options (5) and (6) are for those that feel we can and should provide this
type without introducing the vector keyword.

Re: [VOTE] Release Apache Cassandra 3.11.15

2023-05-04 Thread Tommy Stendahl via dev

+1 (nb)

-Original Message-
From: "Miklosovic, Stefan" 
mailto:%22Miklosovic,%20stefan%22%20%3cstefan.mikloso...@netapp.com%3e>>
Reply-To: dev@cassandra.apache.org
To: dev@cassandra.apache.org 
mailto:%22...@cassandra.apache.org%22%20%3c...@cassandra.apache.org%3e>>
Subject: [VOTE] Release Apache Cassandra 3.11.15
Date: Tue, 02 May 2023 06:37:46 +


Proposing the test build of Cassandra 3.11.15 for release.


sha1: 6cdcf5e56a77cf40c251125d68856a614eccbc53

Git:



https://gitbox.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/3.11.15-tentative


Maven Artifacts:



https://repository.apache.org/content/repositories/orgapachecassandra-1287/org/apache/cassandra/cassandra-all/3.11.15/



The Source and Build Artifacts, and the Debian and RPM packages and 
repositories, are available here:



https://dist.apache.org/repos/dist/dev/cassandra/3.11.15/



The vote will be open for 72 hours (longer if needed). Everyone who has tested 
the build is invited to vote. Votes by PMC members are considered binding. A 
vote passes if there are at least three binding +1s and no -1's.


[1]: CHANGES.txt:



https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/3.11.15-tentative


[2]: NEWS.txt:



https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/3.11.15-tentative

Re: [POLL] Vector type for ML

2023-05-04 Thread Benedict

Hurrah for initial agreement.

For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], VECTOR is 
redundant - FLOAT[N] is fully descriptive by itself. I don’t think VECTOR 
should be used to simply imply non-null, as this would be very unintuitive. 
More logical would be NONNULL, if this is the only condition being applied. 
Alternatively for arrays we could default to NONNULL and later introduce 
NULLABLE if we want to permit nulls.

If the word vector is to be used it makes more sense to make it look like a 
list, so VECTOR as here the word VECTOR is clearly not redundant.

So, I vote:

1) (NON NULL) FLOAT[N]
2) FLOAT[N]   (Non null by default)
3) VECTOR

> On 4 May 2023, at 08:52, Mick Semb Wever  wrote:
> 
> 
>>> Did we agree on a CQL syntax?
>> I don’t believe there has been a pool on CQL syntax… my understanding 
>> reading all the threads is that there are ~4-5 options and non are -1ed, so 
>> believe we are waiting for majority rule on this?
> 
> 
> Re-reading that thread, IIUC the valid choices remaining are…
> 
> 1. VECTOR FLOAT[n]
> 2. FLOAT VECTOR[n]
> 3. VECTOR
> 4. VECTOR[n]
> 5. ARRAY
> 6. NON-NULL FROZEN
> 
> 
> Yes I'm putting my preference (1) first ;) because (banging on) if the future 
> of CQL will have FLOAT[n] and FROZEN, where the VECTOR keyword is: 
> for general cql users; just meaning "non-null and frozen", these gel best 
> together.
> 
> Options (5) and (6) are for those that feel we can and should provide this 
> type without introducing the vector keyword.
> 
>

Re: [POLL] Vector type for ML

2023-05-04 Thread Mike Adamson

>
> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
> think VECTOR should be used to simply imply non-null, as this would be very
> unintuitive. More logical would be NONNULL, if this is the only condition
> being applied. Alternatively for arrays we could default to NONNULL and
> later introduce NULLABLE if we want to permit nulls.
>

I have a small issue relating to not having a specific VECTOR tag on the
data type. The driver behind adding this datatype is the hnsw index that is
being added to consume this data. If we have a generic array datatype, what
is the expectation going to be for users who create an index on it? The
hnsw index will support only floats initially so we would have to reject
any non-float arrays if an attempt was made to create an hnsw index on it.
While there is no problem with doing this, there would be a problem if, in
the future, we allow indexing in arrays in the same way that we index
collections. In this case we would then need to have the user select what
type of index they want at creation time.

Can I add another proposal that we allow a VECTOR or DENSE (this is a well
known term in the ML space) keyword that could be used when the array is
going to be used for ML workloads. This would be optional and would
function similarly to FROZEN in that it would limit the functionality of
the array to ML usage.

On Thu, 4 May 2023 at 09:45, Benedict  wrote:

> Hurrah for initial agreement.
>
> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
> think VECTOR should be used to simply imply non-null, as this would be very
> unintuitive. More logical would be NONNULL, if this is the only condition
> being applied. Alternatively for arrays we could default to NONNULL and
> later introduce NULLABLE if we want to permit nulls.
>
> If the word vector is to be used it makes more sense to make it look like
> a list, so VECTOR as here the word VECTOR is clearly not
> redundant.
>
> So, I vote:
>
> 1) (NON NULL) FLOAT[N]
> 2) FLOAT[N]   (Non null by default)
> 3) VECTOR
>
>
>
> On 4 May 2023, at 08:52, Mick Semb Wever  wrote:
>
> 
>
>> Did we agree on a CQL syntax?
>>
>> I don’t believe there has been a pool on CQL syntax… my understanding
>> reading all the threads is that there are ~4-5 options and non are -1ed, so
>> believe we are waiting for majority rule on this?
>>
>
>
> Re-reading that thread, IIUC the valid choices remaining are…
>
> 1. VECTOR FLOAT[n]
> 2. FLOAT VECTOR[n]
> 3. VECTOR
> 4. VECTOR[n]
> 5. ARRAY
> 6. NON-NULL FROZEN
>
>
> Yes I'm putting my preference (1) first ;) because (banging on) if the
> future of CQL will have FLOAT[n] and FROZEN, where the VECTOR
> keyword is: for general cql users; just meaning "non-null and frozen",
> these gel best together.
>
> Options (5) and (6) are for those that feel we can and should provide this
> type without introducing the vector keyword.
>
>
>
>

-- 
[image: DataStax Logo Square]  *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com 
Find DataStax Online: [image: LinkedIn Logo]

   [image: Facebook Logo]

   [image: Twitter Logo]    [image: RSS Feed]
   [image: Github Logo]

I would expect that the type of index would be specified anyway?I don’t think it’s good API design to have the field define the index you create - only to shape what is permitted.A HNSW index is very specific and should be asked for specifically, not implicitly, IMO.On 4 May 2023, at 11:47, Mike Adamson wrote:For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t think VECTOR should be used to simply imply non-null, as this would be very unintuitive. More logical would be NONNULL, if this is the only condition being applied. Alternatively for arrays we could default to NONNULL and later introduce NULLABLE if we want to permit nulls.I have a small issue relating to not having a specific VECTOR tag on the data type. The driver behind adding this datatype is the hnsw index that is being added to consume this data. If we have a generic array datatype, what is the expectation going to be for users who create an index on it? The hnsw index will support only floats initially so we would have to reject any non-float arrays if an attempt was made to create an hnsw index on it. While there is no problem with doing this, there would be a problem if, in the future, we allow indexing in arrays in the same way that we index collections. In this case we would then need to have the user select what type of index they want at creation time.Can I add another proposal that we allow a VECTOR or DENSE (this is a well known term in the ML space) keyword that could be used when the array is going to be used for ML workloads. This would be optional and would function similarly to FROZEN in that it would limit the functionality of the array to ML usage. On Thu, 4 May 2023 at 09:45, Benedict wrote:Hurrah for initial agreement.For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t think VECTOR should be used to simply imply non-null, as this would be very unintuitive. More logical would be NONNULL, if this is the only condition being applied. Alternatively for arrays we could default to NONNULL and later introduce NULLABLE if we want to permit nulls.If the word vector is to be used it makes more sense to make it look like a list, so VECTOR as here the word VECTOR is clearly not redundant.So, I vote:1) (NON NULL) FLOAT[N]2) FLOAT[N] (Non null by default)3) VECTOROn 4 May 2023, at 08:52, Mick Semb Wever wrote:Did we agree on a CQL syntax?I don’t believe there has been a pool on CQL syntax… my understanding reading all the threads is that there are ~4-5 options and non are -1ed, so believe we are waiting for majority rule on this?Re-reading that thread, IIUC the valid choices remaining are…1. VECTOR FLOAT[n]2. FLOAT VECTOR[n]3. VECTOR4. VECTOR[n]5. ARRAY6. NON-NULL FROZENYes I'm putting my preference (1) first ;) because (banging on) if the future of CQL will have FLOAT[n] and FROZEN, where the VECTOR keyword is: for general cql users; just meaning "non-null and frozen", these gel best together.Options (5) and (6) are for those that feel we can and should provide this type without introducing the vector keyword.

-- Mike AdamsonEngineering+1 650 389 6000 | datastax.comFind DataStax Online:

Re: [POLL] Vector type for ML

2023-05-04 Thread Mike Adamson

That's fair comment. In this case I would be happy with any of your
suggestions although I would prefer that the datatype did not support
nulls.

On Thu, 4 May 2023 at 11:55, Benedict  wrote:

> I would expect that the type of index would be specified anyway?
>
> I don’t think it’s good API design to have the field define the index you
> create - only to shape what is permitted.
>
> A HNSW index is very specific and should be asked for specifically, not
> implicitly, IMO.
>
> On 4 May 2023, at 11:47, Mike Adamson  wrote:
>
> 
>
>> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
>> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
>> think VECTOR should be used to simply imply non-null, as this would be very
>> unintuitive. More logical would be NONNULL, if this is the only condition
>> being applied. Alternatively for arrays we could default to NONNULL and
>> later introduce NULLABLE if we want to permit nulls.
>>
>
> I have a small issue relating to not having a specific VECTOR tag on the
> data type. The driver behind adding this datatype is the hnsw index that is
> being added to consume this data. If we have a generic array datatype, what
> is the expectation going to be for users who create an index on it? The
> hnsw index will support only floats initially so we would have to reject
> any non-float arrays if an attempt was made to create an hnsw index on it.
> While there is no problem with doing this, there would be a problem if, in
> the future, we allow indexing in arrays in the same way that we index
> collections. In this case we would then need to have the user select what
> type of index they want at creation time.
>
> Can I add another proposal that we allow a VECTOR or DENSE (this is a well
> known term in the ML space) keyword that could be used when the array is
> going to be used for ML workloads. This would be optional and would
> function similarly to FROZEN in that it would limit the functionality of
> the array to ML usage.
>
> On Thu, 4 May 2023 at 09:45, Benedict  wrote:
>
>> Hurrah for initial agreement.
>>
>> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
>> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
>> think VECTOR should be used to simply imply non-null, as this would be very
>> unintuitive. More logical would be NONNULL, if this is the only condition
>> being applied. Alternatively for arrays we could default to NONNULL and
>> later introduce NULLABLE if we want to permit nulls.
>>
>> If the word vector is to be used it makes more sense to make it look like
>> a list, so VECTOR as here the word VECTOR is clearly not
>> redundant.
>>
>> So, I vote:
>>
>> 1) (NON NULL) FLOAT[N]
>> 2) FLOAT[N]   (Non null by default)
>> 3) VECTOR
>>
>>
>>
>> On 4 May 2023, at 08:52, Mick Semb Wever  wrote:
>>
>> 
>>
>>> Did we agree on a CQL syntax?
>>>
>>> I don’t believe there has been a pool on CQL syntax… my understanding
>>> reading all the threads is that there are ~4-5 options and non are -1ed, so
>>> believe we are waiting for majority rule on this?
>>>
>>
>>
>> Re-reading that thread, IIUC the valid choices remaining are…
>>
>> 1. VECTOR FLOAT[n]
>> 2. FLOAT VECTOR[n]
>> 3. VECTOR
>> 4. VECTOR[n]
>> 5. ARRAY
>> 6. NON-NULL FROZEN
>>
>>
>> Yes I'm putting my preference (1) first ;) because (banging on) if the
>> future of CQL will have FLOAT[n] and FROZEN, where the VECTOR
>> keyword is: for general cql users; just meaning "non-null and frozen",
>> these gel best together.
>>
>> Options (5) and (6) are for those that feel we can and should provide
>> this type without introducing the vector keyword.
>>
>>
>>
>>
>
> --
> [image: DataStax Logo Square]  *Mike Adamson*
> Engineering
>
> +1 650 389 6000 <16503896000> | datastax.com 
> Find DataStax Online: [image: LinkedIn Logo]
> 
>[image: Facebook Logo]
> 
>[image: Twitter Logo]    [image: RSS
> Feed]    [image: Github Logo]
> 
>
>

-- 
[image: DataStax Logo Square]  *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com 
Find DataStax Online: [image: LinkedIn Logo]

Re: [POLL] Vector type for ML

2023-05-04 Thread Brandon Williams

1. VECTOR
2. VECTOR FLOAT[n]
3. FLOAT[N]   (Non null by default)

Redundant or not, I think having the VECTOR keyword helps signify what
the app is generally about and helps get buy-in from ML stakeholders.

On Thu, May 4, 2023 at 3:45 AM Benedict  wrote:
>
> Hurrah for initial agreement.
>
> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], VECTOR 
> is redundant - FLOAT[N] is fully descriptive by itself. I don’t think VECTOR 
> should be used to simply imply non-null, as this would be very unintuitive. 
> More logical would be NONNULL, if this is the only condition being applied. 
> Alternatively for arrays we could default to NONNULL and later introduce 
> NULLABLE if we want to permit nulls.
>
> If the word vector is to be used it makes more sense to make it look like a 
> list, so VECTOR as here the word VECTOR is clearly not redundant.
>
> So, I vote:
>
> 1) (NON NULL) FLOAT[N]
> 2) FLOAT[N]   (Non null by default)
> 3) VECTOR
>
>
>
> On 4 May 2023, at 08:52, Mick Semb Wever  wrote:
>
> 
>>
>> Did we agree on a CQL syntax?
>>
>> I don’t believe there has been a pool on CQL syntax… my understanding 
>> reading all the threads is that there are ~4-5 options and non are -1ed, so 
>> believe we are waiting for majority rule on this?
>
>
>
> Re-reading that thread, IIUC the valid choices remaining are…
>
> 1. VECTOR FLOAT[n]
> 2. FLOAT VECTOR[n]
> 3. VECTOR
> 4. VECTOR[n]
> 5. ARRAY
> 6. NON-NULL FROZEN
>
>
> Yes I'm putting my preference (1) first ;) because (banging on) if the future 
> of CQL will have FLOAT[n] and FROZEN, where the VECTOR keyword is: 
> for general cql users; just meaning "non-null and frozen", these gel best 
> together.
>
> Options (5) and (6) are for those that feel we can and should provide this 
> type without introducing the vector keyword.
>
>

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-04 Thread Dinesh Joshi

Hi Guo,

I would expect that there would be release artifacts for the sidecar as well as 
the library once this functionality is available.

Dinesh

> On May 4, 2023, at 12:03 AM, guo Maxwell  wrote:
> 
> This is a very meaningful work, thanks , but I would like to ask a question 
> that is not particularly related to the cep project's code design itself but 
> the project engineering management : what is the future development and 
> release plan of this project? 
> As far as I know, project Cassandra Sidecar does not actually have an 
> finnally release version. I think everyone will definitely not want the 
> project code to be merged, but it has been unable to release for a long time 
> as this project relies on Cassandra sidecar.
> 
> Dinesh Joshi mailto:djo...@apache.org>> 于2023年5月4日周四 
> 02:35写道：
>> If there aren't additional questions / comments I will start the VOTE thread 
>> on this CEP tonight.
>> 
>> On 2023/05/01 19:50:12 Dinesh Joshi wrote:
>> > Does anybody have any questions that we could answer about this proposal?
> 
> 
> -- 
> you are the apple of my eye !

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-04 Thread guo Maxwell

Thanks Dinesh ，
That will be great.👍

Dinesh Joshi 于2023年5月4日 周四下午11:06写道：

> Hi Guo,
>
> I would expect that there would be release artifacts for the sidecar as
> well as the library once this functionality is available.
>
> Dinesh
>
> On May 4, 2023, at 12:03 AM, guo Maxwell  wrote:
>
> This is a very meaningful work, thanks , but I would like to ask a
> question that is not particularly related to the cep project's code design 
> itself
> but the project engineering management : what is the future development and
> release plan of this project?
> As far as I know, project Cassandra Sidecar does not actually have an
> finnally release version. I think everyone will definitely not want the
> project code to be merged, but it has been unable to release for a long
> time as this project relies on Cassandra sidecar.
>
> Dinesh Joshi  于2023年5月4日周四 02:35写道：
>
>> If there aren't additional questions / comments I will start the VOTE
>> thread on this CEP tonight.
>>
>> On 2023/05/01 19:50:12 Dinesh Joshi wrote:
>> > Does anybody have any questions that we could answer about this
>> proposal?
>>
>
>
> --
> you are the apple of my eye !
>
>
> --
you are the apple of my eye !

[VOTE] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-04 Thread Doug Rohrer

Hello all,

I’d like to put CEP-28 to a vote.

Proposal:

https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics

Jira:
https://issues.apache.org/jira/browse/CASSANDRA-16222

Draft implementation:

- Apache Cassandra Spark Analytics source code: 
https://github.com/frankgh/cassandra-analytics
- Changes required for Sidecar: 
https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis

Discussion:
https://lists.apache.org/thread/lrww4d7cdxgtg8o3gt8b8foymzpvq7z3

The vote will be open for 72 hours. 
A vote passes if there are at least three binding +1s and no binding vetoes. 


Thanks,

Doug Rohrer

Re: [POLL] Vector type for ML

2023-05-04 Thread David Capwell

My views have changed over time on syntax and I feel type[dimention] may not be 
the best, so it has gone lower in my own personal ranking… this is my current 
preference

1) DENSE [dimention] | NON NULL [dimention]
2) VECTOR
3) type[dimention]

My reasoning for this order

* type[dimention] looks like syntax sugar for array, so users 
may assume list/array semantics, but we limit to non-null elements in a frozen 
array
* feel VECTOR as a prefix feels out of place, but VECTOR as a direct type makes 
more sense… this also leads to a possible future of VECTOR which is the 
non-fixed length version of this type.  What makes VECTOR different from 
list/array?  non-null elements and is frozen.  I don’t feel that VECTOR really 
tells users to expect non-null or frozen semantics, as there exists different 
VECTOR types for those reasons (sparse vs dense)… 
* DENSE may be confusing for people coming from languages where this just means 
“sequential layout”, which is what our frozen array/list already are… but since 
the target user is coming from a ML background, this shouldn’t offer much 
confusion.  DENSE just means FROZEN in Cassandra, with NON NULL elements 
(SPARSE allows for NULL and isn’t frozen)… So DENSE just acts as syntax sugar 
for frozen

> On May 4, 2023, at 4:13 AM, Brandon Williams  wrote:
> 
> 1. VECTOR
> 2. VECTOR FLOAT[n]
> 3. FLOAT[N]   (Non null by default)
> 
> Redundant or not, I think having the VECTOR keyword helps signify what
> the app is generally about and helps get buy-in from ML stakeholders.
> 
> On Thu, May 4, 2023 at 3:45 AM Benedict  wrote:
>> 
>> Hurrah for initial agreement.
>> 
>> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], VECTOR 
>> is redundant - FLOAT[N] is fully descriptive by itself. I don’t think VECTOR 
>> should be used to simply imply non-null, as this would be very unintuitive. 
>> More logical would be NONNULL, if this is the only condition being applied. 
>> Alternatively for arrays we could default to NONNULL and later introduce 
>> NULLABLE if we want to permit nulls.
>> 
>> If the word vector is to be used it makes more sense to make it look like a 
>> list, so VECTOR as here the word VECTOR is clearly not redundant.
>> 
>> So, I vote:
>> 
>> 1) (NON NULL) FLOAT[N]
>> 2) FLOAT[N]   (Non null by default)
>> 3) VECTOR
>> 
>> 
>> 
>> On 4 May 2023, at 08:52, Mick Semb Wever  wrote:
>> 
>> 
>>> 
>>> Did we agree on a CQL syntax?
>>> 
>>> I don’t believe there has been a pool on CQL syntax… my understanding 
>>> reading all the threads is that there are ~4-5 options and non are -1ed, so 
>>> believe we are waiting for majority rule on this?
>> 
>> 
>> 
>> Re-reading that thread, IIUC the valid choices remaining are…
>> 
>> 1. VECTOR FLOAT[n]
>> 2. FLOAT VECTOR[n]
>> 3. VECTOR
>> 4. VECTOR[n]
>> 5. ARRAY
>> 6. NON-NULL FROZEN
>> 
>> 
>> Yes I'm putting my preference (1) first ;) because (banging on) if the 
>> future of CQL will have FLOAT[n] and FROZEN, where the VECTOR 
>> keyword is: for general cql users; just meaning "non-null and frozen", these 
>> gel best together.
>> 
>> Options (5) and (6) are for those that feel we can and should provide this 
>> type without introducing the vector keyword.
>> 
>>

Re: [POLL] Vector type for ML

2023-05-04 Thread Patrick McFadin

I agree with David's reasoning and the use of DENSE (and maybe eventually
SPARSE). This is terminology well established in the data world, and it
would lead to much easier adoption from users. VECTOR is close, but I can
see having to create a lot of content around "How to use it and not get in
trouble." (I have a lot of that content already)

 - We don't have to explain what it is. A lot of prior art out there
already [1][2][3]
 - We're matching an established term with what users would expect. No
surprises.
 - Shorter ramp-up time for users. Cassandra is being modernized.

The implementation is flexible, but the interface should empower our users
to be awesome.

Patrick

1 -
https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks
2 -
https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035
3 - https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/

On Thu, May 4, 2023 at 10:25 AM David Capwell  wrote:

> My views have changed over time on syntax and I feel type[dimention] may
> not be the best, so it has gone lower in my own personal ranking… this is
> my current preference
>
> 1) DENSE [dimention] | NON NULL [dimention]
> 2) VECTOR
> 3) type[dimention]
>
> My reasoning for this order
>
> * type[dimention] looks like syntax sugar for array, so
> users may assume list/array semantics, but we limit to non-null elements in
> a frozen array
> * feel VECTOR as a prefix feels out of place, but VECTOR as a direct type
> makes more sense… this also leads to a possible future of VECTOR
> which is the non-fixed length version of this type.  What makes VECTOR
> different from list/array?  non-null elements and is frozen.  I don’t feel
> that VECTOR really tells users to expect non-null or frozen semantics, as
> there exists different VECTOR types for those reasons (sparse vs dense)…
> * DENSE may be confusing for people coming from languages where this just
> means “sequential layout”, which is what our frozen array/list already are…
> but since the target user is coming from a ML background, this shouldn’t
> offer much confusion.  DENSE just means FROZEN in Cassandra, with NON NULL
> elements (SPARSE allows for NULL and isn’t frozen)… So DENSE just acts as
> syntax sugar for frozen
>
>
> On May 4, 2023, at 4:13 AM, Brandon Williams  wrote:
>
> 1. VECTOR
> 2. VECTOR FLOAT[n]
> 3. FLOAT[N]   (Non null by default)
>
> Redundant or not, I think having the VECTOR keyword helps signify what
> the app is generally about and helps get buy-in from ML stakeholders.
>
> On Thu, May 4, 2023 at 3:45 AM Benedict  wrote:
>
>
> Hurrah for initial agreement.
>
> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
> think VECTOR should be used to simply imply non-null, as this would be very
> unintuitive. More logical would be NONNULL, if this is the only condition
> being applied. Alternatively for arrays we could default to NONNULL and
> later introduce NULLABLE if we want to permit nulls.
>
> If the word vector is to be used it makes more sense to make it look like
> a list, so VECTOR as here the word VECTOR is clearly not
> redundant.
>
> So, I vote:
>
> 1) (NON NULL) FLOAT[N]
> 2) FLOAT[N]   (Non null by default)
> 3) VECTOR
>
>
>
> On 4 May 2023, at 08:52, Mick Semb Wever  wrote:
>
> 
>
>
> Did we agree on a CQL syntax?
>
> I don’t believe there has been a pool on CQL syntax… my understanding
> reading all the threads is that there are ~4-5 options and non are -1ed, so
> believe we are waiting for majority rule on this?
>
>
>
>
> Re-reading that thread, IIUC the valid choices remaining are…
>
> 1. VECTOR FLOAT[n]
> 2. FLOAT VECTOR[n]
> 3. VECTOR
> 4. VECTOR[n]
> 5. ARRAY
> 6. NON-NULL FROZEN
>
>
> Yes I'm putting my preference (1) first ;) because (banging on) if the
> future of CQL will have FLOAT[n] and FROZEN, where the VECTOR
> keyword is: for general cql users; just meaning "non-null and frozen",
> these gel best together.
>
> Options (5) and (6) are for those that feel we can and should provide this
> type without introducing the vector keyword.
>
>
>
>

Re: [VOTE] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-04 Thread C. Scott Andreas


+1nb.As someone familiar with this work, it's pretty hard to overstate the impact it 
has on completing Cassandra's HTAP story. Eliminating the overhead of bulk reads and 
writes on production OLTP clusters is transformative.– ScottOn May 4, 2023, at 9:47 
AM, Doug Rohrer  wrote:Hello all,I’d like to put CEP-28 to a 
vote.Proposal:https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+AnalyticsJira:https://issues.apache.org/jira/browse/CASSANDRA-16222Draft
 implementation:- Apache Cassandra Spark Analytics source code: 
https://github.com/frankgh/cassandra-analytics- Changes required for Sidecar: 
https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apisDiscussion:https://lists.apache.org/thread/lrww4d7cdxgtg8o3gt8b8foymzpvq7z3The
 vote will be open for 72 hours. A vote passes if there are at least three binding 
+1s and no binding vetoes. Thanks,Doug Rohrer

Re: [VOTE] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-04 Thread Patrick McFadin

As somebody who gave this talk: https://youtu.be/9xf_IXNylhM I love the
evolution of this topic. Excited to see this! ++1 nb

Patrick



On Thu, May 4, 2023 at 11:35 AM C. Scott Andreas 
wrote:

> +1nb.
>
> As someone familiar with this work, it's pretty hard to overstate the
> impact it has on completing Cassandra's HTAP story. Eliminating the
> overhead of bulk reads and writes on production OLTP clusters is
> transformative.
>
> – Scott
>
> On May 4, 2023, at 9:47 AM, Doug Rohrer  wrote:
>
>
> Hello all,
>
> I’d like to put CEP-28 to a vote.
>
> Proposal:
>
>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics
>
> Jira:
> https://issues.apache.org/jira/browse/CASSANDRA-16222
>
> Draft implementation:
>
> - Apache Cassandra Spark Analytics source code:
> https://github.com/frankgh/cassandra-analytics
> - Changes required for Sidecar:
> https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis
>
> Discussion:
> https://lists.apache.org/thread/lrww4d7cdxgtg8o3gt8b8foymzpvq7z3
>
> The vote will be open for 72 hours.
> A vote passes if there are at least three binding +1s and no binding
> vetoes.
>
>
> Thanks,
>
> Doug Rohrer
>
>
>
>
>

Re: [VOTE] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-04 Thread Jeremy Hanna

+1 nb, I had to run Cassandra + Hadoop from the early days (0.7+) and it was 
painful.  This is a major step forward.

> On May 4, 2023, at 1:44 PM, Patrick McFadin  wrote:
> 
> As somebody who gave this talk: https://youtu.be/9xf_IXNylhM I love the 
> evolution of this topic. Excited to see this! ++1 nb
> 
> Patrick
> 
> 
> 
> On Thu, May 4, 2023 at 11:35 AM C. Scott Andreas  > wrote:
>> +1nb.
>> 
>> As someone familiar with this work, it's pretty hard to overstate the impact 
>> it has on completing Cassandra's HTAP story. Eliminating the overhead of 
>> bulk reads and writes on production OLTP clusters is transformative.
>> 
>> – Scott
>> 
>>> On May 4, 2023, at 9:47 AM, Doug Rohrer >> > wrote:
>>> 
>>> 
>>> Hello all,
>>> 
>>> I’d like to put CEP-28 to a vote.
>>> 
>>> Proposal:
>>> 
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics
>>> 
>>> Jira:
>>> https://issues.apache.org/jira/browse/CASSANDRA-16222
>>> 
>>> Draft implementation:
>>> 
>>> - Apache Cassandra Spark Analytics source code: 
>>> https://github.com/frankgh/cassandra-analytics
>>> - Changes required for Sidecar: 
>>> https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis
>>> 
>>> Discussion:
>>> https://lists.apache.org/thread/lrww4d7cdxgtg8o3gt8b8foymzpvq7z3
>>> 
>>> The vote will be open for 72 hours. 
>>> A vote passes if there are at least three binding +1s and no binding 
>>> vetoes. 
>>> 
>>> 
>>> Thanks,
>>> 
>>> Doug Rohrer
>>> 
>>> 
>>> 
>> 
>>

Re: [VOTE] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-04 Thread Brandon Williams

+1

Kind Regards,
Brandon

On Thu, May 4, 2023 at 11:47 AM Doug Rohrer  wrote:
>
> Hello all,
>
> I’d like to put CEP-28 to a vote.
>
> Proposal:
>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics
>
> Jira:
> https://issues.apache.org/jira/browse/CASSANDRA-16222
>
> Draft implementation:
>
> - Apache Cassandra Spark Analytics source code: 
> https://github.com/frankgh/cassandra-analytics
> - Changes required for Sidecar: 
> https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis
>
> Discussion:
> https://lists.apache.org/thread/lrww4d7cdxgtg8o3gt8b8foymzpvq7z3
>
> The vote will be open for 72 hours.
> A vote passes if there are at least three binding +1s and no binding vetoes.
>
>
> Thanks,
>
> Doug Rohrer
>
>

Re: [VOTE] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-04 Thread Nate McCall

+1

Thanks Doug!


On Fri, May 5, 2023 at 4:47 AM Doug Rohrer  wrote:

> Hello all,
>
> I’d like to put CEP-28 to a vote.
>
> Proposal:
>
>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics
>
> Jira:
> https://issues.apache.org/jira/browse/CASSANDRA-16222
>
> Draft implementation:
>
> - Apache Cassandra Spark Analytics source code:
> https://github.com/frankgh/cassandra-analytics
> - Changes required for Sidecar:
> https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis
>
> Discussion:
> https://lists.apache.org/thread/lrww4d7cdxgtg8o3gt8b8foymzpvq7z3
>
> The vote will be open for 72 hours.
> A vote passes if there are at least three binding +1s and no binding
> vetoes.
>
>
> Thanks,
>
> Doug Rohrer
>
>
>

Re: [VOTE] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-04 Thread Jon Haddad

+1.

Awesome work Doug!  Great to see this moving forward.  

On 2023/05/04 18:34:46 "C. Scott Andreas" wrote:
> +1nb.As someone familiar with this work, it's pretty hard to overstate the 
> impact it has on completing Cassandra's HTAP story. Eliminating the overhead 
> of bulk reads and writes on production OLTP clusters is transformative.– 
> ScottOn May 4, 2023, at 9:47 AM, Doug Rohrer  wrote:Hello 
> all,I’d like to put CEP-28 to a 
> vote.Proposal:https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+AnalyticsJira:https://issues.apache.org/jira/browse/CASSANDRA-16222Draft
>  implementation:- Apache Cassandra Spark Analytics source code: 
> https://github.com/frankgh/cassandra-analytics- Changes required for Sidecar: 
> https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apisDiscussion:https://lists.apache.org/thread/lrww4d7cdxgtg8o3gt8b8foymzpvq7z3The
>  vote will be open for 72 hours. A vote passes if there are at least three 
> binding +1s and no binding vetoes. Thanks,Doug Rohrer

Re: [VOTE] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-04 Thread Yifan Cai

+1

From: Jon Haddad 
Sent: Thursday, May 4, 2023 3:31:52 PM
To: dev@cassandra.apache.org 
Subject: Re: [VOTE] CEP-28: Reading and Writing Cassandra Data with Spark Bulk 
Analytics

+1.

Awesome work Doug!  Great to see this moving forward.

On 2023/05/04 18:34:46 "C. Scott Andreas" wrote:
> +1nb.As someone familiar with this work, it's pretty hard to overstate the 
> impact it has on completing Cassandra's HTAP story. Eliminating the overhead 
> of bulk reads and writes on production OLTP clusters is transformative.– 
> ScottOn May 4, 2023, at 9:47 AM, Doug Rohrer  wrote:Hello 
> all,I’d like to put CEP-28 to a 
> vote.Proposal:https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+AnalyticsJira:https://issues.apache.org/jira/browse/CASSANDRA-16222Draft
>  implementation:- Apache Cassandra Spark Analytics source code: 
> https://github.com/frankgh/cassandra-analytics- Changes required for Sidecar: 
> https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apisDiscussion:https://lists.apache.org/thread/lrww4d7cdxgtg8o3gt8b8foymzpvq7z3The
>  vote will be open for 72 hours. A vote passes if there are at least three 
> binding +1s and no binding vetoes. Thanks,Doug Rohrer

Re: [VOTE] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-04 Thread Francisco Guerrero

+1 (nb)

On 2023/05/04 23:38:08 Yifan Cai wrote:
> +1
> 
> From: Jon Haddad 
> Sent: Thursday, May 4, 2023 3:31:52 PM
> To: dev@cassandra.apache.org 
> Subject: Re: [VOTE] CEP-28: Reading and Writing Cassandra Data with Spark 
> Bulk Analytics
> 
> +1.
> 
> Awesome work Doug!  Great to see this moving forward.
> 
> On 2023/05/04 18:34:46 "C. Scott Andreas" wrote:
> > +1nb.As someone familiar with this work, it's pretty hard to overstate the 
> > impact it has on completing Cassandra's HTAP story. Eliminating the 
> > overhead of bulk reads and writes on production OLTP clusters is 
> > transformative.– ScottOn May 4, 2023, at 9:47 AM, Doug Rohrer 
> >  wrote:Hello all,I’d like to put CEP-28 to a 
> > vote.Proposal:https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+AnalyticsJira:https://issues.apache.org/jira/browse/CASSANDRA-16222Draft
> >  implementation:- Apache Cassandra Spark Analytics source code: 
> > https://github.com/frankgh/cassandra-analytics- Changes required for 
> > Sidecar: 
> > https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apisDiscussion:https://lists.apache.org/thread/lrww4d7cdxgtg8o3gt8b8foymzpvq7z3The
> >  vote will be open for 72 hours. A vote passes if there are at least three 
> > binding +1s and no binding vetoes. Thanks,Doug Rohrer
>

Re: [POLL] Vector type for ML

2023-05-04 Thread Caleb Rackliffe

I actually still prefer *type[dimension]*, because I think I intuitively
read this as a primitive (meaning no null elements) array. Then we can have
the indexing apparatus only accept *frozen* for the HSNW case.

If that isn't intuitive to anyone else, I don't really have a strong
opinion...but...conflating "frozen" and "dense" seems like a bad idea. One
should indicate single vs. multi-cell, and the other the presence or
absence of nulls/zeros/whatever.

On Thu, May 4, 2023 at 12:51 PM Patrick McFadin  wrote:

> I agree with David's reasoning and the use of DENSE (and maybe eventually
> SPARSE). This is terminology well established in the data world, and it
> would lead to much easier adoption from users. VECTOR is close, but I can
> see having to create a lot of content around "How to use it and not get in
> trouble." (I have a lot of that content already)
>
>  - We don't have to explain what it is. A lot of prior art out there
> already [1][2][3]
>  - We're matching an established term with what users would expect. No
> surprises.
>  - Shorter ramp-up time for users. Cassandra is being modernized.
>
> The implementation is flexible, but the interface should empower our users
> to be awesome.
>
> Patrick
>
> 1 -
> https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks
> 2 -
> https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035
> 3 -
> https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/
>
> On Thu, May 4, 2023 at 10:25 AM David Capwell  wrote:
>
>> My views have changed over time on syntax and I feel type[dimention] may
>> not be the best, so it has gone lower in my own personal ranking… this is
>> my current preference
>>
>> 1) DENSE [dimention] | NON NULL [dimention]
>> 2) VECTOR
>> 3) type[dimention]
>>
>> My reasoning for this order
>>
>> * type[dimention] looks like syntax sugar for array, so
>> users may assume list/array semantics, but we limit to non-null elements in
>> a frozen array
>> * feel VECTOR as a prefix feels out of place, but VECTOR as a direct type
>> makes more sense… this also leads to a possible future of VECTOR
>> which is the non-fixed length version of this type.  What makes VECTOR
>> different from list/array?  non-null elements and is frozen.  I don’t feel
>> that VECTOR really tells users to expect non-null or frozen semantics, as
>> there exists different VECTOR types for those reasons (sparse vs dense)…
>> * DENSE may be confusing for people coming from languages where this just
>> means “sequential layout”, which is what our frozen array/list already are…
>> but since the target user is coming from a ML background, this shouldn’t
>> offer much confusion.  DENSE just means FROZEN in Cassandra, with NON NULL
>> elements (SPARSE allows for NULL and isn’t frozen)… So DENSE just acts as
>> syntax sugar for frozen
>>
>>
>> On May 4, 2023, at 4:13 AM, Brandon Williams  wrote:
>>
>> 1. VECTOR
>> 2. VECTOR FLOAT[n]
>> 3. FLOAT[N]   (Non null by default)
>>
>> Redundant or not, I think having the VECTOR keyword helps signify what
>> the app is generally about and helps get buy-in from ML stakeholders.
>>
>> On Thu, May 4, 2023 at 3:45 AM Benedict  wrote:
>>
>>
>> Hurrah for initial agreement.
>>
>> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
>> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
>> think VECTOR should be used to simply imply non-null, as this would be very
>> unintuitive. More logical would be NONNULL, if this is the only condition
>> being applied. Alternatively for arrays we could default to NONNULL and
>> later introduce NULLABLE if we want to permit nulls.
>>
>> If the word vector is to be used it makes more sense to make it look like
>> a list, so VECTOR as here the word VECTOR is clearly not
>> redundant.
>>
>> So, I vote:
>>
>> 1) (NON NULL) FLOAT[N]
>> 2) FLOAT[N]   (Non null by default)
>> 3) VECTOR
>>
>>
>>
>> On 4 May 2023, at 08:52, Mick Semb Wever  wrote:
>>
>> 
>>
>>
>> Did we agree on a CQL syntax?
>>
>> I don’t believe there has been a pool on CQL syntax… my understanding
>> reading all the threads is that there are ~4-5 options and non are -1ed, so
>> believe we are waiting for majority rule on this?
>>
>>
>>
>>
>> Re-reading that thread, IIUC the valid choices remaining are…
>>
>> 1. VECTOR FLOAT[n]
>> 2. FLOAT VECTOR[n]
>> 3. VECTOR
>> 4. VECTOR[n]
>> 5. ARRAY
>> 6. NON-NULL FROZEN
>>
>>
>> Yes I'm putting my preference (1) first ;) because (banging on) if the
>> future of CQL will have FLOAT[n] and FROZEN, where the VECTOR
>> keyword is: for general cql users; just meaning "non-null and frozen",
>> these gel best together.
>>
>> Options (5) and (6) are for those that feel we can and should provide
>> this type without introducing the vector keyword.
>>
>>
>>
>>

Re: [POLL] Vector type for ML

2023-05-04 Thread Caleb Rackliffe

Even in the ML case, sparse can just mean zeros rather than nulls, and they
should compress similarly anyway.

If we really want null values, I'd rather leave that in collections space.

On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe 
wrote:

> I actually still prefer *type[dimension]*, because I think I intuitively
> read this as a primitive (meaning no null elements) array. Then we can have
> the indexing apparatus only accept *frozen* for the HSNW case.
>
> If that isn't intuitive to anyone else, I don't really have a strong
> opinion...but...conflating "frozen" and "dense" seems like a bad idea. One
> should indicate single vs. multi-cell, and the other the presence or
> absence of nulls/zeros/whatever.
>
> On Thu, May 4, 2023 at 12:51 PM Patrick McFadin 
> wrote:
>
>> I agree with David's reasoning and the use of DENSE (and maybe eventually
>> SPARSE). This is terminology well established in the data world, and it
>> would lead to much easier adoption from users. VECTOR is close, but I can
>> see having to create a lot of content around "How to use it and not get in
>> trouble." (I have a lot of that content already)
>>
>>  - We don't have to explain what it is. A lot of prior art out there
>> already [1][2][3]
>>  - We're matching an established term with what users would expect. No
>> surprises.
>>  - Shorter ramp-up time for users. Cassandra is being modernized.
>>
>> The implementation is flexible, but the interface should empower our
>> users to be awesome.
>>
>> Patrick
>>
>> 1 -
>> https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks
>> 2 -
>> https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035
>> 3 -
>> https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/
>>
>> On Thu, May 4, 2023 at 10:25 AM David Capwell  wrote:
>>
>>> My views have changed over time on syntax and I feel type[dimention] may
>>> not be the best, so it has gone lower in my own personal ranking… this is
>>> my current preference
>>>
>>> 1) DENSE [dimention] | NON NULL [dimention]
>>> 2) VECTOR
>>> 3) type[dimention]
>>>
>>> My reasoning for this order
>>>
>>> * type[dimention] looks like syntax sugar for array, so
>>> users may assume list/array semantics, but we limit to non-null elements in
>>> a frozen array
>>> * feel VECTOR as a prefix feels out of place, but VECTOR as a direct
>>> type makes more sense… this also leads to a possible future of VECTOR
>>> which is the non-fixed length version of this type.  What makes VECTOR
>>> different from list/array?  non-null elements and is frozen.  I don’t feel
>>> that VECTOR really tells users to expect non-null or frozen semantics, as
>>> there exists different VECTOR types for those reasons (sparse vs dense)…
>>> * DENSE may be confusing for people coming from languages where this
>>> just means “sequential layout”, which is what our frozen array/list already
>>> are… but since the target user is coming from a ML background, this
>>> shouldn’t offer much confusion.  DENSE just means FROZEN in Cassandra, with
>>> NON NULL elements (SPARSE allows for NULL and isn’t frozen)… So DENSE just
>>> acts as syntax sugar for frozen
>>>
>>>
>>> On May 4, 2023, at 4:13 AM, Brandon Williams  wrote:
>>>
>>> 1. VECTOR
>>> 2. VECTOR FLOAT[n]
>>> 3. FLOAT[N]   (Non null by default)
>>>
>>> Redundant or not, I think having the VECTOR keyword helps signify what
>>> the app is generally about and helps get buy-in from ML stakeholders.
>>>
>>> On Thu, May 4, 2023 at 3:45 AM Benedict  wrote:
>>>
>>>
>>> Hurrah for initial agreement.
>>>
>>> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
>>> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
>>> think VECTOR should be used to simply imply non-null, as this would be very
>>> unintuitive. More logical would be NONNULL, if this is the only condition
>>> being applied. Alternatively for arrays we could default to NONNULL and
>>> later introduce NULLABLE if we want to permit nulls.
>>>
>>> If the word vector is to be used it makes more sense to make it look
>>> like a list, so VECTOR as here the word VECTOR is clearly not
>>> redundant.
>>>
>>> So, I vote:
>>>
>>> 1) (NON NULL) FLOAT[N]
>>> 2) FLOAT[N]   (Non null by default)
>>> 3) VECTOR
>>>
>>>
>>>
>>> On 4 May 2023, at 08:52, Mick Semb Wever  wrote:
>>>
>>> 
>>>
>>>
>>> Did we agree on a CQL syntax?
>>>
>>> I don’t believe there has been a pool on CQL syntax… my understanding
>>> reading all the threads is that there are ~4-5 options and non are -1ed, so
>>> believe we are waiting for majority rule on this?
>>>
>>>
>>>
>>>
>>> Re-reading that thread, IIUC the valid choices remaining are…
>>>
>>> 1. VECTOR FLOAT[n]
>>> 2. FLOAT VECTOR[n]
>>> 3. VECTOR
>>> 4. VECTOR[n]
>>> 5. ARRAY
>>> 6. NON-NULL FROZEN
>>>
>>>
>>> Yes I'm putting my preference (1) first ;) because (banging on) if the
>>> future of CQL will have FLOAT[n] and FROZEN, wher

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Re: [POLL] Vector type for ML

Re: [VOTE] Release Apache Cassandra 3.11.15

Re: [POLL] Vector type for ML

Re: [POLL] Vector type for ML

Re: [POLL] Vector type for ML

Re: [POLL] Vector type for ML

Re: [POLL] Vector type for ML

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

[VOTE] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Re: [POLL] Vector type for ML

Re: [POLL] Vector type for ML

Re: [VOTE] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Re: [VOTE] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Re: [VOTE] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Re: [VOTE] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Re: [VOTE] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Re: [VOTE] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Re: [VOTE] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Re: [VOTE] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Re: [POLL] Vector type for ML

Re: [POLL] Vector type for ML

23 matches

Site Navigation

Mail list logo

Footer information