Love it. Thank you folks for coming to a decision on this. This is very
helpful to move forward on planning on for the current Python frameworks:
- Langchain.CassandraVectorStore
- Langchain.CassandraVectorRetriever
- Langchain.CassandraVectorStoreAgent
- LlamaIndex.CassandraVectorLoad
https://issues.apache.org/jira/browse/CASSANDRA-18504
> On May 5, 2023, at 12:27 PM, David Capwell wrote:
>
> Yep, fair point…. SPARSE VECTOR better maps to NON NULL MAP
>
>> On May 5, 2023, at 11:58 AM, David Capwell wrote:
>>
>>> If we ever add sparse vectors, we can assume that DENSE is t
Yep, fair point…. SPARSE VECTOR better maps to NON NULL MAP
> On May 5, 2023, at 11:58 AM, David Capwell wrote:
>
>> If we ever add sparse vectors, we can assume that DENSE is the default and
>> allow to use either DENSE, SPARSE or nothing.
>
> I have been feeling that sparse is just a fixed s
Sparse vector in ML has the semantics that elements not explicitly set are
zero. I believe most (all?) sparse vector implementations use a map under
the hood; the point is to save a lot of space when you have 10K zeros and
100 that are nonzero.
On Fri, May 5, 2023 at 2:00 PM David Capwell wrote:
> If we ever add sparse vectors, we can assume that DENSE is the default and
> allow to use either DENSE, SPARSE or nothing.
I have been feeling that sparse is just a fixed size list with nulls… so
array… if you insert {0: 42, 3: 17} then you get a array of
[42, null, null, 17]? One negative d
My vote is:
1. VECTOR
2. DENSE VECTOR
3. type[dimension]
If we ever add sparse vectors, we can assume that DENSE is the default and
allow to use either DENSE, SPARSE or nothing.
Perhaps the dimension could be separated from the type, such as in
VECTOR[dimension] or VECTOR(dimension).
On Fri, 5
>> ...where, just to be clear, VECTOR means a frozen fixed
>> size array w/ no null values?
> Assuming this is the case
The current agreed requirements are:
1) non-null elements
2) fixed length
3) frozen
You pointed out 3 isn’t actually required, but that would be a different
conversation to
>
> ...where, just to be clear, VECTOR means a frozen fixed
> size array w/ no null values?
>
Assuming this is the case, my vote is:
1. VECTOR
2. DENSE VECTOR
I don't really have a 3rd vote because I think that *type[dimension]* is
too ambiguous.
On Fri, 5 May 2023 at 18:32, Derek Chen-Becker
LOL, I'm holding you to that at the summit :) In all seriousness, I'm glad
to see a robust debate around it. I guess for completeness, my order of
preference is
1 - NONNULL FROZEN>
2 - NONNULL TYPE (which part of this implies frozen? The NONNULL or the
cardinality?)
3 - DENSE_VECTOR
I guess my ma
Derek, despite your preference, I would hang out with you at a party.
On Fri, May 5, 2023 at 9:44 AM Derek Chen-Becker
wrote:
> Speaking as someone who likes Erlang, maybe that's why I also like NONNULL
> FROZEN>. It's unambiguous what Cassandra is going to do with that
> type. DENSE VECTOR mean
My vote is:
1. DENSE VECTOR
2. VECTOR
3. ARRAY
On Fri, May 5, 2023 at 9:43 AM David Capwell wrote:
> Went through and created a spreed sheet of current votes… For Patric and
> Mike, I don’t see a clear vote, so I put a ? where I “think” your
> preference is… for Mick, I only put one vote as the
Sorry, DENSE_VECTOR was pointing to the wrong row, updated score
Syntax
Score
VECTOR
16
DENSE VECTOR
11
type[dimension]
9
NON NULL [dimention]
6
VECTOR type[n]
5
DENSE_VECTOR
3
NON-NULL FROZEN
3
ARRAY
0
> On May 5, 2023, at 10:01 AM, David Capwell wrote:
>
> Updated
>
> Syntax
> Jonathan Ellis
Updated
Syntax
Jonathan Ellis
David Capwell
Josh McKenzie
Caleb Rackliffe
Patrick McFadin
Brandon Williams
Mike Adamson
Benedict
Mick Semb Wever
Derek Chen-Becker
VECTOR
1
2
2
1
?
3
2
DENSE VECTOR
2
1
?
?
type[dimension]
3
3
3
1
3
2
DENSE_VECTOR
1
NON NULL [dimention]
1
On Fri, 5 May 2023 at 18:43, David Capwell wrote:
> Went through and created a spreed sheet of current votes… For Patric and
> Mike, I don’t see a clear vote, so I put a ? where I “think” your
> preference is… for Mick, I only put one vote as the list looked like a
> summary, but you mentioned th
Speaking as someone who likes Erlang, maybe that's why I also like NONNULL
FROZEN>. It's unambiguous what Cassandra is going to do with that
type. DENSE VECTOR means I need to go read docs (and then probably
double-check in the source to be sure) to be sure what exactly is going on.
Cheers,
Derek
Went through and created a spreed sheet of current votes… For Patric and Mike,
I don’t see a clear vote, so I put a ? where I “think” your preference is… for
Mick, I only put one vote as the list looked like a summary, but you mentioned
the first was your preference
Syntax
Jonathan Ellis
David
...where, just to be clear, VECTOR means a frozen fixed
size array w/ no null values?
On Fri, May 5, 2023 at 11:23 AM Jonathan Ellis wrote:
> +10 for not inflicting unwieldy keywords on ML users.
>
> Re Josh's summary, mostly agreed, my only objection to adding the DENSE
> keyword is that I don'
+10 for not inflicting unwieldy keywords on ML users.
Re Josh's summary, mostly agreed, my only objection to adding the DENSE
keyword is that I don't see a foreseeable future where we also support
sparse vectors, so it would end up being unnecessary extra verbosity. So
my preference would be
1.
> The hnsw index can be built just as easily from a non-frozen array.
I have 0 issues removing that limitation =)
> I am in favour of enforcing non-null on the elements of an array by default.
This is why I feel DENSE or NON NULL are the best prefix, as those both imply
elements may not be null
I hope we are willing to consider developers that use our system because if
I had to teach people to use "NON-NULL FROZEN" I'm pretty sure the
response would be:
Did you tell me to go write a distributed map-reduce job in Erlang? I
beleive I did, Bob.
On Fri, May 5, 2023 at 8:05 AM Josh McKenzie
Idiomatically, to my mind, there's a question of "what space are we thinking
about this datatype in"?
- In the context of mathematics, nullability in a vector would be 0
- In the context of Cassandra, nullability tends to mean a tombstone (or
nothing)
- In the context of programming languages, i
I think we are still discussing implementation here when I'm talking about
developer experience. I want developers to adopt this quickly, easily and
be successful. Vector search is already a thing. People use it every day. A
successful outcome, in my view, is developers picking up this feature
with
>
> Then we can have the indexing apparatus only accept *frozen* for
> the HSNW case.
>
I'm inclined to agree with Benedict that the index will need to be
specifically select by option rather than inferred based on type. As such
there is no real reason for the *frozen* requirement on the type. The
Even in the ML case, sparse can just mean zeros rather than nulls, and they
should compress similarly anyway.
If we really want null values, I'd rather leave that in collections space.
On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe
wrote:
> I actually still prefer *type[dimension]*, because I t
I actually still prefer *type[dimension]*, because I think I intuitively
read this as a primitive (meaning no null elements) array. Then we can have
the indexing apparatus only accept *frozen* for the HSNW case.
If that isn't intuitive to anyone else, I don't really have a strong
opinion...but...c
I agree with David's reasoning and the use of DENSE (and maybe eventually
SPARSE). This is terminology well established in the data world, and it
would lead to much easier adoption from users. VECTOR is close, but I can
see having to create a lot of content around "How to use it and not get in
trou
My views have changed over time on syntax and I feel type[dimention] may not be
the best, so it has gone lower in my own personal ranking… this is my current
preference
1) DENSE [dimention] | NON NULL [dimention]
2) VECTOR
3) type[dimention]
My reasoning for this order
* type[dimention] looks
1. VECTOR
2. VECTOR FLOAT[n]
3. FLOAT[N] (Non null by default)
Redundant or not, I think having the VECTOR keyword helps signify what
the app is generally about and helps get buy-in from ML stakeholders.
On Thu, May 4, 2023 at 3:45 AM Benedict wrote:
>
> Hurrah for initial agreement.
>
> For s
That's fair comment. In this case I would be happy with any of your
suggestions although I would prefer that the datatype did not support
nulls.
On Thu, 4 May 2023 at 11:55, Benedict wrote:
> I would expect that the type of index would be specified anyway?
>
> I don’t think it’s good API design
I would expect that the type of index would be specified anyway?I don’t think it’s good API design to have the field define the index you create - only to shape what is permitted.A HNSW index is very specific and should be asked for specifically, not implicitly, IMO.On 4 May 2023, at 11:47, Mike Ad
>
> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
> think VECTOR should be used to simply imply non-null, as this would be very
> unintuitive. More logical would be NONNULL, if this is the only conditio
Hurrah for initial agreement.
For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], VECTOR is
redundant - FLOAT[N] is fully descriptive by itself. I don’t think VECTOR
should be used to simply imply non-null, as this would be very unintuitive.
More logical would be NONNULL, if t
>
> Did we agree on a CQL syntax?
>
> I don’t believe there has been a pool on CQL syntax… my understanding
> reading all the threads is that there are ~4-5 options and non are -1ed, so
> believe we are waiting for majority rule on this?
>
Re-reading that thread, IIUC the valid choices remaining
> Did we agree on a CQL syntax?
I don’t believe there has been a pool on CQL syntax… my understanding reading
all the threads is that there are ~4-5 options and non are -1ed, so believe we
are waiting for majority rule on this?
> On May 3, 2023, at 1:23 PM, Jeremiah D Jordan wrote:
>
>> To be
> To be clear, I support the general agreement David and Jonathan seem to have
> reached.
+1 as well.
> On May 3, 2023, at 3:07 PM, Caleb Rackliffe wrote:
>
> To be clear, I support the general agreement David and Jonathan seem to have
> reached.
>
> On Wed, May 3, 2023 at 3:05 PM Caleb Rac
To be clear, I support the general agreement David and Jonathan seem to
have reached.
On Wed, May 3, 2023 at 3:05 PM Caleb Rackliffe
wrote:
> Did we agree on a CQL syntax?
>
> On Wed, May 3, 2023 at 2:06 PM Rahul Xavier Singh <
> rahul.xavier.si...@gmail.com> wrote:
>
>> I like this approach. Th
Did we agree on a CQL syntax?
On Wed, May 3, 2023 at 2:06 PM Rahul Xavier Singh <
rahul.xavier.si...@gmail.com> wrote:
> I like this approach. Thank you for those working on this vector search
> initiative.
>
> Here's the feedback from my "user" hat for someone who is looking at
> databases / ind
I like this approach. Thank you for those working on this vector search
initiative.
Here's the feedback from my "user" hat for someone who is looking at
databases / indexes for my next LLM app.
Can I take some python code and go from using an in memory vector store
like numpy or FAISS to somethin
\o/
Bring it in team. Group hug.
Now if you'll excuse me, I'm going to go build my preso on how Cassandra is
the only distributed database you can do vector search in an ACID
transaction.
Patrick
On Tue, May 2, 2023 at 3:27 PM Jonathan Ellis wrote:
> I had a call with David. We agreed that w
I'm also in favor of having a general data type that is not tied to numeric
data types alone.
On 2023/05/02 22:27:24 Jonathan Ellis wrote:
> I had a call with David. We agreed that we want a "vector" data type with
> these properties
>
> - Fixed length
> - No nulls
> - Random access not support
I had a call with David. We agreed that we want a "vector" data type with
these properties
- Fixed length
- No nulls
- Random access not supported
Where we disagreed was on my proposal to restrict vectors to only numeric
data. David's points were that
(1) He has a use case today for a data typ
> How about it, David? Did you already make this?
I checked out the patch, fixed serialize/deserialize, added the constraints,
then added a composeForFloat(ByteBuffer), with this the impact to the POC patch
was the following
1) move away from VectorType.instance.serializer().deserialize(bb) to
I'm all for bringing more functionality to the masses sooner, but the original
idea has a very very specific use case. Do we have use cases for a general
purpose Vector/Array data structure? If so, awesome. I just wondered if
generalizing provides value, beyond being straightforward to implem
Yeah, it's a bit of a mess but mailing list yo. People reading this would
have no idea we are friends. ;) (Which we are, for anyone reading this
later!)
I must have missed the point of this already being done. How about it,
David? Did you already make this?
"FWIW, my interpretation of the votes t
But it’s so trivial it was already implemented by David in the span of ten minutes? If anything, we’re slowing progress down by refusing to do the extra types, as we’re busy arguing about it rather than delivering a feature?FWIW, my interpretation of the votes today is that we SHOULD NOT (ever) sup
I'll speak up on that one. If you look at my ranked voting, that is where
my head is. I get accused of scope creep (a lot) and looking at the initial
proposal Jonathan put on the ML it was mostly "Developers are adopting
vector search at a furious pace and I think I have a simple way of adding
supp
Could folk voting against a general purpose type (that could well be called a vector) briefly explain their reasoning?We established in the other thread that it’s technically trivial, meaning folk must think it is strictly superior to only support float rather than eg all numeric types (note: for t
A > B > C on both polls.
Having talked to several users in the community that are highly excited
about this change, this gets to what developers want to do at Cassandra
scale: store embeddings and retrieve them.
On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña
wrote:
> A > B > C
>
> I don't th
A > B > C
I don't think that ML is such a niche application that it can't have its
own CQL data type. Also, vectors are mathematical elements that have more
applications that ML.
On Tue, 2 May 2023 at 19:15, Mick Semb Wever wrote:
>
>
> On Tue, 2 May 2023 at 17:14, Jonathan Ellis wrote:
>
>> S
On Tue, 2 May 2023 at 17:14, Jonathan Ellis wrote:
> Should we add a vector type to Cassandra designed to meet the needs of
> machine learning use cases, specifically feature and embedding vectors for
> training, inference, and vector search?
>
> ML vectors are fixed-dimension (fixed-length) sequ
> B) Should we introduce a type that is general purpose, and supports all
> Cassandra types, so that this may be used to support ML (and perhaps other)
> workloads
I vote B only as well...
> On May 2, 2023, at 9:02 AM, Benedict wrote:
>
> This is not the poll I thought we would be conducting,
This is not the poll I thought we would be conducting, and I don’t really support its framing. There are two parallel questions: what the functionality should be and how they should be exposed. This poll compresses the optionality poorly.Whether or not we support a “vector” concept (or something is
My preference: A > B > C. Vectors are distinct enough from arrays that we
should not make adding the latter a prerequisite for adding the former.
On Tue, May 2, 2023 at 10:13 AM Jonathan Ellis wrote:
> Should we add a vector type to Cassandra designed to meet the needs of
> machine learning use
53 matches
Mail list logo