Re: [DISCUSS] New data type for vector search

David Capwell Mon, 01 May 2023 09:58:22 -0700

> In particular it makes no sense at all from an ML perspective to have vector 
> types of anything other than numerics


Back to what Benedict was saying, if the proposal was a ML pluggin, then this 
limitation makes sense, but that is not the proposal at hand.  If you wish to 
change the scope to add pluggable types, then this type of plugin could follow 
whatever rules it desires.

> but we have no reasonable path towards supporting indexing and searches 
> against any other vector type.

Type system is different than the index system, the index system is allowed to 
limit the domain of possible types to a blessed set… so arguing that a new type 
should be added that has limitations for indexing doesn’t make much sense to 
me, as those are index specific limitations…

> 4. Add a vector type that composes with all Cassandra types.  I can't see a 
> reason to do this, nobody wants it, and we killed the most similar proposal 
> in the past as wontfix.

We don’t have a constraint system at the moment, so such constrains are 
normally implemented in application… so I would not argue that nobody would 
want "VECTOR<TEXT, 42>”…

> Benedict, I don't quite see why that matters? 

The global public API is different than a plugin system… Right now we allow 
pluggable SSTables, and Indexes… both could add any limitation they desire as 
they are plugins… but when we are working with the top level systems we do need 
to worry about compatibility…

> The argument is merely that this kind of vector, for this use case, a) is 
> different from arrays, and b) arrays apparently don't serve the use case well 
> enough (or at all).

I have listed every requirement so far, and they are all constrains on arrays… 
I am not arguing that this new type should “extend” from array type in java, or 
that CQL has any ability to convert between the two types… but every single 
requirement given so far are a constraint against arrays…

> But suggesting that Jonathan should work on implementing general purpose 
> arrays seems to fall outside the scope of this discussion, since the result 
> of such work wouldn't even fill the need Jonathan is targeting for here. 

Every comment I have made so far I have argued that the v1 work doesn’t need to 
do some things, but that the limitations proposed so far are not real 
requirements; there is a big difference between what “could be allowed” and 
what is implemented day one… I am pushing back on what “could be allowed”, so 
far every justification has been that it slows down the ANN work…

Simple examples of this already exists in C* (every example could be enhanced 
logically, we just have yet to put in the work)

* updating a element of a list is only allowed for multi-cell
* appending to a list is only allowed for multi-cell
* etc.

By saying that the type "shall not support", you actively block future work and 
future possibilities...

> At most these say “ANN indexes should only support float types” which is 
> different, and not something I would dispute.


Agree here, the type limitation is a limitation for ANN, so should leave in ANN 
and not leak outside there.

> By my superficial reading I get the impression that the main distinction is 
> that vectors don't need to support random access into a single element/float

They don’t “need” for for this single use case, that doesn’t mean that no user 
will ever wish to write “SELECT my_vector[42]”.  Do we “need” to add such 
support day 1?  No.  Does Jonathan have to implement this once the ANN work is 
merged?  No.  I just want to be clear I am pushing back that the type should 
never allow, as every justification has been specific to a specific use case, 
and that the broader type having such capabilities does not actually impact the 
ANN work.

> I haven't looked at what Jonathan is doing, but I assume, and it seems 
> Jonathan assumes or knows that this makes implementation both easier and 
> allows for important optimizations

His patch has a function that goes from BB -> float[].  The fact a vector could 
support non-float types does not actually impact that work, as you would still 
do BB -> float[], you just need the index to validate that the type is 
vector<float> at the index create step (which you have to do already 
reguardless)…

The same is true for multi-cell support, you “could" have a function that takes 
a List<BB> -> float[]… if people really feel that we shouldn’t allow 
multi-cell, that’s fine by me…  The biggest limitation for this I see is that 
SAI works at the Cell level, but talking with Caleb that is a short term 
limitation and something desired to improve (treating frozen vs non-frozen 
differently isn’t desired).  



To repeat the list of requirements I summarized so far, this is everything I 
have seen

1) represents a fixed size array of element type
2) element may not be null
3) works for all types

I removed the frozen one, but as far as I can tell from the ANN work this isn’t 
a requirement for ANN, it just needs a way to create the float[] (which we can 
still do with multi-cell support).

> On Apr 28, 2023, at 3:40 PM, Henrik Ingo <henrik.i...@datastax.com> wrote:
> 
> By my superficial reading I get the impression that the main distinction is 
> that vectors don't need to support random access into a single element/float. 
> I haven't looked at what Jonathan is doing, but I assume, and it seems 
> Jonathan assumes or knows that this makes implementation both easier and 
> allows for important optimizations. Am I following correctly here?
> 
> (Apologies if that is what your #1 is saying, I read yours as something about 
> secondary or maybe clustered indexes?)
> 
> Agree with #3 obviously.
> 
> #2... Vectors actually *could* support ordered (n-dimensional) indexes, since 
> they are vectors. But in practice it seems even asking for a simple 3D index 
> is too much and too niche for anything else than Postgis.
> 
> henrik
> 
> henrik
> 
> On Fri, Apr 28, 2023 at 8:50 PM Benedict <bened...@apache.org> wrote:
> I and others have claimed that an array concept will work, since it is 
> isomorphic with a vector. I have seen the following counterclaims:
> 
> 1. Vectors don’t need to support index lookups
> 2. Vectors don’t need to support ordered indexes
> 3. Vectors don’t need to support other types besides float
> 
> None of these say that vectors are not arrays. At most these say “ANN indexes 
> should only support float types” which is different, and not something I 
> would dispute.
> 
> If the claim is "there is no concept of arrays that is compatible with vector 
> search" then let’s focus on that, because that is probably the initial source 
> of the disconnect.
> 
> 
> 
> 
>> On 28 Apr 2023, at 18:13, Henrik Ingo <henrik.i...@datastax.com> wrote:
>> 
>> Benedict, I don't quite see why that matters? The argument is merely that 
>> this kind of vector, for this use case, a) is different from arrays, and b) 
>> arrays apparently don't serve the use case well enough (or at all).
>> 
>> Now, if from the above it follows a discussion that a vector type cannot be 
>> a first class Cassandra type... that is of course a possible argument. 
>> 
>> But suggesting that Jonathan should work on implementing general purpose 
>> arrays seems to fall outside the scope of this discussion, since the result 
>> of such work wouldn't even fill the need Jonathan is targeting for here. I 
>> could also ask Jonathan to work on a JSONB data type, and it similarly would 
>> not be an interesting proposal to Jonathan, as it wouldn't fill the need for 
>> the specific use case he is targeting.
>> 
>> 
>> But back to the main question... Why wouldn't a "vector for floats" type be 
>> general purpose enough that it should be delegated to some plugin? Machine 
>> Learning is a broad field in itself, with dozens of algorithms you could 
>> choose to use to build an AI model. And AI can be used in pretty much every 
>> industry vertical. If anything, I would claim DECIMAL is much more an 
>> industry specific special case type than these ML vectors would be. 
>> 
>> 
>> 
>> Back to Jonathan:
>> >So in order of what makes sense to me:
>> > 1. Add a vector type for just floats; consider adding bytes later if 
>> > demand materializes. This gives us 99% of the value and limits the scope 
>> > so we can deliver quickly.
>> > 2. Add a vector type for floats or bytes. This gives us another 1% of 
>> > value in exchange for an extra 20% or so of effort.
>> 
>> Is it possible to implement 1 in a way that makes 2 possible in a future 
>> version?
>> 
>> henrik
>> 
>> 
>> henrik
>> 
>> On Fri, Apr 28, 2023 at 7:33 PM Benedict <bened...@apache.org> wrote:
>> pgvector is a plug-in. If you were proposing a plug-in you could ignore 
>> these considerations.
>> 
>>> On 28 Apr 2023, at 16:58, Jonathan Ellis <jbel...@gmail.com> wrote:
>>> 
>>> I'm proposing a vector data type for ML use cases.  It's not the same 
>>> thing as an array or a list and it's not supposed to be.
>>> 
>>> While it's true that it would be possible to build a vector type on top of 
>>> an array type, it's not necessary to do it that way, and given the lack of 
>>> interest in an array type for its own sake I don't see why we would want to 
>>> make that a requirement.
>>> 
>>> It's relevant that pgvector, which among the systems offering vector search 
>>> is based on the most similar system to Cassandra in terms of its query 
>>> language, adds a vector data type that only supports floats *even though 
>>> postgresql already has an array data type* because the semantics are 
>>> different.  Random access doesn't make sense, string and collection and 
>>> other datatypes don't make sense, typical ordered indexes don't make sense, 
>>> etc.  It's just a different beast from arrays, for a different use case.
>>> 
>>> On Fri, Apr 28, 2023 at 10:40 AM Benedict <bened...@apache.org> wrote:
>>> But you’re proposing introducing a general purpose type - this isn’t an ML 
>>> plug-in, it’s modifying the core language in a manner that makes targeting 
>>> your workload easier. Which is fine, but that means you have to consider 
>>> its impact on the general language, not just your target use case.
>>> 
>>>> On 28 Apr 2023, at 16:29, Jonathan Ellis <jbel...@gmail.com> wrote:
>>>> 
>>>> That's exactly right.
>>>> 
>>>> In particular it makes no sense at all from an ML perspective to have 
>>>> vector types of anything other than numerics.  And as I mentioned in the 
>>>> POC thread (but I did not mention here), float is overwhelmingly the most 
>>>> frequently used vector type, to the point that Pinecone (by far the most 
>>>> popular vector search engine) ONLY supports that type.
>>>> 
>>>> Lucene and Elastic also add support for vectors of bytes (8-bit ints), 
>>>> which are useful for optimizing models that you have already built with 
>>>> floats, but we have no reasonable path towards supporting indexing and 
>>>> searches against any other vector type.
>>>> 
>>>> So in order of what makes sense to me:
>>>> 
>>>> 1. Add a vector type for just floats; consider adding bytes later if 
>>>> demand materializes. This gives us 99% of the value and limits the scope 
>>>> so we can deliver quickly.
>>>> 
>>>> 2. Add a vector type for floats or bytes. This gives us another 1% of 
>>>> value in exchange for an extra 20% or so of effort.
>>>> 
>>>> 3. Add a vector type for all numeric primitives, but you can only index 
>>>> floats and bytes.  I think this is confusing to users and a bad idea.
>>>> 
>>>> 4. Add a vector type that composes with all Cassandra types.  I can't see 
>>>> a reason to do this, nobody wants it, and we killed the most similar 
>>>> proposal in the past as wontfix.
>>>> 
>>>> On Thu, Apr 27, 2023 at 7:49 PM Josh McKenzie <jmcken...@apache.org> wrote:
>>>> From a machine learning perspective, vectors are a well-known concept that 
>>>> are effectively immutable fixed-length n-dimensional values that are then 
>>>> later used either as part of a model or in conjunction with a model after 
>>>> the fact.
>>>> 
>>>> While we could have this be non-frozen and not call it a vector, I'd be 
>>>> inclined to still make the argument for a layer of syntactic sugar on top 
>>>> that met ML users where they were with concepts they understood rather 
>>>> than forcing them through the cognitive lift of figuring out the Cassandra 
>>>> specific contortions to replicate something that's ubiquitous in their 
>>>> space. We did the same "Cassandra-first" approach with our JSON support 
>>>> and that didn't do us any favors in terms of adoption and usage as far as 
>>>> I know.
>>>> 
>>>> So is the goal here to provide something specific and idiomatic for the ML 
>>>> community or is the goal to make a primitive that's C*-centric that then 
>>>> another layer can write to? I personally argue for the former; I don't see 
>>>> this specific data type going away any time soon.
>>>> 
>>>> On Thu, Apr 27, 2023, at 12:39 PM, David Capwell wrote:
>>>>>> but as you point out it has the problem of allowing nulls.
>>>>> 
>>>>> If nulls are not allowed for the elements, then either we need  a) a new 
>>>>> type, or b) add some way to say elements may not be null…. As much as I 
>>>>> do like b, I am leaning towards new type for this use case.
>>>>> 
>>>>> So, to flesh out the type requirements I have seen so far
>>>>> 
>>>>> 1) represents a fixed size array of element type
>>>>> * on write path we will need to validate this
>>>>> 2) element may not be null
>>>>> * on write path we will need to validate this
>>>>> 3) “frozen” (is this really a requirement for the type or is this just 
>>>>> simpler for the ANN work?  I feel that this shouldn’t be a requirement)
>>>>> 4) works for all types (my requirement; original proposal is float only, 
>>>>> but could logically expand to primitive types)
>>>>> 
>>>>> Anything else?
>>>>> 
>>>>>> The key thing about a vector is that unlike lists or tuples you really 
>>>>>> don't care about individual elements, you care about doing vector and 
>>>>>> matrix multiplications with the thing as a unit. 
>>>>> 
>>>>> That maybe true for this use case, but “should” this be true for the type 
>>>>> itself?  I feel like no… if a user wants the Nth element of a vector why 
>>>>> would we block them?  I am not saying the first patch, or even 5.0 adds 
>>>>> support for index access, I am just trying to push back saying that the 
>>>>> type should not block this.
>>>>> 
>>>>>> (Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT 
>>>>>> VECTOR[N].)
>>>>> 
>>>>> Now that nulls are not allowed, I have mixed feelings about FLOAT[N], I 
>>>>> prefer this syntax but that limitation may not be desired for all use 
>>>>> cases… we could always add LIST<TYPE, N> and ARRAY<TYPE, N> later to 
>>>>> address that case.
>>>>> 
>>>>> In terms of syntax I have seen, here is my ordered preference:
>>>>> 
>>>>> 1) TYPE[size] - have mixed feelings due to non-null, but still prefer it
>>>>> 2) QUALIFIER TYPE[size] - QUALIFIER is just a Term we use to denote this 
>>>>> semantic…. Could even be NON NULL TYPE[size]
>>>>> 
>>>>>> On Apr 27, 2023, at 9:00 AM, Benedict <bened...@apache.org> wrote:
>>>>>> 
>>>>>> 
>>>>>> That’s a bounded ring buffer, not a fixed length array.
>>>>>> 
>>>>>> This definitely isn’t a tuple because the types are all the same, which 
>>>>>> is pretty crucial for matrix operations. Matrix libraries generally work 
>>>>>> on arrays of known dimensionality, or sparse representations.
>>>>>> 
>>>>>> Whether we draw any semantic link between the frozen list and whatever 
>>>>>> we do here, it is fundamentally a frozen list with a restriction on its 
>>>>>> size. What we’re defining here are “statically” sized arrays, whereas a 
>>>>>> frozen list is essentially a dynamically sized array.
>>>>>> 
>>>>>> I do not think vector is a good name because vector is used in some 
>>>>>> other popular languages to mean a (dynamic) list, which is confusing 
>>>>>> when we also have a list concept.
>>>>>> 
>>>>>> I’m fine with just using the FLOAT[N] syntax, and drawing no direct link 
>>>>>> with list. Though it is a bit strange that this particular type 
>>>>>> declaration looks so different to other collection types.
>>>>>> 
>>>>>>> On 27 Apr 2023, at 16:48, Jeff Jirsa <jji...@gmail.com> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis <jbel...@gmail.com> 
>>>>>>> wrote:
>>>>>>> It's been a while, so I may be missing something, but do we already 
>>>>>>> have fixed-size lists?  If not, I don't see why we'd try to make this 
>>>>>>> fit into a List-shaped problem.
>>>>>>> 
>>>>>>> We do not. The proposal got closed as wont-fix  
>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-9110
>>>>>>> 
>>>>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Jonathan Ellis
>>>> co-founder, http://www.datastax.com
>>>> @spyced
>>> 
>>> 
>>> -- 
>>> Jonathan Ellis
>>> co-founder, http://www.datastax.com
>>> @spyced
>> 
>> 
>> -- 
>> Henrik Ingo
>> c. +358 40 569 7354 
>> w. www.datastax.com
>>       
> 
> 
> -- 
> Henrik Ingo
> c. +358 40 569 7354 
> w. www.datastax.com
>

Re: [DISCUSS] New data type for vector search

Reply via email to