Re: [DISCUSS] New data type for vector search

Henrik Ingo Fri, 28 Apr 2023 15:41:17 -0700

By my superficial reading I get the impression that the main distinction is
that vectors don't need to support random access into a single
element/float. I haven't looked at what Jonathan is doing, but I assume,
and it seems Jonathan assumes or knows that this makes implementation both
easier and allows for important optimizations. Am I following correctly
here?


(Apologies if that is what your #1 is saying, I read yours as something
about secondary or maybe clustered indexes?)

Agree with #3 obviously.

#2... Vectors actually *could* support ordered (n-dimensional) indexes,
since they are vectors. But in practice it seems even asking for a simple
3D index is too much and too niche for anything else than Postgis.

henrik

henrik

On Fri, Apr 28, 2023 at 8:50 PM Benedict <bened...@apache.org> wrote:

> I and others have claimed that an array concept will work, since it is
> isomorphic with a vector. I have seen the following counterclaims:
>
> 1. Vectors don’t need to support index lookups
> 2. Vectors don’t need to support ordered indexes
> 3. Vectors don’t need to support other types besides float
>
> None of these say that *vectors are not arrays*. At most these say “ANN
> indexes should only support float types” which is different, and not
> something I would dispute.
>
> If the claim is "there is no concept of arrays that is compatible with
> vector search" then let’s focus on that, because that is probably the
> initial source of the disconnect.
>
>
>
>
> On 28 Apr 2023, at 18:13, Henrik Ingo <henrik.i...@datastax.com> wrote:
>
> 
> Benedict, I don't quite see why that matters? The argument is merely that
> this kind of vector, for this use case, a) is different from arrays, and b)
> arrays apparently don't serve the use case well enough (or at all).
>
> Now, if from the above it follows a discussion that a vector type cannot
> be a first class Cassandra type... that is of course a possible argument.
>
> But suggesting that Jonathan should work on implementing general purpose
> arrays seems to fall outside the scope of this discussion, since the result
> of such work wouldn't even fill the need Jonathan is targeting for here. I
> could also ask Jonathan to work on a JSONB data type, and it similarly
> would not be an interesting proposal to Jonathan, as it wouldn't fill the
> need for the specific use case he is targeting.
>
>
> But back to the main question... Why wouldn't a "vector for floats" type
> be general purpose enough that it should be delegated to some plugin?
> Machine Learning is a broad field in itself, with dozens of algorithms you
> could choose to use to build an AI model. And AI can be used in pretty much
> every industry vertical. If anything, I would claim DECIMAL is much more an
> industry specific special case type than these ML vectors would be.
>
>
>
> Back to Jonathan:
> >So in order of what makes sense to me:
> > 1. Add a vector type for just floats; consider adding bytes later if
> demand materializes. This gives us 99% of the value and limits the scope so
> we can deliver quickly.
> > 2. Add a vector type for floats or bytes. This gives us another 1% of
> value in exchange for an extra 20% or so of effort.
>
> Is it possible to implement 1 in a way that makes 2 possible in a future
> version?
>
> henrik
>
>
> henrik
>
> On Fri, Apr 28, 2023 at 7:33 PM Benedict <bened...@apache.org> wrote:
>
>> pgvector is a plug-in. If you were proposing a plug-in you could ignore
>> these considerations.
>>
>> On 28 Apr 2023, at 16:58, Jonathan Ellis <jbel...@gmail.com> wrote:
>>
>> 
>> I'm proposing a vector data type for ML use cases.  It's not the same
>> thing as an array or a list and it's not supposed to be.
>>
>> While it's true that it would be possible to build a vector type on top
>> of an array type, it's not necessary to do it that way, and given the lack
>> of interest in an array type for its own sake I don't see why we would want
>> to make that a requirement.
>>
>> It's relevant that pgvector, which among the systems offering vector
>> search is based on the most similar system to Cassandra in terms of its
>> query language, adds a vector data type that only supports floats *even
>> though postgresql already has an array data type* because the semantics are
>> different.  Random access doesn't make sense, string and collection and
>> other datatypes don't make sense, typical ordered indexes don't make sense,
>> etc.  It's just a different beast from arrays, for a different use case.
>>
>> On Fri, Apr 28, 2023 at 10:40 AM Benedict <bened...@apache.org> wrote:
>>
>>> But you’re proposing introducing a general purpose type - this isn’t an
>>> ML plug-in, it’s modifying the core language in a manner that makes
>>> targeting your workload easier. Which is fine, but that means you have to
>>> consider its impact on the general language, not just your target use case.
>>>
>>> On 28 Apr 2023, at 16:29, Jonathan Ellis <jbel...@gmail.com> wrote:
>>>
>>> 
>>> That's exactly right.
>>>
>>> In particular it makes no sense at all from an ML perspective to have
>>> vector types of anything other than numerics.  And as I mentioned in the
>>> POC thread (but I did not mention here), float is overwhelmingly the most
>>> frequently used vector type, to the point that Pinecone (by far the most
>>> popular vector search engine) ONLY supports that type.
>>>
>>> Lucene and Elastic also add support for vectors of bytes (8-bit ints),
>>> which are useful for optimizing models that you have already built with
>>> floats, but we have no reasonable path towards supporting indexing and
>>> searches against any other vector type.
>>>
>>> So in order of what makes sense to me:
>>>
>>> 1. Add a vector type for just floats; consider adding bytes later if
>>> demand materializes. This gives us 99% of the value and limits the scope so
>>> we can deliver quickly.
>>>
>>> 2. Add a vector type for floats or bytes. This gives us another 1% of
>>> value in exchange for an extra 20% or so of effort.
>>>
>>> 3. Add a vector type for all numeric primitives, but you can only index
>>> floats and bytes.  I think this is confusing to users and a bad idea.
>>>
>>> 4. Add a vector type that composes with all Cassandra types.  I can't
>>> see a reason to do this, nobody wants it, and we killed the most similar
>>> proposal in the past as wontfix.
>>>
>>> On Thu, Apr 27, 2023 at 7:49 PM Josh McKenzie <jmcken...@apache.org>
>>> wrote:
>>>
>>>> From a machine learning perspective, vectors are a well-known concept
>>>> that are effectively immutable fixed-length n-dimensional values that are
>>>> then later used either as part of a model or in conjunction with a model
>>>> after the fact.
>>>>
>>>> While we could have this be non-frozen and not call it a vector, I'd be
>>>> inclined to still make the argument for a layer of syntactic sugar on top
>>>> that met ML users where they were with concepts they understood rather than
>>>> forcing them through the cognitive lift of figuring out the Cassandra
>>>> specific contortions to replicate something that's ubiquitous in their
>>>> space. We did the same "Cassandra-first" approach with our JSON support and
>>>> that didn't do us any favors in terms of adoption and usage as far as I
>>>> know.
>>>>
>>>> So is the goal here to provide something specific and idiomatic for the
>>>> ML community or is the goal to make a primitive that's C*-centric that then
>>>> another layer can write to? I personally argue for the former; I don't see
>>>> this specific data type going away any time soon.
>>>>
>>>> On Thu, Apr 27, 2023, at 12:39 PM, David Capwell wrote:
>>>>
>>>> but as you point out it has the problem of allowing nulls.
>>>>
>>>>
>>>> If nulls are not allowed for the elements, then either we need  a) a
>>>> new type, or b) add some way to say elements may not be null…. As much as I
>>>> do like b, I am leaning towards new type for this use case.
>>>>
>>>> So, to flesh out the type requirements I have seen so far
>>>>
>>>> 1) represents a fixed size array of element type
>>>> * on write path we will need to validate this
>>>> 2) element may not be null
>>>> * on write path we will need to validate this
>>>> 3) “frozen” (is this really a requirement for the type or is this
>>>> just simpler for the ANN work?  I feel that this shouldn’t be a 
>>>> requirement)
>>>> 4) works for all types (my requirement; original proposal is float
>>>> only, but could logically expand to primitive types)
>>>>
>>>> Anything else?
>>>>
>>>> The key thing about a vector is that unlike lists or tuples you really
>>>> don't care about individual elements, you care about doing vector and
>>>> matrix multiplications with the thing as a unit.
>>>>
>>>>
>>>> That maybe true for this use case, but “should” this be true for the
>>>> type itself?  I feel like no… if a user wants the Nth element of a vector
>>>> why would we block them?  I am not saying the first patch, or even 5.0 adds
>>>> support for index access, I am just trying to push back saying that the
>>>> type should not block this.
>>>>
>>>> (Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT
>>>> VECTOR[N].)
>>>>
>>>>
>>>> Now that nulls are not allowed, I have mixed feelings about FLOAT[N], I
>>>> prefer this syntax but that limitation may not be desired for all use
>>>> cases… we could always add LIST<TYPE, N> and ARRAY<TYPE, N> later
>>>> to address that case.
>>>>
>>>> In terms of syntax I have seen, here is my ordered preference:
>>>>
>>>> 1) TYPE[size] - have mixed feelings due to non-null, but still prefer it
>>>> 2) QUALIFIER TYPE[size] - QUALIFIER is just a Term we use to denote
>>>> this semantic…. Could even be NON NULL TYPE[size]
>>>>
>>>> On Apr 27, 2023, at 9:00 AM, Benedict <bened...@apache.org> wrote:
>>>>
>>>>
>>>> That’s a bounded ring buffer, not a fixed length array.
>>>>
>>>> This definitely isn’t a tuple because the types are all the same, which
>>>> is pretty crucial for matrix operations. Matrix libraries generally work on
>>>> arrays of known dimensionality, or sparse representations.
>>>>
>>>> Whether we draw any semantic link between the frozen list and whatever
>>>> we do here, it is fundamentally a frozen list with a restriction on its
>>>> size. What we’re defining here are “statically” sized arrays, whereas a
>>>> frozen list is essentially a dynamically sized array.
>>>>
>>>> I do not think vector is a good name because vector is used in some
>>>> other popular languages to mean a (dynamic) list, which is confusing when
>>>> we also have a list concept.
>>>>
>>>> I’m fine with just using the FLOAT[N] syntax, and drawing no direct
>>>> link with list. Though it is a bit strange that this particular type
>>>> declaration looks so different to other collection types.
>>>>
>>>> On 27 Apr 2023, at 16:48, Jeff Jirsa <jji...@gmail.com> wrote:
>>>>
>>>> 
>>>>
>>>>
>>>> On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis <jbel...@gmail.com>
>>>> wrote:
>>>>
>>>> It's been a while, so I may be missing something, but do we already
>>>> have fixed-size lists?  If not, I don't see why we'd try to make this fit
>>>> into a List-shaped problem.
>>>>
>>>>
>>>> We do not. The proposal got closed as wont-fix
>>>> https://issues.apache.org/jira/browse/CASSANDRA-9110
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Jonathan Ellis
>>> co-founder, http://www.datastax.com
>>> @spyced
>>>
>>>
>>
>> --
>> Jonathan Ellis
>> co-founder, http://www.datastax.com
>> @spyced
>>
>>
>
> --
>
> Henrik Ingo
>
> c. +358 40 569 7354
>
> w. www.datastax.com
>
>
> <https://urldefense.com/v3/__https://www.facebook.com/datastax__;!!PbtH5S7Ebw!fLSwhLcJmT70eYUoVF7ODz10-ojWsJmkTiN2qunF0eDe8BMEJDXOvEyrJk33ViyGxdVDmfv5h_BnJK_S07ZC$>
> <https://twitter.com/datastax>
> <https://urldefense.com/v3/__https://www.linkedin.com/company/datastax/__;!!PbtH5S7Ebw!fLSwhLcJmT70eYUoVF7ODz10-ojWsJmkTiN2qunF0eDe8BMEJDXOvEyrJk33ViyGxdVDmfv5h_BnJHMH7Tat$>
> <https://github.com/datastax/>
>
>

-- 

Henrik Ingo

c. +358 40 569 7354

w. www.datastax.com

<https://www.facebook.com/datastax>  <https://twitter.com/datastax>
<https://www.linkedin.com/company/datastax/>  <https://github.com/datastax/>

Re: [DISCUSS] New data type for vector search

Reply via email to