By my superficial reading I get the impression that the main distinction is that vectors don't need to support random access into a single element/float. I haven't looked at what Jonathan is doing, but I assume, and it seems Jonathan assumes or knows that this makes implementation both easier and allows for important optimizations. Am I following correctly here?
(Apologies if that is what your #1 is saying, I read yours as something about secondary or maybe clustered indexes?) Agree with #3 obviously. #2... Vectors actually *could* support ordered (n-dimensional) indexes, since they are vectors. But in practice it seems even asking for a simple 3D index is too much and too niche for anything else than Postgis. henrik henrik On Fri, Apr 28, 2023 at 8:50 PM Benedict <bened...@apache.org> wrote: > I and others have claimed that an array concept will work, since it is > isomorphic with a vector. I have seen the following counterclaims: > > 1. Vectors don’t need to support index lookups > 2. Vectors don’t need to support ordered indexes > 3. Vectors don’t need to support other types besides float > > None of these say that *vectors are not arrays*. At most these say “ANN > indexes should only support float types” which is different, and not > something I would dispute. > > If the claim is "there is no concept of arrays that is compatible with > vector search" then let’s focus on that, because that is probably the > initial source of the disconnect. > > > > > On 28 Apr 2023, at 18:13, Henrik Ingo <henrik.i...@datastax.com> wrote: > > > Benedict, I don't quite see why that matters? The argument is merely that > this kind of vector, for this use case, a) is different from arrays, and b) > arrays apparently don't serve the use case well enough (or at all). > > Now, if from the above it follows a discussion that a vector type cannot > be a first class Cassandra type... that is of course a possible argument. > > But suggesting that Jonathan should work on implementing general purpose > arrays seems to fall outside the scope of this discussion, since the result > of such work wouldn't even fill the need Jonathan is targeting for here. I > could also ask Jonathan to work on a JSONB data type, and it similarly > would not be an interesting proposal to Jonathan, as it wouldn't fill the > need for the specific use case he is targeting. > > > But back to the main question... Why wouldn't a "vector for floats" type > be general purpose enough that it should be delegated to some plugin? > Machine Learning is a broad field in itself, with dozens of algorithms you > could choose to use to build an AI model. And AI can be used in pretty much > every industry vertical. If anything, I would claim DECIMAL is much more an > industry specific special case type than these ML vectors would be. > > > > Back to Jonathan: > >So in order of what makes sense to me: > > 1. Add a vector type for just floats; consider adding bytes later if > demand materializes. This gives us 99% of the value and limits the scope so > we can deliver quickly. > > 2. Add a vector type for floats or bytes. This gives us another 1% of > value in exchange for an extra 20% or so of effort. > > Is it possible to implement 1 in a way that makes 2 possible in a future > version? > > henrik > > > henrik > > On Fri, Apr 28, 2023 at 7:33 PM Benedict <bened...@apache.org> wrote: > >> pgvector is a plug-in. If you were proposing a plug-in you could ignore >> these considerations. >> >> On 28 Apr 2023, at 16:58, Jonathan Ellis <jbel...@gmail.com> wrote: >> >> >> I'm proposing a vector data type for ML use cases. It's not the same >> thing as an array or a list and it's not supposed to be. >> >> While it's true that it would be possible to build a vector type on top >> of an array type, it's not necessary to do it that way, and given the lack >> of interest in an array type for its own sake I don't see why we would want >> to make that a requirement. >> >> It's relevant that pgvector, which among the systems offering vector >> search is based on the most similar system to Cassandra in terms of its >> query language, adds a vector data type that only supports floats *even >> though postgresql already has an array data type* because the semantics are >> different. Random access doesn't make sense, string and collection and >> other datatypes don't make sense, typical ordered indexes don't make sense, >> etc. It's just a different beast from arrays, for a different use case. >> >> On Fri, Apr 28, 2023 at 10:40 AM Benedict <bened...@apache.org> wrote: >> >>> But you’re proposing introducing a general purpose type - this isn’t an >>> ML plug-in, it’s modifying the core language in a manner that makes >>> targeting your workload easier. Which is fine, but that means you have to >>> consider its impact on the general language, not just your target use case. >>> >>> On 28 Apr 2023, at 16:29, Jonathan Ellis <jbel...@gmail.com> wrote: >>> >>> >>> That's exactly right. >>> >>> In particular it makes no sense at all from an ML perspective to have >>> vector types of anything other than numerics. And as I mentioned in the >>> POC thread (but I did not mention here), float is overwhelmingly the most >>> frequently used vector type, to the point that Pinecone (by far the most >>> popular vector search engine) ONLY supports that type. >>> >>> Lucene and Elastic also add support for vectors of bytes (8-bit ints), >>> which are useful for optimizing models that you have already built with >>> floats, but we have no reasonable path towards supporting indexing and >>> searches against any other vector type. >>> >>> So in order of what makes sense to me: >>> >>> 1. Add a vector type for just floats; consider adding bytes later if >>> demand materializes. This gives us 99% of the value and limits the scope so >>> we can deliver quickly. >>> >>> 2. Add a vector type for floats or bytes. This gives us another 1% of >>> value in exchange for an extra 20% or so of effort. >>> >>> 3. Add a vector type for all numeric primitives, but you can only index >>> floats and bytes. I think this is confusing to users and a bad idea. >>> >>> 4. Add a vector type that composes with all Cassandra types. I can't >>> see a reason to do this, nobody wants it, and we killed the most similar >>> proposal in the past as wontfix. >>> >>> On Thu, Apr 27, 2023 at 7:49 PM Josh McKenzie <jmcken...@apache.org> >>> wrote: >>> >>>> From a machine learning perspective, vectors are a well-known concept >>>> that are effectively immutable fixed-length n-dimensional values that are >>>> then later used either as part of a model or in conjunction with a model >>>> after the fact. >>>> >>>> While we could have this be non-frozen and not call it a vector, I'd be >>>> inclined to still make the argument for a layer of syntactic sugar on top >>>> that met ML users where they were with concepts they understood rather than >>>> forcing them through the cognitive lift of figuring out the Cassandra >>>> specific contortions to replicate something that's ubiquitous in their >>>> space. We did the same "Cassandra-first" approach with our JSON support and >>>> that didn't do us any favors in terms of adoption and usage as far as I >>>> know. >>>> >>>> So is the goal here to provide something specific and idiomatic for the >>>> ML community or is the goal to make a primitive that's C*-centric that then >>>> another layer can write to? I personally argue for the former; I don't see >>>> this specific data type going away any time soon. >>>> >>>> On Thu, Apr 27, 2023, at 12:39 PM, David Capwell wrote: >>>> >>>> but as you point out it has the problem of allowing nulls. >>>> >>>> >>>> If nulls are not allowed for the elements, then either we need a) a >>>> new type, or b) add some way to say elements may not be null…. As much as I >>>> do like b, I am leaning towards new type for this use case. >>>> >>>> So, to flesh out the type requirements I have seen so far >>>> >>>> 1) represents a fixed size array of element type >>>> * on write path we will need to validate this >>>> 2) element may not be null >>>> * on write path we will need to validate this >>>> 3) “frozen” (is this really a requirement for the type or is this >>>> just simpler for the ANN work? I feel that this shouldn’t be a >>>> requirement) >>>> 4) works for all types (my requirement; original proposal is float >>>> only, but could logically expand to primitive types) >>>> >>>> Anything else? >>>> >>>> The key thing about a vector is that unlike lists or tuples you really >>>> don't care about individual elements, you care about doing vector and >>>> matrix multiplications with the thing as a unit. >>>> >>>> >>>> That maybe true for this use case, but “should” this be true for the >>>> type itself? I feel like no… if a user wants the Nth element of a vector >>>> why would we block them? I am not saying the first patch, or even 5.0 adds >>>> support for index access, I am just trying to push back saying that the >>>> type should not block this. >>>> >>>> (Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT >>>> VECTOR[N].) >>>> >>>> >>>> Now that nulls are not allowed, I have mixed feelings about FLOAT[N], I >>>> prefer this syntax but that limitation may not be desired for all use >>>> cases… we could always add LIST<TYPE, N> and ARRAY<TYPE, N> later >>>> to address that case. >>>> >>>> In terms of syntax I have seen, here is my ordered preference: >>>> >>>> 1) TYPE[size] - have mixed feelings due to non-null, but still prefer it >>>> 2) QUALIFIER TYPE[size] - QUALIFIER is just a Term we use to denote >>>> this semantic…. Could even be NON NULL TYPE[size] >>>> >>>> On Apr 27, 2023, at 9:00 AM, Benedict <bened...@apache.org> wrote: >>>> >>>> >>>> That’s a bounded ring buffer, not a fixed length array. >>>> >>>> This definitely isn’t a tuple because the types are all the same, which >>>> is pretty crucial for matrix operations. Matrix libraries generally work on >>>> arrays of known dimensionality, or sparse representations. >>>> >>>> Whether we draw any semantic link between the frozen list and whatever >>>> we do here, it is fundamentally a frozen list with a restriction on its >>>> size. What we’re defining here are “statically” sized arrays, whereas a >>>> frozen list is essentially a dynamically sized array. >>>> >>>> I do not think vector is a good name because vector is used in some >>>> other popular languages to mean a (dynamic) list, which is confusing when >>>> we also have a list concept. >>>> >>>> I’m fine with just using the FLOAT[N] syntax, and drawing no direct >>>> link with list. Though it is a bit strange that this particular type >>>> declaration looks so different to other collection types. >>>> >>>> On 27 Apr 2023, at 16:48, Jeff Jirsa <jji...@gmail.com> wrote: >>>> >>>> >>>> >>>> >>>> On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis <jbel...@gmail.com> >>>> wrote: >>>> >>>> It's been a while, so I may be missing something, but do we already >>>> have fixed-size lists? If not, I don't see why we'd try to make this fit >>>> into a List-shaped problem. >>>> >>>> >>>> We do not. The proposal got closed as wont-fix >>>> https://issues.apache.org/jira/browse/CASSANDRA-9110 >>>> >>>> >>>> >>>> >>> >>> -- >>> Jonathan Ellis >>> co-founder, http://www.datastax.com >>> @spyced >>> >>> >> >> -- >> Jonathan Ellis >> co-founder, http://www.datastax.com >> @spyced >> >> > > -- > > Henrik Ingo > > c. +358 40 569 7354 > > w. www.datastax.com > > > <https://urldefense.com/v3/__https://www.facebook.com/datastax__;!!PbtH5S7Ebw!fLSwhLcJmT70eYUoVF7ODz10-ojWsJmkTiN2qunF0eDe8BMEJDXOvEyrJk33ViyGxdVDmfv5h_BnJK_S07ZC$> > <https://twitter.com/datastax> > <https://urldefense.com/v3/__https://www.linkedin.com/company/datastax/__;!!PbtH5S7Ebw!fLSwhLcJmT70eYUoVF7ODz10-ojWsJmkTiN2qunF0eDe8BMEJDXOvEyrJk33ViyGxdVDmfv5h_BnJHMH7Tat$> > <https://github.com/datastax/> > > -- Henrik Ingo c. +358 40 569 7354 w. www.datastax.com <https://www.facebook.com/datastax> <https://twitter.com/datastax> <https://www.linkedin.com/company/datastax/> <https://github.com/datastax/>