I'm also in favor of having a general data type that is not tied to numeric data types alone.
On 2023/05/02 22:27:24 Jonathan Ellis wrote: > I had a call with David. We agreed that we want a "vector" data type with > these properties > > - Fixed length > - No nulls > - Random access not supported > > Where we disagreed was on my proposal to restrict vectors to only numeric > data. David's points were that > > (1) He has a use case today for a data type with the other vector > properties, > (2) It doesn't seem reasonable to create two data types with the same > properties, one of which is restricted to numerics, and > (3) The restrictions that I want for numeric vectors make more sense at the > index and function level, than at the type level. > > I'm ready to concede that David has the better case here and move forward > with a vector implementation without that restriction. > > On Tue, May 2, 2023 at 4:03 PM David Capwell <dcapw...@apple.com> wrote: > > > How about it, David? Did you already make this? > > > > > > I checked out the patch, fixed serialize/deserialize, added the > > constraints, then added a composeForFloat(ByteBuffer), with this the impact > > to the POC patch was the following > > > > 1) move away from VectorType.instance.serializer().deserialize(bb) to > > type.composeForFloat(bb), both return float[] > > 2) change the index validate logic to move away from checking VectorType > > and instead check for that plus the element type == FloatType. I didn’t > > bother to do this as its trivial > > > > David. End this argument. SHOW THE CODE! > > > > > > If this argument ends and people are cool with vector supporting abstract > > type, more than glad to help get this in. > > > > On May 2, 2023, at 1:53 PM, Jeremy Hanna <jeremy.hanna1...@gmail.com> > > wrote: > > > > I'm all for bringing more functionality to the masses sooner, but the > > original idea has a very very specific use case. Do we have use cases for > > a general purpose Vector/Array data structure? If so, awesome. I just > > wondered if generalizing provides value, beyond being straightforward to > > implement. I'm just trying to be sensitive to the database code > > maintenance and driver support for general types versus a single type for a > > specific, well defined purpose. > > > > If it could easily be a plugin, that's great - but the full picture > > involves drivers that need to support it or you end up getting binary blobs > > you have to decode client side and then do stuff with. So ideally if you > > have a well defined use case that you can build into the database, having > > it just be part of the database and associated drivers - that makes the > > experience much much better. > > > > I'm not trying to say B couldn't be valuable or that a plugin couldn't be > > feasible. I'm just trying to enlarge the picture a bit to see what that > > means for this use case and for the supporting drivers/clients. > > > > On May 2, 2023, at 3:04 PM, Benedict <bened...@apache.org> wrote: > > > > But it’s so trivial it was already implemented by David in the span of ten > > minutes? If anything, we’re slowing progress down by refusing to do the > > extra types, as we’re busy arguing about it rather than delivering a > > feature? > > > > FWIW, my interpretation of the votes today is that we SHOULD NOT (ever) > > support types beyond float. Not that we should start with float. > > > > So, this whole debate is a mess, I think. But hey ho. > > > > On 2 May 2023, at 20:57, Patrick McFadin <pmcfa...@gmail.com> wrote: > > > > > > I'll speak up on that one. If you look at my ranked voting, that is where > > my head is. I get accused of scope creep (a lot) and looking at the initial > > proposal Jonathan put on the ML it was mostly "Developers are adopting > > vector search at a furious pace and I think I have a simple way of adding > > support to keep Cassandra relevant for these use cases" Instead of just > > focusing on this use case, I feel the arguments have bike shedded into > > scope creep which means it will take forever to get into the project. > > > > My preference is to see one thing validated with an MVP and get it into > > the hands of developers sooner so we can continue to iterate based on > > actual usage. > > > > It doesn't say your points are wrong or your opinions are broken, I'm > > voting for what I think will be awesome for users sooner. > > > > Patrick > > > > On Tue, May 2, 2023 at 12:29 PM Benedict <bened...@apache.org> wrote: > > > >> Could folk voting against a general purpose type (that could well be > >> called a vector) briefly explain their reasoning? > >> > >> We established in the other thread that it’s technically trivial, meaning > >> folk must think it is strictly superior to only support float rather than > >> eg all numeric types (note: for the type, not the ANN). > >> > >> I am surprised, and the blurbs accompanying votes so far don’t seem to > >> touch on this, mostly just endorsing the idea of a vector. > >> > >> > >> On 2 May 2023, at 20:20, Patrick McFadin <pmcfa...@gmail.com> wrote: > >> > >> > >> A > B > C on both polls. > >> > >> Having talked to several users in the community that are highly excited > >> about this change, this gets to what developers want to do at Cassandra > >> scale: store embeddings and retrieve them. > >> > >> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña <adelap...@apache.org> > >> wrote: > >> > >>> A > B > C > >>> > >>> I don't think that ML is such a niche application that it can't have its > >>> own CQL data type. Also, vectors are mathematical elements that have more > >>> applications that ML. > >>> > >>> On Tue, 2 May 2023 at 19:15, Mick Semb Wever <m...@apache.org> wrote: > >>> > >>>> > >>>> > >>>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis <jbel...@gmail.com> wrote: > >>>> > >>>>> Should we add a vector type to Cassandra designed to meet the needs of > >>>>> machine learning use cases, specifically feature and embedding vectors > >>>>> for > >>>>> training, inference, and vector search? > >>>>> > >>>>> ML vectors are fixed-dimension (fixed-length) sequences of numeric > >>>>> types, with no nulls allowed, and with no need for random access. The ML > >>>>> industry overwhelmingly uses float32 vectors, to the point that the > >>>>> industry-leading special-purpose vector database ONLY supports that data > >>>>> type. > >>>>> > >>>>> This poll is to gauge consensus subsequent to the recent discussion > >>>>> thread at > >>>>> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0. > >>>>> > >>>>> Please rank the discussed options from most preferred option to least, > >>>>> e.g., A > B > C (A is my preference, followed by B, followed by C) or C > >>>>> > B > >>>>> = A (C is my preference, followed by B or A approximately equally.) > >>>>> > >>>>> (A) I am in favor of adding a vector type for floats; I do not believe > >>>>> we need to tie it to any particular implementation details. > >>>>> > >>>>> (B) I am okay with adding a vector type but I believe we must add > >>>>> array types that compose with all Cassandra types first, and make > >>>>> vectors a > >>>>> special case of arrays-without-null-elements. > >>>>> > >>>>> (C) I am not in favor of adding a built-in vector type. > >>>>> > >>>> > >>>> > >>>> > >>>> A > B > C > >>>> > >>>> B is stated as "must add array types…". I think this is a bit loaded. > >>>> If B was the (A + the implementation needs to be a non-null frozen > >>>> float32 > >>>> array, serialisation forward compatible with other frozen arrays later > >>>> implemented) I would put this before (A). Especially because it's been > >>>> shown already this is easy to implement. > >>>> > >>>> > >>>> > >>> > > > > > > -- > Jonathan Ellis > co-founder, http://www.datastax.com > @spyced >