> How about it, David? Did you already make this? I checked out the patch, fixed serialize/deserialize, added the constraints, then added a composeForFloat(ByteBuffer), with this the impact to the POC patch was the following
1) move away from VectorType.instance.serializer().deserialize(bb) to type.composeForFloat(bb), both return float[] 2) change the index validate logic to move away from checking VectorType and instead check for that plus the element type == FloatType. I didn’t bother to do this as its trivial > David. End this argument. SHOW THE CODE! If this argument ends and people are cool with vector supporting abstract type, more than glad to help get this in. > On May 2, 2023, at 1:53 PM, Jeremy Hanna <jeremy.hanna1...@gmail.com> wrote: > > I'm all for bringing more functionality to the masses sooner, but the > original idea has a very very specific use case. Do we have use cases for a > general purpose Vector/Array data structure? If so, awesome. I just > wondered if generalizing provides value, beyond being straightforward to > implement. I'm just trying to be sensitive to the database code maintenance > and driver support for general types versus a single type for a specific, > well defined purpose. > > If it could easily be a plugin, that's great - but the full picture involves > drivers that need to support it or you end up getting binary blobs you have > to decode client side and then do stuff with. So ideally if you have a well > defined use case that you can build into the database, having it just be part > of the database and associated drivers - that makes the experience much much > better. > > I'm not trying to say B couldn't be valuable or that a plugin couldn't be > feasible. I'm just trying to enlarge the picture a bit to see what that > means for this use case and for the supporting drivers/clients. > >> On May 2, 2023, at 3:04 PM, Benedict <bened...@apache.org> wrote: >> >> But it’s so trivial it was already implemented by David in the span of ten >> minutes? If anything, we’re slowing progress down by refusing to do the >> extra types, as we’re busy arguing about it rather than delivering a feature? >> >> FWIW, my interpretation of the votes today is that we SHOULD NOT (ever) >> support types beyond float. Not that we should start with float. >> >> So, this whole debate is a mess, I think. But hey ho. >> >>> On 2 May 2023, at 20:57, Patrick McFadin <pmcfa...@gmail.com> wrote: >>> >>> >>> I'll speak up on that one. If you look at my ranked voting, that is where >>> my head is. I get accused of scope creep (a lot) and looking at the initial >>> proposal Jonathan put on the ML it was mostly "Developers are adopting >>> vector search at a furious pace and I think I have a simple way of adding >>> support to keep Cassandra relevant for these use cases" Instead of just >>> focusing on this use case, I feel the arguments have bike shedded into >>> scope creep which means it will take forever to get into the project. >>> >>> My preference is to see one thing validated with an MVP and get it into the >>> hands of developers sooner so we can continue to iterate based on actual >>> usage. >>> >>> It doesn't say your points are wrong or your opinions are broken, I'm >>> voting for what I think will be awesome for users sooner. >>> >>> Patrick >>> >>> On Tue, May 2, 2023 at 12:29 PM Benedict <bened...@apache.org >>> <mailto:bened...@apache.org>> wrote: >>>> Could folk voting against a general purpose type (that could well be >>>> called a vector) briefly explain their reasoning? >>>> >>>> We established in the other thread that it’s technically trivial, meaning >>>> folk must think it is strictly superior to only support float rather than >>>> eg all numeric types (note: for the type, not the ANN). >>>> >>>> I am surprised, and the blurbs accompanying votes so far don’t seem to >>>> touch on this, mostly just endorsing the idea of a vector. >>>> >>>> >>>>> On 2 May 2023, at 20:20, Patrick McFadin <pmcfa...@gmail.com >>>>> <mailto:pmcfa...@gmail.com>> wrote: >>>>> >>>>> >>>>> A > B > C on both polls. >>>>> >>>>> Having talked to several users in the community that are highly excited >>>>> about this change, this gets to what developers want to do at Cassandra >>>>> scale: store embeddings and retrieve them. >>>>> >>>>> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña <adelap...@apache.org >>>>> <mailto:adelap...@apache.org>> wrote: >>>>>> A > B > C >>>>>> >>>>>> I don't think that ML is such a niche application that it can't have its >>>>>> own CQL data type. Also, vectors are mathematical elements that have >>>>>> more applications that ML. >>>>>> >>>>>> On Tue, 2 May 2023 at 19:15, Mick Semb Wever <m...@apache.org >>>>>> <mailto:m...@apache.org>> wrote: >>>>>>> >>>>>>> >>>>>>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis <jbel...@gmail.com >>>>>>> <mailto:jbel...@gmail.com>> wrote: >>>>>>>> Should we add a vector type to Cassandra designed to meet the needs of >>>>>>>> machine learning use cases, specifically feature and embedding vectors >>>>>>>> for training, inference, and vector search? >>>>>>>> >>>>>>>> ML vectors are fixed-dimension (fixed-length) sequences of numeric >>>>>>>> types, with no nulls allowed, and with no need for random access. The >>>>>>>> ML industry overwhelmingly uses float32 vectors, to the point that the >>>>>>>> industry-leading special-purpose vector database ONLY supports that >>>>>>>> data type. >>>>>>>> >>>>>>>> This poll is to gauge consensus subsequent to the recent discussion >>>>>>>> thread at >>>>>>>> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0. >>>>>>>> >>>>>>>> Please rank the discussed options from most preferred option to least, >>>>>>>> e.g., A > B > C (A is my preference, followed by B, followed by C) or >>>>>>>> C > B = A (C is my preference, followed by B or A approximately >>>>>>>> equally.) >>>>>>>> >>>>>>>> (A) I am in favor of adding a vector type for floats; I do not believe >>>>>>>> we need to tie it to any particular implementation details. >>>>>>>> >>>>>>>> (B) I am okay with adding a vector type but I believe we must add >>>>>>>> array types that compose with all Cassandra types first, and make >>>>>>>> vectors a special case of arrays-without-null-elements. >>>>>>>> >>>>>>>> (C) I am not in favor of adding a built-in vector type. >>>>>>> >>>>>>> >>>>>>> >>>>>>> A > B > C >>>>>>> >>>>>>> B is stated as "must add array types…". I think this is a bit loaded. >>>>>>> If B was the (A + the implementation needs to be a non-null frozen >>>>>>> float32 array, serialisation forward compatible with other frozen >>>>>>> arrays later implemented) I would put this before (A). Especially >>>>>>> because it's been shown already this is easy to implement. >>>>>>> >>>>>>> >