Re: [POLL] Vector type for ML

David Capwell Tue, 02 May 2023 14:03:03 -0700

>  How about it, David? Did you already make this?

I checked out the patch, fixed serialize/deserialize, added the constraints, 
then added a composeForFloat(ByteBuffer), with this the impact to the POC patch 
was the following


1) move away from VectorType.instance.serializer().deserialize(bb) to 
type.composeForFloat(bb), both return float[]
2) change the index validate logic to move away from checking VectorType and 
instead check for that plus the element type == FloatType.  I didn’t bother to 
do this as its trivial

> David. End this argument. SHOW THE CODE! 

If this argument ends and people are cool with vector supporting abstract type, 
more than glad to help get this in.

> On May 2, 2023, at 1:53 PM, Jeremy Hanna <[email protected]> wrote:
> 
> I'm all for bringing more functionality to the masses sooner, but the 
> original idea has a very very specific use case.  Do we have use cases for a 
> general purpose Vector/Array data structure?  If so, awesome.  I just 
> wondered if generalizing provides value, beyond being straightforward to 
> implement.  I'm just trying to be sensitive to the database code maintenance 
> and driver support for general types versus a single type for a specific, 
> well defined purpose.
> 
> If it could easily be a plugin, that's great - but the full picture involves 
> drivers that need to support it or you end up getting binary blobs you have 
> to decode client side and then do stuff with.  So ideally if you have a well 
> defined use case that you can build into the database, having it just be part 
> of the database and associated drivers - that makes the experience much much 
> better.
> 
> I'm not trying to say B couldn't be valuable or that a plugin couldn't be 
> feasible.  I'm just trying to enlarge the picture a bit to see what that 
> means for this use case and for the supporting drivers/clients.
> 
>> On May 2, 2023, at 3:04 PM, Benedict <[email protected]> wrote:
>> 
>> But it’s so trivial it was already implemented by David in the span of ten 
>> minutes? If anything, we’re slowing progress down by refusing to do the 
>> extra types, as we’re busy arguing about it rather than delivering a feature?
>> 
>> FWIW, my interpretation of the votes today is that we SHOULD NOT (ever) 
>> support types beyond float. Not that we should start with float.
>> 
>> So, this whole debate is a mess, I think. But hey ho.
>> 
>>> On 2 May 2023, at 20:57, Patrick McFadin <[email protected]> wrote:
>>> 
>>> 
>>> I'll speak up on that one. If you look at my ranked voting, that is where 
>>> my head is. I get accused of scope creep (a lot) and looking at the initial 
>>> proposal Jonathan put on the ML it was mostly "Developers are adopting 
>>> vector search at a furious pace and I think I have a simple way of adding 
>>> support to keep Cassandra relevant for these use cases" Instead of just 
>>> focusing on this use case, I feel the arguments have bike shedded into 
>>> scope creep which means it will take forever to get into the project.
>>> 
>>> My preference is to see one thing validated with an MVP and get it into the 
>>> hands of developers sooner so we can continue to iterate based on actual 
>>> usage. 
>>> 
>>> It doesn't say your points are wrong or your opinions are broken, I'm 
>>> voting for what I think will be awesome for users sooner. 
>>> 
>>> Patrick
>>> 
>>> On Tue, May 2, 2023 at 12:29 PM Benedict <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>>> Could folk voting against a general purpose type (that could well be 
>>>> called a vector) briefly explain their reasoning?
>>>> 
>>>> We established in the other thread that it’s technically trivial, meaning 
>>>> folk must think it is strictly superior to only support float rather than 
>>>> eg all numeric types (note: for the type, not the ANN). 
>>>> 
>>>> I am surprised, and the blurbs accompanying votes so far don’t seem to 
>>>> touch on this, mostly just endorsing the idea of a vector.
>>>> 
>>>> 
>>>>> On 2 May 2023, at 20:20, Patrick McFadin <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> 
>>>>> A > B > C on both polls. 
>>>>> 
>>>>> Having talked to several users in the community that are highly excited 
>>>>> about this change, this gets to what developers want to do at Cassandra 
>>>>> scale: store embeddings and retrieve them. 
>>>>> 
>>>>> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>>> A > B > C
>>>>>> 
>>>>>> I don't think that ML is such a niche application that it can't have its 
>>>>>> own CQL data type. Also, vectors are mathematical elements that have 
>>>>>> more applications that ML.
>>>>>> 
>>>>>> On Tue, 2 May 2023 at 19:15, Mick Semb Wever <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> Should we add a vector type to Cassandra designed to meet the needs of 
>>>>>>>> machine learning use cases, specifically feature and embedding vectors 
>>>>>>>> for training, inference, and vector search?  
>>>>>>>> 
>>>>>>>> ML vectors are fixed-dimension (fixed-length) sequences of numeric 
>>>>>>>> types, with no nulls allowed, and with no need for random access. The 
>>>>>>>> ML industry overwhelmingly uses float32 vectors, to the point that the 
>>>>>>>> industry-leading special-purpose vector database ONLY supports that 
>>>>>>>> data type.
>>>>>>>> 
>>>>>>>> This poll is to gauge consensus subsequent to the recent discussion 
>>>>>>>> thread at 
>>>>>>>> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>>>>>>>> 
>>>>>>>> Please rank the discussed options from most preferred option to least, 
>>>>>>>> e.g., A > B > C (A is my preference, followed by B, followed by C) or 
>>>>>>>> C > B = A (C is my preference, followed by B or A approximately 
>>>>>>>> equally.)
>>>>>>>> 
>>>>>>>> (A) I am in favor of adding a vector type for floats; I do not believe 
>>>>>>>> we need to tie it to any particular implementation details.
>>>>>>>> 
>>>>>>>> (B) I am okay with adding a vector type but I believe we must add 
>>>>>>>> array types that compose with all Cassandra types first, and make 
>>>>>>>> vectors a special case of arrays-without-null-elements.
>>>>>>>> 
>>>>>>>> (C) I am not in favor of adding a built-in vector type.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> A  > B > C
>>>>>>> 
>>>>>>> B is stated as "must add array types…".  I think this is a bit loaded.  
>>>>>>> If B was the (A + the implementation needs to be a non-null frozen 
>>>>>>> float32 array, serialisation forward compatible with other frozen 
>>>>>>> arrays later implemented) I would put this before (A).  Especially 
>>>>>>> because it's been shown already this is easy to implement.
>>>>>>> 
>>>>>>>  
>

Re: [POLL] Vector type for ML

Reply via email to