Re: [POLL] Vector type for ML

David Capwell Wed, 03 May 2023 15:38:24 -0700

> Did we agree on a CQL syntax?

I don’t believe there has been a pool on CQL syntax… my understanding reading 
all the threads is that there are ~4-5 options and non are -1ed, so believe we 
are waiting for majority rule on this?


> On May 3, 2023, at 1:23 PM, Jeremiah D Jordan <jerem...@datastax.com> wrote:
> 
>> To be clear, I support the general agreement David and Jonathan seem to have 
>> reached.
> 
> +1 as well.
> 
> 
>> On May 3, 2023, at 3:07 PM, Caleb Rackliffe <calebrackli...@gmail.com> wrote:
>> 
>> To be clear, I support the general agreement David and Jonathan seem to have 
>> reached.
>> 
>> On Wed, May 3, 2023 at 3:05 PM Caleb Rackliffe <calebrackli...@gmail.com 
>> <mailto:calebrackli...@gmail.com>> wrote:
>>> Did we agree on a CQL syntax?
>>> 
>>> On Wed, May 3, 2023 at 2:06 PM Rahul Xavier Singh 
>>> <rahul.xavier.si...@gmail.com <mailto:rahul.xavier.si...@gmail.com>> wrote:
>>>> I like this approach. Thank you for those working on this vector search 
>>>> initiative. 
>>>> 
>>>> Here's the feedback from my "user" hat for someone who is looking at 
>>>> databases / indexes for my next LLM app. 
>>>> 
>>>> Can I take some python code and go from using an in memory vector store 
>>>> like numpy or FAISS to something else? How easy is it for me to take my 
>>>> python code and get it to work with this new external service which is no 
>>>> longer just a library?
>>>> There's also tons of services that I can run on docker e.g. milvus, 
>>>> redissearch, typesense, elasticsearch, opensearch and I may hit a hurdle 
>>>> when trying to do a lot more data, so I look at Cassandra Vector Search. 
>>>> Because I am familiar with SQL , Cassandra looks appealing since I can 
>>>> potentially use "cql_agent" lib ( to be created for langchain and we're 
>>>> looking into that now) or an existing CassandraVectorStore class?
>>>> 
>>>> In most of these scenarios, if people are using langchain, llamaindex, the 
>>>> underlying implementation is not as important since we shield the user 
>>>> from CQL data types except at schema creation and most of this libs can be 
>>>> opinionated and just suggest a generic schema. 
>>>> 
>>>> The ideal world is if I can just dump text into a field and do a natural 
>>>> language query on it and have my DB do the embeddings for the document, 
>>>> and then for the query for me. For now the libs can manage all that and 
>>>> they do that well. We just need the interface to stay consistent and be 
>>>> relatively easy to query in CQL. The most popular index in LLM retrieval 
>>>> augmented patterns is pinecone. You make an index, you upsert, and then 
>>>> you query. It's not assumed that you are also giving it content, though 
>>>> you can send it metadata to have the document there. 
>>>> 
>>>> If we can have a similar workflow e.g. create a table with a vector type 
>>>> OR create a table with an existing type and then add an index to it, no 
>>>> one is going to sleep over it as long as it works. Having the ability to 
>>>> take a table that has data, and then add a vector index doesn't make it 
>>>> any different than adding a new field since I've got to calculate the 
>>>> embeddings anyways. 
>>>> 
>>>> Would love to see how the CQL ends up looking like. 
>>>> Rahul Singh
>>>> Chief Executive Officer | Business Platform Architect
>>>> m: 202.905.2818 e: rahul.si...@anant.us <mailto:rahul.si...@anant.us> li: 
>>>> http://linkedin.com/in/xingh 
>>>> <https://urldefense.com/v3/__http://linkedin.com/in/xingh__;!!PbtH5S7Ebw!dpYGB0YPi3Klzqfi_uM7eXJnrZD-GoRc6HOvP3_-p7v8ya3jEKpmpkFbDSguLwD26V9VHlFOMzhMWVy7cvsLl8-Stg$>
>>>>  ca: http://calendly.com/xingh 
>>>> <https://urldefense.com/v3/__http://calendly.com/xingh__;!!PbtH5S7Ebw!dpYGB0YPi3Klzqfi_uM7eXJnrZD-GoRc6HOvP3_-p7v8ya3jEKpmpkFbDSguLwD26V9VHlFOMzhMWVy7cvsBQ99xhQ$>
>>>> We create, support, and manage real-time global data & analytics platforms 
>>>> for the modern enterprise.
>>>> 
>>>> Anant | https://anant.us 
>>>> <https://urldefense.com/v3/__https://anant.us/__;!!PbtH5S7Ebw!dpYGB0YPi3Klzqfi_uM7eXJnrZD-GoRc6HOvP3_-p7v8ya3jEKpmpkFbDSguLwD26V9VHlFOMzhMWVy7cvsh8deoBA$>
>>>> 3 Washington Circle, Suite 301
>>>> Washington, D.C. 20037
>>>> 
>>>> http://Cassandra.Link 
>>>> <https://urldefense.com/v3/__http://cassandra.link/__;!!PbtH5S7Ebw!dpYGB0YPi3Klzqfi_uM7eXJnrZD-GoRc6HOvP3_-p7v8ya3jEKpmpkFbDSguLwD26V9VHlFOMzhMWVy7cvvbbpb74g$>
>>>>  : The best resources for Apache Cassandra
>>>> 
>>>> 
>>>> On Tue, May 2, 2023 at 6:39 PM Patrick McFadin <pmcfa...@gmail.com 
>>>> <mailto:pmcfa...@gmail.com>> wrote:
>>>>> \o/
>>>>> 
>>>>> Bring it in team. Group hug. 
>>>>> 
>>>>> Now if you'll excuse me, I'm going to go build my preso on how Cassandra 
>>>>> is the only distributed database you can do vector search in an ACID 
>>>>> transaction. 
>>>>> 
>>>>> Patrick
>>>>> 
>>>>> On Tue, May 2, 2023 at 3:27 PM Jonathan Ellis <jbel...@gmail.com 
>>>>> <mailto:jbel...@gmail.com>> wrote:
>>>>>> I had a call with David.  We agreed that we want a "vector" data type 
>>>>>> with these properties
>>>>>> 
>>>>>> - Fixed length
>>>>>> - No nulls
>>>>>> - Random access not supported
>>>>>> 
>>>>>> Where we disagreed was on my proposal to restrict vectors to only 
>>>>>> numeric data.  David's points were that
>>>>>> 
>>>>>> (1) He has a use case today for a data type with the other vector 
>>>>>> properties,
>>>>>> (2) It doesn't seem reasonable to create two data types with the same 
>>>>>> properties, one of which is restricted to numerics, and
>>>>>> (3) The restrictions that I want for numeric vectors make more sense at 
>>>>>> the index and function level, than at the type level.
>>>>>> 
>>>>>> I'm ready to concede that David has the better case here and move 
>>>>>> forward with a vector implementation without that restriction.
>>>>>> 
>>>>>> On Tue, May 2, 2023 at 4:03 PM David Capwell <dcapw...@apple.com 
>>>>>> <mailto:dcapw...@apple.com>> wrote:
>>>>>>>>  How about it, David? Did you already make this?
>>>>>>> 
>>>>>>> I checked out the patch, fixed serialize/deserialize, added the 
>>>>>>> constraints, then added a composeForFloat(ByteBuffer), with this the 
>>>>>>> impact to the POC patch was the following
>>>>>>> 
>>>>>>> 1) move away from VectorType.instance.serializer().deserialize(bb) to 
>>>>>>> type.composeForFloat(bb), both return float[]
>>>>>>> 2) change the index validate logic to move away from checking 
>>>>>>> VectorType and instead check for that plus the element type == 
>>>>>>> FloatType.  I didn’t bother to do this as its trivial
>>>>>>> 
>>>>>>>> David. End this argument. SHOW THE CODE! 
>>>>>>> 
>>>>>>> If this argument ends and people are cool with vector supporting 
>>>>>>> abstract type, more than glad to help get this in.
>>>>>>> 
>>>>>>>> On May 2, 2023, at 1:53 PM, Jeremy Hanna <jeremy.hanna1...@gmail.com 
>>>>>>>> <mailto:jeremy.hanna1...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> I'm all for bringing more functionality to the masses sooner, but the 
>>>>>>>> original idea has a very very specific use case.  Do we have use cases 
>>>>>>>> for a general purpose Vector/Array data structure?  If so, awesome.  I 
>>>>>>>> just wondered if generalizing provides value, beyond being 
>>>>>>>> straightforward to implement.  I'm just trying to be sensitive to the 
>>>>>>>> database code maintenance and driver support for general types versus 
>>>>>>>> a single type for a specific, well defined purpose.
>>>>>>>> 
>>>>>>>> If it could easily be a plugin, that's great - but the full picture 
>>>>>>>> involves drivers that need to support it or you end up getting binary 
>>>>>>>> blobs you have to decode client side and then do stuff with.  So 
>>>>>>>> ideally if you have a well defined use case that you can build into 
>>>>>>>> the database, having it just be part of the database and associated 
>>>>>>>> drivers - that makes the experience much much better.
>>>>>>>> 
>>>>>>>> I'm not trying to say B couldn't be valuable or that a plugin couldn't 
>>>>>>>> be feasible.  I'm just trying to enlarge the picture a bit to see what 
>>>>>>>> that means for this use case and for the supporting drivers/clients.
>>>>>>>> 
>>>>>>>>> On May 2, 2023, at 3:04 PM, Benedict <bened...@apache.org 
>>>>>>>>> <mailto:bened...@apache.org>> wrote:
>>>>>>>>> 
>>>>>>>>> But it’s so trivial it was already implemented by David in the span 
>>>>>>>>> of ten minutes? If anything, we’re slowing progress down by refusing 
>>>>>>>>> to do the extra types, as we’re busy arguing about it rather than 
>>>>>>>>> delivering a feature?
>>>>>>>>> 
>>>>>>>>> FWIW, my interpretation of the votes today is that we SHOULD NOT 
>>>>>>>>> (ever) support types beyond float. Not that we should start with 
>>>>>>>>> float.
>>>>>>>>> 
>>>>>>>>> So, this whole debate is a mess, I think. But hey ho.
>>>>>>>>> 
>>>>>>>>>> On 2 May 2023, at 20:57, Patrick McFadin <pmcfa...@gmail.com 
>>>>>>>>>> <mailto:pmcfa...@gmail.com>> wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I'll speak up on that one. If you look at my ranked voting, that is 
>>>>>>>>>> where my head is. I get accused of scope creep (a lot) and looking 
>>>>>>>>>> at the initial proposal Jonathan put on the ML it was mostly 
>>>>>>>>>> "Developers are adopting vector search at a furious pace and I think 
>>>>>>>>>> I have a simple way of adding support to keep Cassandra relevant for 
>>>>>>>>>> these use cases" Instead of just focusing on this use case, I feel 
>>>>>>>>>> the arguments have bike shedded into scope creep which means it will 
>>>>>>>>>> take forever to get into the project.
>>>>>>>>>> 
>>>>>>>>>> My preference is to see one thing validated with an MVP and get it 
>>>>>>>>>> into the hands of developers sooner so we can continue to iterate 
>>>>>>>>>> based on actual usage. 
>>>>>>>>>> 
>>>>>>>>>> It doesn't say your points are wrong or your opinions are broken, 
>>>>>>>>>> I'm voting for what I think will be awesome for users sooner. 
>>>>>>>>>> 
>>>>>>>>>> Patrick
>>>>>>>>>> 
>>>>>>>>>> On Tue, May 2, 2023 at 12:29 PM Benedict <bened...@apache.org 
>>>>>>>>>> <mailto:bened...@apache.org>> wrote:
>>>>>>>>>>> Could folk voting against a general purpose type (that could well 
>>>>>>>>>>> be called a vector) briefly explain their reasoning?
>>>>>>>>>>> 
>>>>>>>>>>> We established in the other thread that it’s technically trivial, 
>>>>>>>>>>> meaning folk must think it is strictly superior to only support 
>>>>>>>>>>> float rather than eg all numeric types (note: for the type, not the 
>>>>>>>>>>> ANN). 
>>>>>>>>>>> 
>>>>>>>>>>> I am surprised, and the blurbs accompanying votes so far don’t seem 
>>>>>>>>>>> to touch on this, mostly just endorsing the idea of a vector.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On 2 May 2023, at 20:20, Patrick McFadin <pmcfa...@gmail.com 
>>>>>>>>>>>> <mailto:pmcfa...@gmail.com>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> A > B > C on both polls. 
>>>>>>>>>>>> 
>>>>>>>>>>>> Having talked to several users in the community that are highly 
>>>>>>>>>>>> excited about this change, this gets to what developers want to do 
>>>>>>>>>>>> at Cassandra scale: store embeddings and retrieve them. 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña 
>>>>>>>>>>>> <adelap...@apache.org <mailto:adelap...@apache.org>> wrote:
>>>>>>>>>>>>> A > B > C
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I don't think that ML is such a niche application that it can't 
>>>>>>>>>>>>> have its own CQL data type. Also, vectors are mathematical 
>>>>>>>>>>>>> elements that have more applications that ML.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, 2 May 2023 at 19:15, Mick Semb Wever <m...@apache.org 
>>>>>>>>>>>>> <mailto:m...@apache.org>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis <jbel...@gmail.com 
>>>>>>>>>>>>>> <mailto:jbel...@gmail.com>> wrote:
>>>>>>>>>>>>>>> Should we add a vector type to Cassandra designed to meet the 
>>>>>>>>>>>>>>> needs of machine learning use cases, specifically feature and 
>>>>>>>>>>>>>>> embedding vectors for training, inference, and vector search?  
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> ML vectors are fixed-dimension (fixed-length) sequences of 
>>>>>>>>>>>>>>> numeric types, with no nulls allowed, and with no need for 
>>>>>>>>>>>>>>> random access. The ML industry overwhelmingly uses float32 
>>>>>>>>>>>>>>> vectors, to the point that the industry-leading special-purpose 
>>>>>>>>>>>>>>> vector database ONLY supports that data type.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This poll is to gauge consensus subsequent to the recent 
>>>>>>>>>>>>>>> discussion thread at 
>>>>>>>>>>>>>>> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Please rank the discussed options from most preferred option to 
>>>>>>>>>>>>>>> least, e.g., A > B > C (A is my preference, followed by B, 
>>>>>>>>>>>>>>> followed by C) or C > B = A (C is my preference, followed by B 
>>>>>>>>>>>>>>> or A approximately equally.)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> (A) I am in favor of adding a vector type for floats; I do not 
>>>>>>>>>>>>>>> believe we need to tie it to any particular implementation 
>>>>>>>>>>>>>>> details.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> (B) I am okay with adding a vector type but I believe we must 
>>>>>>>>>>>>>>> add array types that compose with all Cassandra types first, 
>>>>>>>>>>>>>>> and make vectors a special case of arrays-without-null-elements.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> (C) I am not in favor of adding a built-in vector type.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> A  > B > C
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> B is stated as "must add array types…".  I think this is a bit 
>>>>>>>>>>>>>> loaded.  If B was the (A + the implementation needs to be a 
>>>>>>>>>>>>>> non-null frozen float32 array, serialisation forward compatible 
>>>>>>>>>>>>>> with other frozen arrays later implemented) I would put this 
>>>>>>>>>>>>>> before (A).  Especially because it's been shown already this is 
>>>>>>>>>>>>>> easy to implement.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>  
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> Jonathan Ellis
>>>>>> co-founder, http://www.datastax.com <http://www.datastax.com/>
>>>>>> @spyced
>

Re: [POLL] Vector type for ML

Reply via email to