> Did we agree on a CQL syntax? I don’t believe there has been a pool on CQL syntax… my understanding reading all the threads is that there are ~4-5 options and non are -1ed, so believe we are waiting for majority rule on this?
> On May 3, 2023, at 1:23 PM, Jeremiah D Jordan <jerem...@datastax.com> wrote: > >> To be clear, I support the general agreement David and Jonathan seem to have >> reached. > > +1 as well. > > >> On May 3, 2023, at 3:07 PM, Caleb Rackliffe <calebrackli...@gmail.com> wrote: >> >> To be clear, I support the general agreement David and Jonathan seem to have >> reached. >> >> On Wed, May 3, 2023 at 3:05 PM Caleb Rackliffe <calebrackli...@gmail.com >> <mailto:calebrackli...@gmail.com>> wrote: >>> Did we agree on a CQL syntax? >>> >>> On Wed, May 3, 2023 at 2:06 PM Rahul Xavier Singh >>> <rahul.xavier.si...@gmail.com <mailto:rahul.xavier.si...@gmail.com>> wrote: >>>> I like this approach. Thank you for those working on this vector search >>>> initiative. >>>> >>>> Here's the feedback from my "user" hat for someone who is looking at >>>> databases / indexes for my next LLM app. >>>> >>>> Can I take some python code and go from using an in memory vector store >>>> like numpy or FAISS to something else? How easy is it for me to take my >>>> python code and get it to work with this new external service which is no >>>> longer just a library? >>>> There's also tons of services that I can run on docker e.g. milvus, >>>> redissearch, typesense, elasticsearch, opensearch and I may hit a hurdle >>>> when trying to do a lot more data, so I look at Cassandra Vector Search. >>>> Because I am familiar with SQL , Cassandra looks appealing since I can >>>> potentially use "cql_agent" lib ( to be created for langchain and we're >>>> looking into that now) or an existing CassandraVectorStore class? >>>> >>>> In most of these scenarios, if people are using langchain, llamaindex, the >>>> underlying implementation is not as important since we shield the user >>>> from CQL data types except at schema creation and most of this libs can be >>>> opinionated and just suggest a generic schema. >>>> >>>> The ideal world is if I can just dump text into a field and do a natural >>>> language query on it and have my DB do the embeddings for the document, >>>> and then for the query for me. For now the libs can manage all that and >>>> they do that well. We just need the interface to stay consistent and be >>>> relatively easy to query in CQL. The most popular index in LLM retrieval >>>> augmented patterns is pinecone. You make an index, you upsert, and then >>>> you query. It's not assumed that you are also giving it content, though >>>> you can send it metadata to have the document there. >>>> >>>> If we can have a similar workflow e.g. create a table with a vector type >>>> OR create a table with an existing type and then add an index to it, no >>>> one is going to sleep over it as long as it works. Having the ability to >>>> take a table that has data, and then add a vector index doesn't make it >>>> any different than adding a new field since I've got to calculate the >>>> embeddings anyways. >>>> >>>> Would love to see how the CQL ends up looking like. >>>> Rahul Singh >>>> Chief Executive Officer | Business Platform Architect >>>> m: 202.905.2818 e: rahul.si...@anant.us <mailto:rahul.si...@anant.us> li: >>>> http://linkedin.com/in/xingh >>>> <https://urldefense.com/v3/__http://linkedin.com/in/xingh__;!!PbtH5S7Ebw!dpYGB0YPi3Klzqfi_uM7eXJnrZD-GoRc6HOvP3_-p7v8ya3jEKpmpkFbDSguLwD26V9VHlFOMzhMWVy7cvsLl8-Stg$> >>>> ca: http://calendly.com/xingh >>>> <https://urldefense.com/v3/__http://calendly.com/xingh__;!!PbtH5S7Ebw!dpYGB0YPi3Klzqfi_uM7eXJnrZD-GoRc6HOvP3_-p7v8ya3jEKpmpkFbDSguLwD26V9VHlFOMzhMWVy7cvsBQ99xhQ$> >>>> We create, support, and manage real-time global data & analytics platforms >>>> for the modern enterprise. >>>> >>>> Anant | https://anant.us >>>> <https://urldefense.com/v3/__https://anant.us/__;!!PbtH5S7Ebw!dpYGB0YPi3Klzqfi_uM7eXJnrZD-GoRc6HOvP3_-p7v8ya3jEKpmpkFbDSguLwD26V9VHlFOMzhMWVy7cvsh8deoBA$> >>>> 3 Washington Circle, Suite 301 >>>> Washington, D.C. 20037 >>>> >>>> http://Cassandra.Link >>>> <https://urldefense.com/v3/__http://cassandra.link/__;!!PbtH5S7Ebw!dpYGB0YPi3Klzqfi_uM7eXJnrZD-GoRc6HOvP3_-p7v8ya3jEKpmpkFbDSguLwD26V9VHlFOMzhMWVy7cvvbbpb74g$> >>>> : The best resources for Apache Cassandra >>>> >>>> >>>> On Tue, May 2, 2023 at 6:39 PM Patrick McFadin <pmcfa...@gmail.com >>>> <mailto:pmcfa...@gmail.com>> wrote: >>>>> \o/ >>>>> >>>>> Bring it in team. Group hug. >>>>> >>>>> Now if you'll excuse me, I'm going to go build my preso on how Cassandra >>>>> is the only distributed database you can do vector search in an ACID >>>>> transaction. >>>>> >>>>> Patrick >>>>> >>>>> On Tue, May 2, 2023 at 3:27 PM Jonathan Ellis <jbel...@gmail.com >>>>> <mailto:jbel...@gmail.com>> wrote: >>>>>> I had a call with David. We agreed that we want a "vector" data type >>>>>> with these properties >>>>>> >>>>>> - Fixed length >>>>>> - No nulls >>>>>> - Random access not supported >>>>>> >>>>>> Where we disagreed was on my proposal to restrict vectors to only >>>>>> numeric data. David's points were that >>>>>> >>>>>> (1) He has a use case today for a data type with the other vector >>>>>> properties, >>>>>> (2) It doesn't seem reasonable to create two data types with the same >>>>>> properties, one of which is restricted to numerics, and >>>>>> (3) The restrictions that I want for numeric vectors make more sense at >>>>>> the index and function level, than at the type level. >>>>>> >>>>>> I'm ready to concede that David has the better case here and move >>>>>> forward with a vector implementation without that restriction. >>>>>> >>>>>> On Tue, May 2, 2023 at 4:03 PM David Capwell <dcapw...@apple.com >>>>>> <mailto:dcapw...@apple.com>> wrote: >>>>>>>> How about it, David? Did you already make this? >>>>>>> >>>>>>> I checked out the patch, fixed serialize/deserialize, added the >>>>>>> constraints, then added a composeForFloat(ByteBuffer), with this the >>>>>>> impact to the POC patch was the following >>>>>>> >>>>>>> 1) move away from VectorType.instance.serializer().deserialize(bb) to >>>>>>> type.composeForFloat(bb), both return float[] >>>>>>> 2) change the index validate logic to move away from checking >>>>>>> VectorType and instead check for that plus the element type == >>>>>>> FloatType. I didn’t bother to do this as its trivial >>>>>>> >>>>>>>> David. End this argument. SHOW THE CODE! >>>>>>> >>>>>>> If this argument ends and people are cool with vector supporting >>>>>>> abstract type, more than glad to help get this in. >>>>>>> >>>>>>>> On May 2, 2023, at 1:53 PM, Jeremy Hanna <jeremy.hanna1...@gmail.com >>>>>>>> <mailto:jeremy.hanna1...@gmail.com>> wrote: >>>>>>>> >>>>>>>> I'm all for bringing more functionality to the masses sooner, but the >>>>>>>> original idea has a very very specific use case. Do we have use cases >>>>>>>> for a general purpose Vector/Array data structure? If so, awesome. I >>>>>>>> just wondered if generalizing provides value, beyond being >>>>>>>> straightforward to implement. I'm just trying to be sensitive to the >>>>>>>> database code maintenance and driver support for general types versus >>>>>>>> a single type for a specific, well defined purpose. >>>>>>>> >>>>>>>> If it could easily be a plugin, that's great - but the full picture >>>>>>>> involves drivers that need to support it or you end up getting binary >>>>>>>> blobs you have to decode client side and then do stuff with. So >>>>>>>> ideally if you have a well defined use case that you can build into >>>>>>>> the database, having it just be part of the database and associated >>>>>>>> drivers - that makes the experience much much better. >>>>>>>> >>>>>>>> I'm not trying to say B couldn't be valuable or that a plugin couldn't >>>>>>>> be feasible. I'm just trying to enlarge the picture a bit to see what >>>>>>>> that means for this use case and for the supporting drivers/clients. >>>>>>>> >>>>>>>>> On May 2, 2023, at 3:04 PM, Benedict <bened...@apache.org >>>>>>>>> <mailto:bened...@apache.org>> wrote: >>>>>>>>> >>>>>>>>> But it’s so trivial it was already implemented by David in the span >>>>>>>>> of ten minutes? If anything, we’re slowing progress down by refusing >>>>>>>>> to do the extra types, as we’re busy arguing about it rather than >>>>>>>>> delivering a feature? >>>>>>>>> >>>>>>>>> FWIW, my interpretation of the votes today is that we SHOULD NOT >>>>>>>>> (ever) support types beyond float. Not that we should start with >>>>>>>>> float. >>>>>>>>> >>>>>>>>> So, this whole debate is a mess, I think. But hey ho. >>>>>>>>> >>>>>>>>>> On 2 May 2023, at 20:57, Patrick McFadin <pmcfa...@gmail.com >>>>>>>>>> <mailto:pmcfa...@gmail.com>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I'll speak up on that one. If you look at my ranked voting, that is >>>>>>>>>> where my head is. I get accused of scope creep (a lot) and looking >>>>>>>>>> at the initial proposal Jonathan put on the ML it was mostly >>>>>>>>>> "Developers are adopting vector search at a furious pace and I think >>>>>>>>>> I have a simple way of adding support to keep Cassandra relevant for >>>>>>>>>> these use cases" Instead of just focusing on this use case, I feel >>>>>>>>>> the arguments have bike shedded into scope creep which means it will >>>>>>>>>> take forever to get into the project. >>>>>>>>>> >>>>>>>>>> My preference is to see one thing validated with an MVP and get it >>>>>>>>>> into the hands of developers sooner so we can continue to iterate >>>>>>>>>> based on actual usage. >>>>>>>>>> >>>>>>>>>> It doesn't say your points are wrong or your opinions are broken, >>>>>>>>>> I'm voting for what I think will be awesome for users sooner. >>>>>>>>>> >>>>>>>>>> Patrick >>>>>>>>>> >>>>>>>>>> On Tue, May 2, 2023 at 12:29 PM Benedict <bened...@apache.org >>>>>>>>>> <mailto:bened...@apache.org>> wrote: >>>>>>>>>>> Could folk voting against a general purpose type (that could well >>>>>>>>>>> be called a vector) briefly explain their reasoning? >>>>>>>>>>> >>>>>>>>>>> We established in the other thread that it’s technically trivial, >>>>>>>>>>> meaning folk must think it is strictly superior to only support >>>>>>>>>>> float rather than eg all numeric types (note: for the type, not the >>>>>>>>>>> ANN). >>>>>>>>>>> >>>>>>>>>>> I am surprised, and the blurbs accompanying votes so far don’t seem >>>>>>>>>>> to touch on this, mostly just endorsing the idea of a vector. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On 2 May 2023, at 20:20, Patrick McFadin <pmcfa...@gmail.com >>>>>>>>>>>> <mailto:pmcfa...@gmail.com>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> A > B > C on both polls. >>>>>>>>>>>> >>>>>>>>>>>> Having talked to several users in the community that are highly >>>>>>>>>>>> excited about this change, this gets to what developers want to do >>>>>>>>>>>> at Cassandra scale: store embeddings and retrieve them. >>>>>>>>>>>> >>>>>>>>>>>> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña >>>>>>>>>>>> <adelap...@apache.org <mailto:adelap...@apache.org>> wrote: >>>>>>>>>>>>> A > B > C >>>>>>>>>>>>> >>>>>>>>>>>>> I don't think that ML is such a niche application that it can't >>>>>>>>>>>>> have its own CQL data type. Also, vectors are mathematical >>>>>>>>>>>>> elements that have more applications that ML. >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, 2 May 2023 at 19:15, Mick Semb Wever <m...@apache.org >>>>>>>>>>>>> <mailto:m...@apache.org>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis <jbel...@gmail.com >>>>>>>>>>>>>> <mailto:jbel...@gmail.com>> wrote: >>>>>>>>>>>>>>> Should we add a vector type to Cassandra designed to meet the >>>>>>>>>>>>>>> needs of machine learning use cases, specifically feature and >>>>>>>>>>>>>>> embedding vectors for training, inference, and vector search? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ML vectors are fixed-dimension (fixed-length) sequences of >>>>>>>>>>>>>>> numeric types, with no nulls allowed, and with no need for >>>>>>>>>>>>>>> random access. The ML industry overwhelmingly uses float32 >>>>>>>>>>>>>>> vectors, to the point that the industry-leading special-purpose >>>>>>>>>>>>>>> vector database ONLY supports that data type. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This poll is to gauge consensus subsequent to the recent >>>>>>>>>>>>>>> discussion thread at >>>>>>>>>>>>>>> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Please rank the discussed options from most preferred option to >>>>>>>>>>>>>>> least, e.g., A > B > C (A is my preference, followed by B, >>>>>>>>>>>>>>> followed by C) or C > B = A (C is my preference, followed by B >>>>>>>>>>>>>>> or A approximately equally.) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> (A) I am in favor of adding a vector type for floats; I do not >>>>>>>>>>>>>>> believe we need to tie it to any particular implementation >>>>>>>>>>>>>>> details. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> (B) I am okay with adding a vector type but I believe we must >>>>>>>>>>>>>>> add array types that compose with all Cassandra types first, >>>>>>>>>>>>>>> and make vectors a special case of arrays-without-null-elements. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> (C) I am not in favor of adding a built-in vector type. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> A > B > C >>>>>>>>>>>>>> >>>>>>>>>>>>>> B is stated as "must add array types…". I think this is a bit >>>>>>>>>>>>>> loaded. If B was the (A + the implementation needs to be a >>>>>>>>>>>>>> non-null frozen float32 array, serialisation forward compatible >>>>>>>>>>>>>> with other frozen arrays later implemented) I would put this >>>>>>>>>>>>>> before (A). Especially because it's been shown already this is >>>>>>>>>>>>>> easy to implement. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Jonathan Ellis >>>>>> co-founder, http://www.datastax.com <http://www.datastax.com/> >>>>>> @spyced >