I hope we are willing to consider developers that use our system because if I had to teach people to use "NON-NULL FROZEN<TYPE[n]>" I'm pretty sure the response would be:
Did you tell me to go write a distributed map-reduce job in Erlang? I beleive I did, Bob. On Fri, May 5, 2023 at 8:05 AM Josh McKenzie <jmcken...@apache.org> wrote: > Idiomatically, to my mind, there's a question of "what space are we > thinking about this datatype in"? > > - In the context of mathematics, nullability in a vector would be 0 > - In the context of Cassandra, nullability tends to mean a tombstone (or > nothing) > - In the context of programming languages, it's all over the place > > Given many models are exploring quantizing to int8 and other data types, > there's definitely the "support other data types easily in the future" > piece to me we need to keep in mind. > > So with the above and the "meet the user where they are and don't make > them understand more of Cassandra than absolutely critical to use it", I > lean: > > 1. DENSE_VECTOR<type, dimension> > 2. VECTOR<type, dimension> > 3. type[dimension] > > This leaves the path open for us to expand on it in the future with sparse > support and allows us to introduce some semantics that indicate idioms > around nullability for the users coming from a different space. > > "NON-NULL FROZEN<TYPE[n]>" is strictly correct, however it requires > understanding idioms of how Cassandra thinks about data (nulls mean > different things to us, we have differences between frozen and non-frozen > due to constraints in our storage engine and materialization of data, etc) > that get in the way of users doing things in the pattern they're familiar > with without learning more about the DB than they're probably looking to > learn. Historically this has been a challenge for us in adoption; the > classic "Why can't I just write and delete and write as much as I want? Why > are deletes filling up my disk?" problem comes to mind. > > I'd also be happy with us supporting: > * NON-NULL FROZEN<TYPE[n]> > * DENSE_VECTOR<type, dimension> as syntactic sugar for the above > > If getting into the "built-in syntactic sugar mapping for communities and > specific use-cases" is something we're willing to consider. > > On Fri, May 5, 2023, at 7:26 AM, Patrick McFadin wrote: > > I think we are still discussing implementation here when I'm talking about > developer experience. I want developers to adopt this quickly, easily and > be successful. Vector search is already a thing. People use it every day. A > successful outcome, in my view, is developers picking up this feature > without reading a manual. (Because they don't anyway and get in trouble) I > did some more extensive research about what other DBs are using for syntax. > The consensus is some variety of 'VECTOR', 'DENSE' and 'SPARSE' > > Pinecone[1] - dense_vector, sparse_vector > Elastic[2]: dense_vector > Milvus[3]: float_vector, binary_vector > pgvector[4]: vector > Weaviate[5]: Different approach. All typed arrays can be indexed > > Based on that I'm advocating a similar syntax: > > - DENSE VECTOR > or > - VECTOR > > [1] https://docs.pinecone.io/docs/hybrid-search > [2] > https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html > [3] https://milvus.io/docs/create_collection.md > [4] https://github.com/pgvector/pgvector > [5] https://weaviate.io/developers/weaviate/config-refs/datatypes > > On Fri, May 5, 2023 at 6:07 AM Mike Adamson <madam...@datastax.com> wrote: > > Then we can have the indexing apparatus only accept *frozen<float[n]>* for > the HSNW case. > > I'm inclined to agree with Benedict that the index will need to be > specifically select by option rather than inferred based on type. As such > there is no real reason for the *frozen* requirement on the type. The > hnsw index can be built just as easily from a non-frozen array. > > I am in favour of enforcing non-null on the elements of an array by > default. I would prefer that allowing nulls in the array would be a later > addition if and when a use case arose for it. > > On Fri, 5 May 2023 at 03:02, Caleb Rackliffe <calebrackli...@gmail.com> > wrote: > > Even in the ML case, sparse can just mean zeros rather than nulls, and > they should compress similarly anyway. > > If we really want null values, I'd rather leave that in collections space. > > On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe <calebrackli...@gmail.com> > wrote: > > I actually still prefer *type[dimension]*, because I think I intuitively > read this as a primitive (meaning no null elements) array. Then we can have > the indexing apparatus only accept *frozen<float[n]>* for the HSNW case. > > If that isn't intuitive to anyone else, I don't really have a strong > opinion...but...conflating "frozen" and "dense" seems like a bad idea. One > should indicate single vs. multi-cell, and the other the presence or > absence of nulls/zeros/whatever. > > On Thu, May 4, 2023 at 12:51 PM Patrick McFadin <pmcfa...@gmail.com> > wrote: > > I agree with David's reasoning and the use of DENSE (and maybe eventually > SPARSE). This is terminology well established in the data world, and it > would lead to much easier adoption from users. VECTOR is close, but I can > see having to create a lot of content around "How to use it and not get in > trouble." (I have a lot of that content already) > > - We don't have to explain what it is. A lot of prior art out there > already [1][2][3] > - We're matching an established term with what users would expect. No > surprises. > - Shorter ramp-up time for users. Cassandra is being modernized. > > The implementation is flexible, but the interface should empower our users > to be awesome. > > Patrick > > 1 - > https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks > <https://urldefense.com/v3/__https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ud6ieKGQw$> > 2 - > https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035 > <https://urldefense.com/v3/__https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ue1o2CO2Q$> > 3 - > https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/ > <https://urldefense.com/v3/__https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ud3U6Hw5A$> > > On Thu, May 4, 2023 at 10:25 AM David Capwell <dcapw...@apple.com> wrote: > > My views have changed over time on syntax and I feel type[dimention] may > not be the best, so it has gone lower in my own personal ranking… this is > my current preference > > 1) DENSE <type>[dimention] | NON NULL <type>[dimention] > 2) VECTOR<type, dimention> > 3) type[dimention] > > My reasoning for this order > > * type[dimention] looks like syntax sugar for array<type, dimention>, so > users may assume list/array semantics, but we limit to non-null elements in > a frozen array > * feel VECTOR as a prefix feels out of place, but VECTOR as a direct type > makes more sense… this also leads to a possible future of VECTOR<type> > which is the non-fixed length version of this type. What makes VECTOR > different from list/array? non-null elements and is frozen. I don’t feel > that VECTOR really tells users to expect non-null or frozen semantics, as > there exists different VECTOR types for those reasons (sparse vs dense)… > * DENSE may be confusing for people coming from languages where this just > means “sequential layout”, which is what our frozen array/list already are… > but since the target user is coming from a ML background, this shouldn’t > offer much confusion. DENSE just means FROZEN in Cassandra, with NON NULL > elements (SPARSE allows for NULL and isn’t frozen)… So DENSE just acts as > syntax sugar for frozen<non null type[dimention]> > > > On May 4, 2023, at 4:13 AM, Brandon Williams <dri...@gmail.com> wrote: > > 1. VECTOR<FLOAT,n> > 2. VECTOR FLOAT[n] > 3. FLOAT[N] (Non null by default) > > Redundant or not, I think having the VECTOR keyword helps signify what > the app is generally about and helps get buy-in from ML stakeholders. > > On Thu, May 4, 2023 at 3:45 AM Benedict <bened...@apache.org> wrote: > > > Hurrah for initial agreement. > > For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], > VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t > think VECTOR should be used to simply imply non-null, as this would be very > unintuitive. More logical would be NONNULL, if this is the only condition > being applied. Alternatively for arrays we could default to NONNULL and > later introduce NULLABLE if we want to permit nulls. > > If the word vector is to be used it makes more sense to make it look like > a list, so VECTOR<FLOAT, N> as here the word VECTOR is clearly not > redundant. > > So, I vote: > > 1) (NON NULL) FLOAT[N] > 2) FLOAT[N] (Non null by default) > 3) VECTOR<FLOAT, N> > > > > On 4 May 2023, at 08:52, Mick Semb Wever <m...@apache.org> wrote: > > > > > Did we agree on a CQL syntax? > > I don’t believe there has been a pool on CQL syntax… my understanding > reading all the threads is that there are ~4-5 options and non are -1ed, so > believe we are waiting for majority rule on this? > > > > > Re-reading that thread, IIUC the valid choices remaining are… > > 1. VECTOR FLOAT[n] > 2. FLOAT VECTOR[n] > 3. VECTOR<FLOAT,n> > 4. VECTOR[n]<FLOAT> > 5. ARRAY<FLOAT, n> > 6. NON-NULL FROZEN<FLOAT[n]> > > > Yes I'm putting my preference (1) first ;) because (banging on) if the > future of CQL will have FLOAT[n] and FROZEN<FLOAT[n]>, where the VECTOR > keyword is: for general cql users; just meaning "non-null and frozen", > these gel best together. > > Options (5) and (6) are for those that feel we can and should provide this > type without introducing the vector keyword. > > > > > > -- > [image: DataStax Logo Square] <https://www.datastax.com/> > *Mike Adamson* > Engineering > +1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/> > Find DataStax Online: > [image: LinkedIn Logo] > <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=> > [image: Facebook Logo] > <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=> > [image: Twitter Logo] <https://twitter.com/DataStax> [image: RSS > Feed] <https://www.datastax.com/blog/rss.xml> [image: Github Logo] > <https://github.com/datastax> > > >