Re: [POLL] Vector type for ML

Patrick McFadin Fri, 05 May 2023 08:54:19 -0700

I hope we are willing to consider developers that use our system because if
I had to teach people to use "NON-NULL FROZEN<TYPE[n]>" I'm pretty sure the
response would be:


Did you tell me to go write a distributed map-reduce job in Erlang? I
beleive I did, Bob.

On Fri, May 5, 2023 at 8:05 AM Josh McKenzie <[email protected]> wrote:

> Idiomatically, to my mind, there's a question of "what space are we
> thinking about this datatype in"?
>
> - In the context of mathematics, nullability in a vector would be 0
> - In the context of Cassandra, nullability tends to mean a tombstone (or
> nothing)
> - In the context of programming languages, it's all over the place
>
> Given many models are exploring quantizing to int8 and other data types,
> there's definitely the "support other data types easily in the future"
> piece to me we need to keep in mind.
>
> So with the above and the "meet the user where they are and don't make
> them understand more of Cassandra than absolutely critical to use it", I
> lean:
>
> 1. DENSE_VECTOR<type, dimension>
> 2. VECTOR<type, dimension>
> 3. type[dimension]
>
> This leaves the path open for us to expand on it in the future with sparse
> support and allows us to introduce some semantics that indicate idioms
> around nullability for the users coming from a different space.
>
> "NON-NULL FROZEN<TYPE[n]>" is strictly correct, however it requires
> understanding idioms of how Cassandra thinks about data (nulls mean
> different things to us, we have differences between frozen and non-frozen
> due to constraints in our storage engine and materialization of data, etc)
> that get in the way of users doing things in the pattern they're familiar
> with without learning more about the DB than they're probably looking to
> learn. Historically this has been a challenge for us in adoption; the
> classic "Why can't I just write and delete and write as much as I want? Why
> are deletes filling up my disk?" problem comes to mind.
>
> I'd also be happy with us supporting:
> * NON-NULL FROZEN<TYPE[n]>
> * DENSE_VECTOR<type, dimension> as syntactic sugar for the above
>
> If getting into the "built-in syntactic sugar mapping for communities and
> specific use-cases" is something we're willing to consider.
>
> On Fri, May 5, 2023, at 7:26 AM, Patrick McFadin wrote:
>
> I think we are still discussing implementation here when I'm talking about
> developer experience. I want developers to adopt this quickly, easily and
> be successful. Vector search is already a thing. People use it every day. A
> successful outcome, in my view, is developers picking up this feature
> without reading a manual. (Because they don't anyway and get in trouble) I
> did some more extensive research about what other DBs are using for syntax.
> The consensus is some variety of 'VECTOR', 'DENSE' and 'SPARSE'
>
> Pinecone[1] - dense_vector, sparse_vector
> Elastic[2]: dense_vector
> Milvus[3]: float_vector, binary_vector
> pgvector[4]: vector
> Weaviate[5]: Different approach. All typed arrays can be indexed
>
> Based on that I'm advocating a similar syntax:
>
> - DENSE VECTOR
> or
> - VECTOR
>
> [1] https://docs.pinecone.io/docs/hybrid-search
> [2]
> https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
> [3] https://milvus.io/docs/create_collection.md
> [4] https://github.com/pgvector/pgvector
> [5] https://weaviate.io/developers/weaviate/config-refs/datatypes
>
> On Fri, May 5, 2023 at 6:07 AM Mike Adamson <[email protected]> wrote:
>
> Then we can have the indexing apparatus only accept *frozen<float[n]>* for
> the HSNW case.
>
> I'm inclined to agree with Benedict that the index will need to be
> specifically select by option rather than inferred based on type. As such
> there is no real reason for the *frozen* requirement on the type. The
> hnsw index can be built just as easily from a non-frozen array.
>
> I am in favour of enforcing non-null on the elements of an array by
> default. I would prefer that allowing nulls in the array would be a later
> addition if and when a use case arose for it.
>
> On Fri, 5 May 2023 at 03:02, Caleb Rackliffe <[email protected]>
> wrote:
>
> Even in the ML case, sparse can just mean zeros rather than nulls, and
> they should compress similarly anyway.
>
> If we really want null values, I'd rather leave that in collections space.
>
> On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe <[email protected]>
> wrote:
>
> I actually still prefer *type[dimension]*, because I think I intuitively
> read this as a primitive (meaning no null elements) array. Then we can have
> the indexing apparatus only accept *frozen<float[n]>* for the HSNW case.
>
> If that isn't intuitive to anyone else, I don't really have a strong
> opinion...but...conflating "frozen" and "dense" seems like a bad idea. One
> should indicate single vs. multi-cell, and the other the presence or
> absence of nulls/zeros/whatever.
>
> On Thu, May 4, 2023 at 12:51 PM Patrick McFadin <[email protected]>
> wrote:
>
> I agree with David's reasoning and the use of DENSE (and maybe eventually
> SPARSE). This is terminology well established in the data world, and it
> would lead to much easier adoption from users. VECTOR is close, but I can
> see having to create a lot of content around "How to use it and not get in
> trouble." (I have a lot of that content already)
>
>  - We don't have to explain what it is. A lot of prior art out there
> already [1][2][3]
>  - We're matching an established term with what users would expect. No
> surprises.
>  - Shorter ramp-up time for users. Cassandra is being modernized.
>
> The implementation is flexible, but the interface should empower our users
> to be awesome.
>
> Patrick
>
> 1 -
> https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks
> <https://urldefense.com/v3/__https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ud6ieKGQw$>
> 2 -
> https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035
> <https://urldefense.com/v3/__https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ue1o2CO2Q$>
> 3 -
> https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/
> <https://urldefense.com/v3/__https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ud3U6Hw5A$>
>
> On Thu, May 4, 2023 at 10:25 AM David Capwell <[email protected]> wrote:
>
> My views have changed over time on syntax and I feel type[dimention] may
> not be the best, so it has gone lower in my own personal ranking… this is
> my current preference
>
> 1) DENSE <type>[dimention] | NON NULL <type>[dimention]
> 2) VECTOR<type, dimention>
> 3) type[dimention]
>
> My reasoning for this order
>
> * type[dimention] looks like syntax sugar for array<type, dimention>, so
> users may assume list/array semantics, but we limit to non-null elements in
> a frozen array
> * feel VECTOR as a prefix feels out of place, but VECTOR as a direct type
> makes more sense… this also leads to a possible future of VECTOR<type>
> which is the non-fixed length version of this type.  What makes VECTOR
> different from list/array?  non-null elements and is frozen.  I don’t feel
> that VECTOR really tells users to expect non-null or frozen semantics, as
> there exists different VECTOR types for those reasons (sparse vs dense)…
> * DENSE may be confusing for people coming from languages where this just
> means “sequential layout”, which is what our frozen array/list already are…
> but since the target user is coming from a ML background, this shouldn’t
> offer much confusion.  DENSE just means FROZEN in Cassandra, with NON NULL
> elements (SPARSE allows for NULL and isn’t frozen)… So DENSE just acts as
> syntax sugar for frozen<non null type[dimention]>
>
>
> On May 4, 2023, at 4:13 AM, Brandon Williams <[email protected]> wrote:
>
> 1. VECTOR<FLOAT,n>
> 2. VECTOR FLOAT[n]
> 3. FLOAT[N]   (Non null by default)
>
> Redundant or not, I think having the VECTOR keyword helps signify what
> the app is generally about and helps get buy-in from ML stakeholders.
>
> On Thu, May 4, 2023 at 3:45 AM Benedict <[email protected]> wrote:
>
>
> Hurrah for initial agreement.
>
> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
> think VECTOR should be used to simply imply non-null, as this would be very
> unintuitive. More logical would be NONNULL, if this is the only condition
> being applied. Alternatively for arrays we could default to NONNULL and
> later introduce NULLABLE if we want to permit nulls.
>
> If the word vector is to be used it makes more sense to make it look like
> a list, so VECTOR<FLOAT, N> as here the word VECTOR is clearly not
> redundant.
>
> So, I vote:
>
> 1) (NON NULL) FLOAT[N]
> 2) FLOAT[N]   (Non null by default)
> 3) VECTOR<FLOAT, N>
>
>
>
> On 4 May 2023, at 08:52, Mick Semb Wever <[email protected]> wrote:
>
> 
>
>
> Did we agree on a CQL syntax?
>
> I don’t believe there has been a pool on CQL syntax… my understanding
> reading all the threads is that there are ~4-5 options and non are -1ed, so
> believe we are waiting for majority rule on this?
>
>
>
>
> Re-reading that thread, IIUC the valid choices remaining are…
>
> 1. VECTOR FLOAT[n]
> 2. FLOAT VECTOR[n]
> 3. VECTOR<FLOAT,n>
> 4. VECTOR[n]<FLOAT>
> 5. ARRAY<FLOAT, n>
> 6. NON-NULL FROZEN<FLOAT[n]>
>
>
> Yes I'm putting my preference (1) first ;) because (banging on) if the
> future of CQL will have FLOAT[n] and FROZEN<FLOAT[n]>, where the VECTOR
> keyword is: for general cql users; just meaning "non-null and frozen",
> these gel best together.
>
> Options (5) and (6) are for those that feel we can and should provide this
> type without introducing the vector keyword.
>
>
>
>
>
> --
> [image: DataStax Logo Square] <https://www.datastax.com/>
> *Mike Adamson*
> Engineering
> +1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
> Find DataStax Online:
> [image: LinkedIn Logo]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
>    [image: Facebook Logo]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
>    [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS
> Feed] <https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
> <https://github.com/datastax>
>
>
>

Re: [POLL] Vector type for ML

Reply via email to