>
> ...where, just to be clear, VECTOR<type, dimension> means a frozen fixed
> size array w/ no null values?
>
Assuming this is the case, my vote is:

1. VECTOR<type, dimension>
2. DENSE VECTOR<type, dimension>

I don't really have a 3rd vote because I think that *type[dimension]* is
too ambiguous.


On Fri, 5 May 2023 at 18:32, Derek Chen-Becker <de...@chen-becker.org>
wrote:

> LOL, I'm holding you to that at the summit :) In all seriousness, I'm glad
> to see a robust debate around it. I guess for completeness, my order of
> preference is
>
> 1 - NONNULL FROZEN<TYPE<N>>
> 2 - NONNULL TYPE<N> (which part of this implies frozen? The NONNULL or the
> cardinality?)
> 3 - DENSE_VECTOR<type, N>
>
> I guess my main concern with just "VECTOR" is that it's such an overloaded
> term. Maybe in ML it means something specific, but for anyone coming from
> C++, Rust, Java, etc, a Vector is both mutable and can carry null (or
> equivalent, e.g. None, in Rust). If the argument hadn't also been made that
> we should be working toward something that's not ML-specific maybe I would
> be less concerned.
>
> Cheers,
>
> Derek
>
>
> Cheers,
>
> Derek
>
> On Fri, May 5, 2023 at 11:14 AM Patrick McFadin <pmcfa...@gmail.com>
> wrote:
>
>> Derek, despite your preference, I would hang out with you at a party.
>>
>> On Fri, May 5, 2023 at 9:44 AM Derek Chen-Becker <de...@chen-becker.org>
>> wrote:
>>
>>> Speaking as someone who likes Erlang, maybe that's why I also like
>>> NONNULL FROZEN<TYPE<[n]>>. It's unambiguous what Cassandra is going to do
>>> with that type. DENSE VECTOR means I need to go read docs (and then
>>> probably double-check in the source to be sure) to be sure what exactly is
>>> going on.
>>>
>>> Cheers,
>>>
>>> Derek
>>>
>>> On Fri, May 5, 2023 at 9:54 AM Patrick McFadin <pmcfa...@gmail.com>
>>> wrote:
>>>
>>>> I hope we are willing to consider developers that use our system
>>>> because if I had to teach people to use "NON-NULL FROZEN<TYPE[n]>" I'm
>>>> pretty sure the response would be:
>>>>
>>>> Did you tell me to go write a distributed map-reduce job in Erlang? I
>>>> beleive I did, Bob.
>>>>
>>>> On Fri, May 5, 2023 at 8:05 AM Josh McKenzie <jmcken...@apache.org>
>>>> wrote:
>>>>
>>>>> Idiomatically, to my mind, there's a question of "what space are we
>>>>> thinking about this datatype in"?
>>>>>
>>>>> - In the context of mathematics, nullability in a vector would be 0
>>>>> - In the context of Cassandra, nullability tends to mean a tombstone
>>>>> (or nothing)
>>>>> - In the context of programming languages, it's all over the place
>>>>>
>>>>> Given many models are exploring quantizing to int8 and other data
>>>>> types, there's definitely the "support other data types easily in the
>>>>> future" piece to me we need to keep in mind.
>>>>>
>>>>> So with the above and the "meet the user where they are and don't make
>>>>> them understand more of Cassandra than absolutely critical to use it", I
>>>>> lean:
>>>>>
>>>>> 1. DENSE_VECTOR<type, dimension>
>>>>> 2. VECTOR<type, dimension>
>>>>> 3. type[dimension]
>>>>>
>>>>> This leaves the path open for us to expand on it in the future with
>>>>> sparse support and allows us to introduce some semantics that indicate
>>>>> idioms around nullability for the users coming from a different space.
>>>>>
>>>>> "NON-NULL FROZEN<TYPE[n]>" is strictly correct, however it requires
>>>>> understanding idioms of how Cassandra thinks about data (nulls mean
>>>>> different things to us, we have differences between frozen and non-frozen
>>>>> due to constraints in our storage engine and materialization of data, etc)
>>>>> that get in the way of users doing things in the pattern they're familiar
>>>>> with without learning more about the DB than they're probably looking to
>>>>> learn. Historically this has been a challenge for us in adoption; the
>>>>> classic "Why can't I just write and delete and write as much as I want? 
>>>>> Why
>>>>> are deletes filling up my disk?" problem comes to mind.
>>>>>
>>>>> I'd also be happy with us supporting:
>>>>> * NON-NULL FROZEN<TYPE[n]>
>>>>> * DENSE_VECTOR<type, dimension> as syntactic sugar for the above
>>>>>
>>>>> If getting into the "built-in syntactic sugar mapping for communities
>>>>> and specific use-cases" is something we're willing to consider.
>>>>>
>>>>> On Fri, May 5, 2023, at 7:26 AM, Patrick McFadin wrote:
>>>>>
>>>>> I think we are still discussing implementation here when I'm talking
>>>>> about developer experience. I want developers to adopt this quickly, 
>>>>> easily
>>>>> and be successful. Vector search is already a thing. People use it every
>>>>> day. A successful outcome, in my view, is developers picking up this
>>>>> feature without reading a manual. (Because they don't anyway and get in
>>>>> trouble) I did some more extensive research about what other DBs are using
>>>>> for syntax. The consensus is some variety of 'VECTOR', 'DENSE' and 
>>>>> 'SPARSE'
>>>>>
>>>>> Pinecone[1] - dense_vector, sparse_vector
>>>>> Elastic[2]: dense_vector
>>>>> Milvus[3]: float_vector, binary_vector
>>>>> pgvector[4]: vector
>>>>> Weaviate[5]: Different approach. All typed arrays can be indexed
>>>>>
>>>>> Based on that I'm advocating a similar syntax:
>>>>>
>>>>> - DENSE VECTOR
>>>>> or
>>>>> - VECTOR
>>>>>
>>>>> [1] https://docs.pinecone.io/docs/hybrid-search
>>>>> <https://urldefense.com/v3/__https://docs.pinecone.io/docs/hybrid-search__;!!PbtH5S7Ebw!epFk5syZ_avANqrEkFR0WT7Alkybo0yrvO-_awqqn8mVWpnyuSgAm0FMgbE_rYpSWJSC91KmoX7nGOa1KY4$>
>>>>> [2]
>>>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
>>>>> <https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html__;!!PbtH5S7Ebw!epFk5syZ_avANqrEkFR0WT7Alkybo0yrvO-_awqqn8mVWpnyuSgAm0FMgbE_rYpSWJSC91KmoX7n--HiUaw$>
>>>>> [3] https://milvus.io/docs/create_collection.md
>>>>> <https://urldefense.com/v3/__https://milvus.io/docs/create_collection.md__;!!PbtH5S7Ebw!epFk5syZ_avANqrEkFR0WT7Alkybo0yrvO-_awqqn8mVWpnyuSgAm0FMgbE_rYpSWJSC91KmoX7nQttAKvY$>
>>>>> [4] https://github.com/pgvector/pgvector
>>>>> [5] https://weaviate.io/developers/weaviate/config-refs/datatypes
>>>>> <https://urldefense.com/v3/__https://weaviate.io/developers/weaviate/config-refs/datatypes__;!!PbtH5S7Ebw!epFk5syZ_avANqrEkFR0WT7Alkybo0yrvO-_awqqn8mVWpnyuSgAm0FMgbE_rYpSWJSC91KmoX7n0yKoHLs$>
>>>>>
>>>>> On Fri, May 5, 2023 at 6:07 AM Mike Adamson <madam...@datastax.com>
>>>>> wrote:
>>>>>
>>>>> Then we can have the indexing apparatus only accept *frozen<float[n]>* for
>>>>> the HSNW case.
>>>>>
>>>>> I'm inclined to agree with Benedict that the index will need to be
>>>>> specifically select by option rather than inferred based on type. As such
>>>>> there is no real reason for the *frozen* requirement on the type. The
>>>>> hnsw index can be built just as easily from a non-frozen array.
>>>>>
>>>>> I am in favour of enforcing non-null on the elements of an array by
>>>>> default. I would prefer that allowing nulls in the array would be a later
>>>>> addition if and when a use case arose for it.
>>>>>
>>>>> On Fri, 5 May 2023 at 03:02, Caleb Rackliffe <calebrackli...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Even in the ML case, sparse can just mean zeros rather than nulls, and
>>>>> they should compress similarly anyway.
>>>>>
>>>>> If we really want null values, I'd rather leave that in collections
>>>>> space.
>>>>>
>>>>> On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe <
>>>>> calebrackli...@gmail.com> wrote:
>>>>>
>>>>> I actually still prefer *type[dimension]*, because I think I
>>>>> intuitively read this as a primitive (meaning no null elements) array. 
>>>>> Then
>>>>> we can have the indexing apparatus only accept *frozen<float[n]>* for
>>>>> the HSNW case.
>>>>>
>>>>> If that isn't intuitive to anyone else, I don't really have a strong
>>>>> opinion...but...conflating "frozen" and "dense" seems like a bad idea. One
>>>>> should indicate single vs. multi-cell, and the other the presence or
>>>>> absence of nulls/zeros/whatever.
>>>>>
>>>>> On Thu, May 4, 2023 at 12:51 PM Patrick McFadin <pmcfa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> I agree with David's reasoning and the use of DENSE (and maybe
>>>>> eventually SPARSE). This is terminology well established in the data 
>>>>> world,
>>>>> and it would lead to much easier adoption from users. VECTOR is close, but
>>>>> I can see having to create a lot of content around "How to use it and not
>>>>> get in trouble." (I have a lot of that content already)
>>>>>
>>>>>  - We don't have to explain what it is. A lot of prior art out there
>>>>> already [1][2][3]
>>>>>  - We're matching an established term with what users would expect. No
>>>>> surprises.
>>>>>  - Shorter ramp-up time for users. Cassandra is being modernized.
>>>>>
>>>>> The implementation is flexible, but the interface should empower our
>>>>> users to be awesome.
>>>>>
>>>>> Patrick
>>>>>
>>>>> 1 -
>>>>> https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks
>>>>> <https://urldefense.com/v3/__https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ud6ieKGQw$>
>>>>> 2 -
>>>>> https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035
>>>>> <https://urldefense.com/v3/__https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ue1o2CO2Q$>
>>>>> 3 -
>>>>> https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/
>>>>> <https://urldefense.com/v3/__https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ud3U6Hw5A$>
>>>>>
>>>>> On Thu, May 4, 2023 at 10:25 AM David Capwell <dcapw...@apple.com>
>>>>> wrote:
>>>>>
>>>>> My views have changed over time on syntax and I feel type[dimention]
>>>>> may not be the best, so it has gone lower in my own personal ranking… this
>>>>> is my current preference
>>>>>
>>>>> 1) DENSE <type>[dimention] | NON NULL <type>[dimention]
>>>>> 2) VECTOR<type, dimention>
>>>>> 3) type[dimention]
>>>>>
>>>>> My reasoning for this order
>>>>>
>>>>> * type[dimention] looks like syntax sugar for array<type, dimention>,
>>>>> so users may assume list/array semantics, but we limit to non-null 
>>>>> elements
>>>>> in a frozen array
>>>>> * feel VECTOR as a prefix feels out of place, but VECTOR as a direct
>>>>> type makes more sense… this also leads to a possible future of 
>>>>> VECTOR<type>
>>>>> which is the non-fixed length version of this type.  What makes VECTOR
>>>>> different from list/array?  non-null elements and is frozen.  I don’t feel
>>>>> that VECTOR really tells users to expect non-null or frozen semantics, as
>>>>> there exists different VECTOR types for those reasons (sparse vs dense)…
>>>>> * DENSE may be confusing for people coming from languages where this
>>>>> just means “sequential layout”, which is what our frozen array/list 
>>>>> already
>>>>> are… but since the target user is coming from a ML background, this
>>>>> shouldn’t offer much confusion.  DENSE just means FROZEN in Cassandra, 
>>>>> with
>>>>> NON NULL elements (SPARSE allows for NULL and isn’t frozen)… So DENSE just
>>>>> acts as syntax sugar for frozen<non null type[dimention]>
>>>>>
>>>>>
>>>>> On May 4, 2023, at 4:13 AM, Brandon Williams <dri...@gmail.com> wrote:
>>>>>
>>>>> 1. VECTOR<FLOAT,n>
>>>>> 2. VECTOR FLOAT[n]
>>>>> 3. FLOAT[N]   (Non null by default)
>>>>>
>>>>> Redundant or not, I think having the VECTOR keyword helps signify what
>>>>> the app is generally about and helps get buy-in from ML stakeholders.
>>>>>
>>>>> On Thu, May 4, 2023 at 3:45 AM Benedict <bened...@apache.org> wrote:
>>>>>
>>>>>
>>>>> Hurrah for initial agreement.
>>>>>
>>>>> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
>>>>> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
>>>>> think VECTOR should be used to simply imply non-null, as this would be 
>>>>> very
>>>>> unintuitive. More logical would be NONNULL, if this is the only condition
>>>>> being applied. Alternatively for arrays we could default to NONNULL and
>>>>> later introduce NULLABLE if we want to permit nulls.
>>>>>
>>>>> If the word vector is to be used it makes more sense to make it look
>>>>> like a list, so VECTOR<FLOAT, N> as here the word VECTOR is clearly not
>>>>> redundant.
>>>>>
>>>>> So, I vote:
>>>>>
>>>>> 1) (NON NULL) FLOAT[N]
>>>>> 2) FLOAT[N]   (Non null by default)
>>>>> 3) VECTOR<FLOAT, N>
>>>>>
>>>>>
>>>>>
>>>>> On 4 May 2023, at 08:52, Mick Semb Wever <m...@apache.org> wrote:
>>>>>
>>>>> 
>>>>>
>>>>>
>>>>> Did we agree on a CQL syntax?
>>>>>
>>>>> I don’t believe there has been a pool on CQL syntax… my understanding
>>>>> reading all the threads is that there are ~4-5 options and non are -1ed, 
>>>>> so
>>>>> believe we are waiting for majority rule on this?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Re-reading that thread, IIUC the valid choices remaining are…
>>>>>
>>>>> 1. VECTOR FLOAT[n]
>>>>> 2. FLOAT VECTOR[n]
>>>>> 3. VECTOR<FLOAT,n>
>>>>> 4. VECTOR[n]<FLOAT>
>>>>> 5. ARRAY<FLOAT, n>
>>>>> 6. NON-NULL FROZEN<FLOAT[n]>
>>>>>
>>>>>
>>>>> Yes I'm putting my preference (1) first ;) because (banging on) if the
>>>>> future of CQL will have FLOAT[n] and FROZEN<FLOAT[n]>, where the VECTOR
>>>>> keyword is: for general cql users; just meaning "non-null and frozen",
>>>>> these gel best together.
>>>>>
>>>>> Options (5) and (6) are for those that feel we can and should provide
>>>>> this type without introducing the vector keyword.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> [image: DataStax Logo Square] <https://www.datastax.com/>
>>>>> *Mike Adamson*
>>>>> Engineering
>>>>> +1 650 389 6000 <16503896000> | datastax.com
>>>>> <https://www.datastax.com/>
>>>>> Find DataStax Online:
>>>>> [image: LinkedIn Logo]
>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
>>>>>    [image: Facebook Logo]
>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
>>>>>    [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS
>>>>> Feed] <https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
>>>>> <https://github.com/datastax>
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> +---------------------------------------------------------------+
>>> | Derek Chen-Becker                                             |
>>> | GPG Key available at https://keybase.io/dchenbecker
>>> <https://urldefense.com/v3/__https://keybase.io/dchenbecker__;!!PbtH5S7Ebw!epFk5syZ_avANqrEkFR0WT7Alkybo0yrvO-_awqqn8mVWpnyuSgAm0FMgbE_rYpSWJSC91KmoX7nLBpa-Vg$>
>>> and       |
>>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org
>>> <https://urldefense.com/v3/__https://pgp.mit.edu/pks/lookup?search=derek*40chen-becker.org__;JQ!!PbtH5S7Ebw!epFk5syZ_avANqrEkFR0WT7Alkybo0yrvO-_awqqn8mVWpnyuSgAm0FMgbE_rYpSWJSC91KmoX7nkqpt2mA$>
>>> |
>>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
>>> +---------------------------------------------------------------+
>>>
>>>
>
> --
> +---------------------------------------------------------------+
> | Derek Chen-Becker                                             |
> | GPG Key available at https://keybase.io/dchenbecker
> <https://urldefense.com/v3/__https://keybase.io/dchenbecker__;!!PbtH5S7Ebw!epFk5syZ_avANqrEkFR0WT7Alkybo0yrvO-_awqqn8mVWpnyuSgAm0FMgbE_rYpSWJSC91KmoX7nLBpa-Vg$>
> and       |
> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org
> <https://urldefense.com/v3/__https://pgp.mit.edu/pks/lookup?search=derek*40chen-becker.org__;JQ!!PbtH5S7Ebw!epFk5syZ_avANqrEkFR0WT7Alkybo0yrvO-_awqqn8mVWpnyuSgAm0FMgbE_rYpSWJSC91KmoX7nkqpt2mA$>
> |
> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> +---------------------------------------------------------------+
>
>

-- 
[image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
Find DataStax Online: [image: LinkedIn Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
   [image: Facebook Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
   [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS Feed]
<https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
<https://github.com/datastax>

Reply via email to