Re: [POLL] Vector type for ML

Dinesh Joshi Tue, 02 May 2023 15:37:06 -0700

I'm also in favor of having a general data type that is not tied to numeric 
data types alone.


On 2023/05/02 22:27:24 Jonathan Ellis wrote:
> I had a call with David.  We agreed that we want a "vector" data type with
> these properties
> 
> - Fixed length
> - No nulls
> - Random access not supported
> 
> Where we disagreed was on my proposal to restrict vectors to only numeric
> data.  David's points were that
> 
> (1) He has a use case today for a data type with the other vector
> properties,
> (2) It doesn't seem reasonable to create two data types with the same
> properties, one of which is restricted to numerics, and
> (3) The restrictions that I want for numeric vectors make more sense at the
> index and function level, than at the type level.
> 
> I'm ready to concede that David has the better case here and move forward
> with a vector implementation without that restriction.
> 
> On Tue, May 2, 2023 at 4:03 PM David Capwell <dcapw...@apple.com> wrote:
> 
> >  How about it, David? Did you already make this?
> >
> >
> > I checked out the patch, fixed serialize/deserialize, added the
> > constraints, then added a composeForFloat(ByteBuffer), with this the impact
> > to the POC patch was the following
> >
> > 1) move away from VectorType.instance.serializer().deserialize(bb) to
> > type.composeForFloat(bb), both return float[]
> > 2) change the index validate logic to move away from checking VectorType
> > and instead check for that plus the element type == FloatType.  I didn’t
> > bother to do this as its trivial
> >
> > David. End this argument. SHOW THE CODE!
> >
> >
> > If this argument ends and people are cool with vector supporting abstract
> > type, more than glad to help get this in.
> >
> > On May 2, 2023, at 1:53 PM, Jeremy Hanna <jeremy.hanna1...@gmail.com>
> > wrote:
> >
> > I'm all for bringing more functionality to the masses sooner, but the
> > original idea has a very very specific use case.  Do we have use cases for
> > a general purpose Vector/Array data structure?  If so, awesome.  I just
> > wondered if generalizing provides value, beyond being straightforward to
> > implement.  I'm just trying to be sensitive to the database code
> > maintenance and driver support for general types versus a single type for a
> > specific, well defined purpose.
> >
> > If it could easily be a plugin, that's great - but the full picture
> > involves drivers that need to support it or you end up getting binary blobs
> > you have to decode client side and then do stuff with.  So ideally if you
> > have a well defined use case that you can build into the database, having
> > it just be part of the database and associated drivers - that makes the
> > experience much much better.
> >
> > I'm not trying to say B couldn't be valuable or that a plugin couldn't be
> > feasible.  I'm just trying to enlarge the picture a bit to see what that
> > means for this use case and for the supporting drivers/clients.
> >
> > On May 2, 2023, at 3:04 PM, Benedict <bened...@apache.org> wrote:
> >
> > But it’s so trivial it was already implemented by David in the span of ten
> > minutes? If anything, we’re slowing progress down by refusing to do the
> > extra types, as we’re busy arguing about it rather than delivering a
> > feature?
> >
> > FWIW, my interpretation of the votes today is that we SHOULD NOT (ever)
> > support types beyond float. Not that we should start with float.
> >
> > So, this whole debate is a mess, I think. But hey ho.
> >
> > On 2 May 2023, at 20:57, Patrick McFadin <pmcfa...@gmail.com> wrote:
> >
> > 
> > I'll speak up on that one. If you look at my ranked voting, that is where
> > my head is. I get accused of scope creep (a lot) and looking at the initial
> > proposal Jonathan put on the ML it was mostly "Developers are adopting
> > vector search at a furious pace and I think I have a simple way of adding
> > support to keep Cassandra relevant for these use cases" Instead of just
> > focusing on this use case, I feel the arguments have bike shedded into
> > scope creep which means it will take forever to get into the project.
> >
> > My preference is to see one thing validated with an MVP and get it into
> > the hands of developers sooner so we can continue to iterate based on
> > actual usage.
> >
> > It doesn't say your points are wrong or your opinions are broken, I'm
> > voting for what I think will be awesome for users sooner.
> >
> > Patrick
> >
> > On Tue, May 2, 2023 at 12:29 PM Benedict <bened...@apache.org> wrote:
> >
> >> Could folk voting against a general purpose type (that could well be
> >> called a vector) briefly explain their reasoning?
> >>
> >> We established in the other thread that it’s technically trivial, meaning
> >> folk must think it is strictly superior to only support float rather than
> >> eg all numeric types (note: for the type, not the ANN).
> >>
> >> I am surprised, and the blurbs accompanying votes so far don’t seem to
> >> touch on this, mostly just endorsing the idea of a vector.
> >>
> >>
> >> On 2 May 2023, at 20:20, Patrick McFadin <pmcfa...@gmail.com> wrote:
> >>
> >> 
> >> A > B > C on both polls.
> >>
> >> Having talked to several users in the community that are highly excited
> >> about this change, this gets to what developers want to do at Cassandra
> >> scale: store embeddings and retrieve them.
> >>
> >> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña <adelap...@apache.org>
> >> wrote:
> >>
> >>> A > B > C
> >>>
> >>> I don't think that ML is such a niche application that it can't have its
> >>> own CQL data type. Also, vectors are mathematical elements that have more
> >>> applications that ML.
> >>>
> >>> On Tue, 2 May 2023 at 19:15, Mick Semb Wever <m...@apache.org> wrote:
> >>>
> >>>>
> >>>>
> >>>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis <jbel...@gmail.com> wrote:
> >>>>
> >>>>> Should we add a vector type to Cassandra designed to meet the needs of
> >>>>> machine learning use cases, specifically feature and embedding vectors 
> >>>>> for
> >>>>> training, inference, and vector search?
> >>>>>
> >>>>> ML vectors are fixed-dimension (fixed-length) sequences of numeric
> >>>>> types, with no nulls allowed, and with no need for random access. The ML
> >>>>> industry overwhelmingly uses float32 vectors, to the point that the
> >>>>> industry-leading special-purpose vector database ONLY supports that data
> >>>>> type.
> >>>>>
> >>>>> This poll is to gauge consensus subsequent to the recent discussion
> >>>>> thread at
> >>>>> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
> >>>>>
> >>>>> Please rank the discussed options from most preferred option to least,
> >>>>> e.g., A > B > C (A is my preference, followed by B, followed by C) or C 
> >>>>> > B
> >>>>> = A (C is my preference, followed by B or A approximately equally.)
> >>>>>
> >>>>> (A) I am in favor of adding a vector type for floats; I do not believe
> >>>>> we need to tie it to any particular implementation details.
> >>>>>
> >>>>> (B) I am okay with adding a vector type but I believe we must add
> >>>>> array types that compose with all Cassandra types first, and make 
> >>>>> vectors a
> >>>>> special case of arrays-without-null-elements.
> >>>>>
> >>>>> (C) I am not in favor of adding a built-in vector type.
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> A  > B > C
> >>>>
> >>>> B is stated as "must add array types…".  I think this is a bit loaded.
> >>>> If B was the (A + the implementation needs to be a non-null frozen 
> >>>> float32
> >>>> array, serialisation forward compatible with other frozen arrays later
> >>>> implemented) I would put this before (A).  Especially because it's been
> >>>> shown already this is easy to implement.
> >>>>
> >>>>
> >>>>
> >>>
> >
> >
> 
> -- 
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>

Re: [POLL] Vector type for ML

Reply via email to