Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

David Capwell Wed, 17 May 2023 10:39:16 -0700
Thanks for the update, LGTM

> On May 17, 2023, at 5:35 AM, Jasonstack Zhao Yang <jasonstack.z...@gmail.com> 
> wrote:
> 
> Hi,
> 
> I have updated the CEP with some details about distributed queries in the 
> Approach section.
> 
> David:
> 
> > given results have a real ranking, the current 2i logic may yield incorrect 
> > results
> 
> C* internal iterators are all in primary key order. So we need two in-memory 
> top-k filters, one at replica side and one at coordinator side, to make sure 
> the returned rows are actually top-k but still primary key order.
> 
> > if 1 of the queries fails and can’t fall back to peers… does the query fail 
> > (I assume so)
> 
> yes, it will fail. we can make it pass if lower recall is acceptable.
> 
> Caleb:
> 
> > With smaller clusters or use-cases that are extremely 
> > write-heavy/read-light, it's possible that the full scatter/gather won't be 
> > too onerous, especially w/ a few small tweaks (on top of a non-vnode 
> > cluster)
> 
> You are right. Smaller cluster would definitely requires less coordinator 
> memory to cache all required replicas' responses.
> 
> 
> Jeremy:
> 
> >  With SAI, can you have partial results?  When you have a query that is 
> > non-key based, you need to have full token range coverage of the results.  
> > If that isn't possible, will Vector Search/SAI return partial results?
> 
> No partial result allowed. Query will failed with unavailability exception if 
> some required token range is not available. For ANN search, users might be 
> willing to have lower recall (partial results) with higher availability.
> 
> >  First, how is ordering/scoring done?
> > Each replica returns back to the coordinator a sorted set of results and 
> > the coordinator will have to see all of the results globally in order to do 
> > a global ordering.  You can't know what the top result is unless you've 
> > seen everything.  As to the scoring, I'm not sure how that will get 
> > calculated.
> 
> The results will be top-k but still in primary key order. Scores are computed 
> based on vector similarly function.
> 
> Top-K search need two top-k filter as described in CEP. 
> 
> > Second, if I am ordering the results like for a Vector Search and I want to 
> > have the top 1 result.  How is the scoring done and what happens if there 
> > are 20 that have the same score?  How will the coordinator decide which 1 
> > is returned out of 20?
> 
> It will be the row with smaller primary key order.
> 
> On Wed, 10 May 2023 at 05:39, Jeremy Hanna <jeremy.hanna1...@gmail.com 
> <mailto:jeremy.hanna1...@gmail.com>> wrote:
>> Just wanted to add that I don't have any special knowledge of CEP-30 beyond 
>> what Jonathan posted and just trying to help clarify and answer questions as 
>> I can with some knowledge and experience from DSE Search and SAI.  Thanks to 
>> Caleb for helping validate some things as well.  And to be clear about 
>> partial results - the default with DSE Search at least is to fail a query if 
>> it can't get the full token range coverage.  However there is an option to 
>> allow for shards being unavailable and return partial results.
>> 
>>> On May 9, 2023, at 3:38 PM, Jeremy Hanna <jeremy.hanna1...@gmail.com 
>>> <mailto:jeremy.hanna1...@gmail.com>> wrote:
>>> 
>>> I talked to David and some others in slack to hopefully clarify:
>>> 
>>> With SAI, can you have partial results?  When you have a query that is 
>>> non-key based, you need to have full token range coverage of the results.  
>>> If that isn't possible, will Vector Search/SAI return partial results?
>>> 
>>> Anything can happen in the implementation, but for scoring, it may not make 
>>> sense to return partial results because it's misleading.  For non-global 
>>> queries, it could or couldn't return partial results depending on 
>>> implementation/configuration.  In DSE you could have partial results 
>>> depending on the options.   However I couldn't find partial results defined 
>>> in CEP-7 or CEP-30.
>>> 
>>> The other questions are about scoring.
>>> 
>>> First, how is ordering/scoring done?
>>> 
>>> Each replica returns back to the coordinator a sorted set of results and 
>>> the coordinator will have to see all of the results globally in order to do 
>>> a global ordering.  You can't know what the top result is unless you've 
>>> seen everything.  As to the scoring, I'm not sure how that will get 
>>> calculated.
>>> 
>>> Second, if I am ordering the results like for a Vector Search and I want to 
>>> have the top 1 result.  How is the scoring done and what happens if there 
>>> are 20 that have the same score?  How will the coordinator decide which 1 
>>> is returned out of 20?
>>> 
>>> It returns results in token/partition and then clustering order.
>>> 
>>>> On May 9, 2023, at 2:53 PM, Caleb Rackliffe <calebrackli...@gmail.com 
>>>> <mailto:calebrackli...@gmail.com>> wrote:
>>>> 
>>>> Anyone on this ML who still remembers DSE Search (or has experience w/ 
>>>> Elastic or SolrCloud) probably also knows that there are some significant 
>>>> pieces of an optimized scatter/gather apparatus for IR (even without 
>>>> sorting, which also doesn't exist yet) that do not exist in C* or it's 
>>>> range query system (which SAI and all other 2i implementations use). SAI, 
>>>> like all C* 2i implementations, is still a local index, and as that is the 
>>>> case, anything built on it will perform best in partition-scoped (at least 
>>>> on the read side) use-cases. (On the bright side, the project is moving 
>>>> toward larger partitions being a possibility.) With smaller clusters or 
>>>> use-cases that are extremely write-heavy/read-light, it's possible that 
>>>> the full scatter/gather won't be too onerous, especially w/ a few small 
>>>> tweaks (on top of a non-vnode cluster) to a.) keep fanout minimal and b.) 
>>>> keep range/index queries to a single pass to minimize latency.
>>>> 
>>>> Whatever we do, we just need to avoid a situation down the road where 
>>>> users don't understand these nuances and hit a wall where they try to use 
>>>> this in a way that is fundamentally incompatible w/ the way the database 
>>>> scales/works. (I've done my best to call this out in all discussions 
>>>> around SAI over time, and there may even end up being further guardrails 
>>>> put in place to make it even harder to misuse it...but I digress.)
>>>> 
>>>> Having said all that, I don't fundamentally have a problem w/ the proposal.
>>>> 
>>>> On Tue, May 9, 2023 at 2:11 PM Benedict <bened...@apache.org 
>>>> <mailto:bened...@apache.org>> wrote:
>>>>> HNSW can in principle be made into a distributed index. But that would be 
>>>>> quite a different paradigm to SAI.
>>>>> 
>>>>>> On 9 May 2023, at 19:30, Patrick McFadin <pmcfa...@gmail.com 
>>>>>> <mailto:pmcfa...@gmail.com>> wrote:
>>>>>> 
>>>>>> 
>>>>>> Under the goals section, there is this line:
>>>>>> 
>>>>>> Scatter/gather across replicas, combining topK from each to get global 
>>>>>> topK.
>>>>>> 
>>>>>> But what I'm hearing is, exactly how will that happen? Maybe this is an 
>>>>>> SAI question too. How is that verified in SAI?
>>>>>> 
>>>>>> On Tue, May 9, 2023 at 11:07 AM David Capwell <dcapw...@apple.com 
>>>>>> <mailto:dcapw...@apple.com>> wrote:
>>>>>>> Approach section doesn’t go over how this will handle cross replica 
>>>>>>> search, this would be good to flesh out… given results have a real 
>>>>>>> ranking, the current 2i logic may yield incorrect results… so would 
>>>>>>> think we need num_ranges / rf queries in the best case, with some new 
>>>>>>> capability to sort the results?  If my assumption is correct, then how 
>>>>>>> errors are handled should also be fleshed out… Example: 1k cluster 
>>>>>>> without vnode and RF=3, so 333 queries fanned out to match, then 
>>>>>>> coordinator needs to sort… if 1 of the queries fails and can’t fall 
>>>>>>> back to peers… does the query fail (I assume so)?
>>>>>>> 
>>>>>>>> On May 8, 2023, at 7:20 PM, Jonathan Ellis <jbel...@gmail.com 
>>>>>>>> <mailto:jbel...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> Hi all,
>>>>>>>> 
>>>>>>>> Following the recent discussion threads, I would like to propose 
>>>>>>>> CEP-30 to add Approximate Nearest Neighbor (ANN) Vector Search via 
>>>>>>>> Storage-Attached Indexes (SAI) to Apache Cassandra.
>>>>>>>> 
>>>>>>>> The primary goal of this proposal is to implement ANN vector search 
>>>>>>>> capabilities, making Cassandra more useful to AI developers and 
>>>>>>>> organizations managing large datasets that can benefit from fast 
>>>>>>>> similarity search.
>>>>>>>> 
>>>>>>>> The implementation will leverage Lucene's Hierarchical Navigable Small 
>>>>>>>> World (HNSW) library and introduce a new CQL data type for vector 
>>>>>>>> embeddings, a new SAI index for ANN search functionality, and a new 
>>>>>>>> CQL operator for performing ANN search queries.
>>>>>>>> 
>>>>>>>> We are targeting the 5.0 release for this feature, in conjunction with 
>>>>>>>> the release of SAI. The proposed changes will maintain compatibility 
>>>>>>>> with existing Cassandra functionality and compose well with the 
>>>>>>>> already-approved SAI features.
>>>>>>>> 
>>>>>>>> Please find the full CEP document here: 
>>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes
>>>>>>>> 
>>>>>>>> -- 
>>>>>>>> Jonathan Ellis
>>>>>>>> co-founder, http://www.datastax.com <http://www.datastax.com/>
>>>>>>>> @spyced
>>>>>>> 
>>> 
>>
Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

Reply via email to