Anyone on this ML who still remembers DSE Search (or has experience w/
Elastic or SolrCloud) probably also knows that there are some significant
pieces of an optimized scatter/gather apparatus for IR (even without
sorting, which also doesn't exist yet) that do not exist in C* or it's
range query system (which SAI and all other 2i implementations use). SAI,
like all C* 2i implementations, is still a local index, and as that is the
case, anything built on it will perform best in partition-scoped (at least
on the read side) use-cases. (On the bright side, the project is moving
toward larger partitions being a possibility.) With smaller clusters or
use-cases that are extremely write-heavy/read-light, it's possible that the
full scatter/gather won't be too onerous, especially w/ a few small tweaks
(on top of a non-vnode cluster) to a.) keep fanout minimal and b.) keep
range/index queries to a single pass to minimize latency.

Whatever we do, we just need to avoid a situation down the road where users
don't understand these nuances and hit a wall where they try to use this in
a way that is fundamentally incompatible w/ the way the database
scales/works. (I've done my best to call this out in all discussions around
SAI over time, and there may even end up being further guardrails put in
place to make it even harder to misuse it...but I digress.)

Having said all that, I don't fundamentally have a problem w/ the proposal.

On Tue, May 9, 2023 at 2:11 PM Benedict <bened...@apache.org> wrote:

> HNSW can in principle be made into a distributed index. But that would be
> quite a different paradigm to SAI.
>
> On 9 May 2023, at 19:30, Patrick McFadin <pmcfa...@gmail.com> wrote:
>
> 
> Under the goals section, there is this line:
>
>
>    1. Scatter/gather across replicas, combining topK from each to get
>    global topK.
>
>
> But what I'm hearing is, exactly how will that happen? Maybe this is an
> SAI question too. How is that verified in SAI?
>
> On Tue, May 9, 2023 at 11:07 AM David Capwell <dcapw...@apple.com> wrote:
>
>> Approach section doesn’t go over how this will handle cross replica
>> search, this would be good to flesh out… given results have a real ranking,
>> the current 2i logic may yield incorrect results… so would think we need
>> num_ranges / rf queries in the best case, with some new capability to sort
>> the results?  If my assumption is correct, then how errors are handled
>> should also be fleshed out… Example: 1k cluster without vnode and RF=3, so
>> 333 queries fanned out to match, then coordinator needs to sort… if 1 of
>> the queries fails and can’t fall back to peers… does the query fail (I
>> assume so)?
>>
>> On May 8, 2023, at 7:20 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
>>
>> Hi all,
>>
>> Following the recent discussion threads, I would like to propose CEP-30
>> to add Approximate Nearest Neighbor (ANN) Vector Search via
>> Storage-Attached Indexes (SAI) to Apache Cassandra.
>>
>> The primary goal of this proposal is to implement ANN vector search
>> capabilities, making Cassandra more useful to AI developers and
>> organizations managing large datasets that can benefit from fast similarity
>> search.
>>
>> The implementation will leverage Lucene's Hierarchical Navigable Small
>> World (HNSW) library and introduce a new CQL data type for vector
>> embeddings, a new SAI index for ANN search functionality, and a new CQL
>> operator for performing ANN search queries.
>>
>> We are targeting the 5.0 release for this feature, in conjunction with
>> the release of SAI. The proposed changes will maintain compatibility with
>> existing Cassandra functionality and compose well with the already-approved
>> SAI features.
>>
>> Please find the full CEP document here:
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes
>>
>> --
>> Jonathan Ellis
>> co-founder, http://www.datastax.com
>> @spyced
>>
>>
>>

Reply via email to