At first I wasn’t sure about using ORDER BY, but the more I think about what is 
actually going on, I think it does make sense.

This also matches up with some ideas that have been floating around about being 
able to ORDER BY a sorted SAI index.

-Jeremiah

> On May 22, 2023, at 2:28 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
> 
> Hi all,
> 
> I have a branch of vector search based on cep-7-sai at 
> https://github.com/datastax/cassandra/tree/cep-vsearch. Compared to the 
> original POC branch, this one is based on the SAI code that will be mainline 
> soon, and handles distributed scatter/gather.  Updates and deletes to vector 
> values are still not supported.
> 
> I also put together a demo that uses this branch to provide context to 
> OpenAI’s GPT, available here: https://github.com/jbellis/cassgpt.  
> 
> Here is the query that gets executed:
> 
>     SELECT id, start, end, text 
>     FROM {self.keyspace}.{self.table} 
>     WHERE embedding ANN OF %s 
>     LIMIT %s
> 
> The more I used the proposed `ANN OF` syntax, the less I liked it.  This is 
> because we don’t want an actual boolean predicate; we just want to order 
> results.  Put another way, `ANN OF` will include all rows of the table given 
> a high enough `LIMIT`, and that makes it a bad fit for expression processing 
> that expects to be able to filter out rows before it starts LIMIT-ing.  And 
> in fact the code to support executing the query looks suspiciously like what 
> you’d want for `ORDER BY`.
> 
> I propose that we adopt `ORDER BY` syntax, supporting it for vector indexes 
> first and eventually for all SAI indexes.  So this query would become
> 
>     SELECT id, start, end, text 
>     FROM {self.keyspace}.{self.table} 
>     ORDER BY embedding ANN OF %s 
>     LIMIT %s
> 
> And it would compose with other SAI indexes with syntax like
> 
>     SELECT id, start, end, text 
>     FROM {self.keyspace}.{self.table} 
>     WHERE publish_date > %s
>     ORDER BY embedding ANN OF %s 
>     LIMIT %s
> 
> Related work:
> 
> This is similar to the approach used by pgvector, except they invented the 
> symbolic operator `<->` that has the same semantics as `ANN OF`.  I am okay 
> with adopting their operator, but I think ANN OF is more readable.
> 
> -- 
> Jonathan Ellis
> co-founder, http://www.datastax.com <http://www.datastax.com/>
> @spyced

Reply via email to