At first I wasn’t sure about using ORDER BY, but the more I think about what is actually going on, I think it does make sense.
This also matches up with some ideas that have been floating around about being able to ORDER BY a sorted SAI index. -Jeremiah > On May 22, 2023, at 2:28 PM, Jonathan Ellis <jbel...@gmail.com> wrote: > > Hi all, > > I have a branch of vector search based on cep-7-sai at > https://github.com/datastax/cassandra/tree/cep-vsearch. Compared to the > original POC branch, this one is based on the SAI code that will be mainline > soon, and handles distributed scatter/gather. Updates and deletes to vector > values are still not supported. > > I also put together a demo that uses this branch to provide context to > OpenAI’s GPT, available here: https://github.com/jbellis/cassgpt. > > Here is the query that gets executed: > > SELECT id, start, end, text > FROM {self.keyspace}.{self.table} > WHERE embedding ANN OF %s > LIMIT %s > > The more I used the proposed `ANN OF` syntax, the less I liked it. This is > because we don’t want an actual boolean predicate; we just want to order > results. Put another way, `ANN OF` will include all rows of the table given > a high enough `LIMIT`, and that makes it a bad fit for expression processing > that expects to be able to filter out rows before it starts LIMIT-ing. And > in fact the code to support executing the query looks suspiciously like what > you’d want for `ORDER BY`. > > I propose that we adopt `ORDER BY` syntax, supporting it for vector indexes > first and eventually for all SAI indexes. So this query would become > > SELECT id, start, end, text > FROM {self.keyspace}.{self.table} > ORDER BY embedding ANN OF %s > LIMIT %s > > And it would compose with other SAI indexes with syntax like > > SELECT id, start, end, text > FROM {self.keyspace}.{self.table} > WHERE publish_date > %s > ORDER BY embedding ANN OF %s > LIMIT %s > > Related work: > > This is similar to the approach used by pgvector, except they invented the > symbolic operator `<->` that has the same semantics as `ANN OF`. I am okay > with adopting their operator, but I think ANN OF is more readable. > > -- > Jonathan Ellis > co-founder, http://www.datastax.com <http://www.datastax.com/> > @spyced