I think we need to briefly step back and think about what the syntax means and how it fits into existing syntax.

It seems that the dimensionality verbiage assumes we’re logically introducing N vector fields, so that each row adopts a value for all of the vector fields or none. But in practice we are actually introducing a fixed-length frozen list in Cassandra terms, and our API treats this as a per-row array/vector rather than a number of column vectors.

My inclination then would be to say you declare an ARRAY<FLOAT, N> (which is semantic sugar for FROZEN<LIST<FLOAT, N>>). This is very consistent with our existing style. We then simply permit such columns to define ANN indexes.

Otherwise, I think we should lean into the idea that this is a set of N vectors, as “dimensions" makes limited sense when discussing an array length. In this case I would lean towards declaring e.g. 1500 FLOAT VECTORS, maybe. But then I think we should reconsider our presentation a little, and perhaps the result set should treat each vector as a separate field (or something like this).


On 26 Apr 2023, at 15:31, Jonathan Ellis <jbel...@gmail.com> wrote:

Hi all,

Splitting this out per the suggestion in the initial VS thread so we can work on driver support in parallel with the server-side changes.

I propose adding a new data type for vector search indexes:

FLOAT VECTOR[N_DIMENSIONS]

In the initial commits and thread, this was DENSE FLOAT32. Nobody really loved that, so we considered a bunch of alternatives, including

- `FLOAT[N]`: This minimal option resembles C and Java array syntax, which would make it familiar for many users. However, this syntax raises the question of why arrays cannot be created for other types.  Additionally, the expectation for an array is to provide random access to its contents, which is not supported for vectors.
- `DENSE FLOAT[N]`: This option clarifies that we are supporting dense vectors, not sparse ones. However, since Lucene had sparse vector support in the past but removed it for lack of compelling use cases, it is unlikely that it will be added back, making the "DENSE" qualifier less relevant.
- `DENSE FLOAT VECTOR[N]`: This is the most verbose option and aligns with the CQL/SQL spirit. However, the "DENSE" qualifier is unnecessary for the reasons mentioned above.
- `VECTOR FLOAT[N]`: This option omits the "DENSE" qualifier, but has a less natural word order.
`VECTOR<FLOAT, N>`: This follows the syntax of our Collections, but again this would imply that random access is supported, which we want to avoid doing.
- `VECTOR[N]`: This syntax is not very clear about the vector's contents and could make it difficult to add other vector types, such as byte vectors (already supported by Lucene), in the future.

Finally, the original qualifier of 32 in `FLOAT32` was intended to allow consistency if we add other float types like FLOAT16 or FLOAT64, both of which are sometimes used in ML. However, we already have a CQL data type for a 64-bit float (`DOUBLE`), so it would make more sense to add future variants (which remain hypothetical at this point) along that line instead.

Thus, we believe that `FLOAT VECTOR[N_DIMENSIONS]` provides the best balance of clarity, conciseness, and extensibility. It is more natural in its word order than the original proposal and avoids unnecessary qualifiers, while still being clear about the data type it represents. Finally, this syntax is straighforwardly extensible should we choose to support other vector types in the future.

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Reply via email to