kaivalnp commented on issue #14758: URL: https://github.com/apache/lucene/issues/14758#issuecomment-4311081631
Thank you everyone for the suggestions here, very interesting! Having Lucene use explicit information about the graphs it has to create (either via `String[]` labels or referencing a `label` field) indeed has a lot of benefit -- Lucene does not need to "infer" that information on its own. I guess the main challenge is API-wise, about how to pass that information during indexing and search, without changing a lot of functions? The explicit `String[]` labels approach is what I had in mind [initially](https://github.com/apache/lucene/issues/14758#issue-3118912411), but as @jpountz pointed out [here](https://github.com/apache/lucene/issues/14758#issuecomment-2944618497), we'll need parallel APIs for most vector-related functions that also take label info, like: 1. [`Knn*VectorField` constructor](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/document/KnnFloatVectorField.java#L90-L103) (add set of labels) 2. [`KnnFieldVectorsWriter#addValue`](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/codecs/KnnFieldVectorsWriter.java#L33-L37) (pass set of labels) 3. [`KnnVectorsReader#get*VectorValues`](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java#L60-L65) (optional label to read) 4. [`KnnVectorsReader#search`](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java#L74-L101) (optional label to search) 5. [`CodecReader#searchNearestVectors`](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/index/CodecReader.java#L261-L274) (optional label to search) 6. [`Knn*VectorQuery`](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java#L81-L98) (optional label to search) ..among others, which sounds invasive to me :( The inter-field dependency suggestion relaxes some of these: 1. For (1), the `label` field can be specified in the format itself, like @benwtrent suggested [here](https://github.com/apache/lucene/issues/14758#issuecomment-4253615422). 2. For (2) the writer can read the `label` field contents during indexing and merge: 1. During indexing, we still need to wait until flush to create the smaller HNSW graphs (because this is where the `label` field is guaranteed to be complete). We also need some way of getting and making a reader for the `label` field available in [`KnnVectorsWriter#flush`](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java#L52-L53), but this sounds doable from what @iverase suggested [here](https://github.com/apache/lucene/issues/14758#issuecomment-4259019384). 2. Accessing label information during merge also sounds trivial, by getting a view over the field across all segments in `mergeState`, which is [already available](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java#L85-L90). However, I don't see how we'll get around (3) (4) (5). Additionally, (6) is the tricky part: 1. The `Knn*VectorQuery` [accepts an arbitrary `Query filter`](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java#L77), which will need to be mapped to an appropriate label. 2. While this could _potentially_ be as simple as mapping a `TermQuery` (having the same `label` field in the key) to the appropriate graph, the `label` field itself is format-specific, and unknown to the query? 3. i.e. how would the query know whether to collect matching docs into a `BitSet` (for when a smaller graph is not available), OR pass the `TermQuery` filter as a "hint" to the format to choose a smaller HNSW graph internally? (again, this "hint" needs to be added to the `AcceptDocs` abstraction [here](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java#L100)). I still see some challenges here, which is why I'd like to throw in the approach of Lucene de-duplicating vectors for the user back into the mix :) This approach shifts responsibility of resolving the correct field + label information to the user. For example, if the above approaches add a `vector` field + labels `[1, 2, 3]` for a document -- the user now needs to add three top-level vector fields, (say) `vector_1`, `vector_2`, `vector_3` with the same vector, and Lucene will store a single copy of the vector on disk. At search time, the user also needs to know the correct field to search, based on filter information. Some other advantages of this approach I can think of: 1. De-duplication is agnostic to labels, and one field does not have to be a pure subset of another to share backing vectors (e.g. if many, but not all vectors are shared). 2. De-duplication is index-level, and vectors across documents and different fields can be shared too (e.g. if the vector was generated on a subset of fields of the document, like say `title`, then a single vector can be shared across all documents with the same `title`). Here, IMO the only real issue is the additional work at indexing to "infer" duplicates? (that is known to the user, and should ideally be used by Lucene, but is not easily doable due to challenges listed above) Using an AI, I was able to whip up an implementation for a de-duplicating vector format (see #15979) that needs minimal API changes (drop-in replacement for [`Lucene99FlatVectorsFormat`](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsFormat.java#L74)). As @mikecan mentioned [here](https://github.com/apache/lucene/issues/14758#issuecomment-4290267510), maybe this is a start to having multiple HNSW graphs backed by the same raw vectors, albeit in an expensive way for indexing, until we converge on how to do it safely / cleanly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
