msokolov opened a new pull request, #13683: URL: https://github.com/apache/lucene/pull/13683
This is a working tool for re-ordering an index using BP (binary partitioning) over a vector field. It will renumber an existing HNSW graph over the field without the need to re-index it. There are a bunch of TODOs to be worked out, but the results have been promising so far so I thought I should share it in its current state. Highlights: On a 2M-vector index it is able to compress the vex file more than 30% (from 89M to 61M). Additionally, query latency when searching the index dropped by 20% (from 0.7 ms to 0.56ms) with no loss in precision/recall. Limitations: The current tool only works with floating-point vectors. It doesn't provide any mechanism to maintain an index sort that uses BPV, nor is there a merge policy. I've tested with Euclidean distance only. In theory this will work for other distances as well, but I'm not fully convinced that grouping vectors by averaging (centroid) is the right thing to do for the angular distances. I think it should work OK though, just haven't had time to test. Other: To make the reordering efficient I needed to expose a few details of the HNSW index: I added `KnnVectorReader.getGraph()` and `HnswGraph.maxConn()`. In the past we've chosen not have the graph exposed by the vector reader since it is seen as an internal detail that is tied to a specific ANN implementation. The compromise I made here is that the method can return null. The alternative is to make Lucene99HnswVectorReader (and maybe some other classes?) be an HnswGraphProvider have class cast checks for HnswGraphProvider - I think a null check is cleaner, and with this change I was able to remove a bunch of casting. I enhanced `SortingCodecReader` to be able to sort vector values off-heap in case they are dense.This could be extended to the sparse case as well. Future ideas: To make this operationally-useful I think we need at a minimum the merge policy, but I think ultimately it belongs in the codec. I would like to make it compatilbe with a different (index) sort over docids. This would requires changes to the way nodes and docids are associated by the codec, and because of the way our vector codecs have different codecs for every possible combination of options (Byte vs Float, quantized vs not) this will be a lot of code changes, and I'd like to get the core feature working well before tackling that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org