[PR] HNSW BP reorder tool [lucene]

via GitHub Fri, 23 Aug 2024 13:18:36 -0700


msokolov opened a new pull request, #13683:
URL: https://github.com/apache/lucene/pull/13683


   This is a working tool for re-ordering an index using BP (binary 
partitioning) over a vector field. It will renumber an existing HNSW graph over 
the field without the need to re-index it.  There are a bunch of TODOs to be 
worked out, but the results have been promising so far so I thought I should 
share it in its current state.   
   
   Highlights:
   
   On a 2M-vector index it is able to compress the vex file more than 30% (from 
89M to 61M). Additionally, query latency when searching the index dropped by 
20% (from 0.7 ms to 0.56ms) with no loss in precision/recall.
   
   Limitations:
   
   The current tool only works with floating-point vectors. It doesn't provide 
any mechanism to maintain an index sort that uses BPV, nor is there a merge 
policy.
   
   I've tested with Euclidean distance only. In theory this will work for other 
distances as well, but I'm not fully convinced that grouping vectors by 
averaging (centroid) is the right thing to do for the angular distances. I 
think it should work OK though, just haven't had time to test.
   
   Other:
   
   To make the reordering efficient I needed to expose a few details of the 
HNSW index: I added `KnnVectorReader.getGraph()` and `HnswGraph.maxConn()`. In 
the past we've chosen not have the graph exposed  by the vector reader since it 
is seen as an internal detail that is tied to a specific ANN implementation. 
The compromise I made here is that the method can return null. The alternative 
is to make Lucene99HnswVectorReader (and maybe some other classes?) be an 
HnswGraphProvider have class cast checks for HnswGraphProvider - I think a null 
check is cleaner, and with this change I was able to remove a bunch of casting.
   
   I enhanced `SortingCodecReader` to be able to sort vector values off-heap in 
case they are dense.This could be extended to the sparse case as well.
   
   Future ideas:
   
   To make this operationally-useful I think we need at a minimum the merge 
policy, but I think ultimately it belongs in the codec. I would like to make it 
compatilbe with a different (index) sort over docids. This would requires 
changes to the way nodes and docids are associated by the codec, and because of 
the way our vector codecs have different codecs for every possible combination 
of options (Byte vs Float, quantized vs not) this will be a lot of code 
changes, and I'd like to get the core feature working well before tackling that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] HNSW BP reorder tool [lucene]

Reply via email to