jtibshirani opened a new issue, #11830:
URL: https://github.com/apache/lucene/issues/11830

   ### Description
   
   HNSW search is most efficient when all vector data fits in page cache. So 
good to keep the size of vector files as small as possible.
   
   We currently write all HNSW graph connections as fixed-size integers. This 
is wasteful since most graphs have far fewer nodes than the max integer value:
   
https://github.com/apache/lucene/blob/d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956/lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94HnswVectorsWriter.java#L478
   
   Maybe instead we could store the connection list using `PackedInts.Writer`. 
This would decrease the bits needed to store each connection. We could still 
ensure that every connection list takes the same number of bytes, to continue 
being able to index into the graph data easily.
   
   I quickly tested a version of the idea on the 1 million vector GloVe 
dataset, and saw the graph data size decrease by ~30%:
   
   ```
   Baseline
   103M 
luceneknn-100-16-100.train-16-100.index/_6_Lucene94HnswVectorsFormat_0.vex
   
   Hacky patch
   155M 
luceneknn-100-16-100.train-16-100.index/_5_Lucene94HnswVectorsFormat_0.vex
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to