jtibshirani opened a new issue, #11830: URL: https://github.com/apache/lucene/issues/11830
### Description HNSW search is most efficient when all vector data fits in page cache. So good to keep the size of vector files as small as possible. We currently write all HNSW graph connections as fixed-size integers. This is wasteful since most graphs have far fewer nodes than the max integer value: https://github.com/apache/lucene/blob/d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956/lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94HnswVectorsWriter.java#L478 Maybe instead we could store the connection list using `PackedInts.Writer`. This would decrease the bits needed to store each connection. We could still ensure that every connection list takes the same number of bytes, to continue being able to index into the graph data easily. I quickly tested a version of the idea on the 1 million vector GloVe dataset, and saw the graph data size decrease by ~30%: ``` Baseline 103M luceneknn-100-16-100.train-16-100.index/_6_Lucene94HnswVectorsFormat_0.vex Hacky patch 155M luceneknn-100-16-100.train-16-100.index/_5_Lucene94HnswVectorsFormat_0.vex ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org