mccullocht commented on issue #14997: URL: https://github.com/apache/lucene/issues/14997#issuecomment-3129188217
I agree that this is difficult to read. I've thought about this a bit and have a partial implementation of SPANN on top of a mutable store so I'm reasonably familiar with the concepts. Some thoughts: * The original paper chose mediods for roughly 1-in-8 vectors. I think you can choose many fewer (1-in-128?) but 1-in-10k -- you will likely score quite a few more vectors than you would in HNSW and that will negatively affect query latency even if everything is inline. * You may want to use an index for centroids and keep that entire data structure off-heap. There's already an HNSW implementation available in Lucene so I'd probably use that. * The writer is probably going to want to pivot through different indexing strategies depending on size: exhaustive search (unindexed) for small segments, HNSW for medium segments, SPANN for larger segments (probably at least 100k+, maybe more). Unindexed segments can be represented as a single centroid with a posting list; "medium" segments can be represented as 1-in-1 selection of centroids. * [SPFresh](https://arxiv.org/abs/2410.14452) when merging large segment to attempt to balance sizes. IIUC this was attempting to maintain policy regarding minimum and maximum partition sizes. * It didn't seem like boundary handling was really addressed, in the paper vectors are assigned to multiple partitions with RNG pruning for diversity. * It didn't seem like pre-filtering was addressed either. If you take the top centroids but nothing matches your filter within those centroids, what do you do? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org