mccullocht commented on issue #14997:
URL: https://github.com/apache/lucene/issues/14997#issuecomment-3129188217

   I agree that this is difficult to read.
   
   I've thought about this a bit and have a partial implementation of SPANN on 
top of a mutable store so I'm reasonably familiar with the concepts. Some 
thoughts:
   * The original paper chose mediods for roughly 1-in-8 vectors. I think you 
can choose many fewer (1-in-128?) but 1-in-10k -- you will likely score quite a 
few more vectors than you would in HNSW and that will negatively affect query 
latency even if everything is inline.
   * You may want to use an index for centroids and keep that entire data 
structure off-heap. There's already an HNSW implementation available in Lucene 
so I'd probably use that.
   * The writer is probably going to want to pivot through different indexing 
strategies depending on size: exhaustive search (unindexed) for small segments, 
HNSW for medium segments, SPANN for larger segments (probably at least 100k+, 
maybe more). Unindexed segments can be represented as a single centroid with a 
posting list; "medium" segments can be represented as 1-in-1 selection of 
centroids.
   * [SPFresh](https://arxiv.org/abs/2410.14452) when merging large segment to 
attempt to balance sizes. IIUC this was attempting to maintain policy regarding 
minimum and maximum partition sizes.
   * It didn't seem like boundary handling was really addressed, in the paper 
vectors are assigned to multiple partitions with RNG pruning for diversity.
   * It didn't seem like pre-filtering was addressed either. If you take the 
top centroids but nothing matches your filter within those centroids, what do 
you do?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to