benwtrent commented on issue #12440: URL: https://github.com/apache/lucene/issues/12440#issuecomment-1682640430
> What is your take on existing merge optimization https://github.com/apache/lucene/pull/12050? I think its a good start. One problem I have is that typical Lucene segment merging attempts to merge segments of equal size together. Making "segment tiers" of similar sizes. Maybe an adequate "make merges faster" optimization is to have a better segment merge policy that takes advantage of the inherit advantages with this optimization. > A given vector/document will go through many segment merge in its life time, so the benefit of this optimization accrue a lot. Caveat: I used random vectors. This is true, but it depends on the merge policy and which segments are merged. When merging 10 segments that are of equal size, this optimization has almost no impact. > have we consider integrating other - native - libraries (faiss, raft, nmslib...) like what is done in open search (at a higher abstraction level though). I am unsure about this. A new codec could be made integrating those native libraries, but they should fit within the Lucene segment model and not use JNI. From what I can tell, those integrations don't do either of those things. Additionally, there shouldn't be any external dependencies (if directly integrated into the Lucene repo). See Discussion: https://github.com/apache/lucene/issues/12502 Other options for "making merges faster" is to just provide scalar quantization for users. This will make merges as a whole faster as the computations required will be much cheaper. It bugs me that we have all this distributed work across segments that just gets ignored. No matter if this was a native implementation or not, merging similarly sized HNSW graphs from 9 segments into 1 will still be costly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org