benwtrent commented on issue #12440:
URL: https://github.com/apache/lucene/issues/12440#issuecomment-1708370161

   > For example, if we have segment 1,2,3,4 wants to merge and form a new 
segment, can we just leave the HNSW graphs as-is, or we only merge the smaller 
ones and leave the big ones as-is. And when we do searching, we can do a greedy 
search (always go to the closest node, whichever graph), or we can just let 
user choose to use multithreading to exchange for the latency?
   
   This would be tricky. This scales linearly per graph and technically recall 
is higher when searching multiple graphs than a single one. It seems that the 
`efSearch` (lucene's `k`) per nearest neighbor search should be adjusted 
internally if this is the approach taken. Thinking further, it seems like a 
nice experiment to make regardless to see how multi-segment search speed can be 
improved (potentially keeping the minimally matching score or candidates 
between graph searches...)
   
   @zhaih In short, I think this has merit, but we could look at how it can 
improve multi-segment search by keeping track of min_score or dynamically 
adjusting `efSearch` when multiple segments are being searched.
   
   > An idea, instead of trying to merge the subgraph, is to do a union of 
subgraphs:
   When we merge, we build a disconnected graph which is the union of all the 
segment graphs.
   
   @mbrette 
   
   🤔 I am not sure about this. It seems to me that allowing the graph to be 
disconnected like this (pure union) would be very difficult to keep track of 
within the Codec. Now we not only have each node's doc_id, but which graph they 
belong to, etc. This would cause significant overhead and seems like it 
wouldn't be worth the effort.
   
   > It's worth exploring some variation of this in my opinion.
   
   I agree @mbrette, which is why I linked the paper :). I think if we do 
something like the nnDescent but still adhering to the diversity of 
neighborhood, it could help. HNSW is fairly resilient and I wouldn't expect 
horrific recall changes.
   
   One tricky thing there would be when do we promote/demote a node to a 
higher/lower level in the graph? I am not sure nodes should simply stay at 
their levels as graph size increases.
   
   I haven't thought about what the logic could be for determining which NSW 
get put on which layer. It might be as simple as re-running the probabilistic 
level assignment for every node. Since nodes at higher levels are also on the 
lower levels, we could still get a significant performance improvement as we 
would only have to "start from scratch" on nodes that are promoted from a lower 
level to a higher level (which is very rare). Node demotions could still keep 
their previous connections (or keep what percentage of the connections we allow 
in our implementation).
   
   > What is the current way to measure Lucene knn recall/performance this days 
?
   @mbrette 
   
   I tried to reproduced your test from 
https://github.com/apache/lucene/issues/11354#issuecomment-1239961308, but was 
not able to (recall = 0.574).
   
   [Lucene Util](https://github.com/mikemccand/luceneutil) is the way to do 
this. What issues are you having? It can be sort of frustrating to keep up and 
running, but once you got the flow down and it built, its useful. I will 
happily help with this or even run what benchmarking I can my self once you 
have some POC work.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to