benwtrent commented on PR #14078:
URL: https://github.com/apache/lucene/pull/14078#issuecomment-2706428269

   @lpld 
   
   I agree, both are doing similar things but there are some important 
distinctions. 
   
   `oversample` indicates that you are going to return that ratio more results 
from the index search. Flat or HNSW, it doesn't matter. So, for searching for 
`k=10` the searcher will actually return the top `k=50`. These results are then 
typically passed off to a rescorer type query that will calculate the distances 
at some higher fidelity (e.g. raw floats). This type of oversampling, gathering 
more than the desired docs, and then correcting their scores, is required at 
higher quantization levels.
   
   
   `fanout` makes the search queue when searching the HNSW graph larger. 
However, the searcher will still only return `k` results. So, searching for top 
`k=10` with `fanout=20` indicates the HNSW search (ef_search, if you will) will 
gather the nearest 30, but then only the top 10 of those are returned from the 
searcher. 
   
   
   Think of it this way:
   
    - `oversample` is for handling scoring approximations (e.g. quantized 
scoring)
    - `fanout` is for handling approximations in HNSW (e.g. getting stuck in 
local minima and exploring the graph more).
   
   
   Does that help?
   
   > Could you please also share other parameters of your benchmark (ndoc, 
maxConn, beamWidthIndex, fanout, etc.)
   
   I have lost my test environment and I regrettably didn't write all this 
down. however, these are some best guesses.
   
   
   For Cohere 768:
   
    - ndoc: 1_000_000
    - topk: 10
    - oversample: 5
    - maxConn: 16
    - beamWidth: 100
    - fanout: 50 (might have been 100)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to