Re: [PR] Add new Acorn-esque filtered HNSW search heuristic [lucene]

via GitHub Thu, 30 Jan 2025 08:22:40 -0800


benwtrent commented on PR #14160:
URL: https://github.com/apache/lucene/pull/14160#issuecomment-2624958162


   I ran this over the "nightly" dataset (8M 768 dim vectors). No force 
merging. I think this is the nightly behavior. I ran over various filter 
criteria (I think nightly is 5%).
   
   
   BASELINE
   ```
   recall  latency (ms)     nDoc  topK  fanout  visited  selectivity
    1.000       110.216  8000000   100      50    79846        0.010
    0.982       137.185  8000000   100      50   215393        0.050
    0.974        85.933  8000000   100      50   144953        0.100
    0.965        73.476  8000000   100      50    86333        0.200
    0.958        58.347  8000000   100      50    64055        0.300
    0.952        34.021  8000000   100      50    51634        0.400
    0.944        32.818  8000000   100      50    43643        0.500
    0.940        29.538  8000000   100      50    38200        0.600
    0.936        26.965  8000000   100      50    34205        0.700
    0.930        25.453  8000000   100      50    30989        0.800
    0.926        23.585  8000000   100      50    28482        0.900
    0.924        23.926  8000000   100      50    27318        0.950
    0.922        23.306  8000000   100      50    26481        0.990
   ```
   
   ```
   recall  latency (ms)     nDoc  topK  fanout  visited  selectivity
    0.640        28.972  8000000   100      50    10709        0.010
    0.855        34.103  8000000   100      50    20845        0.050
    0.908        37.990  8000000   100      50    36339        0.100
    0.922        47.513  8000000   100      50    54472        0.200
    0.903        46.094  8000000   100      50    56451        0.300
    0.894        41.164  8000000   100      50    52235        0.400
    0.870        30.850  8000000   100      50    36989        0.500
    0.881        28.043  8000000   100      50    34102        0.600
    0.896        27.725  8000000   100      50    33346        0.700
    0.904        25.472  8000000   100      50    31135        0.800
    0.913        23.670  8000000   100      50    26715        0.900
    0.918        23.148  8000000   100      50    26193        0.950
    0.922        22.982  8000000   100      50    26425        0.990
   ```
   
   The goal is generally "higher recall with lower visited", a nice single 
value to show this would be `recall/visited`, so as visited reduces or recall 
increases, that value is "higher" so higher is better. 
   
   I graphed this ratio (multiplying by 100_000 to make the values saner 
looking)
   
   <img width="602" alt="image" 
src="https://github.com/user-attachments/assets/868a6b59-9e45-46d0-8c82-457a51df4698";
 />
   
   So, this shows on nightly, the ratio is significantly improved, by as much 
as 5x. 
   
   I am currently force merging and attempting to re run.
   
   Here is some more data for candidate only at 0.05 filtering with increasing 
fanout:
   ```
   recall  latency (ms)     nDoc  topK  fanout  visited  selectivity
    0.855        29.257  8000000   100      50    20845        0.050
    0.859        30.215  8000000   100      60    21514        0.050
    0.862        31.189  8000000   100      70    22134        0.050
    0.866        31.998  8000000   100      80    22718        0.050
    0.868        32.896  8000000   100      90    23294        0.050
    0.871        33.569  8000000   100     100    23877        0.050
    0.873        29.677  8000000   100     110    24447        0.050
    0.875        34.983  8000000   100     120    24978        0.050
    0.877        34.644  8000000   100     130    25494        0.050
    0.879        36.034  8000000   100     140    26015        0.050
    0.881        36.557  8000000   100     150    26533        0.050
    0.883        36.708  8000000   100     160    27034        0.050
    0.884        36.946  8000000   100     170    27534        0.050
    0.886        38.691  8000000   100     180    27999        0.050
    0.888        39.257  8000000   100     190    28503        0.050
    0.890        39.152  8000000   100     200    28955        0.050
    0.891        40.726  8000000   100     210    29453        0.050
    0.892        41.062  8000000   100     220    29895        0.050
    0.893        40.994  8000000   100     230    30319        0.050
    0.895        41.713  8000000   100     240    30736        0.050
    0.896        42.321  8000000   100     250    31180        0.050
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new Acorn-esque filtered HNSW search heuristic [lucene]

Reply via email to