Re: [I] New binary vector format doesn't perform well with small-dimension datasets [lucene]

via GitHub Tue, 18 Mar 2025 13:03:25 -0700


benwtrent commented on issue #14342:
URL: https://github.com/apache/lucene/issues/14342#issuecomment-2734567137


   OK, a colleague and I spent some time digging into this and Option 0 (a bug) 
turned out to be the case. Its a 5 character change (like all good bugs), but 
here are the new recall numbers for fashion-minst:
   
   Still not mid 90s at 5x oversampling, but WAY better than the abysmal 
results from before.
   
   ```
   FLAT Results:
   recall  latency(ms)   nDoc  topK  fanout  quantized  index(s)  index_docs/s  
num_segments  index_size(MB)  overSample  vec_disk(MB)  vec_RAM(MB)  indexType
    0.444        4.110  60000    10      50     1 bits      0.00      Infinity  
           1          186.43       1.000       181.646        2.203       FLAT
    0.629        4.383  60000    10      50     1 bits      0.00      Infinity  
           1          186.43       2.000       181.646        2.203       FLAT
    0.730        4.437  60000    10      50     1 bits      0.00      Infinity  
           1          186.43       3.000       181.646        2.203       FLAT
    0.792        4.455  60000    10      50     1 bits      0.00      Infinity  
           1          186.43       4.000       181.646        2.203       FLAT
    0.833        4.445  60000    10      50     1 bits      0.00      Infinity  
           1          186.43       5.000       181.646        2.203       FLAT
    0.926        4.607  60000    10      50     1 bits      0.00      Infinity  
           1          186.43      10.000       181.646        2.203       FLAT
   ```
   
   ```
   HNSW
   recall  latency(ms)   nDoc  topK  fanout  maxConn  beamWidth  quantized  
index(s)  index_docs/s  num_segments  index_size(MB)  overSample  vec_disk(MB)  
vec_RAM(MB)  indexType
    0.443        0.188  60000    10      50       64        250     1 bits      
0.00      Infinity             1          189.55       1.000       181.646      
  2.203       HNSW
    0.629        0.274  60000    10      50       64        250     1 bits      
0.00      Infinity             1          189.55       2.000       181.646      
  2.203       HNSW
    0.730        0.349  60000    10      50       64        250     1 bits      
0.00      Infinity             1          189.55       3.000       181.646      
  2.203       HNSW
    0.792        0.471  60000    10      50       64        250     1 bits      
0.00      Infinity             1          189.55       4.000       181.646      
  2.203       HNSW
    0.833        0.479  60000    10      50       64        250     1 bits      
0.00      Infinity             1          189.55       5.000       181.646      
  2.203       HNSW
    0.926        0.786  60000    10      50       64        250     1 bits      
0.00      Infinity             1          189.55      10.000       181.646      
  2.203       HNSW
   ```
   
   For the curious, it had to do with shifting the normal distribution 
initialization parameters correctly given the standard deviation of the actual 
vector distribution. We had the mean & std flipped. When these are well 
behaved, this sort of bug has a tiny effect (which is why we never caught it), 
but minst isn't well behaved and brought this nasty little bug to light.
   
   I am gonna run some more benchmarks and will open a PR soon with the fix. 
   
   
   As an aside, there is likely even more gains for non-normal distribution 
vectors like minst, but they will take more time and effort.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] New binary vector format doesn't perform well with small-dimension datasets [lucene]

Reply via email to