shubhamvishu opened a new pull request, #16092:
URL: https://github.com/apache/lucene/pull/16092

   ### Description
   
   I worked with CC(Claude Code) to have this PR which adds the Hadamard 
rotation([Fast Walsh Hadamard 
Transform](https://en.wikipedia.org/wiki/Fast_Walsh%E2%80%93Hadamard_transform))
 to vector fields inspired from the @xande 's [TurboQuant 
PR](https://github.com/apache/lucene/pull/15903) (who works with me on Amazon 
Product Search) but a stripped down version just adding rotation to vectors in 
isolation. This address the 2nd item `Implement random rotation of vectors and 
queries.` from  [Data-blind scalar 
quantization](https://github.com/apache/lucene/issues/16029) issue @mccullocht 
is working on. I'm opening this to gather community feedback, as it shows 
promising recall improvements. I'd like to see whether we want to incorporate 
this into Lucene, reuse some of these ideas, or discard the approach if there 
are concerns.
   
   The recall improvement in luceneutil benchmarks with Cohere V3 and Amazon's 
internal benchmarks with 4K dim vectors. Current approach rotates the incoming 
float vectors at insertion (so index the vectors in rotated space) and rest of 
the flow continues as is. It stores whether to do rotation for a vector field 
or not info in the `FieldInfos`. 
   
   **TL;DR** : *Randomized orthogonal rotation (sign flips + permutation + 
FWHT) that Gaussianizes vector dimensions distributions to favor the scalar 
quantization(OSQ) accuracy while preserving distances.*
   
   ### Cohere V3
   
   #### Baseline (`main`) :
   
   ```
   Results:
   NOTE: nDoc = 500000 for all runs; skipping column
   NOTE: searchType = KNN for all runs; skipping column
   NOTE: topK = 100 for all runs; skipping column
   NOTE: fanout = 100 for all runs; skipping column
   NOTE: resultSimilarity = N/A for all runs; skipping column
   NOTE: decay = N/A for all runs; skipping column
   NOTE: resultCount = 100.000 for all runs; skipping column
   NOTE: maxConn = 64 for all runs; skipping column
   NOTE: beamWidth = 250 for all runs; skipping column
   NOTE: num_segments = 1 for all runs; skipping column
   NOTE: filterStrategy = null for all runs; skipping column
   NOTE: filterSelectivity = N/A for all runs; skipping column
   NOTE: overSample = 1.000 for all runs; skipping column
   NOTE: bp-reorder = false for all runs; skipping column
   NOTE: indexType = HNSW for all runs; skipping column
   NOTE: rerank = no for all runs; skipping column
   recall  latency(ms)  netCPU  avgCpuCount  quantized  visited  index(s)  
index_docs/s  force_merge(s)  index_size(MB)  vec_disk(MB)  vec_RAM(MB)
    0.695        2.413   2.409        0.998     1 bits     9589     96.07       
5204.65          110.07         2102.03      2020.836       67.711
    0.816        2.512   2.510        0.999     2 bits     8614     98.17       
5093.00          137.36         2156.30      2081.871      128.746
    0.918        2.711   2.709        0.999     4 bits     8221     96.06       
5205.08          135.73         2276.80      2204.895      251.770
    0.970        3.789   3.788        1.000     7 bits     8129    101.13       
4944.28          228.06         2520.64      2449.036      495.911
    0.976        3.804   3.803        1.000     8 bits     8129    101.92       
4905.71          230.64         2520.66      2449.036      495.911
   
   ```
   
   #### Candidate (`main` + rotation i.e. this PR) :
   
   ```
   Results:
   NOTE: nDoc = 500000 for all runs; skipping column
   NOTE: searchType = KNN for all runs; skipping column
   NOTE: topK = 100 for all runs; skipping column
   NOTE: fanout = 100 for all runs; skipping column
   NOTE: resultSimilarity = N/A for all runs; skipping column
   NOTE: decay = N/A for all runs; skipping column
   NOTE: resultCount = 100.000 for all runs; skipping column
   NOTE: maxConn = 64 for all runs; skipping column
   NOTE: beamWidth = 250 for all runs; skipping column
   NOTE: num_segments = 1 for all runs; skipping column
   NOTE: filterStrategy = null for all runs; skipping column
   NOTE: filterSelectivity = N/A for all runs; skipping column
   NOTE: overSample = 1.000 for all runs; skipping column
   NOTE: bp-reorder = false for all runs; skipping column
   NOTE: indexType = HNSW for all runs; skipping column
   NOTE: rerank = no for all runs; skipping column
   recall  latency(ms)  netCPU  avgCpuCount  quantized  visited  index(s)  
index_docs/s  force_merge(s)  index_size(MB)  vec_disk(MB)  vec_RAM(MB)
    0.729        2.243   2.241        0.999     1 bits     9174     96.59       
5176.47          100.21         2099.73      2020.836       67.711
    0.841        2.478   2.474        0.998     2 bits     8468     97.67       
5119.12          128.30         2155.75      2081.871      128.746
    0.937        2.613   2.612        1.000     4 bits     8162    100.38       
4980.97          124.06         2276.62      2204.895      251.770
    0.982        3.704   3.702        0.999     7 bits     8124    102.14       
4895.48          227.00         2520.65      2449.036      495.911
    0.985        3.685   3.684        1.000     8 bits     8120     99.64       
5017.91          227.74         2520.64      2449.036      495.911
   ```
   
    
   ### Recall
   
   | Bits | Baseline | Rotation | Delta | % Gain |
   |------|----------|----------|-------|--------|
   | 1    | 0.695    | 0.729    | +0.034 | **+4.9%** |
   | 2    | 0.816    | 0.841    | +0.025 | **+3.1%** |
   | 4    | 0.918    | 0.937    | +0.019 | **+2.1%** |
   | 7    | 0.970    | 0.982    | +0.012 | **+1.2%** |
   | 8    | 0.976    | 0.985    | +0.009 | **+0.9%** |
   
   <!--
   If this is your first contribution to Lucene, please make sure you have 
reviewed the contribution guide.
   https://github.com/apache/lucene/blob/main/CONTRIBUTING.md
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to