shubhamvishu opened a new pull request, #16092: URL: https://github.com/apache/lucene/pull/16092
### Description I worked with CC(Claude Code) to have this PR which adds the Hadamard rotation([Fast Walsh Hadamard Transform](https://en.wikipedia.org/wiki/Fast_Walsh%E2%80%93Hadamard_transform)) to vector fields inspired from the @xande 's [TurboQuant PR](https://github.com/apache/lucene/pull/15903) (who works with me on Amazon Product Search) but a stripped down version just adding rotation to vectors in isolation. This address the 2nd item `Implement random rotation of vectors and queries.` from [Data-blind scalar quantization](https://github.com/apache/lucene/issues/16029) issue @mccullocht is working on. I'm opening this to gather community feedback, as it shows promising recall improvements. I'd like to see whether we want to incorporate this into Lucene, reuse some of these ideas, or discard the approach if there are concerns. The recall improvement in luceneutil benchmarks with Cohere V3 and Amazon's internal benchmarks with 4K dim vectors. Current approach rotates the incoming float vectors at insertion (so index the vectors in rotated space) and rest of the flow continues as is. It stores whether to do rotation for a vector field or not info in the `FieldInfos`. **TL;DR** : *Randomized orthogonal rotation (sign flips + permutation + FWHT) that Gaussianizes vector dimensions distributions to favor the scalar quantization(OSQ) accuracy while preserving distances.* ### Cohere V3 #### Baseline (`main`) : ``` Results: NOTE: nDoc = 500000 for all runs; skipping column NOTE: searchType = KNN for all runs; skipping column NOTE: topK = 100 for all runs; skipping column NOTE: fanout = 100 for all runs; skipping column NOTE: resultSimilarity = N/A for all runs; skipping column NOTE: decay = N/A for all runs; skipping column NOTE: resultCount = 100.000 for all runs; skipping column NOTE: maxConn = 64 for all runs; skipping column NOTE: beamWidth = 250 for all runs; skipping column NOTE: num_segments = 1 for all runs; skipping column NOTE: filterStrategy = null for all runs; skipping column NOTE: filterSelectivity = N/A for all runs; skipping column NOTE: overSample = 1.000 for all runs; skipping column NOTE: bp-reorder = false for all runs; skipping column NOTE: indexType = HNSW for all runs; skipping column NOTE: rerank = no for all runs; skipping column recall latency(ms) netCPU avgCpuCount quantized visited index(s) index_docs/s force_merge(s) index_size(MB) vec_disk(MB) vec_RAM(MB) 0.695 2.413 2.409 0.998 1 bits 9589 96.07 5204.65 110.07 2102.03 2020.836 67.711 0.816 2.512 2.510 0.999 2 bits 8614 98.17 5093.00 137.36 2156.30 2081.871 128.746 0.918 2.711 2.709 0.999 4 bits 8221 96.06 5205.08 135.73 2276.80 2204.895 251.770 0.970 3.789 3.788 1.000 7 bits 8129 101.13 4944.28 228.06 2520.64 2449.036 495.911 0.976 3.804 3.803 1.000 8 bits 8129 101.92 4905.71 230.64 2520.66 2449.036 495.911 ``` #### Candidate (`main` + rotation i.e. this PR) : ``` Results: NOTE: nDoc = 500000 for all runs; skipping column NOTE: searchType = KNN for all runs; skipping column NOTE: topK = 100 for all runs; skipping column NOTE: fanout = 100 for all runs; skipping column NOTE: resultSimilarity = N/A for all runs; skipping column NOTE: decay = N/A for all runs; skipping column NOTE: resultCount = 100.000 for all runs; skipping column NOTE: maxConn = 64 for all runs; skipping column NOTE: beamWidth = 250 for all runs; skipping column NOTE: num_segments = 1 for all runs; skipping column NOTE: filterStrategy = null for all runs; skipping column NOTE: filterSelectivity = N/A for all runs; skipping column NOTE: overSample = 1.000 for all runs; skipping column NOTE: bp-reorder = false for all runs; skipping column NOTE: indexType = HNSW for all runs; skipping column NOTE: rerank = no for all runs; skipping column recall latency(ms) netCPU avgCpuCount quantized visited index(s) index_docs/s force_merge(s) index_size(MB) vec_disk(MB) vec_RAM(MB) 0.729 2.243 2.241 0.999 1 bits 9174 96.59 5176.47 100.21 2099.73 2020.836 67.711 0.841 2.478 2.474 0.998 2 bits 8468 97.67 5119.12 128.30 2155.75 2081.871 128.746 0.937 2.613 2.612 1.000 4 bits 8162 100.38 4980.97 124.06 2276.62 2204.895 251.770 0.982 3.704 3.702 0.999 7 bits 8124 102.14 4895.48 227.00 2520.65 2449.036 495.911 0.985 3.685 3.684 1.000 8 bits 8120 99.64 5017.91 227.74 2520.64 2449.036 495.911 ``` ### Recall | Bits | Baseline | Rotation | Delta | % Gain | |------|----------|----------|-------|--------| | 1 | 0.695 | 0.729 | +0.034 | **+4.9%** | | 2 | 0.816 | 0.841 | +0.025 | **+3.1%** | | 4 | 0.918 | 0.937 | +0.019 | **+2.1%** | | 7 | 0.970 | 0.982 | +0.012 | **+1.2%** | | 8 | 0.976 | 0.985 | +0.009 | **+0.9%** | <!-- If this is your first contribution to Lucene, please make sure you have reviewed the contribution guide. https://github.com/apache/lucene/blob/main/CONTRIBUTING.md --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
