shbhar commented on PR #15903:
URL: https://github.com/apache/lucene/pull/15903#issuecomment-4238225775

   And a couple of other updates:
   
   I was trying QJL again to realize the "2 stage process" for NN the paper 
mentions but QJL correction at least at 1bit adds so much variance that it 
makes recall much worse. So I'm not sure how to incorporate QJL and make the 
Turboquant prod version actually work like the paper describes (maybe it can 
work at higher bit widths). This is an observation by others as well, like this 
blog in KV compression context:
   
   https://dejan.ai/blog/turboquant/
   >The QJL stage produces a correction term that makes the inner product 
estimator unbiased. But when you add this correction back to the reconstructed 
vector and store it in the KV cache, you’re injecting noise into the vector 
itself. The result: cosine similarity dropped to 0.69 (terrible) and the model 
produced garbage.
   
   I also got hold of a much larger ASIN production dataset to test on (also 
4096d), and it seems much more well behaved (pairwise mean cosine similarity of 
~0.05 vs ~0.5 of the previous ASIN dataset I was using). Below test is with 1M 
random sample with 10K random sample queries. Graph: M=32, efConstruction=200. 
Search: fanout=50, topK=10. Force-merged to 1 segment. r7g.8xlarge (32 vCPU 
Graviton3, 256 GiB).
   
   | Method | Recall@10 | Latency (ms) | Search QPS | Index docs/s | Index time 
(s) | Merge time (s) | Size (MB) |
   
|--------|-----------|-------------|------------|-------------|---------------|---------------|-----------|
   | Float32 | 0.972 | 0.810 | 3,670 | 1,016 | 984 | 272.5 | 15,682 |
   | OSQ-1bit | 0.836 | 0.569 | 10,921 | 973 | 1,028 | 91.6 | 16,188 |
   | OSQ-2bit | 0.850 | 0.694 | 9,044 | 1,001 | 999 | 110.6 | 16,674 |
   | OSQ-4bit | 0.915 | 0.795 | 6,964 | 1,018 | 982 | 143.6 | 17,650 |
   | OSQ-8bit | 0.950 | 1.242 | 5,036 | 1,021 | 979 | 198.6 | 19,604 |
   | OSQ-1bit-nocenter | 0.840 | 0.578 | 10,512 | 976 | 1,025 | 95.1 | 16,188 |
   | OSQ-2bit-nocenter | 0.843 | 0.609 | 9,244 | 991 | 1,009 | 108.2 | 16,674 |
   | OSQ-4bit-nocenter | 0.916 | 0.800 | 7,350 | 1,022 | 978 | 136.1 | 17,650 |
   | OSQ-8bit-nocenter | 0.945 | 1.239 | 5,311 | 1,021 | 979 | 188.3 | 19,604 |
   | TQ-1bit | 0.840 | 0.483 | 15,278 | 1,129 | 886 | 65.5 | 544 |
   | TQ-2bit | 0.879 | 3.537 | 12,647 | 1,040 | 962 | 79.1 | 1,036 |
   | TQ-4bit | 0.930 | 3.212 | 6,280 | 1,188 | 842 | 159.2 | 2,007 |
   | TQ-8bit | 0.960 | 0.876 | 8,194 | 1,190 | 840 | 122.0 | 3,960 |
   
   
   Note: I haven't made any attempt to optimize 2bit/4bit latency for TQ yet, 
so they can be ignored. But 1bit is already ~15% faster and 8bit ~30% faster (I 
have a couple of other optimization ideas, will have Kiro try them later)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to