shbhar commented on PR #15903: URL: https://github.com/apache/lucene/pull/15903#issuecomment-4238225775
And a couple of other updates: I was trying QJL again to realize the "2 stage process" for NN the paper mentions but QJL correction at least at 1bit adds so much variance that it makes recall much worse. So I'm not sure how to incorporate QJL and make the Turboquant prod version actually work like the paper describes (maybe it can work at higher bit widths). This is an observation by others as well, like this blog in KV compression context: https://dejan.ai/blog/turboquant/ >The QJL stage produces a correction term that makes the inner product estimator unbiased. But when you add this correction back to the reconstructed vector and store it in the KV cache, you’re injecting noise into the vector itself. The result: cosine similarity dropped to 0.69 (terrible) and the model produced garbage. I also got hold of a much larger ASIN production dataset to test on (also 4096d), and it seems much more well behaved (pairwise mean cosine similarity of ~0.05 vs ~0.5 of the previous ASIN dataset I was using). Below test is with 1M random sample with 10K random sample queries. Graph: M=32, efConstruction=200. Search: fanout=50, topK=10. Force-merged to 1 segment. r7g.8xlarge (32 vCPU Graviton3, 256 GiB). | Method | Recall@10 | Latency (ms) | Search QPS | Index docs/s | Index time (s) | Merge time (s) | Size (MB) | |--------|-----------|-------------|------------|-------------|---------------|---------------|-----------| | Float32 | 0.972 | 0.810 | 3,670 | 1,016 | 984 | 272.5 | 15,682 | | OSQ-1bit | 0.836 | 0.569 | 10,921 | 973 | 1,028 | 91.6 | 16,188 | | OSQ-2bit | 0.850 | 0.694 | 9,044 | 1,001 | 999 | 110.6 | 16,674 | | OSQ-4bit | 0.915 | 0.795 | 6,964 | 1,018 | 982 | 143.6 | 17,650 | | OSQ-8bit | 0.950 | 1.242 | 5,036 | 1,021 | 979 | 198.6 | 19,604 | | OSQ-1bit-nocenter | 0.840 | 0.578 | 10,512 | 976 | 1,025 | 95.1 | 16,188 | | OSQ-2bit-nocenter | 0.843 | 0.609 | 9,244 | 991 | 1,009 | 108.2 | 16,674 | | OSQ-4bit-nocenter | 0.916 | 0.800 | 7,350 | 1,022 | 978 | 136.1 | 17,650 | | OSQ-8bit-nocenter | 0.945 | 1.239 | 5,311 | 1,021 | 979 | 188.3 | 19,604 | | TQ-1bit | 0.840 | 0.483 | 15,278 | 1,129 | 886 | 65.5 | 544 | | TQ-2bit | 0.879 | 3.537 | 12,647 | 1,040 | 962 | 79.1 | 1,036 | | TQ-4bit | 0.930 | 3.212 | 6,280 | 1,188 | 842 | 159.2 | 2,007 | | TQ-8bit | 0.960 | 0.876 | 8,194 | 1,190 | 840 | 122.0 | 3,960 | Note: I haven't made any attempt to optimize 2bit/4bit latency for TQ yet, so they can be ignored. But 1bit is already ~15% faster and 8bit ~30% faster (I have a couple of other optimization ideas, will have Kiro try them later) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
