Re: [PR] Add a new codec to implement OSQ for 4 and 8 bit quantized vectors [lucene]

via GitHub Mon, 15 Sep 2025 15:56:21 -0700


mccullocht commented on PR #15169:
URL: https://github.com/apache/lucene/pull/15169#issuecomment-3294250569


   Average visited count in the query path actually is exposed in luceneutil 
today, it just appears in the iteration summary and not the overall summary. 
TIL. I've extracted it for my most recent run here:
   ```
   -4 3794
   -8 3857
    4 4294
    8 3849
   ```
   
   4 bit is doing ~10% more comparisons than 8 bit for the same fanout. More 
work in 4 bit but doesn't explain the size of the win in OSQ4. Workload CPU 
usage and latency is mostly driven scoring costs so let's start by looking at 
microbenchmark results for dot product on the same hardware:
   ```
   VectorUtilBenchmark.binaryDotProductVector         128  thrpt   15   47.741 
±  0.502  ops/us
   VectorUtilBenchmark.binaryDotProductVector         256  thrpt   15   26.198 
±  0.608  ops/us
   VectorUtilBenchmark.binaryDotProductVector         300  thrpt   15   22.857 
±  0.130  ops/us
   VectorUtilBenchmark.binaryDotProductVector         512  thrpt   15   13.864 
±  0.443  ops/us
   VectorUtilBenchmark.binaryDotProductVector         702  thrpt   15   10.003 
±  0.205  ops/us
   VectorUtilBenchmark.binaryDotProductVector        1024  thrpt   15    6.795 
±  0.098  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPacked     128  thrpt   15   72.590 
±  1.089  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPacked     256  thrpt   15   50.962 
±  0.197  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPacked     300  thrpt   15   39.587 
±  0.159  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPacked     512  thrpt   15   31.660 
±  0.187  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPacked     702  thrpt   15   22.251 
±  0.135  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPacked    1024  thrpt   15   18.117 
±  0.129  ops/us
   ```
   
   At 702 dimensions (next closest to our test data set) half byte is about 
twice as fast. The performance different between -4 and -8 makes sense with 
this context. I don't know why 4 is so slow, these numbers suggest it shouldn't 
be worse than 8 yet somehow it is 🤷. Not sure this is worth figuring out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add a new codec to implement OSQ for 4 and 8 bit quantized vectors [lucene]

Reply via email to