rmuir opened a new pull request, #12632: URL: https://github.com/apache/lucene/pull/12632
We can get these functions closer to optimal by just directly converting to 32-bits + `vpmulld`. See https://stackoverflow.com/a/69848057 for the motivation. You can reproduce my results with a `java -jar target/vectorbench.jar -p size=1024 Binary`. See https://github.com/rmuir/vectorbench for instructions. There is a README now! I don't touch java that much so it makes it easier on me. Skylake (256-bit): ``` Benchmark (size) Mode Cnt Score Error Units BinaryCosineBenchmark.cosineDistanceNew 1024 thrpt 5 3.252 ± 1.457 ops/us BinaryCosineBenchmark.cosineDistanceNewNew 1024 thrpt 5 3.746 ± 0.069 ops/us BinaryDotProductBenchmark.dotProductNew 1024 thrpt 5 7.080 ± 0.121 ops/us BinaryDotProductBenchmark.dotProductNewNew 1024 thrpt 5 8.329 ± 0.288 ops/us BinarySquareBenchmark.squareDistanceNew 1024 thrpt 5 6.208 ± 0.800 ops/us BinarySquareBenchmark.squareDistanceNewNew 1024 thrpt 5 7.285 ± 0.629 ops/us ``` I'd appreciate if someone could test AVX-512. This codepath only impacts 256bit+ vectors so it won't change anything for your mac with 128bit vectors. I will look into that one again separately: i know how to speed it up, but I don't want things ugly. I like this change because it makes the code a little simpler. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org