msfroh commented on PR #13521: URL: https://github.com/apache/lucene/pull/13521#issuecomment-2313628836
The approach is pretty neat. I'm wondering if `Bit21With3StepsEncoder` does better on aarch64 because of the explicitly unrolled loop? If so, I'm wondering if unrolling to a multiple of 2 longs would better align to processor cache lines. That is, unrolling the loop to process 3 longs per iteration is faster than processing 1 long per iteration. What about 2 longs per iteration? What about 4 longs per iteration? Since I've been playing around with the incubating vector API recently, I'm going to try downloading your microbenchmark and adding a vectorized implementation. (I have access to an M1 Mac that should be able to process 2 longs at a time, plus an Intel Xeon whose AVX-512 operations should probably be able to do 8 longs.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org