msfroh commented on PR #13521:
URL: https://github.com/apache/lucene/pull/13521#issuecomment-2313628836

   The approach is pretty neat. 
   
   I'm wondering if `Bit21With3StepsEncoder` does better on aarch64 because of 
the explicitly unrolled loop? If so, I'm wondering if unrolling to a multiple 
of 2 longs would better align to processor cache lines. 
   
   That is, unrolling the loop to process 3 longs per iteration is faster than 
processing 1 long per iteration. What about 2 longs per iteration? What about 4 
longs per iteration?
   
   Since I've been playing around with the incubating vector API recently, I'm 
going to try downloading your microbenchmark and adding a vectorized 
implementation. (I have access to an M1 Mac that should be able to process 2 
longs at a time, plus an Intel Xeon whose AVX-512 operations should probably be 
able to do 8 longs.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to