ChrisHegarty commented on PR #12737:
URL: https://github.com/apache/lucene/pull/12737#issuecomment-1787403081

   > I tried naively writing the logic like this with a couple N (8, 16, 
32,etc) with FMA both off and on to see if I can baby this compiler to 
vectorize, nope, nothing. I don't think autovectorization works except for 
BitSet :)
   > 
   > ```
   > // loop one, where's my vectorization? no floating point excuses here. 
tried with fma too.
   > float acc[] = new float[32];
   > int upperBound = a.length & ~(32 - 1);
   > for (; i < upperBound; i += 32) {
   >   for (int j = 0; j < acc.length; j++) {
   >     acc[j] = a[i+j] * b[i+j] + acc[j];
   >   }
   > }
   > // second reduction loop
   > for (int j = 0; j < acc.length; j++) {
   >   res += acc[j];
   > }
   > ```
   
   I see the scalar code vectorize, but not optimally for the target CPU - e.g. 
`vfmadd231ss %xmm9,%xmm10,%xmm4` on my Rocket Lake. Where as, the vector API 
compilation emits instructions that use wider registers, e.g. `vfmadd231ps 
%zmm6,%zmm2,%zmm10`.  My primitive (and possibly out of date) understanding is 
that the register allocator will not use the wider registers for these kinda 
auto-vectorization scenarios - the advise is to use the Vector API! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to