ChrisHegarty commented on PR #12737: URL: https://github.com/apache/lucene/pull/12737#issuecomment-1787403081
> I tried naively writing the logic like this with a couple N (8, 16, 32,etc) with FMA both off and on to see if I can baby this compiler to vectorize, nope, nothing. I don't think autovectorization works except for BitSet :) > > ``` > // loop one, where's my vectorization? no floating point excuses here. tried with fma too. > float acc[] = new float[32]; > int upperBound = a.length & ~(32 - 1); > for (; i < upperBound; i += 32) { > for (int j = 0; j < acc.length; j++) { > acc[j] = a[i+j] * b[i+j] + acc[j]; > } > } > // second reduction loop > for (int j = 0; j < acc.length; j++) { > res += acc[j]; > } > ``` I see the scalar code vectorize, but not optimally for the target CPU - e.g. `vfmadd231ss %xmm9,%xmm10,%xmm4` on my Rocket Lake. Where as, the vector API compilation emits instructions that use wider registers, e.g. `vfmadd231ps %zmm6,%zmm2,%zmm10`. My primitive (and possibly out of date) understanding is that the register allocator will not use the wider registers for these kinda auto-vectorization scenarios - the advise is to use the Vector API! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org