https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846
--- Comment #25 from Peter Cordes <peter at cordes dot ca> ---
We're getting a spill/reload inside the loop with AVX512:
.L2:
vmovdqa64 (%esp), %zmm3
vpaddd (%eax), %zmm3, %zmm2
addl $64, %eax
vmovdqa64 %zmm2, (%esp)
cmpl %eax, %edx
jne .L2
Loop finishes with the accumulator in memory *and* in ZMM2. The copy in ZMM2
is ignored, and we get
# narrow to 32 bytes using memory indexing instead of VEXTRACTI32X8 or
VEXTRACTI64X4
vmovdqa 32(%esp), %ymm5
vpaddd (%esp), %ymm5, %ymm0
# braindead: vextracti128 can write a new reg instead of destroying xmm0
vmovdqa %xmm0, %xmm1
vextracti128 $1, %ymm0, %xmm0
vpaddd %xmm0, %xmm1, %xmm0
... then a sane 128b hsum as expected, so at least that part went
right.