https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846
--- Comment #25 from Peter Cordes <peter at cordes dot ca> --- We're getting a spill/reload inside the loop with AVX512: .L2: vmovdqa64 (%esp), %zmm3 vpaddd (%eax), %zmm3, %zmm2 addl $64, %eax vmovdqa64 %zmm2, (%esp) cmpl %eax, %edx jne .L2 Loop finishes with the accumulator in memory *and* in ZMM2. The copy in ZMM2 is ignored, and we get # narrow to 32 bytes using memory indexing instead of VEXTRACTI32X8 or VEXTRACTI64X4 vmovdqa 32(%esp), %ymm5 vpaddd (%esp), %ymm5, %ymm0 # braindead: vextracti128 can write a new reg instead of destroying xmm0 vmovdqa %xmm0, %xmm1 vextracti128 $1, %ymm0, %xmm0 vpaddd %xmm0, %xmm1, %xmm0 ... then a sane 128b hsum as expected, so at least that part went right.