https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113827
Andrew Pinski <pinskia at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Ever confirmed|0 |1 Status|UNCONFIRMED |WAITING Last reconfirmed| |2024-02-09 --- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> --- >a redundant scalar load I don't see any redundant load in that loop. ``` L3: movq (%rdi), %rax ;; load a[i] from rdi vmovups (%rax), %xmm1 ;; load rax[0-3] into vector vdivps %xmm0, %xmm1, %xmm1 ;; divide vmovups %xmm1, (%rax) ;; store result back into rax[0-3] addq $16, %rax ;; add 4*4 to rax movq %rax, (%rdi) ;; store rax back into rdi addq $8, %rdi ;; add 8 to rdi cmpq %rdi, %rdx jne .L3 ;; compare and loop back ``` That is a[i] is different between each iterations. Maybe you reduced this code too much?