https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110449
rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rguenth at gcc dot gnu.org, | |rsandifo at gcc dot gnu.org --- Comment #1 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> --- Interesting idea! But I think the ideal thing here would be to do the 8*step after the store: .L2: add v29.4s, v31.4s, v28.4s # += 4*step stp q31, q29, [x0] add v31.4s, v31.4s, v27.4s # += 8*step add x0, x0, 32 cmp x1, x0 bne .L2 This has the advantage that the loop-carried dependency is only one ADD instruction deep, rather than 2 ADDs deep. I haven't looked how easy it would be to do though…