https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110449
rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |rguenth at gcc dot gnu.org,
| |rsandifo at gcc dot gnu.org
--- Comment #1 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org>
---
Interesting idea! But I think the ideal thing here would be
to do the 8*step after the store:
.L2:
add v29.4s, v31.4s, v28.4s # += 4*step
stp q31, q29, [x0]
add v31.4s, v31.4s, v27.4s # += 8*step
add x0, x0, 32
cmp x1, x0
bne .L2
This has the advantage that the loop-carried dependency
is only one ADD instruction deep, rather than 2 ADDs deep.
I haven't looked how easy it would be to do though…