https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515
Bug ID: 114515
Summary: [14 Regression] Failure to use aarch64 lane forms
after PR101523
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: rsandifo at gcc dot gnu.org
Target Milestone: ---
The following test regressed on aarch64 after
g:839bc42772ba7af66af3bd16efed4a69511312ae (the fix for PR101523):
typedef float v4sf __attribute__((vector_size(16)));
void f (v4sf *ptr, float f)
{
ptr[0] = ptr[0] * (v4sf) { f, f, f, f };
ptr[1] = ptr[1] * (v4sf) { f, f, f, f };
}
Compiled with -O2, we previously generated:
ldp q1, q31, [x0]
fmul v1.4s, v1.4s, v0.s[0]
fmul v31.4s, v31.4s, v0.s[0]
stp q1, q31, [x0]
ret
Now we generate:
ldp q1, q31, [x0]
dup v0.4s, v0.s[0]
fmul v1.4s, v1.4s, v0.4s
fmul v31.4s, v31.4s, v0.4s
stp q1, q31, [x0]
ret
with the extra dup.
The patch is trying to avoid cases where i3 is canonicalised by contextual
information provided by i2. But here we place a full copy of i2 into i3
(creating an instruction that is no more expensive). This is a benefit in its
own right because the two instructions can then execute in parallel rather than
serially. But it also means that, as here, we might be able to remove i2 with
later combinations.
Perhaps we could also check whether i3 still contains the destination of i2?