https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116667
Bug ID: 116667
Summary: missing superfluous zero-extends of SVE values
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: tnfchris at gcc dot gnu.org
Target Milestone: ---
Target: aarch64*
We've recently started vectorizing functions such as:
void
decode (unsigned char * restrict h, unsigned char * restrict p4,
unsigned char * restrict p6, int f, int b, char * restrict e,
char * restrict a, char * restrict i)
{
int j = b % 8;
for (int k = 0; k < 2; ++k)
{
p4[k] = i[a[k]] | e[k] << j;
h[k] = p6[k] = a[k];
}
}
due to the vectorizer now correctly eliding one of the loads making it
profitable. Using -O3 -march=armv9-a now vectorizes and generates:
decode:
ptrue p7.s, vl2
ptrue p6.b, all
ld1b z31.s, p7/z, [x6]
ld1b z28.s, p7/z, [x5]
and w4, w4, 7
movprfx z0, z31
uxtb z0.s, p6/m, z31.s
mov z30.s, w4
ld1b z29.s, p7/z, [x7, z0.s, uxtw]
lslr z30.s, p6/m, z30.s, z28.s
orr z30.d, z30.d, z29.d
st1b z30.s, p7, [x1]
st1b z31.s, p7, [x2]
st1b z31.s, p7, [x0]
ret
where as we used to generate:
decode:
ptrue p7.s, vl2
and w4, w4, 7
ld1b z0.s, p7/z, [x6]
ld1b z28.s, p7/z, [x5]
ld1b z29.s, p7/z, [x7, z0.s, uxtw]
ld1b z31.s, p7/z, [x6]
mov z30.s, w4
ptrue p6.b, all
lslr z30.s, p6/m, z30.s, z28.s
orr z30.d, z30.d, z29.d
st1b z30.s, p7, [x1]
st1b z31.s, p7, [x2]
st1b z31.s, p7, [x0]
ret
This is great, however we're let down by RTL opt.
There's a couple of weird things here,
Cleaning up the sequence a bit the problematic parts are:
ptrue p7.s, vl2
ptrue p6.b, all
ld1b z31.s, p7/z, [x6]
movprfx z0, z31
uxtb z0.s, p6/m, z31.s
ld1b z29.s, p7/z, [x7, z0.s, uxtw]
It zero extends the same vaue in z31 three times. In the old code we actually
loaded the same value twice, both zero extended and not zero extended.
The RTL for the z31 + extend is
(insn 15 13 16 2 (set (reg:VNx4QI 110 [ vect__3.6 ])
(unspec:VNx4QI [
(subreg:VNx4BI (reg:VNx16BI 120) 0)
(mem:VNx4QI (reg/v/f:DI 117 [ a ]) [0 S[4, 4] A8])
] UNSPEC_LD1_SVE)) "/app/example.c":9:24 5683
{maskloadvnx4qivnx4bi}
(expr_list:REG_DEAD (reg/v/f:DI 117 [ a ])
(expr_list:REG_EQUAL (unspec:VNx4QI [
(const_vector:VNx4BI [
(const_int 1 [0x1]) repeated x2
repeat [
(const_int 0 [0])
(const_int 0 [0])
]
])
(mem:VNx4QI (reg/v/f:DI 117 [ a ]) [0 S[4, 4] A8])
] UNSPEC_LD1_SVE)
(nil))))
(insn 16 15 17 2 (set (reg:VNx16BI 122)
(const_vector:VNx16BI repeat [
(const_int 1 [0x1])
])) 5658 {*aarch64_sve_movvnx16bi}
(nil))
(insn 17 16 20 2 (set (reg:VNx4SI 121 [ vect_patt_59.7_52 ])
(unspec:VNx4SI [
(subreg:VNx4BI (reg:VNx16BI 122) 0)
(zero_extend:VNx4SI (reg:VNx4QI 110 [ vect__3.6 ]))
] UNSPEC_PRED_X)) 6943 {*zero_extendvnx4qivnx4si2}
(expr_list:REG_EQUAL (zero_extend:VNx4SI (reg:VNx4QI 110 [ vect__3.6 ]))
(nil)))
But combine refuses the merge of the zero extend into the load,
deferring rescan insn with uid = 15.
allowing combination of insns 15 and 17
original costs 4 + 4 = 8
replacement costs 4 + 4 = 8
i2 didn't change, not doing this
and instead copies it into the gather load, but leaving the insn 17 alone
presumably because of the predicate. So it looks like a bug in our backend
costing. The widening load is definitely cheaper than load + extend.
However I'm not sure as the line "i2 didn't change, not doing this" seems to
indicate that it wasn't rejected because of cost?
In the codegen there's a peculiarity in that while the two loads
ld1b z31.s, p7/z, [x6]
ld1b z28.s, p7/z, [x5]
are both widening loads, but they aren't modelled the same:
ld1b z31.s, p7/z, [x6] // 15 [c=4 l=4] maskloadvnx4qivnx4bi
ld1b z28.s, p7/z, [x5] // 50 [c=4 l=4]
aarch64_load_zero_extendvnx4sivnx4qi
This is because the RTL pattern seems to want to keep the same number of
elements as the input vector size. So it ends up with a gather and I think is
relying on combine changing one form into the other to remove unneeded extends.