https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117978
Bug ID: 117978 Summary: Optimise 128-bit-predicated SVE loads to Advanced SIMD LDRs Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: aarch64-sve, missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- Target: aarch64 When it is known that the predicate on a zero-predicated SVE load selects only the bottom 128 bits we should emit just a base LDR. For example: #include <arm_sve.h> svuint8_t foo(uint8_t *x) { return svld1(svwhilelt_b8(0, 16), x); } generates: foo(unsigned char*): ptrue p3.b, vl16 ld1b z0.b, p3/z, [x0] ret But could just be: bar(unsigned char*): ldr q0, [x0] ret This can have a number of benefits: * Fewer instructions as we avoid the PTRUE (though it would be hoisted outside of loops usually) * Avoid using a predicate register * Allow the LDR to be combined with other loads into an LDP, which SVE doesn't get until SVE2p1 The same optimisation probably applies to stores as well. In terms of optimisation it seems that adding a split to the maskload<mode><vpred> pattern in aarch64-sve.md would do, though I haven't yet deduced a clean way of checking that the predicate operand has the required constant form (any pointers on what API to use that doesn't generate any temporary RTL?)