https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117978

            Bug ID: 117978
           Summary: Optimise 128-bit-predicated SVE loads to Advanced SIMD
                    LDRs
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: aarch64-sve, missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

When it is known that the predicate on a zero-predicated SVE load selects only
the bottom 128 bits we should emit just a base LDR. For example:

#include <arm_sve.h>

svuint8_t foo(uint8_t *x) {
  return svld1(svwhilelt_b8(0, 16), x);
}

generates:
foo(unsigned char*):
        ptrue   p3.b, vl16
        ld1b    z0.b, p3/z, [x0]
        ret

But could just be:
bar(unsigned char*):
        ldr     q0, [x0]
        ret

This can have a number of benefits:
* Fewer instructions as we avoid the PTRUE (though it would be hoisted outside
of loops usually)
* Avoid using a predicate register
* Allow the LDR to be combined with other loads into an LDP, which SVE doesn't
get until SVE2p1

The same optimisation probably applies to stores as well.
In terms of optimisation it seems that adding a split to the
maskload<mode><vpred> pattern in aarch64-sve.md would do, though I haven't yet
deduced a clean way of checking that the predicate operand has the required
constant form (any pointers on what API to use that doesn't generate any
temporary RTL?)

Reply via email to