[Bug rtl-optimization/114515] New: [14 Regression] Failure to use aarch64 lane forms after PR101523

rsandifo at gcc dot gnu.org via Gcc-bugs Thu, 28 Mar 2024 03:01:10 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515


            Bug ID: 114515
           Summary: [14 Regression] Failure to use aarch64 lane forms
                    after PR101523
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---

The following test regressed on aarch64 after
g:839bc42772ba7af66af3bd16efed4a69511312ae (the fix for PR101523):

typedef float v4sf __attribute__((vector_size(16)));
void f (v4sf *ptr, float f)
{
  ptr[0] = ptr[0] * (v4sf) { f, f, f, f };
  ptr[1] = ptr[1] * (v4sf) { f, f, f, f };
}

Compiled with -O2, we previously generated:

        ldp     q1, q31, [x0]
        fmul    v1.4s, v1.4s, v0.s[0]
        fmul    v31.4s, v31.4s, v0.s[0]
        stp     q1, q31, [x0]
        ret

Now we generate:

        ldp     q1, q31, [x0]
        dup     v0.4s, v0.s[0]
        fmul    v1.4s, v1.4s, v0.4s
        fmul    v31.4s, v31.4s, v0.4s
        stp     q1, q31, [x0]
        ret

with the extra dup.

The patch is trying to avoid cases where i3 is canonicalised by contextual
information provided by i2.  But here we place a full copy of i2 into i3
(creating an instruction that is no more expensive).  This is a benefit in its
own right because the two instructions can then execute in parallel rather than
serially.  But it also means that, as here, we might be able to remove i2 with
later combinations.

Perhaps we could also check whether i3 still contains the destination of i2?

[Bug rtl-optimization/114515] New: [14 Regression] Failure to use aarch64 lane forms after PR101523

Reply via email to