https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121925

            Bug ID: 121925
           Summary: Idiom recognize FCMLA operations independently
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
            Blocks: 53947
  Target Milestone: ---
            Target: aarch64*

Given the following vectors

a = [A1 A0]
b = [C  D ]
c = [E  D ]

we recognize the sequences today as complex operations when they match the
dataflow of a complex numbers operation.

That is we recognize as an example

   double ax = (b[i+1] * a[i]) + (b[i] * a[i]);
   double bx = (a[i+1] * b[i]) - (a[i+1] * b[i+1]);

   c[i] = c[i] - ax;
   c[i+1] = c[i+1] + bx;

and In AArch64 this results in two FCMLA[1] instructions with different
rotations.

[1]
https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/FCMLA--Floating-point-Complex-Multiply-Accumulate-

However the FCMLA instructions standalone are also common in scientific
computing workloads.

This ticket is about recognizing these operations as is as well.

Given the vectors above this is about recognizing

rot0   = [E + A0 * C, D + A0 * B]
rot90  = [E + A1 * B, D - A1 * C]
rot180 = [E - A0 * C, D - A0 * B]
rot270 = [E + A1 * B, D - A1 * C]

With the usual DF requirement that {C,D} be sequential, {E,D} being sequential
and {A1, A0} be sequential *or* splats.

As an example, we want to recognize

void f (float *restrict a, float *restrict b,
        float *restrict c, float *restrict d, int n)
{
    for (int i = 0; i < (n & -2) / 2; i+=2)
      {
        d[i] = c[i] + (a[i] * b[i]);
        d[i+1] = c[i+1] + (a[i] * b[i+1]);
      }
}

which today is vectorized as

.L4:
        add     x9, x0, x4
        add     x8, x2, x4
        add     x7, x1, x4
        add     x6, x3, x4
        add     x4, x4, 32
        ld2     {v29.4s - v30.4s}, [x9]
        ld2     {v27.4s - v28.4s}, [x8]
        ld2     {v30.4s - v31.4s}, [x7]
        fmla    v27.4s, v29.4s, v30.4s
        fmla    v28.4s, v29.4s, v31.4s
        st2     {v27.4s - v28.4s}, [x6]
        cmp     x10, x4
        bne     .L4

and should instead be

.L4:
        ldr     q29, [x0], 16
        ldr     q27, [x1], 16
        ldr     q30, [x2], 16
        fcmla    v27.4s, v29.4s, v30.4s, #0
        str     q27, [x3], 16
        cmp     x10, x4
        bne     .L4

The corresponding SLP tree is

note:   Final SLP tree for instance 0x3dba70f0:
note:   node 0x3db4fe90 (max_nunits=2, refcnt=2) vector(2) float
note:   op template: *_10 = _11;
note:           stmt 0 *_10 = _11;
note:           stmt 1 *_19 = _20;
note:           children 0x3db4ff40
note:   node 0x3db4ff40 (max_nunits=2, refcnt=2) vector(2) float
note:   op template: _11 = _4 + _9;
note:           stmt 0 _11 = _4 + _9;
note:           stmt 1 _20 = _15 + _18;
note:           children 0x3db500a0 0x3db50150
note:   node 0x3db500a0 (max_nunits=2, refcnt=2) vector(2) float
note:   op template: _4 = *_3;
note:           stmt 0 _4 = *_3;
note:           stmt 1 _15 = *_14;
note:   node 0x3db50150 (max_nunits=2, refcnt=2) vector(2) float
note:   op template: _9 = _6 * _8;
note:           stmt 0 _9 = _6 * _8;
note:           stmt 1 _18 = _6 * _17;
note:           children 0x3db506d0 0x3db502b0
note:   node 0x3db506d0 (max_nunits=2, refcnt=2) vector(2) float
note:   op template: _6 = *_5;
note:           stmt 0 _6 = *_5;
note:           stmt 1 _6 = *_5;
note:           load permutation { 0 0 }
note:   node 0x3db502b0 (max_nunits=2, refcnt=2) vector(2) float
note:   op template: _8 = *_7;
note:           stmt 0 _8 = *_7;
note:           stmt 1 _17 = *_16;

The operation itself is easy to match, however the issue is with the splat in
node 0x3db506d0
We have no way to linearize this node because the other element just isn't
there in the SLP
tree as it's not used.

It is unclear to me how to deal with the splat case and could use some advice
here.

Additionally what would a good name for the IFNs and optabs be? I haven't been
able to figure
out good names for them.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

Reply via email to