https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121925
Bug ID: 121925
Summary: Idiom recognize FCMLA operations independently
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: tnfchris at gcc dot gnu.org
Blocks: 53947
Target Milestone: ---
Target: aarch64*
Given the following vectors
a = [A1 A0]
b = [C D ]
c = [E D ]
we recognize the sequences today as complex operations when they match the
dataflow of a complex numbers operation.
That is we recognize as an example
double ax = (b[i+1] * a[i]) + (b[i] * a[i]);
double bx = (a[i+1] * b[i]) - (a[i+1] * b[i+1]);
c[i] = c[i] - ax;
c[i+1] = c[i+1] + bx;
and In AArch64 this results in two FCMLA[1] instructions with different
rotations.
[1]
https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/FCMLA--Floating-point-Complex-Multiply-Accumulate-
However the FCMLA instructions standalone are also common in scientific
computing workloads.
This ticket is about recognizing these operations as is as well.
Given the vectors above this is about recognizing
rot0 = [E + A0 * C, D + A0 * B]
rot90 = [E + A1 * B, D - A1 * C]
rot180 = [E - A0 * C, D - A0 * B]
rot270 = [E + A1 * B, D - A1 * C]
With the usual DF requirement that {C,D} be sequential, {E,D} being sequential
and {A1, A0} be sequential *or* splats.
As an example, we want to recognize
void f (float *restrict a, float *restrict b,
float *restrict c, float *restrict d, int n)
{
for (int i = 0; i < (n & -2) / 2; i+=2)
{
d[i] = c[i] + (a[i] * b[i]);
d[i+1] = c[i+1] + (a[i] * b[i+1]);
}
}
which today is vectorized as
.L4:
add x9, x0, x4
add x8, x2, x4
add x7, x1, x4
add x6, x3, x4
add x4, x4, 32
ld2 {v29.4s - v30.4s}, [x9]
ld2 {v27.4s - v28.4s}, [x8]
ld2 {v30.4s - v31.4s}, [x7]
fmla v27.4s, v29.4s, v30.4s
fmla v28.4s, v29.4s, v31.4s
st2 {v27.4s - v28.4s}, [x6]
cmp x10, x4
bne .L4
and should instead be
.L4:
ldr q29, [x0], 16
ldr q27, [x1], 16
ldr q30, [x2], 16
fcmla v27.4s, v29.4s, v30.4s, #0
str q27, [x3], 16
cmp x10, x4
bne .L4
The corresponding SLP tree is
note: Final SLP tree for instance 0x3dba70f0:
note: node 0x3db4fe90 (max_nunits=2, refcnt=2) vector(2) float
note: op template: *_10 = _11;
note: stmt 0 *_10 = _11;
note: stmt 1 *_19 = _20;
note: children 0x3db4ff40
note: node 0x3db4ff40 (max_nunits=2, refcnt=2) vector(2) float
note: op template: _11 = _4 + _9;
note: stmt 0 _11 = _4 + _9;
note: stmt 1 _20 = _15 + _18;
note: children 0x3db500a0 0x3db50150
note: node 0x3db500a0 (max_nunits=2, refcnt=2) vector(2) float
note: op template: _4 = *_3;
note: stmt 0 _4 = *_3;
note: stmt 1 _15 = *_14;
note: node 0x3db50150 (max_nunits=2, refcnt=2) vector(2) float
note: op template: _9 = _6 * _8;
note: stmt 0 _9 = _6 * _8;
note: stmt 1 _18 = _6 * _17;
note: children 0x3db506d0 0x3db502b0
note: node 0x3db506d0 (max_nunits=2, refcnt=2) vector(2) float
note: op template: _6 = *_5;
note: stmt 0 _6 = *_5;
note: stmt 1 _6 = *_5;
note: load permutation { 0 0 }
note: node 0x3db502b0 (max_nunits=2, refcnt=2) vector(2) float
note: op template: _8 = *_7;
note: stmt 0 _8 = *_7;
note: stmt 1 _17 = *_16;
The operation itself is easy to match, however the issue is with the splat in
node 0x3db506d0
We have no way to linearize this node because the other element just isn't
there in the SLP
tree as it's not used.
It is unclear to me how to deal with the splat case and could use some advice
here.
Additionally what would a good name for the IFNs and optabs be? I haven't been
able to figure
out good names for them.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations