https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625
Bug ID: 110625
Summary: [AArch64] Vect: SLP fails to vectorize a loop as the
reduction_latency calculated by new costs is too large
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: hliu at amperecomputing dot com
Target Milestone: ---
This problem causes a performance regression in SPEC2017 538.imagick. For the
following simple case (modified from pr96208):
typedef struct {
unsigned short m1, m2, m3, m4;
} the_struct_t;
typedef struct {
double m1, m2, m3, m4, m5;
} the_struct2_t;
double bar1 (the_struct2_t*);
double foo (double* k, unsigned int n, the_struct_t* the_struct) {
unsigned int u;
the_struct2_t result;
for (u=0; u < n; u++, k--) {
result.m1 += (*k)*the_struct[u].m1;
result.m2 += (*k)*the_struct[u].m2;
result.m3 += (*k)*the_struct[u].m3;
result.m4 += (*k)*the_struct[u].m4;
}
return bar1 (&result);
}
Compile it with "-Ofast -S -mcpu=neoverse-n2 -fdump-tree-vect-details
-fno-tree-slp-vectorize". SLP fails to vectorize the loop as the vector body
cost is increased due to the too large "reduction latency". See the dump of
vect pass:
Original vector body cost = 51
Scalar issue estimate:
...
reduction latency = 2
estimated min cycles per iteration = 2.000000
estimated cycles per vector iteration (for VF 2) = 4.000000
Vector issue estimate:
...
reduction latency = 8 <-- Too large
estimated min cycles per iteration = 8.000000
Increasing body cost to 102 because scalar code would issue more quickly
Cost model analysis:
Vector inside of loop cost: 102
...
Scalar iteration cost: 44
...
missed: cost model: the vector iteration cost = 102 divided by the scalar
iteration cost = 44 is greater or equal to the vectorization factor = 2.
missed: not vectorized: vectorization not profitable.
SLP will success with "-mcpu=neoverse-n1", as N1 doesn't use the new vector
costs and vector body cost is not increased. The "reduction latency" is
calculated in aarch64.cc count_ops():
/* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
that's not yet the case. */
ops->reduction_latency = MAX (ops->reduction_latency, base * count);
For this case, the "base" is 2 and "count" is 4 . To my understanding, the
"count" of SLP means the number of scalar stmts (i.e. results.m1 +=, ...) in a
permutation group to be merged into a vector stmt. It seems not reasonable to
multiply cost by "count" (maybe it doesn't consider about the SLP situation).
So, I'm thinking to calcualte it differently for SLP situation, e.g.
unsigned int latency = PURE_SLP_STMT(stmt_info) ? base : base * count;
ops->reduction_latency = MAX (ops->reduction_latency, latency);
Is this reasonable?