https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091
Bug ID: 113091
Summary: Over-estimate SLP vector-to-scalar cost for non-live
pattern statement
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: fxue at os dot amperecomputing.com
Target Milestone: ---
Gcc fails to vectorize the below testcase on aarch64.
int test(unsigned array[8]);
int foo(char *a, char *b)
{
unsigned array[8];
array[0] = (a[0] - b[0]);
array[1] = (a[1] - b[1]);
array[2] = (a[2] - b[2]);
array[3] = (a[3] - b[3]);
array[4] = (a[4] - b[4]);
array[5] = (a[5] - b[5]);
array[6] = (a[6] - b[6]);
array[7] = (a[7] - b[7]);
return test(array);
}
The dump shows that loads to a[i] and b[i] are considered to be live as scalar
references, which results in over-estimated vector-to-scalar cost.
*a_50(D) 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)a_50(D) + 1B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)a_50(D) + 2B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)a_50(D) + 3B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)a_50(D) + 4B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)a_50(D) + 5B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)a_50(D) + 6B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)a_50(D) + 7B] 1 times vec_to_scalar costs 2 in epilogue
*b_51(D) 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)b_51(D) + 1B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)b_51(D) + 2B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)b_51(D) + 3B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)b_51(D) + 4B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)b_51(D) + 5B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)b_51(D) + 6B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)b_51(D) + 7B] 1 times vec_to_scalar costs 2 in epilogue
Subtraction on char type is recognized as widen-sub, and involves two kinds of
pattern replacement.
* Original
_1 = *a_50(D);
_2 = (int) _1;
_3 = *b_51(D);
_4 = (int) _3;
_5 = _2 - _4;
* After pattern replacement
patt_63 = (unsigned short) _1; // _2 = (int) _1;
patt_64 = (int) patt_63; // _2 = (int) _1;
patt_65 = (unsigned short) _3; // _4 = (int) _3;
patt_66 = (int) patt_65; // _4 = (int) _3;
patt_67 = .VEC_WIDEN_MINUS (_1, _3); // _5 = _2 - _4;
patt_68 = (signed short) patt_67; // _5 = _2 - _4;
patt_69 = (int) patt_68; // _5 = _2 - _4;
For the statement "_2 = (int) _1", its vectorization representative "patt_64 =
(int) patt_63" is not marked as PURE_SLP, so it is conservatively considered to
having scalar use and being live outside of SLP bb (in the function
vect_bb_slp_mark_live_stmts). However, the pattern definition is actually dead,
should not contribute to vector-to-scalar cost.
Those defs from pattern statements are not part of function body, we could not
track def/use chain as ordinary SSAs. Probably, we may have a quick fix for one
situation, if the original SSA "_2" has single use, its existence should be
only covered by vectorized operation, no matter what/how it would be w/o
pattern replacement.