On Mon, Jun 30, 2025 at 2:50 PM Juergen Christ via Gcc <gcc@gcc.gnu.org> wrote: > > Hi, > > I am just looking at SLP costing for this program: > > #include <stdint.h> > > typedef int16_t __attribute__((vector_size(16))) int16x8_t; > > int16x8_t __attribute__((noinline)) vec_padd(int16x8_t a, int16x8_t b) { > const int16x8_t result = { > a[0] + a[1], > a[2] + a[3], > a[4] + a[5], > a[6] + a[7], > b[0] + b[1], > b[2] + b[3], > b[4] + b[5], > b[6] + b[7], > }; > > return result; > } > > SLP costing adds two external vec_construct nodes: > > node 0x5210210 1 times vec_construct costs 7 in prologue > node 0x52103c0 1 times vec_construct costs 7 in prologue > > which lead to very high costs. I also see these two array constructs > in the SLP pass dump: > > _53 = {_1, _7, _13, _19, _25, _31, _37, _43}; > vect__2.3_54 = VIEW_CONVERT_EXPR<vector(8) unsigned short>(_53); > _55 = {_3, _9, _15, _21, _27, _33, _39, _45}; > vect__4.4_56 = VIEW_CONVERT_EXPR<vector(8) unsigned short>(_55); > > I also see in the SLP log messages like: > > missed: Build SLP failed: different BIT_FIELD_REF arguments in _25 = > BIT_FIELD_REF <b_50(D), 16, 0>;
This is because we do not yet try to combine refs from different dataref groups (or in this case vectors) into one SLP operand. Instead we want to defer to possible operand swapping upthread that might (not in this case) make the operands uniform. So we compete with the case of const int16x8_t result = { a[0] + b[1], a[2] + b[3], a[4] + b[5], a[6] + b[7], b[0] + a[1], b[2] + a[3], b[4] + a[5], b[6] + a[7], }; for loop vectorization we know we can force things, but for basic-block vectorization we have to be careful to not explore the exponential graph of possibilities here. > However, if I disable the cost mode (-fvect-cost-model=unlimited), > forwardprop later on transforms these two vector constructions into > two vector permute operations: > > _2 = VEC_PERM_EXPR <a_49(D), b_50(D), { 0, 2, 4, 6, 8, 10, 12, 14 }>; > vect__2.3_54 = VIEW_CONVERT_EXPR<vector(8) unsigned short>(_2); > _4 = VEC_PERM_EXPR <a_49(D), b_50(D), { 1, 3, 5, 7, 9, 11, 13, 15 }>; > vect__4.4_56 = VIEW_CONVERT_EXPR<vector(8) unsigned short>(_4); > > These would have lower costs and make vectorization profitable. > > My question now is: Should I mimic the behaviour of forwardprop in the > cost model, or is this something that eventually will be ported to SLP > vectorizer such that the cost model gets two vec_permute instead of > two vec_constructs? And if yes, are there other cases where a later > pass performs essential changes to the constructs passed to the cost > model? To fix these cases for good requires a different approach to SLP discovery. We could argue that for dataref groups as well as for vectors like 'a'' and 'b' above we never want to pick them as "external". Instead SLP discovery would need to split the graph upon running into sparate chains and combine using a permute node. This (badly) interferes with current backtracking doing permutation to get a better match of course, and esp. for basic-block vectorization where giving up on an edge, creating a CTOR, is a valid thing to do. Greedy SLP discovery for non-loop vectorization is still an active research topic (and GCC in addition wants something practical and compile-time limited...) I'd be OK with some special-casing of "two parts" here. Can you for now open a bugreport with the testcase (possibly use GCC generic vectors for portability)? Richard. > > Thanks, > > Juergen