On Mon, Jun 30, 2025 at 2:50 PM Juergen Christ via Gcc <gcc@gcc.gnu.org> wrote:
>
> Hi,
>
> I am just looking at SLP costing for this program:
>
> #include <stdint.h>
>
> typedef int16_t __attribute__((vector_size(16))) int16x8_t;
>
> int16x8_t __attribute__((noinline)) vec_padd(int16x8_t a, int16x8_t b) {
>   const int16x8_t result = {
>       a[0] + a[1],
>       a[2] + a[3],
>       a[4] + a[5],
>       a[6] + a[7],
>       b[0] + b[1],
>       b[2] + b[3],
>       b[4] + b[5],
>       b[6] + b[7],
>   };
>
>   return result;
> }
>
> SLP costing adds two external vec_construct nodes:
>
> node 0x5210210 1 times vec_construct costs 7 in prologue
> node 0x52103c0 1 times vec_construct costs 7 in prologue
>
> which lead to very high costs.  I also see these two array constructs
> in the SLP pass dump:
>
>   _53 = {_1, _7, _13, _19, _25, _31, _37, _43};
>   vect__2.3_54 = VIEW_CONVERT_EXPR<vector(8) unsigned short>(_53);
>   _55 = {_3, _9, _15, _21, _27, _33, _39, _45};
>   vect__4.4_56 = VIEW_CONVERT_EXPR<vector(8) unsigned short>(_55);
>
> I also see in the SLP log messages like:
>
> missed:   Build SLP failed: different BIT_FIELD_REF arguments in _25 = 
> BIT_FIELD_REF <b_50(D), 16, 0>;

This is because we do not yet try to combine refs from different
dataref groups (or in this case vectors) into one SLP operand.  Instead
we want to defer to possible operand swapping upthread that might
(not in this case) make the operands uniform.  So we compete with the
case of

  const int16x8_t result = {
      a[0] + b[1],
      a[2] + b[3],
      a[4] + b[5],
      a[6] + b[7],
      b[0] + a[1],
      b[2] + a[3],
      b[4] + a[5],
      b[6] + a[7],
  };

for loop vectorization we know we can force things, but for basic-block
vectorization we have to be careful to not explore the exponential
graph of possibilities here.

> However, if I disable the cost mode (-fvect-cost-model=unlimited),
> forwardprop later on transforms these two vector constructions into
> two vector permute operations:
>
>   _2 = VEC_PERM_EXPR <a_49(D), b_50(D), { 0, 2, 4, 6, 8, 10, 12, 14 }>;
>   vect__2.3_54 = VIEW_CONVERT_EXPR<vector(8) unsigned short>(_2);
>   _4 = VEC_PERM_EXPR <a_49(D), b_50(D), { 1, 3, 5, 7, 9, 11, 13, 15 }>;
>   vect__4.4_56 = VIEW_CONVERT_EXPR<vector(8) unsigned short>(_4);
>
> These would have lower costs and make vectorization profitable.
>
> My question now is: Should I mimic the behaviour of forwardprop in the
> cost model, or is this something that eventually will be ported to SLP
> vectorizer such that the cost model gets two vec_permute instead of
> two vec_constructs?  And if yes, are there other cases where a later
> pass performs essential changes to the constructs passed to the cost
> model?

To fix these cases for good requires a different approach to SLP discovery.
We could argue that for dataref groups as well as for vectors like 'a'' and 'b'
above we never want to pick them as "external".  Instead SLP discovery
would need to split the graph upon running into sparate chains and combine
using a permute node.  This (badly) interferes with current backtracking doing
permutation to get a better match of course, and esp. for basic-block
vectorization
where giving up on an edge, creating a CTOR, is a valid thing to do.
Greedy SLP discovery for non-loop vectorization is still an active
research topic
(and GCC in addition wants something practical and compile-time limited...)

I'd be OK with some special-casing of "two parts" here.  Can you for now
open a bugreport with the testcase (possibly use GCC generic vectors
for portability)?

Richard.

>
> Thanks,
>
> Juergen

Reply via email to