https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82255

            Bug ID: 82255
           Summary: Vectorizer cost model overcounts cost of some
                    vectorized loads
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: wschmidt at gcc dot gnu.org
  Target Milestone: ---

Consider the following testcase, compiled at -O3 on powerpc64le:

extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__)) __attribute__
\
((__const__));

static int
foo (unsigned char *w, int i, unsigned char *x, int j)
{
  int tot = 0;
  for (int a = 0; a < 16; a++)
    {
      for (int b = 0; b < 16; b++)
        tot += abs (w[b] - x[b]);  (*)
      w += i;
      x += j;
    }
  return tot;
}

void
bar (unsigned char *w, unsigned char *x, int i, int *result)
{
  *result = foo (w, 16, x, i);
}

The salient feature here is that we process 16 characters from "w" at
a time, and each time through the "a" loop we increment the "w"
pointer by 16; and we process 16 characters from "x" at a time, but
here each time through the "a" loop we increment the "x" pointer by an
unknown but unchanging value.

Since we don't know the alignment of either "w" or "x", it is logical
to assume that vectorizing this loop on a machine with a 128-bit
vector size will treat both "w" and "x" the same.  The 16 loads of
characters from each array can be combined into a single unaligned
load of a 16-byte vector.  And indeed, this is the code that the
vectorizer produces.  It is only the stride *between* vector loads
that differs; one is constant and the other is variable.

However, the vector cost model does NOT produce the same results for
"w" and "x".  For "w", the cost model represents the combined loads as
a single "unaligned_load" vect_cost_for_stmt.  For "x", the cost model
represents them as an "unaligned_load" followed by a "vec_construct".
This wrongly makes the load of "x" look a good deal more expensive
than the load of "w".

Now, how does this happen?  Prior to vectorization, the "b" loop is
completely unrolled, exposing the main computation (*) to SLP
vectorization.  In tree-vect-slp.c:vect_analyze_slp_cost_1, we find
that for "x" (but not "w"), STMT_VINFO_STRIDED_P (stmt_info) is TRUE,
and this causes us to assume a memory access type of
VMAT_STRIDED_SLP.  That value is passed to tree-vect-stmts.c:
vect_model_load_cost, which first records the cost of the load, and
then (because of the memory access type VMAT_STRIDED_SLP) also records
a vec_construct cost.

This would be appropriate if the loads making up the vectorized load
didn't already form a full vector, so that a strided load of elements
were being loaded.  For example, if we loaded 4 characters at a time
and then skipped ahead by an unknown stride amount to get the next 4,
a strided load of four 32-bit elements would indeed require a
vec_construct.  But that's not the case here.

Going back a bit further, STMT_VINFO_STRIDED_P (stmt_info) is set to
TRUE in tree-vect-data-refs.c:vect_analyze_data_refs, because the
group of data references appears within a loop, and the DR_STEP for
the group is not a constant.  Note that the analysis is done for the
outer loop, but is consumed when doing SLP vectorization on the
unrolled inner loop (on the loop body of the outer loop).

Looking at tree-vect-stmts.c:vectorizable_load, we don't actually
generate the constructor unless the number of loads we're going to
generate is greater than 1.  So we're missing the corresponding logic
in the analysis phase.

I'm working on a patch.

Reply via email to