https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82255
Bug ID: 82255 Summary: Vectorizer cost model overcounts cost of some vectorized loads Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wschmidt at gcc dot gnu.org Target Milestone: --- Consider the following testcase, compiled at -O3 on powerpc64le: extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__)) __attribute__ \ ((__const__)); static int foo (unsigned char *w, int i, unsigned char *x, int j) { int tot = 0; for (int a = 0; a < 16; a++) { for (int b = 0; b < 16; b++) tot += abs (w[b] - x[b]); (*) w += i; x += j; } return tot; } void bar (unsigned char *w, unsigned char *x, int i, int *result) { *result = foo (w, 16, x, i); } The salient feature here is that we process 16 characters from "w" at a time, and each time through the "a" loop we increment the "w" pointer by 16; and we process 16 characters from "x" at a time, but here each time through the "a" loop we increment the "x" pointer by an unknown but unchanging value. Since we don't know the alignment of either "w" or "x", it is logical to assume that vectorizing this loop on a machine with a 128-bit vector size will treat both "w" and "x" the same. The 16 loads of characters from each array can be combined into a single unaligned load of a 16-byte vector. And indeed, this is the code that the vectorizer produces. It is only the stride *between* vector loads that differs; one is constant and the other is variable. However, the vector cost model does NOT produce the same results for "w" and "x". For "w", the cost model represents the combined loads as a single "unaligned_load" vect_cost_for_stmt. For "x", the cost model represents them as an "unaligned_load" followed by a "vec_construct". This wrongly makes the load of "x" look a good deal more expensive than the load of "w". Now, how does this happen? Prior to vectorization, the "b" loop is completely unrolled, exposing the main computation (*) to SLP vectorization. In tree-vect-slp.c:vect_analyze_slp_cost_1, we find that for "x" (but not "w"), STMT_VINFO_STRIDED_P (stmt_info) is TRUE, and this causes us to assume a memory access type of VMAT_STRIDED_SLP. That value is passed to tree-vect-stmts.c: vect_model_load_cost, which first records the cost of the load, and then (because of the memory access type VMAT_STRIDED_SLP) also records a vec_construct cost. This would be appropriate if the loads making up the vectorized load didn't already form a full vector, so that a strided load of elements were being loaded. For example, if we loaded 4 characters at a time and then skipped ahead by an unknown stride amount to get the next 4, a strided load of four 32-bit elements would indeed require a vec_construct. But that's not the case here. Going back a bit further, STMT_VINFO_STRIDED_P (stmt_info) is set to TRUE in tree-vect-data-refs.c:vect_analyze_data_refs, because the group of data references appears within a loop, and the DR_STEP for the group is not a constant. Note that the analysis is done for the outer loop, but is consumed when doing SLP vectorization on the unrolled inner loop (on the loop body of the outer loop). Looking at tree-vect-stmts.c:vectorizable_load, we don't actually generate the constructor unless the number of loads we're going to generate is greater than 1. So we're missing the corresponding logic in the analysis phase. I'm working on a patch.