https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70130

--- Comment #23 from Richard Biener <rguenth at gcc dot gnu.org> ---
But that's ok - we are storing the same scalar element:

t.i:12:3: note: Load permutation 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
t.i:12:3: note: no array mode for V8HI[16]
t.i:12:3: note: Final SLP tree for instance:
t.i:12:3: note: node
t.i:12:3: note:         stmt 0 pretmp_32->mprr_2[1][j_57][0] = _40;
t.i:12:3: note:         stmt 1 pretmp_32->mprr_2[1][j_57][1] = _40;
t.i:12:3: note:         stmt 2 pretmp_32->mprr_2[1][j_57][2] = _40;
t.i:12:3: note:         stmt 3 pretmp_32->mprr_2[1][j_57][3] = _40;
t.i:12:3: note:         stmt 4 pretmp_32->mprr_2[1][j_57][4] = _40;
t.i:12:3: note:         stmt 5 pretmp_32->mprr_2[1][j_57][5] = _40;
t.i:12:3: note:         stmt 6 pretmp_32->mprr_2[1][j_57][6] = _40;
t.i:12:3: note:         stmt 7 pretmp_32->mprr_2[1][j_57][7] = _40;
t.i:12:3: note:         stmt 8 pretmp_32->mprr_2[1][j_57][8] = _40;
t.i:12:3: note:         stmt 9 pretmp_32->mprr_2[1][j_57][9] = _40;
t.i:12:3: note:         stmt 10 pretmp_32->mprr_2[1][j_57][10] = _40;
t.i:12:3: note:         stmt 11 pretmp_32->mprr_2[1][j_57][11] = _40;
t.i:12:3: note:         stmt 12 pretmp_32->mprr_2[1][j_57][12] = _40;
t.i:12:3: note:         stmt 13 pretmp_32->mprr_2[1][j_57][13] = _40;
t.i:12:3: note:         stmt 14 pretmp_32->mprr_2[1][j_57][14] = _40;
t.i:12:3: note:         stmt 15 pretmp_32->mprr_2[1][j_57][15] = _40;
t.i:12:3: note: node
t.i:12:3: note:         stmt 0 _40 = (short int) _39;
t.i:12:3: note:         stmt 1 _40 = (short int) _39;
t.i:12:3: note:         stmt 2 _40 = (short int) _39;
t.i:12:3: note:         stmt 3 _40 = (short int) _39;
t.i:12:3: note:         stmt 4 _40 = (short int) _39;
t.i:12:3: note:         stmt 5 _40 = (short int) _39;
t.i:12:3: note:         stmt 6 _40 = (short int) _39;
t.i:12:3: note:         stmt 7 _40 = (short int) _39;
t.i:12:3: note:         stmt 8 _40 = (short int) _39;
t.i:12:3: note:         stmt 9 _40 = (short int) _39;
t.i:12:3: note:         stmt 10 _40 = (short int) _39;
t.i:12:3: note:         stmt 11 _40 = (short int) _39;
t.i:12:3: note:         stmt 12 _40 = (short int) _39;
t.i:12:3: note:         stmt 13 _40 = (short int) _39;
t.i:12:3: note:         stmt 14 _40 = (short int) _39;
t.i:12:3: note:         stmt 15 _40 = (short int) _39;
t.i:12:3: note: node
t.i:12:3: note:         stmt 0 _39 = *_38[1];
t.i:12:3: note:         stmt 1 _39 = *_38[1];
t.i:12:3: note:         stmt 2 _39 = *_38[1];
t.i:12:3: note:         stmt 3 _39 = *_38[1];
t.i:12:3: note:         stmt 4 _39 = *_38[1];
t.i:12:3: note:         stmt 5 _39 = *_38[1];
t.i:12:3: note:         stmt 6 _39 = *_38[1];
t.i:12:3: note:         stmt 7 _39 = *_38[1];
t.i:12:3: note:         stmt 8 _39 = *_38[1];
t.i:12:3: note:         stmt 9 _39 = *_38[1];
t.i:12:3: note:         stmt 10 _39 = *_38[1];
t.i:12:3: note:         stmt 11 _39 = *_38[1];
t.i:12:3: note:         stmt 12 _39 = *_38[1];
t.i:12:3: note:         stmt 13 _39 = *_38[1];
t.i:12:3: note:         stmt 14 _39 = *_38[1];
t.i:12:3: note:         stmt 15 _39 = *_38[1];

and

Creating dr for *_38[1]
analyze_innermost: success.
        base_address: s_9(D)
        offset from base address: 0
        constant offset from base address: 4
        step: 8
        aligned to: 128
        base_object: *s_9(D)
        Access function 0: 1
        Access function 1: {0B, +, 8}_1

this is an 'int' load accessing *_38[1] in the first and *_38[3] in the
second iteration.  So we load a v4si from *_38[1], and then advance
by half of a vector.  With my patch we still have the
__builtin_altivec_mask_for_load computed once, but that's bougs as the
alignment of the accesses changes each iteration.

Before the cited rev. we probably used interleaving and not SLP to vectorize
this loop.  I think that for this special permutation using a scalar load
and splatting that would have been best, cost-wise.

Now we still first need to tell GCC that when using SLP the re-align
scheme doesn't work in this case.

The following seems to work and ends up generating unaligned loads (even with
-mcpu=power7):

Index: gcc/tree-vect-data-refs.c
===================================================================
--- gcc/tree-vect-data-refs.c   (revision 234970)
+++ gcc/tree-vect-data-refs.c   (working copy)
@@ -5983,10 +5983,19 @@ vect_supportable_dr_alignment (struct da
              || targetm.vectorize.builtin_mask_for_load ()))
        {
          tree vectype = STMT_VINFO_VECTYPE (stmt_info);
-         if ((nested_in_vect_loop
-              && (TREE_INT_CST_LOW (DR_STEP (dr))
-                  != GET_MODE_SIZE (TYPE_MODE (vectype))))
-              || !loop_vinfo)
+
+         /* If we are doing SLP then the accesses need not have the
+            same alignment, instead it depends on the SLP group size.  */
+         if (loop_vinfo
+             && STMT_SLP_TYPE (stmt_info)
+             && (LOOP_VINFO_VECT_FACTOR (loop_vinfo)
+                 * GROUP_SIZE (vinfo_for_stmt (GROUP_FIRST_ELEMENT
(stmt_info))))
+                 % TYPE_VECTOR_SUBPARTS (vectype) != 0)
+           ;
+         else if (!loop_vinfo
+                  || (nested_in_vect_loop
+                      && (TREE_INT_CST_LOW (DR_STEP (dr))
+                          != GET_MODE_SIZE (TYPE_MODE (vectype)))))
            return dr_explicit_realign;
          else
            return dr_explicit_realign_optimized;


So there's still the optimization opportunity to use single-element loads
plus vector splat (which probably all targets support) for the case of an
SLP load using this kind of permutation.  I suppose open a new bug for that.

With -mcpu=power6 we fail to vectorize the loop (using interleaving would
require epilogue peeling which isn't possible here).

Can you check the above?  Also whether it regresses any of the testsuite?

Reply via email to