[Bug tree-optimization/70130] [6 Regression] h264ref fails with verification error starting with r231674 (r224221 is the true start of the problem)

rguenth at gcc dot gnu.org Thu, 14 Apr 2016 01:21:41 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70130


--- Comment #23 from Richard Biener <rguenth at gcc dot gnu.org> ---
But that's ok - we are storing the same scalar element:

t.i:12:3: note: Load permutation 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
t.i:12:3: note: no array mode for V8HI[16]
t.i:12:3: note: Final SLP tree for instance:
t.i:12:3: note: node
t.i:12:3: note:         stmt 0 pretmp_32->mprr_2[1][j_57][0] = _40;
t.i:12:3: note:         stmt 1 pretmp_32->mprr_2[1][j_57][1] = _40;
t.i:12:3: note:         stmt 2 pretmp_32->mprr_2[1][j_57][2] = _40;
t.i:12:3: note:         stmt 3 pretmp_32->mprr_2[1][j_57][3] = _40;
t.i:12:3: note:         stmt 4 pretmp_32->mprr_2[1][j_57][4] = _40;
t.i:12:3: note:         stmt 5 pretmp_32->mprr_2[1][j_57][5] = _40;
t.i:12:3: note:         stmt 6 pretmp_32->mprr_2[1][j_57][6] = _40;
t.i:12:3: note:         stmt 7 pretmp_32->mprr_2[1][j_57][7] = _40;
t.i:12:3: note:         stmt 8 pretmp_32->mprr_2[1][j_57][8] = _40;
t.i:12:3: note:         stmt 9 pretmp_32->mprr_2[1][j_57][9] = _40;
t.i:12:3: note:         stmt 10 pretmp_32->mprr_2[1][j_57][10] = _40;
t.i:12:3: note:         stmt 11 pretmp_32->mprr_2[1][j_57][11] = _40;
t.i:12:3: note:         stmt 12 pretmp_32->mprr_2[1][j_57][12] = _40;
t.i:12:3: note:         stmt 13 pretmp_32->mprr_2[1][j_57][13] = _40;
t.i:12:3: note:         stmt 14 pretmp_32->mprr_2[1][j_57][14] = _40;
t.i:12:3: note:         stmt 15 pretmp_32->mprr_2[1][j_57][15] = _40;
t.i:12:3: note: node
t.i:12:3: note:         stmt 0 _40 = (short int) _39;
t.i:12:3: note:         stmt 1 _40 = (short int) _39;
t.i:12:3: note:         stmt 2 _40 = (short int) _39;
t.i:12:3: note:         stmt 3 _40 = (short int) _39;
t.i:12:3: note:         stmt 4 _40 = (short int) _39;
t.i:12:3: note:         stmt 5 _40 = (short int) _39;
t.i:12:3: note:         stmt 6 _40 = (short int) _39;
t.i:12:3: note:         stmt 7 _40 = (short int) _39;
t.i:12:3: note:         stmt 8 _40 = (short int) _39;
t.i:12:3: note:         stmt 9 _40 = (short int) _39;
t.i:12:3: note:         stmt 10 _40 = (short int) _39;
t.i:12:3: note:         stmt 11 _40 = (short int) _39;
t.i:12:3: note:         stmt 12 _40 = (short int) _39;
t.i:12:3: note:         stmt 13 _40 = (short int) _39;
t.i:12:3: note:         stmt 14 _40 = (short int) _39;
t.i:12:3: note:         stmt 15 _40 = (short int) _39;
t.i:12:3: note: node
t.i:12:3: note:         stmt 0 _39 = *_38[1];
t.i:12:3: note:         stmt 1 _39 = *_38[1];
t.i:12:3: note:         stmt 2 _39 = *_38[1];
t.i:12:3: note:         stmt 3 _39 = *_38[1];
t.i:12:3: note:         stmt 4 _39 = *_38[1];
t.i:12:3: note:         stmt 5 _39 = *_38[1];
t.i:12:3: note:         stmt 6 _39 = *_38[1];
t.i:12:3: note:         stmt 7 _39 = *_38[1];
t.i:12:3: note:         stmt 8 _39 = *_38[1];
t.i:12:3: note:         stmt 9 _39 = *_38[1];
t.i:12:3: note:         stmt 10 _39 = *_38[1];
t.i:12:3: note:         stmt 11 _39 = *_38[1];
t.i:12:3: note:         stmt 12 _39 = *_38[1];
t.i:12:3: note:         stmt 13 _39 = *_38[1];
t.i:12:3: note:         stmt 14 _39 = *_38[1];
t.i:12:3: note:         stmt 15 _39 = *_38[1];

and

Creating dr for *_38[1]
analyze_innermost: success.
        base_address: s_9(D)
        offset from base address: 0
        constant offset from base address: 4
        step: 8
        aligned to: 128
        base_object: *s_9(D)
        Access function 0: 1
        Access function 1: {0B, +, 8}_1

this is an 'int' load accessing *_38[1] in the first and *_38[3] in the
second iteration.  So we load a v4si from *_38[1], and then advance
by half of a vector.  With my patch we still have the
__builtin_altivec_mask_for_load computed once, but that's bougs as the
alignment of the accesses changes each iteration.

Before the cited rev. we probably used interleaving and not SLP to vectorize
this loop.  I think that for this special permutation using a scalar load
and splatting that would have been best, cost-wise.

Now we still first need to tell GCC that when using SLP the re-align
scheme doesn't work in this case.

The following seems to work and ends up generating unaligned loads (even with
-mcpu=power7):

Index: gcc/tree-vect-data-refs.c
===================================================================
--- gcc/tree-vect-data-refs.c   (revision 234970)
+++ gcc/tree-vect-data-refs.c   (working copy)
@@ -5983,10 +5983,19 @@ vect_supportable_dr_alignment (struct da
              || targetm.vectorize.builtin_mask_for_load ()))
        {
          tree vectype = STMT_VINFO_VECTYPE (stmt_info);
-         if ((nested_in_vect_loop
-              && (TREE_INT_CST_LOW (DR_STEP (dr))
-                  != GET_MODE_SIZE (TYPE_MODE (vectype))))
-              || !loop_vinfo)
+
+         /* If we are doing SLP then the accesses need not have the
+            same alignment, instead it depends on the SLP group size.  */
+         if (loop_vinfo
+             && STMT_SLP_TYPE (stmt_info)
+             && (LOOP_VINFO_VECT_FACTOR (loop_vinfo)
+                 * GROUP_SIZE (vinfo_for_stmt (GROUP_FIRST_ELEMENT
(stmt_info))))
+                 % TYPE_VECTOR_SUBPARTS (vectype) != 0)
+           ;
+         else if (!loop_vinfo
+                  || (nested_in_vect_loop
+                      && (TREE_INT_CST_LOW (DR_STEP (dr))
+                          != GET_MODE_SIZE (TYPE_MODE (vectype)))))
            return dr_explicit_realign;
          else
            return dr_explicit_realign_optimized;


So there's still the optimization opportunity to use single-element loads
plus vector splat (which probably all targets support) for the case of an
SLP load using this kind of permutation.  I suppose open a new bug for that.

With -mcpu=power6 we fail to vectorize the loop (using interleaving would
require epilogue peeling which isn't possible here).

Can you check the above?  Also whether it regresses any of the testsuite?

[Bug tree-optimization/70130] [6 Regression] h264ref fails with verification error starting with r231674 (r224221 is the true start of the problem)

Reply via email to