https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70130
--- Comment #23 from Richard Biener <rguenth at gcc dot gnu.org> ---
But that's ok - we are storing the same scalar element:
t.i:12:3: note: Load permutation 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
t.i:12:3: note: no array mode for V8HI[16]
t.i:12:3: note: Final SLP tree for instance:
t.i:12:3: note: node
t.i:12:3: note: stmt 0 pretmp_32->mprr_2[1][j_57][0] = _40;
t.i:12:3: note: stmt 1 pretmp_32->mprr_2[1][j_57][1] = _40;
t.i:12:3: note: stmt 2 pretmp_32->mprr_2[1][j_57][2] = _40;
t.i:12:3: note: stmt 3 pretmp_32->mprr_2[1][j_57][3] = _40;
t.i:12:3: note: stmt 4 pretmp_32->mprr_2[1][j_57][4] = _40;
t.i:12:3: note: stmt 5 pretmp_32->mprr_2[1][j_57][5] = _40;
t.i:12:3: note: stmt 6 pretmp_32->mprr_2[1][j_57][6] = _40;
t.i:12:3: note: stmt 7 pretmp_32->mprr_2[1][j_57][7] = _40;
t.i:12:3: note: stmt 8 pretmp_32->mprr_2[1][j_57][8] = _40;
t.i:12:3: note: stmt 9 pretmp_32->mprr_2[1][j_57][9] = _40;
t.i:12:3: note: stmt 10 pretmp_32->mprr_2[1][j_57][10] = _40;
t.i:12:3: note: stmt 11 pretmp_32->mprr_2[1][j_57][11] = _40;
t.i:12:3: note: stmt 12 pretmp_32->mprr_2[1][j_57][12] = _40;
t.i:12:3: note: stmt 13 pretmp_32->mprr_2[1][j_57][13] = _40;
t.i:12:3: note: stmt 14 pretmp_32->mprr_2[1][j_57][14] = _40;
t.i:12:3: note: stmt 15 pretmp_32->mprr_2[1][j_57][15] = _40;
t.i:12:3: note: node
t.i:12:3: note: stmt 0 _40 = (short int) _39;
t.i:12:3: note: stmt 1 _40 = (short int) _39;
t.i:12:3: note: stmt 2 _40 = (short int) _39;
t.i:12:3: note: stmt 3 _40 = (short int) _39;
t.i:12:3: note: stmt 4 _40 = (short int) _39;
t.i:12:3: note: stmt 5 _40 = (short int) _39;
t.i:12:3: note: stmt 6 _40 = (short int) _39;
t.i:12:3: note: stmt 7 _40 = (short int) _39;
t.i:12:3: note: stmt 8 _40 = (short int) _39;
t.i:12:3: note: stmt 9 _40 = (short int) _39;
t.i:12:3: note: stmt 10 _40 = (short int) _39;
t.i:12:3: note: stmt 11 _40 = (short int) _39;
t.i:12:3: note: stmt 12 _40 = (short int) _39;
t.i:12:3: note: stmt 13 _40 = (short int) _39;
t.i:12:3: note: stmt 14 _40 = (short int) _39;
t.i:12:3: note: stmt 15 _40 = (short int) _39;
t.i:12:3: note: node
t.i:12:3: note: stmt 0 _39 = *_38[1];
t.i:12:3: note: stmt 1 _39 = *_38[1];
t.i:12:3: note: stmt 2 _39 = *_38[1];
t.i:12:3: note: stmt 3 _39 = *_38[1];
t.i:12:3: note: stmt 4 _39 = *_38[1];
t.i:12:3: note: stmt 5 _39 = *_38[1];
t.i:12:3: note: stmt 6 _39 = *_38[1];
t.i:12:3: note: stmt 7 _39 = *_38[1];
t.i:12:3: note: stmt 8 _39 = *_38[1];
t.i:12:3: note: stmt 9 _39 = *_38[1];
t.i:12:3: note: stmt 10 _39 = *_38[1];
t.i:12:3: note: stmt 11 _39 = *_38[1];
t.i:12:3: note: stmt 12 _39 = *_38[1];
t.i:12:3: note: stmt 13 _39 = *_38[1];
t.i:12:3: note: stmt 14 _39 = *_38[1];
t.i:12:3: note: stmt 15 _39 = *_38[1];
and
Creating dr for *_38[1]
analyze_innermost: success.
base_address: s_9(D)
offset from base address: 0
constant offset from base address: 4
step: 8
aligned to: 128
base_object: *s_9(D)
Access function 0: 1
Access function 1: {0B, +, 8}_1
this is an 'int' load accessing *_38[1] in the first and *_38[3] in the
second iteration. So we load a v4si from *_38[1], and then advance
by half of a vector. With my patch we still have the
__builtin_altivec_mask_for_load computed once, but that's bougs as the
alignment of the accesses changes each iteration.
Before the cited rev. we probably used interleaving and not SLP to vectorize
this loop. I think that for this special permutation using a scalar load
and splatting that would have been best, cost-wise.
Now we still first need to tell GCC that when using SLP the re-align
scheme doesn't work in this case.
The following seems to work and ends up generating unaligned loads (even with
-mcpu=power7):
Index: gcc/tree-vect-data-refs.c
===================================================================
--- gcc/tree-vect-data-refs.c (revision 234970)
+++ gcc/tree-vect-data-refs.c (working copy)
@@ -5983,10 +5983,19 @@ vect_supportable_dr_alignment (struct da
|| targetm.vectorize.builtin_mask_for_load ()))
{
tree vectype = STMT_VINFO_VECTYPE (stmt_info);
- if ((nested_in_vect_loop
- && (TREE_INT_CST_LOW (DR_STEP (dr))
- != GET_MODE_SIZE (TYPE_MODE (vectype))))
- || !loop_vinfo)
+
+ /* If we are doing SLP then the accesses need not have the
+ same alignment, instead it depends on the SLP group size. */
+ if (loop_vinfo
+ && STMT_SLP_TYPE (stmt_info)
+ && (LOOP_VINFO_VECT_FACTOR (loop_vinfo)
+ * GROUP_SIZE (vinfo_for_stmt (GROUP_FIRST_ELEMENT
(stmt_info))))
+ % TYPE_VECTOR_SUBPARTS (vectype) != 0)
+ ;
+ else if (!loop_vinfo
+ || (nested_in_vect_loop
+ && (TREE_INT_CST_LOW (DR_STEP (dr))
+ != GET_MODE_SIZE (TYPE_MODE (vectype)))))
return dr_explicit_realign;
else
return dr_explicit_realign_optimized;
So there's still the optimization opportunity to use single-element loads
plus vector splat (which probably all targets support) for the case of an
SLP load using this kind of permutation. I suppose open a new bug for that.
With -mcpu=power6 we fail to vectorize the loop (using interleaving would
require epilogue peeling which isn't possible here).
Can you check the above? Also whether it regresses any of the testsuite?