https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70130
--- Comment #23 from Richard Biener <rguenth at gcc dot gnu.org> --- But that's ok - we are storing the same scalar element: t.i:12:3: note: Load permutation 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 t.i:12:3: note: no array mode for V8HI[16] t.i:12:3: note: Final SLP tree for instance: t.i:12:3: note: node t.i:12:3: note: stmt 0 pretmp_32->mprr_2[1][j_57][0] = _40; t.i:12:3: note: stmt 1 pretmp_32->mprr_2[1][j_57][1] = _40; t.i:12:3: note: stmt 2 pretmp_32->mprr_2[1][j_57][2] = _40; t.i:12:3: note: stmt 3 pretmp_32->mprr_2[1][j_57][3] = _40; t.i:12:3: note: stmt 4 pretmp_32->mprr_2[1][j_57][4] = _40; t.i:12:3: note: stmt 5 pretmp_32->mprr_2[1][j_57][5] = _40; t.i:12:3: note: stmt 6 pretmp_32->mprr_2[1][j_57][6] = _40; t.i:12:3: note: stmt 7 pretmp_32->mprr_2[1][j_57][7] = _40; t.i:12:3: note: stmt 8 pretmp_32->mprr_2[1][j_57][8] = _40; t.i:12:3: note: stmt 9 pretmp_32->mprr_2[1][j_57][9] = _40; t.i:12:3: note: stmt 10 pretmp_32->mprr_2[1][j_57][10] = _40; t.i:12:3: note: stmt 11 pretmp_32->mprr_2[1][j_57][11] = _40; t.i:12:3: note: stmt 12 pretmp_32->mprr_2[1][j_57][12] = _40; t.i:12:3: note: stmt 13 pretmp_32->mprr_2[1][j_57][13] = _40; t.i:12:3: note: stmt 14 pretmp_32->mprr_2[1][j_57][14] = _40; t.i:12:3: note: stmt 15 pretmp_32->mprr_2[1][j_57][15] = _40; t.i:12:3: note: node t.i:12:3: note: stmt 0 _40 = (short int) _39; t.i:12:3: note: stmt 1 _40 = (short int) _39; t.i:12:3: note: stmt 2 _40 = (short int) _39; t.i:12:3: note: stmt 3 _40 = (short int) _39; t.i:12:3: note: stmt 4 _40 = (short int) _39; t.i:12:3: note: stmt 5 _40 = (short int) _39; t.i:12:3: note: stmt 6 _40 = (short int) _39; t.i:12:3: note: stmt 7 _40 = (short int) _39; t.i:12:3: note: stmt 8 _40 = (short int) _39; t.i:12:3: note: stmt 9 _40 = (short int) _39; t.i:12:3: note: stmt 10 _40 = (short int) _39; t.i:12:3: note: stmt 11 _40 = (short int) _39; t.i:12:3: note: stmt 12 _40 = (short int) _39; t.i:12:3: note: stmt 13 _40 = (short int) _39; t.i:12:3: note: stmt 14 _40 = (short int) _39; t.i:12:3: note: stmt 15 _40 = (short int) _39; t.i:12:3: note: node t.i:12:3: note: stmt 0 _39 = *_38[1]; t.i:12:3: note: stmt 1 _39 = *_38[1]; t.i:12:3: note: stmt 2 _39 = *_38[1]; t.i:12:3: note: stmt 3 _39 = *_38[1]; t.i:12:3: note: stmt 4 _39 = *_38[1]; t.i:12:3: note: stmt 5 _39 = *_38[1]; t.i:12:3: note: stmt 6 _39 = *_38[1]; t.i:12:3: note: stmt 7 _39 = *_38[1]; t.i:12:3: note: stmt 8 _39 = *_38[1]; t.i:12:3: note: stmt 9 _39 = *_38[1]; t.i:12:3: note: stmt 10 _39 = *_38[1]; t.i:12:3: note: stmt 11 _39 = *_38[1]; t.i:12:3: note: stmt 12 _39 = *_38[1]; t.i:12:3: note: stmt 13 _39 = *_38[1]; t.i:12:3: note: stmt 14 _39 = *_38[1]; t.i:12:3: note: stmt 15 _39 = *_38[1]; and Creating dr for *_38[1] analyze_innermost: success. base_address: s_9(D) offset from base address: 0 constant offset from base address: 4 step: 8 aligned to: 128 base_object: *s_9(D) Access function 0: 1 Access function 1: {0B, +, 8}_1 this is an 'int' load accessing *_38[1] in the first and *_38[3] in the second iteration. So we load a v4si from *_38[1], and then advance by half of a vector. With my patch we still have the __builtin_altivec_mask_for_load computed once, but that's bougs as the alignment of the accesses changes each iteration. Before the cited rev. we probably used interleaving and not SLP to vectorize this loop. I think that for this special permutation using a scalar load and splatting that would have been best, cost-wise. Now we still first need to tell GCC that when using SLP the re-align scheme doesn't work in this case. The following seems to work and ends up generating unaligned loads (even with -mcpu=power7): Index: gcc/tree-vect-data-refs.c =================================================================== --- gcc/tree-vect-data-refs.c (revision 234970) +++ gcc/tree-vect-data-refs.c (working copy) @@ -5983,10 +5983,19 @@ vect_supportable_dr_alignment (struct da || targetm.vectorize.builtin_mask_for_load ())) { tree vectype = STMT_VINFO_VECTYPE (stmt_info); - if ((nested_in_vect_loop - && (TREE_INT_CST_LOW (DR_STEP (dr)) - != GET_MODE_SIZE (TYPE_MODE (vectype)))) - || !loop_vinfo) + + /* If we are doing SLP then the accesses need not have the + same alignment, instead it depends on the SLP group size. */ + if (loop_vinfo + && STMT_SLP_TYPE (stmt_info) + && (LOOP_VINFO_VECT_FACTOR (loop_vinfo) + * GROUP_SIZE (vinfo_for_stmt (GROUP_FIRST_ELEMENT (stmt_info)))) + % TYPE_VECTOR_SUBPARTS (vectype) != 0) + ; + else if (!loop_vinfo + || (nested_in_vect_loop + && (TREE_INT_CST_LOW (DR_STEP (dr)) + != GET_MODE_SIZE (TYPE_MODE (vectype))))) return dr_explicit_realign; else return dr_explicit_realign_optimized; So there's still the optimization opportunity to use single-element loads plus vector splat (which probably all targets support) for the case of an SLP load using this kind of permutation. I suppose open a new bug for that. With -mcpu=power6 we fail to vectorize the loop (using interleaving would require epilogue peeling which isn't possible here). Can you check the above? Also whether it regresses any of the testsuite?