https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138
--- Comment #4 from Kewen Lin <linkw at gcc dot gnu.org> --- (In reply to Kewen Lin from comment #3) > > IIUC, in current implementation, we get four grouped stores: > { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] } /i=0,1,2,3/ independently > > When all these tryings fail, could we do some re-try on the groups > { tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i] } /i=0,1,2,3/ > with one extra intermediate layer covering those original groups, then start > from these newly adjusted groups? the built operands should isomorphic then. > May be too hackish? This thought seems impractical in current framework. I was thinking whether we can use another vectorization way which is sub-optimal but more fits with the current framework. It only focuses on all things in one loop iteration and starts from group stores { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] }. It can be: w1_0123 = load_word(p1) V1_0123 = scalar_to_vec(w1_0123) // also with promotion w1_4567 = load_word(p1+4) V1_4567 = scalar_to_vec(w1_4567) w2_0123 = load_word(p2) V2_0123 = scalar_to_vec(w2_0123) w2_4567 = load_word(p2+4) V2_4567 = scalar_to_vec(w2_4567) V_a0123 = (V1_0123-V2_0123) + (V1_4567-V2_4567) << 16 V1 = V_a0123 // {a0, a1, a2, a3} V2 = vperm (V1, <1, 0, 3, 2>) // {a1, a0, a3, a2} V3 = V1 - V2 // {t1, -t1, t3, -t3} V4 = V1 + V2 // {t0, t0, t2, t2} V5 = vperm (V3, V4, <4, 0, 6, 2>) // {t0, t1, t2, t3} V6 = vperm (V3, V4, <6, 2, 4, 0>) // {t2, t3, t0, t1} V7 = V5 - V6 // {t0 - t2, t1 - t3, t2 - t0, t3 - t1} V8 = V5 + V6 // {t0 + t2, t1 + t3, t2 + t0, t3 + t1} V9 = vperm (V7, V8, <4, 5, 0, 1>) vec_store (tmp[i], V9) One simplified case can be: extern void test(unsigned int t[4]); void foo(int *p1, int i1, int *p2, int i2) { unsigned int tmp[4]; unsigned int a0, a1, a2, a3; a0 = (p1[0] - p2[0]); a1 = (p1[1] - p2[1]); a2 = (p1[2] - p2[2]); a3 = (p1[3] - p2[3]); int t0 = a0 + a1; int t1 = a0 - a1; int t2 = a2 + a3; int t3 = a2 - a3; tmp[0] = t0 + t2; tmp[2] = t0 - t2; tmp[1] = t1 + t3; tmp[3] = t1 - t3; test(tmp); } Currently we still can't vectorize this simple case. SLP doesn't go deep enough to expose the group loads chance on the bottom nodes loading p1/p2. It ends up early with node { _13, _14, _13, _14 } and { _15, _16, _15, _16 } because the operands like {a0_28, a0_28, a0_28, a0_28} etc. satisfy the condition all_uniform_p so "Building parent vector operands from scalars instead". One rough idea seems: 1) Relax this condition all_uniform_p somehow to get SLP instance building to go deeper and get those p1/p2 loads as SLP nodes. 2) Introduce one more vect_pattern recognizer to catch this kind of pattern, transform the slp instance as we expect. I assume we can know the whole slp instance then we can transform it as we want here. Probably need some costing condition to gate this pattern matching. 3) If 2) fail, trim the slp instance from those nodes which satisfy all_uniform_p condition to ensure it's same as before. Does it sound reasonable?