https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82136
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2017-09-12 CC| |rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- GCC just applies the general interleaving strategy here which for existing groups can be indeed quite bad. And it gets worse because of the splitting which isn't exposed to the vectorizer. In the end the GIMPLE IL more nicely explains what the vectorizer tries to do -- extract even/odd, mult/add and then interleave high/low: vect_x_13.2_26 = MEM[base: _2, offset: 0B]; vect_x_13.3_22 = MEM[base: _2, offset: 32B]; vect_perm_even_21 = VEC_PERM_EXPR <vect_x_13.2_26, vect_x_13.3_22, { 0, 2, 4, 6 }>; vect_perm_odd_20 = VEC_PERM_EXPR <vect_x_13.2_26, vect_x_13.3_22, { 1, 3, 5, 7 }>; vect__7.4_19 = vect_perm_odd_20 * vect_perm_even_21; vect__8.5_18 = vect_perm_odd_20 + vect_perm_even_21; vect_inter_high_34 = VEC_PERM_EXPR <vect__7.4_19, vect__8.5_18, { 0, 4, 1, 5 }>; vect_inter_low_29 = VEC_PERM_EXPR <vect__7.4_19, vect__8.5_18, { 2, 6, 3, 7 }>; MEM[base: _2, offset: 0B] = vect_inter_high_34; MEM[base: _2, offset: 32B] = vect_inter_low_29; not sure what ends up messing things up here (I guess AVX256 doesn't have full width extract even/odd and interleave high/low ...). Looks like with -mprefer-avx128 we never try the larger vector size (Oops?). At least we figure vectorization isn't profitable. So all this probably boils down to costs of permutes not being modeled.