https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98516
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> --- Meh, it's very hard to spot the actual problem :/ diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index d8a2ceb0fa1..dee360307d0 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -5058,8 +5059,7 @@ vect_slp_region (vec<basic_block> bbs, vec<data_reference_p> datarefs, bb_vinfo->shared->check_datarefs (); bb_vinfo->vector_mode = next_vector_mode; - if (vect_slp_analyze_bb_1 (bb_vinfo, n_stmts, fatal, dataref_groups) - && dbg_cnt (vect_slp)) + if (vect_slp_analyze_bb_1 (bb_vinfo, n_stmts, fatal, dataref_groups)) { if (dump_enabled_p ()) { @@ -5090,6 +5090,9 @@ vect_slp_region (vec<basic_block> bbs, vec<data_reference_p> datarefs, continue; } + if (!dbg_cnt (vect_slp)) + continue; + if (!vectorized && dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "Basic block will be vectorized " helps to narrow down the bogus vectorization, -fdbg-cnt=vect_slp:2:2 triggers it but the SLP region is quite big still. diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index d8a2ceb0fa1..dee360307d0 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -3310,6 +3310,7 @@ vect_optimize_slp (vec_info *vinfo) auto_vec<int> leafs; vect_slp_build_vertices (vinfo, vertices, leafs); +#if 0 struct graph *slpg = new_graph (vertices.length ()); FOR_EACH_VEC_ELT (vertices, i, node) { @@ -3619,7 +3620,7 @@ vect_optimize_slp (vec_info *vinfo) while (!perms.is_empty ()) perms.pop ().release (); free_graph (slpg); - +#endif /* Now elide load permutations that are not necessary. */ for (i = 0; i < leafs.length (); ++i) avoids the miscompilation. The key transform we're doing is eliding load permutations that swap real/imag parts and instead adjust the lane permutation of a blend created for plus/minus ops which is where the bug is I think. We're changing t.C:80:7: note: node 0x4204018 (max_nunits=2, refcnt=1) t.C:80:7: note: op: VEC_PERM_EXPR t.C:80:7: note: stmt 0 _37 = _35 - _36; t.C:80:7: note: stmt 1 _34 = _32 + _33; t.C:80:7: note: lane permutation { 0[0] 1[1] } t.C:80:7: note: children 0x42045f0 0x4204678 to t.C:80:7: note: node 0x4207018 (max_nunits=2, refcnt=1) t.C:80:7: note: op: VEC_PERM_EXPR t.C:80:7: note: stmt 0 _37 = _35 - _36; t.C:80:7: note: stmt 1 _34 = _32 + _33; t.C:80:7: note: lane permutation { 1[1] 0[0] } t.C:80:7: note: children 0x42075f0 0x4207678 but that's not what is necessary - we have permuted the lanes of the children but permuting the blend will not materialize properly instead we need to generate { 0[1] 1[0] } I think. I'm trying to create a simpler C testcase now.