https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102789
Kewen Lin <linkw at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |bergner at gcc dot gnu.org, | |rguenth at gcc dot gnu.org, | |wschmidt at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #5 from Kewen Lin <linkw at gcc dot gnu.org> --- As Jakub noted, r12-4340 just exposed one latent bug, even without r12-4340 but with -fvect-cost-model=dynamic, the issue still exists. The key is if it will do the peeling for alignment in prologue. unsigned max_allowed_peel = param_vect_max_peeling_for_alignment; if (flag_vect_cost_model <= VECT_COST_MODEL_CHEAP) max_allowed_peel = 0; --param vect-max-peeling-for-alignment=14 make the peeling disabled and it passes. I think this is a bug in vectorizer, reduced the culprit loop to (also move the first loop out of function): for (i = n; i < o; i++) { k += m + 1; t = k + p[i]; s2 += t; c[i]++; } we have some temporary storages for the omp clause such as: int D.3802[16]; // for k int D.3800[16]; // for s2 int D.3799[16]; // for t After having the peeling (one prologue), the addresses of k,s2,t become to: _187 = prolog_loop_niters.27_88 * 4; vectp.37_186 = &D.3802 + _187; _213 = prolog_loop_niters.27_88 * 4; vectp.46_212 = &D.3799 + _213; _222 = prolog_loop_niters.27_88 * 4; vectp.48_221 = &D.3800 + _222; then the main vectorized loop body acts on the biased addresses which is wrong: vect__61.49_223 = MEM <vector(4) int> [(int *)vectp.48_221]; vectp.48_224 = vectp.48_221 + 16; vect__61.50_225 = MEM <vector(4) int> [(int *)vectp.48_224]; vectp.48_226 = vectp.48_221 + 32; vect__61.51_227 = MEM <vector(4) int> [(int *)vectp.48_226]; vectp.48_228 = vectp.48_221 + 48; vect__61.52_229 = MEM <vector(4) int> [(int *)vectp.48_228]; _61 = D.3800[_56]; vect__62.53_230 = vect__59.44_208 + vect__61.49_223; vect__62.53_231 = vect__59.44_209 + vect__61.50_225; vect__62.53_232 = vect__59.44_210 + vect__61.51_227; vect__62.53_233 = vect__59.44_211 + vect__61.52_229; _62 = _59 + _61; MEM <vector(4) int> [(int *)vectp.55_234] = vect__62.53_230; vectp.55_237 = vectp.55_234 + 16; MEM <vector(4) int> [(int *)vectp.55_237] = vect__62.53_231; vectp.55_239 = vectp.55_234 + 32; MEM <vector(4) int> [(int *)vectp.55_239] = vect__62.53_232; vectp.55_241 = vectp.55_234 + 48; MEM <vector(4) int> [(int *)vectp.55_241] = vect__62.53_233; A fix looks to avoid the address biasing for these kinds of DRs for omp clause specific storage. These DRs are mainly used in the main loop (lanes?), for this case it's for reduction, in prologues we use element 0, in epilogue we use the last one or reduc_op all elements according to the type. The below small fix can make it pass: diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index 4988c93fdb6..a447f457f93 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -1820,7 +1820,7 @@ vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters, FOR_EACH_VEC_ELT (datarefs, i, dr) { dr_vec_info *dr_info = loop_vinfo->lookup_dr (dr); - if (!STMT_VINFO_GATHER_SCATTER_P (dr_info->stmt)) + if (!STMT_VINFO_GATHER_SCATTER_P (dr_info->stmt) && !STMT_VINFO_SIMD_LANE_ACCESS_P (dr_info->stmt)) vect_update_init_of_dr (dr_info, niters, code); } } I've not looked into the meaning for different values (1,2,3,4) for STMT_VINFO_SIMD_LANE_ACCESS_P (stmt_info), it seems for the different omp clauses? The assumption of the above fix is that for all cases of STMT_VINFO_SIMD_LANE_ACCESS_P > 0, the related DR would be used mainly in vectorized loop body, we don't need any updates for it in prologue. I'm going to do one broader testing to see if we need more restrictions on that.