simd-3.C fails after r12-4340 for 32 bits

linkw at gcc dot gnu.org via Gcc-bugs Tue, 19 Oct 2021 23:24:39 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102789


Kewen Lin <linkw at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bergner at gcc dot gnu.org,
                   |                            |rguenth at gcc dot gnu.org,
                   |                            |wschmidt at gcc dot gnu.org
             Status|NEW                         |ASSIGNED

--- Comment #5 from Kewen Lin <linkw at gcc dot gnu.org> ---
As Jakub noted, r12-4340 just exposed one latent bug, even without r12-4340 but
with -fvect-cost-model=dynamic, the issue still exists. The key is if it will
do the peeling for alignment in prologue.

          unsigned max_allowed_peel
            = param_vect_max_peeling_for_alignment;
          if (flag_vect_cost_model <= VECT_COST_MODEL_CHEAP)
            max_allowed_peel = 0;

--param vect-max-peeling-for-alignment=14 make the peeling disabled and it
passes.

I think this is a bug in vectorizer, reduced the culprit loop to (also move the
first loop out of function):

  for (i = n; i < o; i++)
    {
      k += m + 1;
      t = k + p[i];
      s2 += t;
      c[i]++;
    }

we have some temporary storages for the omp clause such as:

  int D.3802[16];  // for k
  int D.3800[16];  // for s2
  int D.3799[16];  // for t

After having the peeling (one prologue), the addresses of k,s2,t become to:

  _187 = prolog_loop_niters.27_88 * 4;
  vectp.37_186 = &D.3802 + _187;
  _213 = prolog_loop_niters.27_88 * 4;
  vectp.46_212 = &D.3799 + _213;
  _222 = prolog_loop_niters.27_88 * 4;
  vectp.48_221 = &D.3800 + _222;

then the main vectorized loop body acts on the biased addresses which is wrong:

  vect__61.49_223 = MEM <vector(4) int> [(int *)vectp.48_221];
  vectp.48_224 = vectp.48_221 + 16;
  vect__61.50_225 = MEM <vector(4) int> [(int *)vectp.48_224];
  vectp.48_226 = vectp.48_221 + 32;
  vect__61.51_227 = MEM <vector(4) int> [(int *)vectp.48_226];
  vectp.48_228 = vectp.48_221 + 48;
  vect__61.52_229 = MEM <vector(4) int> [(int *)vectp.48_228];
  _61 = D.3800[_56];

  vect__62.53_230 = vect__59.44_208 + vect__61.49_223;
  vect__62.53_231 = vect__59.44_209 + vect__61.50_225;
  vect__62.53_232 = vect__59.44_210 + vect__61.51_227;
  vect__62.53_233 = vect__59.44_211 + vect__61.52_229;
  _62 = _59 + _61;

  MEM <vector(4) int> [(int *)vectp.55_234] = vect__62.53_230;
  vectp.55_237 = vectp.55_234 + 16;
  MEM <vector(4) int> [(int *)vectp.55_237] = vect__62.53_231;
  vectp.55_239 = vectp.55_234 + 32;
  MEM <vector(4) int> [(int *)vectp.55_239] = vect__62.53_232;
  vectp.55_241 = vectp.55_234 + 48;
  MEM <vector(4) int> [(int *)vectp.55_241] = vect__62.53_233;


A fix looks to avoid the address biasing for these kinds of DRs for omp clause
specific storage. These DRs are mainly used in the main loop (lanes?), for this
case it's for reduction, in prologues we use element 0, in epilogue we use the
last one or reduc_op all elements according to the type. The below small fix
can make it pass:

diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 4988c93fdb6..a447f457f93 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1820,7 +1820,7 @@ vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree
niters,
   FOR_EACH_VEC_ELT (datarefs, i, dr)
     {
       dr_vec_info *dr_info = loop_vinfo->lookup_dr (dr);
-      if (!STMT_VINFO_GATHER_SCATTER_P (dr_info->stmt))
+      if (!STMT_VINFO_GATHER_SCATTER_P (dr_info->stmt) &&
!STMT_VINFO_SIMD_LANE_ACCESS_P (dr_info->stmt))
        vect_update_init_of_dr (dr_info, niters, code);
     }
 }

I've not looked into the meaning for different values (1,2,3,4) for
STMT_VINFO_SIMD_LANE_ACCESS_P (stmt_info), it seems for the different omp
clauses? The assumption of the above fix is that for all cases of
STMT_VINFO_SIMD_LANE_ACCESS_P > 0, the related DR would be used mainly in
vectorized loop body, we don't need any updates for it in prologue. I'm going
to do one broader testing to see if we need more restrictions on that.

[Bug target/102789] [12 regression] libgomp.c++/simd-3.C fails after r12-4340 for 32 bits

Reply via email to