https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110310

            Bug ID: 110310
           Summary: vector epilogue handling is inefficient
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

It looks like we apply some analysis only when transforming the main vector
loop.  In particular vect_do_peeling does the following which elides a
vector epilogue after costing.

  /* If we know the number of scalar iterations for the main loop we should
     check whether after the main loop there are enough iterations left over
     for the epilogue.  */
  if (vect_epilogues
      && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
      && prolog_peeling >= 0
      && known_eq (vf, lowest_vf))
    {
      unsigned HOST_WIDE_INT eiters
        = (LOOP_VINFO_INT_NITERS (loop_vinfo)
           - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));

      eiters -= prolog_peeling;
      eiters
        = eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);

      while (!vect_update_epilogue_niters (epilogue_vinfo, eiters))
        {
          delete epilogue_vinfo;
          epilogue_vinfo = NULL;
          if (loop_vinfo->epilogue_vinfos.length () == 0)
            {
              vect_epilogues = false;
              break;
            }
          epilogue_vinfo = loop_vinfo->epilogue_vinfos[0];
          loop_vinfo->epilogue_vinfos.ordered_remove (0);
        }
      vect_epilogues_updated_niters = true;

So for example for the loop

void foo (int * __restrict a, int *b)
{
  for (int i = 0; i < 20; ++i)
    a[i] = b[i] + 42;
}

we end up with no vectorized epilogue when using AVX512 but instead of the
AVX2 epilogue which is discarded we'd like to use a SSE2 epilogue.  It
seems that vect_determine_partial_vectors_and_peeling as called from
vect_update_epilogue_niters should have been already determined when
analyzing the epilogue, but during the epilogue costing the loop_vinfo
still inherits the main loop NITER.

For the testcase at hand we're somewhat saved by BB vectorization but when
doing partial loop vectorization we unnecessarily get a AVX512 masked
epilogue here and the cost model doesn't get a chance to see the updated
known niter for the epilogue nor would there be a meaningful way to
do this when costs are compared because we have no way of estimating the
number of masked out lanes for example.

Reply via email to