> > 
> >      So the unvectorized cost is
> >      SIC * niters
> > 
> >      The vectorized path is
> >      SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
> >      The scalar path of vectorizer loop is
> >      SIC * niters + SOC
> 
> Note that 'th' is used for the runtime profitability check which is
> done at the time the setup cost has already been taken (yes, we

Yes, I understand that.
> probably should make it more conservative but then guard the whole
> set of loops by the check, not only the vectorized path).
> See PR53355 for the general issue.

Yep, we may reduce the cost of SOC by outputting early guard for non-vectorized
path better than we do now. However...
> >    Of course this is very simple benchmark, in reality the vectorizatoin 
> > can be
> >    a lot more harmful by complicating more complex control flows.
> >
> >    So I guess we have two options
> >     1) go with the new formula and try to make cost model a bit more 
> > realistic.
> >     2) stay with original formula that is quite close to reality, but I 
> > think
> >        more by an accident.
> 
> I think we need to improve it as whole, thus I'd prefer 2).

... I do not see why.
Even if we make the check cheaper we will only distribute part of SOC to vector
prologues/epilogues.

Still I think the formula is wrong, I.e. accounting SOC where it should not.

The cost of scalar path without vectorization is 
  niters * SIC
while with vectorization we have scalar path
  niters * SIC + SOC
and vector path
  SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC

So SOC cancels out in the runtime check.
I still think we need two formulas - one determining if vectorization is
profitable, other specifying the threshold for scalar path at runtime (that
will generally give lower values).
> > 2) Even when loop iterates 2 times, it is estimated to 4 iterations by
> >    estimated_stmt_executions_int with the profile feedback.
> >    The reason is loop_ch pass.  Given a rolled loop with exit probability
> >    30%, proceeds by duplicating the header with original probabilities.
> >    This makes the loop to be executed with 60% probability.  Because the
> >    loop body counts remain the same (and they should), the expected number
> >    of iterations increase by the decrease of entry edge to the header.
> > 
> >    I wonder what to do about this.  Obviously without path profiling
> >    loop_ch can not really do a good job.  We can artifically make
> >    header to suceed more likely, that is the reality, but that requires
> >    non-trivial loop profile updating.
> > 
> >    We can also simply record the iteration bound into loop structure 
> >    and ignore that the profile is not realistic
> 
> But we don't preserve loop structure from header copying ...

>From what time we keep loop structure? In general I would like to eventualy
drop value histograms to loop structure specifying number of iterations with
profile feedback.
> 
> >    Finally we can duplicate loop headers before profilng.  I implemented
> >    that via early_ch pass executed only with profile generation or feedback.
> >    I guess it makes sense to do, even if it breaks the assumption that
> >    we should do strictly -Os generation on paths where
> 
> Well, there are CH cases that do not increase code size and I doubt
> that loop header copying is generally bad for -Os ... we are not
> good at handling non-copied loop headers.

There is comment saying 
  /* Loop header copying usually increases size of the code.  This used not to
     be true, since quite often it is possible to verify that the condition is
     satisfied in the first iteration and therefore to eliminate it.  Jump
     threading handles these cases now.  */
  if (optimize_loop_for_size_p (loop))
    return false;

I am not sure how much backing it has. Schedule loop_ch as part of early passes
just after profile pass makes optimize_loop_for_size_p to return true 
even for functions that are later found cold by profile feedback.  I do not see
that being big issue.

I tested enabling loop_ch in early passes with -fprofile-feedback and it is SPEC
neutral.  Given that it improves loop count estimates, I would still like 
mainline
doing that.  I do not like these quite important estimates to be wrong most of 
time.

> 
> Btw, I added a "similar" check in vect_analyze_loop_operations:
> 
>   if ((LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>        && (LOOP_VINFO_INT_NITERS (loop_vinfo) < vectorization_factor))
>       || ((max_niter = max_stmt_executions_int (loop)) != -1
>           && (unsigned HOST_WIDE_INT) max_niter < vectorization_factor))
>     {
>       if (dump_kind_p (MSG_MISSED_OPTIMIZATION))
>         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>                          "not vectorized: iteration count too small.");
>       if (dump_kind_p (MSG_MISSED_OPTIMIZATION))
>         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>                          "not vectorized: iteration count smaller than "
>                          "vectorization factor.");
>       return false;
>     }
> 
> maybe you simply need to update that to also consider the profile?

Hmm, I am still getting familiar wth the code. Later we later have
  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
      && LOOP_VINFO_INT_NITERS (loop_vinfo) <= th)
    {
      if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
        fprintf (vect_dump, "not vectorized: vectorization not "
                 "profitable.");
      if (vect_print_dump_info (REPORT_DETAILS))
        fprintf (vect_dump, "not vectorized: iteration count smaller than "
                 "user specified loop bound parameter or minimum "
                 "profitable iterations (whichever is more conservative).");
      return false;
    }

where th is always greater or equal than vectorization_factor from the cost 
model.
So this test seems redundant if the max_stmt_executions_int was pushed down
to the second conditoinal?

Honza

Reply via email to