Re: PING: Re: [patch] implement simd loops in trunk (OMP_SIMD)

Richard Henderson Thu, 01 Aug 2013 12:39:07 -0700

> +      if (simd
> +         /*
> +         || (fd->sched_kind == OMP_CLAUSE_SCHEDULE_STATIC
> +         && !fd->have_ordered)*/)


Debugging leftovers or what?

> +  /* Enforce simdlen 1 in simd loops with data sharing clauses referencing
> +     variable sized vars.  That is unnecessarily hard to support and very
> +     unlikely to result in vectorized code anyway.  */
> +  if (is_simd)
> +    for (c = clauses; c ; c = OMP_CLAUSE_CHAIN (c))
> +      switch (OMP_CLAUSE_CODE (c))
> +       {
> +       case OMP_CLAUSE_REDUCTION:
> +         if (OMP_CLAUSE_REDUCTION_PLACEHOLDER (c))
> +           max_vf = 1;
> +         /* FALLTHRU */
> +       case OMP_CLAUSE_PRIVATE:
> +       case OMP_CLAUSE_FIRSTPRIVATE:
> +       case OMP_CLAUSE_LASTPRIVATE:
> +       case OMP_CLAUSE_LINEAR:
> +         if (is_variable_sized (OMP_CLAUSE_DECL (c)))
> +           max_vf = 1;
> +         break;
> +       default:
> +         continue;
> +       }

Comment is wrong, code is right, adjusting max_vf not simdlen.

> +  /* If max_vf is non-NULL, then we can use only vectorization factor
> +     up to the max_vf we chose.  So stick it into safelen clause.  */
> +  if (max_vf)
> +    {
> +      tree c = find_omp_clause (gimple_omp_for_clauses (ctx->stmt),
> +                               OMP_CLAUSE_SAFELEN);

First, non-zero, not non-null -- max_vf is not a pointer.

Second, I don't understand why we're adjusting safelen.  The VF we choose
for optimizing the loop (even 1), does not change the fact that there are
no dependencies between loop iterations larger than that.

Did you want to be adding a _max_vf_ attribute to record stuff that we
learned while examining the omp loop?  E.g. the max_vf=1 reduction above?

Given the only adjustment we make to max_vf is to disable vectorization,
did we in fact want to discard the simd attribute instead, making this a
"regular" openmp loop?

> +       if (__builtin_expect (N32 cond3 N31, 0)) goto ZERO_ITER_BB;
> +       if (cond3 is <)
> +         adj = STEP3 - 1;
> +       else
> +         adj = STEP3 + 1;
> +       count3 = (adj + N32 - N31) / STEP3;
> +       if (__builtin_expect (N22 cond2 N21, 0)) goto ZERO_ITER_BB;
> +       if (cond2 is <)
> +         adj = STEP2 - 1;
> +       else
> +         adj = STEP2 + 1;
> +       count2 = (adj + N22 - N21) / STEP2;
> +       if (__builtin_expect (N12 cond1 N11, 0)) goto ZERO_ITER_BB;

CEIL_DIV_EXPR, instead of TRUNC_DIV_EXPR and adjusting by hand?  Unless we
can't generate the same code in the end because generically we don't know that
the values involved must be negative for GT?

I wonder if we do better mooshing all of the BBs together, creating one larger
BB with all the computation and the unexpected jump at the end.  I.e.

  bool zero3, zero2, zero1, zero;

  zero3 = N32 c3 N31;
  count3 = (N32 - N31) /[cl] STEP3;
  zero2 = N22 c2 N21;
  count2 = (N22 - N21) /[cl] STEP2;
  zero1 = N12 c1 N11;
  count1 = (N12 - N11) /[cl] STEP1;
  zero = zero3 || zero2 || zero1;
  count = count1 * count2 * count3;
  if (__builtin_expect(zero, false)) goto zero_iter_bb;

After all, we expect the zero=false, and thus we expect to have to evaluate all
of the comparison expressions, so short-circuiting oughtn't be a win.  Since
the condition isn't protecting a denominator, we're not concerned about
divide-by-zero, so we can fully evaluate count even if a numerator turned out
to be wrong.

It seems like putting this all together would create much better scheduling
opportunities, and less pressure on the chip's branch predictor.

> @@ -291,6 +300,15 @@ vect_analyze_data_ref_dependence (struct 
> data_dependence_re

You've two adjustments in this function.  First hunk for "dont_know", which is
fine; second hunk for "data is bad", which I suppose is really a form of
dont_know and thus also fine.

But it seems like there should be a third hunk in the "know" section, in which
we perform some MAX(dist, loop->safelen) computation inside that loop.

Alternately, why are we patching this function at all, rather than whatever bit
of code creates the data_dependence_relation data?

> +  hash_table <simduid_to_vf> simduid_to_vf_htab;
> +  hash_table <decl_to_simduid> decl_to_simduid_htab;

Why two hashes?  Seems like you can map from decl to vf directly.  At what
point do you have a simduid integer, but not the decl from whence it came?



r~

Re: PING: Re: [patch] implement simd loops in trunk (OMP_SIMD)

Reply via email to