Re: RFC/RFH - proof of concept for GCC OpenMP 5.0 non-rectangular worksharing-loop implementation

Richard Biener Wed, 17 Jun 2020 05:30:21 -0700

On Tue, 16 Jun 2020, Jakub Jelinek wrote:

> Hi!
> 
> Before OpenMP 5.0, all OpenMP loop nests had to be rectangular (and OpenMP
> has various other restrictions that make implementation easier), so that it
> was very easy to compute the number of iterations of the collapsed loop
> by computing number of iterations of each of the loops in the loop nest
> and multiplying them; then either inline code or runtime library from that
> total number of iterations, the thread number, number of threads in the team
> and scheduling policies can determine an interval of logical iterations
> which the thread will handle and then it was fairly cheap to compute from
> that single 0 .. num_iterations-1 iterator what iterator values it starts
> with.
> 
> For non-rectangular loops, this is getting harder, both how to compute
> the number of iterations and how to cheaply if possible from the single
> logical number iteration start compute the different iterator values.
> 
> Below is a proof of concept of what I came up so far.  I have tried
> to list the relevant restrictions in OpenMP in the first big comment in bar
> function, foo function has #if 0 what I'm trying to implement and #else
> a version that does it serially in a single thread only, and then bar
> is an attempt to write in C what GCC could roughly emit for it.
> Note, the test uses (mostly) int types, but in reality it can be other
> integral types, signed or unsigned, and OpenMP just adds assurances that
> the loops will not wrap around and otherwise behave well.
> 
> For both number of iterations computation and the logical iteration to
> actual iterator values computations I have two versions, one lame fallback
> which worst case let's each thread basically run the whole loop as is except
> for the body.  This one is because I don't want to spend months on it and
> deal with Bernoulli constants for 120 nested loops etc., so something that
> will be standard compliant and for loops that have costly body will be
> beneficial too.  And then an optimized version, for now limited to
> triangular loops (can be used also if there are rectangular loops around
> those), where the first part tries to compute total number of iterations
> using Faulhaber's formula and the second part attempts to compute quadratic
> equation root using integer square root.
> 
> The proof of concept right now uses the fallback even if it sees that the
> inner loop will not have at least a single iteration for all values of the
> outer loop iterator, will try to change that to just artificially change
> the a and b values (bounds of the outer iterator) for the purposes of total
> number of iterations computation and for the purpose of transforming the
> quadratic equation root into the actual iterator values (and keep the
> original ones for the purpose of computing lastprivate iterator values).
> 
> Any thoughts on how to simplify this, what to do differently, whether e.g.
> using floating point math instead would be beneficial etc.?
> Any input appreciated.
> 
> /* Proof of concept for OpenMP non-rectangular worksharing-loop
>    implementation.
>    Copyright (C) 2020 Free Software Foundation, Inc.
> 
>    GCC is free software; you can redistribute it and/or modify it under
>    the terms of the GNU General Public License as published by the Free
>    Software Foundation; either version 3, or (at your option) any later
>    version.
> 
>    GCC is distributed in the hope that it will be useful, but WITHOUT ANY
>    WARRANTY; without even the implied warranty of MERCHANTABILITY or
>    FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
>    for more details.
> 
>    You should have received a copy of the GNU General Public License
>    along with GCC; see the file COPYING3.  If not see
>    <http://www.gnu.org/licenses/>.  */
> 
> #include <stdlib.h>
> #include <stdio.h>
> #ifdef _OPENMP
> #include <omp.h>
> #endif
> 
> int x, i, j, nitersv, niterscnt;
> 
> #ifdef DEBUG
> #define DPRINTF(...) printf (__VA_ARGS__)
> #else
> #define DPRINTF(...) do {} while (0)
> #endif
> #ifndef _OPENMP
> #define omp_get_num_threads() 1
> #define omp_get_thread_num() 0
> #endif
> 
> void
> foo (int a, int b, int c, int d, int e, int f, int g, int h)
> {
> #ifdef DEBUG
> #pragma omp single
>   nitersv = 0;
> #endif
> #if 0
>   #pragma omp for collapse(2) lastprivate (i, j, x)
> #else
>   #pragma omp single
> #endif
>   for (i = a; i < b; i += c)
>     for (j = d * i + e; j < f * i + g; j += h)
>       {
>         x = i * 1024 + (j & 1023);
> #ifdef DEBUG
>         nitersv++;
> #endif
>         DPRINTF ("%d %d %d %d\n", i, j, x, omp_get_thread_num ());
>       }
> #ifdef DEBUG
>   #pragma omp single
>   DPRINTF ("niters = %d\n", nitersv);
> #endif
> }
> 
> void
> bar (int a, int b, int c, int d, int e, int f, int g, int h)
> {
>   /* Proposed implementation of:
>   #pragma omp for collapse(2) lastprivate (i, j, x)
>   for (i = a; i < b; i += c)
>     for (j = d * i + e; j < f * i + g; j += h)  */
> 
>   /* OpenMP requires that ((f - d) * c) % h == 0
>      and that either the initializer and condition expressions
>      are outermost loop invariant, or have syntactic forms that can
>      be represented as integral a1 * var-outer + a2 where var-outer
>      is some outer loop iterator with compatible type and a1 and a2 are
>      integral expressions (and have compatible type too), which are outermost
>      loop invariant.  Comparisons can be <, <=, >, >= or != but in the last
>      case the step is required to be compile time constant so that one
>      can determine iteration direction and for the others the step has to
>      match the iteration direction of the comparison operator.
>      And also a requirement that essentially says that there
>      is no wrap-around in any of the iterators and that the number of
>      iterations can be computed without risks of overflows/wrap-arounds.
>      Any number of loops can be collapsed and all but the outermost can be
>      non-rectangular (or at least potentially one, basically where the
>      expressions refer to the outer loop iterator).  The step expressions
>      must be always outermost loop invariant.  */


So this remembers me of the loop scalarization pass Sebastian once
implemented - that is, with the above constraints it looks like we
can do the actual collapsing, replacing collapsed loops with a new
one with a canonical IV from 0 to N (thus another rectangular one).
The old IVs would be either directly computed from that N
(but that requires costly modulo) or be derived IVs that would
not be affine (because they are "wrapping", aka reset).

You could possibly ask ISL to compute the mappings and
the GRAPHITE pieces could help generating the actual code
for them.

For followup optimizations such loop structure is of course
highly complicated because SCEV will give up on all uses :/

That said, is it possible to do the above transform of the actual
loop during OMP lowering or expansion (or even earlier)?  I guess
what you sketch below is similar, but I don't understand the
different cases well enough.

Are the loops required to be perfectly nesting?  At least those
that collapse?

Sorry to not have some magic to help here ...

Richard.


>   /* First try to calculate the total number of iterations.
>      Can be simplified by computing all outermost rectangular loops
>      whose iterator vars are not referenced in the non-rectangular loops
>      separately, and similarly all innermost rectangular loops separately.  */
>   int niters = 0;
>   if (!(a < b))
>     {
>       /* No iterations at all, only i defined after loop.  */
>       i = a;
>       goto end;
>     }
>   /* If the (middle) non-rectangular loops are triangular (or perhaps in some
>      more cases using Faulhaber's formula?), check if for all the iterators 
> the
>      inner loop will have at least one iteration.
>      If all of a, b, c, d, e, f, g, h are compile time constants, we want
>      to compute niters at compile time obviously.  If only some of them
>      are constant, let the normal optimizations simplify the expressions
>      correspondingly.  */
>   int t4 = (b + (c - 1) - a) / c;
>   int t5 = a + ((t4 - 1) * c);
>   int t8 = d * a + e;
>   int t9 = f * a + g;
>   int t10, t11;
>   if (t8 < t9 && d * t5 + e < f * t5 + g)
>     {
>       t10 = ((t9 + (h - 1) - t8) / h);
>       t11 = ((f - d) * c / h);
>       niters = t4 * t10 + t11 * (((t4 - 1) * t4) / 2);
>     }
>   else
>     {
>       t10 = t11 = 0;
>       /* Fallback implementation, if it above gets too ugly/hard.  Repeat all
>        loops except the innermost, hope loop optimizations optimize at least
>        something.  */
>       for (int t1 = a; t1 < b; t1 += c)
>         {
>           int t2 = d * t1 + e;
>           int t3 = f * t1 + g;
>           if (t2 < t3)
>           niters += (t3 + (h - 1) - t2) / h;
>         }
>     }
> 
>   /* Second step, the usual GCC OpenMP schedule(static) computation
>      of which iterations should the current thread take.  */
>   int num_threads = omp_get_num_threads ();
>   int thr = omp_get_thread_num ();
>   int t6 = niters / num_threads;
>   int t7 = niters % num_threads;
>   if (thr < t7) { t7 = 0; t6++; }
>   int start = t6 * thr + t7;
>   int end = start + t6;
>   if (!(start < end))
>     goto end;
>   /* Now, start contains the first logical iteration that this
>      thread should handle (counted from 0) and end should be the
>      first one it should already not handle.  */
> 
>   /* Third step, from start try to determine the initial values of
>      the loop iteration variables.  */
>   int ipriv, jpriv;
>   if (t10)
>     {
>       /* We want to find maximum x such that
>          start >= x * t10 + t11 * (((x - 1) * x) / 2)
>          and from that x compute both indexes:
>          ipriv = a + x * c;
>          jpriv = d * ipriv + e + (start - (x * t10 + t11 * (((x - 1) * x) / 
> 2))) * h.  */
> 
>       /* x = (isqrt((t10-t11/2)*(t10-t11/2)+start)-(t10-t11/2)) / t11 would 
> be a rough
>          guess that needs to be verified.  */
>       int t12 = t10 - t11 / 2;
>       /* Quick overflow check for the ^2.  */
>       if (__builtin_expect (t12 + 45000U > 90000U, 0))
>         goto fallback;
>       unsigned t13 = t12 * t12 + (unsigned) start;
>       if (__builtin_expect (t13 == 0 || t11 == 0, 0))
>       goto fallback;
>       /* Compute isqrt(t13).  */
>       unsigned isqrtb = (1U << (__SIZEOF_INT__ * __CHAR_BIT__ + 1
>                               - __builtin_clz (t13)) /2) - 1;
>       unsigned isqrta = (isqrtb + 3) / 2;
>       do {
>       unsigned isqrtm = (isqrta + isqrtb) >> 1;
>       if (isqrtm * isqrtm > t13)
>         isqrtb = isqrtm - 1;
>       else
>         isqrta = isqrtm + 1;
>       } while (isqrtb >= isqrta);
>       unsigned isqrt = isqrta - 1;
>       unsigned t14 = (isqrt - t12) / t11;
>       unsigned t15 = t14 * t10 + t11 * (((t14 - 1) * t14) / 2);
>       if (__builtin_expect (start >= t15, 1))
>       {
>         unsigned t16 = t15 + t10 + t14;
>         if (__builtin_expect (start >= t16, 0))
>           goto fallback;
>       }
>       else
>       {
>         unsigned t16 = t15 - t10 - t14;
>         if (__builtin_expect (start < t16, 0))
>           goto fallback;
>         t14--;
>         t15 = t16;
>       }
>       ipriv = a + (int) t14 * c;
>       jpriv = d * ipriv + e + (start - (int) t15) * h;
>     }
>   else
>     {
>       /* Fallback implementation, if it above gets too ugly/hard.  Repeat all
>        loops except the innermost, hope loop optimizations optimize at least
>        something.  */
>      fallback:;
>       int cnt = 0;
>       for (int t1 = a; t1 < b; t1 += c)
>         {
>           int t2 = d * t1 + e;
>           int t3 = f * t1 + g;
>           if (t2 < t3)
>           {
>             int t8 = (t3 + (h - 1) - t2) / h;
>             if (cnt + t8 > start)
>               {
>                 ipriv = t1;
>                 jpriv = t2 + (start - cnt) * h;
>                 goto done;
>               }
>             else
>               cnt += t8;
>           }
>       }
>       done:;
>     }
> 
>   int jmax = f * ipriv + g;
> 
>   int xpriv;
>   do
>     {
>       /* Now the body, with the privatized copies of the loop iterators
>        as well as other privatized variables as usual in OpenMP.  */
>       {
>         xpriv = ipriv * 1024 + (jpriv & 1023);
>         DPRINTF ("%d %d %d %d\n", ipriv, jpriv, xpriv, omp_get_thread_num ());
>       }
> #ifdef DEBUG
>       if (ipriv < a || ipriv >= b)
>         abort ();
>       if ((ipriv - a) % c)
>         abort ();
>       if (jpriv < d * ipriv + e || jpriv >= f * ipriv + g)
>         abort ();
>       if ((jpriv - (d * ipriv + e)) % h)
>         abort ();
>       #pragma omp atomic
>       ++niterscnt;
> #endif
> 
>       /* Use start as the logical iteration counter.  */
>       start++;
>       if (!(start < end))
>       break;
> 
>       /* Now bump the innermost iterator.  */
>       jpriv += h;
>       if (jpriv < jmax)
>       continue;
> 
>       /* The outermost iterator doesn't need condition checking, we have done 
> that
>        already through the start < end check.  */
>       ipriv += c;
>       /* Or precompute earlier how to bump jmax and jmin less expensively?  */
>       jpriv = d * ipriv + e;
>       jmax = f * ipriv + g;
>     }
>   while (1);
>   /* Lastprivate handling.  */
>   if (start == niters)
>     {
>       /* The thread that has been assigned the last iteration will handle 
> this.  */
> 
>       /* The variables other than iterators are very easy.  */
>       x = xpriv;
> 
>       /* The iterators can be harder, at least in cases where the innermost
>        loop is not (or might not) be executed at all for some of the outer
>        loop iterator values.  */
>       /* Try to do something smarter for the cases where the first phase
>        proved that is not the case?  */
> 
>       /* As fallback, continue iterating with empty bodies the outer loops
>        until all the conditions fail.  */
>       jpriv += h;
>       do
>       {
>         ipriv += c;
>         if (!(ipriv < b))
>           break;
>           jpriv = d * ipriv + e;
>       }
>       while (1);
> 
>       /* And assign those to the shared variables.  */
>       i = ipriv;
>       j = jpriv;
>     }
> 
>   end:;
>   DPRINTF ("niters %d\n", niters);
> }
> 
> volatile int v;
> 
> int
> main ()
> {
>   x = i = j = -1;
>   #pragma omp parallel num_threads(15)
>   foo (v + 4, v + 10, v + 1, v + 2, v - 9, v + 1, v, v + 1);
>   DPRINTF ("last %d %d %d\n", i, j, x);
>   if (i != 10 || j != 9 || x != 8 * 1024 + 7)
>     abort ();
>   DPRINTF ("===\n");
>   x = i = j = -1;
>   #pragma omp parallel num_threads(15)
>   bar (v + 4, v + 10, v + 1, v + 2, v - 9, v + 1, v, v + 1);
>   DPRINTF ("last %d %d %d\n", i, j, x);
>   if (i != 10 || j != 9 || x != 8 * 1024 + 7)
>     abort ();
>   DPRINTF ("===\n");
> 
>   x = i = j = -1;
>   #pragma omp parallel num_threads(15)
>   foo (v + 1, v + 10, v + 2, v + 0, v + 1, v + 1, v + 1, v + 1);
>   DPRINTF ("last %d %d %d\n", i, j, x);
>   if (i != 11 || j != 10 || x != 9 * 1024 + 9)
>     abort ();
>   DPRINTF ("===\n");
>   x = i = j = -1;
>   #pragma omp parallel num_threads(15)
>   bar (v + 1, v + 10, v + 2, v + 0, v + 1, v + 1, v + 1, v + 1);
>   DPRINTF ("last %d %d %d\n", i, j, x);
>   if (i != 11 || j != 10 || x != 9 * 1024 + 9)
>     abort ();
>   DPRINTF ("===\n");
> 
>   x = i = j = -1;
>   #pragma omp parallel num_threads(15)
>   foo (v + 4, v + 8, v + 12, v - 8, v - 9, v - 3, v + 6, v + 15);
>   DPRINTF ("last %d %d %d\n", i, j, x);
>   if (i != 16 || j != 4 || x != 5 * 1024 - 11)
>     abort ();
>   DPRINTF ("===\n");
>   x = i = j = -1;
>   #pragma omp parallel num_threads(15)
>   bar (v + 4, v + 8, v + 12, v - 8, v - 9, v - 3, v + 6, v + 15);
>   DPRINTF ("last %d %d %d\n", i, j, x);
>   if (i != 16 || j != 4 || x != 5 * 1024 - 11)
>     abort ();
>   DPRINTF ("===\n");
> 
>   x = i = j = -1;
>   #pragma omp parallel num_threads(15)
>   foo (v - 13, v + 7, v + 12, v + 3, v + 5, v, v - 6, v + 1);
>   DPRINTF ("last %d %d %d\n", i, j, x);
>   if (i != 11 || j != 2 || x != -12 * 1024 - 7)
>     abort ();
>   DPRINTF ("===\n");
>   x = i = j = -1;
>   #pragma omp parallel num_threads(15)
>   bar (v - 13, v + 7, v + 12, v + 3, v + 5, v, v - 6, v + 1);
>   DPRINTF ("last %d %d %d\n", i, j, x);
>   if (i != 11 || j != 2 || x != -12 * 1024 - 7)
>     abort ();
>   DPRINTF ("===\n");
> 
> 
>   for (int idx = 0; idx < 16384 * 1024; idx++)
>     {
>       int a = (random () & 31) - 16;
>       int b = (random () & 31) - 16;
>       int c = (random () & 15) + 1;
>       int d = (random () & 31) - 16;
>       int e = (random () & 31) - 16;
>       int f = (random () & 31) - 16;
>       int g = (random () & 31) - 16;
>       int h = (random () & 15) + 1;
>       while (((f - d) * c % h) != 0)
>       h = (random () & 15) + 1;
>       x = i = j = -1;
>       #pragma omp parallel num_threads(15)
>       foo (a, b, c, d, e, f, g, h);
>       int xs = x;
>       int is = i;
>       int js = j;
>       x = i = j = -1;
>       niterscnt = 0;
>       #pragma omp parallel num_threads(15)
>       bar (a, b, c, d, e, f, g, h);
> #ifdef DEBUG
>       if (nitersv != niterscnt)
>       abort ();
> #endif
>     }
>   return 0;
> }
> 
>       Jakub
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

Re: RFC/RFH - proof of concept for GCC OpenMP 5.0 non-rectangular worksharing-loop implementation

Reply via email to