[RFC] OpenMP 5.0 #pragma omp for {,simd} reduction (inscan, ...) design proposal

Jakub Jelinek Fri, 21 Jun 2019 10:01:43 -0700

Hi!

Now that the inclusive/exclusive scan support is pretty much done in simd,
it is time to implement it also in the OpenMP worksharing-loop and in
the combined worksharing-loop SIMD constructs, so that the scan computation
is not just vectorized, but also parallelized.


There are some parallel algorithms for inclusive and exclusive scan,
https://www.researchgate.net/publication/319955406_Automatic_Scan_Parallelization_in_OpenMP
https://en.wikipedia.org/wiki/Prefix_sum

While for GPGPUs where non-divergent execution of threads often is done by
all threads doing the same thing at the same time and where thread barriers
are therefore quite cheap, it might be best to actually just use those
parallel algorithms without any serial handling, for normal CPUs I believe
it is best to just compute a partial prefix sum for a consecutive set of
iterations by each thread, then just perform a parallel inclusive scan
computation of the final results (i.e. num_threads elements instead of all)
from each such computation and finally just add the numbers from the parallel
scan to the partial prefix sums.

Before starting to code this up in the compiler, I thought it might be best
to write it in a testcase where I can play with it more cheaply.
The testcase shows both inclusive scan (when -UEXCLUSIVE_SCAN) and
exclusive scan (-DEXCLUSIVE_SCAN).

The foo routine shows the user code we want to transform, with
-DGCC_SUPPORTS_WORKSHARE_SCAN it will be rejected right now because GCC
doesn't support that yet but is what users will write in the end, otherwise
foo has just a serial version that can be (is) vectorized.
bar contains a proposal on how to implement it.  The OpenMP 5.0 says that
the construct creates one or more private copies, but as the input and
scan phases can be pretty arbitrary user code (if honoring the
restrictions which allow the implementation to swap the two phases for
exclusive scan or take them appart, in one loop perform all input phases and
in another loop perform all scan phases), at least in general we can't reuse
the user array (not to mention that there doesn't have to be any, or doesn't
have to be array etc.), this implementation chooses to use number of
iterations + (small constant) * (number of threads) private copies.

One question is where to store the number of iterations private copies, they
may but don't have to live in a contiguous array, it is enough that each
thread's set of iterations have corresponding private copies in a contiguous
array.  The testcase bellow uses alloca for it if small enough, otherwise
uses a shared malloced array, though maybe it might be better to malloc
in each thread the smaller array; thoughts on that?

Then there is an array with number of threads elements, in the test
malloced inside of #pragma omp single copyprivate, but in the end I'd like
to use the GOMP_loop_start etc. APIs that allow to allocate per-workshare
memory by the thread that encounters the worksharing region first.
If the user writes #pragma omp parallel for simd reduction (inscan, +:r)
then it would use a normal omp simd lowering/expansion for the two
simd loops it would create out of the one, otherwise if user writes
#pragma omp parallel for reduction (inscan, +:r) the loops would be normal
serial ones.
The if (thread_num < t) { t = 0; q++; } etc. stuff is what we already
emit by the compiler for schedule(static) - and the spec disallows explicit
schedule clause on these loops.
The testcase has with -DSERIAL_MERGE=1 just serial scan computation on the
rpriva[0:num_threads] array section, otherwise uses the Work-efficient
algorithm from the wiki (in the #else of #if 0 tweaked so that it is
just a single loop and can have just a single combiner in there).
The Shorter span, more parallel algorithm would most likely need
twice as big rpriva array section, because it uses and stores the same
elements by different threads, so would be a data race.  By having
even iterations write one set of vars and read the other one
and odd iterations vice versa, it could be written even using that, but
not sure if it is a good idea to use that algorithm on non-GPGPUs,
especially if over-subscribing the machine.  Another possibility is to use
the Work-efficient algorithm, but not do a barrier after every step (of the
2*log2(num_threads) steps) - instead let threads perform a couple of
iterations before doing a barrier, kind of something in between the
SERIAL_MERGE 1 and fully parallel one, or use the SERIAL_MERGE say when
num_threads is <= some small constant (say 32) and only use more than those
two SERIAL_MERGE barriers if we have lots of threads.

Any thoughts on this?

Of course, the testcase hardcodes + operation where the compiler needs to
support any reductions including UDRs, hardcodes 0 as the neutral element
and doesn't show invocation of C++ constructors, destructors, assignment
operators etc. that would be needed.

#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

#ifndef SERIAL_MERGE
#define SERIAL_MERGE 0
#endif

unsigned int r;

__attribute__((noipa)) void
foo (unsigned int *a, unsigned int *b, int n)
{
#ifdef GCC_SUPPORTS_WORKSHARE_SCAN
  #pragma omp parallel for simd reduction (inscan, +:r)
#else
  #pragma omp simd reduction (inscan, +:r)
#endif
  for (int i = 0; i < n; i++)
    {
#ifndef EXCLUSIVE_SCAN
      r += a[i];
      #pragma omp scan inclusive(r)
      b[i] = r;
#else
      b[i] = r;
      #pragma omp scan exclusive(r)
      r += a[i];
#endif
    }
}

__attribute__((noipa)) void
bar (unsigned int *a, unsigned int *b, int n)
{
  #pragma omp parallel
  {
    int num_threads = omp_get_num_threads ();
    int use_alloca = n * sizeof (unsigned int) < 16384;

    // This would be done during worksharing construct allocation
    // in the library, the first thread to encounter the worksharing
    // would allocate memory similarly to how memory is allocated for
    // lastprivate(conditional:) helper variables, so wouldn't actually
    // cost a barrier.
    unsigned int *rpriva;
    #pragma omp single copyprivate (rpriva)
    rpriva = malloc ((num_threads + (use_alloca ? 0 : n)) * sizeof (unsigned 
int));

    // By hand written #pragma omp for schedule(static)
    // because we need the exact start, end range for this thread
    // to be used explicitly in the algorithm.
    int thread_num = omp_get_thread_num ();
    int q = n / num_threads;
    int t = n % num_threads;
    if (thread_num < t) { t = 0; q++; }
    int start = 0 + q * thread_num + t;
    int end = start + q;
    unsigned int rpriv1 = thread_num == 0 ? r : 0;
    unsigned int *rprivb = use_alloca ? __builtin_alloca ((end - start) * 
sizeof (unsigned int)) : rpriva + num_threads + start;
    #pragma omp simd reduction (inscan, +:rpriv1)
    for (int i = start; i < end; i++)
      {
#ifndef EXCLUSIVE_SCAN
        // This is the actual user structured block with
        // r remapped to rpriv1.
        rpriv1 += a[i];
        #pragma omp scan inclusive(rpriv1)
        rprivb[i - start] = rpriv1;
#else
        rprivb[i - start] = rpriv1;
        #pragma omp scan exclusive(rpriv1)
        // This is the actual user structured block with
        // r remapped to rpriv1.
        rpriv1 += a[i];
#endif
      }
    rpriva[thread_num] = rpriv1;
    if (num_threads > 1)
      {
        // At this point, rpriva[thread_num] is the partial prefix sum,
        // for rpriva[0] of (r, a[0], ... a[end - 1]), for
        // others (a[start], ... a[end - 1]).
        #pragma omp barrier
        if (SERIAL_MERGE)
          {
            #pragma omp single
            for (int l = 1; l < num_threads; ++l)
              rpriva[l] = rpriva[l - 1] + rpriva[l];
          }
        else
          {
            unsigned int k;
#if 0
            for (k = 1; k * 2 <= (unsigned) num_threads; k <<= 1)
              #pragma omp for schedule(static)
              for (int l = k * 2 - 1; l < num_threads; l += k * 2)
                rpriva[l] = rpriva[l - k] + rpriva[l];
            k >>= 1;
            if (k == (unsigned) num_threads)
              k >>= 1;
            // Down-sweep.
            for (; k; k >>= 1)
              #pragma omp for schedule(static)
              for (int l = k * 3 - 1; l < num_threads; l += k * 2)
                rpriva[l] = rpriva[l - k] + rpriva[l];
#else
            // For up-sweep 0, down-sweep 1.
            int down = 0;
            k = 1;
            do
              {
                unsigned int l;
                if ((k * 2) > (unsigned) num_threads)
                  {
                    down = 1;
                    k >>= 1;
                    if (k == (unsigned) num_threads)
                      k >>= 1;
                  }
                // For 64-bit arches can actually just use
                // unsigned long long l = (thread_num + 1ULL) * (k * 2);
                // if (l < (unsigned long long) num_threads).
                //   rpriva[l] = rpriva[l - k] + rpriva[l];
                // Number of threads is int, so we can't have ever more
                // than 2G threads in a team, but more than 64K threads
                // might not be that unbelievable.
                if (!__builtin_mul_overflow (thread_num + 1U, k * 2, &l)
                    && (l += (down ? k : 0) - 1) < (unsigned) num_threads)
                  rpriva[l] = rpriva[l - k] + rpriva[l];
                if (down)
                  k >>= 1;
                else
                  k <<= 1;
                #pragma omp barrier
              }
            while (k);
#endif
          }
      }
    rpriv1 = thread_num == 0 ? 0 : rpriva[thread_num - 1];
    #pragma omp simd
    for (int i = start; i < end; i++)
      {
        rprivb[i - start] += rpriv1;
        // Below is the actual user structured block with
        // r remapped to rprivb[i - start].
        b[i] = rprivb[i - start];
      }

    if (thread_num == num_threads - 1)
      r = rpriva[thread_num];

    // Leak rpriva array for now, if it is allocated by the runtime
    // library, it will actually not leak.  I could add a barrier and
    // freeing say in omp master before that.
  }
}

int
main (int argc, const char **argv)
{
  int s = 0;
  int n = atoi (argv[1]);
  int *a = malloc (n * sizeof (int));
  int *b = malloc (n * sizeof (int));
  #pragma omp parallel for
  for (int i = 0; i < n; ++i)
    a[i] = i;
  struct timespec ts1, ts2, ts3, ts4;
  clock_gettime (CLOCK_MONOTONIC, &ts1);
  foo (a, b, n);
  clock_gettime (CLOCK_MONOTONIC, &ts2);
  long long ns1 = ts1.tv_sec * 1000000000LL + ts1.tv_nsec;
  long long ns2 = ts2.tv_sec * 1000000000LL + ts2.tv_nsec;
  printf ("%lld\n", ns2 - ns1);
  unsigned x = 0;
  for (int i = 0; i < n; ++i)
    {
#ifndef EXCLUSIVE_SCAN
      x += i;
#endif
      if (b[i] != x)
        abort ();
      else
        b[i] = -1;
#ifdef EXCLUSIVE_SCAN
      x += i;
#endif
    }
  if (r != x)
    abort ();
  r = 0;
  clock_gettime (CLOCK_MONOTONIC, &ts3);
  bar (a, b, n);
  clock_gettime (CLOCK_MONOTONIC, &ts4);
  long long ns3 = ts3.tv_sec * 1000000000LL + ts3.tv_nsec;
  long long ns4 = ts4.tv_sec * 1000000000LL + ts4.tv_nsec;
  printf ("%lld\n", ns4 - ns3);
  x = 0;
  for (int i = 0; i < n; ++i)
    {
#ifndef EXCLUSIVE_SCAN
      x += i;
#endif
      if (b[i] != x)
        abort ();
      else
        b[i] = -1;
#ifdef EXCLUSIVE_SCAN
      x += i;
#endif
    }
  if (r != x)
    abort ();
  return 0;
}


        Jakub

[RFC] OpenMP 5.0 #pragma omp for {,simd} reduction (inscan, ...) design proposal

Reply via email to