Hi! Now that the inclusive/exclusive scan support is pretty much done in simd, it is time to implement it also in the OpenMP worksharing-loop and in the combined worksharing-loop SIMD constructs, so that the scan computation is not just vectorized, but also parallelized.
There are some parallel algorithms for inclusive and exclusive scan, https://www.researchgate.net/publication/319955406_Automatic_Scan_Parallelization_in_OpenMP https://en.wikipedia.org/wiki/Prefix_sum While for GPGPUs where non-divergent execution of threads often is done by all threads doing the same thing at the same time and where thread barriers are therefore quite cheap, it might be best to actually just use those parallel algorithms without any serial handling, for normal CPUs I believe it is best to just compute a partial prefix sum for a consecutive set of iterations by each thread, then just perform a parallel inclusive scan computation of the final results (i.e. num_threads elements instead of all) from each such computation and finally just add the numbers from the parallel scan to the partial prefix sums. Before starting to code this up in the compiler, I thought it might be best to write it in a testcase where I can play with it more cheaply. The testcase shows both inclusive scan (when -UEXCLUSIVE_SCAN) and exclusive scan (-DEXCLUSIVE_SCAN). The foo routine shows the user code we want to transform, with -DGCC_SUPPORTS_WORKSHARE_SCAN it will be rejected right now because GCC doesn't support that yet but is what users will write in the end, otherwise foo has just a serial version that can be (is) vectorized. bar contains a proposal on how to implement it. The OpenMP 5.0 says that the construct creates one or more private copies, but as the input and scan phases can be pretty arbitrary user code (if honoring the restrictions which allow the implementation to swap the two phases for exclusive scan or take them appart, in one loop perform all input phases and in another loop perform all scan phases), at least in general we can't reuse the user array (not to mention that there doesn't have to be any, or doesn't have to be array etc.), this implementation chooses to use number of iterations + (small constant) * (number of threads) private copies. One question is where to store the number of iterations private copies, they may but don't have to live in a contiguous array, it is enough that each thread's set of iterations have corresponding private copies in a contiguous array. The testcase bellow uses alloca for it if small enough, otherwise uses a shared malloced array, though maybe it might be better to malloc in each thread the smaller array; thoughts on that? Then there is an array with number of threads elements, in the test malloced inside of #pragma omp single copyprivate, but in the end I'd like to use the GOMP_loop_start etc. APIs that allow to allocate per-workshare memory by the thread that encounters the worksharing region first. If the user writes #pragma omp parallel for simd reduction (inscan, +:r) then it would use a normal omp simd lowering/expansion for the two simd loops it would create out of the one, otherwise if user writes #pragma omp parallel for reduction (inscan, +:r) the loops would be normal serial ones. The if (thread_num < t) { t = 0; q++; } etc. stuff is what we already emit by the compiler for schedule(static) - and the spec disallows explicit schedule clause on these loops. The testcase has with -DSERIAL_MERGE=1 just serial scan computation on the rpriva[0:num_threads] array section, otherwise uses the Work-efficient algorithm from the wiki (in the #else of #if 0 tweaked so that it is just a single loop and can have just a single combiner in there). The Shorter span, more parallel algorithm would most likely need twice as big rpriva array section, because it uses and stores the same elements by different threads, so would be a data race. By having even iterations write one set of vars and read the other one and odd iterations vice versa, it could be written even using that, but not sure if it is a good idea to use that algorithm on non-GPGPUs, especially if over-subscribing the machine. Another possibility is to use the Work-efficient algorithm, but not do a barrier after every step (of the 2*log2(num_threads) steps) - instead let threads perform a couple of iterations before doing a barrier, kind of something in between the SERIAL_MERGE 1 and fully parallel one, or use the SERIAL_MERGE say when num_threads is <= some small constant (say 32) and only use more than those two SERIAL_MERGE barriers if we have lots of threads. Any thoughts on this? Of course, the testcase hardcodes + operation where the compiler needs to support any reductions including UDRs, hardcodes 0 as the neutral element and doesn't show invocation of C++ constructors, destructors, assignment operators etc. that would be needed. #include <time.h> #include <stdlib.h> #include <stdio.h> #include <omp.h> #ifndef SERIAL_MERGE #define SERIAL_MERGE 0 #endif unsigned int r; __attribute__((noipa)) void foo (unsigned int *a, unsigned int *b, int n) { #ifdef GCC_SUPPORTS_WORKSHARE_SCAN #pragma omp parallel for simd reduction (inscan, +:r) #else #pragma omp simd reduction (inscan, +:r) #endif for (int i = 0; i < n; i++) { #ifndef EXCLUSIVE_SCAN r += a[i]; #pragma omp scan inclusive(r) b[i] = r; #else b[i] = r; #pragma omp scan exclusive(r) r += a[i]; #endif } } __attribute__((noipa)) void bar (unsigned int *a, unsigned int *b, int n) { #pragma omp parallel { int num_threads = omp_get_num_threads (); int use_alloca = n * sizeof (unsigned int) < 16384; // This would be done during worksharing construct allocation // in the library, the first thread to encounter the worksharing // would allocate memory similarly to how memory is allocated for // lastprivate(conditional:) helper variables, so wouldn't actually // cost a barrier. unsigned int *rpriva; #pragma omp single copyprivate (rpriva) rpriva = malloc ((num_threads + (use_alloca ? 0 : n)) * sizeof (unsigned int)); // By hand written #pragma omp for schedule(static) // because we need the exact start, end range for this thread // to be used explicitly in the algorithm. int thread_num = omp_get_thread_num (); int q = n / num_threads; int t = n % num_threads; if (thread_num < t) { t = 0; q++; } int start = 0 + q * thread_num + t; int end = start + q; unsigned int rpriv1 = thread_num == 0 ? r : 0; unsigned int *rprivb = use_alloca ? __builtin_alloca ((end - start) * sizeof (unsigned int)) : rpriva + num_threads + start; #pragma omp simd reduction (inscan, +:rpriv1) for (int i = start; i < end; i++) { #ifndef EXCLUSIVE_SCAN // This is the actual user structured block with // r remapped to rpriv1. rpriv1 += a[i]; #pragma omp scan inclusive(rpriv1) rprivb[i - start] = rpriv1; #else rprivb[i - start] = rpriv1; #pragma omp scan exclusive(rpriv1) // This is the actual user structured block with // r remapped to rpriv1. rpriv1 += a[i]; #endif } rpriva[thread_num] = rpriv1; if (num_threads > 1) { // At this point, rpriva[thread_num] is the partial prefix sum, // for rpriva[0] of (r, a[0], ... a[end - 1]), for // others (a[start], ... a[end - 1]). #pragma omp barrier if (SERIAL_MERGE) { #pragma omp single for (int l = 1; l < num_threads; ++l) rpriva[l] = rpriva[l - 1] + rpriva[l]; } else { unsigned int k; #if 0 for (k = 1; k * 2 <= (unsigned) num_threads; k <<= 1) #pragma omp for schedule(static) for (int l = k * 2 - 1; l < num_threads; l += k * 2) rpriva[l] = rpriva[l - k] + rpriva[l]; k >>= 1; if (k == (unsigned) num_threads) k >>= 1; // Down-sweep. for (; k; k >>= 1) #pragma omp for schedule(static) for (int l = k * 3 - 1; l < num_threads; l += k * 2) rpriva[l] = rpriva[l - k] + rpriva[l]; #else // For up-sweep 0, down-sweep 1. int down = 0; k = 1; do { unsigned int l; if ((k * 2) > (unsigned) num_threads) { down = 1; k >>= 1; if (k == (unsigned) num_threads) k >>= 1; } // For 64-bit arches can actually just use // unsigned long long l = (thread_num + 1ULL) * (k * 2); // if (l < (unsigned long long) num_threads). // rpriva[l] = rpriva[l - k] + rpriva[l]; // Number of threads is int, so we can't have ever more // than 2G threads in a team, but more than 64K threads // might not be that unbelievable. if (!__builtin_mul_overflow (thread_num + 1U, k * 2, &l) && (l += (down ? k : 0) - 1) < (unsigned) num_threads) rpriva[l] = rpriva[l - k] + rpriva[l]; if (down) k >>= 1; else k <<= 1; #pragma omp barrier } while (k); #endif } } rpriv1 = thread_num == 0 ? 0 : rpriva[thread_num - 1]; #pragma omp simd for (int i = start; i < end; i++) { rprivb[i - start] += rpriv1; // Below is the actual user structured block with // r remapped to rprivb[i - start]. b[i] = rprivb[i - start]; } if (thread_num == num_threads - 1) r = rpriva[thread_num]; // Leak rpriva array for now, if it is allocated by the runtime // library, it will actually not leak. I could add a barrier and // freeing say in omp master before that. } } int main (int argc, const char **argv) { int s = 0; int n = atoi (argv[1]); int *a = malloc (n * sizeof (int)); int *b = malloc (n * sizeof (int)); #pragma omp parallel for for (int i = 0; i < n; ++i) a[i] = i; struct timespec ts1, ts2, ts3, ts4; clock_gettime (CLOCK_MONOTONIC, &ts1); foo (a, b, n); clock_gettime (CLOCK_MONOTONIC, &ts2); long long ns1 = ts1.tv_sec * 1000000000LL + ts1.tv_nsec; long long ns2 = ts2.tv_sec * 1000000000LL + ts2.tv_nsec; printf ("%lld\n", ns2 - ns1); unsigned x = 0; for (int i = 0; i < n; ++i) { #ifndef EXCLUSIVE_SCAN x += i; #endif if (b[i] != x) abort (); else b[i] = -1; #ifdef EXCLUSIVE_SCAN x += i; #endif } if (r != x) abort (); r = 0; clock_gettime (CLOCK_MONOTONIC, &ts3); bar (a, b, n); clock_gettime (CLOCK_MONOTONIC, &ts4); long long ns3 = ts3.tv_sec * 1000000000LL + ts3.tv_nsec; long long ns4 = ts4.tv_sec * 1000000000LL + ts4.tv_nsec; printf ("%lld\n", ns4 - ns3); x = 0; for (int i = 0; i < n; ++i) { #ifndef EXCLUSIVE_SCAN x += i; #endif if (b[i] != x) abort (); else b[i] = -1; #ifdef EXCLUSIVE_SCAN x += i; #endif } if (r != x) abort (); return 0; } Jakub