On Thu, 2015-09-24 at 20:32 +0200, Jakub Jelinek wrote: > Torvald, can you please have a look at it, if I got all the atomics / memory > models right?
More detailed comments below, but in general, I'd really suggest to add more code comments for the synchronization parts. In the end, the level of detail of documentation of libgomp is your decision, but, for example, the lack of comments in synchronization code in glibc has made maintaining this code and fixing issues in it very costly. It has also been hard to understand for many. My suggestion would be both to (1) document the high-level, abstract synchronization scheme and (2) how that scheme is implemented. The first point is important in my experience because typically, the high-level scheme and the actual thinking behind it (or, IOW, the intent of the original author) is much harder to reconstruct in case of concurrent code than it is for sequential code; you can't just simply follow the program along line by line, but have to consider interleavings. Even if the synchronization problem to solve is relatively straight-forward as in thise case (ie, one-directional waiting), it's worth IMO to do point (1). If it is simple, the high-level description will be simple, and it will assure others that one really has to just solve that and that the original author wasn't aware of any other issues. Regarding point (2), what we're doing in glibc now is basically to document how the specific things we do in the code make sure we implement the high-level scheme. So we'd say things like "this CAS here now ensures consensus among the threads A, B, C". For memory orders specifically, it helps to document why they are sufficient and necessary; this helps others understand the code, so that they don't need to go hunting through all of the code looking for other accesses to the same memory locations to be able to reconstruct the intended happens-before relations. I have some examples below. Also, given that you don't use explicit atomic types but just atomic operations, it's IMO a good idea to document which variables are supposed to be accessed atomically so it becomes easier to not violate the data-race-freedom requirement accidentally. > The testcase obviously is not a good benchmark, we'll need > some more realistic one. But obviously when asking for oversubscription, it > is quite expensive. The question is how to implement a non-busy waiting > fallback, whether we put some mutex and queue guarded by the mutex into the > same (or some other?) cache-line, or just use atomics to queue it and how to > make it cheap for the case where busy waiting is sufficient. Atomics and futexes is probably the best approach if you want performance; at least we need some efficient way for post() to figure out that there are indeed waiters, and we don't want to use a lock for that. What specific approach to use is a question of how much time we want to spend on this. It's hard to estimate how often we'd really need a blocking wait in practice, though. > I'd say > it should be sufficient to implement non-busy waiting in the flattened > variant. Sounds fine to me. > --- libgomp/ordered.c.jj 2015-09-18 18:36:42.000000000 +0200 > +++ libgomp/ordered.c 2015-09-24 18:20:28.286244397 +0200 > @@ -252,14 +254,146 @@ GOMP_ordered_end (void) > { > } > > +/* DOACROSS initialization. */ > + > +#define MAX_COLLAPSED_BITS (__SIZEOF_LONG__ * __CHAR_BIT__) > + > +void > +gomp_doacross_init (unsigned ncounts, long *counts, long chunk_size) > +{ > + struct gomp_thread *thr = gomp_thread (); > + struct gomp_team *team = thr->ts.team; > + struct gomp_work_share *ws = thr->ts.work_share; > + unsigned int i, bits[MAX_COLLAPSED_BITS], num_bits = 0; > + unsigned long ent, num_ents, elt_sz, shift_sz; > + struct gomp_doacross_work_share *doacross; > + > + if (team == NULL || team->nthreads == 1) > + return; > + > + for (i = 0; i < ncounts; i++) > + { > + /* If any count is 0, GOMP_doacross_{post,wait} can't be called. */ > + if (counts[i] == 0) > + return; > + > + if (num_bits <= MAX_COLLAPSED_BITS) > + { > + unsigned int this_bits; > + if (counts[i] == 1) > + this_bits = 1; > + else > + this_bits = __SIZEOF_LONG__ * __CHAR_BIT__ > + - __builtin_clzl (counts[i] - 1); > + if (num_bits + this_bits <= MAX_COLLAPSED_BITS) > + { > + bits[i] = this_bits; > + num_bits += this_bits; > + } > + else > + num_bits = MAX_COLLAPSED_BITS + 1; > + } > + } > + > + if (ws->sched == GFS_STATIC) > + num_ents = team->nthreads; > + else > + num_ents = (counts[0] - 1) / chunk_size + 1; > + if (num_bits <= MAX_COLLAPSED_BITS) > + { > + elt_sz = sizeof (unsigned long); > + shift_sz = ncounts * sizeof (unsigned int); > + } > + else > + { > + elt_sz = sizeof (unsigned long) * ncounts; > + shift_sz = 0; > + } > + elt_sz = (elt_sz + 63) & ~63UL; > + > + doacross = gomp_malloc (sizeof (*doacross) + 63 + num_ents * elt_sz > + + shift_sz); > + doacross->chunk_size = chunk_size; > + doacross->elt_sz = elt_sz; > + doacross->ncounts = ncounts; > + doacross->flattened = false; > + doacross->boundary = 0; > + doacross->array = (unsigned char *) > + ((((uintptr_t) (doacross + 1)) + 63 + shift_sz) > + & ~(uintptr_t) 63); > + if (num_bits <= MAX_COLLAPSED_BITS) > + { > + unsigned int shift_count = 0; > + doacross->flattened = true; > + for (i = ncounts; i > 0; i--) > + { > + doacross->shift_counts[i - 1] = shift_count; > + shift_count += bits[i - 1]; > + } > + for (ent = 0; ent < num_ents; ent++) > + *(unsigned long *) (doacross->array + ent * elt_sz) = 0; > + } > + else > + for (ent = 0; ent < num_ents; ent++) > + memset (doacross->array + ent * elt_sz, '\0', > + sizeof (unsigned long) * ncounts); > + if (ws->sched == GFS_STATIC && chunk_size == 0) > + { > + unsigned long q = counts[0] / num_ents; > + unsigned long t = counts[0] % num_ents; > + doacross->boundary = t * (q + 1); > + doacross->q = q; > + doacross->t = t; > + } > + ws->doacross = doacross; > +} > + > /* DOACROSS POST operation. */ > > void > -GOMP_doacross_post (long first, ...) > +GOMP_doacross_post (long *counts) > { > - va_list ap; > - va_start (ap, first); > - va_end (ap); > + struct gomp_thread *thr = gomp_thread (); > + struct gomp_work_share *ws = thr->ts.work_share; > + struct gomp_doacross_work_share *doacross = ws->doacross; > + unsigned long ent; > + unsigned int i; > + > + if (__builtin_expect (doacross == NULL, 0)) > + { > + __sync_synchronize (); Is this necessary for OpenMP no-ops, or what is the reasoning behind having the full barrier here? Full barriers aren't cheap. Also, given that you use the new-style atomic ops elsewhere, any reason to not use a seq-cst thread fence? > + return; > + } > + > + if (__builtin_expect (ws->sched == GFS_STATIC, 1)) > + ent = thr->ts.team_id; > + else > + ent = counts[0] / doacross->chunk_size; > + unsigned long *array = (unsigned long *) (doacross->array > + + ent * doacross->elt_sz); > + > + if (__builtin_expect (doacross->flattened, 1)) > + { > + unsigned long flattened > + = (unsigned long) counts[0] << doacross->shift_counts[0]; > + > + for (i = 1; i < doacross->ncounts; i++) > + flattened |= (unsigned long) counts[i] > + << doacross->shift_counts[i]; > + flattened++; > + if (flattened == __atomic_load_n (array, MEMMODEL_ACQUIRE)) > + __atomic_thread_fence (MEMMODEL_RELEASE); Can the same value actually be posted more than once? If so, I suppose this would be a program that has two equal post() statements, unnecessarily? If this isn't a common thing, I'd just avoid the additional load. But *another* thread can never store the same value, or can it? Also, even if trying to avoid the load, it wouldn't work like this. The acquire MO load is fine for ensuring a synchronizes-with with the thread that produced this value with a release MO store. But the release MO fence is in itself not effective, it only takes effect through other atomic stores sequenced after the fence. So, if this thread needs to make its application data changes happen-before the accesses to that after a wait reading this value of flattened, it needs to store to flattened. But if it stores the same value, the wait can't actually distinguish between the two stores to flattened (unless you ensure through other sync), so it can't rely on seeing this threads newer changes to application data anyway. > + else > + __atomic_store_n (array, flattened, MEMMODEL_RELEASE); OK. I'd add a code comment like this: "We need release MO so that the respective acquire MO load in GOMP_doacross_wait synchronizes with us; thise ensures that the changes in application data produced by this thread before the post happen-before the accesses to this data after a matching GOMP_doacross_wait." This tells readers of the code what this is intended to synchronize with, and why we needs this happens-before relation to make the abstract synchronization scheme work. > + return; > + } > + > + __atomic_thread_fence (MEMMODEL_ACQUIRE); What is the acquire fence for? It would apply to atomic loads sequenced before the fence, but I don't see any relevant ones, and you hvaen't documented which these are supposed to be. If it's supposed to be in effect together with the relaxed MO in the loop below, then it has the wrong position in the code. > + for (i = doacross->ncounts; i-- > 0; ) > + { > + if (counts[i] + 1UL != __atomic_load_n (&array[i], MEMMODEL_RELAXED)) This looks okay, but it's worth documenting why relaxed MO is fine. The question to ask (and answer) is whether anyone reading this value tries to rely on happening after app data changes by this call of post. I think this is not the case because (1) two different threads should not attempt to store the same value in *all* elements of counts[] (ie, same line of though as above), and (2) a wait() that does depend on seeing our app data updates will check all elements of counts[] and will thus read at least from one release MO store by us, thus synchronizing with us. Is counts[] always lexicographically increasing? (You start at the "least-significant" element of it.) Or is this just the case with something like dependence folding in place on the compiler side? If it's not the case, we need to figure out something else to make the atomic snapshot in wait() work without running into ABA issues. See below... > + __atomic_store_n (&array[i], counts[i] + 1UL, MEMMODEL_RELEASE); OK. I'd add a comment like for the release MO store above. Additionally, I'd also explain why a release MO store is required for each element to make the atomic snapshot in wait() work. BTW, I hope this example clarifies why I think more detailed comments can make things easier for readers of the code; showing that this *really* works needs quite a bit more reasoning... > + } > } > > /* DOACROSS WAIT operation. */ > @@ -267,7 +401,81 @@ GOMP_doacross_post (long first, ...) > void > GOMP_doacross_wait (long first, ...) > { > + struct gomp_thread *thr = gomp_thread (); > + struct gomp_work_share *ws = thr->ts.work_share; > + struct gomp_doacross_work_share *doacross = ws->doacross; > va_list ap; > - va_start (ap, first); > - va_end (ap); > + unsigned long ent; > + unsigned int i; > + > + if (__builtin_expect (doacross == NULL, 0)) > + { > + __sync_synchronize (); > + return; See above. > + } > + > + if (__builtin_expect (ws->sched == GFS_STATIC, 1)) > + { > + if (ws->chunk_size == 0) > + { > + if (first < doacross->boundary) > + ent = first / (doacross->q + 1); > + else > + ent = (first - doacross->boundary) / doacross->q > + + doacross->t; > + } > + else > + ent = first / ws->chunk_size % thr->ts.team->nthreads; > + } > + else > + ent = first / doacross->chunk_size; > + unsigned long *array = (unsigned long *) (doacross->array > + + ent * doacross->elt_sz); > + > + if (__builtin_expect (doacross->flattened, 1)) > + { > + unsigned long flattened > + = (unsigned long) first << doacross->shift_counts[0]; > + unsigned long cur; > + > + va_start (ap, first); > + for (i = 1; i < doacross->ncounts; i++) > + flattened |= (unsigned long) va_arg (ap, long) > + << doacross->shift_counts[i]; > + cur = __atomic_load_n (array, MEMMODEL_ACQUIRE); OK, but I'd add a small comment that this is supposed to synchronize with just the release MO store in post(), and that post() describes why this is necessary. > + if (flattened < cur) > + { > + __atomic_thread_fence (MEMMODEL_RELEASE); What is this trying to accomplish? Where is the atomic store that is to be combined with the fence, and which happens-before relation do you want to enforce here? > + va_end (ap); > + return; > + } > + doacross_spin (array, flattened, cur); doacross_spin used relaxed MO. But you need acquire MO here, I believe. Either doacross_spin needs to issue an acquire MO fence after reading the expected value with relaxed MO, or you need to issue an acquire MO fence here. The latter might allow for better reuse of doacross_spin. A quick comment about the required acquire MO would be helpful here too, IMO. > + __atomic_thread_fence (MEMMODEL_RELEASE); See above. > + va_end (ap); > + return; > + } > + The following relies on array[] to be lexicographically monotonically increasing, and on post to change array[] in such a way that the least-significant element is modified first. If it does not, it will be prone to ABA issues, I think. I really think it should be documented how the abstract synchronization scheme is supposed to work, and why it works. I can provide text for that if you want. If you or somebody else wants to practise reasoning about concurrent code, this piece of synchronization would be a good exercise. Start with two elements, perhaps use cppmem, and explore which values can be written and read. Don't forget to work with the assumptions we make. Documenting why it works in clear terms is a good test for whether the reasoning is sound. (And that's another reason for why I think that documenting the concurrent code is helpful; it's like a test for your code. I've often found bugs in my concurrent code just by noticing that I couldn't really thoroughly explain why it would have to work...) > + do > + { > + va_start (ap, first); > + for (i = 0; i < doacross->ncounts; i++) > + { > + unsigned long thisv > + = (unsigned long) (i ? va_arg (ap, long) : first) + 1; > + unsigned long cur = __atomic_load_n (&array[i], MEMMODEL_RELAXED); This needs to be acquire MO. The key here is to ensure that changes to less significant elements of array[] by the same thread that did the store we read from here do happen before our load. I'll leave the reason for that as an exercise (see above) :) > + if (thisv < cur) Why does this work? Say, we wait for (2,6), and post has stored the value (2+1,3+1) (the +1 is the flattened++; in post()). This will see 2<3 and then break out of the loop, after which it will try the loop again, and will never get to comparing the second element. > + { > + i = doacross->ncounts; > + break; > + } > + if (thisv > cur) > + break; Can't you use just doacross_spin here and wait until this element has the expected value? (With added acquire MO fence after it.) This should be better than just calling cpu_relax at the end of the loop. > + } > + va_end (ap); > + if (i == doacross->ncounts) > + break; > + cpu_relax (); > + } > + while (1); > + __sync_synchronize (); What is this full barrier for? Aren't the acquire MO in the loads above sufficient? > --- libgomp/libgomp_g.h.jj 2015-09-17 09:25:23.000000000 +0200 > +++ libgomp/libgomp_g.h 2015-09-24 13:33:32.726324481 +0200 > @@ -78,6 +78,36 @@ enum gomp_schedule_type > GFS_AUTO > }; > > +struct gomp_doacross_work_share > +{ > + union { > + /* chunk_size copy, as ws->chunk_size is multiplied by incr for > + GFS_DYNAMIC. */ > + long chunk_size; > + /* For schedule(static,0) this is the number > + of iterations assigned to the last thread, i.e. number of > + iterations / number of threads. */ > + long q; > + }; > + /* Size of each array entry (padded to cache line size). */ > + unsigned long elt_sz; > + /* Number of dimensions in sink vectors. */ > + unsigned int ncounts; > + /* True if the iterations can be flattened. */ > + bool flattened; > + /* Actual array (of elt_sz sized units), aligned to cache line size. > + This is indexed by team_id for GFS_STATIC and outermost iteration > + / chunk_size for other schedules. */ > + unsigned char *array; > + /* These two are only used for schedule(static,0). */ > + /* This one is number of iterations % number of threads. */ > + long t; > + /* And this one is cached t * (q + 1). */ > + long boundary; > + /* Array of shift counts for each dimension if they can be flattened. */ > + unsigned int shift_counts[]; > +}; As mentioned above, I'd quickly document which of these must be accessed with atomic operations to achieve data-race freedom. > --- libgomp/config/posix/doacross.h.jj 2015-09-23 12:17:53.217834221 > +0200 > +++ libgomp/config/posix/doacross.h 2015-09-24 10:51:18.310081801 +0200 > @@ -0,0 +1,62 @@ > +/* Copyright (C) 2015 Free Software Foundation, Inc. > + Contributed by Jakub Jelinek <ja...@redhat.com>. > + > + This file is part of the GNU Offloading and Multi Processing Library > + (libgomp). > + > + Libgomp is free software; you can redistribute it and/or modify it > + under the terms of the GNU General Public License as published by > + the Free Software Foundation; either version 3, or (at your option) > + any later version. > + > + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY > + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS > + FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + more details. > + > + Under Section 7 of GPL version 3, you are granted additional > + permissions described in the GCC Runtime Library Exception, version > + 3.1, as published by the Free Software Foundation. > + > + You should have received a copy of the GNU General Public License and > + a copy of the GCC Runtime Library Exception along with this program; > + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see > + <http://www.gnu.org/licenses/>. */ > + > +/* This is a generic implementation of doacross spinning. */ > + > +#ifndef GOMP_DOACROSS_H > +#define GOMP_DOACROSS_H 1 > + > +#include "libgomp.h" > +#include <errno.h> > + > +#ifdef HAVE_ATTRIBUTE_VISIBILITY > +# pragma GCC visibility push(hidden) > +#endif > + > +static inline void > +cpu_relax (void) > +{ > + __asm volatile ("" : : : "memory"); > +} IIRC, cpu_relax() is something like the pause instruction on x86, meant to tell the CPU that busy-waiting is happening. The compiler barrier in the asm above doesn't do that. It's also not necessary if using atomic operations for the spinning (as opposed to plain memory accesses), because the atomics (even with relaxed MO) already prevent the compiler from optimizing them away. (Specifically, the standards require that threads need to eventually observe the most recent store to a value; if the compiler would optimize away a spin-loop with relaxed MO atomics, it would violate this forward progress requirement. This is different if plain accesses are used, because the compiler is allowed to assume that there are not endless loops without IO or synchronization in a program.) > + > +static inline void doacross_spin (unsigned long *addr, unsigned long > expected, > + unsigned long cur) > +{ > + /* FIXME: back off depending on how large expected - cur is. */ > + do > + { > + cpu_relax (); > + cur = __atomic_load_n (addr, MEMMODEL_RELAXED); > + if (expected < cur) > + return; > + } > + while (1); > +} > + > +#ifdef HAVE_ATTRIBUTE_VISIBILITY > +# pragma GCC visibility pop > +#endif > + > +#endif /* GOMP_DOACROSS_H */