Re: [gomp4.1] Doacross library implementation

Torvald Riegel Thu, 08 Oct 2015 05:49:28 -0700

On Thu, 2015-09-24 at 20:32 +0200, Jakub Jelinek wrote:
> Torvald, can you please have a look at it, if I got all the atomics / memory
> models right?


More detailed comments below, but in general, I'd really suggest to add
more code comments for the synchronization parts.  In the end, the level
of detail of documentation of libgomp is your decision, but, for
example, the lack of comments in synchronization code in glibc has made
maintaining this code and fixing issues in it very costly.  It has also
been hard to understand for many.

My suggestion would be both to (1) document the high-level, abstract
synchronization scheme and (2) how that scheme is implemented.  The
first point is important in my experience because typically, the
high-level scheme and the actual thinking behind it (or, IOW, the intent
of the original author) is much harder to reconstruct in case of
concurrent code than it is for sequential code; you can't just simply
follow the program along line by line, but have to consider
interleavings.

Even if the synchronization problem to solve is relatively
straight-forward as in thise case (ie, one-directional waiting), it's
worth IMO to do point (1).  If it is simple, the high-level description
will be simple, and it will assure others that one really has to just
solve that and that the original author wasn't aware of any other
issues.

Regarding point (2), what we're doing in glibc now is basically to
document how the specific things we do in the code make sure we
implement the high-level scheme.  So we'd say things like "this CAS here
now ensures consensus among the threads A, B, C".  For memory orders
specifically, it helps to document why they are sufficient and
necessary; this helps others understand the code, so that they don't
need to go hunting through all of the code looking for other accesses to
the same memory locations to be able to reconstruct the intended
happens-before relations.  I have some examples below.
Also, given that you don't use explicit atomic types but just atomic
operations, it's IMO a good idea to document which variables are
supposed to be accessed atomically so it becomes easier to not violate
the data-race-freedom requirement accidentally.

> The testcase obviously is not a good benchmark, we'll need
> some more realistic one.  But obviously when asking for oversubscription, it
> is quite expensive.  The question is how to implement a non-busy waiting
> fallback, whether we put some mutex and queue guarded by the mutex into the
> same (or some other?) cache-line, or just use atomics to queue it and how to
> make it cheap for the case where busy waiting is sufficient.

Atomics and futexes is probably the best approach if you want
performance; at least we need some efficient way for post() to figure
out that there are indeed waiters, and we don't want to use a lock for
that.

What specific approach to use is a question of how much time we want to
spend on this.  It's hard to estimate how often we'd really need a
blocking wait in practice, though.

> I'd say
> it should be sufficient to implement non-busy waiting in the flattened
> variant.

Sounds fine to me.

> --- libgomp/ordered.c.jj      2015-09-18 18:36:42.000000000 +0200
> +++ libgomp/ordered.c 2015-09-24 18:20:28.286244397 +0200
> @@ -252,14 +254,146 @@ GOMP_ordered_end (void)
>  {
>  }
>  
> +/* DOACROSS initialization.  */
> +
> +#define MAX_COLLAPSED_BITS (__SIZEOF_LONG__ * __CHAR_BIT__)
> +
> +void
> +gomp_doacross_init (unsigned ncounts, long *counts, long chunk_size)
> +{
> +  struct gomp_thread *thr = gomp_thread ();
> +  struct gomp_team *team = thr->ts.team;
> +  struct gomp_work_share *ws = thr->ts.work_share;
> +  unsigned int i, bits[MAX_COLLAPSED_BITS], num_bits = 0;
> +  unsigned long ent, num_ents, elt_sz, shift_sz;
> +  struct gomp_doacross_work_share *doacross;
> +
> +  if (team == NULL || team->nthreads == 1)
> +    return;
> +
> +  for (i = 0; i < ncounts; i++)
> +    {
> +      /* If any count is 0, GOMP_doacross_{post,wait} can't be called.  */
> +      if (counts[i] == 0)
> +     return;
> +
> +      if (num_bits <= MAX_COLLAPSED_BITS)
> +     {
> +       unsigned int this_bits;
> +       if (counts[i] == 1)
> +         this_bits = 1;
> +       else
> +         this_bits = __SIZEOF_LONG__ * __CHAR_BIT__
> +                     - __builtin_clzl (counts[i] - 1);
> +       if (num_bits + this_bits <= MAX_COLLAPSED_BITS)
> +         {
> +           bits[i] = this_bits;
> +           num_bits += this_bits;
> +         }
> +       else
> +         num_bits = MAX_COLLAPSED_BITS + 1;
> +     }
> +    }
> +
> +  if (ws->sched == GFS_STATIC)
> +    num_ents = team->nthreads;
> +  else
> +    num_ents = (counts[0] - 1) / chunk_size + 1;
> +  if (num_bits <= MAX_COLLAPSED_BITS)
> +    {
> +      elt_sz = sizeof (unsigned long);
> +      shift_sz = ncounts * sizeof (unsigned int);
> +    }
> +  else
> +    {
> +      elt_sz = sizeof (unsigned long) * ncounts;
> +      shift_sz = 0;
> +    }
> +  elt_sz = (elt_sz + 63) & ~63UL;
> +
> +  doacross = gomp_malloc (sizeof (*doacross) + 63 + num_ents * elt_sz
> +                       + shift_sz);
> +  doacross->chunk_size = chunk_size;
> +  doacross->elt_sz = elt_sz;
> +  doacross->ncounts = ncounts;
> +  doacross->flattened = false;
> +  doacross->boundary = 0;
> +  doacross->array = (unsigned char *)
> +                 ((((uintptr_t) (doacross + 1)) + 63 + shift_sz)
> +                  & ~(uintptr_t) 63);
> +  if (num_bits <= MAX_COLLAPSED_BITS)
> +    {
> +      unsigned int shift_count = 0;
> +      doacross->flattened = true;
> +      for (i = ncounts; i > 0; i--)
> +     {
> +       doacross->shift_counts[i - 1] = shift_count;
> +       shift_count += bits[i - 1];
> +     }
> +      for (ent = 0; ent < num_ents; ent++)
> +     *(unsigned long *) (doacross->array + ent * elt_sz) = 0;
> +    }
> +  else
> +    for (ent = 0; ent < num_ents; ent++)
> +      memset (doacross->array + ent * elt_sz, '\0',
> +           sizeof (unsigned long) * ncounts);
> +  if (ws->sched == GFS_STATIC && chunk_size == 0)
> +    {
> +      unsigned long q = counts[0] / num_ents;
> +      unsigned long t = counts[0] % num_ents;
> +      doacross->boundary = t * (q + 1);
> +      doacross->q = q;
> +      doacross->t = t;
> +    }
> +  ws->doacross = doacross;
> +}
> +
>  /* DOACROSS POST operation.  */
>  
>  void
> -GOMP_doacross_post (long first, ...)
> +GOMP_doacross_post (long *counts)
>  {
> -  va_list ap;
> -  va_start (ap, first);
> -  va_end (ap);
> +  struct gomp_thread *thr = gomp_thread ();
> +  struct gomp_work_share *ws = thr->ts.work_share;
> +  struct gomp_doacross_work_share *doacross = ws->doacross;
> +  unsigned long ent;
> +  unsigned int i;
> +
> +  if (__builtin_expect (doacross == NULL, 0))
> +    {
> +      __sync_synchronize ();

Is this necessary for OpenMP no-ops, or what is the reasoning behind
having the full barrier here?  Full barriers aren't cheap.
Also, given that you use the new-style atomic ops elsewhere, any reason
to not use a seq-cst thread fence?

> +      return;
> +    }
> +
> +  if (__builtin_expect (ws->sched == GFS_STATIC, 1))
> +    ent = thr->ts.team_id;
> +  else
> +    ent = counts[0] / doacross->chunk_size;
> +  unsigned long *array = (unsigned long *) (doacross->array
> +                                         + ent * doacross->elt_sz);
> +
> +  if (__builtin_expect (doacross->flattened, 1))
> +    {
> +      unsigned long flattened
> +     = (unsigned long) counts[0] << doacross->shift_counts[0];
> +
> +      for (i = 1; i < doacross->ncounts; i++)
> +     flattened |= (unsigned long) counts[i]
> +                  << doacross->shift_counts[i];
> +      flattened++;
> +      if (flattened == __atomic_load_n (array, MEMMODEL_ACQUIRE))
> +     __atomic_thread_fence (MEMMODEL_RELEASE);

Can the same value actually be posted more than once?  If so, I suppose
this would be a program that has two equal post() statements,
unnecessarily?  If this isn't a common thing, I'd just avoid the
additional load.

But *another* thread can never store the same value, or can it?

Also, even if trying to avoid the load, it wouldn't work like this.  The
acquire MO load is fine for ensuring a synchronizes-with with the thread
that produced this value with a release MO store.  But the release MO
fence is in itself not effective, it only takes effect through other
atomic stores sequenced after the fence.
So, if this thread needs to make its application data changes
happen-before the accesses to that after a wait reading this value of
flattened, it needs to store to flattened.  But if it stores the same
value, the wait can't actually distinguish between the two stores to
flattened (unless you ensure through other sync), so it can't rely on
seeing this threads newer changes to application data anyway.

> +      else
> +     __atomic_store_n (array, flattened, MEMMODEL_RELEASE);

OK.  I'd add a code comment like this:

"We need release MO so that the respective acquire MO load in
GOMP_doacross_wait synchronizes with us; thise ensures that the changes
in application data produced by this thread before the post
happen-before the accesses to this data after a matching
GOMP_doacross_wait."

This tells readers of the code what this is intended to synchronize
with, and why we needs this happens-before relation to make the abstract
synchronization scheme work.

> +      return;
> +    }
> +
> +  __atomic_thread_fence (MEMMODEL_ACQUIRE);

What is the acquire fence for?  It would apply to atomic loads sequenced
before the fence, but I don't see any relevant ones, and you hvaen't
documented which these are supposed to be.  If it's supposed to be in
effect together with the relaxed MO in the loop below, then it has the
wrong position in the code.

> +  for (i = doacross->ncounts; i-- > 0; )
> +    {
> +      if (counts[i] + 1UL != __atomic_load_n (&array[i], MEMMODEL_RELAXED))

This looks okay, but it's worth documenting why relaxed MO is fine.  The
question to ask (and answer) is whether anyone reading this value tries
to rely on happening after app data changes by this call of post.  I
think this is not the case because (1) two different threads should not
attempt to store the same value in *all* elements of counts[] (ie, same
line of though as above), and (2) a wait() that does depend on seeing
our app data updates will check all elements of counts[] and will thus
read at least from one release MO store by us, thus synchronizing with
us.

Is counts[] always lexicographically increasing? (You start at the
"least-significant" element of it.)  Or is this just the case with
something like dependence folding in place on the compiler side?  If
it's not the case, we need to figure out something else to make the
atomic snapshot in wait() work without running into ABA issues.  See
below...

> +     __atomic_store_n (&array[i], counts[i] + 1UL, MEMMODEL_RELEASE);

OK.  I'd add a comment like for the release MO store above.
Additionally, I'd also explain why a release MO store is required for
each element to make the atomic snapshot in wait() work.

BTW, I hope this example clarifies why I think more detailed comments
can make things easier for readers of the code; showing that this
*really* works needs quite a bit more reasoning...

> +    }
>  }
>  
>  /* DOACROSS WAIT operation.  */
> @@ -267,7 +401,81 @@ GOMP_doacross_post (long first, ...)
>  void
>  GOMP_doacross_wait (long first, ...)
>  {
> +  struct gomp_thread *thr = gomp_thread ();
> +  struct gomp_work_share *ws = thr->ts.work_share;
> +  struct gomp_doacross_work_share *doacross = ws->doacross;
>    va_list ap;
> -  va_start (ap, first);
> -  va_end (ap);
> +  unsigned long ent;
> +  unsigned int i;
> +
> +  if (__builtin_expect (doacross == NULL, 0))
> +    {
> +      __sync_synchronize ();
> +      return;

See above.

> +    }
> +
> +  if (__builtin_expect (ws->sched == GFS_STATIC, 1))
> +    {
> +      if (ws->chunk_size == 0)
> +     {
> +       if (first < doacross->boundary)
> +         ent = first / (doacross->q + 1);
> +       else
> +         ent = (first - doacross->boundary) / doacross->q
> +               + doacross->t;
> +     }
> +      else
> +     ent = first / ws->chunk_size % thr->ts.team->nthreads;
> +    }
> +  else
> +    ent = first / doacross->chunk_size;
> +  unsigned long *array = (unsigned long *) (doacross->array
> +                                         + ent * doacross->elt_sz);
> +
> +  if (__builtin_expect (doacross->flattened, 1))
> +    {
> +      unsigned long flattened
> +     = (unsigned long) first << doacross->shift_counts[0];
> +      unsigned long cur;
> +
> +      va_start (ap, first);
> +      for (i = 1; i < doacross->ncounts; i++)
> +     flattened |= (unsigned long) va_arg (ap, long)
> +                  << doacross->shift_counts[i];
> +      cur = __atomic_load_n (array, MEMMODEL_ACQUIRE);

OK, but I'd add a small comment that this is supposed to synchronize
with just the release MO store in post(), and that post() describes why
this is necessary.

> +      if (flattened < cur)
> +     {
> +       __atomic_thread_fence (MEMMODEL_RELEASE);

What is this trying to accomplish?  Where is the atomic store that is to
be combined with the fence, and which happens-before relation do you
want to enforce here?

> +       va_end (ap);
> +       return;
> +     }
> +      doacross_spin (array, flattened, cur);

doacross_spin used relaxed MO.  But you need acquire MO here, I believe.
Either doacross_spin needs to issue an acquire MO fence after reading
the expected value with relaxed MO, or you need to issue an acquire MO
fence here.  The latter might allow for better reuse of doacross_spin.

A quick comment about the required acquire MO would be helpful here too,
IMO.

> +      __atomic_thread_fence (MEMMODEL_RELEASE);

See above.

> +      va_end (ap);
> +      return;
> +    }
> +

The following relies on array[] to be lexicographically monotonically
increasing, and on post to change array[] in such a way that the
least-significant element is modified first.  If it does not, it will be
prone to ABA issues, I think.  I really think it should be documented
how the abstract synchronization scheme is supposed to work, and why it
works.

I can provide text for that if you want.  If you or somebody else wants
to practise reasoning about concurrent code, this piece of
synchronization would be a good exercise.  Start with two elements,
perhaps use cppmem, and explore which values can be written and read.
Don't forget to work with the assumptions we make.

Documenting why it works in clear terms is a good test for whether the
reasoning is sound.  (And that's another reason for why I think that
documenting the concurrent code is helpful; it's like a test for your
code.  I've often found bugs in my concurrent code just by noticing that
I couldn't really thoroughly explain why it would have to work...)

> +  do
> +    {
> +      va_start (ap, first);
> +      for (i = 0; i < doacross->ncounts; i++)
> +     {
> +       unsigned long thisv
> +         = (unsigned long) (i ? va_arg (ap, long) : first) + 1;
> +       unsigned long cur = __atomic_load_n (&array[i], MEMMODEL_RELAXED);

This needs to be acquire MO.  The key here is to ensure that changes to
less significant elements of array[] by the same thread that did the
store we read from here do happen before our load.  I'll leave the
reason for that as an exercise (see above) :)

> +       if (thisv < cur)

Why does this work?  Say, we wait for (2,6), and post has stored the
value (2+1,3+1) (the +1 is the flattened++; in post()).  This will see
2<3 and then break out of the loop, after which it will try the loop
again, and will never get to comparing the second element.

> +         {
> +           i = doacross->ncounts;
> +           break;
> +         }
> +       if (thisv > cur)
> +         break;

Can't you use just doacross_spin here and wait until this element has
the expected value? (With added acquire MO fence after it.)  This should
be better than just calling cpu_relax at the end of the loop.

> +     }
> +      va_end (ap);
> +      if (i == doacross->ncounts)
> +     break;
> +      cpu_relax ();
> +    }
> +  while (1);
> +  __sync_synchronize ();

What is this full barrier for?  Aren't the acquire MO in the loads above
sufficient?

> --- libgomp/libgomp_g.h.jj    2015-09-17 09:25:23.000000000 +0200
> +++ libgomp/libgomp_g.h       2015-09-24 13:33:32.726324481 +0200
> @@ -78,6 +78,36 @@ enum gomp_schedule_type
>    GFS_AUTO
>  };
>  
> +struct gomp_doacross_work_share
> +{
> +  union {
> +    /* chunk_size copy, as ws->chunk_size is multiplied by incr for
> +       GFS_DYNAMIC.  */
> +    long chunk_size;
> +    /* For schedule(static,0) this is the number
> +       of iterations assigned to the last thread, i.e. number of
> +       iterations / number of threads.  */
> +    long q;
> +  };
> +  /* Size of each array entry (padded to cache line size).  */
> +  unsigned long elt_sz;
> +  /* Number of dimensions in sink vectors.  */
> +  unsigned int ncounts;
> +  /* True if the iterations can be flattened.  */
> +  bool flattened;
> +  /* Actual array (of elt_sz sized units), aligned to cache line size.
> +     This is indexed by team_id for GFS_STATIC and outermost iteration
> +     / chunk_size for other schedules.  */
> +  unsigned char *array;
> +  /* These two are only used for schedule(static,0).  */
> +  /* This one is number of iterations % number of threads.  */
> +  long t;
> +  /* And this one is cached t * (q + 1).  */
> +  long boundary;
> +  /* Array of shift counts for each dimension if they can be flattened.  */
> +  unsigned int shift_counts[];
> +};

As mentioned above, I'd quickly document which of these must be accessed
with atomic operations to achieve data-race freedom.

> --- libgomp/config/posix/doacross.h.jj        2015-09-23 12:17:53.217834221 
> +0200
> +++ libgomp/config/posix/doacross.h   2015-09-24 10:51:18.310081801 +0200
> @@ -0,0 +1,62 @@
> +/* Copyright (C) 2015 Free Software Foundation, Inc.
> +   Contributed by Jakub Jelinek <ja...@redhat.com>.
> +
> +   This file is part of the GNU Offloading and Multi Processing Library
> +   (libgomp).
> +
> +   Libgomp is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3, or (at your option)
> +   any later version.
> +
> +   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
> +   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
> +   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> +   more details.
> +
> +   Under Section 7 of GPL version 3, you are granted additional
> +   permissions described in the GCC Runtime Library Exception, version
> +   3.1, as published by the Free Software Foundation.
> +
> +   You should have received a copy of the GNU General Public License and
> +   a copy of the GCC Runtime Library Exception along with this program;
> +   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* This is a generic implementation of doacross spinning.  */
> +
> +#ifndef GOMP_DOACROSS_H
> +#define GOMP_DOACROSS_H 1
> +
> +#include "libgomp.h"
> +#include <errno.h>
> +
> +#ifdef HAVE_ATTRIBUTE_VISIBILITY
> +# pragma GCC visibility push(hidden)
> +#endif
> +
> +static inline void
> +cpu_relax (void)
> +{
> +  __asm volatile ("" : : : "memory");
> +}


IIRC, cpu_relax() is something like the pause instruction on x86, meant
to tell the CPU that busy-waiting is happening.  The compiler barrier in
the asm above doesn't do that.  It's also not necessary if using atomic
operations for the spinning (as opposed to plain memory accesses),
because the atomics (even with relaxed MO) already prevent the compiler
from optimizing them away.  (Specifically, the standards require that
threads need to eventually observe the most recent store to a value; if
the compiler would optimize away a spin-loop with relaxed MO atomics, it
would violate this forward progress requirement.  This is different if
plain accesses are used, because the compiler is allowed to assume that
there are not endless loops without IO or synchronization in a program.)

> +
> +static inline void doacross_spin (unsigned long *addr, unsigned long 
> expected,
> +                               unsigned long cur)
> +{
> +  /* FIXME: back off depending on how large expected - cur is.  */
> +  do
> +    {
> +      cpu_relax ();
> +      cur = __atomic_load_n (addr, MEMMODEL_RELAXED);
> +      if (expected < cur)
> +     return;
> +    }
> +  while (1);
> +}
> +
> +#ifdef HAVE_ATTRIBUTE_VISIBILITY
> +# pragma GCC visibility pop
> +#endif
> +
> +#endif /* GOMP_DOACROSS_H */

Re: [gomp4.1] Doacross library implementation

Reply via email to