Re: Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime

2019-07-21 Thread 김규래
Hi Jakub,
About the snippet below, 
 
  if (gomp_barrier_last_thread (state)) 
{
  if (team->task_count == 0) 
{
  gomp_team_barrier_done (&team->barrier, state);
  gomp_mutex_unlock (&team->task_lock);
  gomp_team_barrier_wake (&team->barrier, 0);
  return;
}
  gomp_team_barrier_set_waiting_for_tasks (&team->barrier);
}

Am I safe to assume that gomp_barrier_last_thread is thread-safe?

Ray Kim


[GSoC'19] Parallelize GCC with Threads -- Second Evaluation Status

2019-07-21 Thread Giuliano Belinassi
Hi all,

Here is my second evaluation report, together with a simple program that
I was able to compile with my parallel version of GCC. Keep in mind that
I still have lots of concurrent issues inside the compiler and therefore
my branch will fail to compile pretty much anything else.

To reproduce my current branch, use the following steps:

1-) Clone https://gitlab.com/flusp/gcc/tree/giulianob_parallel

2-) Edit gcc/graphunit.c's variable `num_threads` to 1.

3-) Compile with --disable-bootstrap --enable-languages=c

4-) make

5-) Edit gcc/graphunit.c's variable `num_threads` to 2, for instance.

6-) make install DESTDIR="somewhere_else_that_doesnt_break_your_gcc"

7-) compile the program using -O2

I a attaching my report in markdown format, which you can convert to pdf
using `pandoc` if you find it difficult to read in the current format.

I am also open to suggestions. Please do not hesitate to comment :)

Thank you,
Giuliano.
#include 
#include 
#include 
#include 
#include 

#include 
#include 

#define MAX 1000

double
//__attribute__ ((noinline, optimize("-O0")))
sinatan2(double x, double y)
{
return sin(atan2(x, y));
}

#define EPS 1e-160
#define HUGE 1e150
#define INF 1/0.


double
//__attribute__ ((noinline))
sinatan2_(double x, double y)
{
   static double hue; 
//if (!(*((uint64_t*) &x) & 0x7FFFULL))
if (fpclassify(x) == FP_ZERO)
return x;
else
{
if (y == 0 || x > INF || x < -INF)
return copysign(1., x);
else
{
double x_abs = fabs(x);
double y_abs = fabs(y);
if (x_abs < EPS || y_abs < EPS)
{
double alpha = 1/EPS; 
x = alpha*x;
y = alpha*y;
}
if (x_abs > HUGE || y_abs > HUGE)
{
double alpha = 1/HUGE;
x = alpha*x;
y = alpha*y;
}
}
printf("x = %le, sqrt = %le\n", x, sqrt(x*x + y*y));
return x / sqrt(x*x + y*y);
}
}

static void generate_random_array(int n, double a[])
{
int i;

for (i = 0; i < n; ++i)
a[i] = (double) rand() / (double) RAND_MAX;
}

int
main(int argc, char* argv[])
{
//struct timespec start, end;
unsigned long long delta_t;

   
volatile double a;
int i, j;

double x, y, z1, z2;

if (argc != 3)
return 1;

x = atof(argv[1]);
y = atof(argv[2]);

z1 = sinatan2(x, y);
z2 = sinatan2_(x, y);

printf("sin(atan2(x, y))  = %le\n", z1);
printf("x / sqrt(x² + y²) = %le\n", z2);
/*
clock_gettime(CLOCK_MONOTONIC_RAW, &start);
clock_gettime(CLOCK_MONOTONIC_RAW, &end);

clock_gettime(CLOCK_MONOTONIC_RAW, &start);
for (i = 0; i < MAX; ++i)
{
for (j = 0; j < MAX; ++j)
{
a = sinatan2_(A[i], A[j]);
}
}
clock_gettime(CLOCK_MONOTONIC_RAW, &end);
delta_t = end.tv_nsec - start.tv_nsec;
printf("time x/sqrt(x*x + y*y) = %llu\n", delta_t);

clock_gettime(CLOCK_MONOTONIC_RAW, &start);
for (i = 0; i < MAX; ++i)
{
for (j = 0; j < MAX; ++j)
{
a = sinatan2(A[i], A[j]);
}
}
clock_gettime(CLOCK_MONOTONIC_RAW, &end);
delta_t = (end.tv_nsec - start.tv_nsec);
printf("time sin(atan2(x, y))  = %llu\n", delta_t);
*/

return 0;
}
# Parallelize GCC with Threads: Second Evaluation Report


## Tasks accomplished until the Second Evaluation:

These are the tasks which I accomplished until now:

  - Keep on mapping the GIMPLE global variables.

  - Compile a small program in parallel.

Although I aimed to have a correct version of the parallel GCC by this
evaluation, I couldn't archive this result yet.

Here I explain some technical problems that I faced during this phase.

## Chapter 1 -- Global Variables

In this chapter, I discuss some global variables
that I found while debugging my parallel prototype.

By using the "algorithm" that I stated in the first evaluation and sometimes
using helgrind, or aided by gdb with the crash point, I was able to find some
more global variables. See the table in Appendex for more details. There is
still lots of variables that I have not reached yet, and some that fell through
the cracks that I only find references to these after debugging the crash.

## Chapter 2 -- Garbage Collector (GGC)

I inserted a single mutex in GGC to ensure that memory allocation is done
sequentially, therefore avoiding me to rewrite the GGC with threading support
for now. Of course, this will require a better strategy in the future, as doing
memory allocation sequentially might impact performance negatively.

## Chapter 3 -- Object Pools

Some passes use object pools to store objects, avoiding to allocate and
deallocate memory frequently. My previous approach was to have a threadsafe
memory pool that locked a mutex each time a new object is requested or released
to this pool. This turned out to be a bad ap

[GSoC'19, libgomp work-stealing] GSoC 2nd Evaluation Status

2019-07-21 Thread 김규래
Hi,
Just submitted a WIP patch for my current status.
I've finished unifying the three queues and reducing the execution paths.
From now on, I will reduce the locked region so that in the end, only the queue 
accesses are locked.
Once this is done splitting the queues and implementing work-stealing will 
follow.
 
The link below is a simple benchmark result form the gcc-patches submitted 
version.  
https://imgur.com/IvaBDwT
The benchmark problem is computing the LU decomposition of an NxN matrix.
PLASMA [1] is a task-parallel linear algebra library.
The upstream version of PLASMA uses OpenMP's task scheduling system.
 
Looking at the results, the '2nd eval' version (currently submitted patch) 
surpasses the upstream version's performance passed N=4096. 
Apparently, unifying the queues improved the performance despite the 
more frequent mutex lock/unlocks.
 
Ray Kim.
 
[1] https://bitbucket.org/icl/plasma/src


gcc-10-20190721 is now available

2019-07-21 Thread gccadmin
Snapshot gcc-10-20190721 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/10-20190721/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 10 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/trunk revision 273652

You'll find:

 gcc-10-20190721.tar.xz   Complete GCC

  SHA256=d9be23901b2d474e95382f3f708e00b600a2f7141853c978dd0f927aeaf08914
  SHA1=54ac6e6b5658e3f22a6061d74ec7a7a2f378d6d9

Diffs from 10-20190714 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-10
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


Re: Doubts regarding the _Dependent_ptr keyword

2019-07-21 Thread Akshat Garg
Hi all,
Consider part of an example(figure 20) from doc P0190R4(
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0190r4.pdf) shown
below:

1.  void thread1 (void)
2.  {
3.int * volatile p;
4.p = rcu_dereference(gip);
5.if (p)
6.assert(*(p+p[0]) == 42);
7.  }
The .gimple code produced is :

1.  thread1 ()
2.  {
3.   atomic int * D.1992;
4.int * volatile p;
5.  {
6.atomic int * * __atomic_load_ptr;
7.   atomic int * __atomic_load_tmp;
8.try
9. {
10.__atomic_load_ptr = &gip;
11._1 = __atomic_load_8 (__atomic_load_ptr, 1);
12._2 = (atomic int *) _1;
13.__atomic_load_tmp = _2;
14.D.1992 = __atomic_load_tmp;
15. }
16.finally
17.  {
18.__atomic_load_tmp = {CLOBBER};
19.  }
20.  }
21.   p = D.1992;
22.   p.2_3 = p;
23.  if (p.2_3 != 0B) goto ; else goto ;
24.  :
25.   p.3_4 = p;
26.  p.4_5 = p;
27.   _6 = *p.4_5;
28.  _7 = (long unsigned int) _6;
29.  _8 = _7 * 4;
30.  _9 = p.3_4 + _8;
31.  _10 = *_9;
32.  _11 = _10 == 42;
33.  _12 = (int) _11;
34.  assert (_12);
35.  :
36. }

assert at line 34 in .gimple code still breaks the dependency given by the
user. I believe, there should be some ssa defined variable of p or p itself
in assert. This is happening when I am considering pointer as volatile
qualified. If I consider it as _Dependent_ptr qualified then it surely
produces the broken dependency code. Let me know, if I wrong somewhere.

-Akshat




On Wed, Jul 17, 2019 at 4:23 PM Akshat Garg  wrote:

> On Tue, Jul 2, 2019 at 9:06 PM Jason Merrill  wrote:
>
>> On Mon, Jul 1, 2019 at 8:59 PM Paul E. McKenney 
>> wrote:
>> >
>> > On Tue, Jul 02, 2019 at 05:58:48AM +0530, Akshat Garg wrote:
>> > > On Tue, Jun 25, 2019 at 9:49 PM Akshat Garg  wrote:
>> > >
>> > > > On Tue, Jun 25, 2019 at 4:04 PM Ramana Radhakrishnan <
>> > > > ramana@googlemail.com> wrote:
>> > > >
>> > > >> On Tue, Jun 25, 2019 at 11:03 AM Akshat Garg 
>> wrote:
>> > > >> >
>> > > >> > As we have some working front-end code for _Dependent_ptr, What
>> should
>> > > >> we do next? What I understand, we can start adding the library for
>> > > >> dependent_ptr and its functions for C corresponding to the ones we
>> created
>> > > >> as C++ template library. Then, after that, we can move on to
>> generating the
>> > > >> assembly code part.
>> > > >> >
>> > > >>
>> > > >>
>> > > >> I think the next step is figuring out how to model the Dependent
>> > > >> pointer information in the IR and figuring out what optimizations
>> to
>> > > >> allow or not with that information. At this point , I suspect we
>> need
>> > > >> a plan on record and have the conversation upstream on the lists.
>> > > >>
>> > > >> I think we need to put down a plan on record.
>> > > >>
>> > > >> Ramana
>> > > >
>> > > > [CCing gcc mailing list]
>> > > >
>> > > > So, shall I start looking over the pointer optimizations only and
>> see what
>> > > > information we may be needed on the same examples in the IR itself?
>> > > >
>> > > > - Akshat
>> > > >
>> > > I have coded an example where equality comparison kills dependency
>> from the
>> > > document P0190R4 as shown below :
>> > >
>> > > 1. struct rcutest rt = {1, 2, 3};
>> > > 2. void thread0 ()
>> > > 3. {
>> > > 4. rt.a = -42;
>> > > 5. rt.b = -43;
>> > > 6. rt.c = -44;
>> > > 7. rcu_assign_pointer(gp, &rt);
>> > > 8. }
>> > > 9.
>> > > 10. void thread1 ()
>> > > 11. {
>> > > 12.int i = -1;
>> > > 13.int j = -1;
>> > > 14._Dependent_ptr struct rcutest *p;
>> > > 15.
>> > > 16.p = rcu_dereference(gp);
>> > > 17.j = p->a;
>> > > 18.   if (p == &rt)
>> > > 19.i = p->b;  /*Dependency breaking point*/
>> > > 20.   else if(p)
>> > > 21.   i = p->c;
>> > > 22.   assert(i<0);
>> > > 23.   assert(j<0);
>> > > 24. }
>> > > The gimple unoptimized code produced for lines 17-24 is shown below
>> > >
>> > > 1. if (p_16 == &rt)
>> > > 2. goto ; [INV]
>> > > 3.   else
>> > > 4.goto ; [INV]
>> > > 5.
>> > > 6.   :
>> > > 7.  i_19 = p_16->b;
>> > > 8.  goto ; [INV]
>> > > 9.
>> > > 10.   :
>> > > 11.  if (p_16 != 0B)
>> > > 12.goto ; [INV]
>> > > 13.  else
>> > > 14.goto ; [INV]
>> > > 15.
>> > > 16.   :
>> > > 17.  i_18 = p_16->c;
>> > > 18.
>> > > 19.   :
>> > > 20.  # i_7 = PHI 
>> > > 21.  _3 = i_7 < 0;
>> > > 22.  _4 = (int) _3;
>> > > 23.  assert (_4);
>> > > 24.  _5 = j_17 < 0;
>> > > 25.  _6 = (int) _5;
>> > > 26.  assert (_6);
>> > > 27.  return;
>> > >
>> > > The optimized code after -O1 is applied for the same lines is hown
>> below :
>> > >
>> > > 1. if (_2 == &rt)
>> > > 2.goto ; [30.00%]
>> > > 3. else
>> > > 4.goto ; [70.00%]
>> > > 5.
>> > > 6.   [local count: 322122547]:
>> > > 7.   i_12 = rt.b;
>> > > 8.   goto ; [100.00%]
>> > > 9.
>> > > 10.   [local count: 751619277]:
>> > > 11.   if (_1 != 0)
>> > > 12.   goto ; [50.00%]
>> > > 13.   else
>> > > 14.goto ; [50.00%]
>> > > 15.
>> > > 16.   [local count: 375809638]:
>> > > 17.   i_11 = MEM[(dependen