Re: Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
Hi Jakub, About the snippet below, if (gomp_barrier_last_thread (state)) { if (team->task_count == 0) { gomp_team_barrier_done (&team->barrier, state); gomp_mutex_unlock (&team->task_lock); gomp_team_barrier_wake (&team->barrier, 0); return; } gomp_team_barrier_set_waiting_for_tasks (&team->barrier); } Am I safe to assume that gomp_barrier_last_thread is thread-safe? Ray Kim
[GSoC'19] Parallelize GCC with Threads -- Second Evaluation Status
Hi all, Here is my second evaluation report, together with a simple program that I was able to compile with my parallel version of GCC. Keep in mind that I still have lots of concurrent issues inside the compiler and therefore my branch will fail to compile pretty much anything else. To reproduce my current branch, use the following steps: 1-) Clone https://gitlab.com/flusp/gcc/tree/giulianob_parallel 2-) Edit gcc/graphunit.c's variable `num_threads` to 1. 3-) Compile with --disable-bootstrap --enable-languages=c 4-) make 5-) Edit gcc/graphunit.c's variable `num_threads` to 2, for instance. 6-) make install DESTDIR="somewhere_else_that_doesnt_break_your_gcc" 7-) compile the program using -O2 I a attaching my report in markdown format, which you can convert to pdf using `pandoc` if you find it difficult to read in the current format. I am also open to suggestions. Please do not hesitate to comment :) Thank you, Giuliano. #include #include #include #include #include #include #include #define MAX 1000 double //__attribute__ ((noinline, optimize("-O0"))) sinatan2(double x, double y) { return sin(atan2(x, y)); } #define EPS 1e-160 #define HUGE 1e150 #define INF 1/0. double //__attribute__ ((noinline)) sinatan2_(double x, double y) { static double hue; //if (!(*((uint64_t*) &x) & 0x7FFFULL)) if (fpclassify(x) == FP_ZERO) return x; else { if (y == 0 || x > INF || x < -INF) return copysign(1., x); else { double x_abs = fabs(x); double y_abs = fabs(y); if (x_abs < EPS || y_abs < EPS) { double alpha = 1/EPS; x = alpha*x; y = alpha*y; } if (x_abs > HUGE || y_abs > HUGE) { double alpha = 1/HUGE; x = alpha*x; y = alpha*y; } } printf("x = %le, sqrt = %le\n", x, sqrt(x*x + y*y)); return x / sqrt(x*x + y*y); } } static void generate_random_array(int n, double a[]) { int i; for (i = 0; i < n; ++i) a[i] = (double) rand() / (double) RAND_MAX; } int main(int argc, char* argv[]) { //struct timespec start, end; unsigned long long delta_t; volatile double a; int i, j; double x, y, z1, z2; if (argc != 3) return 1; x = atof(argv[1]); y = atof(argv[2]); z1 = sinatan2(x, y); z2 = sinatan2_(x, y); printf("sin(atan2(x, y)) = %le\n", z1); printf("x / sqrt(x² + y²) = %le\n", z2); /* clock_gettime(CLOCK_MONOTONIC_RAW, &start); clock_gettime(CLOCK_MONOTONIC_RAW, &end); clock_gettime(CLOCK_MONOTONIC_RAW, &start); for (i = 0; i < MAX; ++i) { for (j = 0; j < MAX; ++j) { a = sinatan2_(A[i], A[j]); } } clock_gettime(CLOCK_MONOTONIC_RAW, &end); delta_t = end.tv_nsec - start.tv_nsec; printf("time x/sqrt(x*x + y*y) = %llu\n", delta_t); clock_gettime(CLOCK_MONOTONIC_RAW, &start); for (i = 0; i < MAX; ++i) { for (j = 0; j < MAX; ++j) { a = sinatan2(A[i], A[j]); } } clock_gettime(CLOCK_MONOTONIC_RAW, &end); delta_t = (end.tv_nsec - start.tv_nsec); printf("time sin(atan2(x, y)) = %llu\n", delta_t); */ return 0; } # Parallelize GCC with Threads: Second Evaluation Report ## Tasks accomplished until the Second Evaluation: These are the tasks which I accomplished until now: - Keep on mapping the GIMPLE global variables. - Compile a small program in parallel. Although I aimed to have a correct version of the parallel GCC by this evaluation, I couldn't archive this result yet. Here I explain some technical problems that I faced during this phase. ## Chapter 1 -- Global Variables In this chapter, I discuss some global variables that I found while debugging my parallel prototype. By using the "algorithm" that I stated in the first evaluation and sometimes using helgrind, or aided by gdb with the crash point, I was able to find some more global variables. See the table in Appendex for more details. There is still lots of variables that I have not reached yet, and some that fell through the cracks that I only find references to these after debugging the crash. ## Chapter 2 -- Garbage Collector (GGC) I inserted a single mutex in GGC to ensure that memory allocation is done sequentially, therefore avoiding me to rewrite the GGC with threading support for now. Of course, this will require a better strategy in the future, as doing memory allocation sequentially might impact performance negatively. ## Chapter 3 -- Object Pools Some passes use object pools to store objects, avoiding to allocate and deallocate memory frequently. My previous approach was to have a threadsafe memory pool that locked a mutex each time a new object is requested or released to this pool. This turned out to be a bad ap
[GSoC'19, libgomp work-stealing] GSoC 2nd Evaluation Status
Hi, Just submitted a WIP patch for my current status. I've finished unifying the three queues and reducing the execution paths. From now on, I will reduce the locked region so that in the end, only the queue accesses are locked. Once this is done splitting the queues and implementing work-stealing will follow. The link below is a simple benchmark result form the gcc-patches submitted version. https://imgur.com/IvaBDwT The benchmark problem is computing the LU decomposition of an NxN matrix. PLASMA [1] is a task-parallel linear algebra library. The upstream version of PLASMA uses OpenMP's task scheduling system. Looking at the results, the '2nd eval' version (currently submitted patch) surpasses the upstream version's performance passed N=4096. Apparently, unifying the queues improved the performance despite the more frequent mutex lock/unlocks. Ray Kim. [1] https://bitbucket.org/icl/plasma/src
gcc-10-20190721 is now available
Snapshot gcc-10-20190721 is now available on ftp://gcc.gnu.org/pub/gcc/snapshots/10-20190721/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 10 SVN branch with the following options: svn://gcc.gnu.org/svn/gcc/trunk revision 273652 You'll find: gcc-10-20190721.tar.xz Complete GCC SHA256=d9be23901b2d474e95382f3f708e00b600a2f7141853c978dd0f927aeaf08914 SHA1=54ac6e6b5658e3f22a6061d74ec7a7a2f378d6d9 Diffs from 10-20190714 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-10 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
Re: Doubts regarding the _Dependent_ptr keyword
Hi all, Consider part of an example(figure 20) from doc P0190R4( http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0190r4.pdf) shown below: 1. void thread1 (void) 2. { 3.int * volatile p; 4.p = rcu_dereference(gip); 5.if (p) 6.assert(*(p+p[0]) == 42); 7. } The .gimple code produced is : 1. thread1 () 2. { 3. atomic int * D.1992; 4.int * volatile p; 5. { 6.atomic int * * __atomic_load_ptr; 7. atomic int * __atomic_load_tmp; 8.try 9. { 10.__atomic_load_ptr = &gip; 11._1 = __atomic_load_8 (__atomic_load_ptr, 1); 12._2 = (atomic int *) _1; 13.__atomic_load_tmp = _2; 14.D.1992 = __atomic_load_tmp; 15. } 16.finally 17. { 18.__atomic_load_tmp = {CLOBBER}; 19. } 20. } 21. p = D.1992; 22. p.2_3 = p; 23. if (p.2_3 != 0B) goto ; else goto ; 24. : 25. p.3_4 = p; 26. p.4_5 = p; 27. _6 = *p.4_5; 28. _7 = (long unsigned int) _6; 29. _8 = _7 * 4; 30. _9 = p.3_4 + _8; 31. _10 = *_9; 32. _11 = _10 == 42; 33. _12 = (int) _11; 34. assert (_12); 35. : 36. } assert at line 34 in .gimple code still breaks the dependency given by the user. I believe, there should be some ssa defined variable of p or p itself in assert. This is happening when I am considering pointer as volatile qualified. If I consider it as _Dependent_ptr qualified then it surely produces the broken dependency code. Let me know, if I wrong somewhere. -Akshat On Wed, Jul 17, 2019 at 4:23 PM Akshat Garg wrote: > On Tue, Jul 2, 2019 at 9:06 PM Jason Merrill wrote: > >> On Mon, Jul 1, 2019 at 8:59 PM Paul E. McKenney >> wrote: >> > >> > On Tue, Jul 02, 2019 at 05:58:48AM +0530, Akshat Garg wrote: >> > > On Tue, Jun 25, 2019 at 9:49 PM Akshat Garg wrote: >> > > >> > > > On Tue, Jun 25, 2019 at 4:04 PM Ramana Radhakrishnan < >> > > > ramana@googlemail.com> wrote: >> > > > >> > > >> On Tue, Jun 25, 2019 at 11:03 AM Akshat Garg >> wrote: >> > > >> > >> > > >> > As we have some working front-end code for _Dependent_ptr, What >> should >> > > >> we do next? What I understand, we can start adding the library for >> > > >> dependent_ptr and its functions for C corresponding to the ones we >> created >> > > >> as C++ template library. Then, after that, we can move on to >> generating the >> > > >> assembly code part. >> > > >> > >> > > >> >> > > >> >> > > >> I think the next step is figuring out how to model the Dependent >> > > >> pointer information in the IR and figuring out what optimizations >> to >> > > >> allow or not with that information. At this point , I suspect we >> need >> > > >> a plan on record and have the conversation upstream on the lists. >> > > >> >> > > >> I think we need to put down a plan on record. >> > > >> >> > > >> Ramana >> > > > >> > > > [CCing gcc mailing list] >> > > > >> > > > So, shall I start looking over the pointer optimizations only and >> see what >> > > > information we may be needed on the same examples in the IR itself? >> > > > >> > > > - Akshat >> > > > >> > > I have coded an example where equality comparison kills dependency >> from the >> > > document P0190R4 as shown below : >> > > >> > > 1. struct rcutest rt = {1, 2, 3}; >> > > 2. void thread0 () >> > > 3. { >> > > 4. rt.a = -42; >> > > 5. rt.b = -43; >> > > 6. rt.c = -44; >> > > 7. rcu_assign_pointer(gp, &rt); >> > > 8. } >> > > 9. >> > > 10. void thread1 () >> > > 11. { >> > > 12.int i = -1; >> > > 13.int j = -1; >> > > 14._Dependent_ptr struct rcutest *p; >> > > 15. >> > > 16.p = rcu_dereference(gp); >> > > 17.j = p->a; >> > > 18. if (p == &rt) >> > > 19.i = p->b; /*Dependency breaking point*/ >> > > 20. else if(p) >> > > 21. i = p->c; >> > > 22. assert(i<0); >> > > 23. assert(j<0); >> > > 24. } >> > > The gimple unoptimized code produced for lines 17-24 is shown below >> > > >> > > 1. if (p_16 == &rt) >> > > 2. goto ; [INV] >> > > 3. else >> > > 4.goto ; [INV] >> > > 5. >> > > 6. : >> > > 7. i_19 = p_16->b; >> > > 8. goto ; [INV] >> > > 9. >> > > 10. : >> > > 11. if (p_16 != 0B) >> > > 12.goto ; [INV] >> > > 13. else >> > > 14.goto ; [INV] >> > > 15. >> > > 16. : >> > > 17. i_18 = p_16->c; >> > > 18. >> > > 19. : >> > > 20. # i_7 = PHI >> > > 21. _3 = i_7 < 0; >> > > 22. _4 = (int) _3; >> > > 23. assert (_4); >> > > 24. _5 = j_17 < 0; >> > > 25. _6 = (int) _5; >> > > 26. assert (_6); >> > > 27. return; >> > > >> > > The optimized code after -O1 is applied for the same lines is hown >> below : >> > > >> > > 1. if (_2 == &rt) >> > > 2.goto ; [30.00%] >> > > 3. else >> > > 4.goto ; [70.00%] >> > > 5. >> > > 6. [local count: 322122547]: >> > > 7. i_12 = rt.b; >> > > 8. goto ; [100.00%] >> > > 9. >> > > 10. [local count: 751619277]: >> > > 11. if (_1 != 0) >> > > 12. goto ; [50.00%] >> > > 13. else >> > > 14.goto ; [50.00%] >> > > 15. >> > > 16. [local count: 375809638]: >> > > 17. i_11 = MEM[(dependen