On Mon, Aug 26, 2019 at 01:08:39AM +0900, Ray Kim wrote: > This patch implemented work-stealing task scheduling for GSoC'19 final > evaluations. > Currently there are some issues that needs to be further addressed, > however I think it is functional. > > This that could be improved are as follows: > 1. Currently the threads busy wait even when the task queues are empty > until the stopping criterion are met. > The previous way the task waits were implemented do not work well on > this implementation. > So I initially removed the task wait feature all together.
For the GSoC submission I guess I can live with that, but not long term, many apps oversubscribe the machine, run more threads than the hw has available and unbounded busy waiting in that case is unacceptable. For latency reasons some busy waiting is performed, but the user should have control over how long that busy waiting is done and then have a fallback to sleeping. See e.g. OMP_WAIT_POLICY or GOMP_SPINCOUNT env variables that allow to control that behavior. E.g. for the taskwait case, when you run gomp_execute_task and it returns false, hasn't executed any task, can't you perhaps task the depend_lock on the paren't or just atomically set some state bit that you are about to sleep and go to sleep, and then on the other side when doing __atomic_sub_fetch (&parent->num_children, 1, MEMMODEL_ACQ_REL); check the return value and if num_children is 0, verify if the parent isn't sleeping and if it could be sleeping, wake it up. Guess similarly for taskgroup wait or waiting for dependencies. > 2. The barrier interfaces use a big bold lock. > Because the barrier related functions such as gomp_team_barrier_done > are not atomic (as far as I believe), they are protected using team- > >barrier_lock. Can't those operations that need protection just be turned into __atomic_* instead? > 3. The currently used work-stealing algorithm is extremely basic. > Currently, if the thread dedicated queue is empty, work is stolen by > uniformly picking a random queue. While this is the way libomp does > things, it could be improved by applying more advanced algorithms. > Detph-first scheduling and locality aware work-stealing for example. > > I'll run some benchmarks and post the results. Have you managed to run some benchmarks? E.g. the EPCC microbenchmarks, LCALS/CLOMP, etc.? As for the implementation, e.g. /* This array contains the taskqueues and the structures for implicit tasks. The implicit tasks start from &implicit_task + sizeof (gomp_taskqueue) * num_taskqueue. */ struct gomp_task taskqueues_and_implicit_tasks[]; is too ugly, you chose to put first the taskqueues which have struct gomp_taskqueue type and only after that the struct gomp_task ones, so using struct gomp_task for the flexible array member is wrong. Either swap those two, put implicit tasks first and after that the taskqueues, then the implicit tasks can be accessed through ->implicit_tasks[num]. Or change the type to struct gomp_taskqueue and access the taskqueues directly and for the implicit tasks use the inline accessor. Or perhaps best, use a new type that puts struct gomp_task and struct gomp_taskqueue next to each other and use that type as the type of the flexible array member, then use ->implicit_tasks[i].task for what was previously ->implicit_tasks[i] and ->implicit_tasks[i].queue for the queues. I must say I don't understand why do you use atomics on gomp_task_state state as kind of critical section, instead of say depend_lock mutex? I'd note that e.g. the data structures layout needs to be carefully considered so that there is no cache line ping-pong on often atomically modified cache lines from other threads. Further incremental changes what I've noticed when going through the code: --- libgomp/libgomp.h 2019-09-02 13:02:20.187427096 +0200 +++ libgomp/libgomp.h 2019-09-02 14:13:18.376464048 +0200 @@ -460,7 +460,7 @@ /* Mutex for protecting the dependency hash table and the lifecycle of the task. The lock is taken whenever dependencies are updated and the - a task lifecycle related critical section is entered (e.g. num_children + task lifecycle related critical section is entered (e.g. num_children becomes 0). */ gomp_mutex_t depend_lock; --- libgomp/task.c 2019-09-02 12:54:27.879483614 +0200 +++ libgomp/task.c 2019-09-02 13:57:33.900760986 +0200 @@ -819,15 +819,11 @@ requeuing here. */ if (__atomic_load_n (&ttask->state, MEMMODEL_ACQUIRE) == GOMP_TARGET_TASK_FINISHED) - { - gomp_target_task_completion (team, task, thr); - } + gomp_target_task_completion (team, task, thr); else if (__atomic_exchange_n (&ttask->state, GOMP_TARGET_TASK_RUNNING, MEMMODEL_ACQ_REL) == GOMP_TARGET_TASK_FINISHED) - { - gomp_target_task_completion (team, task, thr); - } + gomp_target_task_completion (team, task, thr); return true; } @@ -953,7 +949,7 @@ gomp_mutex_lock (&parent->depend_lock); gomp_task_run_post_handle_depend_hash (child_task); if (child_task->dependers != NULL) - new_tasks = gomp_task_run_post_handle_dependers (child_task, team); + new_tasks = gomp_task_run_post_handle_dependers (child_task, team); gomp_mutex_unlock (&parent->depend_lock); return new_tasks; } @@ -1035,7 +1031,7 @@ cancelled = gomp_task_run_pre (next_task, team); if (__builtin_expect (cancelled, 0)) - goto finish_cancelled; + goto finish_cancelled; thr->task = next_task; if (__builtin_expect (next_task->fn == NULL, 0)) @@ -1052,15 +1048,11 @@ if (__atomic_load_n (&ttask->state, MEMMODEL_ACQUIRE) == GOMP_TARGET_TASK_FINISHED) - { - gomp_target_task_completion (team, task, thr); - } + gomp_target_task_completion (team, task, thr); else if (__atomic_exchange_n (&ttask->state, GOMP_TARGET_TASK_RUNNING, MEMMODEL_ACQ_REL) == GOMP_TARGET_TASK_FINISHED) - { - gomp_target_task_completion (team, task, thr); - } + gomp_target_task_completion (team, task, thr); return true; } } @@ -1162,16 +1154,12 @@ if (__atomic_load_n (&ttask->state, MEMMODEL_ACQUIRE) == GOMP_TARGET_TASK_FINISHED) - { - gomp_target_task_completion (team, task, thr); - } + gomp_target_task_completion (team, task, thr); else if (__atomic_exchange_n (&ttask->state, GOMP_TARGET_TASK_RUNNING, MEMMODEL_ACQ_REL) == GOMP_TARGET_TASK_FINISHED) - { - gomp_target_task_completion (team, task, thr); - } + gomp_target_task_completion (team, task, thr); continue; } } Jakub