Re: My current idea for improving libgomp

2011-04-29 Thread Andreas Prell
Hi Sho,

I just came across your project and would like to add a few comments.

I think the biggest problem in libgomp is that tasks are stored in a
single queue. This may work for a few threads or for long-running tasks,
but certainly not for 48 threads. In fact, contention on the queue can
grow so high that performance starts to degrade when you add more
threads.

First of all, you need to think about data structures for storing tasks.
When you look at other parallel runtime systems/libraries, you will find
that many implement some form of work-stealing scheduling. In
work-stealing scheduling, each worker or group of workers have their own
queue on which they operate. Load balancing is done by stealing tasks
from the queues of other workers. The Nanos runtime is probably a good
place to start to get some ideas...  

Regarding Lazy Task Creation (LTC)... I agree with the advice here and
would say this is not what you should focus on. The most prominent use
of LTC is in the Cilk multithreaded language. Basically, we have two
options when we encounter a new task: (1) spawn the task as a child and
continue executing the parent, or, what LTC does, (2) start executing
the task and leave the continuation (the rest of the parent) for later
execution. Load balancing requires that the continuation is made
available so that it can be picked up and executed by other threads. The
Nanos runtime supports different scheduling strategies, including
work-first scheduling, which is similar in principle to LTC. Nanos
implements tasks as user-level threads (nano-threads), so it's easier to
switch between tasks or move tasks across threads. Something like
work-first scheduling that depends on untied tasks is nice to have. But
I think a decent scheduler (for tied tasks) is more important at this
point...

-Andreas



Re: My current idea for improving libgomp

2011-04-30 Thread Andreas Prell
Hey Sho,

> I totally agree with this point.
> Currently, I'm planning to implement tied task using breath-first
> scheduler wrote in
> section 3.1 of "Evaluation of OpenMP Task Scheduling Strategies" by Nanos 
> Group.
> http://www.sarc-ip.org/files/null/Workshop/1234128788173__TSchedStrat-iwomp08.pdf
> 
> That is:
> * A team has one team queue which contains tasks.
> * A team has some (user-level) threads.
> * A thread can have one running task.
> * A thread has private queue which contains tasks.
> * When a task is created, it is queued in team queue.
> * Each thread steals tasks from the team queue and inserts it in the
> private queue.
> * Once tied task is executed in a thread, it is queued only in the
> private queue in the thread
>   when it encounters `taskwait'.
> * Each thread runs a task from its private queue.
> 
> But I'm not sure how to achieve good load-balancing and what kind of
> cutoff strategy to take.
> As for load-balancing, I'll read Nanos4 implementations and ask Nanos
> Group for it.
> (Of course your advice will do :-) )
> 
> As for cutoff, basically I can choose `max-tasks' strategy or
> `max-levels' strategy.
> When number of tasks or recursion levels exceed this value, the
> scheduler stops its work
> and execute each task as sentences in sequential programs.
> But "Evaluation of OpenMP Task Scheduling Strategies" says better
> cutoff strategy is different
> from application to application.

Right, start with distributing the queues and then think about load
balancing.

I would say don't worry too much about cut-offs at this point. Finding a
good cut-off strategy that works without drawbacks is pretty much an
open research problem. Just spawn the tasks and focus on efficient task
creation and scheduling. In my experience, going from a centralized to a
distributed task pool already makes a huge difference.

To get a better overview of other implementations, which you can compare
to libgomp, I recommend a couple of papers. For example:

- OpenMP Tasks in IBM XL Compilers, X. Teruel et al.
- Support for OpenMP Tasks in Nanos v4, X. Teruel et al.
- OpenMP 3.0 Tasking Implementation in OpenUH, C. Addison et al.
- A Runtime Implementation of OpenMP Tasks, J. LaGrone et al.

You should be able to find copies online. If not, let me know.

-Andreas