Hi Sho, I just came across your project and would like to add a few comments.
I think the biggest problem in libgomp is that tasks are stored in a single queue. This may work for a few threads or for long-running tasks, but certainly not for 48 threads. In fact, contention on the queue can grow so high that performance starts to degrade when you add more threads. First of all, you need to think about data structures for storing tasks. When you look at other parallel runtime systems/libraries, you will find that many implement some form of work-stealing scheduling. In work-stealing scheduling, each worker or group of workers have their own queue on which they operate. Load balancing is done by stealing tasks from the queues of other workers. The Nanos runtime is probably a good place to start to get some ideas... Regarding Lazy Task Creation (LTC)... I agree with the advice here and would say this is not what you should focus on. The most prominent use of LTC is in the Cilk multithreaded language. Basically, we have two options when we encounter a new task: (1) spawn the task as a child and continue executing the parent, or, what LTC does, (2) start executing the task and leave the continuation (the rest of the parent) for later execution. Load balancing requires that the continuation is made available so that it can be picked up and executed by other threads. The Nanos runtime supports different scheduling strategies, including work-first scheduling, which is similar in principle to LTC. Nanos implements tasks as user-level threads (nano-threads), so it's easier to switch between tasks or move tasks across threads. Something like work-first scheduling that depends on untied tasks is nice to have. But I think a decent scheduler (for tied tasks) is more important at this point... -Andreas