On 02/07/2020 18:00, Jakub Jelinek wrote:
On Thu, Jul 02, 2020 at 05:15:20PM +0100, Andrew Stubbs wrote:
This patch, originally by Kwok, auto-adjusts the default OpenMP target
arguments to set num_threads(1) when there are no parallel regions. There
may still be multiple teams in this case.
The result is that libgomp will not attempt to launch GPU threads that will
never get used.
OK to commit?
That doesn't look safe to me.
My understanding of the patch is that it looks for parallel construct
lexically in the target region, but that isn't sufficient, one can do that
only if the target region can't encounter a parallel construct in the target
region (i.e. the body and all functions that are called from it at runtime).
OpenMP is complicated. :-(
Is it normally expected that the runtime will always launch the maximum
number of threads, just in case?
There's a cost to both launching and running excess threads that it
would be nice to avoid, but the real point of the optimization is that
launching fewer threads allows us to launch more teams.
AMD GPUs usually allow us to run 2040 or 2400 wavefronts simultaneously,
so if we're running 15 unused threads for each team then we're limiting
ourselves to 60 or 64 teams. If we limit each team to 1 thread then we
can run the full 2040 or 2400 teams. Potentially, that's a 16x speed
improvement on kernels that happen to not use parallel regions.
I would like to be able to do this, but it appears that the region data
is insufficient for complex cases. Can you suggest a good way to solve this?
Perhaps one could ignore some builtin calls but it would need to be ones
where one can assume there will be no OpenMP code in them.
Also, it needs to avoid doing the optimization if there is or might
indirectly be called omp_get_thread_limit (), because if the optimization
forces thread_limit (1), that means that omp_get_thread_limit () in the
region will also return 1 rather than the expected value.
Would that not be the correct answer, if the number of threads actually
has been limited to 1?
Thanks for the prompt review.
Andrew