On 10/20/2015 08:34 PM, Alexander Monakov wrote:
(This patch serves as a straw man proposal to have something concrete for
discussion and further patches)
On PTX, stack memory is private to each thread. When master thread constructs
'omp_data_o' on its own stack and passes it to other threads via
GOMP_parallel by reference, other threads cannot use the resulting pointer.
We need to arrange structures passed between threads be in global, or better,
in PTX __shared__ memory (private to each CUDA thread block).
I guess the question is - why is it better? Do you have multiple thread
blocks active in your execution model, and do they require different
omp_data_o structures? Are accesses to it performance critical (more so
than any other access?) If the answers are "no", then I think you
probably want to fall back to just normal malloced memory or a regular
static variable, as shared memory is a fairly limited resource.
It might be slightly cleaner to have the copy described as a new builtin
call that is always generated and expanded to nothing on normal targets
rather than modifying existing calls in the IL. Or maybe:
p = __builtin_omp_select_location (&stack_local_var, size)
....
__builtin_omp_maybe_free (p);
where the select_location could get simplified to a malloc for nvptx,
hopefully making the stack variable unused and discarded.
Using separate variables is wasteful: they should go into a union to reduce
shared memory consumption.
Not sure what you mean by separate variables?
Bernd