On Wed, 21 Oct 2015, Bernd Schmidt wrote:
> On 10/20/2015 08:34 PM, Alexander Monakov wrote: > > (This patch serves as a straw man proposal to have something concrete for > > discussion and further patches) > > > > On PTX, stack memory is private to each thread. When master thread > > constructs > > 'omp_data_o' on its own stack and passes it to other threads via > > GOMP_parallel by reference, other threads cannot use the resulting pointer. > > We need to arrange structures passed between threads be in global, or > > better, > > in PTX __shared__ memory (private to each CUDA thread block). > > I guess the question is - why is it better? Do you have multiple thread blocks > active in your execution model, '#pragma omp teams' should map to spawning multiple thread blocks, so yes, at least in plans I do (but honestly I don't see how it affects the heap-vs-shared memory decision here) > and do they require different omp_data_o structures? yes, each omp_data_o should be private to a team > Are accesses to it performance critical (more so than any other access?) Not sure how to address the "more so than ..." part, but since omp_data_o is accessed by all threads after entering a parallel region, potentially many times throughout the region, it does seem helpful to arrange it in shared memory. I expect there will be other instances like this one, where some on-stack data will need to be moved to team-shared storage for nvptx. > It might be slightly cleaner to have the copy described as a new builtin > call that is always generated and expanded to nothing on normal targets > rather than modifying existing calls in the IL. Or maybe: > > p = __builtin_omp_select_location (&stack_local_var, size) .... > __builtin_omp_maybe_free (p); > > where the select_location could get simplified to a malloc for nvptx, > hopefully making the stack variable unused and discarded. Agreed. > > Using separate variables is wasteful: they should go into a union to > > reduce shared memory consumption. > > Not sure what you mean by separate variables? If two parallel regions are nested in a target region, there will be two omp_data_o variables of potentially different types, but they can reuse the same storage. The patch does not achieve that, because it simply emits a static __shared__ declaration for each original variable. Alexander