On Wed, Oct 21, 2015 at 06:18:25PM +0300, Alexander Monakov wrote: > On Wed, 21 Oct 2015, Bernd Schmidt wrote: > > > On 10/20/2015 08:34 PM, Alexander Monakov wrote: > > > This patch series ports enough of libgomp.c to get warp-level parallelism > > > working for OpenMP offloading. The overall approach is as follows. > > > > Could you elaborate a bit what you mean by this just so we understand each > > other in terms of terminology? "Warp-level" sounds to me like you have all > > threads in a warp executing in lockstep at all times. If individual threads > > can take different paths, I'd expect it to be called thread-level > > parallelism > > or something like that. > > Sorry, that was unclear. What I meant is that there is a degree of > parallelism available across different warps, but not across different teams > (because only 1 team is spawned), nor across threads in a warp (because all > threads in a warp except one exit immediately -- later on we'd need to > keep them converged so they can enter a simd region together). > > > What is your end goal in terms of mapping GPU parallelism onto OpenMP? > > OpenMP team is mapped to a CUDA thread block, OpenMP thread is mapped to a > warp, OpenMP simd lane is mapped to a CUDA thread. So, follow the OpenACC > model. Like in OpenACC, we'd need to artificially deactivate/reactivate warp > members on simd region boundaires.
Does that apply also to threads within a warp? I.e. is .local local to each thread in the warp, or to the whole warp, and if the former, how can say at the start of a SIMD region or at its end the local vars be broadcast to other threads and collected back? One thing is scalar vars, another pointers, or references to various types, or even bigger indirection. Jakub