On Fri, Nov 14, 2014 at 12:09:03PM +0100, Bernd Schmidt wrote: > >I have some questions about nvptx: > >1) you've said that alloca isn't supported, but it seems > > to be wired up and uses the %alloca documented in the PTX > > manual, what is the issue with that? %alloca not being actually > > implemented by the current PTX assembler or translator? > > Yes, it's unimplemented. There's an internal declaration for it but that > seems to be as far as it goes, and that declaration is 32-bit only anyway.
:(. Does NVidia plan to fix that in next version? > >2) what is the reason why TLS isn't supported by the port (well, > > __emutls is emitted, but I doubt pthread_[gs]etspecific is > > implementable and thus it will not really do anything. > > Can't the port just emit all DECL_THREAD_LOCAL_P variables > > into .local instead of .global address space? > > .local is stack frame memory, not TLS. The ptx docs mention the use of > .local at file-scope as occurring only in "legacy" ptx code and I get the > impression it's discouraged. :(. So what other option one has to implement something like TLS, even using inline asm or similar? There is %tid, so perhaps indexing some array with %tid? The trouble with that is that some thread can do #pragma omp parallel again, and I bet the %tid afterwards would be again 0-(n-1), and if it is an index into a global array, it wouldn't work well then. Maybe without anything like TLS we can't really support nested parallelism, only one level of #pragma omp parallel inside of nvptx regions. But, if we add support for #pragma omp team, we'd either need the array in gang-local memory, or some other special register to give us gang id. BTW, one can still invoke OpenMP target regions (even OpenACC regions) from multiple host threads, so the question is how without local TLS we can actually do anything at all. Sure, we can pass parameters to the kernel, but we'd need to propagate it through all functions. Or can cudaGetParameterBuffer be used for that? > >4) I had a brief look at what it would take to port libgomp to PTX, > > which is needed for OpenMP offloading. OpenMP offloaded kernels > > should start with 1 team and 1 thread in it, if we ignore > > GOMP_teams for now, I think the major things are: > > - right now libgomp is heavily pthread_* based, which is a no-go > > for nvptx I assume, I think we'll need some ifdefs in the sources > > I haven't looked into whether libpthread is doable. I suspect it's a poor > match. I also haven't really looked into OpenMP, so I'm feeling a bit > uncertain about answering your further questions. What OpenMP needs is essentially: - some way to spawn multiple threads (fork-join model), where the parent thread is the first one among those other threads, or, if that isn't possible, the first thread pretends to be the same as the first thread and the parent thread sleeps - something like pthread_mutex_lock/unlock (only basic; or say atomic ops + futex we use for Linux) - something like sem_* semaphore - and some TLS or something similar (pthread_[gs]etspecific etc.) > > - the main thing is that I believe we just have to replace > > gomp_team_start for nvptx; seems there are > > cudaLaunchDevice (and cudaGetParameterBuffer) functions one can use > > to spawn selected kernel in selected number of threads (and teams), > > from the docs it isn't exactly clear what the calling thread will do, > > if it is suspended and the HW core given to it is reused by something > > else (e.g. one of the newly spawned threads), then I think it should > > be usable. Not sure what happens with .local memory of the parent > > task, if the children all have different .local memory, then > > perhaps one could just copy over what is needed from the > > invoking to the first invoked thread at start. > > I'm a bit confused here, it sounds as if you want to call cudaLaunchDevice > from ptx code? These are called from the host. As mentioned above, .local is > probably not useful for what you want. In CUDA_Dynamic_Parallelism_Programming_Guide.pdf in C.3.2 it is mentioned it should be possible, there is: .extern .func(.param .b32 func_retval0) cudaLaunchDevice ( .param .b64 func, .param .b64 parameterBuffer, .param .align 4 .b8 gridDimension[12], .param .align 4 .b8 blockDimension[12], .param .b32 sharedMemSize, .param .b64 stream ) ; (or s/.b64/.b32/ for -m32) that should be usable from within PTX. The Liao-OpenMP-Accelerator-Model-2013.pdf paper also mentions using dynamic parallelism (because all other variants are just bad for OpenMP, you'd need to preallocate all the gangs/threads (without knowing how many you'll need), and perhaps let them sleep on some barrier until you have work for them. > > - is it worth to reuse cudaLaunchDevice "threads" or are they cheap > > enough to start that any "thread" pooling should be removed for nvptx? > > Sorry, I don't understand the question. I meant what is the cost of cudaLaunchDevice from within PTX compared to keeping the threads around sleeping on a barrier. As OpenMP doesn't support threadprivate user vars in the offloaded regions, we don't have to preserve any state and thus could always launch threads and tear them appart again. > > - we'll need some synchronization primitives, I see atomic support is > > there, we need mutexes and semaphores I think, is that implementable > > using bar instruction? > > It's probably membar you need. That is a memory barrier, I need threads to wait on each other, wake up one another etc. > > - is there any way to do any affinity management, or shall we just > > ignore affinity strategies? > > Not sure what they do in libgomp. It's probably not a match for GPU > architectures. Ok. > > - any way how to query time? > > There are %clock and %clock64 cycle counters. Thanks. Jakub