The nvptx port

Jakub Jelinek Fri, 14 Nov 2014 00:30:54 -0800

Hi!

I have some questions about nvptx:
1) you've said that alloca isn't supported, but it seems
   to be wired up and uses the %alloca documented in the PTX
   manual, what is the issue with that?  %alloca not being actually
   implemented by the current PTX assembler or translator?  Or
   some local vs. global address space issues?  If the latter,
   could at least VLAs be supported?
2) what is the reason why TLS isn't supported by the port (well,
   __emutls is emitted, but I doubt pthread_[gs]etspecific is
   implementable and thus it will not really do anything.
   Can't the port just emit all DECL_THREAD_LOCAL_P variables
   into .local instead of .global address space?  Would one
   need to convert those pointers to generic any way?
   I'm asking because e.g. libgomp uses __thread heavily and
   it would be nice to be able to use that.
3) in assembly emitted by the nvptx port, I've noticed:
.visible .func (.param.u32 %out_retval)foo(.param.u64 %in_ar1, .param.u32 
%in_ar2)
{
        .reg.u64 %ar1;
        .reg.u32 %ar2;
.reg.u32 %retval;
        .reg.u64 %hr10;
        .reg.u32 %r22;
        .reg.u64 %r25;
   is the missing \t before the %retval line intentional?
4) I had a brief look at what it would take to port libgomp to PTX,
   which is needed for OpenMP offloading.  OpenMP offloaded kernels
   should start with 1 team and 1 thread in it, if we ignore
   GOMP_teams for now, I think the major things are:
   - right now libgomp is heavily pthread_* based, which is a no-go
     for nvptx I assume, I think we'll need some ifdefs in the sources
   - the main thing is that I believe we just have to replace
     gomp_team_start for nvptx; seems there are
     cudaLaunchDevice (and cudaGetParameterBuffer) functions one can use
     to spawn selected kernel in selected number of threads (and teams),
     from the docs it isn't exactly clear what the calling thread will do,
     if it is suspended and the HW core given to it is reused by something
     else (e.g. one of the newly spawned threads), then I think it should
     be usable.  Not sure what happens with .local memory of the parent
     task, if the children all have different .local memory, then
     perhaps one could just copy over what is needed from the
     invoking to the first invoked thread at start.  The question is
     how to figure out what to pass to cudeLaunchDevice (e.g. how to
     get handle of the current stream), and how to query how many
     teams and/or threads it is reasonable to ask for if the program
     wants defaults (and how many teams/threads are hard limits beyond which
     one can't go)
   - is it worth to reuse cudaLaunchDevice "threads" or are they cheap
     enough to start that any "thread" pooling should be removed for nvptx?
   - we'll need some synchronization primitives, I see atomic support is
     there, we need mutexes and semaphores I think, is that implementable
     using bar instruction?
   - the library uses __attribute__((constructor)) in 3 places or so,
     initialize_team is pthread specific and can be probably ifdefed out,
     we won't support dlclose in nvptx anyway, but at least we need some
     way to initialize the nvptx libgomp; if the initialization is done
     in global memory, would it persist in between different kernels,
     so can the initialization as separate kernel be run once, something
     else?
   - is there any way to do any affinity management, or shall we just
     ignore affinity strategies?
   - the target/offloading stuff should be most likely stubbed in the
     library for nvptx, target data/target regions inside of target
     regions are undefined behavior in OpenMP, no need to bloat things
   - any way how to query time?
   Other thoughts?


        Jakub

The nvptx port

Reply via email to