Hi Jakub,

I have some questions about nvptx:
1) you've said that alloca isn't supported, but it seems
    to be wired up and uses the %alloca documented in the PTX
    manual, what is the issue with that?  %alloca not being actually
    implemented by the current PTX assembler or translator?

Yes, it's unimplemented. There's an internal declaration for it but that seems to be as far as it goes, and that declaration is 32-bit only anyway.

2) what is the reason why TLS isn't supported by the port (well,
    __emutls is emitted, but I doubt pthread_[gs]etspecific is
    implementable and thus it will not really do anything.
    Can't the port just emit all DECL_THREAD_LOCAL_P variables
    into .local instead of .global address space?

.local is stack frame memory, not TLS. The ptx docs mention the use of .local at file-scope as occurring only in "legacy" ptx code and I get the impression it's discouraged.

(As an aside, there's a question of how to represent a different concept, gang-local memory, in gcc. That would be .shared memory. We're currently going with just using an internal attribute)

3) in assembly emitted by the nvptx port, I've noticed:
.visible .func (.param.u32 %out_retval)foo(.param.u64 %in_ar1, .param.u32 
%in_ar2)
{
        .reg.u64 %ar1;
        .reg.u32 %ar2;
.reg.u32 %retval;
        .reg.u64 %hr10;
        .reg.u32 %r22;
        .reg.u64 %r25;
    is the missing \t before the %retval line intentional?

No, I can fix that up.

4) I had a brief look at what it would take to port libgomp to PTX,
    which is needed for OpenMP offloading.  OpenMP offloaded kernels
    should start with 1 team and 1 thread in it, if we ignore
    GOMP_teams for now, I think the major things are:
    - right now libgomp is heavily pthread_* based, which is a no-go
      for nvptx I assume, I think we'll need some ifdefs in the sources

I haven't looked into whether libpthread is doable. I suspect it's a poor match. I also haven't really looked into OpenMP, so I'm feeling a bit uncertain about answering your further questions.

    - the main thing is that I believe we just have to replace
      gomp_team_start for nvptx; seems there are
      cudaLaunchDevice (and cudaGetParameterBuffer) functions one can use
      to spawn selected kernel in selected number of threads (and teams),
      from the docs it isn't exactly clear what the calling thread will do,
      if it is suspended and the HW core given to it is reused by something
      else (e.g. one of the newly spawned threads), then I think it should
      be usable.  Not sure what happens with .local memory of the parent
      task, if the children all have different .local memory, then
      perhaps one could just copy over what is needed from the
      invoking to the first invoked thread at start.

I'm a bit confused here, it sounds as if you want to call cudaLaunchDevice from ptx code? These are called from the host. As mentioned above, .local is probably not useful for what you want.

    - is it worth to reuse cudaLaunchDevice "threads" or are they cheap
      enough to start that any "thread" pooling should be removed for nvptx?

Sorry, I don't understand the question.

    - we'll need some synchronization primitives, I see atomic support is
      there, we need mutexes and semaphores I think, is that implementable
      using bar instruction?

It's probably membar you need.

    - the library uses __attribute__((constructor)) in 3 places or so,
      initialize_team is pthread specific and can be probably ifdefed out,
      we won't support dlclose in nvptx anyway, but at least we need some
      way to initialize the nvptx libgomp; if the initialization is done
      in global memory, would it persist in between different kernels,
      so can the initialization as separate kernel be run once, something
      else?

I think that it would persist, and this would be my scheme for implementing constructors, but I haven't actually tried.

    - is there any way to do any affinity management, or shall we just
      ignore affinity strategies?

Not sure what they do in libgomp. It's probably not a match for GPU architectures.

    - any way how to query time?

There are %clock and %clock64 cycle counters.


Bernd

Reply via email to