tra added a subscriber: rsmith.
tra added a comment.

In D73979#1857728 <https://reviews.llvm.org/D73979#1857728>, @yaxunl wrote:

> All extern shared vars are sharing the same address, however, they may be 
> used as different types in different functions.
>
> For example,
>
>   __device__ int foo() {
>     extern __shared__ int a;
>     for (...) a+=...;
>     return a;
>   }
>  
>   __device__ double bar(int x) {
>     extern __shared__ double b[10];
>     for(...) b[x]+=...;
>     return b[0];
>   }
>  
>   __global__ void k() {
>     foo();
>     //...
>     bar();
>   }
>  
>


I do agree that it is possible to use `extern __shared__` given enough care. My 
point is that as a feature it is ill-designed, very easy to misuse and creates 
more problems than it's worth, especially in non-trivial code.
As an illustration, what if bar needs to call foo() and foo lives in a header 
file somewhere else? Whoever implements foo must make sure that nothing else in 
the transitive call chain uses `extern __shared__`. That's hard to guarantee in 
practice and it's very easy to introduce new dependencies without even being 
aware of them. I.e. via an intermediate function which is not aware that the 
caller and callee have this restriction.

There are no compiler checks to warn you about it, you you will only know about 
the problem when you encounter data corruption at runtime and in machine 
learning applications that may go unnoticed for a long time.

> In one function foo, users need to use the shared memory as an int. In 
> another function, users need to use the shared memory as a double array. 
> Users just need to make sure they request sufficient dynamic shared memory in 
> triple chevron to be greater than the max dynamic shared memory usage.

Again, that requires complete knowledge of who uses this construct. Without 
compiler's help that's hard to guarantee outside of simple use cases. I can not 
imagine using it as is in something like thrust or eigen. In fact, thrust does 
provide `extern_shared_ptr` specifically to serve the same kind of API that I 
proposed.

> Users do not need to pass values in extern shared var between functions. They 
> just treat it as an uninitialized variable.

I'm OK with that, but they *do* need to make it explicit that they are dealing 
with externally allocated memory and they do need to pass something to identify 
which chunk of that memory they operate upon. It may be the pointer, or it may 
be an offset to be used relative to an `extern __shared__` base.

> Forbidding different types for extern shared variable does not add any 
> benefit, just forcing users to work around the limitation and resulting in 
> less readable code.

That's where we disagree. I believe that in this case it would be a net benefit 
to be even more restrictive with `extern __shared__` than we are right now and 
get users to explicitly treat `extern __shared__` as externally allocated 
memory.

It does not take all that much code to make it work and it does result in more 
robust code. E.g. your example can be rewritten like this:

  // sprinkle static_casts as necessary.
  void *get_shmem(size_t offset){
    extern __shared__ char shmem[];
    return  &shmem[offset];
  }
  
  __device__ int foo(int *a) {
    for (...) *a+=...;
    return *a;
  }
  
  __device__ double bar(double *bx) {
    for(...) *bx+=...;
    return *bx;
  }
  
  __global__ void k() {
    foo(get_shmem(0));
    //...
    bar(get_shmem(0));
  }

Net benefits that I see:

- functions are composable now -- one can call foo from bar and vice versa and 
ensure they don't step on each other's toes.
- it's clear that they do operate on the same buffer, when called from `k` -- 
arguably that's the place where it matters.
- It's easy to change if you need them to work on different sub-buffers.
- foo/bar are not limited to working on shared memory only
- foo/bar can execute in diverged branches, if given non-overlapping buffers. 
The original example would potentially fail in interesting ways if `k` does 
something like this:

  __global__ void k() {
    if (threadIdx.x < 16)
       foo();
    else 
       bar();
  }

It's much easier not to open this can of works than clean it up afterwards when 
you grow more users that depend on it.
I don't think it should be enabled for CUDA, and don't think that it would be a 
good idea for HIP, either.
Perhaps we need a third opinion from someone with a broader perspective.

@rsmith -- do you have an opinion on what should be done with a tactically 
useful, but strategically unsound features in general and this CUDA-specific 
oddity specifically?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D73979/new/

https://reviews.llvm.org/D73979



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to