[PATCH] D77954: [CUDA][HIP] Fix host/device based overload resolution

Artem Belevich via Phabricator via cfe-commits Fri, 24 Apr 2020 10:16:40 -0700

tra added inline comments.


================
Comment at: clang/lib/Sema/SemaOverload.cpp:9749
+  if (isBetterMultiversionCandidate(Cand1, Cand2))
+    return true;
+
----------------
yaxunl wrote:
> rjmccall wrote:
> > tra wrote:
> > > rjmccall wrote:
> > > > erichkeane wrote:
> > > > > yaxunl wrote:
> > > > > > echristo wrote:
> > > > > > > rjmccall wrote:
> > > > > > > > yaxunl wrote:
> > > > > > > > > rjmccall wrote:
> > > > > > > > > > If we move anything below this check, it needs to figure 
> > > > > > > > > > out a tri-state so that it can return false if `Cand2` is a 
> > > > > > > > > > better candidate than `Cand1`.  Now, that only matters if 
> > > > > > > > > > multiversion functions are supported under CUDA, but if 
> > > > > > > > > > you're relying on them not being supported, that should at 
> > > > > > > > > > least be commented on.
> > > > > > > > > multiversion host functions is orthogonal to CUDA therefore 
> > > > > > > > > should be supported. multiversion in device, host device, and 
> > > > > > > > > global functions are not supported. However this change does 
> > > > > > > > > not make things worse, and should continue to work if they 
> > > > > > > > > are supported.
> > > > > > > > > 
> > > > > > > > > host/device based overloading resolution is mostly for 
> > > > > > > > > determining viability of a function. If two functions are 
> > > > > > > > > both viable, other factors should take precedence in 
> > > > > > > > > preference. This general rule has been taken for cases other 
> > > > > > > > > than multiversion, I think it should also apply to 
> > > > > > > > > multiversion.
> > > > > > > > > 
> > > > > > > > > I will make isBetterMultiversionCandidate three states.
> > > > > > > > > This general rule has been taken for cases other than 
> > > > > > > > > multiversion, I think it should also apply to multiversion.
> > > > > > > > 
> > > > > > > > Well, but the multiversion people could say the same: that 
> > > > > > > > multiversioning is for picking an alternative among 
> > > > > > > > otherwise-identical functions, and HD and H functions are not 
> > > > > > > > otherwise-identical.
> > > > > > > > 
> > > > > > > > CC'ing @echristo for his thoughts on the right ordering here.
> > > > > > > Adding @erichkeane here as well.
> > > > > > > 
> > > > > > > I think this makes sense, but I can see a reason to multiversion 
> > > > > > > a function that will run on host and device. A version of some 
> > > > > > > matrix mult that takes advantage of 3 host architectures and one 
> > > > > > > cuda one? Am I missing something here?
> > > > > > My understanding is that a multiversion function is for a specific 
> > > > > > cpu(gpu). Let's say we want to have a function f for gfx900, 
> > > > > > gfx906, sandybridge, ivybridge, shouldn't they be more like
> > > > > > 
> > > > > > ```
> > > > > > __host__ __attribute__((cpu_specific(sandybridge))) f();
> > > > > > __host__ __attribute__((cpu_specific(ivybridge))) f();
> > > > > > __device__ __attribute__((cpu_specific(gfx900))) f();
> > > > > > __device__ __attribute__((cpu_specific(gfx906))) f();
> > > > > > ```
> > > > > > instead of all `__device__ __host__` functions?
> > > > > IMO, it doesn't make sense for functions to functions be BOTH host 
> > > > > and device, they'd have to be just one.  Otherwise I'm not sure how 
> > > > > the resolver behavior is supposed to work.  The whole idea is that 
> > > > > the definition is chosen at runtime.
> > > > > 
> > > > > Unless __host__ __device void foo(); is TWO declaration chains 
> > > > > (meaning two separate AST entries), it doesn't make sense to have 
> > > > > multiverison on it (and then, how it would be spelled is 
> > > > > awkward/confusing to me).
> > > > > 
> > > > > In the above case, if those 4 declarations are not 2 separate root- 
> > > > > AST nodes, multiversioning won't work.
> > > > There are certainly functions that ought to be usable from either host 
> > > > or device context — any inline function that just does ordinary 
> > > > language things should be in that category.  Also IIUC many 
> > > > declarations are *inferred* to be `__host__ __device__`, or can be 
> > > > mass-annotated with pragmas, and those reasons are probably the main 
> > > > ones this might matter — we might include a header in CUDA mode that 
> > > > declares a multi-versioned function, and we should handle it right.
> > > > 
> > > > My read of how CUDA programmers expect this to work is that they see 
> > > > the `__host__` / `__device__` attributes as primarily a mechanism for 
> > > > catching problems where you're using the wrong functions for the 
> > > > current configuration.  That is, while we allow overloading by 
> > > > `__host__`/`__device__`-ness, users expect those attributes to mostly 
> > > > be used as a filter for what's "really there" rather than really 
> > > > strictly segregating the namespace.  So I would say that CUDA 
> > > > programmers would probably expect the interaction with multiversioning 
> > > > to be:
> > > > 
> > > > - Programmers can put `__host__`, `__device__`, or both on a variant 
> > > > depending on where it was usable.
> > > > - Dispatches should simply ignore any variants that aren't usable for 
> > > > the current configuration.
> > > > 
> > > > And specifically they would not expect e.g. a `__host__` dispatch 
> > > > function to only consider `__host__` variants — it should be able to 
> > > > dispatch to anything available, which is to say, it should also include 
> > > > `__host__ __device__` variants.  Similarly (and probably more 
> > > > usefully), a `__host__ __device__` dispatch function being compiled for 
> > > > the device should also consider pure `__device__` functions, and so on.
> > > > 
> > > > If we accept that, then I think it gives us a much better idea for how 
> > > > to resolve the priority of the overload rules.  The main impact of 
> > > > `isBetterMultiversionCandidate` is to try to ensure that we're looking 
> > > > at the `__attribute__((cpu_dispatch))` function instead of one of the 
> > > > `__attribute__((cpu_specific))` variants.  (It has no effect on 
> > > > `__attribute__((target))` multi-versioning, mostly because it doesn't 
> > > > need to: target-specific variants don't show up in lookup with 
> > > > `__attribute__((target))`.)  That rule should take precedence over the 
> > > > CUDA preference for exact matches, because e.g. if we're compiling this:
> > > > 
> > > > ```
> > > > __host__ __device__ int magic(void) 
> > > > __attribute__((cpu_dispatch("...")));
> > > > __host__ __device__ int magic(void) 
> > > > __attribute__((cpu_specific(generic)));
> > > > __host__ int magic(void) __attribute__((cpu_specific(mmx)));
> > > > __host__ int magic(void) __attribute__((cpu_specific(sse)));
> > > > __device__ int magic(void) 
> > > > __attribute__((cpu_specific(some_device_feature)));
> > > > __device__ int magic(void) 
> > > > __attribute__((cpu_specific(some_other_device_feature)));
> > > > ```
> > > > 
> > > > then we don't want the compiler to prefer a CPU-specific variant over 
> > > > the dispatch function just because one of the variant was marked 
> > > > `__host__`.
> > > It's a bit more complicated and a bit less straightforward than that. :-( 
> > >  https://goo.gl/EXnymm
> > > Handling of target attributes is where clang is very different from the 
> > > NVCC, so no matter which mental model of "CUDA programmer" you pick, 
> > > there's another one which will not match. 
> > > 
> > > In the existing code `__host__ __device__` is commonly used as a 
> > > sledgehammer to work around NVCC's limitations. It does not allow 
> > > attribute-based overloading, so the only way you can specialize a 
> > > function for host/device is via something like this:
> > > ```
> > > __host__ __device__ void foo() {
> > > #if __CUDA_ARCH__ > 0
> > >  // GPU code
> > > #else
> > >  // CPU code.
> > > #endif
> > > }
> > > ```
> > > 
> > > With clang you can write separate overloaded functions and we'll do our 
> > > best to pick the one you meant to call. Alas, there are cases where it's 
> > > ambiguous and depends on the callee's attributes, which may depend on 
> > > theirs. When something ends up being called from different contexts, 
> > > interesting things start happening. With more functions becoming 
> > > constexpr (those are implicitly HD), we'll be running into such 
> > > impossible-to-do-the-right-thing situations more often. The only reliable 
> > > way to avoid such ambiguity is to 'clone' HD functions into separate H & 
> > > D functions and do overload resolutions only considering same-side 
> > > functions which will, in effect, completely separate host and device name 
> > > spaces. 
> > > 
> > > Run-time dispatch is also somewhat irrelevant to CUDA. Sort of. On one 
> > > hand kernel launch is already a form of runtime dispatch, only it's CUDA 
> > > runtime does the dispatching based on the GPU one attempts to run the 
> > > kernel on. `__device__` functions are always compiled for the specific 
> > > GPU variant. Also, GPU variants often have different instruction sets and 
> > > can't be mixed together in the same object file at all, so there's no 
> > > variants once we're running the code as it's already compiled for 
> > > precisely the GPU we're running on. Almost. Technically GPUs in the same 
> > > family do share the same instruction sets, but I'm not sure runtime 
> > > dispatch would buy us much there as the hardware differences are 
> > > relatively minor.
> > > 
> > > The only reliable way to avoid such ambiguity is to 'clone' HD functions 
> > > into separate H & D functions and do overload resolutions only 
> > > considering same-side functions which will, in effect, completely 
> > > separate host and device name spaces.
> > 
> > Okay.  Well, even if you completely split host and device functions, I 
> > think we'd still want to prefer dispatch functions over variant functions 
> > before preferring H over HD.
> > 
> > Although... I suppose we *do* want to consider H vs. HD before looking at 
> > the more arbitrary factors that `isBetterMultiversionCandidate` looks at, 
> > like the number of architectures in the dispatch.  Honestly, though, those 
> > just seem like bad rules that we should drop from the code.
> > 
> > > Run-time dispatch is also somewhat irrelevant to CUDA. Sort of. 
> > 
> > I understand that there's very little reason (or even ability) to use 
> > multiversioning in device code, but it can certainly happen in host code, 
> > right?  Still, I guess the easiest thing would just be to forbid 
> > multiversioned functions on the device.
> Will change back the precedence of multiversion to be over host/device.
@rjmccall I'm OK with your reasoning & this patch. As long as the change does 
not break existing code, I'm fine.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D77954/new/

https://reviews.llvm.org/D77954



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D77954: [CUDA][HIP] Fix host/device based overload resolution

Reply via email to