tra added inline comments.
================
Comment at: clang/lib/Sema/SemaOverload.cpp:9749
+ if (isBetterMultiversionCandidate(Cand1, Cand2))
+ return true;
+
----------------
yaxunl wrote:
> rjmccall wrote:
> > tra wrote:
> > > rjmccall wrote:
> > > > erichkeane wrote:
> > > > > yaxunl wrote:
> > > > > > echristo wrote:
> > > > > > > rjmccall wrote:
> > > > > > > > yaxunl wrote:
> > > > > > > > > rjmccall wrote:
> > > > > > > > > > If we move anything below this check, it needs to figure
> > > > > > > > > > out a tri-state so that it can return false if `Cand2` is a
> > > > > > > > > > better candidate than `Cand1`. Now, that only matters if
> > > > > > > > > > multiversion functions are supported under CUDA, but if
> > > > > > > > > > you're relying on them not being supported, that should at
> > > > > > > > > > least be commented on.
> > > > > > > > > multiversion host functions is orthogonal to CUDA therefore
> > > > > > > > > should be supported. multiversion in device, host device, and
> > > > > > > > > global functions are not supported. However this change does
> > > > > > > > > not make things worse, and should continue to work if they
> > > > > > > > > are supported.
> > > > > > > > >
> > > > > > > > > host/device based overloading resolution is mostly for
> > > > > > > > > determining viability of a function. If two functions are
> > > > > > > > > both viable, other factors should take precedence in
> > > > > > > > > preference. This general rule has been taken for cases other
> > > > > > > > > than multiversion, I think it should also apply to
> > > > > > > > > multiversion.
> > > > > > > > >
> > > > > > > > > I will make isBetterMultiversionCandidate three states.
> > > > > > > > > This general rule has been taken for cases other than
> > > > > > > > > multiversion, I think it should also apply to multiversion.
> > > > > > > >
> > > > > > > > Well, but the multiversion people could say the same: that
> > > > > > > > multiversioning is for picking an alternative among
> > > > > > > > otherwise-identical functions, and HD and H functions are not
> > > > > > > > otherwise-identical.
> > > > > > > >
> > > > > > > > CC'ing @echristo for his thoughts on the right ordering here.
> > > > > > > Adding @erichkeane here as well.
> > > > > > >
> > > > > > > I think this makes sense, but I can see a reason to multiversion
> > > > > > > a function that will run on host and device. A version of some
> > > > > > > matrix mult that takes advantage of 3 host architectures and one
> > > > > > > cuda one? Am I missing something here?
> > > > > > My understanding is that a multiversion function is for a specific
> > > > > > cpu(gpu). Let's say we want to have a function f for gfx900,
> > > > > > gfx906, sandybridge, ivybridge, shouldn't they be more like
> > > > > >
> > > > > > ```
> > > > > > __host__ __attribute__((cpu_specific(sandybridge))) f();
> > > > > > __host__ __attribute__((cpu_specific(ivybridge))) f();
> > > > > > __device__ __attribute__((cpu_specific(gfx900))) f();
> > > > > > __device__ __attribute__((cpu_specific(gfx906))) f();
> > > > > > ```
> > > > > > instead of all `__device__ __host__` functions?
> > > > > IMO, it doesn't make sense for functions to functions be BOTH host
> > > > > and device, they'd have to be just one. Otherwise I'm not sure how
> > > > > the resolver behavior is supposed to work. The whole idea is that
> > > > > the definition is chosen at runtime.
> > > > >
> > > > > Unless __host__ __device void foo(); is TWO declaration chains
> > > > > (meaning two separate AST entries), it doesn't make sense to have
> > > > > multiverison on it (and then, how it would be spelled is
> > > > > awkward/confusing to me).
> > > > >
> > > > > In the above case, if those 4 declarations are not 2 separate root-
> > > > > AST nodes, multiversioning won't work.
> > > > There are certainly functions that ought to be usable from either host
> > > > or device context — any inline function that just does ordinary
> > > > language things should be in that category. Also IIUC many
> > > > declarations are *inferred* to be `__host__ __device__`, or can be
> > > > mass-annotated with pragmas, and those reasons are probably the main
> > > > ones this might matter — we might include a header in CUDA mode that
> > > > declares a multi-versioned function, and we should handle it right.
> > > >
> > > > My read of how CUDA programmers expect this to work is that they see
> > > > the `__host__` / `__device__` attributes as primarily a mechanism for
> > > > catching problems where you're using the wrong functions for the
> > > > current configuration. That is, while we allow overloading by
> > > > `__host__`/`__device__`-ness, users expect those attributes to mostly
> > > > be used as a filter for what's "really there" rather than really
> > > > strictly segregating the namespace. So I would say that CUDA
> > > > programmers would probably expect the interaction with multiversioning
> > > > to be:
> > > >
> > > > - Programmers can put `__host__`, `__device__`, or both on a variant
> > > > depending on where it was usable.
> > > > - Dispatches should simply ignore any variants that aren't usable for
> > > > the current configuration.
> > > >
> > > > And specifically they would not expect e.g. a `__host__` dispatch
> > > > function to only consider `__host__` variants — it should be able to
> > > > dispatch to anything available, which is to say, it should also include
> > > > `__host__ __device__` variants. Similarly (and probably more
> > > > usefully), a `__host__ __device__` dispatch function being compiled for
> > > > the device should also consider pure `__device__` functions, and so on.
> > > >
> > > > If we accept that, then I think it gives us a much better idea for how
> > > > to resolve the priority of the overload rules. The main impact of
> > > > `isBetterMultiversionCandidate` is to try to ensure that we're looking
> > > > at the `__attribute__((cpu_dispatch))` function instead of one of the
> > > > `__attribute__((cpu_specific))` variants. (It has no effect on
> > > > `__attribute__((target))` multi-versioning, mostly because it doesn't
> > > > need to: target-specific variants don't show up in lookup with
> > > > `__attribute__((target))`.) That rule should take precedence over the
> > > > CUDA preference for exact matches, because e.g. if we're compiling this:
> > > >
> > > > ```
> > > > __host__ __device__ int magic(void)
> > > > __attribute__((cpu_dispatch("...")));
> > > > __host__ __device__ int magic(void)
> > > > __attribute__((cpu_specific(generic)));
> > > > __host__ int magic(void) __attribute__((cpu_specific(mmx)));
> > > > __host__ int magic(void) __attribute__((cpu_specific(sse)));
> > > > __device__ int magic(void)
> > > > __attribute__((cpu_specific(some_device_feature)));
> > > > __device__ int magic(void)
> > > > __attribute__((cpu_specific(some_other_device_feature)));
> > > > ```
> > > >
> > > > then we don't want the compiler to prefer a CPU-specific variant over
> > > > the dispatch function just because one of the variant was marked
> > > > `__host__`.
> > > It's a bit more complicated and a bit less straightforward than that. :-(
> > > https://goo.gl/EXnymm
> > > Handling of target attributes is where clang is very different from the
> > > NVCC, so no matter which mental model of "CUDA programmer" you pick,
> > > there's another one which will not match.
> > >
> > > In the existing code `__host__ __device__` is commonly used as a
> > > sledgehammer to work around NVCC's limitations. It does not allow
> > > attribute-based overloading, so the only way you can specialize a
> > > function for host/device is via something like this:
> > > ```
> > > __host__ __device__ void foo() {
> > > #if __CUDA_ARCH__ > 0
> > > // GPU code
> > > #else
> > > // CPU code.
> > > #endif
> > > }
> > > ```
> > >
> > > With clang you can write separate overloaded functions and we'll do our
> > > best to pick the one you meant to call. Alas, there are cases where it's
> > > ambiguous and depends on the callee's attributes, which may depend on
> > > theirs. When something ends up being called from different contexts,
> > > interesting things start happening. With more functions becoming
> > > constexpr (those are implicitly HD), we'll be running into such
> > > impossible-to-do-the-right-thing situations more often. The only reliable
> > > way to avoid such ambiguity is to 'clone' HD functions into separate H &
> > > D functions and do overload resolutions only considering same-side
> > > functions which will, in effect, completely separate host and device name
> > > spaces.
> > >
> > > Run-time dispatch is also somewhat irrelevant to CUDA. Sort of. On one
> > > hand kernel launch is already a form of runtime dispatch, only it's CUDA
> > > runtime does the dispatching based on the GPU one attempts to run the
> > > kernel on. `__device__` functions are always compiled for the specific
> > > GPU variant. Also, GPU variants often have different instruction sets and
> > > can't be mixed together in the same object file at all, so there's no
> > > variants once we're running the code as it's already compiled for
> > > precisely the GPU we're running on. Almost. Technically GPUs in the same
> > > family do share the same instruction sets, but I'm not sure runtime
> > > dispatch would buy us much there as the hardware differences are
> > > relatively minor.
> > >
> > > The only reliable way to avoid such ambiguity is to 'clone' HD functions
> > > into separate H & D functions and do overload resolutions only
> > > considering same-side functions which will, in effect, completely
> > > separate host and device name spaces.
> >
> > Okay. Well, even if you completely split host and device functions, I
> > think we'd still want to prefer dispatch functions over variant functions
> > before preferring H over HD.
> >
> > Although... I suppose we *do* want to consider H vs. HD before looking at
> > the more arbitrary factors that `isBetterMultiversionCandidate` looks at,
> > like the number of architectures in the dispatch. Honestly, though, those
> > just seem like bad rules that we should drop from the code.
> >
> > > Run-time dispatch is also somewhat irrelevant to CUDA. Sort of.
> >
> > I understand that there's very little reason (or even ability) to use
> > multiversioning in device code, but it can certainly happen in host code,
> > right? Still, I guess the easiest thing would just be to forbid
> > multiversioned functions on the device.
> Will change back the precedence of multiversion to be over host/device.
@rjmccall I'm OK with your reasoning & this patch. As long as the change does
not break existing code, I'm fine.
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D77954/new/
https://reviews.llvm.org/D77954
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits