tra added inline comments.
================ Comment at: clang/lib/Sema/SemaOverload.cpp:9749 + if (isBetterMultiversionCandidate(Cand1, Cand2)) + return true; + ---------------- yaxunl wrote: > rjmccall wrote: > > tra wrote: > > > rjmccall wrote: > > > > erichkeane wrote: > > > > > yaxunl wrote: > > > > > > echristo wrote: > > > > > > > rjmccall wrote: > > > > > > > > yaxunl wrote: > > > > > > > > > rjmccall wrote: > > > > > > > > > > If we move anything below this check, it needs to figure > > > > > > > > > > out a tri-state so that it can return false if `Cand2` is a > > > > > > > > > > better candidate than `Cand1`. Now, that only matters if > > > > > > > > > > multiversion functions are supported under CUDA, but if > > > > > > > > > > you're relying on them not being supported, that should at > > > > > > > > > > least be commented on. > > > > > > > > > multiversion host functions is orthogonal to CUDA therefore > > > > > > > > > should be supported. multiversion in device, host device, and > > > > > > > > > global functions are not supported. However this change does > > > > > > > > > not make things worse, and should continue to work if they > > > > > > > > > are supported. > > > > > > > > > > > > > > > > > > host/device based overloading resolution is mostly for > > > > > > > > > determining viability of a function. If two functions are > > > > > > > > > both viable, other factors should take precedence in > > > > > > > > > preference. This general rule has been taken for cases other > > > > > > > > > than multiversion, I think it should also apply to > > > > > > > > > multiversion. > > > > > > > > > > > > > > > > > > I will make isBetterMultiversionCandidate three states. > > > > > > > > > This general rule has been taken for cases other than > > > > > > > > > multiversion, I think it should also apply to multiversion. > > > > > > > > > > > > > > > > Well, but the multiversion people could say the same: that > > > > > > > > multiversioning is for picking an alternative among > > > > > > > > otherwise-identical functions, and HD and H functions are not > > > > > > > > otherwise-identical. > > > > > > > > > > > > > > > > CC'ing @echristo for his thoughts on the right ordering here. > > > > > > > Adding @erichkeane here as well. > > > > > > > > > > > > > > I think this makes sense, but I can see a reason to multiversion > > > > > > > a function that will run on host and device. A version of some > > > > > > > matrix mult that takes advantage of 3 host architectures and one > > > > > > > cuda one? Am I missing something here? > > > > > > My understanding is that a multiversion function is for a specific > > > > > > cpu(gpu). Let's say we want to have a function f for gfx900, > > > > > > gfx906, sandybridge, ivybridge, shouldn't they be more like > > > > > > > > > > > > ``` > > > > > > __host__ __attribute__((cpu_specific(sandybridge))) f(); > > > > > > __host__ __attribute__((cpu_specific(ivybridge))) f(); > > > > > > __device__ __attribute__((cpu_specific(gfx900))) f(); > > > > > > __device__ __attribute__((cpu_specific(gfx906))) f(); > > > > > > ``` > > > > > > instead of all `__device__ __host__` functions? > > > > > IMO, it doesn't make sense for functions to functions be BOTH host > > > > > and device, they'd have to be just one. Otherwise I'm not sure how > > > > > the resolver behavior is supposed to work. The whole idea is that > > > > > the definition is chosen at runtime. > > > > > > > > > > Unless __host__ __device void foo(); is TWO declaration chains > > > > > (meaning two separate AST entries), it doesn't make sense to have > > > > > multiverison on it (and then, how it would be spelled is > > > > > awkward/confusing to me). > > > > > > > > > > In the above case, if those 4 declarations are not 2 separate root- > > > > > AST nodes, multiversioning won't work. > > > > There are certainly functions that ought to be usable from either host > > > > or device context — any inline function that just does ordinary > > > > language things should be in that category. Also IIUC many > > > > declarations are *inferred* to be `__host__ __device__`, or can be > > > > mass-annotated with pragmas, and those reasons are probably the main > > > > ones this might matter — we might include a header in CUDA mode that > > > > declares a multi-versioned function, and we should handle it right. > > > > > > > > My read of how CUDA programmers expect this to work is that they see > > > > the `__host__` / `__device__` attributes as primarily a mechanism for > > > > catching problems where you're using the wrong functions for the > > > > current configuration. That is, while we allow overloading by > > > > `__host__`/`__device__`-ness, users expect those attributes to mostly > > > > be used as a filter for what's "really there" rather than really > > > > strictly segregating the namespace. So I would say that CUDA > > > > programmers would probably expect the interaction with multiversioning > > > > to be: > > > > > > > > - Programmers can put `__host__`, `__device__`, or both on a variant > > > > depending on where it was usable. > > > > - Dispatches should simply ignore any variants that aren't usable for > > > > the current configuration. > > > > > > > > And specifically they would not expect e.g. a `__host__` dispatch > > > > function to only consider `__host__` variants — it should be able to > > > > dispatch to anything available, which is to say, it should also include > > > > `__host__ __device__` variants. Similarly (and probably more > > > > usefully), a `__host__ __device__` dispatch function being compiled for > > > > the device should also consider pure `__device__` functions, and so on. > > > > > > > > If we accept that, then I think it gives us a much better idea for how > > > > to resolve the priority of the overload rules. The main impact of > > > > `isBetterMultiversionCandidate` is to try to ensure that we're looking > > > > at the `__attribute__((cpu_dispatch))` function instead of one of the > > > > `__attribute__((cpu_specific))` variants. (It has no effect on > > > > `__attribute__((target))` multi-versioning, mostly because it doesn't > > > > need to: target-specific variants don't show up in lookup with > > > > `__attribute__((target))`.) That rule should take precedence over the > > > > CUDA preference for exact matches, because e.g. if we're compiling this: > > > > > > > > ``` > > > > __host__ __device__ int magic(void) > > > > __attribute__((cpu_dispatch("..."))); > > > > __host__ __device__ int magic(void) > > > > __attribute__((cpu_specific(generic))); > > > > __host__ int magic(void) __attribute__((cpu_specific(mmx))); > > > > __host__ int magic(void) __attribute__((cpu_specific(sse))); > > > > __device__ int magic(void) > > > > __attribute__((cpu_specific(some_device_feature))); > > > > __device__ int magic(void) > > > > __attribute__((cpu_specific(some_other_device_feature))); > > > > ``` > > > > > > > > then we don't want the compiler to prefer a CPU-specific variant over > > > > the dispatch function just because one of the variant was marked > > > > `__host__`. > > > It's a bit more complicated and a bit less straightforward than that. :-( > > > https://goo.gl/EXnymm > > > Handling of target attributes is where clang is very different from the > > > NVCC, so no matter which mental model of "CUDA programmer" you pick, > > > there's another one which will not match. > > > > > > In the existing code `__host__ __device__` is commonly used as a > > > sledgehammer to work around NVCC's limitations. It does not allow > > > attribute-based overloading, so the only way you can specialize a > > > function for host/device is via something like this: > > > ``` > > > __host__ __device__ void foo() { > > > #if __CUDA_ARCH__ > 0 > > > // GPU code > > > #else > > > // CPU code. > > > #endif > > > } > > > ``` > > > > > > With clang you can write separate overloaded functions and we'll do our > > > best to pick the one you meant to call. Alas, there are cases where it's > > > ambiguous and depends on the callee's attributes, which may depend on > > > theirs. When something ends up being called from different contexts, > > > interesting things start happening. With more functions becoming > > > constexpr (those are implicitly HD), we'll be running into such > > > impossible-to-do-the-right-thing situations more often. The only reliable > > > way to avoid such ambiguity is to 'clone' HD functions into separate H & > > > D functions and do overload resolutions only considering same-side > > > functions which will, in effect, completely separate host and device name > > > spaces. > > > > > > Run-time dispatch is also somewhat irrelevant to CUDA. Sort of. On one > > > hand kernel launch is already a form of runtime dispatch, only it's CUDA > > > runtime does the dispatching based on the GPU one attempts to run the > > > kernel on. `__device__` functions are always compiled for the specific > > > GPU variant. Also, GPU variants often have different instruction sets and > > > can't be mixed together in the same object file at all, so there's no > > > variants once we're running the code as it's already compiled for > > > precisely the GPU we're running on. Almost. Technically GPUs in the same > > > family do share the same instruction sets, but I'm not sure runtime > > > dispatch would buy us much there as the hardware differences are > > > relatively minor. > > > > > > The only reliable way to avoid such ambiguity is to 'clone' HD functions > > > into separate H & D functions and do overload resolutions only > > > considering same-side functions which will, in effect, completely > > > separate host and device name spaces. > > > > Okay. Well, even if you completely split host and device functions, I > > think we'd still want to prefer dispatch functions over variant functions > > before preferring H over HD. > > > > Although... I suppose we *do* want to consider H vs. HD before looking at > > the more arbitrary factors that `isBetterMultiversionCandidate` looks at, > > like the number of architectures in the dispatch. Honestly, though, those > > just seem like bad rules that we should drop from the code. > > > > > Run-time dispatch is also somewhat irrelevant to CUDA. Sort of. > > > > I understand that there's very little reason (or even ability) to use > > multiversioning in device code, but it can certainly happen in host code, > > right? Still, I guess the easiest thing would just be to forbid > > multiversioned functions on the device. > Will change back the precedence of multiversion to be over host/device. @rjmccall I'm OK with your reasoning & this patch. As long as the change does not break existing code, I'm fine. CHANGES SINCE LAST ACTION https://reviews.llvm.org/D77954/new/ https://reviews.llvm.org/D77954 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits