tra added a comment.

In https://reviews.llvm.org/D47849#1126925, @gtbercea wrote:

> I just stumbled upon a very interesting situation.
>
> I noticed that, for OpenMP, the use of device math functions happens as I 
> expected for -O0. For -O1 or higher math functions such as "sqrt" resolve to 
> llvm builtins/intrinsics:
>
>   call double @llvm.sqrt.f64(double %1)
>
>
> instead of the nvvm variant.


I believe we do have a pass that attempts to replace some nvvm intrinsics  with 
their llvm equivalent. It allows us to optimize the code better. My guess would 
be that the change does not happen with -O0.

> The surprising part (at least to me) is that the same llvm intrinsic is used 
> when I use Clang to compile CUDA kernel code calling the "sqrt" function. I 
> would have expected that the NVVM variant would be called for CUDA code.

What we may end up generating for any given standard library call from the 
device side depends on number of factors and may vary.
Here's what typically happens:

- clang parses CUDA headers and pulls 'standard' C math functions and bits of 
C++ overloads. These usually call __something.
- CUDA versions up to 8.0 provided those __something() functions which 
*usually* called __nv_something() in libdevice.
- As of CUDA-9 __something became NVCC's compiler builtins and clang has to 
provide its own implementation -- __clang_cuda_device_functions.h. This 
implementation may use whatever works that does the job. Any of 
__builtin.../__nvvm.../__nv_... are fair game, as long as it works.
- CUDA wrapper headers in clang do some magic to make math parts of standard 
C++ library working by magic by providing some functions to do the right thing. 
Usually those forward to the C math functions, but it may not always be the 
case.
- LLVM may update some __nvvm* intrinsics to their llvm equivalent.

In the end you may end up with somewhat different IR depending on the function 
and the CUDA version clang used.

> Is it ok for CUDA kernels to call llvm intrinsics instead of the device 
> specific math library functions?

It depends. We can not lower all LLVM intrinsics. Generally you can't use 
intrinsics that are lowered to external library call.

> If it's ok for CUDA can this be ok for OpenMP NVPTX too?
>  If not we probably need to fix it for both toolchains.

I don't have an answer for these. OpenMP seems to have somewhat different 
requirements compared to C++ which we assume for CUDA.

On thing you do need to consider, though, is that the wrapper headers are 
rather unstable. Their goal is to provide a glue between half-broken CUDA 
headers and the user's code. They are not intended to provide any sort of 
stability to anyone else. Every new CUDA version brings new and exciting 
changes to its headers which requires fair amount of changes in the wrappers.

If all you need is C math functions, it *may* be OK, but, perhaps, there may be 
a better approach.
Why not compile a real math library to bitcode and avoid all this weirdness 
with gluing together half-broken pieces of CUDA that are broken by design? 
Unlike real CUDA compilation, you don't have the constraint that you have to 
match NVCC 1:1. If you have your own device-side math library you could use 
regular math headers and link real libm.bc instead of CUDA's libdevice. The 
rumors of "high performance" functions in the libdevice are somewhat 
exaggerated , IMO. If you take a look at the IR in the libdevice of recent CUDA 
version, you will see that a lot of the functions just call their llvm 
counterpart. If it turns out that in some case llvm generates slower code than 
what nvidia provides, I'm sure it will be possible to implement a reasonably 
fast replacement.


Repository:
  rC Clang

https://reviews.llvm.org/D47849



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to