[PATCH] D47757: [Sema] Produce diagnostics when unavailable aligned allocation/deallocation functions are called

2018-08-09 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: test/SemaCUDA/call-host-fn-from-device.cu:88
 __host__ __device__ void class_specific_delete(T *t, U *u) {
-  delete t; // ok, call sized device delete even though host has preferable 
non-sized version
+  delete t; // expected-error {{reference to __host__ function 'operator 
delete' in __host__ __device__ function}}
   delete u; // ok, call non-sized HD delete rather than sized D delete

The C++ magic is way above my paygrade, but as far as CUDA goes this is a 
regression, compared to what nvcc does. This code in NVCC produced a warning 
and clang should not error out at this point in time either as 
it's not an error to call a host function from HD unless we use HD in a host 
function, and we would not know how it's used until later. I think the error 
should be postponed until codegen. 






Repository:
  rC Clang

https://reviews.llvm.org/D47757



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D47757: [Sema] Produce diagnostics when unavailable aligned allocation/deallocation functions are called

2018-08-13 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: test/SemaCUDA/call-host-fn-from-device.cu:88
 __host__ __device__ void class_specific_delete(T *t, U *u) {
-  delete t; // ok, call sized device delete even though host has preferable 
non-sized version
+  delete t; // expected-error {{reference to __host__ function 'operator 
delete' in __host__ __device__ function}}
   delete u; // ok, call non-sized HD delete rather than sized D delete

rsmith wrote:
> rsmith wrote:
> > tra wrote:
> > > The C++ magic is way above my paygrade, but as far as CUDA goes this is a 
> > > regression, compared to what nvcc does. This code in NVCC produced a 
> > > warning and clang should not error out at this point in time either as 
> > > it's not an error to call a host function from HD unless we use HD in a 
> > > host function, and we would not know how it's used until later. I think 
> > > the error should be postponed until codegen. 
> > > 
> > > 
> > > 
> > > 
> > We're in `-fcuda-is-device` mode, so IIUC it's correct to reject a call to 
> > a host function here (because `__host__ __device__` is treated as basically 
> > meaning `__device__` in that mode for the purpose of checking whether a 
> > call is valid), right?
> > 
> > However, the comment suggests that the intent was that this would instead 
> > call the device version. Did that actually previously happen (in which case 
> > this patch is somehow affecting overload resolution and should be fixed), 
> > or is the comment prior to this patch wrong and we were silently calling a 
> > host function from a device function (in which case this patch is fine, but 
> > we should add a FIXME here to select the device delete function if we think 
> > that's appropriate)?
> OK, I see from prior review comments (that phab is helpfully hiding from 
> view) that this is just adding a diagnostic and the overload resolution 
> behavior is unchanged. So I think this change is correct. @tra, can you 
> confirm? My testing shows that
> 
> ```
> __host__ void f(); __host__ __device__ void g() { f(); }
> ```
> 
> is accepted by default but rejected in `-fcuda-is-device` mode, which is 
> consistent with the behavior after this patch is applied.
"-fcuda-is-device" does not necessarily mean that the __host__ __device__ 
function will be used.

In the example above the error is indeed correct, as g() is considered to be 
externally visible and we will attempt to generate code for it, and we can't 
call f() on device.

However, if you make it static, there should be no error:
```
__host__ void f(); 
static __host__ __device__ void g() { f(); }
```

CUDA is somewhat weird when it comes to what's considered available and what is 
not.
If you want some of the gory details, see D12453 and https://goo.gl/EXnymm

@jlebar has details on how we handle the errors in cases like that:
https://github.com/llvm-mirror/clang/blob/master/lib/Sema/SemaCUDA.cpp#L590



Repository:
  rC Clang

https://reviews.llvm.org/D47757



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D50845: [CUDA/OpenMP] Define only some host macros during device compilation

2018-08-16 Thread Artem Belevich via Phabricator via cfe-commits
tra added a subscriber: pcc.
tra added a comment.

In https://reviews.llvm.org/D50845#1202551, @ABataev wrote:

> In https://reviews.llvm.org/D50845#1202550, @Hahnfeld wrote:
>
> > In https://reviews.llvm.org/D50845#1202540, @ABataev wrote:
> >
> > > Maybe for device compilation we also should define `__NO_MATH_INLINES` 
> > > and `__NO_STRING_INLINES` macros to disable inline assembly in glibc?
> >
> >
> > The problem is that `__NO_MATH_INLINES` doesn't even avoid all inline 
> > assembly from `bits/mathinline.h` :-( incidentally Clang already defines 
> > `__NO_MATH_INLINES` for x86 (due to an old bug which has been fixed long 
> > ago) - and on CentOS we still have problems as described in PR38464.
> >
> > As a second thought: This might be valid for NVPTX, but I don't think it's 
> > a good idea for x86-like offloading targets - they might well profit from 
> > inline assembly code.
>
>
> I'm not saying that we should define those macros for all targets, only for 
> NVPTX. But still, it may disable some inline assembly for other architectures.


IMO, trying to avoid inline assembly by defining(or not) some macros and hoping 
for the best is rather fragile as we'll have to chase *all* patches that host's 
math.h may have on any given system.

If I understand it correctly, the root cause of this exercise is that we want 
to compile for GPU using plain C. CUDA avoids this issue by separating device 
and host code via target attributes and clang has few special cases to ignore 
inline assembly errors in the host code if we're compiling for device. For 
OpenMP there's no such separation, not in the system headers, at least.

Perhaps we can just add another special case for inline assembly & OpenMP. If 
there's an error in inline assembly during device compilation and we see that 
the function came from the system headers, then ignore the error, poison the 
function, etc. That said, I don't know enough about OpenMP to tell whether 
that's feasible or whether that's sufficient.

Another option would be to implement some sort of attribute-based overloading. 
Then OpenMP can provide its own version of the device-side library function 
without clashing with system headers.


On a side note, I did spend about a year and got 3 almost-but-not-quite-working 
'solutions' of exactly this problem during early days of adding CUDA support to 
clang. I'm very thoroughly convinced that verbatim use of headers from platform 
A and making them work on platform B is not feasible unless you have control of 
both sets of headers. Considering that system headers are *not* under our 
control, we do need to have a way for them to coexist without clashing. 
Preprocessor magic may work in limited circumstances (e.g. we only need to deal 
with two variants of headers that never change), but the cases where that 
approach is going to fall apart are rather easy to come by. Clang's 
__clang_cuda_runtime_wrapper.h is a horrible example of that -- it sort of 
works, but every CUDA release I cross my fingers and hope that they didn't 
decide to change *anything* in their headers.




Comment at: test/SemaCUDA/builtins.cu:15-17
+#if !defined(__x86_64__)
+#error "Expected to see preprocessor macros from the host."
 #endif

Hahnfeld wrote:
> @tra I'm not sure here: Do we want `__PTX__` to be defined during host 
> compilation? I can't think of a valid use case, but you have more experience 
> with user code.
I'm not sure what was the reason for adding this macro. @pcc did it long time 
ago in rL157173. Perhaps he has better idea about the purpose.

AFAICT, It's not used by CUDA headers, nor can I find any uses in any of CUDA 
sources we have (excluding clang's own tests). The only use case I see is in 
cl_kernel.h in Apple's xcode SDK. 
System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/2.0/include/cl_kernel.h
It's used there to #define a lot of convert_FOO to __builtin_convert(FOO...).

Based on that single use case, not defining __PTX__ for the host compilation 
should probably be OK.



Repository:
  rC Clang

https://reviews.llvm.org/D50845



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D50845: [CUDA/OpenMP] Define only some host macros during device compilation

2018-08-16 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

In https://reviews.llvm.org/D50845#1203031, @gtbercea wrote:

> In https://reviews.llvm.org/D50845#1202991, @hfinkel wrote:
>
> > In https://reviews.llvm.org/D50845#1202965, @Hahnfeld wrote:
> >
> > > In https://reviews.llvm.org/D50845#1202963, @hfinkel wrote:
> > >
> > > > As a result, we should really have a separate header that has those 
> > > > actually-available functions. When targeting NVPTX, why don't we have 
> > > > the included math.h be CUDA's math.h? In the end, those are the 
> > > > functions we need to call when we generate code. Right?
> > >
> > >
> > > That's what https://reviews.llvm.org/D47849 deals with.
> >
> >
> > Yes, but it doesn't get CUDA's math.h. Maybe I misunderstand how this works 
> > (and I very well might, because it's not clear that CUDA has a math.h by 
> > that name), but that patch tries to avoid problems with the host's math.h 
> > and then also injects __clang_cuda_device_functions.h into the device 
> > compilation. How does this compare to when you include math.h in Clang's 
> > CUDA mode? It seems to be that we want to somehow map standard includes, 
> > where applicable, to include files in CUDA's include/crt directory (e.g., 
> > crt/math_functions.h and crt/common_functions.h for stdio.h for printf), 
> > and nothing else ends up being available (because it is, in fact, not 
> > available).
>
>
> There's no CUDA specific math.h unless you want to regard 
> clang_cuda_device_functions.h as a math header.


True. We rely on CUDA SDK which defines a subset of standard libc/libm 
functions with `__device__` attribute.

__clang_cuda_device_functions.h just provides a set of substitutes that became 
nvcc's builtins and are no longer implemented in CUDA headers.
It's not supposed to replace math.h and may change with next version of CUDA 
which may need to cope with some other quirk of CUDA's headers.

> The patch is using the same approach as CUDA and redirecting the function 
> calls to device specific function calls. The parts of that patch which deal 
> with host header compatibility would more naturally belong in a patch like 
> this one so ultimately they won't be part of that patch. I'm currently 
> working on improving the patch though by eliminating the 
> clang_cuda_device_functions.h injection and eliminating the need to disable 
> the built-ins.

This sounds great. When you do have device-side implementation of math library, 
it would probably worth considering to make CUDA use it, instead of the current 
hacks to adapt to CUDA headers. This would simplify things a bit and would give 
us much better control over the implementation.


Repository:
  rC Clang

https://reviews.llvm.org/D50845



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D50815: Establish the header

2018-08-16 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

It appears that libcxx/include/CMakeLists.txt needs to be updated to include 
`bit` file into the file set.


https://reviews.llvm.org/D50815



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D47757: [Sema] Produce diagnostics when unavailable aligned allocation/deallocation functions are called

2018-08-17 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

In https://reviews.llvm.org/D47757#1204545, @ahatanak wrote:

> @tra and @rsmith: Can we move forward and fix the incorrect cuda diagnostics 
> in a separate patch?


Doing that in a separate patch is OK, provided that that patch will be 
committed along with this one.

It's a regression. There's a decent chance it breaks someone and this patch, if 
committed by itself, will end up being rolled back.


Repository:
  rC Clang

https://reviews.llvm.org/D47757



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D47757: [Sema] Produce diagnostics when unavailable aligned allocation/deallocation functions are called

2018-08-17 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

In https://reviews.llvm.org/D47757#1204621, @ahatanak wrote:

> In https://reviews.llvm.org/D47757#1204561, @tra wrote:
>
> > It's a regression. There's a decent chance it breaks someone and this 
> > patch, if committed by itself, will end up being rolled back.
>
>
> Is the regression you are referring to about the static function case? I 
> don't see a difference between ToT clang and my patch in the diagnostics they 
> produce when I compile the following code:
>
>   __host__ void f();
>   static __host__ __device__ void g() { f(); }
>   __host__ __device__ void g2() { g(); } 
>
>
> Both error out when `-fcuda-is-device` is provided. If I comment out the 
> definition of g2, it compiles fine.


The example above *is* expected to produce the error on device side, bacause 
g2() is externally visible, uses g(), which in turn uses host-only f().

I'm talking about a case where g() {f()} is present in the source code, but 
will not be codegen'ed on device side.

The code below is expected to compile. Note that g2() is host-only.

  __host__ void f(); 
  static __host__ __device__ void g() { f(); }
  __host__ void g2() { g(); } 


Repository:
  rC Clang

https://reviews.llvm.org/D47757



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D47757: [Sema] Produce diagnostics when unavailable aligned allocation/deallocation functions are called

2018-08-17 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

Talked to @ahatanak over IRC. It appears that this patch may have exposed a 
preexisting bug.
Apparently `delete t;` in test/SemaCUDA/call-host-fn-from-device.cu does 
actually end up calling `__host__ operator delete`.  It should've picked 
`__device__ operator delete`, but it does not, so reporting an error here 
appears to be correct.

It's visible in AST and the IR.

@rsmith -- the original change was done a while back in 
https://reviews.llvm.org/rL283830.  I assume it worked at that time and wonder 
if it's a (possibly not-so-)recent regression.


Repository:
  rC Clang

https://reviews.llvm.org/D47757



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D47849: [OpenMP][Clang][NVPTX] Enable math functions called in an OpenMP NVPTX target device region to be resolved as device-native function calls

2018-08-22 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

`__clang_cuda_device_functions.h` is not intended to be a device-side math.h, 
despite having a lot of overlap/similarities. It may change at any time we get 
new CUDA version.
I would suggest writing an OpenMP-specific replacement for math.h which would 
map to whatever device-specific function OpenMP needs. For NVPTX that may be 
libdevice, for which you have declarations in 
`__clang_cuda_libdevice_declares.h`. Using part of 
`__clang_cuda_device_functions.h` may be a decent starting point for NVPTX, but 
OpenMP will likely need to provide an equivalent for other back-ends, too.




Comment at: lib/Basic/Targets/NVPTX.cpp:232
+  // getting inlined on the device.
+  Builder.defineMacro("__NO_MATH_INLINES");
 }

This relies on implementation detail of particular variant of the header file 
you're assuming all compilations will include. This is a workaround of the real 
problem (attempting to use headers from machine X while targeting Y) at best.

D50845 is dealing with the issue of headers for target code. Hopefully, they'll 
find a way to provide device-specific headers, so you don't rely on host 
headers being parseable during device-side compilation.



Comment at: lib/Driver/ToolChains/Clang.cpp:4758
+// toolchain.
+CmdArgs.push_back("-fno-math-builtin");
   }

Could you elaborate on why you don't want the builtins?
Builtins are enabled and are useful for CUDA. What makes their use different 
for OpenMP?
Are you doing it to guarantee that math functions remain unresolved in IR so 
you could link them in from external bitcode?



Repository:
  rC Clang

https://reviews.llvm.org/D47849



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D44435: CUDA ctor/dtor Module-Unique Symbol Name

2018-05-08 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

Great! Let's close this review then.
And good luck with cling.


https://reviews.llvm.org/D44435



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D46471: [HIP] Add hip offload kind

2018-05-08 Thread Artem Belevich via Phabricator via cfe-commits
tra accepted this revision.
tra added a comment.

Small nit. LGTM otherwise.




Comment at: lib/Driver/ToolChains/Clang.cpp:133-135
 Work(*C.getSingleOffloadToolChain());
 
+  if (JA.isHostOffloading(Action::OFK_HIP))

CUDA and HIP are mutually exclusive, so this should probably be `else if`


https://reviews.llvm.org/D46471



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D46148: [CUDA] Added -f[no-]cuda-short-ptr option

2018-05-09 Thread Artem Belevich via Phabricator via cfe-commits
This revision was automatically updated to reflect the committed changes.
Closed by commit rL331938: [CUDA] Added -f[no-]cuda-short-ptr option (authored 
by tra, committed by ).
Herald added a subscriber: llvm-commits.

Changed prior to commit:
  https://reviews.llvm.org/D46148?vs=144419&id=146027#toc

Repository:
  rL LLVM

https://reviews.llvm.org/D46148

Files:
  cfe/trunk/include/clang/Basic/TargetOptions.h
  cfe/trunk/include/clang/Driver/Options.td
  cfe/trunk/lib/Basic/Targets/NVPTX.cpp
  cfe/trunk/lib/Driver/ToolChains/Clang.cpp
  cfe/trunk/lib/Driver/ToolChains/Cuda.cpp
  cfe/trunk/lib/Frontend/CompilerInvocation.cpp


Index: cfe/trunk/include/clang/Basic/TargetOptions.h
===
--- cfe/trunk/include/clang/Basic/TargetOptions.h
+++ cfe/trunk/include/clang/Basic/TargetOptions.h
@@ -63,6 +63,10 @@
 
   /// If given, enables support for __int128_t and __uint128_t types.
   bool ForceEnableInt128 = false;
+
+  /// \brief If enabled, use 32-bit pointers for accessing const/local/shared
+  /// address space.
+  bool NVPTXUseShortPointers = false;
 };
 
 }  // end namespace clang
Index: cfe/trunk/include/clang/Driver/Options.td
===
--- cfe/trunk/include/clang/Driver/Options.td
+++ cfe/trunk/include/clang/Driver/Options.td
@@ -581,6 +581,9 @@
 def fcuda_rdc : Flag<["-"], "fcuda-rdc">, Flags<[CC1Option]>,
   HelpText<"Generate relocatable device code, also known as separate 
compilation mode.">;
 def fno_cuda_rdc : Flag<["-"], "fno-cuda-rdc">;
+def fcuda_short_ptr : Flag<["-"], "fcuda-short-ptr">, Flags<[CC1Option]>,
+  HelpText<"Use 32-bit pointers for accessing const/local/shared address 
spaces.">;
+def fno_cuda_short_ptr : Flag<["-"], "fno-cuda-short-ptr">;
 def dA : Flag<["-"], "dA">, Group;
 def dD : Flag<["-"], "dD">, Group, Flags<[CC1Option]>,
   HelpText<"Print macro definitions in -E mode in addition to normal output">;
Index: cfe/trunk/lib/Basic/Targets/NVPTX.cpp
===
--- cfe/trunk/lib/Basic/Targets/NVPTX.cpp
+++ cfe/trunk/lib/Basic/Targets/NVPTX.cpp
@@ -68,6 +68,9 @@
 
   if (TargetPointerWidth == 32)
 resetDataLayout("e-p:32:32-i64:64-i128:128-v16:16-v32:32-n16:32:64");
+  else if (Opts.NVPTXUseShortPointers)
+resetDataLayout(
+
"e-p3:32:32-p4:32:32-p5:32:32-i64:64-i128:128-v16:16-v32:32-n16:32:64");
   else
 resetDataLayout("e-i64:64-i128:128-v16:16-v32:32-n16:32:64");
 
Index: cfe/trunk/lib/Frontend/CompilerInvocation.cpp
===
--- cfe/trunk/lib/Frontend/CompilerInvocation.cpp
+++ cfe/trunk/lib/Frontend/CompilerInvocation.cpp
@@ -2922,6 +2922,8 @@
 Opts.Triple = llvm::sys::getDefaultTargetTriple();
   Opts.OpenCLExtensionsAsWritten = Args.getAllArgValues(OPT_cl_ext_EQ);
   Opts.ForceEnableInt128 = Args.hasArg(OPT_fforce_enable_int128);
+  Opts.NVPTXUseShortPointers = Args.hasFlag(
+  options::OPT_fcuda_short_ptr, options::OPT_fno_cuda_short_ptr, false);
 }
 
 bool CompilerInvocation::CreateFromArgs(CompilerInvocation &Res,
Index: cfe/trunk/lib/Driver/ToolChains/Clang.cpp
===
--- cfe/trunk/lib/Driver/ToolChains/Clang.cpp
+++ cfe/trunk/lib/Driver/ToolChains/Clang.cpp
@@ -4714,6 +4714,9 @@
 
 if (Args.hasFlag(options::OPT_fcuda_rdc, options::OPT_fno_cuda_rdc, false))
   CmdArgs.push_back("-fcuda-rdc");
+if (Args.hasFlag(options::OPT_fcuda_short_ptr,
+ options::OPT_fno_cuda_short_ptr, false))
+  CmdArgs.push_back("-fcuda-short-ptr");
   }
 
   // OpenMP offloading device jobs take the argument -fopenmp-host-ir-file-path
Index: cfe/trunk/lib/Driver/ToolChains/Cuda.cpp
===
--- cfe/trunk/lib/Driver/ToolChains/Cuda.cpp
+++ cfe/trunk/lib/Driver/ToolChains/Cuda.cpp
@@ -635,8 +635,10 @@
 // CUDA-9.0 uses new instructions that are only available in PTX6.0+
 PtxFeature = "+ptx60";
   }
-  CC1Args.push_back("-target-feature");
-  CC1Args.push_back(PtxFeature);
+  CC1Args.append({"-target-feature", PtxFeature});
+  if (DriverArgs.hasFlag(options::OPT_fcuda_short_ptr,
+ options::OPT_fno_cuda_short_ptr, false))
+CC1Args.append({"-mllvm", "--nvptx-short-ptr"});
 
   if (DeviceOffloadingKind == Action::OFK_OpenMP) {
 SmallVector LibraryPaths;


Index: cfe/trunk/include/clang/Basic/TargetOptions.h
===
--- cfe/trunk/include/clang/Basic/TargetOptions.h
+++ cfe/trunk/include/clang/Basic/TargetOptions.h
@@ -63,6 +63,10 @@
 
   /// If given, enables support for __int128_t and __uint128_t types.
   bool ForceEnableInt128 = false;
+
+  /// \brief If enabled, use 32-bit pointers for accessing const/local/shared
+  /// address space.
+  bool NVPTXUseShortPointers = false;
 };
 
 }  // 

[PATCH] D46994: [test-suite] Test CUDA in C++14 mode with C++11 stdlibs.

2018-05-17 Thread Artem Belevich via Phabricator via cfe-commits
tra accepted this revision.
tra added inline comments.
This revision is now accepted and ready to land.



Comment at: External/CUDA/CMakeLists.txt:339-345
# Same as above, but for libc++
# Tell clang to use libc++
# We also need to add compiler's include path for cxxabi.h
get_filename_component(_compiler_path ${CMAKE_CXX_COMPILER} DIRECTORY)
-   set(_Stdlib_CPPFLAGS -stdlib=libc++ 
-I${_compiler_path}/../include/c++-build)
+set(_Stdlib_CPPFLAGS -stdlib=libc++ 
-I${_compiler_path}/../include/c++-build -DSTDLIB_VERSION=2017)
set(_Stdlib_LDFLAGS  -stdlib=libc++)
set(_Stdlib_Libs libcxx)

Looks like the file should be un-tabified.


Repository:
  rT test-suite

https://reviews.llvm.org/D46994



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D46995: [test-suite] Enable CUDA complex tests with libc++ now that D25403 is resolved.

2018-05-17 Thread Artem Belevich via Phabricator via cfe-commits
tra accepted this revision.
tra added inline comments.
This revision is now accepted and ready to land.



Comment at: External/CUDA/complex.cu:24
 // libstdc++ (compile errors in ).
-#if __cplusplus >= 201103L && !defined(_LIBCPP_VERSION) && \
-(__cplusplus < 201402L || STDLIB_VERSION >= 2014)
+#if __cplusplus >= 201103L && (__cplusplus < 201402L || STDLIB_VERSION >= 2014)
 

Is this specific to c++14 only, or will we have similar conditions for 
c++17,20, etc?
Perhaps we could express library version requirements as `STDLIB_VERSION >= 
(__cplusplus / 100)` ?
I'm OK with either way.




Repository:
  rT test-suite

https://reviews.llvm.org/D46995



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D47070: [CUDA] Upgrade linked bitcode to enable inlining

2018-05-18 Thread Artem Belevich via Phabricator via cfe-commits
tra added a subscriber: echristo.
tra added a comment.

This was not intended. :-( I was unaware that GetCPUAndFeaturesAttributes() 
would add any feature that looks like a valid CPU name to the target-cpu 
attribute.
All I needed is to make builtins available or not. Setting them as function 
attributes is not what we need here.

I'm not sure what's the best way to deal with this. On one hand I do need to 
make some builtins available depending on combination of GPU arch and PTX 
version. The only way to do it is via the features. On the other hand, the 
features appear to propagate to LLVM IR, which is something I don't need or 
want.

One way would be to introduce some sort of feature blacklist which would 
prevent them from being converted to function attributes.
Or, perhaps, we can change TARGET_BUILTIN or create something similar which 
would allow availability of builtins w/o relying on features.

As a short-term fix we can disable feature-to-function attribute propagation 
for NVPTX until we fix it.

@echristo -- any other suggestions?


Repository:
  rC Clang

https://reviews.llvm.org/D47070



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D45212: [HIP] Let CUDA toolchain support HIP language mode and amdgpu

2018-05-18 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

Hi,

Sorry about the long silence. I'm back to continue the reviews. I'll handle 
what I can today and will continue with the rest on Tuesday.

It looks like patch description needs to be updated:

> Use clang-offload-bindler to create binary for device ISA.

I don't see anything related to offload-bundler in this patch any more.




Comment at: include/clang/Driver/Options.td:582
 def fno_cuda_rdc : Flag<["-"], "fno-cuda-rdc">;
+def hip_device_lib_path_EQ : Joined<["--"], "hip-device-lib-path=">, 
Group,
+  HelpText<"HIP device library path">;

I'm not sure about `i_Group`? This will cause this option to be passed to all 
preprocessor jobs. It will also be passed to host and device side compilations, 
while you probably only want/need it on device side only.



Comment at: lib/Driver/ToolChains/Cuda.cpp:323
+C.getDriver().Diag(diag::err_drv_no_such_file) << BCName;
+  CmdArgs.push_back(Args.MakeArgString(FullName));
+  return FoundLibDevice;

FullName is already result of Args.MakeArgString. You only need to do it once.



Comment at: lib/Driver/ToolChains/Cuda.cpp:329
+// object file. It calls llvm-link, opt, llc, then lld steps.
+void AMDGCN::Linker::ConstructJob(Compilation &C, const JobAction &JA,
+  const InputInfo &Output,

This function is too large to easily see that we're actually constructing 
sequence of commands.
I'd probably split construction of individual tool's command line into its own 
function.



Comment at: lib/Driver/ToolChains/Cuda.cpp:336
+  assert(StringRef(JA.getOffloadingArch()).startswith("gfx") &&
+ " unless gfx processor, backend should be clang");
+  const auto &TC =

No need for the leading space in the message.



Comment at: lib/Driver/ToolChains/Cuda.cpp:344-345
+  // Add the input bc's created by compile step.
+  for (InputInfoList::const_iterator it = Inputs.begin(), ie = Inputs.end();
+   it != ie; ++it) {
+const InputInfo &II = *it;

`for (const InputInfo &it : Inputs)` ?



Comment at: lib/Driver/ToolChains/Cuda.cpp:350
+
+  std::string GFXNAME = JA.getOffloadingArch();
+

All-caps name looks like a macro. Rename to `GfxName` ?



Comment at: lib/Driver/ToolChains/Cuda.cpp:354-359
+  // Find in --hip-device-lib-path and HIP_LIBRARY_PATH.
+  for (auto Arg : Args) {
+if (Arg->getSpelling() == "--hip-device-lib-path=") {
+  LibraryPaths.push_back(Args.MakeArgString(Arg->getValue()));
+}
+  }

```
for (path : Args.getAllArgValues(...)) {
   LibraryPaths.push_back(Args.MakeArgString(path));
}

```



Comment at: lib/Driver/ToolChains/Cuda.cpp:375-378
+  addBCLib(C, Args, CmdArgs, LibraryPaths,
+   (Twine("oclc_isa_version_") + StringRef(GFXNAME).drop_front(3) +
+".amdgcn.bc")
+   .str());

This is somewhat unreadable. Perhaps you could construct the name in a temp 
variable.



Comment at: lib/Driver/ToolChains/Cuda.cpp:384
+  const char *ResultingBitcodeF =
+  C.addTempFile(C.getArgs().MakeArgString(TmpName.c_str()));
+  CmdArgs.push_back(ResultingBitcodeF);

You don't need to use c_str() for MakeArgString. It will happily accept 
std::string.



Comment at: lib/Driver/ToolChains/Cuda.cpp:394
+  // The input to opt is the output from llvm-link.
+  OptArgs.push_back(ResultingBitcodeF);
+  // Pass optimization arg to opt.

`BitcodeOutputFile`?



Comment at: lib/Driver/ToolChains/Cuda.cpp:417
+  const char *mcpustr = Args.MakeArgString("-mcpu=" + GFXNAME);
+  OptArgs.push_back(mcpustr);
+  OptArgs.push_back("-o");

I think you can get rid of the temp var here without hurting readability.



Comment at: lib/Driver/ToolChains/Cuda.cpp:420
+  std::string OptOutputFileName =
+  C.getDriver().GetTemporaryPath("OPT_OUTPUT", "bc");
+  const char *OptOutputFile =

I wonder if we could derive temp file name from the input's name. This may make 
it easier to find relevant temp files if/when we need to debug the compilation 
process.



Comment at: lib/Driver/ToolChains/Cuda.cpp:422
+  const char *OptOutputFile =
+  C.addTempFile(C.getArgs().MakeArgString(OptOutputFileName.c_str()));
+  OptArgs.push_back(OptOutputFile);

No need for c_str() here.



Comment at: lib/Driver/ToolChains/Cuda.cpp:439
+  const char *LlcOutputFile =
+  C.addTempFile(C.getArgs().MakeArgString(LlcOutputFileName.c_str()));
+  LlcArgs.push_back(LlcOutputFile);

c_str(), again.



Comment at: lib/Driver/ToolChains/Cuda.cpp:764
+  if (DriverArgs.hasArg(options::OPT_nocudalib) ||
+  DeviceOffloadingKind == Action::OFK_HIP)

[PATCH] D46476: [HIP] Add action builder for HIP

2018-05-18 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: lib/Driver/Driver.cpp:2221
+CudaDeviceActions.clear();
+for (unsigned I = 0, E = GpuArchList.size(); I != E; ++I) {
+  CudaDeviceActions.push_back(UA);

`for(auto Arch: GpuArchList)`



Comment at: lib/Driver/Driver.cpp:2265-2272
+  assert(AssociatedOffloadKind == Action::OFK_Cuda || 
AssociatedOffloadKind == Action::OFK_HIP);
+
   // We don't need to support CUDA.
-  if (!C.hasOffloadToolChain())
+  if (AssociatedOffloadKind == Action::OFK_Cuda && 
!C.hasOffloadToolChain())
+return false;
+
+  // We don't need to support HIP.

Please reformat.



Comment at: lib/Driver/Driver.cpp:2330-2332
+  for (CudaArch Arch : GpuArchs) {
 GpuArchList.push_back(Arch);
+  }

Single-statement for does not need braces.



Comment at: lib/Driver/Driver.cpp:2485-2493
+  // The host only depends on device action in the linking phase, when all
+  // the device images have to be embedded in the host image.
+  if (CurPhase == phases::Link) {
+DeviceLinkerInputs.resize(CudaDeviceActions.size());
+auto LI = DeviceLinkerInputs.begin();
+for (auto *A : CudaDeviceActions) {
+  LI->push_back(A);

I'm not sure I understand what happens here and the comment does not help.
We appear to add each element of CudaDeviceActions to the action list of each 
linker input.

Does the comment mean that *only in linking mode* do we need to add dependency 
on device actions?



https://reviews.llvm.org/D46476



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D45212: [HIP] Let CUDA toolchain support HIP language mode and amdgpu

2018-05-18 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

One more thing -- it would be really good to add some tests to make sure your 
commands are constructed the way you want.


https://reviews.llvm.org/D45212



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D46472: [HIP] Support offloading by linker script

2018-05-18 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: cfe/trunk/lib/Driver/ToolChains/CommonArgs.cpp:1371-1388
+  // machines.
+  LksStream << "/*\n";
+  LksStream << "   HIP Offload Linker Script\n";
+  LksStream << " *** Automatically generated by Clang ***\n";
+  LksStream << "*/\n";
+  LksStream << "TARGET(binary)\n";
+  LksStream << "INPUT(" << BundleFileName << ")\n";

Using this linker script may present a problem.

INSERT BEFORE is not going to work with ld.gold.
https://sourceware.org/bugzilla/show_bug.cgi?id=15373

LLD also does not handle it particularly well -- INSERT BEFORE can only be used 
to override explicitly specified external linker script and virtually nobody 
uses linker scripts with LLD.
See tests in https://reviews.llvm.org/D44380



Repository:
  rL LLVM

https://reviews.llvm.org/D46472



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D47070: [CUDA] Upgrade linked bitcode to enable inlining

2018-05-22 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

In https://reviews.llvm.org/D47070#1106018, @echristo wrote:

> > As a short-term fix we can disable feature-to-function attribute 
> > propagation for NVPTX until we fix it.
> > 
> > @echristo -- any other suggestions?
>
> This is some of what I was talking about when I was mentioning how function 
> attributes and the targets work. Ideally you'll have a compatible set of 
> features and it won't really cause an issue. The idea is that if you're 
> compiling for a minimum ptx feature of X, then any "compatible" set of ptx 
> should be able to inline into your code. I think you do want the features to 
> propagate in general, just specific use cases may not care one way or another 
> - that said, for those use cases you're probably just compiling everything 
> with the same feature anyhow.


The thing is that with NVPTX you can not have incompatible functions in the 
PTX, period. PTXAS will just throw syntax errors at you. In that regard PTX is 
very different from intel where in the same binary you can have different 
functions with code for different x86 variants.  For PTX, sm_50 and sm_60 mean 
entirely different GPUs with entirely different instruction sets/encoding. PTX 
version would be an approximation of a different language dialect .  You can 
not use anything from PTX 4.0 if your file says it's PTX3.0. It's sort of like 
you can't use c++17 features when you're compiling in c++98 mode. Bottom line 
is that features and target-cpu do not make  much sense for NVPTX. Everything  
we generate in a TU must satisfy minimum PTX version and minimum GPU variant 
and it all will be compiled for and run on only one specific GPU. There's no 
mixing and matching.

The question is -- what's the best way to make things work as they were before 
I broke them?
@Hahnfeld's idea of ignoring features and target-cpu would get us there, but 
that may be a never-ending source of surprises if/when something else decides 
to pay attention to those attributes.
I think the best way to tackle that would be to 
a) figure out how to make builtins available/or not on clang side, and
b) make target-cpu and target-features attributes explicitly unsupported on 
NVPTX as we can not provide the functionality those attributes imply.

> I guess, ultimately, I'm not seeing what the concern here is for how features 
> are working or not working for the target so it's harder to help. What is the 
> problem you're running into, or can you try a different way of explaining it 
> to me? :)

Here's my understanding of what happens: 
We've started adding target-features and target-cpu to everything clang 
generates. 
We also need to link with libdevice (or IR generated by clang which which has 
functions w/o those attributes. Or we need to link with IR produced by clang 
which used different CUDA SDK and thus different PTX version in target-feature.
Due to attribute mismatch we are failing to inline some of the functions and 
that hurts performance.


Repository:
  rC Clang

https://reviews.llvm.org/D47070



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D47154: Try to make builtin address space declarations not useless

2018-05-22 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

CUDA does not expose explicit AS on clang size. All pointers are treated as 
generic and we infer specific address space only in LLVM.
`__nvvm_atom_*_[sg]_*` builtins should probably be removed as they are indeed 
useless without pointers with explicit AS and NVCC itself does not have such 
builtins either.  Instead, we should convert the generic AS builtin to 
address-space specific instruction somewhere in LLVM.

Using `attribute((address_space())` should probably produce an error during 
CUDA compilation.


https://reviews.llvm.org/D47154



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D47268: [CUDA] Fixed the list of GPUs supported by CUDA-9

2018-05-23 Thread Artem Belevich via Phabricator via cfe-commits
tra created this revision.
tra added reviewers: jlebar, klimek.
Herald added subscribers: bixia, sanjoy.

Removed sm_20 as it is not supported by CUDA-9.
Added sm_37.


https://reviews.llvm.org/D47268

Files:
  clang/lib/Driver/ToolChains/Cuda.cpp


Index: clang/lib/Driver/ToolChains/Cuda.cpp
===
--- clang/lib/Driver/ToolChains/Cuda.cpp
+++ clang/lib/Driver/ToolChains/Cuda.cpp
@@ -164,8 +164,8 @@
   std::string FilePath = LibDevicePath + "/libdevice.10.bc";
   if (FS.exists(FilePath)) {
 for (const char *GpuArchName :
- {"sm_20", "sm_30", "sm_32", "sm_35", "sm_50", "sm_52", "sm_53",
-   "sm_60", "sm_61", "sm_62", "sm_70", "sm_72"}) {
+ {"sm_30", "sm_32", "sm_35", "sm_37", "sm_50", "sm_52", "sm_53",
+  "sm_60", "sm_61", "sm_62", "sm_70", "sm_72"}) {
   const CudaArch GpuArch = StringToCudaArch(GpuArchName);
   if (Version >= MinVersionForCudaArch(GpuArch) &&
   Version <= MaxVersionForCudaArch(GpuArch))


Index: clang/lib/Driver/ToolChains/Cuda.cpp
===
--- clang/lib/Driver/ToolChains/Cuda.cpp
+++ clang/lib/Driver/ToolChains/Cuda.cpp
@@ -164,8 +164,8 @@
   std::string FilePath = LibDevicePath + "/libdevice.10.bc";
   if (FS.exists(FilePath)) {
 for (const char *GpuArchName :
- {"sm_20", "sm_30", "sm_32", "sm_35", "sm_50", "sm_52", "sm_53",
-   "sm_60", "sm_61", "sm_62", "sm_70", "sm_72"}) {
+ {"sm_30", "sm_32", "sm_35", "sm_37", "sm_50", "sm_52", "sm_53",
+  "sm_60", "sm_61", "sm_62", "sm_70", "sm_72"}) {
   const CudaArch GpuArch = StringToCudaArch(GpuArchName);
   if (Version >= MinVersionForCudaArch(GpuArch) &&
   Version <= MaxVersionForCudaArch(GpuArch))
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D47268: [CUDA] Fixed the list of GPUs supported by CUDA-9

2018-05-23 Thread Artem Belevich via Phabricator via cfe-commits
This revision was automatically updated to reflect the committed changes.
Closed by commit rC333098: [CUDA] Fixed the list of GPUs supported by CUDA-9. 
(authored by tra, committed by ).

Changed prior to commit:
  https://reviews.llvm.org/D47268?vs=148232&id=148236#toc

Repository:
  rC Clang

https://reviews.llvm.org/D47268

Files:
  lib/Driver/ToolChains/Cuda.cpp


Index: lib/Driver/ToolChains/Cuda.cpp
===
--- lib/Driver/ToolChains/Cuda.cpp
+++ lib/Driver/ToolChains/Cuda.cpp
@@ -164,8 +164,8 @@
   std::string FilePath = LibDevicePath + "/libdevice.10.bc";
   if (FS.exists(FilePath)) {
 for (const char *GpuArchName :
- {"sm_20", "sm_30", "sm_32", "sm_35", "sm_50", "sm_52", "sm_53",
-   "sm_60", "sm_61", "sm_62", "sm_70", "sm_72"}) {
+ {"sm_30", "sm_32", "sm_35", "sm_37", "sm_50", "sm_52", "sm_53",
+  "sm_60", "sm_61", "sm_62", "sm_70", "sm_72"}) {
   const CudaArch GpuArch = StringToCudaArch(GpuArchName);
   if (Version >= MinVersionForCudaArch(GpuArch) &&
   Version <= MaxVersionForCudaArch(GpuArch))


Index: lib/Driver/ToolChains/Cuda.cpp
===
--- lib/Driver/ToolChains/Cuda.cpp
+++ lib/Driver/ToolChains/Cuda.cpp
@@ -164,8 +164,8 @@
   std::string FilePath = LibDevicePath + "/libdevice.10.bc";
   if (FS.exists(FilePath)) {
 for (const char *GpuArchName :
- {"sm_20", "sm_30", "sm_32", "sm_35", "sm_50", "sm_52", "sm_53",
-   "sm_60", "sm_61", "sm_62", "sm_70", "sm_72"}) {
+ {"sm_30", "sm_32", "sm_35", "sm_37", "sm_50", "sm_52", "sm_53",
+  "sm_60", "sm_61", "sm_62", "sm_70", "sm_72"}) {
   const CudaArch GpuArch = StringToCudaArch(GpuArchName);
   if (Version >= MinVersionForCudaArch(GpuArch) &&
   Version <= MaxVersionForCudaArch(GpuArch))
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D45212: Add HIP toolchain

2018-05-23 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: lib/Driver/ToolChains/HIP.cpp:29-47
+static bool addBCLib(Compilation &C, const ArgList &Args,
+ ArgStringList &CmdArgs, ArgStringList LibraryPaths,
+ StringRef BCName) {
+  StringRef FullName;
+  bool FoundLibDevice = false;
+  for (std::string LibraryPath : LibraryPaths) {
+SmallString<128> Path(LibraryPath);

FullName may remain uninitialized if LibraryPaths are empty which will probably 
crash compiler when you attempt to pass it to MakeArgString.
If empty LibraryPaths is not expected there should be an assert.

If the library is not found, we issue an error, but we still proceed to append 
the FullName to the CmdArgs. I don't think we should do that. FullName will be 
either NULL or pointing to the last directory in the LibraryPaths. 

You seem to be relying on diagnostics to deal with errors and are not using 
return value of the function. You may as well make it void.

I'd move  `CmdArgs.push_back(...)` under `if(::exists(FullName))` and change 
`break` to `return`;
Then you can get rid of FoundLibDevice and just issue the error if we ever 
reach the end of the function.




Comment at: lib/Driver/ToolChains/HIP.cpp:79-81
+std::string ISAVerBC = "oclc_isa_version_";
+ISAVerBC = ISAVerBC + SubArchName.drop_front(3).str();
+ISAVerBC = ISAVerBC + ".amdgcn.bc";

No need for intermediate values here -- just '+' all parts together. 




Comment at: lib/Driver/ToolChains/HIP.cpp:133
+}
+OptArgs.push_back(Args.MakeArgString(llvm::Twine("-O") + OOpt));
+  }

Nit: I think explicit llvm::Twine is unnecessary here. 



Comment at: lib/Driver/ToolChains/HIP.cpp:155-160
+  ArgStringList LlcArgs;
+  LlcArgs.push_back(InputFileName);
+  LlcArgs.push_back("-mtriple=amdgcn-amd-amdhsa");
+  LlcArgs.push_back("-filetype=obj");
+  LlcArgs.push_back(Args.MakeArgString("-mcpu=" + SubArchName));
+  LlcArgs.push_back("-o");

Nit: THis could be collapsed into `ArgStringList LlcArgs({...});`



Comment at: lib/Driver/ToolChains/HIP.cpp:179-181
+  ArgStringList LldArgs;
+  // The output from ld.lld is an HSA code object file.
+  LldArgs.append({"-flavor", "gnu", "--no-undefined", "-shared", "-o"});

Same here: `ArgStringList LldArgs({"-flavor", "gnu", "--no-undefined", 
"-shared", "-o"});`



Comment at: lib/Driver/ToolChains/HIP.cpp:212-215
+  TempFile =
+  constructOptCommand(C, JA, Inputs, Args, SubArchName, Prefix, TempFile);
+  TempFile =
+  constructLlcCommand(C, JA, Inputs, Args, SubArchName, Prefix, TempFile);

Right now the code is structured as if you're appending to the same TempFile 
string which is not the case here. I'd give intermediate variables their own 
names -- `OptCommand`,`LlcCommand`.
This would make it easier to see that you are **chaining** separate commands, 
each producing its own temp output file.


https://reviews.llvm.org/D45212



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D45212: Add HIP toolchain

2018-05-23 Thread Artem Belevich via Phabricator via cfe-commits
tra accepted this revision.
tra added a comment.
This revision is now accepted and ready to land.

One small nit. LGTM otherwise.




Comment at: lib/Driver/ToolChains/HIP.cpp:44
+  }
+  if (!FoundLibDevice)
+C.getDriver().Diag(diag::err_drv_no_such_file) << BCName;

You don't need FoundLibDevice any more as you will  always return from inside 
the loop if it is ever true.


https://reviews.llvm.org/D45212



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D47394: [OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain

2018-05-29 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

"Interoperability with other compilers" is probably a statement that's a bit 
too strong. At best it's kind of compatible with CUDA tools and I don't think 
it's feasible for other compilers. I.e. it will be useless for AMD GPUs and 
whatever compiler they use.

In general it sounds like you're going back to what regular CUDA compilation 
pipeline does:

- [clang] C++->.ptx
- [ptxas] .ptx -> .cubin
- [fatbin] .cubin -> .fatbin
- [clang] C++ + .fatbin -> host .o

On one hand I can see how being able to treat GPU-side binaries as any other 
host files is convenient. On the other hand, this convenience comes with the 
price of targeting only NVPTX. This seems contrary to OpenMP's goal of 
supporting many different kinds of accelerators. I'm not sure what's the 
consensus in the OpenMP community these days, but I vaguely recall that generic 
bundling/unbundling was explicitly chosen over vendor-specific encapsulation in 
host .o when the bundling was implemented. If the underlying reasons have 
changed since then it would be great to hear more details about that.

Assuming we do proceed with back-to-CUDA approach, one thing I'd consider would 
be using clang's -fcuda-include-gpubinary option which CUDA uses to include GPU 
code into the host object. You may be able to use it to avoid compiling and 
partially linking .fatbin and host .o.


Repository:
  rC Clang

https://reviews.llvm.org/D47394



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D46476: [HIP] Add action builder for HIP

2018-05-29 Thread Artem Belevich via Phabricator via cfe-commits
tra accepted this revision.
tra added a comment.
This revision is now accepted and ready to land.

One nit. LGTM otherwise.




Comment at: test/Driver/cuda-phases.cu:16
+// RUN: | FileCheck -check-prefixes=BIN,BIN_NV %s
+// RUN: %clang -x hip -target powerpc64le-ibm-linux-gnu -ccc-print-phases 
--cuda-gpu-arch=gfx803 %s 2>&1 \
+// RUN: | FileCheck -check-prefixes=BIN,BIN_AMD %s

Please wrap long RUN lines in all tests.


https://reviews.llvm.org/D46476



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D38188: [CUDA] Fix names of __nvvm_vote* intrinsics.

2017-09-25 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

In https://reviews.llvm.org/D38188#880318, @jlebar wrote:

> Should we add tests to the test-suite?  Or, are these already caught by the 
> existing tests we have?


That's the plan. Once clang can compile CUDA headers, I'll add CUDA-9 specific 
tests to the testsuite and update the buildbot to compile/run tests with CUDA-9.


https://reviews.llvm.org/D38188



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D38188: [CUDA] Fix names of __nvvm_vote* intrinsics.

2017-09-25 Thread Artem Belevich via Phabricator via cfe-commits
This revision was automatically updated to reflect the committed changes.
Closed by commit rL314129: [CUDA] Fix names of __nvvm_vote* intrinsics. 
(authored by tra).

Changed prior to commit:
  https://reviews.llvm.org/D38188?vs=116400&id=116576#toc

Repository:
  rL LLVM

https://reviews.llvm.org/D38188

Files:
  cfe/trunk/lib/Headers/__clang_cuda_intrinsics.h


Index: cfe/trunk/lib/Headers/__clang_cuda_intrinsics.h
===
--- cfe/trunk/lib/Headers/__clang_cuda_intrinsics.h
+++ cfe/trunk/lib/Headers/__clang_cuda_intrinsics.h
@@ -170,22 +170,22 @@
 }
 
 inline __device__ int __all_sync(unsigned int mask, int pred) {
-  return __nvvm_vote_sync_all(mask, pred);
+  return __nvvm_vote_all_sync(mask, pred);
 }
 
 inline __device__ int __any_sync(unsigned int mask, int pred) {
-  return __nvvm_vote_sync_any(mask, pred);
+  return __nvvm_vote_any_sync(mask, pred);
 }
 
 inline __device__ int __uni_sync(unsigned int mask, int pred) {
-  return __nvvm_vote_sync_uni(mask, pred);
+  return __nvvm_vote_uni_sync(mask, pred);
 }
 
 inline __device__ unsigned int __ballot_sync(unsigned int mask, int pred) {
-  return __nvvm_vote_sync_ballot(mask, pred);
+  return __nvvm_vote_ballot_sync(mask, pred);
 }
 
-inline __device__ activemask() { return __nvvm_vote.ballot(1); }
+inline __device__ unsigned int activemask() { return __nvvm_vote_ballot(1); }
 
 #endif // __CUDA_VERSION >= 9000 && (!defined(__CUDA_ARCH__) ||
// __CUDA_ARCH__ >= 300)


Index: cfe/trunk/lib/Headers/__clang_cuda_intrinsics.h
===
--- cfe/trunk/lib/Headers/__clang_cuda_intrinsics.h
+++ cfe/trunk/lib/Headers/__clang_cuda_intrinsics.h
@@ -170,22 +170,22 @@
 }
 
 inline __device__ int __all_sync(unsigned int mask, int pred) {
-  return __nvvm_vote_sync_all(mask, pred);
+  return __nvvm_vote_all_sync(mask, pred);
 }
 
 inline __device__ int __any_sync(unsigned int mask, int pred) {
-  return __nvvm_vote_sync_any(mask, pred);
+  return __nvvm_vote_any_sync(mask, pred);
 }
 
 inline __device__ int __uni_sync(unsigned int mask, int pred) {
-  return __nvvm_vote_sync_uni(mask, pred);
+  return __nvvm_vote_uni_sync(mask, pred);
 }
 
 inline __device__ unsigned int __ballot_sync(unsigned int mask, int pred) {
-  return __nvvm_vote_sync_ballot(mask, pred);
+  return __nvvm_vote_ballot_sync(mask, pred);
 }
 
-inline __device__ activemask() { return __nvvm_vote.ballot(1); }
+inline __device__ unsigned int activemask() { return __nvvm_vote_ballot(1); }
 
 #endif // __CUDA_VERSION >= 9000 && (!defined(__CUDA_ARCH__) ||
// __CUDA_ARCH__ >= 300)
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D38191: [NVPTX] added match.{any, all}.sync instructions, intrinsics & builtins.

2017-09-25 Thread Artem Belevich via Phabricator via cfe-commits
tra updated this revision to Diff 116578.
tra marked an inline comment as done.
tra added a comment.

Addressed Justin's comments.


https://reviews.llvm.org/D38191

Files:
  clang/include/clang/Basic/BuiltinsNVPTX.def
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/lib/Headers/__clang_cuda_intrinsics.h
  clang/test/CodeGen/builtins-nvptx-ptx60.cu
  llvm/include/llvm/IR/IntrinsicsNVVM.td
  llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
  llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
  llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
  llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
  llvm/test/CodeGen/NVPTX/match.ll

Index: llvm/test/CodeGen/NVPTX/match.ll
===
--- /dev/null
+++ llvm/test/CodeGen/NVPTX/match.ll
@@ -0,0 +1,117 @@
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_70 -mattr=+ptx60 | FileCheck %s
+
+declare i32 @llvm.nvvm.match.any.sync.i32(i32, i32)
+declare i64 @llvm.nvvm.match.any.sync.i64(i32, i64)
+
+; CHECK-LABEL: .func{{.*}}match.any.sync.i32
+define i32 @match.any.sync.i32(i32 %mask, i32 %value) {
+  ; CHECK: ld.param.u32 	[[MASK:%r[0-9]+]], [match.any.sync.i32_param_0];
+  ; CHECK: ld.param.u32 	[[VALUE:%r[0-9]+]], [match.any.sync.i32_param_1];
+
+  ; CHECK:  match.any.sync.b32  [[V0:%r[0-9]+]], [[VALUE]], [[MASK]];
+  %v0 = call i32 @llvm.nvvm.match.any.sync.i32(i32 %mask, i32 %value)
+  ; CHECK:  match.any.sync.b32  [[V1:%r[0-9]+]], [[VALUE]], 1;
+  %v1 = call i32 @llvm.nvvm.match.any.sync.i32(i32 1, i32 %value)
+  ; CHECK:  match.any.sync.b32  [[V2:%r[0-9]+]], 2, [[MASK]];
+  %v2 = call i32 @llvm.nvvm.match.any.sync.i32(i32 %mask, i32 2)
+  ; CHECK:  match.any.sync.b32  [[V3:%r[0-9]+]], 4, 3;
+  %v3 = call i32 @llvm.nvvm.match.any.sync.i32(i32 3, i32 4)
+  %sum1 = add i32 %v0, %v1
+  %sum2 = add i32 %v2, %v3
+  %sum3 = add i32 %sum1, %sum2
+  ret i32 %sum3;
+}
+
+; CHECK-LABEL: .func{{.*}}match.any.sync.i64
+define i64 @match.any.sync.i64(i32 %mask, i64 %value) {
+  ; CHECK: ld.param.u32 	[[MASK:%r[0-9]+]], [match.any.sync.i64_param_0];
+  ; CHECK: ld.param.u64 	[[VALUE:%rd[0-9]+]], [match.any.sync.i64_param_1];
+
+  ; CHECK:  match.any.sync.b64  [[V0:%rd[0-9]+]], [[VALUE]], [[MASK]];
+  %v0 = call i64 @llvm.nvvm.match.any.sync.i64(i32 %mask, i64 %value)
+  ; CHECK:  match.any.sync.b64  [[V1:%rd[0-9]+]], [[VALUE]], 1;
+  %v1 = call i64 @llvm.nvvm.match.any.sync.i64(i32 1, i64 %value)
+  ; CHECK:  match.any.sync.b64  [[V2:%rd[0-9]+]], 2, [[MASK]];
+  %v2 = call i64 @llvm.nvvm.match.any.sync.i64(i32 %mask, i64 2)
+  ; CHECK:  match.any.sync.b64  [[V3:%rd[0-9]+]], 4, 3;
+  %v3 = call i64 @llvm.nvvm.match.any.sync.i64(i32 3, i64 4)
+  %sum1 = add i64 %v0, %v1
+  %sum2 = add i64 %v2, %v3
+  %sum3 = add i64 %sum1, %sum2
+  ret i64 %sum3;
+}
+
+declare {i32, i1} @llvm.nvvm.match.all.sync.i32p(i32, i32)
+declare {i64, i1} @llvm.nvvm.match.all.sync.i64p(i32, i64)
+
+; CHECK-LABEL: .func{{.*}}match.all.sync.i32p(
+define {i32,i1} @match.all.sync.i32p(i32 %mask, i32 %value) {
+  ; CHECK: ld.param.u32 	[[MASK:%r[0-9]+]], [match.all.sync.i32p_param_0];
+  ; CHECK: ld.param.u32 	[[VALUE:%r[0-9]+]], [match.all.sync.i32p_param_1];
+
+  ; CHECK:  match.all.sync.b32 {{%r[0-9]+\|%p[0-9]+}}, [[VALUE]], [[MASK]];
+  %r1 = call {i32, i1} @llvm.nvvm.match.all.sync.i32p(i32 %mask, i32 %value)
+  %v1 = extractvalue {i32, i1} %r1, 0
+  %p1 = extractvalue {i32, i1} %r1, 1
+
+  ; CHECK:  match.all.sync.b32 {{%r[0-9]+\|%p[0-9]+}}, 1, [[MASK]];
+  %r2 = call {i32, i1} @llvm.nvvm.match.all.sync.i32p(i32 %mask, i32 1)
+  %v2 = extractvalue {i32, i1} %r2, 0
+  %p2 = extractvalue {i32, i1} %r2, 1
+
+  ; CHECK:  match.all.sync.b32 {{%r[0-9]+\|%p[0-9]+}}, [[VALUE]], 2;
+  %r3 = call {i32, i1} @llvm.nvvm.match.all.sync.i32p(i32 2, i32 %value)
+  %v3 = extractvalue {i32, i1} %r3, 0
+  %p3 = extractvalue {i32, i1} %r3, 1
+
+  ; CHECK:  match.all.sync.b32 {{%r[0-9]+\|%p[0-9]+}}, 4, 3;
+  %r4 = call {i32, i1} @llvm.nvvm.match.all.sync.i32p(i32 3, i32 4)
+  %v4 = extractvalue {i32, i1} %r4, 0
+  %p4 = extractvalue {i32, i1} %r4, 1
+
+  %vsum1 = add i32 %v1, %v2
+  %vsum2 = add i32 %v3, %v4
+  %vsum3 = add i32 %vsum1, %vsum2
+  %psum1 = add i1 %p1, %p2
+  %psum2 = add i1 %p3, %p4
+  %psum3 = add i1 %psum1, %psum2
+  %ret0 = insertvalue {i32, i1} undef, i32 %vsum3, 0
+  %ret1 = insertvalue {i32, i1} %ret0, i1 %psum3, 1
+  ret {i32, i1} %ret1;
+}
+
+; CHECK-LABEL: .func{{.*}}match.all.sync.i64p(
+define {i64,i1} @match.all.sync.i64p(i32 %mask, i64 %value) {
+  ; CHECK: ld.param.u32 	[[MASK:%r[0-9]+]], [match.all.sync.i64p_param_0];
+  ; CHECK: ld.param.u64 	[[VALUE:%rd[0-9]+]], [match.all.sync.i64p_param_1];
+
+  ; CHECK:  match.all.sync.b64 {{%rd[0-9]+\|%p[0-9]+}}, [[VALUE]], [[MASK]];
+  %r1 = call {i64, i1} @llvm.nvvm.match.all.sync.i64p(i32 %mask, i64 %value)
+  %v1 = extractvalue {i64, i1} %r1, 0
+  %p1 = extractvalue {i64, i1} %r1, 1
+
+  ; CHECK:  match.all.sync.b64 {{%rd[0-9]+\|%p[0-9]+}}, 1, [[MASK]];
+  %r2 = call {i64, i1} @llvm.nvvm.match.all.sync.i64p(i32 %mask, i64 1)
+  %v2 = extractval

[PATCH] D38191: [NVPTX] added match.{any, all}.sync instructions, intrinsics & builtins.

2017-09-25 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: clang/lib/CodeGen/CGBuiltin.cpp:9603
+Value *Pred = Builder.CreateSExt(Builder.CreateExtractValue(ResultPair, 1),
+ PredOutPtr.getElementType());
+Builder.CreateStore(Pred, PredOutPtr);

jlebar wrote:
> Doing sext i1 -> i32 is going to cause us to store 0 or -1 in the pred 
> (right?).  The CUDA docs say
> 
> > Predicate pred is set to true if all threads in mask have the same value of 
> > value; otherwise the predicate is set to false.
> 
> I'd guess that "true" probably means 1 (i.e. uext i1 -> i32) rather than -1, 
> although, I guess we have to check.
Right. It should've been ZExt. In similar places CUDA headers use "selp %r1, 1, 
0, %p".


https://reviews.llvm.org/D38191



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D38191: [NVPTX] added match.{any, all}.sync instructions, intrinsics & builtins.

2017-09-25 Thread Artem Belevich via Phabricator via cfe-commits
This revision was automatically updated to reflect the committed changes.
Closed by commit rL314135: [NVPTX] added match.{any,all}.sync instructions, 
intrinsics & builtins. (authored by tra).

Changed prior to commit:
  https://reviews.llvm.org/D38191?vs=116578&id=116584#toc

Repository:
  rL LLVM

https://reviews.llvm.org/D38191

Files:
  cfe/trunk/include/clang/Basic/BuiltinsNVPTX.def
  cfe/trunk/lib/CodeGen/CGBuiltin.cpp
  cfe/trunk/lib/Headers/__clang_cuda_intrinsics.h
  cfe/trunk/test/CodeGen/builtins-nvptx-ptx60.cu
  llvm/trunk/include/llvm/IR/IntrinsicsNVVM.td
  llvm/trunk/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
  llvm/trunk/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
  llvm/trunk/lib/Target/NVPTX/NVPTXInstrInfo.td
  llvm/trunk/lib/Target/NVPTX/NVPTXIntrinsics.td
  llvm/trunk/test/CodeGen/NVPTX/match.ll

Index: llvm/trunk/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
===
--- llvm/trunk/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
+++ llvm/trunk/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
@@ -714,6 +714,9 @@
 return false;
   case Intrinsic::nvvm_texsurf_handle_internal:
 SelectTexSurfHandle(N);
+  case Intrinsic::nvvm_match_all_sync_i32p:
+  case Intrinsic::nvvm_match_all_sync_i64p:
+SelectMatchAll(N);
 return true;
   }
 }
@@ -726,6 +729,36 @@
 MVT::i64, GlobalVal));
 }
 
+void NVPTXDAGToDAGISel::SelectMatchAll(SDNode *N) {
+  SDLoc DL(N);
+  enum { IS_I64 = 4, HAS_CONST_VALUE = 2, HAS_CONST_MASK = 1 };
+  unsigned IID = cast(N->getOperand(0))->getZExtValue();
+  unsigned OpcodeIndex =
+  (IID == Intrinsic::nvvm_match_all_sync_i64p) ? IS_I64 : 0;
+  SDValue MaskOp = N->getOperand(1);
+  SDValue ValueOp = N->getOperand(2);
+  if (ConstantSDNode *ValueConst = dyn_cast(ValueOp)) {
+OpcodeIndex |= HAS_CONST_VALUE;
+ValueOp = CurDAG->getTargetConstant(ValueConst->getZExtValue(), DL,
+ValueConst->getValueType(0));
+  }
+  if (ConstantSDNode *MaskConst = dyn_cast(MaskOp)) {
+OpcodeIndex |= HAS_CONST_MASK;
+MaskOp = CurDAG->getTargetConstant(MaskConst->getZExtValue(), DL,
+   MaskConst->getValueType(0));
+  }
+  // Maps {IS_I64, HAS_CONST_VALUE, HAS_CONST_MASK} -> opcode
+  unsigned Opcodes[8] = {
+  NVPTX::MATCH_ALLP_SYNC_32rr, NVPTX::MATCH_ALLP_SYNC_32ri,
+  NVPTX::MATCH_ALLP_SYNC_32ir, NVPTX::MATCH_ALLP_SYNC_32ii,
+  NVPTX::MATCH_ALLP_SYNC_64rr, NVPTX::MATCH_ALLP_SYNC_64ri,
+  NVPTX::MATCH_ALLP_SYNC_64ir, NVPTX::MATCH_ALLP_SYNC_64ii};
+  SDNode *NewNode = CurDAG->getMachineNode(Opcodes[OpcodeIndex], DL,
+   {ValueOp->getValueType(0), MVT::i1},
+   {MaskOp, ValueOp});
+  ReplaceNode(N, NewNode);
+}
+
 void NVPTXDAGToDAGISel::SelectAddrSpaceCast(SDNode *N) {
   SDValue Src = N->getOperand(0);
   AddrSpaceCastSDNode *CastN = cast(N);
Index: llvm/trunk/lib/Target/NVPTX/NVPTXInstrInfo.td
===
--- llvm/trunk/lib/Target/NVPTX/NVPTXInstrInfo.td
+++ llvm/trunk/lib/Target/NVPTX/NVPTXInstrInfo.td
@@ -158,6 +158,7 @@
 def hasPTX60 : Predicate<"Subtarget->getPTXVersion() >= 60">;
 
 def hasSM30 : Predicate<"Subtarget->getSmVersion() >= 30">;
+def hasSM70 : Predicate<"Subtarget->getSmVersion() >= 70">;
 
 def useFP16Math: Predicate<"Subtarget->allowFP16Math()">;
 
Index: llvm/trunk/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
===
--- llvm/trunk/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
+++ llvm/trunk/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
@@ -58,6 +58,7 @@
   bool tryIntrinsicNoChain(SDNode *N);
   bool tryIntrinsicChain(SDNode *N);
   void SelectTexSurfHandle(SDNode *N);
+  void SelectMatchAll(SDNode *N);
   bool tryLoad(SDNode *N);
   bool tryLoadVector(SDNode *N);
   bool tryLDGLDU(SDNode *N);
Index: llvm/trunk/lib/Target/NVPTX/NVPTXIntrinsics.td
===
--- llvm/trunk/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ llvm/trunk/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -247,6 +247,63 @@
 defm VOTE_SYNC_UNI : VOTE_SYNC;
 defm VOTE_SYNC_BALLOT : VOTE_SYNC;
 
+multiclass MATCH_ANY_SYNC {
+  def ii : NVPTXInst<(outs regclass:$dest), (ins i32imm:$mask, ImmOp:$value),
+  "match.any.sync." # ptxtype # " \t$dest, $value, $mask;",
+  [(set regclass:$dest, (IntOp imm:$mask, imm:$value))]>,
+   Requires<[hasPTX60, hasSM70]>;
+  def ir : NVPTXInst<(outs regclass:$dest), (ins Int32Regs:$mask, ImmOp:$value),
+  "match.any.sync." # ptxtype # " \t$dest, $value, $mask;",
+  [(set regclass:$dest, (IntOp Int32Regs:$mask, imm:$value))]>,
+   Requires<[hasPTX60, hasSM70]>;
+  def ri : NVPTXInst<(outs regclass:$dest), (ins i32imm:$mask, regclass:$value),
+  "match.any.sync." # ptxtype # " \t$dest, $value, $mask;",
+   

[PATCH] D38191: [NVPTX] added match.{any, all}.sync instructions, intrinsics & builtins.

2017-09-26 Thread Artem Belevich via Phabricator via cfe-commits
tra reopened this revision.
tra added inline comments.
This revision is now accepted and ready to land.



Comment at: llvm/trunk/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp:716
   case Intrinsic::nvvm_texsurf_handle_internal:
 SelectTexSurfHandle(N);
+  case Intrinsic::nvvm_match_all_sync_i32p:

I've unintentionally killed `return true;` here and that's what broke the 
texture tests. I'm not sure yet why my local tests worked.


https://reviews.llvm.org/D38191



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D38191: [NVPTX] added match.{any, all}.sync instructions, intrinsics & builtins.

2017-09-26 Thread Artem Belevich via Phabricator via cfe-commits
tra updated this revision to Diff 116674.
tra added a comment.

Added missing return. Tests pass now.


https://reviews.llvm.org/D38191

Files:
  clang/include/clang/Basic/BuiltinsNVPTX.def
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/lib/Headers/__clang_cuda_intrinsics.h
  clang/test/CodeGen/builtins-nvptx-ptx60.cu
  llvm/include/llvm/IR/IntrinsicsNVVM.td
  llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
  llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
  llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
  llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
  llvm/test/CodeGen/NVPTX/match.ll

Index: llvm/test/CodeGen/NVPTX/match.ll
===
--- /dev/null
+++ llvm/test/CodeGen/NVPTX/match.ll
@@ -0,0 +1,117 @@
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_70 -mattr=+ptx60 | FileCheck %s
+
+declare i32 @llvm.nvvm.match.any.sync.i32(i32, i32)
+declare i64 @llvm.nvvm.match.any.sync.i64(i32, i64)
+
+; CHECK-LABEL: .func{{.*}}match.any.sync.i32
+define i32 @match.any.sync.i32(i32 %mask, i32 %value) {
+  ; CHECK: ld.param.u32 	[[MASK:%r[0-9]+]], [match.any.sync.i32_param_0];
+  ; CHECK: ld.param.u32 	[[VALUE:%r[0-9]+]], [match.any.sync.i32_param_1];
+
+  ; CHECK:  match.any.sync.b32  [[V0:%r[0-9]+]], [[VALUE]], [[MASK]];
+  %v0 = call i32 @llvm.nvvm.match.any.sync.i32(i32 %mask, i32 %value)
+  ; CHECK:  match.any.sync.b32  [[V1:%r[0-9]+]], [[VALUE]], 1;
+  %v1 = call i32 @llvm.nvvm.match.any.sync.i32(i32 1, i32 %value)
+  ; CHECK:  match.any.sync.b32  [[V2:%r[0-9]+]], 2, [[MASK]];
+  %v2 = call i32 @llvm.nvvm.match.any.sync.i32(i32 %mask, i32 2)
+  ; CHECK:  match.any.sync.b32  [[V3:%r[0-9]+]], 4, 3;
+  %v3 = call i32 @llvm.nvvm.match.any.sync.i32(i32 3, i32 4)
+  %sum1 = add i32 %v0, %v1
+  %sum2 = add i32 %v2, %v3
+  %sum3 = add i32 %sum1, %sum2
+  ret i32 %sum3;
+}
+
+; CHECK-LABEL: .func{{.*}}match.any.sync.i64
+define i64 @match.any.sync.i64(i32 %mask, i64 %value) {
+  ; CHECK: ld.param.u32 	[[MASK:%r[0-9]+]], [match.any.sync.i64_param_0];
+  ; CHECK: ld.param.u64 	[[VALUE:%rd[0-9]+]], [match.any.sync.i64_param_1];
+
+  ; CHECK:  match.any.sync.b64  [[V0:%rd[0-9]+]], [[VALUE]], [[MASK]];
+  %v0 = call i64 @llvm.nvvm.match.any.sync.i64(i32 %mask, i64 %value)
+  ; CHECK:  match.any.sync.b64  [[V1:%rd[0-9]+]], [[VALUE]], 1;
+  %v1 = call i64 @llvm.nvvm.match.any.sync.i64(i32 1, i64 %value)
+  ; CHECK:  match.any.sync.b64  [[V2:%rd[0-9]+]], 2, [[MASK]];
+  %v2 = call i64 @llvm.nvvm.match.any.sync.i64(i32 %mask, i64 2)
+  ; CHECK:  match.any.sync.b64  [[V3:%rd[0-9]+]], 4, 3;
+  %v3 = call i64 @llvm.nvvm.match.any.sync.i64(i32 3, i64 4)
+  %sum1 = add i64 %v0, %v1
+  %sum2 = add i64 %v2, %v3
+  %sum3 = add i64 %sum1, %sum2
+  ret i64 %sum3;
+}
+
+declare {i32, i1} @llvm.nvvm.match.all.sync.i32p(i32, i32)
+declare {i64, i1} @llvm.nvvm.match.all.sync.i64p(i32, i64)
+
+; CHECK-LABEL: .func{{.*}}match.all.sync.i32p(
+define {i32,i1} @match.all.sync.i32p(i32 %mask, i32 %value) {
+  ; CHECK: ld.param.u32 	[[MASK:%r[0-9]+]], [match.all.sync.i32p_param_0];
+  ; CHECK: ld.param.u32 	[[VALUE:%r[0-9]+]], [match.all.sync.i32p_param_1];
+
+  ; CHECK:  match.all.sync.b32 {{%r[0-9]+\|%p[0-9]+}}, [[VALUE]], [[MASK]];
+  %r1 = call {i32, i1} @llvm.nvvm.match.all.sync.i32p(i32 %mask, i32 %value)
+  %v1 = extractvalue {i32, i1} %r1, 0
+  %p1 = extractvalue {i32, i1} %r1, 1
+
+  ; CHECK:  match.all.sync.b32 {{%r[0-9]+\|%p[0-9]+}}, 1, [[MASK]];
+  %r2 = call {i32, i1} @llvm.nvvm.match.all.sync.i32p(i32 %mask, i32 1)
+  %v2 = extractvalue {i32, i1} %r2, 0
+  %p2 = extractvalue {i32, i1} %r2, 1
+
+  ; CHECK:  match.all.sync.b32 {{%r[0-9]+\|%p[0-9]+}}, [[VALUE]], 2;
+  %r3 = call {i32, i1} @llvm.nvvm.match.all.sync.i32p(i32 2, i32 %value)
+  %v3 = extractvalue {i32, i1} %r3, 0
+  %p3 = extractvalue {i32, i1} %r3, 1
+
+  ; CHECK:  match.all.sync.b32 {{%r[0-9]+\|%p[0-9]+}}, 4, 3;
+  %r4 = call {i32, i1} @llvm.nvvm.match.all.sync.i32p(i32 3, i32 4)
+  %v4 = extractvalue {i32, i1} %r4, 0
+  %p4 = extractvalue {i32, i1} %r4, 1
+
+  %vsum1 = add i32 %v1, %v2
+  %vsum2 = add i32 %v3, %v4
+  %vsum3 = add i32 %vsum1, %vsum2
+  %psum1 = add i1 %p1, %p2
+  %psum2 = add i1 %p3, %p4
+  %psum3 = add i1 %psum1, %psum2
+  %ret0 = insertvalue {i32, i1} undef, i32 %vsum3, 0
+  %ret1 = insertvalue {i32, i1} %ret0, i1 %psum3, 1
+  ret {i32, i1} %ret1;
+}
+
+; CHECK-LABEL: .func{{.*}}match.all.sync.i64p(
+define {i64,i1} @match.all.sync.i64p(i32 %mask, i64 %value) {
+  ; CHECK: ld.param.u32 	[[MASK:%r[0-9]+]], [match.all.sync.i64p_param_0];
+  ; CHECK: ld.param.u64 	[[VALUE:%rd[0-9]+]], [match.all.sync.i64p_param_1];
+
+  ; CHECK:  match.all.sync.b64 {{%rd[0-9]+\|%p[0-9]+}}, [[VALUE]], [[MASK]];
+  %r1 = call {i64, i1} @llvm.nvvm.match.all.sync.i64p(i32 %mask, i64 %value)
+  %v1 = extractvalue {i64, i1} %r1, 0
+  %p1 = extractvalue {i64, i1} %r1, 1
+
+  ; CHECK:  match.all.sync.b64 {{%rd[0-9]+\|%p[0-9]+}}, 1, [[MASK]];
+  %r2 = call {i64, i1} @llvm.nvvm.match.all.sync.i64p(i32 %mask, i64 1)
+  %v2 = extractvalue {i64, i1} %r2, 0
+  %p2 = 

[PATCH] D38191: [NVPTX] added match.{any, all}.sync instructions, intrinsics & builtins.

2017-09-26 Thread Artem Belevich via Phabricator via cfe-commits
tra closed this revision.
tra added a comment.

Landed with fix in r314223.


https://reviews.llvm.org/D38191



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D38326: [CUDA] Work around conflicting function definitions in CUDA-9 headers.

2017-09-27 Thread Artem Belevich via Phabricator via cfe-commits
tra created this revision.
Herald added a subscriber: sanjoy.

https://reviews.llvm.org/D38326

Files:
  clang/lib/Headers/__clang_cuda_runtime_wrapper.h


Index: clang/lib/Headers/__clang_cuda_runtime_wrapper.h
===
--- clang/lib/Headers/__clang_cuda_runtime_wrapper.h
+++ clang/lib/Headers/__clang_cuda_runtime_wrapper.h
@@ -173,7 +173,18 @@
 // __device__.
 #pragma push_macro("__forceinline__")
 #define __forceinline__ __device__ __inline__ __attribute__((always_inline))
+
+#pragma push_macro("__float2half_rn")
+#if CUDA_VERSION >= 9000
+// CUDA-9 has conflicting prototypes for __float2half_rn(float f) in
+// cuda_fp16.h[pp] and device_functions.hpp. We need to get the one in
+// device_functions.hpp out of the way.
+#define __float2half_rn  __float2half_rn_disabled
+#endif
+
 #include "device_functions.hpp"
+#pragma pop_macro("__float2half_rn")
+
 
 // math_function.hpp uses the __USE_FAST_MATH__ macro to determine whether we
 // get the slow-but-accurate or fast-but-inaccurate versions of functions like


Index: clang/lib/Headers/__clang_cuda_runtime_wrapper.h
===
--- clang/lib/Headers/__clang_cuda_runtime_wrapper.h
+++ clang/lib/Headers/__clang_cuda_runtime_wrapper.h
@@ -173,7 +173,18 @@
 // __device__.
 #pragma push_macro("__forceinline__")
 #define __forceinline__ __device__ __inline__ __attribute__((always_inline))
+
+#pragma push_macro("__float2half_rn")
+#if CUDA_VERSION >= 9000
+// CUDA-9 has conflicting prototypes for __float2half_rn(float f) in
+// cuda_fp16.h[pp] and device_functions.hpp. We need to get the one in
+// device_functions.hpp out of the way.
+#define __float2half_rn  __float2half_rn_disabled
+#endif
+
 #include "device_functions.hpp"
+#pragma pop_macro("__float2half_rn")
+
 
 // math_function.hpp uses the __USE_FAST_MATH__ macro to determine whether we
 // get the slow-but-accurate or fast-but-inaccurate versions of functions like
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D38326: [CUDA] Work around conflicting function definitions in CUDA-9 headers.

2017-09-27 Thread Artem Belevich via Phabricator via cfe-commits
This revision was automatically updated to reflect the committed changes.
Closed by commit rL314334: [CUDA] Work around conflicting function definitions 
in CUDA-9 headers. (authored by tra).

Changed prior to commit:
  https://reviews.llvm.org/D38326?vs=116856&id=116858#toc

Repository:
  rL LLVM

https://reviews.llvm.org/D38326

Files:
  cfe/trunk/lib/Headers/__clang_cuda_runtime_wrapper.h


Index: cfe/trunk/lib/Headers/__clang_cuda_runtime_wrapper.h
===
--- cfe/trunk/lib/Headers/__clang_cuda_runtime_wrapper.h
+++ cfe/trunk/lib/Headers/__clang_cuda_runtime_wrapper.h
@@ -173,7 +173,18 @@
 // __device__.
 #pragma push_macro("__forceinline__")
 #define __forceinline__ __device__ __inline__ __attribute__((always_inline))
+
+#pragma push_macro("__float2half_rn")
+#if CUDA_VERSION >= 9000
+// CUDA-9 has conflicting prototypes for __float2half_rn(float f) in
+// cuda_fp16.h[pp] and device_functions.hpp. We need to get the one in
+// device_functions.hpp out of the way.
+#define __float2half_rn  __float2half_rn_disabled
+#endif
+
 #include "device_functions.hpp"
+#pragma pop_macro("__float2half_rn")
+
 
 // math_function.hpp uses the __USE_FAST_MATH__ macro to determine whether we
 // get the slow-but-accurate or fast-but-inaccurate versions of functions like


Index: cfe/trunk/lib/Headers/__clang_cuda_runtime_wrapper.h
===
--- cfe/trunk/lib/Headers/__clang_cuda_runtime_wrapper.h
+++ cfe/trunk/lib/Headers/__clang_cuda_runtime_wrapper.h
@@ -173,7 +173,18 @@
 // __device__.
 #pragma push_macro("__forceinline__")
 #define __forceinline__ __device__ __inline__ __attribute__((always_inline))
+
+#pragma push_macro("__float2half_rn")
+#if CUDA_VERSION >= 9000
+// CUDA-9 has conflicting prototypes for __float2half_rn(float f) in
+// cuda_fp16.h[pp] and device_functions.hpp. We need to get the one in
+// device_functions.hpp out of the way.
+#define __float2half_rn  __float2half_rn_disabled
+#endif
+
 #include "device_functions.hpp"
+#pragma pop_macro("__float2half_rn")
+
 
 // math_function.hpp uses the __USE_FAST_MATH__ macro to determine whether we
 // get the slow-but-accurate or fast-but-inaccurate versions of functions like
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D38742: [CUDA] Added __hmma_m16n16k16_* builtins to support mma instructions in sm_70

2017-10-10 Thread Artem Belevich via Phabricator via cfe-commits
tra created this revision.
Herald added subscribers: sanjoy, jholewinski.

https://reviews.llvm.org/D38742

Files:
  clang/include/clang/Basic/BuiltinsNVPTX.def
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/test/CodeGen/builtins-nvptx-sm_70.cu

Index: clang/test/CodeGen/builtins-nvptx-sm_70.cu
===
--- /dev/null
+++ clang/test/CodeGen/builtins-nvptx-sm_70.cu
@@ -0,0 +1,166 @@
+// RUN: %clang_cc1 -triple nvptx64-unknown-unknown -target-cpu sm_70 \
+// RUN:-fcuda-is-device -target-feature +ptx60 \
+// RUN:-S -emit-llvm -o - -x cuda %s \
+// RUN:   | FileCheck -check-prefix=CHECK %s
+// RUN: %clang_cc1 -triple nvptx-unknown-unknown -target-cpu sm_60 \
+// RUN:   -fcuda-is-device -S -o /dev/null -x cuda -verify %s
+
+#if !defined(CUDA_VERSION)
+#define __device__ __attribute__((device))
+#define __global__ __attribute__((global))
+#define __shared__ __attribute__((shared))
+#define __constant__ __attribute__((constant))
+
+typedef unsigned long long uint64_t;
+#endif
+// We have to keep all builtins that depend on particular target feature in the
+// same function, because the codegen will stop after the very first function
+// that encounters an error, so -verify will not be able to find errors in
+// subsequent functions.
+
+// CHECK-LABEL: nvvm_wmma
+__device__ void nvvm_wmma(int *src, int *dst,
+  float *fsrc, float *fdst,
+  int ldm) {
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.a.sync.row.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_a' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_a(dst, src, ldm, 0);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.a.sync.col.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_a' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_a(dst, src+1, ldm, 1);
+
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.b.sync.row.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_b' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_b(dst, src, ldm, 0);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.b.sync.col.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_b' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_b(dst, src+2, ldm, 1);
+
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.c.sync.row.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_c_f16' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_c_f16(dst, src, ldm, 0);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.c.sync.col.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_c_f16' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_c_f16(dst, src, ldm, 1);
+
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.c.sync.row.m16n16k16.stride.f32
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_c_f32' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_c_f32(fdst, fsrc, ldm, 0);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.c.sync.col.m16n16k16.stride.f32
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_c_f32' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_c_f32(fdst, fsrc, ldm, 1);
+
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.store.d.sync.row.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_st_c_f16' needs target feature ptx60}}
+  __hmma_m16n16k16_st_c_f16(dst, src, ldm, 0);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.store.d.sync.col.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_st_c_f16' needs target feature ptx60}}
+  __hmma_m16n16k16_st_c_f16(dst, src, ldm, 1);
+
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.store.d.sync.row.m16n16k16.stride.f32
+  // expected-error@+1 {{'__hmma_m16n16k16_st_c_f32' needs target feature ptx60}}
+  __hmma_m16n16k16_st_c_f32(fdst, fsrc, ldm, 0);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.store.d.sync.col.m16n16k16.stride.f32
+  // expected-error@+1 {{'__hmma_m16n16k16_st_c_f32' needs target feature ptx60}}
+  __hmma_m16n16k16_st_c_f32(fdst, fsrc, ldm, 1);
+
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.mma.sync.row.row.m16n16k16.f16.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_mma_f16f16' needs target feature ptx60}}
+  __hmma_m16n16k16_mma_f16f16(dst, src, src, src, 0, 0);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.mma.sync.row.row.m16n16k16.f16.f16.satfinite
+  // expected-error@+1 {{'__hmma_m16n16k16_mma_f16f16' needs target feature ptx60}}
+  __hmma_m16n16k16_mma_f16f16(dst, src, src, src, 0, 1);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.mma.sync.row.col.m16n16k16.f16.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_mma_f16f16' needs target feature ptx60}}
+  __hmma_m16n16k16_mma_f16f16(dst, src, src, src, 1, 0);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.mma.sync.row.col.m16n16k16.f16.f16.satfinite
+  // expected-error@+1 {{'__hmma_m16n16k16_mma_f16f16' needs target feature ptx60}}
+  __hmma_m16n16k16_mma_f16f16(dst, src, src, src, 1, 1);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.mma.sync.col.row.m16n16k16.f16.f16
+  // expected-error@+1 {{'__hmma

[PATCH] D38742: [CUDA] Added __hmma_m16n16k16_* builtins to support mma instructions in sm_70

2017-10-11 Thread Artem Belevich via Phabricator via cfe-commits
tra updated this revision to Diff 118636.
tra marked 6 inline comments as done.
tra added a comment.

Addressed Justin's comments.


https://reviews.llvm.org/D38742

Files:
  clang/include/clang/Basic/BuiltinsNVPTX.def
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/test/CodeGen/builtins-nvptx-sm_70.cu

Index: clang/test/CodeGen/builtins-nvptx-sm_70.cu
===
--- /dev/null
+++ clang/test/CodeGen/builtins-nvptx-sm_70.cu
@@ -0,0 +1,166 @@
+// RUN: %clang_cc1 -triple nvptx64-unknown-unknown -target-cpu sm_70 \
+// RUN:-fcuda-is-device -target-feature +ptx60 \
+// RUN:-S -emit-llvm -o - -x cuda %s \
+// RUN:   | FileCheck -check-prefix=CHECK %s
+// RUN: %clang_cc1 -triple nvptx-unknown-unknown -target-cpu sm_60 \
+// RUN:   -fcuda-is-device -S -o /dev/null -x cuda -verify %s
+
+#if !defined(CUDA_VERSION)
+#define __device__ __attribute__((device))
+#define __global__ __attribute__((global))
+#define __shared__ __attribute__((shared))
+#define __constant__ __attribute__((constant))
+
+typedef unsigned long long uint64_t;
+#endif
+// We have to keep all builtins that depend on particular target feature in the
+// same function, because the codegen will stop after the very first function
+// that encounters an error, so -verify will not be able to find errors in
+// subsequent functions.
+
+// CHECK-LABEL: nvvm_wmma
+__device__ void nvvm_wmma(int *src, int *dst,
+  float *fsrc, float *fdst,
+  int ldm) {
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.a.sync.row.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_a' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_a(dst, src, ldm, 0);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.a.sync.col.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_a' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_a(dst, src+1, ldm, 1);
+
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.b.sync.row.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_b' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_b(dst, src, ldm, 0);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.b.sync.col.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_b' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_b(dst, src+2, ldm, 1);
+
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.c.sync.row.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_c_f16' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_c_f16(dst, src, ldm, 0);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.c.sync.col.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_c_f16' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_c_f16(dst, src, ldm, 1);
+
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.c.sync.row.m16n16k16.stride.f32
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_c_f32' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_c_f32(fdst, fsrc, ldm, 0);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.c.sync.col.m16n16k16.stride.f32
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_c_f32' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_c_f32(fdst, fsrc, ldm, 1);
+
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.store.d.sync.row.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_st_c_f16' needs target feature ptx60}}
+  __hmma_m16n16k16_st_c_f16(dst, src, ldm, 0);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.store.d.sync.col.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_st_c_f16' needs target feature ptx60}}
+  __hmma_m16n16k16_st_c_f16(dst, src, ldm, 1);
+
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.store.d.sync.row.m16n16k16.stride.f32
+  // expected-error@+1 {{'__hmma_m16n16k16_st_c_f32' needs target feature ptx60}}
+  __hmma_m16n16k16_st_c_f32(fdst, fsrc, ldm, 0);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.store.d.sync.col.m16n16k16.stride.f32
+  // expected-error@+1 {{'__hmma_m16n16k16_st_c_f32' needs target feature ptx60}}
+  __hmma_m16n16k16_st_c_f32(fdst, fsrc, ldm, 1);
+
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.mma.sync.row.row.m16n16k16.f16.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_mma_f16f16' needs target feature ptx60}}
+  __hmma_m16n16k16_mma_f16f16(dst, src, src, src, 0, 0);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.mma.sync.row.row.m16n16k16.f16.f16.satfinite
+  // expected-error@+1 {{'__hmma_m16n16k16_mma_f16f16' needs target feature ptx60}}
+  __hmma_m16n16k16_mma_f16f16(dst, src, src, src, 0, 1);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.mma.sync.row.col.m16n16k16.f16.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_mma_f16f16' needs target feature ptx60}}
+  __hmma_m16n16k16_mma_f16f16(dst, src, src, src, 1, 0);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.mma.sync.row.col.m16n16k16.f16.f16.satfinite
+  // expected-error@+1 {{'__hmma_m16n16k16_mma_f16f16' needs target feature ptx60}}
+  __hmma_m16n16k16_mma_f16f16(dst, src, src, src, 1, 1);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.mma.sync.c

[PATCH] D38742: [CUDA] Added __hmma_m16n16k16_* builtins to support mma instructions in sm_70

2017-10-11 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: clang/lib/CodeGen/CGBuiltin.cpp:9726
+  case NVPTX::BI__hmma_m16n16k16_ld_c_f16:
+case NVPTX::BI__hmma_m16n16k16_ld_c_f32:{
+Address Dst = EmitPointerWithAlignment(E->getArg(0));

jlebar wrote:
> weird indentation?
My emacs and clang-format keep fighting case indentation... Fixed.



Comment at: clang/lib/CodeGen/CGBuiltin.cpp:9733
+  return nullptr;
+bool isColMajor = isColMajorArg.getZExtValue();
+unsigned IID;

jlebar wrote:
> Urg, this isn't a bool?  Do we want it to be?
There are no explicit declarations for these builtins in CUDA headers. Callers 
of these builtins pass 0/1 and corresponding intrinsic described in [[ 
http://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#nvvm-intrin-warp-level-matrix-ld
 | NVVM-IR spec ]] shows the argument type as i32, so I've made the type 
integer in clang. 





Comment at: clang/lib/CodeGen/CGBuiltin.cpp:9762
+//auto EltTy = cast(Src->getType())->getElementType();
+// Returned are 8 16x2 elements.
+for (unsigned i = 0; i < NumResults; ++i) {

jlebar wrote:
> s/8/NumElements/?
> s/16/f16/?
> 
> Maybe it would be better to write it as "Return value has type [[f16 x 2] x 
> NumResults]."?
That was part of the leftover block. Particular types are irrelevant here. All 
we care is to store whatever intrinsic call returned ([4 or 8 elements of v2f16 
or float] ) in the destination array (which is int[] ). 



Comment at: clang/lib/CodeGen/CGBuiltin.cpp:9802
+llvm::Type *ParamType = Intrinsic->getFunctionType()->getParamType(1);
+SmallVector Values;
+Values.push_back(Builder.CreatePointerCast(Dst, VoidPtrTy));

jlebar wrote:
> Nit, we know that there won't ever be more than 8 elements...
We have two extra arguments -- destination buffer and stride.


https://reviews.llvm.org/D38742



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D38742: [CUDA] Added __hmma_m16n16k16_* builtins to support mma instructions in sm_70

2017-10-12 Thread Artem Belevich via Phabricator via cfe-commits
This revision was automatically updated to reflect the committed changes.
Closed by commit rL315624: [CUDA] Added __hmma_m16n16k16_* builtins to support 
mma instructions on sm_70 (authored by tra).

Changed prior to commit:
  https://reviews.llvm.org/D38742?vs=118636&id=118848#toc

Repository:
  rL LLVM

https://reviews.llvm.org/D38742

Files:
  cfe/trunk/include/clang/Basic/BuiltinsNVPTX.def
  cfe/trunk/lib/CodeGen/CGBuiltin.cpp
  cfe/trunk/test/CodeGen/builtins-nvptx-sm_70.cu

Index: cfe/trunk/include/clang/Basic/BuiltinsNVPTX.def
===
--- cfe/trunk/include/clang/Basic/BuiltinsNVPTX.def
+++ cfe/trunk/include/clang/Basic/BuiltinsNVPTX.def
@@ -688,5 +688,18 @@
 BUILTIN(__nvvm_ldg_f4, "E4fE4fC*", "")
 BUILTIN(__nvvm_ldg_d2, "E2dE2dC*", "")
 
+// Builtins to support WMMA instructions on sm_70
+TARGET_BUILTIN(__hmma_m16n16k16_ld_a, "vi*iC*UiIi", "", "ptx60")
+TARGET_BUILTIN(__hmma_m16n16k16_ld_b, "vi*iC*UiIi", "", "ptx60")
+TARGET_BUILTIN(__hmma_m16n16k16_ld_c_f16, "vi*iC*UiIi", "", "ptx60")
+TARGET_BUILTIN(__hmma_m16n16k16_ld_c_f32, "vf*fC*UiIi", "", "ptx60")
+TARGET_BUILTIN(__hmma_m16n16k16_st_c_f16, "vi*i*UiIi", "", "ptx60")
+TARGET_BUILTIN(__hmma_m16n16k16_st_c_f32, "vf*f*UiIi", "", "ptx60")
+
+TARGET_BUILTIN(__hmma_m16n16k16_mma_f16f16, "vi*iC*iC*iC*IiIi", "", "ptx60")
+TARGET_BUILTIN(__hmma_m16n16k16_mma_f32f16, "vf*iC*iC*iC*IiIi", "", "ptx60")
+TARGET_BUILTIN(__hmma_m16n16k16_mma_f32f32, "vf*iC*iC*fC*IiIi", "", "ptx60")
+TARGET_BUILTIN(__hmma_m16n16k16_mma_f16f32, "vi*iC*iC*fC*IiIi", "", "ptx60")
+
 #undef BUILTIN
 #undef TARGET_BUILTIN
Index: cfe/trunk/test/CodeGen/builtins-nvptx-sm_70.cu
===
--- cfe/trunk/test/CodeGen/builtins-nvptx-sm_70.cu
+++ cfe/trunk/test/CodeGen/builtins-nvptx-sm_70.cu
@@ -0,0 +1,166 @@
+// RUN: %clang_cc1 -triple nvptx64-unknown-unknown -target-cpu sm_70 \
+// RUN:-fcuda-is-device -target-feature +ptx60 \
+// RUN:-S -emit-llvm -o - -x cuda %s \
+// RUN:   | FileCheck -check-prefix=CHECK %s
+// RUN: %clang_cc1 -triple nvptx-unknown-unknown -target-cpu sm_60 \
+// RUN:   -fcuda-is-device -S -o /dev/null -x cuda -verify %s
+
+#if !defined(CUDA_VERSION)
+#define __device__ __attribute__((device))
+#define __global__ __attribute__((global))
+#define __shared__ __attribute__((shared))
+#define __constant__ __attribute__((constant))
+
+typedef unsigned long long uint64_t;
+#endif
+// We have to keep all builtins that depend on particular target feature in the
+// same function, because the codegen will stop after the very first function
+// that encounters an error, so -verify will not be able to find errors in
+// subsequent functions.
+
+// CHECK-LABEL: nvvm_wmma
+__device__ void nvvm_wmma(int *src, int *dst,
+  float *fsrc, float *fdst,
+  int ldm) {
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.a.sync.row.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_a' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_a(dst, src, ldm, 0);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.a.sync.col.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_a' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_a(dst, src+1, ldm, 1);
+
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.b.sync.row.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_b' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_b(dst, src, ldm, 0);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.b.sync.col.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_b' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_b(dst, src+2, ldm, 1);
+
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.c.sync.row.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_c_f16' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_c_f16(dst, src, ldm, 0);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.c.sync.col.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_c_f16' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_c_f16(dst, src, ldm, 1);
+
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.c.sync.row.m16n16k16.stride.f32
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_c_f32' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_c_f32(fdst, fsrc, ldm, 0);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.load.c.sync.col.m16n16k16.stride.f32
+  // expected-error@+1 {{'__hmma_m16n16k16_ld_c_f32' needs target feature ptx60}}
+  __hmma_m16n16k16_ld_c_f32(fdst, fsrc, ldm, 1);
+
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.store.d.sync.row.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_st_c_f16' needs target feature ptx60}}
+  __hmma_m16n16k16_st_c_f16(dst, src, ldm, 0);
+  // CHECK: call {{.*}} @llvm.nvvm.wmma.store.d.sync.col.m16n16k16.stride.f16
+  // expected-error@+1 {{'__hmma_m16n16k16_st_c_f16' needs target feature ptx60}}
+  __hmma_m16n16k16_st_c_f16(dst, src, ldm, 1)

[PATCH] D38883: [CMake][OpenMP] Customize default offloading arch

2017-10-13 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: lib/Driver/ToolChains/Cuda.cpp:170-182
-// This code prevents IsValid from being set when
-// no libdevice has been found.
-bool allEmpty = true;
-std::string LibDeviceFile;
-for (auto key : LibDeviceMap.keys()) {
-  LibDeviceFile = LibDeviceMap.lookup(key);
-  if (!LibDeviceFile.empty())

I'd keep this code. It appears to serve useful purpose as it requires CUDA 
installation to have at least some libdevice library in it.  It gives us a 
change to find a valid installation, instead of ailing some time later when we 
ask for a libdevice file and fail because there are none.



Comment at: lib/Driver/ToolChains/Cuda.cpp:556
+  DAL->AddJoinedArg(nullptr, Opts.getOption(options::OPT_march_EQ),
+CLANG_OPENMP_NVPTX_DEFAULT_ARCH);
 }

This sets default GPU arch for *everyone* based on OPENMP requirements. Perhaps 
this should be predicated on this being openmp compilation.

IMO to avoid unnecessary surprises, the default for CUDA compilation should 
follow defaults of nvcc. sm_30 becomes default only in CUDA-9.



https://reviews.llvm.org/D38883



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D38883: [CMake][OpenMP] Customize default offloading arch

2017-10-13 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: lib/Driver/ToolChains/Cuda.cpp:170-182
-// This code prevents IsValid from being set when
-// no libdevice has been found.
-bool allEmpty = true;
-std::string LibDeviceFile;
-for (auto key : LibDeviceMap.keys()) {
-  LibDeviceFile = LibDeviceMap.lookup(key);
-  if (!LibDeviceFile.empty())

Hahnfeld wrote:
> tra wrote:
> > I'd keep this code. It appears to serve useful purpose as it requires CUDA 
> > installation to have at least some libdevice library in it.  It gives us a 
> > change to find a valid installation, instead of ailing some time later when 
> > we ask for a libdevice file and fail because there are none.
> We had some internal discussions about this after I submitted the patch here.
> 
> The main question is: Do we want to support CUDA installations without 
> libdevice and are there use cases for that? I'd say that the user should be 
> able to use a toolchain without libdevice together with `-nocudalib`.
Sounds reasonable. How about keeping the code but putting it under 
`if(!hasArg(nocudalib))`?




Comment at: lib/Driver/ToolChains/Cuda.cpp:556
+  DAL->AddJoinedArg(nullptr, Opts.getOption(options::OPT_march_EQ),
+CLANG_OPENMP_NVPTX_DEFAULT_ARCH);
 }

Hahnfeld wrote:
> tra wrote:
> > This sets default GPU arch for *everyone* based on OPENMP requirements. 
> > Perhaps this should be predicated on this being openmp compilation.
> > 
> > IMO to avoid unnecessary surprises, the default for CUDA compilation should 
> > follow defaults of nvcc. sm_30 becomes default only in CUDA-9.
> > 
> This branch is only executed for OpenMP, see above
OK. I've missed that 'if'.


https://reviews.llvm.org/D38883



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D38883: [CMake][OpenMP] Customize default offloading arch

2017-10-13 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: lib/Driver/ToolChains/Cuda.cpp:170-182
-// This code prevents IsValid from being set when
-// no libdevice has been found.
-bool allEmpty = true;
-std::string LibDeviceFile;
-for (auto key : LibDeviceMap.keys()) {
-  LibDeviceFile = LibDeviceMap.lookup(key);
-  if (!LibDeviceFile.empty())

gtbercea wrote:
> gtbercea wrote:
> > Hahnfeld wrote:
> > > tra wrote:
> > > > Hahnfeld wrote:
> > > > > tra wrote:
> > > > > > I'd keep this code. It appears to serve useful purpose as it 
> > > > > > requires CUDA installation to have at least some libdevice library 
> > > > > > in it.  It gives us a change to find a valid installation, instead 
> > > > > > of ailing some time later when we ask for a libdevice file and fail 
> > > > > > because there are none.
> > > > > We had some internal discussions about this after I submitted the 
> > > > > patch here.
> > > > > 
> > > > > The main question is: Do we want to support CUDA installations 
> > > > > without libdevice and are there use cases for that? I'd say that the 
> > > > > user should be able to use a toolchain without libdevice together 
> > > > > with `-nocudalib`.
> > > > Sounds reasonable. How about keeping the code but putting it under 
> > > > `if(!hasArg(nocudalib))`?
> > > > 
> > > Ok, I'll do that in a separate patch and keep the code here for now.
> > The problem with nocudalib is that if for example you write a test, which 
> > looks to verify some device facing feature that requires a libdevice to be 
> > found (so you don't want to use nocudalib), it will probably work on your 
> > machine which has the correct CUDA setup but fail on another machine which 
> > does not (which is where you want to use nocudalib). You can see the 
> > contradiction there.
> Just to be clear I am arguing for keeping this code :)
@gtbercea: I'm not sure I follow your example. If you're talking about clang 
tests, we do have fake CUDA installation setup under test/Driver/Inputs which 
removes dependency on whatever CUDA you may or may not have installed on your 
machine. I also don't see a contradiction -- you you do need libdevice, it 
makes no point picking a broken CUDA installation which does not have any 
libdevice files. If you explicitly tell compiler that you don't need libdevice, 
that would make CUDA w/o libdevice acceptable. With --cuda-path you do have a 
way to tell clang which installation you want it to use. What do I miss?




https://reviews.llvm.org/D38883



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D38883: [CMake][OpenMP] Customize default offloading arch

2017-10-13 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: lib/Driver/ToolChains/Cuda.cpp:170-182
-// This code prevents IsValid from being set when
-// no libdevice has been found.
-bool allEmpty = true;
-std::string LibDeviceFile;
-for (auto key : LibDeviceMap.keys()) {
-  LibDeviceFile = LibDeviceMap.lookup(key);
-  if (!LibDeviceFile.empty())

tra wrote:
> gtbercea wrote:
> > gtbercea wrote:
> > > Hahnfeld wrote:
> > > > tra wrote:
> > > > > Hahnfeld wrote:
> > > > > > tra wrote:
> > > > > > > I'd keep this code. It appears to serve useful purpose as it 
> > > > > > > requires CUDA installation to have at least some libdevice 
> > > > > > > library in it.  It gives us a change to find a valid 
> > > > > > > installation, instead of ailing some time later when we ask for a 
> > > > > > > libdevice file and fail because there are none.
> > > > > > We had some internal discussions about this after I submitted the 
> > > > > > patch here.
> > > > > > 
> > > > > > The main question is: Do we want to support CUDA installations 
> > > > > > without libdevice and are there use cases for that? I'd say that 
> > > > > > the user should be able to use a toolchain without libdevice 
> > > > > > together with `-nocudalib`.
> > > > > Sounds reasonable. How about keeping the code but putting it under 
> > > > > `if(!hasArg(nocudalib))`?
> > > > > 
> > > > Ok, I'll do that in a separate patch and keep the code here for now.
> > > The problem with nocudalib is that if for example you write a test, which 
> > > looks to verify some device facing feature that requires a libdevice to 
> > > be found (so you don't want to use nocudalib), it will probably work on 
> > > your machine which has the correct CUDA setup but fail on another machine 
> > > which does not (which is where you want to use nocudalib). You can see 
> > > the contradiction there.
> > Just to be clear I am arguing for keeping this code :)
> @gtbercea: I'm not sure I follow your example. If you're talking about clang 
> tests, we do have fake CUDA installation setup under test/Driver/Inputs which 
> removes dependency on whatever CUDA you may or may not have installed on your 
> machine. I also don't see a contradiction -- you you do need libdevice, it 
> makes no point picking a broken CUDA installation which does not have any 
> libdevice files. If you explicitly tell compiler that you don't need 
> libdevice, that would make CUDA w/o libdevice acceptable. With --cuda-path 
> you do have a way to tell clang which installation you want it to use. What 
> do I miss?
> 
> 
Ah, you were arguing with Hahnfeld@'s -nocudalib example. Then I guess we're in 
violent agreement.


https://reviews.llvm.org/D38883



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D38883: [CMake][OpenMP] Customize default offloading arch

2017-10-13 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: lib/Driver/ToolChains/Cuda.h:90
-  }
 };
 

gtbercea wrote:
> gtbercea wrote:
> > I would also like to keep the spirit of this code if not in this exact form 
> > at least something that performs the same functionality.
> @tra what's your opinion on this code? Should this stay, stay but modified to 
> be more robust or taken out completely?
There are currently no users for this. In general, I would rather not have 
magically-changing default GPU based on how broken your CUDA installation is. 
IMO it would be better to keep defaults static and fail if prerequisites are 
not met.


https://reviews.llvm.org/D38883



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D38901: [CUDA] Require libdevice only if needed

2017-10-13 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

The change could use a test.


https://reviews.llvm.org/D38901



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D38883: [CMake][OpenMP] Customize default offloading arch

2017-10-13 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: lib/Driver/ToolChains/Cuda.h:90
-  }
 };
 

gtbercea wrote:
> gtbercea wrote:
> > tra wrote:
> > > gtbercea wrote:
> > > > gtbercea wrote:
> > > > > I would also like to keep the spirit of this code if not in this 
> > > > > exact form at least something that performs the same functionality.
> > > > @tra what's your opinion on this code? Should this stay, stay but 
> > > > modified to be more robust or taken out completely?
> > > There are currently no users for this. In general, I would rather not 
> > > have magically-changing default GPU based on how broken your CUDA 
> > > installation is. IMO it would be better to keep defaults static and fail 
> > > if prerequisites are not met.
> > I would have thought that it is up to the compiler to select, as default, 
> > the lowest viable compute capability. This is what this code aims to do 
> > (whether it actually does that's a separate issue :) ).
> > 
> The reason I added this code in the first place was to overcome the fact that 
> something like a default of sm_30 may work on the K40 but once you go to 
> newer Pascal, Volta GPUs then you need a new minimum compute capability that 
> is supported.
> Should this stay, stay but modified to be more robust or taken out completely?

I'd take it out, at least for now as you've removed the only user of that 
function.

In general, though, compilers tend to use conservative defaults and for CUDA 
that would be the lowest GPU variant supported by compiler. In case of CUDA 
it's determined by the CUDA SDK version. Figuring lowers supported version via 
libdevice mapping we've created is wrong. E.g. with this patch and -nocudalib 
you may end up using CUDA-9 without any libdevice and would return sm_20.

If/when we need to figure out minimum supported version, it should be based 
directly on the value returned by version().


https://reviews.llvm.org/D38883



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D38901: [CUDA] Require libdevice only if needed

2017-10-13 Thread Artem Belevich via Phabricator via cfe-commits
tra accepted this revision.
tra added a comment.
This revision is now accepted and ready to land.

Looks good. Thank you.


https://reviews.llvm.org/D38901



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D38978: [OpenMP] Enable the lowering of implicitly shared variables in OpenMP GPU-offloaded target regions to the GPU shared memory

2017-10-16 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

Please add tests for the cases where such local->shaed conversion should and 
should not happen.
I would appreciate if you could add details on what exactly your passes are 
supposed to move to shared memory.

Considering that device-side code tends to be heavily inlined, it may be 
prudent to add an option to control the total size of shared memory we allow to 
be used for this purpose.

In case your passes are not executed (or didn't move anything to shared 
memory), is there any impact on the generated PTX. I.e. can ptxas successfully 
optimize unused shared memory away?

If the code intentionally wants to allocate something in local memory, would 
the allocation ever be moved to shared memory by your pass? If so, how would I 
prevent that?




Comment at: lib/Target/NVPTX/NVPTXAsmPrinter.cpp:1749
+O << "\t.reg .b32 \t%SHSP;\n";
+O << "\t.reg .b32 \t%SHSPL;\n";
+  }

Nit: the name should end with `S` as the L in `SPL` was for 'local' address 
space. which then gets converted to generic AS. In your case it will be in 
shared space, hence S would be more appropriate.



Comment at: lib/Target/NVPTX/NVPTXAssignValidGlobalNames.cpp:68
 
+void NVPTXAssignValidGlobalNames::generateCleanName(Value &V) {
+  std::string ValidName;

The name cleanup changes in this file should probably be committed by 
themselves as they have nothing to do with the rest of the patch.



Comment at: lib/Target/NVPTX/NVPTXFunctionDataSharing.cpp:9
+//===--===//
+//
+//

Please add details about what the pass is supposed to do.


Repository:
  rL LLVM

https://reviews.llvm.org/D38978



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D39005: [OpenMP] Clean up variable and function names for NVPTX backend

2017-10-17 Thread Artem Belevich via Phabricator via cfe-commits
tra requested changes to this revision.
tra added a comment.

Justin is right. I completely forgot about this. :-/
Hal offered possible solution: https://reviews.llvm.org/D17738#661115


Repository:
  rL LLVM

https://reviews.llvm.org/D39005



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D47757: [Sema] Produce diagnostics when unavailable aligned allocation/deallocation functions are called

2018-08-23 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

I've confirmed that the patch does not break anything in our CUDA code, so it's 
good to go as far as CUDA is concerned.

I'll fix the exposed CUDA issue in a separate patch.


Repository:
  rC Clang

https://reviews.llvm.org/D47757



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D50845: [CUDA/OpenMP] Define only some host macros during device compilation

2018-08-24 Thread Artem Belevich via Phabricator via cfe-commits
tra accepted this revision.
tra added a comment.
This revision is now accepted and ready to land.

Please keep an eye on CUDA buildbot 
http://lab.llvm.org:8011/builders/clang-cuda-build.
It runs fair amount of tests with libc++ and handful of libstdc++ versions and 
may a canary if these changes break something.


https://reviews.llvm.org/D50845



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D51434: [HIP] Add -amdgpu-internalize-symbols option to opt

2018-08-29 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

Could you elaborate on what exactly is the problem this patch fixes?
I don't see how internalizing the symbols connects to PLTs. My understanding is 
that PLTs are used to provide stubs for symbols to be resolved by dynamic 
linker at runtime. AFAICT AMD does not use shared libs on device side. What do 
I miss?


https://reviews.llvm.org/D51434



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D51434: [HIP] Add -amdgpu-internalize-symbols option to opt

2018-08-29 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

I could not find anything about PLTs in AMDGPU-ABI 
,
 nor could I find anything relevant on google.
I still have no idea why PLTs are required in this case. Without that info, the 
problem may as well be due to unintended requirement for PLT that this patch 
would hide.

I'm going to defer to someone more familiar with amdgpu to tell whether that's 
the right fix for the problem.


https://reviews.llvm.org/D51434



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D51441: Add predefined macro __gnu_linux__ for proper aux-triple

2018-08-29 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: lib/Frontend/InitPreprocessor.cpp:1126
 Builder.defineMacro("__linux__");
+if (AuxTriple.getEnvironment() == llvm::Triple::GNU)
+  Builder.defineMacro("__gnu_linux__");

AFAICT, we always define `__gnu_linix__` on Linux:
https://github.com/llvm-mirror/clang/blob/master/lib/Basic/Targets/OSTargets.h#L306

I think it should be the case here, too.


https://reviews.llvm.org/D51441



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D51441: Add predefined macro __gnu_linux__ for proper aux-triple

2018-08-29 Thread Artem Belevich via Phabricator via cfe-commits
tra accepted this revision.
tra added a comment.
This revision is now accepted and ready to land.

While we're here, perhaps `Builder.defineMacro("__linux__")` should be changed 
to `DefineStd("linux")` which defines `linux/__linux/__linux__`?


https://reviews.llvm.org/D51441



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D51434: [HIP] Add -fvisibility hidden option to clang

2018-08-29 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: lib/Driver/ToolChains/HIP.cpp:255
+ options::OPT_fvisibility_ms_compat)) {
+CC1Args.push_back("-fvisibility");
+CC1Args.push_back("hidden");

Nit: You could collapse multiple `push_back` calls into a single 
`append({...})`:
`CC1Args.append({"-fvisibility", "hidden"});`


https://reviews.llvm.org/D51434



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D50845: [CUDA/OpenMP] Define only some host macros during device compilation

2018-08-30 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

FYI. This breaks our CUDA compilation. I haven't figured out what exactly is 
wrong yet. I may need to unroll the patch if the fix is not obvious.


Repository:
  rL LLVM

https://reviews.llvm.org/D50845



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D50845: [CUDA/OpenMP] Define only some host macros during device compilation

2018-08-30 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

In our case the headers from a relatively old glibc and compiler errors out on 
this:

  /* This function is used in the `isfinite' macro.  */
  __MATH_INLINE int
  __NTH (__finite (double __x))
  {
return (__extension__
  (union { double __d; int __i[2]; }) {__d: __x}).__i[1]
 | 0x800fu) + 1) >> 31));
  }

expanded to this:

  extern __inline __attribute__ ((__always_inline__)) __attribute__ 
((__gnu_inline__)) int
   __finite (double __x) throw ()
  {
return (__extension__
 (union { double __d; int __i[2]; }) {__d: __x}).__i[1]
| 0x800fu) + 1) >> 31));
  }

The error:

  .../include/bits/mathinline.h:945:9: error: '(anonymous union at 
.../include/bits/mathinline.h:945:9)' cannot be defined in a type specifier
(union { double __d; int __i[2]; }) {__d: __x}).__i[1]
 ^
  .../include/bits/mathinline.h:945:55: error: member reference base type 
'void' is not a structure or union
(union { double __d; int __i[2]; }) {__d: __x}).__i[1]
   ^~~~

Also, whatever macros we generate do not prevent headers from using x86 inline 
assembly. I see quite a few inline asm code in preprocessed output. The headers 
are from libc ~2.19.


Repository:
  rL LLVM

https://reviews.llvm.org/D50845



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D50845: [CUDA/OpenMP] Define only some host macros during device compilation

2018-08-30 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

In https://reviews.llvm.org/D50845#1219733, @Hahnfeld wrote:

> In https://reviews.llvm.org/D50845#1219726, @gtbercea wrote:
>
> > In general, it looks like this patch leads to some host macros having to be 
> > defined again for the auxiliary triple case. It is not clear to me how to 
> > exhaustively identify the missing macros, so far it's been just trial and 
> > error.
>
>
> Well, that's the point of this patch, isn't it? Again, the current approach 
> is to just define all macros which is definitely broken.


I would agree that it it does not work for OpenMP which relies on host headers 
to be usable for device compilation.
It works OK for CUDA as device code can co-exist with the host code.

Perhaps the patch should keep existing behavior for CUDA and cherry-pick macros 
for OpenMP compilation only.


Repository:
  rL LLVM

https://reviews.llvm.org/D50845



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D51501: [CUDA] Fix CUDA compilation broken by D50845

2018-08-30 Thread Artem Belevich via Phabricator via cfe-commits
tra created this revision.
tra added a reviewer: Hahnfeld.
Herald added subscribers: bixia, jlebar, sanjoy.

This keeps predefined macros for CUDA to work as they were before and lets 
OpenMP control the set of macros it needs.


https://reviews.llvm.org/D51501

Files:
  clang/lib/Frontend/InitPreprocessor.cpp


Index: clang/lib/Frontend/InitPreprocessor.cpp
===
--- clang/lib/Frontend/InitPreprocessor.cpp
+++ clang/lib/Frontend/InitPreprocessor.cpp
@@ -1158,8 +1158,16 @@
 
   // Install things like __POWERPC__, __GNUC__, etc into the macro table.
   if (InitOpts.UsePredefines) {
+// CUDA and OpenMP handle preprocessing differently.  CUDA wants to see
+// identical preprocessed output (as close to it as possible), so it
+// provides full set of predefined macros for both sides of compilation.
+if (LangOpts.CUDA && PP.getAuxTargetInfo())
+  InitializePredefinedMacros(*PP.getAuxTargetInfo(), LangOpts, FEOpts,
+ Builder);
 InitializePredefinedMacros(PP.getTargetInfo(), LangOpts, FEOpts, Builder);
-if ((LangOpts.CUDA || LangOpts.OpenMPIsDevice) && PP.getAuxTargetInfo())
+// OpenMP relies on selectively picking predefined macros to work around 
the
+// bits of host includes it can't compile during device-side compilation.
+if (LangOpts.OpenMPIsDevice && PP.getAuxTargetInfo())
   InitializePredefinedAuxMacros(*PP.getAuxTargetInfo(), LangOpts, Builder);
 
 // Install definitions to make Objective-C++ ARC work well with various


Index: clang/lib/Frontend/InitPreprocessor.cpp
===
--- clang/lib/Frontend/InitPreprocessor.cpp
+++ clang/lib/Frontend/InitPreprocessor.cpp
@@ -1158,8 +1158,16 @@
 
   // Install things like __POWERPC__, __GNUC__, etc into the macro table.
   if (InitOpts.UsePredefines) {
+// CUDA and OpenMP handle preprocessing differently.  CUDA wants to see
+// identical preprocessed output (as close to it as possible), so it
+// provides full set of predefined macros for both sides of compilation.
+if (LangOpts.CUDA && PP.getAuxTargetInfo())
+  InitializePredefinedMacros(*PP.getAuxTargetInfo(), LangOpts, FEOpts,
+ Builder);
 InitializePredefinedMacros(PP.getTargetInfo(), LangOpts, FEOpts, Builder);
-if ((LangOpts.CUDA || LangOpts.OpenMPIsDevice) && PP.getAuxTargetInfo())
+// OpenMP relies on selectively picking predefined macros to work around the
+// bits of host includes it can't compile during device-side compilation.
+if (LangOpts.OpenMPIsDevice && PP.getAuxTargetInfo())
   InitializePredefinedAuxMacros(*PP.getAuxTargetInfo(), LangOpts, Builder);
 
 // Install definitions to make Objective-C++ ARC work well with various
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D50845: [CUDA/OpenMP] Define only some host macros during device compilation

2018-08-30 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

I've sent out https://reviews.llvm.org/D51501. It unbreaks CUDA compilation and 
keeps OpenMP unchanged.


Repository:
  rL LLVM

https://reviews.llvm.org/D50845



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D50845: [CUDA/OpenMP] Define only some host macros during device compilation

2018-08-30 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

In https://reviews.llvm.org/D50845#1219819, @Hahnfeld wrote:

> Ok, the top preprocessor condition for that function is `#ifndef 
> __SSE2_MATH__` - the exact same macro that was part of the motivation. Can 
> you please test compiling a simple C file (including `math.h`) with 
> `-mno-sse`? My guess would be that this is broken as well.
>  If yes I'm fine with reverting because I need to teach Clang to allow 
> anonymous unions in type specifiers to make that weird system header work 
> with this patch.


It compiles fine. The code that causes the problem is also conditional on 
`__NO_MATH_INLINES` and it's always defined for X86, so compilation only breaks 
for when we compile for NVPTX.

Still, the issue seems to be way too hairy for one-line fix, so I'll proceed 
with the unroll if you don't beat me to it.


Repository:
  rL LLVM

https://reviews.llvm.org/D50845



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D50845: [CUDA/OpenMP] Define only some host macros during device compilation

2018-08-30 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

That, and r340967 https://reviews.llvm.org/D51441. I'm running check-clang now 
and will land reverted changes shortly.


Repository:
  rL LLVM

https://reviews.llvm.org/D50845



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D50845: [CUDA/OpenMP] Define only some host macros during device compilation

2018-08-30 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

Reverted in https://reviews.llvm.org/rL341115


Repository:
  rL LLVM

https://reviews.llvm.org/D50845



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D51441: Add predefined macro __gnu_linux__ for proper aux-triple

2018-08-30 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

Reverted in https://reviews.llvm.org/rL341115.


Repository:
  rC Clang

https://reviews.llvm.org/D51441



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D50845: [CUDA/OpenMP] Define only some host macros during device compilation

2018-08-30 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

Tests reverted in https://reviews.llvm.org/rL341118


Repository:
  rL LLVM

https://reviews.llvm.org/D50845



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D51312: [OpenMP][NVPTX] Use appropriate _CALL_ELF macro when offloading

2018-08-30 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

Reverted in https://reviews.llvm.org/rL341115 & 
https://reviews.llvm.org/rL341118.


Repository:
  rC Clang

https://reviews.llvm.org/D51312



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D51441: Add predefined macro __gnu_linux__ for proper aux-triple

2018-08-30 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

Tests reverted in https://reviews.llvm.org/rL341118.


Repository:
  rC Clang

https://reviews.llvm.org/D51441



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D51507: Allow all supportable attributes to be used with #pragma clang attribute.

2018-08-30 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: test/Misc/pragma-attribute-supported-attributes-list.test:27-32
+// CHECK-NEXT: CUDAConstant (SubjectMatchRule_variable)
+// CHECK-NEXT: CUDADevice (SubjectMatchRule_function, 
SubjectMatchRule_variable)
+// CHECK-NEXT: CUDAGlobal (SubjectMatchRule_function)
+// CHECK-NEXT: CUDAHost (SubjectMatchRule_function)
+// CHECK-NEXT: CUDALaunchBounds (SubjectMatchRule_objc_method, 
SubjectMatchRule_hasType_functionType)
+// CHECK-NEXT: CUDAShared (SubjectMatchRule_variable)

I don't see much practical use of this pragma for CUDA, but I also don't have 
any specific objections.
LGTM.

Theoretically  we could use it to apply `__host__ __device__` attribute to some 
portable headers-only library so we could use it on device side. In practice, 
though, there usually will be few things that would have to remain host-only 
(anything involving file-io, for example) and we would need to be more 
selective in applying the attributes or have a way to remove them from a subset 
of objects later on. 







Repository:
  rC Clang

https://reviews.llvm.org/D51507



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D51554: [CUDA][OPENMP][NVPTX]Improve logic of the debug info support.

2018-08-31 Thread Artem Belevich via Phabricator via cfe-commits
tra accepted this revision.
tra added a comment.
This revision is now accepted and ready to land.

Nice. So, in effect, for optimized builds we'll generate pre-DWARF line info 
only, unless --cuda-noopt-device-debug is specified.
Will this deal with the warnings about back-end being unable to handle 
particular debug options?

On a side note, when DWARF is functional in NVPTX we need to seriously consider 
per-GPU control for it. Enabling debug info blows up cubin size (ptxas 
apparently packs compressed PTX inside *cubin*) and we run into ELF reloc 
overflows in some tensorflow builds if all GPU variants carry it.


Repository:
  rC Clang

https://reviews.llvm.org/D51554



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D51501: [CUDA] Fix CUDA compilation broken by D50845

2018-09-04 Thread Artem Belevich via Phabricator via cfe-commits
tra abandoned this revision.
tra added a comment.

> Not needed anymore after the reverts in https://reviews.llvm.org/rC341115 and 
> https://reviews.llvm.org/rC341118, right?

Correct.


https://reviews.llvm.org/D51501



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D51808: [CUDA] Ignore uncallable functions when we check for usual deallocators.

2018-09-07 Thread Artem Belevich via Phabricator via cfe-commits
tra created this revision.
tra added a reviewer: rsmith.
Herald added subscribers: bixia, jlebar, sanjoy.

Previously clang considered function variants from both sides of
compilation and that sometimes resulted in picking up wrong deallocation 
function.


https://reviews.llvm.org/D51808

Files:
  clang-tools-extra/clang-tidy/readability/IdentifierNamingCheck.cpp
  clang/include/clang/AST/DeclCXX.h
  clang/include/clang/Sema/Sema.h
  clang/lib/AST/DeclCXX.cpp
  clang/lib/Sema/SemaDeclCXX.cpp
  clang/lib/Sema/SemaExprCXX.cpp
  clang/test/CodeGenCUDA/usual-deallocators.cu
  clang/test/SemaCUDA/call-host-fn-from-device.cu
  clang/test/SemaCUDA/usual-deallocators.cu

Index: clang/test/SemaCUDA/usual-deallocators.cu
===
--- /dev/null
+++ clang/test/SemaCUDA/usual-deallocators.cu
@@ -0,0 +1,95 @@
+// RUN: %clang_cc1 %s --std=c++11 -triple nvptx-unknown-unknown -fcuda-is-device \
+// RUN:   -emit-llvm -o /dev/null -verify=device
+// RUN: %clang_cc1 %s --std=c++11 -triple nvptx-unknown-unknown \
+// RUN:   -emit-llvm -o /dev/null -verify=host
+// RUN: %clang_cc1 %s --std=c++17 -triple nvptx-unknown-unknown -fcuda-is-device \
+// RUN:   -emit-llvm -o /dev/null -verify=device
+// RUN: %clang_cc1 %s --std=c++17 -triple nvptx-unknown-unknown \
+// RUN:   -emit-llvm -o /dev/null -verify=host
+
+#include "Inputs/cuda.h"
+extern __host__ void host_fn();
+extern __device__ void dev_fn();
+extern __host__ __device__ void hd_fn();
+
+struct H1D1 {
+  __host__ void operator delete(void *) { host_fn(); };
+  __device__ void operator delete(void *) { dev_fn(); };
+};
+
+struct h1D1 {
+  __host__ void operator delete(void *) = delete;
+  // host-note@-1 {{'operator delete' has been explicitly marked deleted here}}
+  __device__ void operator delete(void *) { dev_fn(); };
+};
+
+struct H1d1 {
+  __host__ void operator delete(void *) { host_fn(); };
+  __device__ void operator delete(void *) = delete;
+  // device-note@-1 {{'operator delete' has been explicitly marked deleted here}}
+};
+
+struct H1D2 {
+  __host__ void operator delete(void *) { host_fn(); };
+  __device__ void operator delete(void *, __SIZE_TYPE__) { dev_fn(); };
+};
+
+struct H2D1 {
+  __host__ void operator delete(void *, __SIZE_TYPE__) { host_fn(); };
+  __device__ void operator delete(void *) { dev_fn(); };
+};
+
+struct H2D2 {
+  __host__ void operator delete(void *, __SIZE_TYPE__) { host_fn(); };
+  __device__ void operator delete(void *, __SIZE_TYPE__) { dev_fn(); };
+};
+
+struct H1D1D2 {
+  __host__ void operator delete(void *) { host_fn(); };
+  __device__ void operator delete(void *) { dev_fn(); };
+  __device__ void operator delete(void *, __SIZE_TYPE__) { dev_fn(); };
+};
+
+struct H1H2D1 {
+  __host__ void operator delete(void *) { host_fn(); };
+  __host__ void operator delete(void *, __SIZE_TYPE__) { host_fn(); };
+  __device__ void operator delete(void *) { dev_fn(); };
+};
+
+struct H1H2D2 {
+  __host__ void operator delete(void *) { host_fn(); };
+  __host__ void operator delete(void *, __SIZE_TYPE__) { host_fn(); };
+  __device__ void operator delete(void *, __SIZE_TYPE__) { dev_fn(); };
+};
+
+struct H1H2D1D2 {
+  __host__ void operator delete(void *) { host_fn(); };
+  __host__ void operator delete(void *, __SIZE_TYPE__) { host_fn(); };
+  __device__ void operator delete(void *) { dev_fn(); };
+  __device__ void operator delete(void *, __SIZE_TYPE__) { dev_fn(); };
+};
+
+
+template 
+__host__ __device__ void test_hd(void *p) {
+  T *t = (T *)p;
+  delete t;
+  // host-error@-1 {{attempt to use a deleted function}}
+  // device-error@-2 {{attempt to use a deleted function}}
+}
+
+__host__ __device__ void tests_hd(void *t) {
+  test_hd(t);
+  test_hd(t);
+  // host-note@-1 {{in instantiation of function template specialization 'test_hd' requested here}}
+  test_hd(t);
+  // device-note@-1 {{in instantiation of function template specialization 'test_hd' requested here}}
+  test_hd(t);
+  test_hd(t);
+  test_hd(t);
+  test_hd(t);
+  test_hd(t);
+  test_hd(t);
+  test_hd(t);
+  test_hd(t);
+}
Index: clang/test/SemaCUDA/call-host-fn-from-device.cu
===
--- clang/test/SemaCUDA/call-host-fn-from-device.cu
+++ clang/test/SemaCUDA/call-host-fn-from-device.cu
@@ -41,12 +41,12 @@
   operator Dummy() { return Dummy(); }
   // expected-note@-1 {{'operator Dummy' declared here}}
 
-  __host__ void operator delete(void*);
-  __device__ void operator delete(void*, size_t);
+  __host__ void operator delete(void *) { host_fn(); };
+  __device__ void operator delete(void*, __SIZE_TYPE__);
 };
 
 struct U {
-  __device__ void operator delete(void*, size_t) = delete;
+  __device__ void operator delete(void*, __SIZE_TYPE__) = delete;
   __host__ __device__ void operator delete(void*);
 };
 
Index: clang/test/CodeGenCUDA/usual-deallocators.cu
===
--- /dev/null
+++ clang/test

[PATCH] D51809: [CUDA][HIP] Fix assertion in LookupSpecialMember

2018-09-07 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

@jlebar Justin, can you take a look?


https://reviews.llvm.org/D51809



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D51808: [CUDA] Ignore uncallable functions when we check for usual deallocators.

2018-09-13 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

@rsmith ping.


https://reviews.llvm.org/D51808



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D49274: [CUDA] Provide integer SIMD functions for CUDA-9.2

2018-07-19 Thread Artem Belevich via Phabricator via cfe-commits
tra updated this revision to Diff 156386.
tra added a comment.

Fixed inline asm syntax.
Added workaround for the bug in __vmaxs2() discovered during testing().

I've got set of tests for these functions that I'll add to test-suite shortly. 
AFAICT this implementation matches nvidia's bit-to-bit.


https://reviews.llvm.org/D49274

Files:
  clang/lib/Headers/__clang_cuda_device_functions.h
  clang/lib/Headers/__clang_cuda_libdevice_declares.h

Index: clang/lib/Headers/__clang_cuda_libdevice_declares.h
===
--- clang/lib/Headers/__clang_cuda_libdevice_declares.h
+++ clang/lib/Headers/__clang_cuda_libdevice_declares.h
@@ -372,6 +372,7 @@
 __device__ unsigned int __nv_urhadd(unsigned int __a, unsigned int __b);
 __device__ unsigned int __nv_usad(unsigned int __a, unsigned int __b,
   unsigned int __c);
+#if CUDA_VERSION >= 9000 && CUDA_VERSION < 9020
 __device__ int __nv_vabs2(int __a);
 __device__ int __nv_vabs4(int __a);
 __device__ int __nv_vabsdiffs2(int __a, int __b);
@@ -454,12 +455,12 @@
 __device__ int __nv_vsubss4(int __a, int __b);
 __device__ int __nv_vsubus2(int __a, int __b);
 __device__ int __nv_vsubus4(int __a, int __b);
+#endif  // CUDA_VERSION
 __device__ double __nv_y0(double __a);
 __device__ float __nv_y0f(float __a);
 __device__ double __nv_y1(double __a);
 __device__ float __nv_y1f(float __a);
 __device__ float __nv_ynf(int __a, float __b);
 __device__ double __nv_yn(int __a, double __b);
-
 } // extern "C"
 #endif // __CLANG_CUDA_LIBDEVICE_DECLARES_H__
Index: clang/lib/Headers/__clang_cuda_device_functions.h
===
--- clang/lib/Headers/__clang_cuda_device_functions.h
+++ clang/lib/Headers/__clang_cuda_device_functions.h
@@ -803,6 +803,8 @@
unsigned int __c) {
   return __nv_usad(__a, __b, __c);
 }
+
+#if CUDA_VERSION >= 9000 && CUDA_VERSION < 9020
 __DEVICE__ unsigned int __vabs2(unsigned int __a) { return __nv_vabs2(__a); }
 __DEVICE__ unsigned int __vabs4(unsigned int __a) { return __nv_vabs4(__a); }
 __DEVICE__ unsigned int __vabsdiffs2(unsigned int __a, unsigned int __b) {
@@ -1041,6 +1043,431 @@
 __DEVICE__ unsigned int __vsubus4(unsigned int __a, unsigned int __b) {
   return __nv_vsubus4(__a, __b);
 }
+#else // CUDA_VERSION >= 9020
+// CUDA no longer provides inline assembly (or bitcode) implementation of these
+// functions, so we have to reimplment them. The implementation is naive and is
+// not optimized for performance.
+
+// Helper function to convert N-bit boolean subfields into all-0 or all-1.
+// E.g. __bool2mask(0x01000100,8) -> 0xff00ff00
+//  __bool2mask(0x0001,16) -> 0x
+__DEVICE__ unsigned int __bool2mask(unsigned int __a, int shift) {
+  return (__a << shift) - __a;
+}
+__DEVICE__ unsigned int __vabs2(unsigned int __a) {
+  unsigned int r;
+  asm("vabsdiff2.s32.s32.s32 %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(0), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vabs4(unsigned int __a) {
+  unsigned int r;
+  asm("vabsdiff4.s32.s32.s32 %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(0), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vabsdiffs2(unsigned int __a, unsigned int __b) {
+  unsigned int r;
+  asm("vabsdiff2.s32.s32.s32 %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(__b), "r"(0));
+  return r;
+}
+
+__DEVICE__ unsigned int __vabsdiffs4(unsigned int __a, unsigned int __b) {
+  unsigned int r;
+  asm("vabsdiff4.s32.s32.s32 %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(__b), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vabsdiffu2(unsigned int __a, unsigned int __b) {
+  unsigned int r;
+  asm("vabsdiff2.u32.u32.u32.sat %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(__b), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vabsdiffu4(unsigned int __a, unsigned int __b) {
+  unsigned int r;
+  asm("vabsdiff4.u32.u32.u32.sat %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(__b), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vabsss2(unsigned int __a) {
+  unsigned int r;
+  asm("vabsdiff2.s32.s32.s32.sat %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(0), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vabsss4(unsigned int __a) {
+  unsigned int r;
+  asm("vabsdiff2.s32.s32.s32.sat %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(0), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vadd2(unsigned int __a, unsigned int __b) {
+  unsigned int r;
+  asm("vadd2.u32.u32.u32 %0,%1,%2,%3;" : "=r"(r) : "r"(__a), "r"(__b), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vadd4(unsigned int __a, unsigned int __b) {
+  unsigned int r;
+  asm("vadd4.u32.u32.u32 %0,%1,%2,%3;" : "=r"(r) : "r"(__a), "r"(__b), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vaddss2(unsigned int __a, unsigned int __b) {
+  unsigned int r;
+  asm("vadd2.s32.s32.s32.sat %0,%1,%2,%3;"
+  : "=r"(r)
+  :

[PATCH] D49274: [CUDA] Provide integer SIMD functions for CUDA-9.2

2018-07-19 Thread Artem Belevich via Phabricator via cfe-commits
tra updated this revision to Diff 156397.
tra added a comment.

Fixed the issues pointed out by bkramer@.  
Apparently. sat does not matter for vabsdiff instruction with unsigned operands.
My tests were also missing __vabsssN.


https://reviews.llvm.org/D49274

Files:
  clang/lib/Headers/__clang_cuda_device_functions.h
  clang/lib/Headers/__clang_cuda_libdevice_declares.h

Index: clang/lib/Headers/__clang_cuda_libdevice_declares.h
===
--- clang/lib/Headers/__clang_cuda_libdevice_declares.h
+++ clang/lib/Headers/__clang_cuda_libdevice_declares.h
@@ -372,6 +372,7 @@
 __device__ unsigned int __nv_urhadd(unsigned int __a, unsigned int __b);
 __device__ unsigned int __nv_usad(unsigned int __a, unsigned int __b,
   unsigned int __c);
+#if CUDA_VERSION >= 9000 && CUDA_VERSION < 9020
 __device__ int __nv_vabs2(int __a);
 __device__ int __nv_vabs4(int __a);
 __device__ int __nv_vabsdiffs2(int __a, int __b);
@@ -454,12 +455,12 @@
 __device__ int __nv_vsubss4(int __a, int __b);
 __device__ int __nv_vsubus2(int __a, int __b);
 __device__ int __nv_vsubus4(int __a, int __b);
+#endif  // CUDA_VERSION
 __device__ double __nv_y0(double __a);
 __device__ float __nv_y0f(float __a);
 __device__ double __nv_y1(double __a);
 __device__ float __nv_y1f(float __a);
 __device__ float __nv_ynf(int __a, float __b);
 __device__ double __nv_yn(int __a, double __b);
-
 } // extern "C"
 #endif // __CLANG_CUDA_LIBDEVICE_DECLARES_H__
Index: clang/lib/Headers/__clang_cuda_device_functions.h
===
--- clang/lib/Headers/__clang_cuda_device_functions.h
+++ clang/lib/Headers/__clang_cuda_device_functions.h
@@ -803,6 +803,8 @@
unsigned int __c) {
   return __nv_usad(__a, __b, __c);
 }
+
+#if CUDA_VERSION >= 9000 && CUDA_VERSION < 9020
 __DEVICE__ unsigned int __vabs2(unsigned int __a) { return __nv_vabs2(__a); }
 __DEVICE__ unsigned int __vabs4(unsigned int __a) { return __nv_vabs4(__a); }
 __DEVICE__ unsigned int __vabsdiffs2(unsigned int __a, unsigned int __b) {
@@ -1041,6 +1043,431 @@
 __DEVICE__ unsigned int __vsubus4(unsigned int __a, unsigned int __b) {
   return __nv_vsubus4(__a, __b);
 }
+#else // CUDA_VERSION >= 9020
+// CUDA no longer provides inline assembly (or bitcode) implementation of these
+// functions, so we have to reimplment them. The implementation is naive and is
+// not optimized for performance.
+
+// Helper function to convert N-bit boolean subfields into all-0 or all-1.
+// E.g. __bool2mask(0x01000100,8) -> 0xff00ff00
+//  __bool2mask(0x0001,16) -> 0x
+__DEVICE__ unsigned int __bool2mask(unsigned int __a, int shift) {
+  return (__a << shift) - __a;
+}
+__DEVICE__ unsigned int __vabs2(unsigned int __a) {
+  unsigned int r;
+  asm("vabsdiff2.s32.s32.s32 %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(0), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vabs4(unsigned int __a) {
+  unsigned int r;
+  asm("vabsdiff4.s32.s32.s32 %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(0), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vabsdiffs2(unsigned int __a, unsigned int __b) {
+  unsigned int r;
+  asm("vabsdiff2.s32.s32.s32 %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(__b), "r"(0));
+  return r;
+}
+
+__DEVICE__ unsigned int __vabsdiffs4(unsigned int __a, unsigned int __b) {
+  unsigned int r;
+  asm("vabsdiff4.s32.s32.s32 %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(__b), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vabsdiffu2(unsigned int __a, unsigned int __b) {
+  unsigned int r;
+  asm("vabsdiff2.u32.u32.u32 %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(__b), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vabsdiffu4(unsigned int __a, unsigned int __b) {
+  unsigned int r;
+  asm("vabsdiff4.u32.u32.u32 %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(__b), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vabsss2(unsigned int __a) {
+  unsigned int r;
+  asm("vabsdiff2.s32.s32.s32.sat %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(0), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vabsss4(unsigned int __a) {
+  unsigned int r;
+  asm("vabsdiff4.s32.s32.s32.sat %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(0), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vadd2(unsigned int __a, unsigned int __b) {
+  unsigned int r;
+  asm("vadd2.u32.u32.u32 %0,%1,%2,%3;" : "=r"(r) : "r"(__a), "r"(__b), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vadd4(unsigned int __a, unsigned int __b) {
+  unsigned int r;
+  asm("vadd4.u32.u32.u32 %0,%1,%2,%3;" : "=r"(r) : "r"(__a), "r"(__b), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vaddss2(unsigned int __a, unsigned int __b) {
+  unsigned int r;
+  asm("vadd2.s32.s32.s32.sat %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(__b), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vaddss

[PATCH] D49274: [CUDA] Provide integer SIMD functions for CUDA-9.2

2018-07-19 Thread Artem Belevich via Phabricator via cfe-commits
tra marked 2 inline comments as done.
tra added a comment.

Ben, PTAL.




Comment at: clang/lib/Headers/__clang_cuda_device_functions.h:1080
+  unsigned int r;
+  asm("vabsdiff2.u32.u32.u32.sat %0,%1,%2,0;" : "=r"(r) : "r"(__a), "r"(__b));
+  return r;

bkramer wrote:
> Should this really saturate?
Hmm. My tests didn't catch this. I wonder if ptxas just ignores .sat here.
Yup. I've confirmed that the tests do run on this function and do trigger if I 
intentionally introduce an error.
In any case, I've removed the .sat as it should not be there.



Comment at: clang/lib/Headers/__clang_cuda_device_functions.h:1095
+  unsigned int r;
+  asm("vabsdiff2.s32.s32.s32.sat %0,%1,0,0;" : "=r"(r) : "r"(__a));
+  return r;

bkramer wrote:
> vabsdiff4?
Ah. I've missed __vabsssN in my tests. Fixed both the header and the tests.


https://reviews.llvm.org/D49274



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D49274: [CUDA] Provide integer SIMD functions for CUDA-9.2

2018-07-20 Thread Artem Belevich via Phabricator via cfe-commits
This revision was automatically updated to reflect the committed changes.
tra marked 2 inline comments as done.
Closed by commit rC337587: [CUDA] Provide integer SIMD functions for CUDA-9.2 
(authored by tra, committed by ).

Changed prior to commit:
  https://reviews.llvm.org/D49274?vs=156397&id=156542#toc

Repository:
  rC Clang

https://reviews.llvm.org/D49274

Files:
  lib/Headers/__clang_cuda_device_functions.h
  lib/Headers/__clang_cuda_libdevice_declares.h

Index: lib/Headers/__clang_cuda_libdevice_declares.h
===
--- lib/Headers/__clang_cuda_libdevice_declares.h
+++ lib/Headers/__clang_cuda_libdevice_declares.h
@@ -372,6 +372,7 @@
 __device__ unsigned int __nv_urhadd(unsigned int __a, unsigned int __b);
 __device__ unsigned int __nv_usad(unsigned int __a, unsigned int __b,
   unsigned int __c);
+#if CUDA_VERSION >= 9000 && CUDA_VERSION < 9020
 __device__ int __nv_vabs2(int __a);
 __device__ int __nv_vabs4(int __a);
 __device__ int __nv_vabsdiffs2(int __a, int __b);
@@ -454,12 +455,12 @@
 __device__ int __nv_vsubss4(int __a, int __b);
 __device__ int __nv_vsubus2(int __a, int __b);
 __device__ int __nv_vsubus4(int __a, int __b);
+#endif  // CUDA_VERSION
 __device__ double __nv_y0(double __a);
 __device__ float __nv_y0f(float __a);
 __device__ double __nv_y1(double __a);
 __device__ float __nv_y1f(float __a);
 __device__ float __nv_ynf(int __a, float __b);
 __device__ double __nv_yn(int __a, double __b);
-
 } // extern "C"
 #endif // __CLANG_CUDA_LIBDEVICE_DECLARES_H__
Index: lib/Headers/__clang_cuda_device_functions.h
===
--- lib/Headers/__clang_cuda_device_functions.h
+++ lib/Headers/__clang_cuda_device_functions.h
@@ -803,6 +803,8 @@
unsigned int __c) {
   return __nv_usad(__a, __b, __c);
 }
+
+#if CUDA_VERSION >= 9000 && CUDA_VERSION < 9020
 __DEVICE__ unsigned int __vabs2(unsigned int __a) { return __nv_vabs2(__a); }
 __DEVICE__ unsigned int __vabs4(unsigned int __a) { return __nv_vabs4(__a); }
 __DEVICE__ unsigned int __vabsdiffs2(unsigned int __a, unsigned int __b) {
@@ -1041,6 +1043,431 @@
 __DEVICE__ unsigned int __vsubus4(unsigned int __a, unsigned int __b) {
   return __nv_vsubus4(__a, __b);
 }
+#else // CUDA_VERSION >= 9020
+// CUDA no longer provides inline assembly (or bitcode) implementation of these
+// functions, so we have to reimplment them. The implementation is naive and is
+// not optimized for performance.
+
+// Helper function to convert N-bit boolean subfields into all-0 or all-1.
+// E.g. __bool2mask(0x01000100,8) -> 0xff00ff00
+//  __bool2mask(0x0001,16) -> 0x
+__DEVICE__ unsigned int __bool2mask(unsigned int __a, int shift) {
+  return (__a << shift) - __a;
+}
+__DEVICE__ unsigned int __vabs2(unsigned int __a) {
+  unsigned int r;
+  asm("vabsdiff2.s32.s32.s32 %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(0), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vabs4(unsigned int __a) {
+  unsigned int r;
+  asm("vabsdiff4.s32.s32.s32 %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(0), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vabsdiffs2(unsigned int __a, unsigned int __b) {
+  unsigned int r;
+  asm("vabsdiff2.s32.s32.s32 %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(__b), "r"(0));
+  return r;
+}
+
+__DEVICE__ unsigned int __vabsdiffs4(unsigned int __a, unsigned int __b) {
+  unsigned int r;
+  asm("vabsdiff4.s32.s32.s32 %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(__b), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vabsdiffu2(unsigned int __a, unsigned int __b) {
+  unsigned int r;
+  asm("vabsdiff2.u32.u32.u32 %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(__b), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vabsdiffu4(unsigned int __a, unsigned int __b) {
+  unsigned int r;
+  asm("vabsdiff4.u32.u32.u32 %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(__b), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vabsss2(unsigned int __a) {
+  unsigned int r;
+  asm("vabsdiff2.s32.s32.s32.sat %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(0), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vabsss4(unsigned int __a) {
+  unsigned int r;
+  asm("vabsdiff4.s32.s32.s32.sat %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(0), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vadd2(unsigned int __a, unsigned int __b) {
+  unsigned int r;
+  asm("vadd2.u32.u32.u32 %0,%1,%2,%3;" : "=r"(r) : "r"(__a), "r"(__b), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vadd4(unsigned int __a, unsigned int __b) {
+  unsigned int r;
+  asm("vadd4.u32.u32.u32 %0,%1,%2,%3;" : "=r"(r) : "r"(__a), "r"(__b), "r"(0));
+  return r;
+}
+__DEVICE__ unsigned int __vaddss2(unsigned int __a, unsigned int __b) {
+  unsigned int r;
+  asm("vadd2.s32.s32.s32.sat %0,%1,%2,%3;"
+  : "=r"(r)
+  : "r"(__a), "r"(__b), "

[PATCH] D48287: [HIP] Support -fcuda-flush-denormals-to-zero for amdgcn

2018-07-20 Thread Artem Belevich via Phabricator via cfe-commits
tra accepted this revision.
tra added a comment.
This revision is now accepted and ready to land.

Thank you. That should work.


https://reviews.llvm.org/D48287



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D47849: [OpenMP][Clang][NVPTX] Enable math functions called in an OpenMP NVPTX target device region to be resolved as device-native function calls

2018-07-20 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

In https://reviews.llvm.org/D47849#1126925, @gtbercea wrote:

> I just stumbled upon a very interesting situation.
>
> I noticed that, for OpenMP, the use of device math functions happens as I 
> expected for -O0. For -O1 or higher math functions such as "sqrt" resolve to 
> llvm builtins/intrinsics:
>
>   call double @llvm.sqrt.f64(double %1)
>
>
> instead of the nvvm variant.


I believe we do have a pass that attempts to replace some nvvm intrinsics  with 
their llvm equivalent. It allows us to optimize the code better. My guess would 
be that the change does not happen with -O0.

> The surprising part (at least to me) is that the same llvm intrinsic is used 
> when I use Clang to compile CUDA kernel code calling the "sqrt" function. I 
> would have expected that the NVVM variant would be called for CUDA code.

What we may end up generating for any given standard library call from the 
device side depends on number of factors and may vary.
Here's what typically happens:

- clang parses CUDA headers and pulls 'standard' C math functions and bits of 
C++ overloads. These usually call __something.
- CUDA versions up to 8.0 provided those __something() functions which 
*usually* called __nv_something() in libdevice.
- As of CUDA-9 __something became NVCC's compiler builtins and clang has to 
provide its own implementation -- __clang_cuda_device_functions.h. This 
implementation may use whatever works that does the job. Any of 
__builtin.../__nvvm.../__nv_... are fair game, as long as it works.
- CUDA wrapper headers in clang do some magic to make math parts of standard 
C++ library working by magic by providing some functions to do the right thing. 
Usually those forward to the C math functions, but it may not always be the 
case.
- LLVM may update some __nvvm* intrinsics to their llvm equivalent.

In the end you may end up with somewhat different IR depending on the function 
and the CUDA version clang used.

> Is it ok for CUDA kernels to call llvm intrinsics instead of the device 
> specific math library functions?

It depends. We can not lower all LLVM intrinsics. Generally you can't use 
intrinsics that are lowered to external library call.

> If it's ok for CUDA can this be ok for OpenMP NVPTX too?
>  If not we probably need to fix it for both toolchains.

I don't have an answer for these. OpenMP seems to have somewhat different 
requirements compared to C++ which we assume for CUDA.

On thing you do need to consider, though, is that the wrapper headers are 
rather unstable. Their goal is to provide a glue between half-broken CUDA 
headers and the user's code. They are not intended to provide any sort of 
stability to anyone else. Every new CUDA version brings new and exciting 
changes to its headers which requires fair amount of changes in the wrappers.

If all you need is C math functions, it *may* be OK, but, perhaps, there may be 
a better approach.
Why not compile a real math library to bitcode and avoid all this weirdness 
with gluing together half-broken pieces of CUDA that are broken by design? 
Unlike real CUDA compilation, you don't have the constraint that you have to 
match NVCC 1:1. If you have your own device-side math library you could use 
regular math headers and link real libm.bc instead of CUDA's libdevice. The 
rumors of "high performance" functions in the libdevice are somewhat 
exaggerated , IMO. If you take a look at the IR in the libdevice of recent CUDA 
version, you will see that a lot of the functions just call their llvm 
counterpart. If it turns out that in some case llvm generates slower code than 
what nvidia provides, I'm sure it will be possible to implement a reasonably 
fast replacement.


Repository:
  rC Clang

https://reviews.llvm.org/D47849



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D49763: [CUDA] Call atexit() for CUDA destructor early on.

2018-07-24 Thread Artem Belevich via Phabricator via cfe-commits
tra created this revision.
tra added reviewers: jlebar, timshen.
Herald added subscribers: bixia, sanjoy.

There's apparently a race between fatbin destructors registered by us
and some internal calls registered by CUDA runtime from cudaRegisterFatbin.
Moving fatbin de-registration to atexit() was not sufficient to avoid crash in 
CUDA runtime on exit when the runtime was linked statically, but CUDA
kernel was launched from a shared library.

Moving atexit() call to before we call cudaRegisterFatbin appears to work
with both statically and dynamically linked CUDA TUs.


https://reviews.llvm.org/D49763

Files:
  clang/lib/CodeGen/CGCUDANV.cpp


Index: clang/lib/CodeGen/CGCUDANV.cpp
===
--- clang/lib/CodeGen/CGCUDANV.cpp
+++ clang/lib/CodeGen/CGCUDANV.cpp
@@ -375,6 +375,19 @@
 
   CtorBuilder.SetInsertPoint(CtorEntryBB);
 
+  // Create destructor and register it with atexit() the way NVCC does it. 
Doing
+  // it during regular destructor phase worked in CUDA before 9.2 but results 
in
+  // double-free in 9.2.
+  if (llvm::Function *CleanupFn = makeModuleDtorFunction()) {
+// extern "C" int atexit(void (*f)(void));
+llvm::FunctionType *AtExitTy =
+llvm::FunctionType::get(IntTy, CleanupFn->getType(), false);
+llvm::Constant *AtExitFunc =
+CGM.CreateRuntimeFunction(AtExitTy, "atexit", llvm::AttributeList(),
+  /*Local=*/true);
+CtorBuilder.CreateCall(AtExitFunc, CleanupFn);
+  }
+
   const char *FatbinConstantName;
   const char *FatbinSectionName;
   const char *ModuleIDSectionName;
@@ -530,19 +543,6 @@
 CtorBuilder.CreateCall(RegisterLinkedBinaryFunc, Args);
   }
 
-  // Create destructor and register it with atexit() the way NVCC does it. 
Doing
-  // it during regular destructor phase worked in CUDA before 9.2 but results 
in
-  // double-free in 9.2.
-  if (llvm::Function *CleanupFn = makeModuleDtorFunction()) {
-// extern "C" int atexit(void (*f)(void));
-llvm::FunctionType *AtExitTy =
-llvm::FunctionType::get(IntTy, CleanupFn->getType(), false);
-llvm::Constant *AtExitFunc =
-CGM.CreateRuntimeFunction(AtExitTy, "atexit", llvm::AttributeList(),
-  /*Local=*/true);
-CtorBuilder.CreateCall(AtExitFunc, CleanupFn);
-  }
-
   CtorBuilder.CreateRetVoid();
   return ModuleCtorFunc;
 }


Index: clang/lib/CodeGen/CGCUDANV.cpp
===
--- clang/lib/CodeGen/CGCUDANV.cpp
+++ clang/lib/CodeGen/CGCUDANV.cpp
@@ -375,6 +375,19 @@
 
   CtorBuilder.SetInsertPoint(CtorEntryBB);
 
+  // Create destructor and register it with atexit() the way NVCC does it. Doing
+  // it during regular destructor phase worked in CUDA before 9.2 but results in
+  // double-free in 9.2.
+  if (llvm::Function *CleanupFn = makeModuleDtorFunction()) {
+// extern "C" int atexit(void (*f)(void));
+llvm::FunctionType *AtExitTy =
+llvm::FunctionType::get(IntTy, CleanupFn->getType(), false);
+llvm::Constant *AtExitFunc =
+CGM.CreateRuntimeFunction(AtExitTy, "atexit", llvm::AttributeList(),
+  /*Local=*/true);
+CtorBuilder.CreateCall(AtExitFunc, CleanupFn);
+  }
+
   const char *FatbinConstantName;
   const char *FatbinSectionName;
   const char *ModuleIDSectionName;
@@ -530,19 +543,6 @@
 CtorBuilder.CreateCall(RegisterLinkedBinaryFunc, Args);
   }
 
-  // Create destructor and register it with atexit() the way NVCC does it. Doing
-  // it during regular destructor phase worked in CUDA before 9.2 but results in
-  // double-free in 9.2.
-  if (llvm::Function *CleanupFn = makeModuleDtorFunction()) {
-// extern "C" int atexit(void (*f)(void));
-llvm::FunctionType *AtExitTy =
-llvm::FunctionType::get(IntTy, CleanupFn->getType(), false);
-llvm::Constant *AtExitFunc =
-CGM.CreateRuntimeFunction(AtExitTy, "atexit", llvm::AttributeList(),
-  /*Local=*/true);
-CtorBuilder.CreateCall(AtExitFunc, CleanupFn);
-  }
-
   CtorBuilder.CreateRetVoid();
   return ModuleCtorFunc;
 }
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D49763: [CUDA] Call atexit() for CUDA destructor early on.

2018-07-24 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

In https://reviews.llvm.org/D49763#1174283, @joerg wrote:

> Can this ever end up in a shared library? If yes, please use the normal logic 
> for creating a global destructor. atexit is not very friendly to dlopen...


Yes, it can end up in a shared library. What would be the normal logic in this 
case?

We used to use regular global destructor, but has even worse issues. Alas, 
NVIDIA provides no documentation to how compiler-generated glue is expected to 
interact with CUDA runtime, so we need to guess what it wants.
NVCC-generated glue generates call to atexit(). If we use global destructors, 
then by the time they are executed, nvidia's runtime has already been 
deinitialized and our attempt to call it causes the crash.
Deregistering fatbin from atexit() works better, but apparently we still race 
with the runtime. calling atexit() before we register the fatbin appears to 
work for all combinations of {static/dynamic, kernel/runtime}.


https://reviews.llvm.org/D49763



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D49763: [CUDA] Call atexit() for CUDA destructor early on.

2018-07-24 Thread Artem Belevich via Phabricator via cfe-commits
tra planned changes to this revision.
tra added a comment.

Ugh. Apparently moving this code up just disabled module destructor. :-( That 
explains why we no longer crash.


https://reviews.llvm.org/D49763



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D49931: [CUDA][HIP] Allow function-scope static const variable

2018-07-27 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

> This patch also allows function-scope static const variable without device 
> memory qualifier and emits it as a global variable in constant address space.

What does NVCC do with local static const variables?


https://reviews.llvm.org/D49931



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D49931: [CUDA][HIP] Allow function-scope static const variable

2018-07-27 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

Looks OK overall except for the huge `if` below.




Comment at: lib/Sema/SemaDecl.cpp:11923-11930
+  if (getLangOpts().CUDA &&
+  !(VD->hasAttr() ||
+(VD->getType().isConstQualified() &&
+ !VD->hasAttr() &&
+ !VD->hasAttr())) &&
   CUDADiagIfDeviceCode(VD->getLocation(),
diag::err_device_static_local_var)

This is rather convoluted. It would make it somewhat more readable if we could 
split CUDADiagIfDeviceCode into its own if().

Or, maybe use a lambda + early exit or, perhaps even goto to break down this 
huge if:

```
[&](){
   if (VD->hasAttr()) return;
   if (VD->getType().isConstQualified() 
&& !(VD->hasAttr()||VD->hasAttr())
return;
   if (CUDADiagIfDeviceCode(VD->getLocation(), 
diag::err_device_static_local_var)
  << CurrentCUDATarget()))
  VD->setInvalidDecl();
}()
```


https://reviews.llvm.org/D49931



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D49763: [CUDA] Call atexit() for CUDA destructor early on.

2018-07-30 Thread Artem Belevich via Phabricator via cfe-commits
tra abandoned this revision.
tra added a comment.

It appears that the issue that originally prompted this change is due to 
suspected bug in glibc triggered by specific details of our internal build.


https://reviews.llvm.org/D49763



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D49148: [DEBUGINFO] Disable unsupported debug info options for NVPTX target.

2018-08-01 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

I wonder, what's the right thing to do to silence the warnings. For instance, 
we compile everything with -Werror and the warnings result in build breaks.

Easy way out is to pass `-Wno-unsupported-target-opt`.  It works, but it does 
not really solve anything. It also may mask potential other problems.

Another alternative is to change clang driver and filter out unsupported 
options so they are not passed to cc1. That will also work, but it looks wrong, 
because now we have two patches that effectively cancel each other for no 
observable benefit.

Third option is to grow a better way to specify target-specific sub-compilation 
options and then consider fancy debug flags to be attributable to host 
compilation only. Anything beyond the "safe" subset, would have to be specified 
explicitly.  This also sounds awkward -- I don't really want to replicate bunch 
of options times number of GPUs I'm compiling for. That may be alleviated by 
providing more coarse way to group options. E.g. we could say "these are the 
options for *all* non-host compilations, and here are few specifically for 
sm_XY". I think @echristo and I had discussed something like this long time ago.

Any other ideas?


Repository:
  rC Clang

https://reviews.llvm.org/D49148



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D49148: [DEBUGINFO] Disable unsupported debug info options for NVPTX target.

2018-08-01 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

We normally do not need to deviate from the host options all that often. I 
would argue that keeping options identical is a reasonable default for most 
options.
For some options the driver may be able to derive a sensible value based on the 
host options. E.g. some options can be ignored. Some can be downgraded. Some 
can be replaced with a target-specific equivalent.
For others we must require the user to provide the value.

So, at the very least we must be able to put all options into one of the 
categories.

We also need to figure out what kind of syntax we'll use to specify 
target-specific options. We currently have a `-Xarch...` hack in some 
toolchains, but it's rather awkward to us in practice as it does not give you 
much control over where are those options placed on the cc1's command line, 
they are also rather verbose and usually do not support options with arguments. 
We could make -Xarch=XYZ a sticky option which would consider following options 
to apply only to particular arch with, possibly, few special arch names to 
specify `host`, `device`,  `all` subcompilations.

It's also possible that I'm reinventing the wheel here. Are there existing 
precedents for command-line options with this kind of multi-consumer 
functionality?


Repository:
  rC Clang

https://reviews.llvm.org/D49148



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D43045: Add NVPTX Support to ValidCPUList (enabling march notes)

2018-02-08 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: test/Misc/target-invalid-cpu-note.c:38
+// NVPTX: note: valid target CPU values are: sm_20, sm_21, sm_30, sm_32, sm_35,
+// NVPTX-SAME: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72

Nit: Generally speaking this note is false. For any given version of CUDA, some 
of the listed GPU variants will not be accepted. E.g. CUDA versions before 9.1 
do not know anything about sm_72, but CUDA-9.1  does not supports sm_20. 


https://reviews.llvm.org/D43045



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D42581: [NVPTX] Emit debug info in DWARF-2 by default for Cuda devices.

2018-02-08 Thread Artem Belevich via Phabricator via cfe-commits
tra accepted this revision.
tra added a comment.
This revision is now accepted and ready to land.

LGTM


Repository:
  rC Clang

https://reviews.llvm.org/D42581



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D43045: Add NVPTX Support to ValidCPUList (enabling march notes)

2018-02-08 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: test/Misc/target-invalid-cpu-note.c:38
+// NVPTX: note: valid target CPU values are: sm_20, sm_21, sm_30, sm_32, sm_35,
+// NVPTX-SAME: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72

erichkeane wrote:
> tra wrote:
> > Nit: Generally speaking this note is false. For any given version of CUDA, 
> > some of the listed GPU variants will not be accepted. E.g. CUDA versions 
> > before 9.1 do not know anything about sm_72, but CUDA-9.1  does not 
> > supports sm_20. 
> Is there somewhere else that this is checked?  It seems that the 'setCPU' 
> function here checks against this same list.  
We have CheckCudaVersionSupportsArch() in lib/Driver/ToolChains/Cuda.cpp.
I'm not sure if CUDA version is already known at the time we check -target-cpu, 
though.

I'm OK with the ful list for now if it's hard to get to the CUDA version.




https://reviews.llvm.org/D43045



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D42920: [CUDA] Fix test cuda-external-tools.cu

2018-02-09 Thread Artem Belevich via Phabricator via cfe-commits
tra accepted this revision.
tra added inline comments.
This revision is now accepted and ready to land.



Comment at: test/Driver/cuda-external-tools.cu:11
+// RUN: | FileCheck -check-prefix CHECK -check-prefix ARCH64 \
+// RUN: -check-prefix SM20 -check-prefix OPT0 %s
 // RUN: %clang -### -target x86_64-linux-gnu -O1 -c %s 2>&1 \

Nit: I'd use --check-prefixes=CHECK,ARCH64,SM20,OPT0 . Up to you.


Repository:
  rC Clang

https://reviews.llvm.org/D42920



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D42921: [CUDA] Add option to generate relocatable device code

2018-02-09 Thread Artem Belevich via Phabricator via cfe-commits
tra accepted this revision.
tra added inline comments.
This revision is now accepted and ready to land.



Comment at: include/clang/Driver/Options.td:572
+  HelpText<"Generate relocatable device code, also known as separate 
compilation mode.">;
+def fno_cuda_rdc : Flag<["-"], "fno-cuda-rdc">;
 def dA : Flag<["-"], "dA">, Group;

Does the options show up in clang --help? 
If it does, and if you plan to commit patches one at a time, we may want to 
make it hidden until everything is in place.


Repository:
  rC Clang

https://reviews.llvm.org/D42921



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D42923: [CUDA] Allow external variables in separate compilation

2018-02-12 Thread Artem Belevich via Phabricator via cfe-commits
tra accepted this revision.
tra added a comment.
This revision is now accepted and ready to land.

LGTM.




Comment at: test/SemaCUDA/extern-shared.cu:4
+// These declarations are fine in separate compilation mode!
+// RUN: %clang_cc1 -fsyntax-only -fcuda-rdc -verify=rdc %s
+// RUN: %clang_cc1 -fsyntax-only -fcuda-is-device -fcuda-rdc -verify=rdc %s

Nit. `-verify=rdc` is somewhat confusing as there's no rdc prefixes in the 
checks below. Perhaps something along the lines of 
`-verify=there-should-be-no-errors`  would be more descriptive.


Repository:
  rC Clang

https://reviews.llvm.org/D42923



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D42923: [CUDA] Allow external variables in separate compilation

2018-02-12 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: test/SemaCUDA/extern-shared.cu:4
+// These declarations are fine in separate compilation mode!
+// RUN: %clang_cc1 -fsyntax-only -fcuda-rdc -verify=rdc %s
+// RUN: %clang_cc1 -fsyntax-only -fcuda-is-device -fcuda-rdc -verify=rdc %s

Hahnfeld wrote:
> tra wrote:
> > Nit. `-verify=rdc` is somewhat confusing as there's no rdc prefixes in the 
> > checks below. Perhaps something along the lines of 
> > `-verify=there-should-be-no-errors`  would be more descriptive.
> There is: `rdc-no-diagnostics`.
> 
> But given that you missed it, maybe I should move the comment `declarations 
> are fine` between `RUN` lines and `no-diagnostics`? Don't know if that helps 
> much though...
Oh! I did miss it.  Never mind then.


Repository:
  rC Clang

https://reviews.llvm.org/D42923



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D42922: [CUDA] Register relocatable GPU binaries

2018-02-16 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: lib/CodeGen/CGCUDANV.cpp:330-331
   // the GPU side.
   for (const std::string &GpuBinaryFileName :
CGM.getCodeGenOpts().CudaGpuBinaryFileNames) {
 llvm::ErrorOr> GpuBinaryOrErr =

Hahnfeld wrote:
> Can we actually have multiple GPU binaries here? If yes, how do I get there?
Yes. `clang --cuda-gpu-arch=sm_35 --cuda-gpu-arch=sm_50...` will compile for 
sm_35 and sm_50 and then will pass the names of GPU-side objects to the host 
compilation via `-fcuda-include-gpubinary`.


https://reviews.llvm.org/D42922



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D42922: [CUDA] Register relocatable GPU binaries

2018-02-16 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: lib/CodeGen/CGCUDANV.cpp:330-331
   // the GPU side.
   for (const std::string &GpuBinaryFileName :
CGM.getCodeGenOpts().CudaGpuBinaryFileNames) {
 llvm::ErrorOr> GpuBinaryOrErr =

Hahnfeld wrote:
> tra wrote:
> > Hahnfeld wrote:
> > > Can we actually have multiple GPU binaries here? If yes, how do I get 
> > > there?
> > Yes. `clang --cuda-gpu-arch=sm_35 --cuda-gpu-arch=sm_50...` will compile 
> > for sm_35 and sm_50 and then will pass the names of GPU-side objects to the 
> > host compilation via `-fcuda-include-gpubinary`.
> I'm not sure if that's true anymore: I think they are now combined by 
> `fatbinary`. This seems to be confirmed by `test/Driver/cuda-options.cu`. If 
> that was the only use case, we may try to get rid of this possibility, let me 
> check this.
You are correct. All GPU binaries are in the single fatbin now.
That said, you could still pass extra -fcuda-include-gpubinary to cc1 manually, 
but I see no practical reason to do it -- single fatbin serves the purpose 
better.

We should remove this loop and make CGM.getCodeGenOpts().CudaGpuBinaryFileNames 
a scalar.



https://reviews.llvm.org/D42922



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D43461: [CUDA] Include single GPU binary, NFCI.

2018-02-20 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: lib/Driver/ToolChains/Clang.cpp:4659
   if (IsCuda) {
-// Host-side cuda compilation receives device-side outputs as Inputs[1...].
-// Include them with -fcuda-include-gpubinary.
+// Host-side cuda compilation receives device-side outputs as Inputs[1].
+// Include the binary with -fcuda-include-gpubinary.

Nit: Passing multiple things as a single input may need some more details.
E.g. `...receives all device-side outputs in a single fatbin as Inputs[1]`



Comment at: lib/Frontend/CompilerInvocation.cpp:1044-1045
 
-  Opts.CudaGpuBinaryFileNames =
-  Args.getAllArgValues(OPT_fcuda_include_gpubinary);
+  Opts.CudaGpuBinaryFileName =
+  Args.getLastArgValue(OPT_fcuda_include_gpubinary);
 

If more than one gpu binary is passed, all but last will be ignored.
IMO in this case we would want to either warn that some inputs were ignored or 
report an error that there is more than one GPU binary.


Repository:
  rC Clang

https://reviews.llvm.org/D43461



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


  1   2   3   4   5   6   7   8   9   10   >