[clang] cuda clang: Add support for CUDA surfaces (PR #132883)

2025-03-24 Thread Austin Schuh via cfe-commits

https://github.com/AustinSchuh created 
https://github.com/llvm/llvm-project/pull/132883

This adds support for all the surface read and write calls to clang. It extends 
the pattern used for textures to surfaces too.

I tested this by generating all the various permutations of the calls and 
argument types in a python script, compiling them with both clang and nvcc, and 
comparing the generated ptx for equivilence.  They all agree, ignoring register 
allocation, and some places where Clang picks different memory write 
instructions.  An example kernel is:

__global__ void testKernel(cudaSurfaceObject_t surfObj, int x, float2* result) {
*result = surf1Dread(surfObj, x, cudaBoundaryModeZero);
}

>From d8ffb5acbd79869476c91433f85488f3088e38fd Mon Sep 17 00:00:00 2001
From: Austin Schuh 
Date: Mon, 24 Mar 2025 21:42:35 -0700
Subject: [PATCH] Add support for CUDA surfaces

This adds support for all the surface read and write calls to clang.
It extends the pattern used for textures to surfaces too.

I tested this by generating all the various permutations of the calls
and argument types in a python script, compiling them with both clang
and nvcc, and comparing the generated ptx for equivilence.  They all
agree, ignoring register allocation, and some places where Clang does
different memory writes.  An example kernel is:

__global__ void testKernel(cudaSurfaceObject_t surfObj, int x, float2* result)
{
*result = surf1Dread(surfObj, x, cudaBoundaryModeZero);
}

Signed-off-by: Austin Schuh 
---
 .../Headers/__clang_cuda_runtime_wrapper.h|   1 +
 .../Headers/__clang_cuda_texture_intrinsics.h | 419 +-
 2 files changed, 418 insertions(+), 2 deletions(-)

diff --git a/clang/lib/Headers/__clang_cuda_runtime_wrapper.h 
b/clang/lib/Headers/__clang_cuda_runtime_wrapper.h
index d369c86fe1064..8182c961ec32f 100644
--- a/clang/lib/Headers/__clang_cuda_runtime_wrapper.h
+++ b/clang/lib/Headers/__clang_cuda_runtime_wrapper.h
@@ -386,6 +386,7 @@ __host__ __device__ void __nv_tex_surf_handler(const char 
*name, T *ptr,
 #endif // __cplusplus >= 201103L && CUDA_VERSION >= 9000
 #include "texture_fetch_functions.h"
 #include "texture_indirect_functions.h"
+#include "surface_indirect_functions.h"
 
 // Restore state of __CUDA_ARCH__ and __THROW we had on entry.
 #pragma pop_macro("__CUDA_ARCH__")
diff --git a/clang/lib/Headers/__clang_cuda_texture_intrinsics.h 
b/clang/lib/Headers/__clang_cuda_texture_intrinsics.h
index a71952211237b..2ea83f66036d4 100644
--- a/clang/lib/Headers/__clang_cuda_texture_intrinsics.h
+++ b/clang/lib/Headers/__clang_cuda_texture_intrinsics.h
@@ -28,6 +28,7 @@
 #pragma push_macro("__Args")
 #pragma push_macro("__ID")
 #pragma push_macro("__IDV")
+#pragma push_macro("__OP_TYPE_SURFACE")
 #pragma push_macro("__IMPL_2DGATHER")
 #pragma push_macro("__IMPL_ALIAS")
 #pragma push_macro("__IMPL_ALIASI")
@@ -45,6 +46,64 @@
 #pragma push_macro("__IMPL_SI")
 #pragma push_macro("__L")
 #pragma push_macro("__STRIP_PARENS")
+#pragma push_macro("__SURF_WRITE_V2")
+#pragma push_macro("__SW_ASM_ARGS")
+#pragma push_macro("__SW_ASM_ARGS1")
+#pragma push_macro("__SW_ASM_ARGS2")
+#pragma push_macro("__SW_ASM_ARGS4")
+#pragma push_macro("__SURF_WRITE_V2")
+#pragma push_macro("__SURF_READ_V2")
+#pragma push_macro("__SW_ASM_ARGS")
+#pragma push_macro("__SW_ASM_ARGS1")
+#pragma push_macro("__SW_ASM_ARGS2")
+#pragma push_macro("__SW_ASM_ARGS4")
+#pragma push_macro("__SURF_READ1D");
+#pragma push_macro("__SURF_READ2D");
+#pragma push_macro("__SURF_READ3D");
+#pragma push_macro("__SURF_READ1DLAYERED");
+#pragma push_macro("__SURF_READ2DLAYERED");
+#pragma push_macro("__SURF_READCUBEMAP");
+#pragma push_macro("__SURF_READCUBEMAPLAYERED");
+#pragma push_macro("__1DV1");
+#pragma push_macro("__1DV2");
+#pragma push_macro("__1DV4");
+#pragma push_macro("__2DV1");
+#pragma push_macro("__2DV2");
+#pragma push_macro("__2DV4");
+#pragma push_macro("__1DLAYERV1");
+#pragma push_macro("__1DLAYERV2");
+#pragma push_macro("__1DLAYERV4");
+#pragma push_macro("__3DV1");
+#pragma push_macro("__3DV2");
+#pragma push_macro("__3DV4");
+#pragma push_macro("__2DLAYERV1");
+#pragma push_macro("__2DLAYERV2");
+#pragma push_macro("__2DLAYERV4");
+#pragma push_macro("__CUBEMAPV1");
+#pragma push_macro("__CUBEMAPV2");
+#pragma push_macro("__CUBEMAPV4");
+#pragma push_macro("__CUBEMAPLAYERV1");
+#pragma push_macro("__CUBEMAPLAYERV2");
+#pragma push_macro("__CUBEMAPLAYERV4");
+#pragma push_macro("__SURF_READXD_ALL");
+#pragma push_macro("__SURF_WRITE1D_V2");
+#pragma push_macro("__SURF_WRITE1DLAYERED_V2");
+#pragma push_macro("__SURF_WRITE2D_V2");
+#pragma push_macro("__SURF_WRITE2DLAYERED_V2");
+#pragma push_macro("__SURF_WRITE3D_V2");
+#pragma push_macro("__SURF_CUBEMAPWRITE_V2");
+#pragma push_macro("__SURF_CUBEMAPLAYEREDWRITE_V2");
+#pragma push_macro("__SURF_WRITEXD_V2_ALL");
+#pragma push_macro("__1DV1");
+#pragma push_macro("__1DV2");
+#pragma push_macro("__1DV4");
+#pragma push_macro("__2DV1");
+#pragma push_macro("__2DV2"

[clang] cuda clang: Add support for CUDA surfaces (PR #132883)

2025-03-24 Thread Austin Schuh via cfe-commits

AustinSchuh wrote:

@Artem-B I think you added texture support originally.

A lot of the language in that file is focused on just textures, not textures 
and surfaces.  I am happy to adjust that if that is desired.  I figured a bit 
ugly, working, and early feedback was preferable.

https://github.com/llvm/llvm-project/pull/132883
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] cuda clang: Add support for CUDA surfaces (PR #132883)

2025-03-29 Thread Austin Schuh via cfe-commits

AustinSchuh wrote:

> If the header is compileable without CUDA SDK (maybe with some stub headers 
> in tests Inputs), then a source file with a lot of builtin calls and 
> [autogenerated 
> checks](https://github.com/llvm/llvm-project/blob/d724bab8064685c98cdded88157b6e2245e0d7bc/llvm/utils/update_cc_test_checks.py#L2)
>  to verify produced instructions would do.

That wasn't too bad, done!

https://github.com/llvm/llvm-project/pull/132883
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] cuda clang: Add support for CUDA surfaces (PR #132883)

2025-04-03 Thread Austin Schuh via cfe-commits


@@ -0,0 +1,3329 @@
+// RUN: %clang_cc1 -triple nvptx-unknown-unknown -fcuda-is-device -O3 -o - %s 
-emit-llvm | FileCheck %s
+// RUN: %clang_cc1 -triple nvptx64-unknown-unknown -fcuda-is-device -O3 -o - 
%s -emit-llvm | FileCheck %s
+#include "../Headers/Inputs/include/cuda.h"

AustinSchuh wrote:

Any suggestions for how to best not duplicate it?  I could copy it over, but 
that feels worse.

https://github.com/llvm/llvm-project/pull/132883
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] cuda clang: Add support for CUDA surfaces (PR #132883)

2025-03-24 Thread Austin Schuh via cfe-commits

https://github.com/AustinSchuh edited 
https://github.com/llvm/llvm-project/pull/132883
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] cuda clang: Clean up test dependency for CUDA surfaces (PR #134459)

2025-04-04 Thread Austin Schuh via cfe-commits

https://github.com/AustinSchuh created 
https://github.com/llvm/llvm-project/pull/134459

https://github.com/llvm/llvm-project/pull/132883 added support for cuda 
surfaces but reached into clang/test/Headers/ from clang/test/CodeGen/ to grab 
the minimal cuda.h.  Duplicate that file instead based on comments in the 
review, to fix remote test runs.

>From 52d4f06540761507c4d81cead1278bff783fc54d Mon Sep 17 00:00:00 2001
From: Austin Schuh 
Date: Fri, 4 Apr 2025 15:23:15 -0700
Subject: [PATCH] cuda clang: Clean up test dependency for CUDA surfaces

https://github.com/llvm/llvm-project/pull/132883 added support for cuda
surfaces but reached into clang/test/Headers/ from clang/test/CodeGen/
to grab the minimal cuda.h.  Duplicate that file instead based on
comments in the review, to fix remote test runs.

Signed-off-by: Austin Schuh 
---
 clang/test/CodeGen/include/cuda.h| 194 +++
 clang/test/CodeGen/nvptx-surface.cu  |   2 +-
 clang/test/Headers/Inputs/include/cuda.h |   4 +-
 3 files changed, 198 insertions(+), 2 deletions(-)
 create mode 100644 clang/test/CodeGen/include/cuda.h

diff --git a/clang/test/CodeGen/include/cuda.h 
b/clang/test/CodeGen/include/cuda.h
new file mode 100644
index 0..58202442e1f8c
--- /dev/null
+++ b/clang/test/CodeGen/include/cuda.h
@@ -0,0 +1,194 @@
+/* Minimal declarations for CUDA support.  Testing purposes only.
+ * This should stay in sync with clang/test/Headers/Inputs/include/cuda.h
+ */
+#pragma once
+
+// Make this file work with nvcc, for testing compatibility.
+
+#ifndef __NVCC__
+#define __constant__ __attribute__((constant))
+#define __device__ __attribute__((device))
+#define __global__ __attribute__((global))
+#define __host__ __attribute__((host))
+#define __shared__ __attribute__((shared))
+#define __managed__ __attribute__((managed))
+#define __launch_bounds__(...) __attribute__((launch_bounds(__VA_ARGS__)))
+
+struct dim3 {
+  unsigned x, y, z;
+  __host__ __device__ dim3(unsigned x, unsigned y = 1, unsigned z = 1) : x(x), 
y(y), z(z) {}
+};
+
+// Host- and device-side placement new overloads.
+void *operator new(__SIZE_TYPE__, void *p) { return p; }
+void *operator new[](__SIZE_TYPE__, void *p) { return p; }
+__device__ void *operator new(__SIZE_TYPE__, void *p) { return p; }
+__device__ void *operator new[](__SIZE_TYPE__, void *p) { return p; }
+
+#define CUDA_VERSION 10100
+
+struct char1 {
+  char x;
+  __host__ __device__ char1(char x = 0) : x(x) {}
+};
+struct char2 {
+  char x, y;
+  __host__ __device__ char2(char x = 0, char y = 0) : x(x), y(y) {}
+};
+struct char4 {
+  char x, y, z, w;
+  __host__ __device__ char4(char x = 0, char y = 0, char z = 0, char w = 0) : 
x(x), y(y), z(z), w(w) {}
+};
+
+struct uchar1 {
+  unsigned char x;
+  __host__ __device__ uchar1(unsigned char x = 0) : x(x) {}
+};
+struct uchar2 {
+  unsigned char x, y;
+  __host__ __device__ uchar2(unsigned char x = 0, unsigned char y = 0) : x(x), 
y(y) {}
+};
+struct uchar4 {
+  unsigned char x, y, z, w;
+  __host__ __device__ uchar4(unsigned char x = 0, unsigned char y = 0, 
unsigned char z = 0, unsigned char w = 0) : x(x), y(y), z(z), w(w) {}
+};
+
+struct short1 {
+  short x;
+  __host__ __device__ short1(short x = 0) : x(x) {}
+};
+struct short2 {
+  short x, y;
+  __host__ __device__ short2(short x = 0, short y = 0) : x(x), y(y) {}
+};
+struct short4 {
+  short x, y, z, w;
+  __host__ __device__ short4(short x = 0, short y = 0, short z = 0, short w = 
0) : x(x), y(y), z(z), w(w) {}
+};
+
+struct ushort1 {
+  unsigned short x;
+  __host__ __device__ ushort1(unsigned short x = 0) : x(x) {}
+};
+struct ushort2 {
+  unsigned short x, y;
+  __host__ __device__ ushort2(unsigned short x = 0, unsigned short y = 0) : 
x(x), y(y) {}
+};
+struct ushort4 {
+  unsigned short x, y, z, w;
+  __host__ __device__ ushort4(unsigned short x = 0, unsigned short y = 0, 
unsigned short z = 0, unsigned short w = 0) : x(x), y(y), z(z), w(w) {}
+};
+
+struct int1 {
+  int x;
+  __host__ __device__ int1(int x = 0) : x(x) {}
+};
+struct int2 {
+  int x, y;
+  __host__ __device__ int2(int x = 0, int y = 0) : x(x), y(y) {}
+};
+struct int4 {
+  int x, y, z, w;
+  __host__ __device__ int4(int x = 0, int y = 0, int z = 0, int w = 0) : x(x), 
y(y), z(z), w(w) {}
+};
+
+struct uint1 {
+  unsigned x;
+  __host__ __device__ uint1(unsigned x = 0) : x(x) {}
+};
+struct uint2 {
+  unsigned x, y;
+  __host__ __device__ uint2(unsigned x = 0, unsigned y = 0) : x(x), y(y) {}
+};
+struct uint3 {
+  unsigned x, y, z;
+  __host__ __device__ uint3(unsigned x = 0, unsigned y = 0, unsigned z = 0) : 
x(x), y(y), z(z) {}
+};
+struct uint4 {
+  unsigned x, y, z, w;
+  __host__ __device__ uint4(unsigned x = 0, unsigned y = 0, unsigned z = 0, 
unsigned w = 0) : x(x), y(y), z(z), w(w) {}
+};
+
+struct longlong1 {
+  long long x;
+  __host__ __device__ longlong1(long long x = 0) : x(x) {}
+};
+struct longlong2 {
+  long long x, y;
+  __host__ __device__ longlong2(long long x = 0, long long y = 

[clang] cuda clang: Add support for CUDA surfaces (PR #132883)

2025-04-04 Thread Austin Schuh via cfe-commits


@@ -0,0 +1,3329 @@
+// RUN: %clang_cc1 -triple nvptx-unknown-unknown -fcuda-is-device -O3 -o - %s 
-emit-llvm | FileCheck %s
+// RUN: %clang_cc1 -triple nvptx64-unknown-unknown -fcuda-is-device -O3 -o - 
%s -emit-llvm | FileCheck %s
+#include "../Headers/Inputs/include/cuda.h"

AustinSchuh wrote:

https://github.com/llvm/llvm-project/pull/134459 has the fixup

https://github.com/llvm/llvm-project/pull/132883
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] cuda clang: Clean up test dependency for CUDA surfaces (PR #134459)

2025-04-04 Thread Austin Schuh via cfe-commits

AustinSchuh wrote:

@slackito and @Artem-B , hopefully this addresses your feedback!

https://github.com/llvm/llvm-project/pull/134459
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] cuda clang: Add support for CUDA surfaces (PR #132883)

2025-03-25 Thread Austin Schuh via cfe-commits

https://github.com/AustinSchuh updated 
https://github.com/llvm/llvm-project/pull/132883

>From d8ffb5acbd79869476c91433f85488f3088e38fd Mon Sep 17 00:00:00 2001
From: Austin Schuh 
Date: Mon, 24 Mar 2025 21:42:35 -0700
Subject: [PATCH 1/4] Add support for CUDA surfaces

This adds support for all the surface read and write calls to clang.
It extends the pattern used for textures to surfaces too.

I tested this by generating all the various permutations of the calls
and argument types in a python script, compiling them with both clang
and nvcc, and comparing the generated ptx for equivilence.  They all
agree, ignoring register allocation, and some places where Clang does
different memory writes.  An example kernel is:

__global__ void testKernel(cudaSurfaceObject_t surfObj, int x, float2* result)
{
*result = surf1Dread(surfObj, x, cudaBoundaryModeZero);
}

Signed-off-by: Austin Schuh 
---
 .../Headers/__clang_cuda_runtime_wrapper.h|   1 +
 .../Headers/__clang_cuda_texture_intrinsics.h | 419 +-
 2 files changed, 418 insertions(+), 2 deletions(-)

diff --git a/clang/lib/Headers/__clang_cuda_runtime_wrapper.h 
b/clang/lib/Headers/__clang_cuda_runtime_wrapper.h
index d369c86fe1064..8182c961ec32f 100644
--- a/clang/lib/Headers/__clang_cuda_runtime_wrapper.h
+++ b/clang/lib/Headers/__clang_cuda_runtime_wrapper.h
@@ -386,6 +386,7 @@ __host__ __device__ void __nv_tex_surf_handler(const char 
*name, T *ptr,
 #endif // __cplusplus >= 201103L && CUDA_VERSION >= 9000
 #include "texture_fetch_functions.h"
 #include "texture_indirect_functions.h"
+#include "surface_indirect_functions.h"
 
 // Restore state of __CUDA_ARCH__ and __THROW we had on entry.
 #pragma pop_macro("__CUDA_ARCH__")
diff --git a/clang/lib/Headers/__clang_cuda_texture_intrinsics.h 
b/clang/lib/Headers/__clang_cuda_texture_intrinsics.h
index a71952211237b..2ea83f66036d4 100644
--- a/clang/lib/Headers/__clang_cuda_texture_intrinsics.h
+++ b/clang/lib/Headers/__clang_cuda_texture_intrinsics.h
@@ -28,6 +28,7 @@
 #pragma push_macro("__Args")
 #pragma push_macro("__ID")
 #pragma push_macro("__IDV")
+#pragma push_macro("__OP_TYPE_SURFACE")
 #pragma push_macro("__IMPL_2DGATHER")
 #pragma push_macro("__IMPL_ALIAS")
 #pragma push_macro("__IMPL_ALIASI")
@@ -45,6 +46,64 @@
 #pragma push_macro("__IMPL_SI")
 #pragma push_macro("__L")
 #pragma push_macro("__STRIP_PARENS")
+#pragma push_macro("__SURF_WRITE_V2")
+#pragma push_macro("__SW_ASM_ARGS")
+#pragma push_macro("__SW_ASM_ARGS1")
+#pragma push_macro("__SW_ASM_ARGS2")
+#pragma push_macro("__SW_ASM_ARGS4")
+#pragma push_macro("__SURF_WRITE_V2")
+#pragma push_macro("__SURF_READ_V2")
+#pragma push_macro("__SW_ASM_ARGS")
+#pragma push_macro("__SW_ASM_ARGS1")
+#pragma push_macro("__SW_ASM_ARGS2")
+#pragma push_macro("__SW_ASM_ARGS4")
+#pragma push_macro("__SURF_READ1D");
+#pragma push_macro("__SURF_READ2D");
+#pragma push_macro("__SURF_READ3D");
+#pragma push_macro("__SURF_READ1DLAYERED");
+#pragma push_macro("__SURF_READ2DLAYERED");
+#pragma push_macro("__SURF_READCUBEMAP");
+#pragma push_macro("__SURF_READCUBEMAPLAYERED");
+#pragma push_macro("__1DV1");
+#pragma push_macro("__1DV2");
+#pragma push_macro("__1DV4");
+#pragma push_macro("__2DV1");
+#pragma push_macro("__2DV2");
+#pragma push_macro("__2DV4");
+#pragma push_macro("__1DLAYERV1");
+#pragma push_macro("__1DLAYERV2");
+#pragma push_macro("__1DLAYERV4");
+#pragma push_macro("__3DV1");
+#pragma push_macro("__3DV2");
+#pragma push_macro("__3DV4");
+#pragma push_macro("__2DLAYERV1");
+#pragma push_macro("__2DLAYERV2");
+#pragma push_macro("__2DLAYERV4");
+#pragma push_macro("__CUBEMAPV1");
+#pragma push_macro("__CUBEMAPV2");
+#pragma push_macro("__CUBEMAPV4");
+#pragma push_macro("__CUBEMAPLAYERV1");
+#pragma push_macro("__CUBEMAPLAYERV2");
+#pragma push_macro("__CUBEMAPLAYERV4");
+#pragma push_macro("__SURF_READXD_ALL");
+#pragma push_macro("__SURF_WRITE1D_V2");
+#pragma push_macro("__SURF_WRITE1DLAYERED_V2");
+#pragma push_macro("__SURF_WRITE2D_V2");
+#pragma push_macro("__SURF_WRITE2DLAYERED_V2");
+#pragma push_macro("__SURF_WRITE3D_V2");
+#pragma push_macro("__SURF_CUBEMAPWRITE_V2");
+#pragma push_macro("__SURF_CUBEMAPLAYEREDWRITE_V2");
+#pragma push_macro("__SURF_WRITEXD_V2_ALL");
+#pragma push_macro("__1DV1");
+#pragma push_macro("__1DV2");
+#pragma push_macro("__1DV4");
+#pragma push_macro("__2DV1");
+#pragma push_macro("__2DV2");
+#pragma push_macro("__2DV4");
+#pragma push_macro("__3DV1");
+#pragma push_macro("__3DV2");
+#pragma push_macro("__3DV4");
+
 
 // Put all functions into anonymous namespace so they have internal linkage.
 // The device-only function here must be internal in order to avoid ODR
@@ -186,6 +245,20 @@ template  struct __TypeInfoT {
   using __fetch_t = typename __TypeInfoT<__base_t>::__fetch_t;
 };
 
+// Tag structs to distinguish operation types
+struct __texture_op_tag {};
+struct __surface_op_tag {};
+
+// Template specialization to determine operation type based on tag value
+template 
+struct __op_type

[clang] [llvm] cuda clang: Fix argument order for __reduce_max_sync (PR #132881)

2025-03-25 Thread Austin Schuh via cfe-commits

https://github.com/AustinSchuh updated 
https://github.com/llvm/llvm-project/pull/132881

>From 381ba9f21b4c2a2c4028143b84e9f71c8a20692f Mon Sep 17 00:00:00 2001
From: Austin Schuh 
Date: Mon, 24 Mar 2025 21:42:45 -0700
Subject: [PATCH 1/2] cuda clang: Fix argument order for __reduce_max_sync

The following cuda kernel would crash with an "an illegal instruction
was encountered" message.

__global__ void testcode(const float* data, unsigned *max_value) {
unsigned r = static_cast(data[threadIdx.x]);

const unsigned mask = __ballot_sync(0x, true);

unsigned mx = __reduce_max_sync(mask, r);
atomicMax(max_value, mx);
}

Digging into the ptx from both nvcc and clang, I discovered that the
arguments for the mask and value were swapped.  This swaps them back.

Fixes: https://github.com/llvm/llvm-project/issues/131415
Signed-off-by: Austin Schuh 
---
 llvm/lib/Target/NVPTX/NVPTXIntrinsics.td | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td 
b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index b2e05a567b4fe..1943e94c3ee7a 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -315,7 +315,7 @@ defm MATCH_ALLP_SYNC_64 : MATCH_ALLP_SYNC {
   def : NVPTXInst<(outs Int32Regs:$dst), (ins Int32Regs:$src, Int32Regs:$mask),
   "redux.sync." # BinOp # "." # PTXType # " $dst, $src, $mask;",
-  [(set i32:$dst, (Intrin i32:$src, Int32Regs:$mask))]>,
+  [(set i32:$dst, (Intrin i32:$mask, Int32Regs:$src))]>,
 Requires<[hasPTX<70>, hasSM<80>]>;
 }
 

>From aa9148ce6af19a799f37aea6d973087c8fb7c53c Mon Sep 17 00:00:00 2001
From: Austin Schuh 
Date: Tue, 25 Mar 2025 12:12:47 -0700
Subject: [PATCH 2/2] Swap in __clang_cuda_intrinsics.h instead

---
 clang/lib/Headers/__clang_cuda_intrinsics.h | 16 
 llvm/lib/Target/NVPTX/NVPTXIntrinsics.td|  2 +-
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/clang/lib/Headers/__clang_cuda_intrinsics.h 
b/clang/lib/Headers/__clang_cuda_intrinsics.h
index a04e8b6de44d0..8b230af6f6647 100644
--- a/clang/lib/Headers/__clang_cuda_intrinsics.h
+++ b/clang/lib/Headers/__clang_cuda_intrinsics.h
@@ -515,32 +515,32 @@ __device__ inline cuuint32_t __nvvm_get_smem_pointer(void 
*__ptr) {
 #if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800
 __device__ inline unsigned __reduce_add_sync(unsigned __mask,
  unsigned __value) {
-  return __nvvm_redux_sync_add(__mask, __value);
+  return __nvvm_redux_sync_add(__value, __mask);
 }
 __device__ inline unsigned __reduce_min_sync(unsigned __mask,
  unsigned __value) {
-  return __nvvm_redux_sync_umin(__mask, __value);
+  return __nvvm_redux_sync_umin(__value, __mask);
 }
 __device__ inline unsigned __reduce_max_sync(unsigned __mask,
  unsigned __value) {
-  return __nvvm_redux_sync_umax(__mask, __value);
+  return __nvvm_redux_sync_umax(__value, __mask);
 }
 __device__ inline int __reduce_min_sync(unsigned __mask, int __value) {
-  return __nvvm_redux_sync_min(__mask, __value);
+  return __nvvm_redux_sync_min(__value, __mask);
 }
 __device__ inline int __reduce_max_sync(unsigned __mask, int __value) {
-  return __nvvm_redux_sync_max(__mask, __value);
+  return __nvvm_redux_sync_max(__value, __mask);
 }
 __device__ inline unsigned __reduce_or_sync(unsigned __mask, unsigned __value) 
{
-  return __nvvm_redux_sync_or(__mask, __value);
+  return __nvvm_redux_sync_or(__value, __mask);
 }
 __device__ inline unsigned __reduce_and_sync(unsigned __mask,
  unsigned __value) {
-  return __nvvm_redux_sync_and(__mask, __value);
+  return __nvvm_redux_sync_and(__value, __mask);
 }
 __device__ inline unsigned __reduce_xor_sync(unsigned __mask,
  unsigned __value) {
-  return __nvvm_redux_sync_xor(__mask, __value);
+  return __nvvm_redux_sync_xor(__value, __mask);
 }
 
 __device__ inline void __nv_memcpy_async_shared_global_4(void *__dst,
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td 
b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index 1943e94c3ee7a..b2e05a567b4fe 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -315,7 +315,7 @@ defm MATCH_ALLP_SYNC_64 : MATCH_ALLP_SYNC {
   def : NVPTXInst<(outs Int32Regs:$dst), (ins Int32Regs:$src, Int32Regs:$mask),
   "redux.sync." # BinOp # "." # PTXType # " $dst, $src, $mask;",
-  [(set i32:$dst, (Intrin i32:$mask, Int32Regs:$src))]>,
+  [(set i32:$dst, (Intrin i32:$src, Int32Regs:$mask))]>,
 Requires<[hasPTX<70>, hasSM<80>]>;
 }
 

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] cuda clang: Fix argument order for __reduce_max_sync (PR #132881)

2025-03-25 Thread Austin Schuh via cfe-commits


@@ -315,7 +315,7 @@ defm MATCH_ALLP_SYNC_64 : MATCH_ALLP_SYNC {
   def : NVPTXInst<(outs Int32Regs:$dst), (ins Int32Regs:$src, Int32Regs:$mask),
   "redux.sync." # BinOp # "." # PTXType # " $dst, $src, $mask;",
-  [(set i32:$dst, (Intrin i32:$src, Int32Regs:$mask))]>,
+  [(set i32:$dst, (Intrin i32:$mask, Int32Regs:$src))]>,

AustinSchuh wrote:

I swear I read the code multiple times, and couldn't see the swap.  I agree 
with you now, that's the right place to swap it.  The argument order matches 
the docs in the other spots, but not in __clang_cuda_intrinsics.h.  That also 
keeps the tests working (and explains why the tests didn't catch it).

https://github.com/llvm/llvm-project/pull/132881
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] cuda clang: Add support for CUDA surfaces (PR #132883)

2025-03-25 Thread Austin Schuh via cfe-commits

https://github.com/AustinSchuh updated 
https://github.com/llvm/llvm-project/pull/132883

>From d8ffb5acbd79869476c91433f85488f3088e38fd Mon Sep 17 00:00:00 2001
From: Austin Schuh 
Date: Mon, 24 Mar 2025 21:42:35 -0700
Subject: [PATCH 1/2] Add support for CUDA surfaces

This adds support for all the surface read and write calls to clang.
It extends the pattern used for textures to surfaces too.

I tested this by generating all the various permutations of the calls
and argument types in a python script, compiling them with both clang
and nvcc, and comparing the generated ptx for equivilence.  They all
agree, ignoring register allocation, and some places where Clang does
different memory writes.  An example kernel is:

__global__ void testKernel(cudaSurfaceObject_t surfObj, int x, float2* result)
{
*result = surf1Dread(surfObj, x, cudaBoundaryModeZero);
}

Signed-off-by: Austin Schuh 
---
 .../Headers/__clang_cuda_runtime_wrapper.h|   1 +
 .../Headers/__clang_cuda_texture_intrinsics.h | 419 +-
 2 files changed, 418 insertions(+), 2 deletions(-)

diff --git a/clang/lib/Headers/__clang_cuda_runtime_wrapper.h 
b/clang/lib/Headers/__clang_cuda_runtime_wrapper.h
index d369c86fe1064..8182c961ec32f 100644
--- a/clang/lib/Headers/__clang_cuda_runtime_wrapper.h
+++ b/clang/lib/Headers/__clang_cuda_runtime_wrapper.h
@@ -386,6 +386,7 @@ __host__ __device__ void __nv_tex_surf_handler(const char 
*name, T *ptr,
 #endif // __cplusplus >= 201103L && CUDA_VERSION >= 9000
 #include "texture_fetch_functions.h"
 #include "texture_indirect_functions.h"
+#include "surface_indirect_functions.h"
 
 // Restore state of __CUDA_ARCH__ and __THROW we had on entry.
 #pragma pop_macro("__CUDA_ARCH__")
diff --git a/clang/lib/Headers/__clang_cuda_texture_intrinsics.h 
b/clang/lib/Headers/__clang_cuda_texture_intrinsics.h
index a71952211237b..2ea83f66036d4 100644
--- a/clang/lib/Headers/__clang_cuda_texture_intrinsics.h
+++ b/clang/lib/Headers/__clang_cuda_texture_intrinsics.h
@@ -28,6 +28,7 @@
 #pragma push_macro("__Args")
 #pragma push_macro("__ID")
 #pragma push_macro("__IDV")
+#pragma push_macro("__OP_TYPE_SURFACE")
 #pragma push_macro("__IMPL_2DGATHER")
 #pragma push_macro("__IMPL_ALIAS")
 #pragma push_macro("__IMPL_ALIASI")
@@ -45,6 +46,64 @@
 #pragma push_macro("__IMPL_SI")
 #pragma push_macro("__L")
 #pragma push_macro("__STRIP_PARENS")
+#pragma push_macro("__SURF_WRITE_V2")
+#pragma push_macro("__SW_ASM_ARGS")
+#pragma push_macro("__SW_ASM_ARGS1")
+#pragma push_macro("__SW_ASM_ARGS2")
+#pragma push_macro("__SW_ASM_ARGS4")
+#pragma push_macro("__SURF_WRITE_V2")
+#pragma push_macro("__SURF_READ_V2")
+#pragma push_macro("__SW_ASM_ARGS")
+#pragma push_macro("__SW_ASM_ARGS1")
+#pragma push_macro("__SW_ASM_ARGS2")
+#pragma push_macro("__SW_ASM_ARGS4")
+#pragma push_macro("__SURF_READ1D");
+#pragma push_macro("__SURF_READ2D");
+#pragma push_macro("__SURF_READ3D");
+#pragma push_macro("__SURF_READ1DLAYERED");
+#pragma push_macro("__SURF_READ2DLAYERED");
+#pragma push_macro("__SURF_READCUBEMAP");
+#pragma push_macro("__SURF_READCUBEMAPLAYERED");
+#pragma push_macro("__1DV1");
+#pragma push_macro("__1DV2");
+#pragma push_macro("__1DV4");
+#pragma push_macro("__2DV1");
+#pragma push_macro("__2DV2");
+#pragma push_macro("__2DV4");
+#pragma push_macro("__1DLAYERV1");
+#pragma push_macro("__1DLAYERV2");
+#pragma push_macro("__1DLAYERV4");
+#pragma push_macro("__3DV1");
+#pragma push_macro("__3DV2");
+#pragma push_macro("__3DV4");
+#pragma push_macro("__2DLAYERV1");
+#pragma push_macro("__2DLAYERV2");
+#pragma push_macro("__2DLAYERV4");
+#pragma push_macro("__CUBEMAPV1");
+#pragma push_macro("__CUBEMAPV2");
+#pragma push_macro("__CUBEMAPV4");
+#pragma push_macro("__CUBEMAPLAYERV1");
+#pragma push_macro("__CUBEMAPLAYERV2");
+#pragma push_macro("__CUBEMAPLAYERV4");
+#pragma push_macro("__SURF_READXD_ALL");
+#pragma push_macro("__SURF_WRITE1D_V2");
+#pragma push_macro("__SURF_WRITE1DLAYERED_V2");
+#pragma push_macro("__SURF_WRITE2D_V2");
+#pragma push_macro("__SURF_WRITE2DLAYERED_V2");
+#pragma push_macro("__SURF_WRITE3D_V2");
+#pragma push_macro("__SURF_CUBEMAPWRITE_V2");
+#pragma push_macro("__SURF_CUBEMAPLAYEREDWRITE_V2");
+#pragma push_macro("__SURF_WRITEXD_V2_ALL");
+#pragma push_macro("__1DV1");
+#pragma push_macro("__1DV2");
+#pragma push_macro("__1DV4");
+#pragma push_macro("__2DV1");
+#pragma push_macro("__2DV2");
+#pragma push_macro("__2DV4");
+#pragma push_macro("__3DV1");
+#pragma push_macro("__3DV2");
+#pragma push_macro("__3DV4");
+
 
 // Put all functions into anonymous namespace so they have internal linkage.
 // The device-only function here must be internal in order to avoid ODR
@@ -186,6 +245,20 @@ template  struct __TypeInfoT {
   using __fetch_t = typename __TypeInfoT<__base_t>::__fetch_t;
 };
 
+// Tag structs to distinguish operation types
+struct __texture_op_tag {};
+struct __surface_op_tag {};
+
+// Template specialization to determine operation type based on tag value
+template 
+struct __op_type

[clang] [llvm] cuda clang: Fix argument order for __reduce_max_sync (PR #132881)

2025-03-26 Thread Austin Schuh via cfe-commits

AustinSchuh wrote:

> @AustinSchuh would you like me to merge the change for you, once the checks 
> are done?

That would be wonderful.  I don't know how to merge it (happy to learn, but I 
suspect I won't do it more than a couple of times)

https://github.com/llvm/llvm-project/pull/132881
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] cuda clang: Fix argument order for __reduce_max_sync (PR #132881)

2025-04-05 Thread Austin Schuh via cfe-commits


@@ -315,7 +315,7 @@ defm MATCH_ALLP_SYNC_64 : MATCH_ALLP_SYNC {
   def : NVPTXInst<(outs Int32Regs:$dst), (ins Int32Regs:$src, Int32Regs:$mask),
   "redux.sync." # BinOp # "." # PTXType # " $dst, $src, $mask;",
-  [(set i32:$dst, (Intrin i32:$src, Int32Regs:$mask))]>,
+  [(set i32:$dst, (Intrin i32:$mask, Int32Regs:$src))]>,

AustinSchuh wrote:

Cool.  I pushed a second commit in this review to do that instead.  (I can 
squash if you want, I'm not sure the best strategy to keep this readable)

https://github.com/llvm/llvm-project/pull/132881
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] cuda clang: Add support for CUDA surfaces (PR #132883)

2025-04-05 Thread Austin Schuh via cfe-commits

https://github.com/AustinSchuh edited 
https://github.com/llvm/llvm-project/pull/132883
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] cuda clang: Add support for CUDA surfaces (PR #132883)

2025-04-05 Thread Austin Schuh via cfe-commits

https://github.com/AustinSchuh edited 
https://github.com/llvm/llvm-project/pull/132883
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits