[PATCH] D128158: [AMDGPU] Add amdgcn_sched_group_barrier builtin

2022-07-01 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes added a comment.

Hey Austin -- I like the removal of canAddMIs. In the original design, I was 
leaving open the possibility for users to pass in canAddMIs rather than a mask 
/ SchedGroup name, but it looks like this isn't the direction we're going, and 
the classification functions defined in a general canAddMI makes things easier.

I see this is a WIP, but I've added some thoughts I had from reading it over. I 
may have more as I use the design for my patch.




Comment at: llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp:199
+  // SchedGroupMask of instructions that should be barred.
+  SchedGroupMask invertSchedBarrierMask(SchedGroupMask Mask) const;
+

I find it confusing that SchedBarrier uses inversion while SchedGroupBarrier 
doesn't.



Comment at: llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp:306
+bool SchedGroup::isFull() const {
+  return MaxSize.hasValue() && Collection.size() >= *MaxSize;
+}

As in the update to IGroupLP.cpp in trunk, seems like we are not supposed to 
use hasValue.



Comment at: llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp:349
+  add(InitSU);
+  assert(MaxSize.hasValue());
+  (*MaxSize)++;

Not possible to have unsized groups?



Comment at: llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp:445
+  // initialized all of the SCHED_GROUP_BARRIER SchedGroups.
+  addSchedGroupBarrierEdges();
 }

If both types of barriers are present -- the SchedBarriers are handled first. 
However, if there is a conflict between SchedBarrier and SchedGroupBarrier, 
should SchedBarrier always get the priority? Maybe SchedBarrier should only 
handle groups not present in SchedGroupBarrier?



Comment at: llvm/test/CodeGen/AMDGPU/sched-group-barrier-pre-RA.mir:104
+GLOBAL_STORE_DWORD_SADDR %1, %13, %0, 512, 0, implicit $exec :: (store 
(s32) into %ir.out, !noalias !0, addrspace 1)
+; 1 VMEM_READ
+SCHED_GROUP_BARRIER 32, 1, 0

I think you are aware of this issue. But the ability for the mutation to match 
the pipeline is dependent upon which instructions go into which group (when an 
instruction can be mapped to multiple groups).

If we had SchedGroups: 2 VMEM_READ, 1 VALU, 1 MFMA, 2 VMEM_READ

and initial schedule: VMEMR, VALU, VMEMR, MFMA, VMEMR, with a dependency 
between middle VMEMR->MFMA. 

initSchedGroup will add the middle VMEMR to the last VMEMR group, but we could 
get a more accurate pipeline by adding it to the first group.




Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128158/new/

https://reviews.llvm.org/D128158

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D128158: [AMDGPU] Add amdgcn_sched_group_barrier builtin

2022-07-26 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes added a comment.

LGTM


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128158/new/

https://reviews.llvm.org/D128158

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D132079: [AMDGPU] Add iglp_opt builtin and MFMA GEMM Opt strategy

2022-08-17 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes added a comment.

Hey Austin --

Just have a small question about the purpose of shouldApplyStrategy -- other 
than that, LGTM.




Comment at: llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp:758
+
+  bool shouldApplyStrategy(ScheduleDAGInstrs *DAG) override { return true; }
+

Is the plan to use heuristics on top of the builtin at some point? Not sure I 
understand this.



Comment at: llvm/lib/Target/AMDGPU/SIPostRABundler.cpp:135
+// Don't cluster with IGLP instructions.
+bool HasIGLPInstrs =
+std::any_of(MBB.instr_begin(), MBB.instr_end(), [](MachineInstr &MI) {

Maybe not in this patch due to time constraints, but perhaps in future work we 
can extract checking for IGLP_OPT / SCHED_GROUP_BARRIER to an analysis patch so 
we don't need to keep checking for it.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D132079/new/

https://reviews.llvm.org/D132079

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D132079: [AMDGPU] Add iglp_opt builtin and MFMA GEMM Opt strategy

2022-08-19 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes added inline comments.



Comment at: llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp:1063
+} else if (Opc == AMDGPU::IGLP_OPT) {
+  if (!foundSB && !foundIGLP)
+initIGLPOpt(*R);

I think this makes more sense if you parse the entire dag first, then check if 
neither were found.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D132079/new/

https://reviews.llvm.org/D132079

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D132079: [AMDGPU] Add iglp_opt builtin and MFMA GEMM Opt strategy

2022-08-19 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes added a comment.

LGTM again


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D132079/new/

https://reviews.llvm.org/D132079

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D132079: [AMDGPU] Add iglp_opt builtin and MFMA GEMM Opt strategy

2022-08-19 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes added a comment.

Just a couple nitpicks




Comment at: llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp:1071
 
   PipelineSolver PS(SyncedSchedGroups, SyncedInstrs, DAG);
   // PipelineSolver performs the mutation by adding the edges it

Have a fully unguarded entry point into PS construction / PS.solve() makes me a 
bit uneasy -- and it is at best inefficient. Can you guard this with foundSGB 
|| foundIGLP?



Comment at: llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp:427
 DAG->addMutation(createStoreClusterDAGMutation(DAG->TII, DAG->TRI));
   DAG->addMutation(createIGroupLPDAGMutation());
   DAG->addMutation(createAMDGPUMacroFusionDAGMutation());

I think you can remove this as well since you're doing it from within the 
scheduler.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D132079/new/

https://reviews.llvm.org/D132079

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D132079: [AMDGPU] Add iglp_opt builtin and MFMA GEMM Opt strategy

2022-08-19 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes accepted this revision.
jrbyrnes added a comment.
This revision is now accepted and ready to land.

LGTM




Comment at: llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp:427
 DAG->addMutation(createStoreClusterDAGMutation(DAG->TII, DAG->TRI));
   DAG->addMutation(createIGroupLPDAGMutation());
   DAG->addMutation(createAMDGPUMacroFusionDAGMutation());

kerbowa wrote:
> jrbyrnes wrote:
> > I think you can remove this as well since you're doing it from within the 
> > scheduler.
> It's not added in the scheduler for plain SCHED_BARRIER.
Oh okay -- I see


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D132079/new/

https://reviews.llvm.org/D132079

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D147732: [AMDGPU] Add type mangling for {read, write, readfirst, perm}lane intrinsics

2023-07-06 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes added inline comments.



Comment at: llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp:187
 
+Value *AMDGPULateCodeGenPrepare::buildLegalLaneIntrinsic(
+IRBuilder<> &B, Intrinsic::ID IID, Value *Data0, Value *Data1, Value 
*Lane0,

arsenm wrote:
> You're not relying on this for correctness are you? This is an optimization 
> pass, you can't lower here. You also shouldn't need to handle this in the IR, 
> it should codegen normally 
This is the legalization for non 32bit types -- I don't exactly know why it 
wasn't handled via the normal codegen / selection process. @nhaehnle , I 
believe you tried this in https://reviews.llvm.org/D86154 -- do you happen to 
remember why we do legalization this way? If not, I'll rework the approach.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D147732/new/

https://reviews.llvm.org/D147732

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D147732: [AMDGPU] Add type mangling for {read, write, readfirst, perm}lane intrinsics

2023-06-20 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes updated this revision to Diff 533080.
jrbyrnes marked 5 inline comments as done.
jrbyrnes added a comment.

Address comments + enable selection of ptr types


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D147732/new/

https://reviews.llvm.org/D147732

Files:
  clang/include/clang/Basic/BuiltinsAMDGPU.def
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/test/CodeGenOpenCL/builtins-amdgcn-gfx10.cl
  clang/test/CodeGenOpenCL/builtins-amdgcn.cl
  clang/test/SemaOpenCL/builtins-amdgcn-error-gfx10-param.cl
  llvm/include/llvm/IR/IntrinsicsAMDGPU.td
  llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
  llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
  llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
  llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
  llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
  llvm/lib/Target/AMDGPU/SIInstructions.td
  llvm/lib/Target/AMDGPU/VOP3Instructions.td
  llvm/test/Analysis/UniformityAnalysis/AMDGPU/intrinsics.ll
  llvm/test/Assembler/autoupgrade-amdgpu-intrinsics.ll
  llvm/test/CodeGen/AMDGPU/GlobalISel/atomic_optimizations_mul_one.ll
  llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-amdgcn.readfirstlane.mir
  llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll
  llvm/test/CodeGen/AMDGPU/global-atomic-scan.ll
  llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane.ll
  llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readfirstlane.ll
  llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readlane.ll
  llvm/test/CodeGen/AMDGPU/llvm.amdgcn.writelane.ll
  llvm/test/CodeGen/AMDGPU/permlane-ptr.ll
  llvm/test/Transforms/InstCombine/AMDGPU/amdgcn-intrinsics.ll
  llvm/test/Verifier/AMDGPU/intrinsic-immarg.ll

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D147732: [AMDGPU] Add type mangling for {read, write, readfirst, perm}lane intrinsics

2023-06-23 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes added a comment.

In D147732#4434557 , @arsenm wrote:

> I think this may not hard break mesa. I believe mesa bypasses the intrinsic 
> creation API, and just declares the string name of the intrinsic. The type 
> name mangling suffix is technically irrelevant, and as long as you use a 
> consistent type with a consistent suffix things should work out (and the null 
> suffix also works). After committing mesa should still move to adding the 
> type suffix

I can echo this sentiment.

The main issues arises when there are untyped calls to CreateIntrinsic, as the 
intrinsics are no longer defined with a type.

For {read, readfirst, write, perm}lanes, Mesa uses LLVMAddFunction and 
LLVMBuildCall2 APIs under its own ac_build_intrinsic -- these calls are all 
typed in the current implementation. Also, (as expected) the implementation 
inserts bitcasts to cast to Int32Ty before inserting these calls since only 
that version of the intrinsic currently exists. This also implies they wont 
have an issue with intrinsic / type declarations.

Unless I have missed something, I don't see why switching to type-mangling 
would cause an issue with Mesa's current implementation.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D147732/new/

https://reviews.llvm.org/D147732

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153667: [HIP]: Add gpu-link-output to control link job creation

2023-06-23 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes created this revision.
jrbyrnes added a reviewer: yaxunl.
Herald added a project: All.
jrbyrnes requested review of this revision.
Herald added subscribers: cfe-commits, MaskRay.
Herald added a project: clang.

Change-Id: Ia19a28867d15022d1400d3e18c61f14259057ff4


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D153667

Files:
  clang/include/clang/Driver/Options.td
  clang/lib/Driver/Driver.cpp
  clang/test/Driver/hip-device-compile.hip
  clang/test/Driver/hip-phases.hip

Index: clang/test/Driver/hip-phases.hip
===
--- clang/test/Driver/hip-phases.hip
+++ clang/test/Driver/hip-phases.hip
@@ -244,6 +244,43 @@
 // DASM-NOT: clang-offload-bundler
 // DASM-NOT: host
 
+//
+// Test single gpu architecture with compile to relocatable in device-only
+// compilation mode.
+//
+// RUN: %clang -x hip --target=x86_64-unknown-linux-gnu -ccc-print-phases \
+// RUN: --cuda-gpu-arch=gfx803 %s --cuda-device-only --no-gpu-link-output 2>&1 \
+// RUN: | FileCheck -check-prefixes=RELOC %s
+// RELOC-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH:gfx803]])
+// RELOC-DAG: [[P1:[0-9]+]]: preprocessor, {[[P0]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P2:[0-9]+]]: compiler, {[[P1]]}, ir, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P3:[0-9]+]]: backend, {[[P2]]}, assembler, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P4:[0-9]+]]: assembler, {[[P3]]}, object, (device-[[T]], [[ARCH]])
+// RELOC-NOT: [[P5:[0-9]+]]: linker, {[[P4]]}, image, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P5:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH]])" {[[P4]]}, object
+
+//
+// Test two gpu architectures with compile to relocatable in device-only
+// compilation mode.
+//
+// RUN: %clang -x hip --target=x86_64-unknown-linux-gnu -ccc-print-phases \
+// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s --cuda-device-only --no-gpu-link-output 2>&1 \
+// RUN: | FileCheck -check-prefixes=RELOC2 %s
+// RELOC2-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH:gfx803]])
+// RELOC2-DAG: [[P1:[0-9]+]]: preprocessor, {[[P0]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH]])
+// RELOC2-DAG: [[P2:[0-9]+]]: compiler, {[[P1]]}, ir, (device-[[T]], [[ARCH]])
+// RELOC2-DAG: [[P3:[0-9]+]]: backend, {[[P2]]}, assembler, (device-[[T]], [[ARCH]])
+// RELOC2-DAG: [[P4:[0-9]+]]: assembler, {[[P3]]}, object, (device-[[T]], [[ARCH]])
+// RELOC2-NOT: [[P5:[0-9]+]]: linker, {[[P4]]}, image, (device-[[T]], [[ARCH]])
+// RELOC2-DAG: [[P5:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH]])" {[[P4]]}, object
+// RELOC2-DAG: [[P6:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH2:gfx900]])
+// RELOC2-DAG: [[P7:[0-9]+]]: preprocessor, {[[P6]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH2]])
+// RELOC2-DAG: [[P8:[0-9]+]]: compiler, {[[P7]]}, ir, (device-[[T]], [[ARCH2]])
+// RELOC2-DAG: [[P9:[0-9]+]]: backend, {[[P8]]}, assembler, (device-[[T]], [[ARCH2]])
+// RELOC2-DAG: [[P10:[0-9]+]]: assembler, {[[P9]]}, object, (device-[[T]], [[ARCH2]])
+// RELOC2-NOT: [[P11:[0-9]+]]: linker, {[[P10]]}, image, (device-[[T]], [[ARCH2]])
+// RELOC2-DAG: [[P11:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH2]])" {[[P10]]}, object
+
 //
 // Test two gpu architectures with complete compilation in device-only
 // compilation mode.
Index: clang/test/Driver/hip-device-compile.hip
===
--- clang/test/Driver/hip-device-compile.hip
+++ clang/test/Driver/hip-device-compile.hip
@@ -45,6 +45,14 @@
 // RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
 // RUN: 2>&1 | FileCheck -check-prefixes=CHECK,ASM,NBUN %s
 
+// Output relocatable.
+// RUN: %clang -c --cuda-device-only -### --target=x86_64-linux-gnu \
+// RUN:   -o a.o -x hip --cuda-gpu-arch=gfx900 --no-gpu-link-output \
+// RUN:   --hip-device-lib=lib1.bc \
+// RUN:   --hip-device-lib-path=%S/Inputs/hip_multiple_inputs/lib1 \
+// RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
+// RUN: 2>&1 | FileCheck -check-prefixes=CHECK,NBUN,RELOC %s
+
 // Output bundled assembly.
 // RUN: %clang -c -S --cuda-device-only -### --target=x86_64-linux-gnu \
 // RUN:   -o a.s -x hip --cuda-gpu-arch=gfx900 --no-gpu-bundle-output \
@@ -68,6 +76,7 @@
 // LLBUN-SAME: "-o" "{{.*}}.ll"
 // ASM-SAME: "-o" "a.s"
 // ASMBUN-SAME: "-o" "{{.*}}.s"
+// RELOC-SAME: "-o" "a.o"
 // CHECK-SAME: {{".*a.cu"}}
 
 // CHECK-NOT: {{"*.llvm-link"}}
Index: clang/lib/Driver/Driver.cpp
===
--- clang/lib/Driver/Driver.cpp
+++ clang/lib/Driver/Driver.cpp
@@ -3322,16 +3322,22 @@
 // only compilation. Bundle other type of output files only if
 // --gpu-bundle-output is specified for device only compilation.
 std::optional BundleOutput;
+std::optional LinkOutput;
 
   public:
 HIPActionBuilder(Compilation &C, DerivedArgList &Args,
 

[PATCH] D153667: [HIP]: Add gpu-link-output to control link job creation

2023-06-23 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes updated this revision to Diff 534086.
jrbyrnes added a comment.

Formatting


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153667/new/

https://reviews.llvm.org/D153667

Files:
  clang/include/clang/Driver/Options.td
  clang/lib/Driver/Driver.cpp
  clang/test/Driver/hip-device-compile.hip
  clang/test/Driver/hip-phases.hip

Index: clang/test/Driver/hip-phases.hip
===
--- clang/test/Driver/hip-phases.hip
+++ clang/test/Driver/hip-phases.hip
@@ -244,6 +244,43 @@
 // DASM-NOT: clang-offload-bundler
 // DASM-NOT: host
 
+//
+// Test single gpu architecture with compile to relocatable in device-only
+// compilation mode.
+//
+// RUN: %clang -x hip --target=x86_64-unknown-linux-gnu -ccc-print-phases \
+// RUN: --cuda-gpu-arch=gfx803 %s --cuda-device-only --no-gpu-link-output 2>&1 \
+// RUN: | FileCheck -check-prefixes=RELOC %s
+// RELOC-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH:gfx803]])
+// RELOC-DAG: [[P1:[0-9]+]]: preprocessor, {[[P0]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P2:[0-9]+]]: compiler, {[[P1]]}, ir, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P3:[0-9]+]]: backend, {[[P2]]}, assembler, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P4:[0-9]+]]: assembler, {[[P3]]}, object, (device-[[T]], [[ARCH]])
+// RELOC-NOT: [[P5:[0-9]+]]: linker, {[[P4]]}, image, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P5:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH]])" {[[P4]]}, object
+
+//
+// Test two gpu architectures with compile to relocatable in device-only
+// compilation mode.
+//
+// RUN: %clang -x hip --target=x86_64-unknown-linux-gnu -ccc-print-phases \
+// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s --cuda-device-only --no-gpu-link-output 2>&1 \
+// RUN: | FileCheck -check-prefixes=RELOC2 %s
+// RELOC2-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH:gfx803]])
+// RELOC2-DAG: [[P1:[0-9]+]]: preprocessor, {[[P0]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH]])
+// RELOC2-DAG: [[P2:[0-9]+]]: compiler, {[[P1]]}, ir, (device-[[T]], [[ARCH]])
+// RELOC2-DAG: [[P3:[0-9]+]]: backend, {[[P2]]}, assembler, (device-[[T]], [[ARCH]])
+// RELOC2-DAG: [[P4:[0-9]+]]: assembler, {[[P3]]}, object, (device-[[T]], [[ARCH]])
+// RELOC2-NOT: [[P5:[0-9]+]]: linker, {[[P4]]}, image, (device-[[T]], [[ARCH]])
+// RELOC2-DAG: [[P5:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH]])" {[[P4]]}, object
+// RELOC2-DAG: [[P6:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH2:gfx900]])
+// RELOC2-DAG: [[P7:[0-9]+]]: preprocessor, {[[P6]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH2]])
+// RELOC2-DAG: [[P8:[0-9]+]]: compiler, {[[P7]]}, ir, (device-[[T]], [[ARCH2]])
+// RELOC2-DAG: [[P9:[0-9]+]]: backend, {[[P8]]}, assembler, (device-[[T]], [[ARCH2]])
+// RELOC2-DAG: [[P10:[0-9]+]]: assembler, {[[P9]]}, object, (device-[[T]], [[ARCH2]])
+// RELOC2-NOT: [[P11:[0-9]+]]: linker, {[[P10]]}, image, (device-[[T]], [[ARCH2]])
+// RELOC2-DAG: [[P11:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH2]])" {[[P10]]}, object
+
 //
 // Test two gpu architectures with complete compilation in device-only
 // compilation mode.
Index: clang/test/Driver/hip-device-compile.hip
===
--- clang/test/Driver/hip-device-compile.hip
+++ clang/test/Driver/hip-device-compile.hip
@@ -45,6 +45,14 @@
 // RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
 // RUN: 2>&1 | FileCheck -check-prefixes=CHECK,ASM,NBUN %s
 
+// Output relocatable.
+// RUN: %clang -c --cuda-device-only -### --target=x86_64-linux-gnu \
+// RUN:   -o a.o -x hip --cuda-gpu-arch=gfx900 --no-gpu-link-output \
+// RUN:   --hip-device-lib=lib1.bc \
+// RUN:   --hip-device-lib-path=%S/Inputs/hip_multiple_inputs/lib1 \
+// RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
+// RUN: 2>&1 | FileCheck -check-prefixes=CHECK,NBUN,RELOC %s
+
 // Output bundled assembly.
 // RUN: %clang -c -S --cuda-device-only -### --target=x86_64-linux-gnu \
 // RUN:   -o a.s -x hip --cuda-gpu-arch=gfx900 --no-gpu-bundle-output \
@@ -68,6 +76,7 @@
 // LLBUN-SAME: "-o" "{{.*}}.ll"
 // ASM-SAME: "-o" "a.s"
 // ASMBUN-SAME: "-o" "{{.*}}.s"
+// RELOC-SAME: "-o" "a.o"
 // CHECK-SAME: {{".*a.cu"}}
 
 // CHECK-NOT: {{"*.llvm-link"}}
Index: clang/lib/Driver/Driver.cpp
===
--- clang/lib/Driver/Driver.cpp
+++ clang/lib/Driver/Driver.cpp
@@ -3322,16 +3322,23 @@
 // only compilation. Bundle other type of output files only if
 // --gpu-bundle-output is specified for device only compilation.
 std::optional BundleOutput;
+std::optional LinkOutput;
 
   public:
 HIPActionBuilder(Compilation &C, DerivedArgList &Args,
  const Driver::InputList &Inputs)
 : CudaActionBuilderBase(C, Args, Inputs, Action::OFK_HIP) {
   Def

[PATCH] D153667: [HIP]: Add gpu-link-output to control link job creation

2023-06-26 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes updated this revision to Diff 534725.
jrbyrnes marked an inline comment as done.
jrbyrnes added a comment.

Fix tests + add tests. Add phase test for -fgpu-rdc --no-gpu-link-output (these 
are not intended to be used together)


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153667/new/

https://reviews.llvm.org/D153667

Files:
  clang/include/clang/Driver/Options.td
  clang/lib/Driver/Driver.cpp
  clang/test/Driver/hip-device-compile.hip
  clang/test/Driver/hip-phases.hip
  clang/test/Driver/hip-rdc-device-only.hip

Index: clang/test/Driver/hip-rdc-device-only.hip
===
--- clang/test/Driver/hip-rdc-device-only.hip
+++ clang/test/Driver/hip-rdc-device-only.hip
@@ -5,7 +5,7 @@
 // RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
 // RUN:   -c -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
 // RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
-// RUN:   %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output \
+// RUN:   %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output --gpu-link-output \
 // RUN: 2>&1 | FileCheck -check-prefixes=COMMON,EMITBC %s
 
 // With `-emit-llvm`, the output should be the same as the aforementioned line
@@ -15,14 +15,14 @@
 // RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
 // RUN:   -c -emit-llvm -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
 // RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
-// RUN:   %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output \
+// RUN:   %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output --gpu-link-output \
 // RUN: 2>&1 | FileCheck -check-prefixes=COMMON,EMITBC %s
 
 // RUN: %clang -### --target=x86_64-linux-gnu \
 // RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
 // RUN:   -S -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
 // RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
-// RUN:   %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output \
+// RUN:   %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output --gpu-link-output \
 // RUN: 2>&1 | FileCheck -check-prefixes=COMMON,EMITLL %s
 
 // With `-emit-llvm`, the output should be the same as the aforementioned line
@@ -32,7 +32,7 @@
 // RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
 // RUN:   -S -emit-llvm -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
 // RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
-// RUN:   %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output \
+// RUN:   %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output --gpu-link-output \
 // RUN: 2>&1 | FileCheck -check-prefixes=COMMON,EMITLL %s
 
 // With `-save-temps`, commane lines for each steps are dumped. For assembly
@@ -43,7 +43,7 @@
 // RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
 // RUN:   -S -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
 // RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
-// RUN:   %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output \
+// RUN:   %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output --gpu-link-output \
 // RUN: 2>&1 | FileCheck -check-prefix=SAVETEMP %s
 
 // Check output one file without bundling cause error.
@@ -54,6 +54,12 @@
 // RUN:   %S/Inputs/hip_multiple_inputs/a.cu -o %t.s --no-gpu-bundle-output \
 // RUN: 2>&1 | FileCheck -check-prefix=FAIL %s
 
+// RUN: %clang -### --target=x86_64-linux-gnu \
+// RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
+// RUN:   -S -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
+// RUN:   %S/Inputs/hip_multiple_inputs/a.cu -o %t.s --no-gpu-link-output \
+// RUN: 2>&1 | FileCheck -check-prefix=FAIL %s
+
 // COMMON: [[CLANG:".*clang.*"]] "-cc1" "-triple" "amdgcn-amd-amdhsa"
 // COMMON-SAME: "-aux-triple" "x86_64-unknown-linux-gnu"
 // EMITBC-SAME: "-emit-llvm-bc"
Index: clang/test/Driver/hip-phases.hip
===
--- clang/test/Driver/hip-phases.hip
+++ clang/test/Driver/hip-phases.hip
@@ -244,6 +244,53 @@
 // DASM-NOT: clang-offload-bundler
 // DASM-NOT: host
 
+//
+// Test single gpu architecture with compile to relocatable in device-only
+// compilation mode.
+//
+// RUN: %clang -x hip --target=x86_64-unknown-linux-gnu -ccc-print-phases \
+// RUN: --cuda-gpu-arch=gfx803 %s --cuda-device-only --no-gpu-link-output 2>&1 \
+// RUN: | FileCheck -check-prefixes=RELOC %s
+// RELOC-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH:gfx803]])
+// RELOC-DAG: [[P1:[0-9]+]]: preprocessor, {[[P0]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P2:[0-9]+]]: compiler, {[[P1]]}, ir, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P3:[0-9]+]]: backend, {[[P2]]}, assembler, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P4:[0-9]+]]: assembler, {[[P3]]}, object, (device-[[T]], [[ARCH]])
+// RELOC-NOT: linker
+// RELOC-DAG: [[P5:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH]])" {[[P4]]}, object
+
+// RUN: %clang -x hip --target=x86_64-un

[PATCH] D153667: [HIP]: Add gpu-link-output to control link job creation

2023-06-26 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes added a comment.

In D153667#4450517 , @jhuber6 wrote:

> What's the difference here between this and the existing `--hip-link`?

Hi @jhuber6

The commit is poorly named, the main purpose is to introduce 
`-no-gpu-link-output.`

We want a way to produce relocatable from source. In terms of the Driver, this 
means building actions and jobs for phases up to `phases::Assemble`. `-no- 
gpu-link-output` does this by overriding BuildActions to stop after 
`phases::Assemble` (similar to `-no-gpu-bundle-output`). `-gpu-link-output` is 
NFCI. COMGR would be the client of this, and it would be up to COMGR to handle 
linking of the relocatable.

AFAICT, `-hip-link` allows for linking of offload-bundles, so it is 
conceptually different. We can get (somewhat) close to what we with `-emit-llvm 
-hip-link`, but that is probably more due to `-emit-llvm`. `-hip-link` by 
itself produces linker actions / jobs which what we are trying to avoid here.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153667/new/

https://reviews.llvm.org/D153667

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153667: [HIP]: Add gpu-link-output to control link job creation

2023-06-28 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes added a comment.

In D153667#4450724 , @jhuber6 wrote:

> In D153667#4450705 , @jrbyrnes 
> wrote:
>
>> In D153667#4450517 , @jhuber6 
>> wrote:
>>
>>> What's the difference here between this and the existing `--hip-link`?
>>
>> Hi @jhuber6
>>
>> The commit is poorly named, the main purpose is to introduce 
>> `-no-gpu-link-output.`
>>
>> We want a way to produce relocatable from source. In terms of the Driver, 
>> this means building actions and jobs for phases up to `phases::Assemble`. 
>> `-no- gpu-link-output` does this by overriding BuildActions to stop after 
>> `phases::Assemble` (similar to `-no-gpu-bundle-output`). `-gpu-link-output` 
>> is NFCI. COMGR would be the client of this, and it would be up to COMGR to 
>> handle linking of the relocatable.
>>
>> AFAICT, `-hip-link` allows for linking of offload-bundles, so it is 
>> conceptually different. We can get (somewhat) close to what we with 
>> `-emit-llvm -hip-link`, but that is probably more due to `-emit-llvm`. 
>> `-hip-link` by itself produces linker actions / jobs which what we are 
>> trying to avoid here.
>
> So, you run the backend and obtain a relocatable ELF, but do not link it via 
> `lld`? If I'm understanding this correctly, that would be the difference 
> between `-flto` and `-fno-lto`, or `-foffload-lto` and `-fno-offload-lto`, 
> AMDGPU always having `-flto` on currently. Also I recall AMDGPU / HIP 
> completely disabling the backend step at some point, so it only emits LLVM-IR.

The whole point of this work is to give hiprtc a way to compile-to-bitcode and 
optimize sources in a single step, to make (user-passed) flag handling less 
weird. Since the intent of LTO is to defer this optimization step, I would 
assume any way we try to use it here would not be correct.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153667/new/

https://reviews.llvm.org/D153667

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153667: [HIP]: Add gpu-link-output to control link job creation

2023-06-28 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes updated this revision to Diff 535456.
jrbyrnes added a comment.

Naming + -cuda-device-only and -fno-gpu-rdc only


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153667/new/

https://reviews.llvm.org/D153667

Files:
  clang/include/clang/Driver/Options.td
  clang/lib/Driver/Driver.cpp
  clang/test/Driver/hip-device-compile.hip
  clang/test/Driver/hip-phases.hip
  clang/test/Driver/hip-rdc-device-only.hip

Index: clang/test/Driver/hip-rdc-device-only.hip
===
--- clang/test/Driver/hip-rdc-device-only.hip
+++ clang/test/Driver/hip-rdc-device-only.hip
@@ -18,6 +18,27 @@
 // RUN:   %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output \
 // RUN: 2>&1 | FileCheck -check-prefixes=COMMON,EMITBC %s
 
+// With `-fno-hip-emit-relocatable`, the output should be the same as the aforementioned line
+// as `-fgpu-rdc` in HIP implies `-fno-hip-emit-relocatable`.
+
+// RUN: %clang -### --target=x86_64-linux-gnu \
+// RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
+// RUN:   -c -fno-hip-emit-relocatable -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
+// RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
+// RUN:   %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output \
+// RUN: 2>&1 | FileCheck -check-prefixes=COMMON,EMITBC %s
+
+// With `-fhip-emit-relocatable`, the output should be the same as the aforementioned line
+// as `-fgpu-rdc` in HIP overrides `-fhip-emit-relocatable`.
+
+// RUN: %clang -### --target=x86_64-linux-gnu \
+// RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
+// RUN:   -c -fhip-emit-relocatable -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
+// RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
+// RUN:   %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output \
+// RUN: 2>&1 | FileCheck -check-prefixes=COMMON,EMITBC %s
+
+
 // RUN: %clang -### --target=x86_64-linux-gnu \
 // RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
 // RUN:   -S -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
@@ -54,6 +75,12 @@
 // RUN:   %S/Inputs/hip_multiple_inputs/a.cu -o %t.s --no-gpu-bundle-output \
 // RUN: 2>&1 | FileCheck -check-prefix=FAIL %s
 
+// RUN: %clang -### --target=x86_64-linux-gnu \
+// RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
+// RUN:   -S -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
+// RUN:   %S/Inputs/hip_multiple_inputs/a.cu -o %t.s \
+// RUN: 2>&1 | FileCheck -check-prefix=FAIL %s
+
 // COMMON: [[CLANG:".*clang.*"]] "-cc1" "-triple" "amdgcn-amd-amdhsa"
 // COMMON-SAME: "-aux-triple" "x86_64-unknown-linux-gnu"
 // EMITBC-SAME: "-emit-llvm-bc"
Index: clang/test/Driver/hip-phases.hip
===
--- clang/test/Driver/hip-phases.hip
+++ clang/test/Driver/hip-phases.hip
@@ -244,6 +244,59 @@
 // DASM-NOT: clang-offload-bundler
 // DASM-NOT: host
 
+//
+// Test single gpu architecture with compile to relocatable in device-only
+// compilation mode.
+//
+// RUN: %clang -x hip --target=x86_64-unknown-linux-gnu -ccc-print-phases \
+// RUN: --cuda-gpu-arch=gfx803 %s --cuda-device-only -fhip-emit-relocatable 2>&1 \
+// RUN: | FileCheck -check-prefixes=RELOC %s
+// RELOC-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH:gfx803]])
+// RELOC-DAG: [[P1:[0-9]+]]: preprocessor, {[[P0]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P2:[0-9]+]]: compiler, {[[P1]]}, ir, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P3:[0-9]+]]: backend, {[[P2]]}, assembler, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P4:[0-9]+]]: assembler, {[[P3]]}, object, (device-[[T]], [[ARCH]])
+// RELOC-NOT: linker
+// RELOC-DAG: [[P5:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH]])" {[[P4]]}, object
+
+
+// Compile to relocatable with fgpu-rdc is not allowed
+
+// RUN: %clang -x hip --target=x86_64-unknown-linux-gnu -ccc-print-phases \
+// RUN: --cuda-gpu-arch=gfx803 %s --cuda-device-only -fhip-emit-relocatable -fgpu-rdc 2>&1 \
+// RUN: | FileCheck -check-prefixes=RELOCRDC %s
+// RELOCRDC-DAG: linker
+
+// Compile to relocatable with is only allowed in device-only compilation mode
+
+// RUN: %clang -x hip --target=x86_64-unknown-linux-gnu -ccc-print-phases \
+// RUN: --cuda-gpu-arch=gfx803 %s -fhip-emit-relocatable -fno-gpu-rdc 2>&1 \
+// RUN: | FileCheck -check-prefixes=RELOCHOST %s
+// RELOCHOST-DAG: linker
+
+
+//
+// Test two gpu architectures with compile to relocatable in device-only
+// compilation mode.
+//
+// RUN: %clang -x hip --target=x86_64-unknown-linux-gnu -ccc-print-phases \
+// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s --cuda-device-only -fhip-emit-relocatable 2>&1 \
+// RUN: | FileCheck -check-prefixes=RELOC2 %s
+// RELOC2-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH:gfx803]])
+// RELOC2-DAG: [[P1:[0-9]+]]: preprocessor, {[[P0]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH]])
+// RELOC

[PATCH] D153667: [HIP]: Add -fhip-emit-relocatable to override link job creation for -fno-gpu-rdc

2023-06-28 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes updated this revision to Diff 535484.
jrbyrnes added a comment.

Use member variabls + add diagnostic + tests


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153667/new/

https://reviews.llvm.org/D153667

Files:
  clang/include/clang/Driver/Options.td
  clang/lib/Driver/Driver.cpp
  clang/test/Driver/hip-dependent-options.hip
  clang/test/Driver/hip-device-compile.hip
  clang/test/Driver/hip-phases.hip
  clang/test/Driver/hip-rdc-device-only.hip

Index: clang/test/Driver/hip-rdc-device-only.hip
===
--- clang/test/Driver/hip-rdc-device-only.hip
+++ clang/test/Driver/hip-rdc-device-only.hip
@@ -18,6 +18,16 @@
 // RUN:   %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output \
 // RUN: 2>&1 | FileCheck -check-prefixes=COMMON,EMITBC %s
 
+// With `-fno-hip-emit-relocatable`, the output should be the same as the aforementioned line
+// as `-fgpu-rdc` in HIP implies `-fno-hip-emit-relocatable`.
+
+// RUN: %clang -### --target=x86_64-linux-gnu \
+// RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
+// RUN:   -c -fno-hip-emit-relocatable -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
+// RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
+// RUN:   %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output \
+// RUN: 2>&1 | FileCheck -check-prefixes=COMMON,EMITBC %s
+
 // RUN: %clang -### --target=x86_64-linux-gnu \
 // RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
 // RUN:   -S -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
Index: clang/test/Driver/hip-phases.hip
===
--- clang/test/Driver/hip-phases.hip
+++ clang/test/Driver/hip-phases.hip
@@ -244,6 +244,43 @@
 // DASM-NOT: clang-offload-bundler
 // DASM-NOT: host
 
+//
+// Test single gpu architecture with compile to relocatable in device-only
+// compilation mode.
+//
+// RUN: %clang -x hip --target=x86_64-unknown-linux-gnu -ccc-print-phases \
+// RUN: --cuda-gpu-arch=gfx803 %s --cuda-device-only -fhip-emit-relocatable 2>&1 \
+// RUN: | FileCheck -check-prefixes=RELOC %s
+// RELOC-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH:gfx803]])
+// RELOC-DAG: [[P1:[0-9]+]]: preprocessor, {[[P0]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P2:[0-9]+]]: compiler, {[[P1]]}, ir, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P3:[0-9]+]]: backend, {[[P2]]}, assembler, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P4:[0-9]+]]: assembler, {[[P3]]}, object, (device-[[T]], [[ARCH]])
+// RELOC-NOT: linker
+// RELOC-DAG: [[P5:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH]])" {[[P4]]}, object
+
+//
+// Test two gpu architectures with compile to relocatable in device-only
+// compilation mode.
+//
+// RUN: %clang -x hip --target=x86_64-unknown-linux-gnu -ccc-print-phases \
+// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s --cuda-device-only -fhip-emit-relocatable 2>&1 \
+// RUN: | FileCheck -check-prefixes=RELOC2 %s
+// RELOC2-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH:gfx803]])
+// RELOC2-DAG: [[P1:[0-9]+]]: preprocessor, {[[P0]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH]])
+// RELOC2-DAG: [[P2:[0-9]+]]: compiler, {[[P1]]}, ir, (device-[[T]], [[ARCH]])
+// RELOC2-DAG: [[P3:[0-9]+]]: backend, {[[P2]]}, assembler, (device-[[T]], [[ARCH]])
+// RELOC2-DAG: [[P4:[0-9]+]]: assembler, {[[P3]]}, object, (device-[[T]], [[ARCH]])
+// RELOC2-NOT: [[P5:[0-9]+]]: linker, {[[P4]]}, image, (device-[[T]], [[ARCH]])
+// RELOC2-DAG: [[P5:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH]])" {[[P4]]}, object
+// RELOC2-DAG: [[P6:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH2:gfx900]])
+// RELOC2-DAG: [[P7:[0-9]+]]: preprocessor, {[[P6]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH2]])
+// RELOC2-DAG: [[P8:[0-9]+]]: compiler, {[[P7]]}, ir, (device-[[T]], [[ARCH2]])
+// RELOC2-DAG: [[P9:[0-9]+]]: backend, {[[P8]]}, assembler, (device-[[T]], [[ARCH2]])
+// RELOC2-DAG: [[P10:[0-9]+]]: assembler, {[[P9]]}, object, (device-[[T]], [[ARCH2]])
+// RELOC2-NOT: linker
+// RELOC2-DAG: [[P11:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH2]])" {[[P10]]}, object
+
 //
 // Test two gpu architectures with complete compilation in device-only
 // compilation mode.
Index: clang/test/Driver/hip-device-compile.hip
===
--- clang/test/Driver/hip-device-compile.hip
+++ clang/test/Driver/hip-device-compile.hip
@@ -45,6 +45,14 @@
 // RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
 // RUN: 2>&1 | FileCheck -check-prefixes=CHECK,ASM,NBUN %s
 
+// Output relocatable.
+// RUN: %clang -c --cuda-device-only -### --target=x86_64-linux-gnu \
+// RUN:   -o a.o -x hip --cuda-gpu-arch=gfx900 --no-gpu-link-output \
+// RUN:   --hip-device-lib=lib1.bc \
+// RUN:   --hip-device-lib-path=%S/Inputs/hip_multiple_inputs/lib1

[PATCH] D153667: [HIP]: Add -fhip-emit-relocatable to override link job creation for -fno-gpu-rdc

2023-06-28 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes updated this revision to Diff 535519.
jrbyrnes marked 3 inline comments as done.
jrbyrnes added a comment.

Address Comment


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153667/new/

https://reviews.llvm.org/D153667

Files:
  clang/include/clang/Driver/Options.td
  clang/lib/Driver/Driver.cpp
  clang/test/Driver/hip-dependent-options.hip
  clang/test/Driver/hip-device-compile.hip
  clang/test/Driver/hip-phases.hip
  clang/test/Driver/hip-rdc-device-only.hip

Index: clang/test/Driver/hip-rdc-device-only.hip
===
--- clang/test/Driver/hip-rdc-device-only.hip
+++ clang/test/Driver/hip-rdc-device-only.hip
@@ -18,6 +18,16 @@
 // RUN:   %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output \
 // RUN: 2>&1 | FileCheck -check-prefixes=COMMON,EMITBC %s
 
+// With `-fno-hip-emit-relocatable`, the output should be the same as the aforementioned line
+// as `-fgpu-rdc` in HIP implies `-fno-hip-emit-relocatable`.
+
+// RUN: %clang -### --target=x86_64-linux-gnu \
+// RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
+// RUN:   -c -fno-hip-emit-relocatable -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
+// RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
+// RUN:   %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output \
+// RUN: 2>&1 | FileCheck -check-prefixes=COMMON,EMITBC %s
+
 // RUN: %clang -### --target=x86_64-linux-gnu \
 // RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
 // RUN:   -S -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
Index: clang/test/Driver/hip-phases.hip
===
--- clang/test/Driver/hip-phases.hip
+++ clang/test/Driver/hip-phases.hip
@@ -244,6 +244,43 @@
 // DASM-NOT: clang-offload-bundler
 // DASM-NOT: host
 
+//
+// Test single gpu architecture with compile to relocatable in device-only
+// compilation mode.
+//
+// RUN: %clang -x hip --target=x86_64-unknown-linux-gnu -ccc-print-phases \
+// RUN: --cuda-gpu-arch=gfx803 %s --cuda-device-only -fhip-emit-relocatable 2>&1 \
+// RUN: | FileCheck -check-prefixes=RELOC %s
+// RELOC-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH:gfx803]])
+// RELOC-DAG: [[P1:[0-9]+]]: preprocessor, {[[P0]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P2:[0-9]+]]: compiler, {[[P1]]}, ir, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P3:[0-9]+]]: backend, {[[P2]]}, assembler, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P4:[0-9]+]]: assembler, {[[P3]]}, object, (device-[[T]], [[ARCH]])
+// RELOC-NOT: linker
+// RELOC-DAG: [[P5:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH]])" {[[P4]]}, object
+
+//
+// Test two gpu architectures with compile to relocatable in device-only
+// compilation mode.
+//
+// RUN: %clang -x hip --target=x86_64-unknown-linux-gnu -ccc-print-phases \
+// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s --cuda-device-only -fhip-emit-relocatable 2>&1 \
+// RUN: | FileCheck -check-prefixes=RELOC2 %s
+// RELOC2-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH:gfx803]])
+// RELOC2-DAG: [[P1:[0-9]+]]: preprocessor, {[[P0]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH]])
+// RELOC2-DAG: [[P2:[0-9]+]]: compiler, {[[P1]]}, ir, (device-[[T]], [[ARCH]])
+// RELOC2-DAG: [[P3:[0-9]+]]: backend, {[[P2]]}, assembler, (device-[[T]], [[ARCH]])
+// RELOC2-DAG: [[P4:[0-9]+]]: assembler, {[[P3]]}, object, (device-[[T]], [[ARCH]])
+// RELOC2-NOT: [[P5:[0-9]+]]: linker, {[[P4]]}, image, (device-[[T]], [[ARCH]])
+// RELOC2-DAG: [[P5:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH]])" {[[P4]]}, object
+// RELOC2-DAG: [[P6:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH2:gfx900]])
+// RELOC2-DAG: [[P7:[0-9]+]]: preprocessor, {[[P6]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH2]])
+// RELOC2-DAG: [[P8:[0-9]+]]: compiler, {[[P7]]}, ir, (device-[[T]], [[ARCH2]])
+// RELOC2-DAG: [[P9:[0-9]+]]: backend, {[[P8]]}, assembler, (device-[[T]], [[ARCH2]])
+// RELOC2-DAG: [[P10:[0-9]+]]: assembler, {[[P9]]}, object, (device-[[T]], [[ARCH2]])
+// RELOC2-NOT: linker
+// RELOC2-DAG: [[P11:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH2]])" {[[P10]]}, object
+
 //
 // Test two gpu architectures with complete compilation in device-only
 // compilation mode.
Index: clang/test/Driver/hip-device-compile.hip
===
--- clang/test/Driver/hip-device-compile.hip
+++ clang/test/Driver/hip-device-compile.hip
@@ -45,6 +45,14 @@
 // RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
 // RUN: 2>&1 | FileCheck -check-prefixes=CHECK,ASM,NBUN %s
 
+// Output relocatable.
+// RUN: %clang -c --cuda-device-only -### --target=x86_64-linux-gnu \
+// RUN:   -o a.o -x hip --cuda-gpu-arch=gfx900 -fhip-emit-relocatable \
+// RUN:   --hip-device-lib=lib1.bc \
+// RUN:   --hip-device-lib-path=%S/Inputs/hip_mult

[PATCH] D153667: [HIP]: Add -fhip-emit-relocatable to override link job creation for -fno-gpu-rdc

2023-06-28 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes added inline comments.



Comment at: clang/lib/Driver/Driver.cpp:3328-3330
+  CompileDeviceOnly = C.getDriver().offloadDeviceOnly();
+  Relocatable = Args.hasFlag(options::OPT_fgpu_rdc,
+ options::OPT_fno_gpu_rdc, /*Default=*/false);

yaxunl wrote:
> probably needs to be moved to ctor of CudaActionBuilderBase since they are 
> needed by both Cuda and HIP action builders.
Thanks


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153667/new/

https://reviews.llvm.org/D153667

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153667: [HIP]: Add -fhip-emit-relocatable to override link job creation for -fno-gpu-rdc

2023-06-29 Thread Jeffrey Byrnes via Phabricator via cfe-commits
This revision was automatically updated to reflect the committed changes.
Closed by commit rGbe8a65b598b3: [HIP]: Add -fhip-emit-relocatable to override 
link job creation for -fno-gpu-rdc (authored by jrbyrnes).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153667/new/

https://reviews.llvm.org/D153667

Files:
  clang/include/clang/Driver/Options.td
  clang/lib/Driver/Driver.cpp
  clang/test/Driver/hip-dependent-options.hip
  clang/test/Driver/hip-device-compile.hip
  clang/test/Driver/hip-phases.hip
  clang/test/Driver/hip-rdc-device-only.hip

Index: clang/test/Driver/hip-rdc-device-only.hip
===
--- clang/test/Driver/hip-rdc-device-only.hip
+++ clang/test/Driver/hip-rdc-device-only.hip
@@ -18,6 +18,16 @@
 // RUN:   %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output \
 // RUN: 2>&1 | FileCheck -check-prefixes=COMMON,EMITBC %s
 
+// With `-fno-hip-emit-relocatable`, the output should be the same as the aforementioned line
+// as `-fgpu-rdc` in HIP implies `-fno-hip-emit-relocatable`.
+
+// RUN: %clang -### --target=x86_64-linux-gnu \
+// RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
+// RUN:   -c -fno-hip-emit-relocatable -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
+// RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
+// RUN:   %S/Inputs/hip_multiple_inputs/b.hip --gpu-bundle-output \
+// RUN: 2>&1 | FileCheck -check-prefixes=COMMON,EMITBC %s
+
 // RUN: %clang -### --target=x86_64-linux-gnu \
 // RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
 // RUN:   -S -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
Index: clang/test/Driver/hip-phases.hip
===
--- clang/test/Driver/hip-phases.hip
+++ clang/test/Driver/hip-phases.hip
@@ -244,6 +244,43 @@
 // DASM-NOT: clang-offload-bundler
 // DASM-NOT: host
 
+//
+// Test single gpu architecture with compile to relocatable in device-only
+// compilation mode.
+//
+// RUN: %clang -x hip --target=x86_64-unknown-linux-gnu -ccc-print-phases \
+// RUN: --cuda-gpu-arch=gfx803 %s --cuda-device-only -fhip-emit-relocatable 2>&1 \
+// RUN: | FileCheck -check-prefixes=RELOC %s
+// RELOC-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH:gfx803]])
+// RELOC-DAG: [[P1:[0-9]+]]: preprocessor, {[[P0]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P2:[0-9]+]]: compiler, {[[P1]]}, ir, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P3:[0-9]+]]: backend, {[[P2]]}, assembler, (device-[[T]], [[ARCH]])
+// RELOC-DAG: [[P4:[0-9]+]]: assembler, {[[P3]]}, object, (device-[[T]], [[ARCH]])
+// RELOC-NOT: linker
+// RELOC-DAG: [[P5:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH]])" {[[P4]]}, object
+
+//
+// Test two gpu architectures with compile to relocatable in device-only
+// compilation mode.
+//
+// RUN: %clang -x hip --target=x86_64-unknown-linux-gnu -ccc-print-phases \
+// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %s --cuda-device-only -fhip-emit-relocatable 2>&1 \
+// RUN: | FileCheck -check-prefixes=RELOC2 %s
+// RELOC2-DAG: [[P0:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH:gfx803]])
+// RELOC2-DAG: [[P1:[0-9]+]]: preprocessor, {[[P0]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH]])
+// RELOC2-DAG: [[P2:[0-9]+]]: compiler, {[[P1]]}, ir, (device-[[T]], [[ARCH]])
+// RELOC2-DAG: [[P3:[0-9]+]]: backend, {[[P2]]}, assembler, (device-[[T]], [[ARCH]])
+// RELOC2-DAG: [[P4:[0-9]+]]: assembler, {[[P3]]}, object, (device-[[T]], [[ARCH]])
+// RELOC2-NOT: [[P5:[0-9]+]]: linker, {[[P4]]}, image, (device-[[T]], [[ARCH]])
+// RELOC2-DAG: [[P5:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH]])" {[[P4]]}, object
+// RELOC2-DAG: [[P6:[0-9]+]]: input, "{{.*}}hip-phases.hip", [[T:hip]], (device-[[T]], [[ARCH2:gfx900]])
+// RELOC2-DAG: [[P7:[0-9]+]]: preprocessor, {[[P6]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH2]])
+// RELOC2-DAG: [[P8:[0-9]+]]: compiler, {[[P7]]}, ir, (device-[[T]], [[ARCH2]])
+// RELOC2-DAG: [[P9:[0-9]+]]: backend, {[[P8]]}, assembler, (device-[[T]], [[ARCH2]])
+// RELOC2-DAG: [[P10:[0-9]+]]: assembler, {[[P9]]}, object, (device-[[T]], [[ARCH2]])
+// RELOC2-NOT: linker
+// RELOC2-DAG: [[P11:[0-9]+]]: offload, "device-[[T]] (amdgcn-amd-amdhsa:[[ARCH2]])" {[[P10]]}, object
+
 //
 // Test two gpu architectures with complete compilation in device-only
 // compilation mode.
Index: clang/test/Driver/hip-device-compile.hip
===
--- clang/test/Driver/hip-device-compile.hip
+++ clang/test/Driver/hip-device-compile.hip
@@ -45,6 +45,14 @@
 // RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
 // RUN: 2>&1 | FileCheck -check-prefixes=CHECK,ASM,NBUN %s
 
+// Output relocatable.
+// RUN: %clang -c --cuda-device-only -### --target=x86_64-linux-gnu \
+// RUN:   -o a.o -x hip --cuda-gpu-arch=gfx900 -fhip-emit-relocatable \
+// RUN:  

[PATCH] D147732: [AMDGPU] Add f32 permlane{16, x16} builtin variants

2023-04-06 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes created this revision.
jrbyrnes added reviewers: rampitec, arsenm.
Herald added subscribers: kosarev, foad, kerbowa, hiraditya, tpr, dstuttard, 
yaxunl, jvesely, kzhuravl.
Herald added a project: All.
jrbyrnes requested review of this revision.
Herald added subscribers: llvm-commits, cfe-commits, wdng.
Herald added projects: clang, LLVM.

Add builtins which accept floats for these instructions. A user is requesting 
to have permlane builtins for floats without use of casts.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D147732

Files:
  clang/include/clang/Basic/BuiltinsAMDGPU.def
  clang/test/CodeGenOpenCL/builtins-amdgcn-gfx10.cl
  clang/test/SemaOpenCL/builtins-amdgcn-error-gfx10-param.cl
  llvm/include/llvm/IR/IntrinsicsAMDGPU.td
  llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
  llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
  llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
  llvm/lib/Target/AMDGPU/VOP3Instructions.td

Index: llvm/lib/Target/AMDGPU/VOP3Instructions.td
===
--- llvm/lib/Target/AMDGPU/VOP3Instructions.td
+++ llvm/lib/Target/AMDGPU/VOP3Instructions.td
@@ -663,7 +663,9 @@
 let OtherPredicates = [HasMADIntraFwdBug], SubtargetPredicate = isGFX11Only in
   defm : IMAD32_Pats;
 
-def VOP3_PERMLANE_Profile : VOP3_Profile, VOP3_OPSEL> {
+
+
+class VOP3_PERMLANE_Profile : VOP3_Profile, VOP3_OPSEL> {
   let InsVOP3OpSel = (ins IntOpSelMods:$src0_modifiers, VRegSrc_32:$src0,
   IntOpSelMods:$src1_modifiers, SSrc_b32:$src1,
   IntOpSelMods:$src2_modifiers, SSrc_b32:$src2,
@@ -679,9 +681,9 @@
 def gi_opsel_i1timm : GICustomOperandRenderer<"renderOpSelTImm">,
   GISDNodeXFormEquiv;
 
-class PermlanePat : GCNPat<
-  (permlane i32:$vdst_in, i32:$src0, i32:$src1, i32:$src2,
+  (permlane vt:$vdst_in, vt:$src0, i32:$src1, i32:$src2,
 timm:$fi, timm:$bc),
   (inst (opsel_i1timm $fi), VGPR_32:$src0, (opsel_i1timm $bc),
 SCSrc_b32:$src1, 0, SCSrc_b32:$src2, VGPR_32:$vdst_in)
@@ -695,12 +697,17 @@
   def : ThreeOp_i32_Pats;
 
   let Constraints = "$vdst = $vdst_in", DisableEncoding="$vdst_in" in {
-defm V_PERMLANE16_B32 : VOP3Inst<"v_permlane16_b32", VOP3_PERMLANE_Profile>;
-defm V_PERMLANEX16_B32 : VOP3Inst<"v_permlanex16_b32", VOP3_PERMLANE_Profile>;
+defm V_PERMLANE16_B32 : VOP3Inst<"v_permlane16_b32", VOP3_PERMLANE_Profile>;
+defm V_PERMLANEX16_B32 : VOP3Inst<"v_permlanex16_b32", VOP3_PERMLANE_Profile>;
+defm V_PERMLANE16_F32_B32 : VOP3Inst<"v_permlane16_b32", VOP3_PERMLANE_Profile>;
+defm V_PERMLANEX16_F32_B32 : VOP3Inst<"v_permlanex16_b32", VOP3_PERMLANE_Profile>;
   } // End $vdst = $vdst_in, DisableEncoding $vdst_in
 
-  def : PermlanePat;
-  def : PermlanePat;
+  def : PermlanePat;
+  def : PermlanePat;
+  def : PermlanePat;
+  def : PermlanePat;
+
 
   defm V_ADD_NC_U16 : VOP3Inst <"v_add_nc_u16", VOP3_Profile, add>;
   defm V_SUB_NC_U16 : VOP3Inst <"v_sub_nc_u16", VOP3_Profile, sub>;
Index: llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
===
--- llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
+++ llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
@@ -303,6 +303,8 @@
 def : SourceOfDivergence;
 def : SourceOfDivergence;
 def : SourceOfDivergence;
+def : SourceOfDivergence;
+def : SourceOfDivergence;
 def : SourceOfDivergence;
 def : SourceOfDivergence;
 def : SourceOfDivergence;
Index: llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
===
--- llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -2990,7 +2990,9 @@
   applyDefaultMapping(OpdMapper);
   return;
 case Intrinsic::amdgcn_permlane16:
-case Intrinsic::amdgcn_permlanex16: {
+case Intrinsic::amdgcn_permlanex16:
+case Intrinsic::amdgcn_permlane16_f32:
+case Intrinsic::amdgcn_permlanex16_f32: {
   // Doing a waterfall loop over these wouldn't make any sense.
   substituteSimpleCopyRegs(OpdMapper, 2);
   substituteSimpleCopyRegs(OpdMapper, 3);
@@ -4367,7 +4369,9 @@
   break;
 }
 case Intrinsic::amdgcn_permlane16:
-case Intrinsic::amdgcn_permlanex16: {
+case Intrinsic::amdgcn_permlanex16:
+case Intrinsic::amdgcn_permlane16_f32:
+case Intrinsic::amdgcn_permlanex16_f32: {
   unsigned Size = getSizeInBits(MI.getOperand(0).getReg(), MRI, *TRI);
   OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size);
   OpdsMapping[2] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size);
Index: llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
===
--- llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
+++ llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
@@ -892,7 +892,9 @@
 return IC.replaceOperand(II, 0, UndefValue::get(

[PATCH] D147732: [AMDGPU] Add f32 permlane{16, x16} builtin variants

2023-04-06 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes added a comment.

In D147732#4249567 , @rampitec wrote:

> Isn't it simpler to lower it to an existing int intrinsic and casts in clang?

Thanks for your comment Stas!

I think it would be ideal if clang inserted pure bitcasts for floats instead of 
fptoui when passed as operands to these builtins. My concern is -- Do you think 
we need to preserve the implicit casting behavior for compatibility?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D147732/new/

https://reviews.llvm.org/D147732

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D147732: [AMDGPU] Add f32 permlane{16, x16} builtin variants

2023-04-13 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes updated this revision to Diff 513386.
jrbyrnes marked an inline comment as done.
jrbyrnes added a comment.

Use type mangling


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D147732/new/

https://reviews.llvm.org/D147732

Files:
  clang/include/clang/Basic/BuiltinsAMDGPU.def
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/test/CodeGenOpenCL/builtins-amdgcn-gfx10.cl
  clang/test/SemaOpenCL/builtins-amdgcn-error-gfx10-param.cl
  llvm/include/llvm/IR/IntrinsicsAMDGPU.td
  llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
  llvm/lib/Target/AMDGPU/VOP3Instructions.td
  llvm/test/Analysis/DivergenceAnalysis/AMDGPU/intrinsics.ll
  llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll
  llvm/test/Transforms/InstCombine/AMDGPU/amdgcn-intrinsics.ll
  llvm/test/Verifier/AMDGPU/intrinsic-immarg.ll

Index: llvm/test/Verifier/AMDGPU/intrinsic-immarg.ll
===
--- llvm/test/Verifier/AMDGPU/intrinsic-immarg.ll
+++ llvm/test/Verifier/AMDGPU/intrinsic-immarg.ll
@@ -555,12 +555,12 @@
 define i32 @test_permlane16(ptr addrspace(1) %out, i32 %arg0, i32 %arg1, i32 %arg2, i1 %arg3, i1 %arg4) {
   ; CHECK: immarg operand has non-immediate parameter
   ; CHECK-NEXT: i1 %arg3
-  ; CHECK-NEXT: %v1 = call i32 @llvm.amdgcn.permlane16(i32 %arg0, i32 %arg0, i32 %arg1, i32 %arg2, i1 %arg3, i1 false)
+  ; CHECK-NEXT: %v1 = call i32 @llvm.amdgcn.permlane16.i32(i32 %arg0, i32 %arg0, i32 %arg1, i32 %arg2, i1 %arg3, i1 false)
   %v1 = call i32 @llvm.amdgcn.permlane16(i32 %arg0, i32 %arg0, i32 %arg1, i32 %arg2, i1 %arg3, i1 false)
 
   ; CHECK: immarg operand has non-immediate parameter
   ; CHECK-NEXT: i1 %arg4
-  ; CHECK-NEXT: call i32 @llvm.amdgcn.permlane16(i32 %v2, i32 %arg0, i32 %arg1, i32 %arg2, i1 false, i1 %arg4)
+  ; CHECK-NEXT: call i32 @llvm.amdgcn.permlane16.i32(i32 %v2, i32 %arg0, i32 %arg1, i32 %arg2, i1 false, i1 %arg4)
   %v2 = call i32 @llvm.amdgcn.permlane16(i32 %v2, i32 %arg0, i32 %arg1, i32 %arg2, i1 false, i1 %arg4)
   ret i32 %v2
 }
@@ -569,12 +569,12 @@
 define i32 @test_permlanex16(ptr addrspace(1) %out, i32 %arg0, i32 %arg1, i32 %arg2, i1 %arg3, i1 %arg4) {
   ; CHECK: immarg operand has non-immediate parameter
   ; CHECK-NEXT: i1 %arg3
-  ; CHECK-NEXT: %v1 = call i32 @llvm.amdgcn.permlanex16(i32 %arg0, i32 %arg0, i32 %arg1, i32 %arg2, i1 %arg3, i1 false)
+  ; CHECK-NEXT: %v1 = call i32 @llvm.amdgcn.permlanex16.i32(i32 %arg0, i32 %arg0, i32 %arg1, i32 %arg2, i1 %arg3, i1 false)
   %v1 = call i32 @llvm.amdgcn.permlanex16(i32 %arg0, i32 %arg0, i32 %arg1, i32 %arg2, i1 %arg3, i1 false)
 
   ; CHECK: immarg operand has non-immediate parameter
   ; CHECK-NEXT: i1 %arg4
-  ; CHECK-NEXT: call i32 @llvm.amdgcn.permlanex16(i32 %v2, i32 %arg0, i32 %arg1, i32 %arg2, i1 false, i1 %arg4)
+  ; CHECK-NEXT: call i32 @llvm.amdgcn.permlanex16.i32(i32 %v2, i32 %arg0, i32 %arg1, i32 %arg2, i1 false, i1 %arg4)
   %v2 = call i32 @llvm.amdgcn.permlanex16(i32 %v2, i32 %arg0, i32 %arg1, i32 %arg2, i1 false, i1 %arg4)
   ret i32 %v2
 }
@@ -600,7 +600,6 @@
   ; CHECK: immarg operand has non-immediate parameter
   ; CHECK-NEXT: i32 %arg2
   ; CHECK-NEXT: %val0 = call float @llvm.amdgcn.interp.p2(float %arg0, float %arg1, i32 %arg2, i32 0, i32 0)
-
   %val0 = call float @llvm.amdgcn.interp.p2(float %arg0, float %arg1, i32 %arg2, i32 0, i32 0)
   store volatile float %val0, ptr addrspace(1) undef
 
Index: llvm/test/Transforms/InstCombine/AMDGPU/amdgcn-intrinsics.ll
===
--- llvm/test/Transforms/InstCombine/AMDGPU/amdgcn-intrinsics.ll
+++ llvm/test/Transforms/InstCombine/AMDGPU/amdgcn-intrinsics.ll
@@ -66,7 +66,7 @@
 
 define float @test_constant_fold_rcp_f32_43_strictfp() nounwind strictfp {
 ; CHECK-LABEL: @test_constant_fold_rcp_f32_43_strictfp(
-; CHECK-NEXT:[[VAL:%.*]] = call float @llvm.amdgcn.rcp.f32(float 4.30e+01) #[[ATTR14:[0-9]+]]
+; CHECK-NEXT:[[VAL:%.*]] = call float @llvm.amdgcn.rcp.f32(float 4.30e+01) #[[ATTR13:[0-9]+]]
 ; CHECK-NEXT:ret float [[VAL]]
 ;
   %val = call float @llvm.amdgcn.rcp.f32(float 4.30e+01) strictfp nounwind readnone
@@ -107,7 +107,7 @@
 
 define half @test_constant_fold_sqrt_f16_0() nounwind {
 ; CHECK-LABEL: @test_constant_fold_sqrt_f16_0(
-; CHECK-NEXT:[[VAL:%.*]] = call half @llvm.amdgcn.sqrt.f16(half 0xH) #[[ATTR15:[0-9]+]]
+; CHECK-NEXT:[[VAL:%.*]] = call half @llvm.amdgcn.sqrt.f16(half 0xH) #[[ATTR14:[0-9]+]]
 ; CHECK-NEXT:ret half [[VAL]]
 ;
   %val = call half @llvm.amdgcn.sqrt.f16(half 0.0) nounwind readnone
@@ -116,7 +116,7 @@
 
 define float @test_constant_fold_sqrt_f32_0() nounwind {
 ; CHECK-LABEL: @test_constant_fold_sqrt_f32_0(
-; CHECK-NEXT:[[VAL:%.*]] = call float @llvm.amdgcn.sqrt.f32(float 0.00e+00) #[[ATTR15]]
+; CHECK-NEXT:[[VAL:%.*]] = call float @llvm.amdgcn.sqrt.f32(float 0.00e+00) #[[ATTR14]]
 ; CHECK-NEXT:ret float [[VAL]]
 ;
   %val = call float 

[PATCH] D135269: [AMDGPU] Disable bool range metadata to workaround backend issue

2022-12-08 Thread Jeffrey Byrnes via Phabricator via cfe-commits
jrbyrnes added a comment.

In D135269#3981856 , @yaxunl wrote:

> In D135269#3981561 , @nikic wrote:
>
>> Checking back here again on whether there is any progress on finding the 
>> root cause of the issue. If no progress is expected in the near future, I'd 
>> ask for this patch to be reverted.
>
> @jrbyrnes is working on the root cause of this issue. Any updates? Thanks.

Thanks for the ping. I would also like to see this reverted as it enables some 
optimizations. I do not have a definitive answer at the moment (w.r.t reverting 
this), but hope to provide one soon

As for now, the issue we are seeing from 
(https://github.com/llvm/llvm-project/commit/8018d6be3459780e81a5da128a9915eb27909902)
 seems most likely to be a source code issue (first document of issue 
https://github.com/pytorch/pytorch/issues/54789 . upstream PyTorch currently 
skips problematic test 
https://github.com/pytorch/pytorch/blob/b738da8c8e4d9142ad38a1bd8c35d0bfef4b5e3c/torch/testing/_internal/common_methods_invocations.py#L14891)
 . I will provide a better update soon.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D135269/new/

https://reviews.llvm.org/D135269

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits