[llvm-branch-commits] [llvm] [AMDGPU] always emit a soft wait even if it is trivially ~0 (PR #147257)

via llvm-branch-commits Mon, 07 Jul 2025 02:12:07 -0700

llvmbot wrote:


<!--LLVM PR SUMMARY COMMENT-->

@llvm/pr-subscribers-llvm-globalisel

Author: Sameer Sahasrabuddhe (ssahasra)

<details>
<summary>Changes</summary>

The memory legalizer is currently responsible for emitting wait instructions at 
ordering operations such as acquire and release. It tries to be efficient by 
emitting waits only when required. In particular, it does not emit a wait on 
vmcnt at workgroup scope since that ordering is already guaranteed by the 
architecture. But this is now incorrect because direct loads to LDS have an LDS 
component which needs explicit ordering on vmcnt. But it is inefficient to 
always emit a wait on vmcnt since majority of the programs do not use direct 
loads to LDS, and this will affect all workgroup scope operations.

As a first step to that, the memory legalizer now emits a soft wait instruction 
even if all counts are trivially ~0. This is a placeholder that the 
SIInsertWaitcnts pass will either optimize away or strenghthen based on its 
analysis of whether direct loads to LDS are pending at this point in the 
program.

---

Patch is 4.42 MiB, truncated to 20.00 KiB below, full version: 
https://github.com/llvm/llvm-project/pull/147257.diff


41 Files Affected:

- (modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+25-33) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll (+6-6) 
- (modified) 
llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll (+112) 
- (modified) 
llvm/test/CodeGen/AMDGPU/attributor-flatscratchinit-undefined-behavior2.ll 
(+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/call-argument-types.ll (+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll (+6-6) 
- (modified) llvm/test/CodeGen/AMDGPU/flat-scratch.ll (+6-6) 
- (modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+66-66) 
- (modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+2) 
- (modified) llvm/test/CodeGen/AMDGPU/kernel-vgpr-spill-mubuf-with-voffset.ll 
(+1) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.update.dpp.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll 
(+64) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll 
(+168-6) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence.ll (+220-4) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-agent.ll (+160-32) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-singlethread.ll 
(+1420) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-system.ll (+160-32) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-volatile.ll (+14-2) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-wavefront.ll 
(+1410) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-workgroup.ll 
(+576-68) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-agent.ll (+192) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-singlethread.ll 
(+1152-52) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-system.ll (+168) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll 
(+14-1) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-wavefront.ll 
(+1152-52) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-workgroup.ll 
(+706-82) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-agent.ll (+940-98) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-nontemporal.ll 
(+3-2) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-singlethread.ll 
(+1548) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-system.ll 
(+940-98) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-volatile.ll 
(+27-7) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-wavefront.ll 
(+1548) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-workgroup.ll 
(+940-98) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local.mir (+31) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-volatile.ll 
(+12) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-region.mir (+31) 
- (modified) 
llvm/test/CodeGen/AMDGPU/mubuf-legalize-operands-non-ptr-intrinsics.ll (+4-4) 
- (modified) llvm/test/CodeGen/AMDGPU/mubuf-legalize-operands.ll (+4-4) 
- (modified) llvm/test/CodeGen/AMDGPU/stacksave_stackrestore.ll (+6) 
- (modified) llvm/test/CodeGen/AMDGPU/trap-abis.ll (+5) 
- (modified) llvm/test/CodeGen/AMDGPU/wait-before-stores-with-scope_sys.mir 
(+1) 


``````````diff
diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp 
b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index 3212060f303a5..f015d3ad7811e 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -1074,8 +1074,6 @@ bool 
SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI,
                                     SIAtomicAddrSpace AddrSpace, SIMemOp Op,
                                     bool IsCrossAddrSpaceOrdering, Position 
Pos,
                                     AtomicOrdering Order) const {
-  bool Changed = false;
-
   MachineBasicBlock &MBB = *MI->getParent();
   DebugLoc DL = MI->getDebugLoc();
 
@@ -1149,21 +1147,19 @@ bool 
SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI,
     }
   }
 
-  if (VMCnt || LGKMCnt) {
-    unsigned WaitCntImmediate =
-      AMDGPU::encodeWaitcnt(IV,
-                            VMCnt ? 0 : getVmcntBitMask(IV),
-                            getExpcntBitMask(IV),
-                            LGKMCnt ? 0 : getLgkmcntBitMask(IV));
-    BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
-        .addImm(WaitCntImmediate);
-    Changed = true;
-  }
+  // Always emit a soft wait count, even if it is trivially ~0. 
SIInsertWaitcnts
+  // will later use this marker to add additional waits such as those required
+  // from direct load to LDS (formerly known as LDS DMA).
+  unsigned WaitCntImmediate = AMDGPU::encodeWaitcnt(
+      IV, VMCnt ? 0 : getVmcntBitMask(IV), getExpcntBitMask(IV),
+      LGKMCnt ? 0 : getLgkmcntBitMask(IV));
+  BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
+      .addImm(WaitCntImmediate);
 
   if (Pos == Position::AFTER)
     --MI;
 
-  return Changed;
+  return true;
 }
 
 bool SIGfx6CacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
@@ -1966,8 +1962,6 @@ bool 
SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
                                      SIAtomicAddrSpace AddrSpace, SIMemOp Op,
                                      bool IsCrossAddrSpaceOrdering,
                                      Position Pos, AtomicOrdering Order) const 
{
-  bool Changed = false;
-
   MachineBasicBlock &MBB = *MI->getParent();
   DebugLoc DL = MI->getDebugLoc();
 
@@ -2057,28 +2051,25 @@ bool 
SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
     }
   }
 
-  if (VMCnt || LGKMCnt) {
-    unsigned WaitCntImmediate =
-      AMDGPU::encodeWaitcnt(IV,
-                            VMCnt ? 0 : getVmcntBitMask(IV),
-                            getExpcntBitMask(IV),
-                            LGKMCnt ? 0 : getLgkmcntBitMask(IV));
-    BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
-        .addImm(WaitCntImmediate);
-    Changed = true;
-  }
+  // Always emit a soft wait count, even if it is trivially ~0. 
SIInsertWaitcnts
+  // will later use this marker to add additional waits such as those required
+  // from direct load to LDS (formerly known as LDS DMA).
+  unsigned WaitCntImmediate = AMDGPU::encodeWaitcnt(
+      IV, VMCnt ? 0 : getVmcntBitMask(IV), getExpcntBitMask(IV),
+      LGKMCnt ? 0 : getLgkmcntBitMask(IV));
+  BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
+      .addImm(WaitCntImmediate);
 
   if (VSCnt) {
     BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_VSCNT_soft))
         .addReg(AMDGPU::SGPR_NULL, RegState::Undef)
         .addImm(0);
-    Changed = true;
   }
 
   if (Pos == Position::AFTER)
     --MI;
 
-  return Changed;
+  return true;
 }
 
 bool SIGfx10CacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
@@ -2287,8 +2278,6 @@ bool 
SIGfx12CacheControl::insertWait(MachineBasicBlock::iterator &MI,
                                      SIAtomicAddrSpace AddrSpace, SIMemOp Op,
                                      bool IsCrossAddrSpaceOrdering,
                                      Position Pos, AtomicOrdering Order) const 
{
-  bool Changed = false;
-
   MachineBasicBlock &MBB = *MI->getParent();
   DebugLoc DL = MI->getDebugLoc();
 
@@ -2372,23 +2361,26 @@ bool 
SIGfx12CacheControl::insertWait(MachineBasicBlock::iterator &MI,
       BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_SAMPLECNT_soft)).addImm(0);
     }
     BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_LOADCNT_soft)).addImm(0);
-    Changed = true;
+  } else {
+    // Always emit a soft wait count, even if it is trivially ~0.
+    // SIInsertWaitcnts will later use this marker to add additional waits such
+    // as those required from direct load to LDS (formerly known as LDS DMA).
+    BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_LOADCNT_soft))
+        .addImm(getLoadcntBitMask(IV));
   }
 
   if (STORECnt) {
     BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_STORECNT_soft)).addImm(0);
-    Changed = true;
   }
 
   if (DSCnt) {
     BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_DSCNT_soft)).addImm(0);
-    Changed = true;
   }
 
   if (Pos == Position::AFTER)
     --MI;
 
-  return Changed;
+  return true;
 }
 
 bool SIGfx12CacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll 
b/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
index 8a80afd4a768f..1bbbec977b714 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
@@ -880,8 +880,8 @@ define amdgpu_kernel void 
@store_load_vindex_small_offset_kernel(i32 %n) {
 ; GFX10-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; GFX10-NEXT:    v_add_nc_u32_e32 v0, 0x100, v0
 ; GFX10-NEXT:    scratch_store_dword v0, v2, off offset:128
-; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-NEXT:    s_lshl_b32 s0, s0, 7
 ; GFX10-NEXT:    s_add_u32 s0, 0x100, s0
 ; GFX10-NEXT:    v_add_nc_u32_e32 v1, s0, v1
@@ -921,8 +921,8 @@ define amdgpu_kernel void 
@store_load_vindex_small_offset_kernel(i32 %n) {
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_4) | 
instid1(SALU_CYCLE_1)
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; GFX11-NEXT:    scratch_store_b32 v0, v2, off offset:384 dlc
-; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-NEXT:    s_lshl_b32 s0, s0, 7
 ; GFX11-NEXT:    s_add_u32 s0, 0x100, s0
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
@@ -991,8 +991,8 @@ define amdgpu_kernel void 
@store_load_vindex_small_offset_kernel(i32 %n) {
 ; UNALIGNED_GFX10-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; UNALIGNED_GFX10-NEXT:    v_add_nc_u32_e32 v0, 0x100, v0
 ; UNALIGNED_GFX10-NEXT:    scratch_store_dword v0, v2, off offset:128
-; UNALIGNED_GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; UNALIGNED_GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX10-NEXT:    s_lshl_b32 s0, s0, 7
 ; UNALIGNED_GFX10-NEXT:    s_add_u32 s0, 0x100, s0
 ; UNALIGNED_GFX10-NEXT:    v_add_nc_u32_e32 v1, s0, v1
@@ -1032,8 +1032,8 @@ define amdgpu_kernel void 
@store_load_vindex_small_offset_kernel(i32 %n) {
 ; UNALIGNED_GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_4) 
| instid1(SALU_CYCLE_1)
 ; UNALIGNED_GFX11-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; UNALIGNED_GFX11-NEXT:    scratch_store_b32 v0, v2, off offset:384 dlc
-; UNALIGNED_GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; UNALIGNED_GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX11-NEXT:    s_lshl_b32 s0, s0, 7
 ; UNALIGNED_GFX11-NEXT:    s_add_u32 s0, 0x100, s0
 ; UNALIGNED_GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
@@ -1520,8 +1520,8 @@ define amdgpu_kernel void 
@store_load_vindex_large_offset_kernel(i32 %n) {
 ; GFX10-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; GFX10-NEXT:    v_add_nc_u32_e32 v0, 0x4004, v0
 ; GFX10-NEXT:    scratch_store_dword v0, v2, off offset:128
-; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-NEXT:    s_lshl_b32 s0, s0, 7
 ; GFX10-NEXT:    s_add_u32 s0, 0x4004, s0
 ; GFX10-NEXT:    v_add_nc_u32_e32 v1, s0, v1
@@ -1633,8 +1633,8 @@ define amdgpu_kernel void 
@store_load_vindex_large_offset_kernel(i32 %n) {
 ; UNALIGNED_GFX10-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; UNALIGNED_GFX10-NEXT:    v_add_nc_u32_e32 v0, 0x4004, v0
 ; UNALIGNED_GFX10-NEXT:    scratch_store_dword v0, v2, off offset:128
-; UNALIGNED_GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; UNALIGNED_GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX10-NEXT:    s_lshl_b32 s0, s0, 7
 ; UNALIGNED_GFX10-NEXT:    s_add_u32 s0, 0x4004, s0
 ; UNALIGNED_GFX10-NEXT:    v_add_nc_u32_e32 v1, s0, v1
diff --git 
a/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll 
b/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
index 66037615f0ba0..ea6a5d5e74d52 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
@@ -199,26 +199,32 @@ entry:
 define amdgpu_kernel void @singlethread_one_as_acquire() #0 {
   ; GFX6-LABEL: name: singlethread_one_as_acquire
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: singlethread_one_as_acquire
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: singlethread_one_as_acquire
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: singlethread_one_as_acquire
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: singlethread_one_as_acquire
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: singlethread_one_as_acquire
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("singlethread-one-as") acquire
@@ -228,26 +234,32 @@ entry:
 define amdgpu_kernel void @singlethread_one_as_release() #0 {
   ; GFX6-LABEL: name: singlethread_one_as_release
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: singlethread_one_as_release
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: singlethread_one_as_release
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: singlethread_one_as_release
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: singlethread_one_as_release
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: singlethread_one_as_release
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("singlethread-one-as") release
@@ -257,26 +269,32 @@ entry:
 define amdgpu_kernel void @singlethread_one_as_acq_rel() #0 {
   ; GFX6-LABEL: name: singlethread_one_as_acq_rel
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: singlethread_one_as_acq_rel
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: singlethread_one_as_acq_rel
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: singlethread_one_as_acq_rel
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: singlethread_one_as_acq_rel
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: singlethread_one_as_acq_rel
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("singlethread-one-as") acq_rel
@@ -286,26 +304,32 @@ entry:
 define amdgpu_kernel void @singlethread_one_as_seq_cst() #0 {
   ; GFX6-LABEL: name: singlethread_one_as_seq_cst
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: singlethread_one_as_seq_cst
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: singlethread_one_as_seq_cst
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: singlethread_one_as_seq_cst
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: singlethread_one_as_seq_cst
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: singlethread_one_as_seq_cst
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("singlethread-one-as") seq_cst
@@ -501,10 +525,12 @@ entry:
 define amdgpu_kernel void @workgroup_one_as_acquire() #0 {
   ; GFX6-LABEL: name: workgroup_one_as_acquire
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: workgroup_one_as_acquire
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: workgroup_one_as_acquire
@@ -516,6 +542,7 @@ define amdgpu_kernel void @workgroup_one_as_acquire() #0 {
   ;
   ; GFX10CU-LABEL: name: workgroup_one_as_acquire
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: workgroup_one_as_acquire
@@ -527,6 +554,7 @@ define amdgpu_kernel void @workgroup_one_as_acquire() #0 {
   ;
   ; GFX11CU-LABEL: name: workgroup_one_as_acquire
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("workgroup-one-as") acquire
@@ -536,10 +564,12 @@ entry:
 define amdgpu_kernel void @workgroup_one_as_release() #0 {
   ; GFX6-LABEL: name: workgroup_one_as_release
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: workgroup_one_as_release
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: workgroup_one_as_release
@@ -550,6 +580,7 @@ define amdgpu_kernel void @workgroup_one_as_release() #0 {
   ;
   ; GFX10CU-LABEL: name: workgroup_one_as_release
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: workgroup_one_as_release
@@ -560,6 +591,7 @@ define amdgpu_kernel void @workgroup_one_as_release() #0 {
   ;
   ; GFX11CU-LABEL: name: workgroup_one_as_release
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("workgroup-one-as") release
@@ -569,10 +601,12 @@ entry:
 define amdgpu_kernel void @workgroup_one_as_acq_rel() #0 {
   ; GFX6-LABEL: name: workgroup_one_as_acq_rel
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: workgroup_one_as_acq_rel
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: workgroup_one_as_acq_rel
@@ -584,6 +618,7 @@ define amdgpu_kernel void @workgroup_one_as_acq_rel() #0 {
   ;
   ; GFX10CU-LABEL: name: workgroup_one_as_acq_rel
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: workgroup_one_as_acq_rel
@@ -595,6 +630,7 @@ define amdgpu_kernel void @workgroup_one_as_acq_rel() #0 {
   ;
   ; GFX11CU-LABEL: name: workgroup_one_as_acq_rel
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("workgroup-one-as") acq_rel
@@ -604,10 +640,12 @@ entry:
 define amdgpu_kernel void @workgroup_one_as_seq_cst() #0 {
   ; GFX6-LABEL: name: workgroup_one_as_seq_cst
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: workgroup_one_as_seq_cst
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: workgroup_one_as_seq_cst
@@ -619,6 +657,7 @@ define amdgpu_kernel void @workgroup_one_as_seq_cst() #0 {
   ;
   ; GFX10CU-LABEL: name: workgroup_one_as_seq_cst
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: workgroup_one_as_seq_cst
@@ -630,6 +669,7 @@ define amdgpu_kernel void @workgroup_one_as_seq_cst() #0 {
   ;
   ; GFX11CU-LABEL: name: workgroup_one_as_seq_cst
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("workgroup-one-as") seq_cst
@@ -639,26 +679,32 @@ entry:
 define amdgpu_kernel void @wavefront_one_as_acquire() #0 {
   ; GFX6-LABEL: name: wavefront_one_as_acquire
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: wavefront_one_as_acquire
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: wavefront_one_as_acquire
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: wavefront_one_as_acquire
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: wavefront_one_as_acquire
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: wavefront_one_as_acquire
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("wavefront-one-as") acquire
@@ -668,26 +714,32 @@ entry:
 define amdgpu_kernel void @wavefront_one_as_release() #0 {
   ; GFX6-LABEL: name: wavefront_one_as_release
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: wavefront_one_as_release
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: wavefront_one_as_release
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: wavefront_one_as_release
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: wavefront_one_as_release
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: wavefront_one_as_release
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("wavefront-one-as") release
@@ -697,26 +749,32 @@ entry:
 define amdgpu_kernel void @wavefront_one_as_acq_rel() #0 {
   ; GFX6-LABEL: name: wavefront_one_as_acq_rel
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: wavefront_one_as_acq_rel
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: wavefront_one_as_acq_rel
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: wavefront_one_as_acq_rel
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: wavefront_one_as_acq_rel
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: wavefront_one_as_acq_rel
   ; GFX11CU: bb.0.entry:
+  ;...
[truncated]

``````````

</details>


https://github.com/llvm/llvm-project/pull/147257
_______________________________________________
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] [AMDGPU] always emit a soft wait even if it is trivially ~0 (PR #147257)

Reply via email to