[llvm-branch-commits] [llvm] [Attributor][AMDGPU] Improve the handling of indirect calls (PR #100954)

2024-07-28 Thread Sameer Sahasrabuddhe via llvm-branch-commits

ssahasra wrote:

The apparent change here is to simply reverse the effect of #100952 on the lit 
test. Would be good to have a test which shows what the improvement is.

Also, I think #100952 merely enables AAIndirectCallInfo, and feels like an 
integral part of this change itself. I would lean towards squashing it into 
this change.

https://github.com/llvm/llvm-project/pull/100954
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [Transforms] Refactor CreateControlFlowHub (PR #103013)

2024-08-12 Thread Sameer Sahasrabuddhe via llvm-branch-commits

https://github.com/ssahasra created 
https://github.com/llvm/llvm-project/pull/103013

CreateControlFlowHub is a method that redirects control flow edges from a set 
of incoming blocks to a set of outgoing blocks through a new set of "guard" 
blocks. This is now refactored into a separate file with one enhancement: The 
input to the method is now a set of branches rather than two sets of blocks.

The original implementation reroutes every edge from incoming blocks to 
outgoing blocks. But it is possible that for some incoming block InBB, some 
successor S might be in the set of outgoing blocks, but that particular edge 
should not be rerouted. The new implementation makes this possible by allowing 
the user to specify the targets of each branch that need to be rerouted.

This is needed when improving the implementation of FixIrreducible #101386. 
Current uses in FixIrreducible and UnifyLoopExits do not demonstrate this finer 
control over the edges being rerouted.

>From 9b8d7f65680155f04bc754aebd4d820bad743581 Mon Sep 17 00:00:00 2001
From: Sameer Sahasrabuddhe 
Date: Tue, 13 Aug 2024 12:07:00 +0530
Subject: [PATCH] [Transforms] Refactor CreateControlFlowHub

CreateControlFlowHub is a method that redirects control flow edges from a set of
incoming blocks to a set of outgoing blocks through a new set of "guard" blocks.
This is now refactored into a separate file with one enhancement: The input to
the method is now a set of branches rather than two sets of blocks.

The original implementation reroutes every edge from incoming blocks to outgoing
blocks. But it is possible that for some incoming block InBB, some successor S
might be in the set of outgoing blocks, but that particular edge should not be
rerouted. The new implementation makes this possible by allowing the user to
specify the targets of each branch that need to be rerouted.

This is needed when improving the implementation of FixIrreducible #101386. 
Current
uses in FixIrreducible and UnifyLoopExits do not demonstrate this finer control
over the edges being rerouted.
---
 .../llvm/Transforms/Utils/BasicBlockUtils.h   |  75 
 .../llvm/Transforms/Utils/ControlFlowUtils.h  | 121 +++
 llvm/lib/Transforms/Utils/BasicBlockUtils.cpp | 314 
 llvm/lib/Transforms/Utils/CMakeLists.txt  |   1 +
 .../lib/Transforms/Utils/ControlFlowUtils.cpp | 341 ++
 llvm/lib/Transforms/Utils/FixIrreducible.cpp  |  35 +-
 llvm/lib/Transforms/Utils/UnifyLoopExits.cpp  |  69 ++--
 .../CodeGen/AMDGPU/local-atomicrmw-fadd.ll|  32 +-
 llvm/test/Transforms/FixIrreducible/basic.ll  |   4 +-
 .../Transforms/FixIrreducible/bug45623.ll |   3 +-
 llvm/test/Transforms/FixIrreducible/nested.ll |   3 +-
 llvm/test/Transforms/FixIrreducible/switch.ll |   3 +-
 .../Transforms/FixIrreducible/unreachable.ll  |   4 +-
 13 files changed, 545 insertions(+), 460 deletions(-)
 create mode 100644 llvm/include/llvm/Transforms/Utils/ControlFlowUtils.h
 create mode 100644 llvm/lib/Transforms/Utils/ControlFlowUtils.cpp

diff --git a/llvm/include/llvm/Transforms/Utils/BasicBlockUtils.h 
b/llvm/include/llvm/Transforms/Utils/BasicBlockUtils.h
index c99df6bf94d025..b447942ffbd676 100644
--- a/llvm/include/llvm/Transforms/Utils/BasicBlockUtils.h
+++ b/llvm/include/llvm/Transforms/Utils/BasicBlockUtils.h
@@ -602,81 +602,6 @@ bool SplitIndirectBrCriticalEdges(Function &F, bool 
IgnoreBlocksWithoutPHI,
   BranchProbabilityInfo *BPI = nullptr,
   BlockFrequencyInfo *BFI = nullptr);
 
-/// Given a set of incoming and outgoing blocks, create a "hub" such that every
-/// edge from an incoming block InBB to an outgoing block OutBB is now split
-/// into two edges, one from InBB to the hub and another from the hub to
-/// OutBB. The hub consists of a series of guard blocks, one for each outgoing
-/// block. Each guard block conditionally branches to the corresponding 
outgoing
-/// block, or the next guard block in the chain. These guard blocks are 
returned
-/// in the argument vector.
-///
-/// Since the control flow edges from InBB to OutBB have now been replaced, the
-/// function also updates any PHINodes in OutBB. For each such PHINode, the
-/// operands corresponding to incoming blocks are moved to a new PHINode in the
-/// hub, and the hub is made an operand of the original PHINode.
-///
-/// Input CFG:
-/// --
-///
-///Def
-/// |
-/// v
-///   In1  In2
-///||
-///||
-///vv
-///  Foo ---> Out1 Out2
-/// |
-/// v
-///Use
-///
-///
-/// Create hub: Incoming = {In1, In2}, Outgoing = {Out1, Out2}
-/// --
-///
-/// Def
-///  |
-///  v
-///  In1In2  Foo
-///   |Hub   ||
-///   

[llvm-branch-commits] [llvm] [Transforms] Refactor CreateControlFlowHub (PR #103013)

2024-08-12 Thread Sameer Sahasrabuddhe via llvm-branch-commits

https://github.com/ssahasra edited 
https://github.com/llvm/llvm-project/pull/103013
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [FixIrreducible] Use CycleInfo instead of a custom SCC traversal (PR #103014)

2024-08-13 Thread Sameer Sahasrabuddhe via llvm-branch-commits

https://github.com/ssahasra created 
https://github.com/llvm/llvm-project/pull/103014

1. CycleInfo efficiently locates all cycles in a single pass, while the SCC is 
repeated inside every natural loop.

2. CycleInfo provides a hierarchy of irreducible cycles, and the new 
implementation transforms each cycle in this hierarchy separately instead of 
reducing an entire irreducible SCC in a single step. This reduces the number of 
control-flow paths that pass through the header of each newly created loop. 
This is evidenced by the reduced number of predecessors on the "guard" blocks 
in the lit tests, and fewer operands on the corresponding PHI nodes.

3. When an entry of an irreducible cycle is the header of a child natural loop, 
the original implementation destroyed that loop. This is now preserved, since 
the incoming edges on non-header entries are not touched.

>From 0ba4872d47179a4d54a06224008cc160905360dc Mon Sep 17 00:00:00 2001
From: Sameer Sahasrabuddhe 
Date: Mon, 12 Aug 2024 14:44:13 +0530
Subject: [PATCH] [FixIrreducible] Use CycleInfo instead of a custom SCC
 traversal

1. CycleInfo efficiently locates all cycles in a single pass, while the SCC is
   repeated inside every natural loop.

2. CycleInfo provides a hierarchy of irreducible cycles, and the new
   implementation transforms each cycle in this hierarchy separately instead of
   reducing an entire irreducible SCC in a single step. This reduces the number
   of control-flow paths that pass through the header of each newly created
   loop. This is evidenced by the reduced number of predecessors on the "guard"
   blocks in the lit tests, and fewer operands on the corresponding PHI nodes.

3. When an entry of an irreducible cycle is the header of a child natural loop,
   the original implementation destroyed that loop. This is now preserved,
   since the incoming edges on non-header entries are not touched.
---
 llvm/include/llvm/ADT/GenericCycleInfo.h  |  28 +-
 llvm/lib/Transforms/Utils/FixIrreducible.cpp  | 364 +-
 llvm/test/CodeGen/AMDGPU/llc-pipeline.ll  |  15 +-
 llvm/test/Transforms/FixIrreducible/basic.ll  |  98 ++---
 .../Transforms/FixIrreducible/bug45623.ll |   9 +-
 llvm/test/Transforms/FixIrreducible/nested.ll | 143 ---
 llvm/test/Transforms/FixIrreducible/switch.ll |   8 +-
 .../Transforms/FixIrreducible/unreachable.ll  |   1 +
 .../workarounds/needs-fix-reducible.ll|  56 +--
 .../workarounds/needs-fr-ule.ll   | 173 +
 10 files changed, 500 insertions(+), 395 deletions(-)

diff --git a/llvm/include/llvm/ADT/GenericCycleInfo.h 
b/llvm/include/llvm/ADT/GenericCycleInfo.h
index b5d719c6313c43..cf13f8e95a35e3 100644
--- a/llvm/include/llvm/ADT/GenericCycleInfo.h
+++ b/llvm/include/llvm/ADT/GenericCycleInfo.h
@@ -107,6 +107,13 @@ template  class GenericCycle {
 return is_contained(Entries, Block);
   }
 
+  /// \brief Replace all entries with \p Block as single entry.
+  void setSingleEntry(BlockT *Block) {
+assert(contains(Block));
+Entries.clear();
+Entries.push_back(Block);
+  }
+
   /// \brief Return whether \p Block is contained in the cycle.
   bool contains(const BlockT *Block) const { return Blocks.contains(Block); }
 
@@ -192,11 +199,16 @@ template  class GenericCycle {
   //@{
   using const_entry_iterator =
   typename SmallVectorImpl::const_iterator;
-
+  const_entry_iterator entry_begin() const { return Entries.begin(); }
+  const_entry_iterator entry_end() const { return Entries.end(); }
   size_t getNumEntries() const { return Entries.size(); }
   iterator_range entries() const {
-return llvm::make_range(Entries.begin(), Entries.end());
+return llvm::make_range(entry_begin(), entry_end());
   }
+  using const_reverse_entry_iterator =
+  typename SmallVectorImpl::const_reverse_iterator;
+  const_reverse_entry_iterator entry_rbegin() const { return Entries.rbegin(); 
}
+  const_reverse_entry_iterator entry_rend() const { return Entries.rend(); }
   //@}
 
   Printable printEntries(const ContextT &Ctx) const {
@@ -257,12 +269,6 @@ template  class GenericCycleInfo {
   /// the subtree.
   void moveTopLevelCycleToNewParent(CycleT *NewParent, CycleT *Child);
 
-  /// Assumes that \p Cycle is the innermost cycle containing \p Block.
-  /// \p Block will be appended to \p Cycle and all of its parent cycles.
-  /// \p Block will be added to BlockMap with \p Cycle and
-  /// BlockMapTopLevel with \p Cycle's top level parent cycle.
-  void addBlockToCycle(BlockT *Block, CycleT *Cycle);
-
 public:
   GenericCycleInfo() = default;
   GenericCycleInfo(GenericCycleInfo &&) = default;
@@ -280,6 +286,12 @@ template  class GenericCycleInfo {
   unsigned getCycleDepth(const BlockT *Block) const;
   CycleT *getTopLevelParentCycle(BlockT *Block);
 
+  /// Assumes that \p Cycle is the innermost cycle containing \p Block.
+  /// \p Block will be appended to \p Cycle and all of its parent cycles.
+  /// \p Block will be added to BlockMap with \p Cyc

[llvm-branch-commits] [llvm] [FixIrreducible] Use CycleInfo instead of a custom SCC traversal (PR #103014)

2024-08-13 Thread Sameer Sahasrabuddhe via llvm-branch-commits

https://github.com/ssahasra closed 
https://github.com/llvm/llvm-project/pull/103014
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [FixIrreducible] Use CycleInfo instead of a custom SCC traversal (PR #101386)

2024-08-21 Thread Sameer Sahasrabuddhe via llvm-branch-commits

https://github.com/ssahasra edited 
https://github.com/llvm/llvm-project/pull/101386
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [FixIrreducible] Use CycleInfo instead of a custom SCC traversal (PR #101386)

2024-08-21 Thread Sameer Sahasrabuddhe via llvm-branch-commits


@@ -189,6 +195,21 @@ template  class GenericCycle {
   //@{
   using const_entry_iterator =
   typename SmallVectorImpl::const_iterator;
+  const_entry_iterator entry_begin() const {
+return const_entry_iterator{Entries.begin()};

ssahasra wrote:

Fixed.

https://github.com/llvm/llvm-project/pull/101386
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [FixIrreducible] Use CycleInfo instead of a custom SCC traversal (PR #101386)

2024-08-21 Thread Sameer Sahasrabuddhe via llvm-branch-commits


@@ -107,6 +107,12 @@ template  class GenericCycle {
 return is_contained(Entries, Block);
   }
 
+  /// \brief Replace all entries with \p Block as single entry.
+  void setSingleEntry(BlockT *Block) {
+Entries.clear();
+Entries.push_back(Block);

ssahasra wrote:

Fixed.

https://github.com/llvm/llvm-project/pull/101386
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [FixIrreducible] Use CycleInfo instead of a custom SCC traversal (PR #101386)

2024-08-21 Thread Sameer Sahasrabuddhe via llvm-branch-commits

ssahasra wrote:

> This needs a finer method that redirects only specific edges. Either that, or 
> we let the pass destroy some cycles. But updating `CycleInfo` for these 
> missing subcycles may be a fair amount of work too, so I would rather do it 
> the right way.

This now depends on the newly refactored ControlFlowHub, which correctly 
reroutes only the relevant edges. The effect was already caught in an existing 
test with nested cycles and a common header, so no new test needs to be written 
for this.

https://github.com/llvm/llvm-project/pull/101386
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [FixIrreducible] Use CycleInfo instead of a custom SCC traversal (PR #101386)

2024-08-21 Thread Sameer Sahasrabuddhe via llvm-branch-commits

ssahasra wrote:

> Note that I have not yet finished verifying all the lit tests. I might also 
> have to add a few more tests, especially involving a mix of irreducible and 
> reducible cycles that are siblings and/or nested inside each other in various 
> combinations. Especially with some overlap in the entry and header nodes.

- New tests added that involve nesting with common header or entry nodes. 
Existing tests also covered some relevant combinations.
- Verified all tests.

https://github.com/llvm/llvm-project/pull/101386
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [FixIrreducible] Use CycleInfo instead of a custom SCC traversal (PR #101386)

2024-08-21 Thread Sameer Sahasrabuddhe via llvm-branch-commits

https://github.com/ssahasra edited 
https://github.com/llvm/llvm-project/pull/101386
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [FixIrreducible] Use CycleInfo instead of a custom SCC traversal (PR #101386)

2024-08-22 Thread Sameer Sahasrabuddhe via llvm-branch-commits

https://github.com/ssahasra closed 
https://github.com/llvm/llvm-project/pull/101386
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [FixIrreducible] Use CycleInfo instead of a custom SCC traversal (PR #101386)

2024-08-22 Thread Sameer Sahasrabuddhe via llvm-branch-commits

https://github.com/ssahasra reopened 
https://github.com/llvm/llvm-project/pull/101386
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] c540ce9 - [AMDGPU] pin lit test divergent-unswitch.ll to the old pass manager

2021-01-20 Thread Sameer Sahasrabuddhe via llvm-branch-commits

Author: Sameer Sahasrabuddhe
Date: 2021-01-20T22:02:09+05:30
New Revision: c540ce9900ff99566b4951186e2f070b3b36cdbe

URL: 
https://github.com/llvm/llvm-project/commit/c540ce9900ff99566b4951186e2f070b3b36cdbe
DIFF: 
https://github.com/llvm/llvm-project/commit/c540ce9900ff99566b4951186e2f070b3b36cdbe.diff

LOG: [AMDGPU] pin lit test divergent-unswitch.ll to the old pass manager

The loop-unswitch transform should not be performed on a loop whose
condition is divergent. For this to happen correctly, divergence
analysis must be available. The existing divergence analysis has not
been ported to the new pass manager yet. As a result, loop unswitching
on the new pass manager is currently unsafe on targets that care about
divergence.

This test is temporarily disabled to unblock work on the new pass
manager. The issue is now tracked in bug 48819.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D95051

Added: 


Modified: 
llvm/test/Transforms/LoopUnswitch/AMDGPU/divergent-unswitch.ll

Removed: 




diff  --git a/llvm/test/Transforms/LoopUnswitch/AMDGPU/divergent-unswitch.ll 
b/llvm/test/Transforms/LoopUnswitch/AMDGPU/divergent-unswitch.ll
index 1f106bd894a8..873a7653973d 100644
--- a/llvm/test/Transforms/LoopUnswitch/AMDGPU/divergent-unswitch.ll
+++ b/llvm/test/Transforms/LoopUnswitch/AMDGPU/divergent-unswitch.ll
@@ -1,4 +1,7 @@
-; RUN: opt -mtriple=amdgcn-- -O3 -S %s | FileCheck %s
+; RUN: opt -mtriple=amdgcn-- -O3 -S -enable-new-pm=0 %s | FileCheck %s
+
+; This fails with the new pass manager:
+; https://bugs.llvm.org/show_bug.cgi?id=48819
 
 ; Check that loop unswitch happened and condition hoisted out of the loop.
 ; Condition is uniform so all targets should perform unswitching.



___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering (non i1) (PR #124298)

2025-01-30 Thread Sameer Sahasrabuddhe via llvm-branch-commits


@@ -342,6 +342,10 @@ template  class 
GenericUniformityAnalysisImpl {
   typename SyncDependenceAnalysisT::DivergenceDescriptor;
   using BlockLabelMapT = typename SyncDependenceAnalysisT::BlockLabelMap;
 
+  // Use outside cycle with divergent exit
+  using UOCWDE =

ssahasra wrote:

Alternatively, UOCWDE can be renamed to ``TemporalDivergenceTuple``?

https://github.com/llvm/llvm-project/pull/124298
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering (non i1) (PR #124298)

2025-02-04 Thread Sameer Sahasrabuddhe via llvm-branch-commits

https://github.com/ssahasra commented:

The changes to UA look good to me. I can't comment much about the actual patch 
itself.

https://github.com/llvm/llvm-project/pull/124298
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering (non i1) (PR #124298)

2025-01-30 Thread Sameer Sahasrabuddhe via llvm-branch-commits


@@ -188,6 +190,37 @@ void 
DivergenceLoweringHelper::constrainAsLaneMask(Incoming &In) {
   In.Reg = Copy.getReg(0);
 }
 
+void replaceUsesOfRegInInstWith(Register Reg, MachineInstr *Inst,
+Register NewReg) {
+  for (MachineOperand &Op : Inst->operands()) {
+if (Op.isReg() && Op.getReg() == Reg)
+  Op.setReg(NewReg);
+  }
+}
+
+bool DivergenceLoweringHelper::lowerTempDivergence() {
+  AMDGPU::IntrinsicLaneMaskAnalyzer ILMA(*MF);
+
+  for (auto [Inst, UseInst, _] : MUI->getUsesOutsideCycleWithDivergentExit()) {
+Register Reg = Inst->getOperand(0).getReg();
+if (MRI->getType(Reg) == LLT::scalar(1) || MUI->isDivergent(Reg) ||
+ILMA.isS32S64LaneMask(Reg))
+  continue;
+
+MachineInstr *MI = const_cast(Inst);

ssahasra wrote:

I lean on the other side. If you look at LoopInfoBase or LoopBase, their 
functions take const pointers as arguments but return non-const pointers when 
asked. Sure, an analysis should treat its inputs as const, but when it returns 
something to the client, that client owns it anyway, so forcing that to be 
const is just an inconvenience. I would rather have the analysis do the 
const_cast before returning a list of pointers to something I already own.

This seems to be the first time that uniformity analysis is returning 
something. Until now, the public interface has simply been a bunch of 
predicates like "isUniform" that take a const pointer as arguments.

https://github.com/llvm/llvm-project/pull/124298
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [clang] [llvm] AMDGPU: Fix libcall recognition of image array types (PR #119832)

2024-12-15 Thread Sameer Sahasrabuddhe via llvm-branch-commits


@@ -622,9 +622,9 @@ bool ItaniumParamParser::parseItaniumParam(StringRef& param,
   if (isDigit(TC)) {
 res.ArgType =
 StringSwitch(eatLengthPrefixedName(param))
-.Case("ocl_image1darray", AMDGPULibFunc::IMG1DA)
-.Case("ocl_image1dbuffer", AMDGPULibFunc::IMG1DB)
-.Case("ocl_image2darray", AMDGPULibFunc::IMG2DA)
+.StartsWith("ocl_image1d_array", AMDGPULibFunc::IMG1DA)
+.StartsWith("ocl_image1d_buffer", AMDGPULibFunc::IMG1DB)
+.StartsWith("ocl_image2d_array", AMDGPULibFunc::IMG2DA)

ssahasra wrote:

Shouldn't this change also fix the mangling generated in `getItaniumTypeName`?

https://github.com/llvm/llvm-project/pull/119832
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [clang] [llvm] AMDGPU: Fix libcall recognition of image array types (PR #119832)

2024-12-15 Thread Sameer Sahasrabuddhe via llvm-branch-commits

https://github.com/ssahasra approved this pull request.


https://github.com/llvm/llvm-project/pull/119832
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering (non i1) (PR #124298)

2025-01-28 Thread Sameer Sahasrabuddhe via llvm-branch-commits


@@ -395,6 +399,14 @@ template  class 
GenericUniformityAnalysisImpl {
   }
 
   void print(raw_ostream &out) const;
+  SmallVector UsesOutsideCycleWithDivergentExit;
+  void recordUseOutsideCycleWithDivergentExit(const InstructionT *,

ssahasra wrote:

Everywhere in this patch, is there some reason to precisely say 
"UseOutsideCycleWithDivergentExit"? Can't we just say "TemporalDivergence"? 

https://github.com/llvm/llvm-project/pull/124298
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering (non i1) (PR #124298)

2025-01-29 Thread Sameer Sahasrabuddhe via llvm-branch-commits

https://github.com/ssahasra edited 
https://github.com/llvm/llvm-project/pull/124298
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering (non i1) (PR #124298)

2025-01-29 Thread Sameer Sahasrabuddhe via llvm-branch-commits


@@ -395,6 +399,14 @@ template  class 
GenericUniformityAnalysisImpl {
   }
 
   void print(raw_ostream &out) const;
+  SmallVector UsesOutsideCycleWithDivergentExit;
+  void recordUseOutsideCycleWithDivergentExit(const InstructionT *,

ssahasra wrote:

You're right. The LLVM doc does not actually define the term "temporal 
divergence". But it has always been used in a way that means "uniform inside 
cycle, divergent outside cycle, due to divergent cycle exit. But whether the 
value is uniform inside the cycle is less important. What matters is that 
values arrive at the use on exits from different iterations by different 
threads. I think we should use the name TemporalDivergence here. It's shorter 
and will show up when someone greps for temporal divergence. Let's also not add 
"Candidate" ... it just makes the name longer with only a little bit of new 
information.

https://github.com/llvm/llvm-project/pull/124298
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering (non i1) (PR #124298)

2025-01-29 Thread Sameer Sahasrabuddhe via llvm-branch-commits


@@ -342,6 +342,10 @@ template  class 
GenericUniformityAnalysisImpl {
   typename SyncDependenceAnalysisT::DivergenceDescriptor;
   using BlockLabelMapT = typename SyncDependenceAnalysisT::BlockLabelMap;
 
+  // Use outside cycle with divergent exit
+  using UOCWDE =

ssahasra wrote:

Just a suggestion, I would consider giving the name "TemporalDivergenceList" to 
the entire type ``SmallVectorhttps://github.com/llvm/llvm-project/pull/124298
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering (non i1) (PR #124298)

2025-01-29 Thread Sameer Sahasrabuddhe via llvm-branch-commits


@@ -1210,6 +1240,13 @@ void 
GenericUniformityAnalysisImpl::print(raw_ostream &OS) const {
   }
 }
 
+template 
+iterator_range::UOCWDE *>

ssahasra wrote:

Just say ``auto`` as the return type here? Or if this needs to be exposed in an 
outer header file, then name a new type such as ``temporal_divergence_range``?

https://github.com/llvm/llvm-project/pull/124298
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering (non i1) (PR #124298)

2025-01-29 Thread Sameer Sahasrabuddhe via llvm-branch-commits


@@ -40,6 +40,10 @@ template  class GenericUniformityInfo {
   using CycleInfoT = GenericCycleInfo;
   using CycleT = typename CycleInfoT::CycleT;
 
+  // Use outside cycle with divergent exit
+  using UOCWDE =

ssahasra wrote:

This declaration got repeated. One of them can be eliminated?

https://github.com/llvm/llvm-project/pull/124298
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [clang] [llvm] [clang] Redefine `noconvergent` and generate convergence control tokens (PR #136282)

2025-04-18 Thread Sameer Sahasrabuddhe via llvm-branch-commits

https://github.com/ssahasra created 
https://github.com/llvm/llvm-project/pull/136282

This introduces the `-fconvergence-control` flag that emits convergence control 
intrinsics which are then used as the `convergencectrl` operand bundle on 
convergent calls.

This also redefines the `noconvergent` attribute in Clang. The existing simple 
interpretation is that if a statement is marked `noconvergent`, then every asm 
call is treated as a non-convergent operation in the emitted LLVM IR.

The new semantics introduces a more powerful notion that a `noconvergent` 
statement may contain convergent operations, but the resulting convergence 
constraints are limited to the scope of that statement. As a whole the 
statement itself does not place any convergence constraints on the control flow 
reaching it. When emitting convergence tokens, this attribute results in a call 
to the `anchor` intrinsic that determines convergence within the statement.

>From 5681859e308283628da481c0ddc09a39345b3d46 Mon Sep 17 00:00:00 2001
From: Sameer Sahasrabuddhe 
Date: Tue, 15 Apr 2025 18:00:01 +0530
Subject: [PATCH] [clang] Redefine `noconvergent` and generate convergence
 control tokens

This introduces the `-fconvergence-control` flag that emits convergence control
intrinsics which are then used as the `convergencectrl` operand bundle on
convergent calls.

This also redefines the `noconvergent` attribute in Clang. The existing simple
interpretation is that if a statement is marked `noconvergent`, then every asm
call is treated as a non-convergent operation in the emitted LLVM IR.

The new semantics introduces a more powerful notion that a `noconvergent`
statement may contain convergent operations, but the resulting convergence
constraints are limited to the scope of that statement. As a whole the statement
itself does not place any convergence constraints on the control flow reaching
it. When emitting convergence tokens, this attribute results in a call to the
`anchor` intrinsic that determines convergence within the statement.
---
 clang/docs/ThreadConvergence.rst  |  27 +
 .../Analysis/Analyses/ConvergenceCheck.h  |   3 +-
 clang/include/clang/Basic/AttrDocs.td |  15 +-
 .../clang/Basic/DiagnosticSemaKinds.td|   2 +
 clang/include/clang/Basic/LangOptions.def |   2 +
 clang/include/clang/Driver/Options.td |   5 +
 clang/lib/Analysis/ConvergenceCheck.cpp   |  43 +-
 clang/lib/CodeGen/CGCall.cpp  |   8 +-
 clang/lib/CodeGen/CGStmt.cpp  |  44 +-
 clang/lib/CodeGen/CodeGenFunction.cpp |  23 +-
 clang/lib/CodeGen/CodeGenFunction.h   |  13 +-
 clang/lib/CodeGen/CodeGenModule.h |   2 +-
 clang/lib/Driver/ToolChains/Clang.cpp |   3 +
 clang/lib/Sema/AnalysisBasedWarnings.cpp  |   8 +-
 clang/test/CodeGenHIP/convergence-tokens.hip  | 687 ++
 .../CodeGenHIP/noconvergent-statement.hip | 109 +++
 .../noconvergent-errors/backwards_jump.hip|  23 +
 .../noconvergent-errors/jump-into-nest.hip|  32 +
 .../SemaHIP/noconvergent-errors/no-errors.hip |  83 +++
 .../noconvergent-errors/simple_jump.hip   |  23 +
 llvm/include/llvm/IR/InstrTypes.h |   8 +-
 llvm/include/llvm/IR/IntrinsicInst.h  |  12 +
 .../Transforms/Utils/FixConvergenceControl.h  |  21 +
 llvm/lib/IR/Instructions.cpp  |   7 +
 llvm/lib/IR/IntrinsicInst.cpp |  21 +
 llvm/lib/Transforms/Utils/CMakeLists.txt  |   1 +
 .../Utils/FixConvergenceControl.cpp   | 191 +
 27 files changed, 1365 insertions(+), 51 deletions(-)
 create mode 100644 clang/test/CodeGenHIP/convergence-tokens.hip
 create mode 100644 clang/test/CodeGenHIP/noconvergent-statement.hip
 create mode 100644 clang/test/SemaHIP/noconvergent-errors/backwards_jump.hip
 create mode 100644 clang/test/SemaHIP/noconvergent-errors/jump-into-nest.hip
 create mode 100644 clang/test/SemaHIP/noconvergent-errors/no-errors.hip
 create mode 100644 clang/test/SemaHIP/noconvergent-errors/simple_jump.hip
 create mode 100644 llvm/include/llvm/Transforms/Utils/FixConvergenceControl.h
 create mode 100644 llvm/lib/Transforms/Utils/FixConvergenceControl.cpp

diff --git a/clang/docs/ThreadConvergence.rst b/clang/docs/ThreadConvergence.rst
index d872ab9cb77f5..ce2ca2cbeacde 100644
--- a/clang/docs/ThreadConvergence.rst
+++ b/clang/docs/ThreadConvergence.rst
@@ -564,6 +564,33 @@ backwards ``goto`` instead of a ``while`` statement.
   ``outside_loop``. This includes threads that jumped from ``G2`` as well as
   threads that  reached ``outside_loop`` after executing ``C``.
 
+.. _noconvergent-statement:
+
+The ``noconvergent`` Statement
+==
+
+When a statement is marked as ``noconvergent`` the convergence of threads at 
the
+start of this statement is not constrained by any convergent operations inside
+the statement.
+
+- When two threads execute a statement marked ``noconvergent``, it is
+  implementation-

[llvm-branch-commits] [clang] [llvm] [clang] Redefine `noconvergent` and generate convergence control tokens (PR #136282)

2025-04-21 Thread Sameer Sahasrabuddhe via llvm-branch-commits

https://github.com/ssahasra edited 
https://github.com/llvm/llvm-project/pull/136282
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [clang] [llvm] [clang] Redefine `noconvergent` and generate convergence control tokens (PR #136282)

2025-04-21 Thread Sameer Sahasrabuddhe via llvm-branch-commits

https://github.com/ssahasra edited 
https://github.com/llvm/llvm-project/pull/136282
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU] efficiently wait for direct loads to LDS at all scopes (PR #147258)

2025-07-07 Thread Sameer Sahasrabuddhe via llvm-branch-commits

https://github.com/ssahasra created 
https://github.com/llvm/llvm-project/pull/147258

Currently, the memory legalizer does not generate any wait on vmcnt at workgroup
scope. This is incorrect because direct loads to LDS are tracked using vmcnt and
they need to be released properly at workgroup scope.

The memory legalizer was previously updated to always emit a soft wait
instruction even when all counts are trivially ~0. SIInsertWaitcnts now examines
pending loads to LDS at each S_WAITCNT_soft instruction. If such instructions
exist, the vmcnt (which could be ~0) is upgraded to a value that wiats for any
such pending loads to LDS. After that, any soft instruction that has only
trivial ~0 counts is automatically dropped.

Thus, common programs that do not use direct loads to LDS remain unaffected, but
programs that do use such loads see a correct and efficient vmcnt even at
workgroup scope.

>From de111cd96570df7127722cb7df476cb833694f72 Mon Sep 17 00:00:00 2001
From: Sameer Sahasrabuddhe 
Date: Tue, 17 Jun 2025 13:11:55 +0530
Subject: [PATCH 1/2] [AMDGCN] pre-checkin test for LDS DMA and release
 operations

---
 .../AMDGPU/lds-dma-workgroup-release.ll   | 482 ++
 1 file changed, 482 insertions(+)
 create mode 100644 llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll

diff --git a/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll 
b/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll
new file mode 100644
index 0..1db15c3c6099c
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll
@@ -0,0 +1,482 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py 
UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck %s 
--check-prefixes=GFX900
+; RUN: llc -mtriple=amdgcn -mcpu=gfx90a < %s | FileCheck %s 
--check-prefixes=GFX90A
+; RUN: llc -mtriple=amdgcn -mcpu=gfx90a -mattr=+tgsplit < %s | FileCheck %s 
--check-prefixes=GFX90A-TGSPLIT
+; RUN: llc -mtriple=amdgcn -mcpu=gfx942 < %s | FileCheck %s 
--check-prefixes=GFX942
+; RUN: llc -mtriple=amdgcn -mcpu=gfx942 -mattr=+tgsplit < %s | FileCheck %s 
--check-prefixes=GFX942-TGSPLIT
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck %s 
--check-prefixes=GFX1010
+
+; In each of these tests, an LDS DMA operation is followed by a release pattern
+; at workgroup scope. The fence in such a release (implicit or explicit) should
+; wait for the store component in the LDS DMA. The additional noalias metadata
+; is just meant to ensure that the wait counts are not generated due to some
+; unintended aliasing.
+
+declare void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr 
addrspace(3) nocapture, i32 %size, i32 %voffset, i32 %soffset, i32 %offset, i32 
%aux)
+
+define amdgpu_kernel void @barrier_release(<4 x i32> inreg %rsrc,
+; GFX900-LABEL: barrier_release:
+; GFX900:   ; %bb.0: ; %main_body
+; GFX900-NEXT:s_load_dwordx8 s[8:15], s[4:5], 0x24
+; GFX900-NEXT:v_mov_b32_e32 v0, 0x800
+; GFX900-NEXT:v_mov_b32_e32 v1, 0
+; GFX900-NEXT:s_waitcnt lgkmcnt(0)
+; GFX900-NEXT:s_mov_b32 m0, s12
+; GFX900-NEXT:s_nop 0
+; GFX900-NEXT:buffer_load_dword v0, s[8:11], 0 offen lds
+; GFX900-NEXT:v_mov_b32_e32 v0, s13
+; GFX900-NEXT:s_waitcnt vmcnt(0)
+; GFX900-NEXT:s_barrier
+; GFX900-NEXT:ds_read_b32 v0, v0
+; GFX900-NEXT:s_waitcnt lgkmcnt(0)
+; GFX900-NEXT:global_store_dword v1, v0, s[14:15]
+; GFX900-NEXT:s_endpgm
+;
+; GFX90A-LABEL: barrier_release:
+; GFX90A:   ; %bb.1:
+; GFX90A-NEXT:s_load_dwordx4 s[8:11], s[4:5], 0x0
+; GFX90A-NEXT:s_load_dwordx2 s[12:13], s[4:5], 0x10
+; GFX90A-NEXT:s_waitcnt lgkmcnt(0)
+; GFX90A-NEXT:s_branch .LBB0_0
+; GFX90A-NEXT:.p2align 8
+; GFX90A-NEXT:  ; %bb.2:
+; GFX90A-NEXT:  .LBB0_0: ; %main_body
+; GFX90A-NEXT:s_mov_b32 m0, s12
+; GFX90A-NEXT:v_mov_b32_e32 v0, 0x800
+; GFX90A-NEXT:buffer_load_dword v0, s[8:11], 0 offen lds
+; GFX90A-NEXT:v_mov_b32_e32 v0, s13
+; GFX90A-NEXT:s_load_dwordx2 s[0:1], s[4:5], 0x3c
+; GFX90A-NEXT:s_waitcnt lgkmcnt(0)
+; GFX90A-NEXT:s_barrier
+; GFX90A-NEXT:s_waitcnt vmcnt(0)
+; GFX90A-NEXT:ds_read_b32 v0, v0
+; GFX90A-NEXT:v_mov_b32_e32 v1, 0
+; GFX90A-NEXT:s_waitcnt lgkmcnt(0)
+; GFX90A-NEXT:global_store_dword v1, v0, s[0:1]
+; GFX90A-NEXT:s_endpgm
+;
+; GFX90A-TGSPLIT-LABEL: barrier_release:
+; GFX90A-TGSPLIT:   ; %bb.1:
+; GFX90A-TGSPLIT-NEXT:s_load_dwordx4 s[8:11], s[4:5], 0x0
+; GFX90A-TGSPLIT-NEXT:s_load_dwordx2 s[12:13], s[4:5], 0x10
+; GFX90A-TGSPLIT-NEXT:s_waitcnt lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:s_branch .LBB0_0
+; GFX90A-TGSPLIT-NEXT:.p2align 8
+; GFX90A-TGSPLIT-NEXT:  ; %bb.2:
+; GFX90A-TGSPLIT-NEXT:  .LBB0_0: ; %main_body
+; GFX90A-TGSPLIT-NEXT:s_mov_b32 m0, s12
+; GFX90A-TGSPLIT-NEXT:v_mov_b32_e32 v0, 0x800
+; GFX90A-TGSPLIT-NEXT:buffer_load_dword v0, s[8:11], 0 offen lds
+; GFX90A-TGSPLIT-NEXT:v_mov_b32_e32 v0, s13
+; G

[llvm-branch-commits] [llvm] [AMDGPU] always emit a soft wait even if it is trivially ~0 (PR #147257)

2025-07-07 Thread Sameer Sahasrabuddhe via llvm-branch-commits


@@ -669,6 +679,7 @@ define amdgpu_kernel void @global_volatile_store_1(
 ; GFX12-WGP-NEXT:s_wait_kmcnt 0x0
 ; GFX12-WGP-NEXT:s_wait_storecnt 0x0
 ; GFX12-WGP-NEXT:global_store_b32 v0, v1, s[0:1] scope:SCOPE_SYS
+; GFX12-WGP-NEXT:s_wait_loadcnt 0x3f

ssahasra wrote:

Not directly related to this discussion, but this line does exist:
```
   1390   // Merge consecutive waitcnt of the same type by erasing 
multiples.
   1391   if (WaitcntInstr || (!Wait.hasWaitExceptStoreCnt() && 
TrySimplify)) {
```
It is meant to preserver S_WAITCNT_soft even if there is no actual wait 
required. @jayfoad , you had introduced `TrySimplify` ... do you think it is 
okay to relax its uses?

```
   1373   if (TrySimplify **|| (Opcode != II.getOpcode() && 
OldWait.hasValuesSetToMax()**)
   1374 ScoreBrackets.simplifyWaitcnt(OldWait);
```
Here, `hasValuesSetToMax()` is a hypothetical function that checks the encoding 
of each count separately to have all bits set to 1, and not just a ~0 in the 
data structure.

https://github.com/llvm/llvm-project/pull/147257
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU] always emit a soft wait even if it is trivially ~0 (PR #147257)

2025-07-07 Thread Sameer Sahasrabuddhe via llvm-branch-commits

https://github.com/ssahasra edited 
https://github.com/llvm/llvm-project/pull/147257
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU] always emit a soft wait even if it is trivially ~0 (PR #147257)

2025-07-07 Thread Sameer Sahasrabuddhe via llvm-branch-commits


@@ -669,6 +679,7 @@ define amdgpu_kernel void @global_volatile_store_1(
 ; GFX12-WGP-NEXT:s_wait_kmcnt 0x0
 ; GFX12-WGP-NEXT:s_wait_storecnt 0x0
 ; GFX12-WGP-NEXT:global_store_b32 v0, v1, s[0:1] scope:SCOPE_SYS
+; GFX12-WGP-NEXT:s_wait_loadcnt 0x3f

ssahasra wrote:

> These should always be printed with the named counter syntax

I haven't check what's different about this wait count for it to be printed 
like this. Will need to follow it up as a separate change.

https://github.com/llvm/llvm-project/pull/147257
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU] always emit a soft wait even if it is trivially ~0 (PR #147257)

2025-07-07 Thread Sameer Sahasrabuddhe via llvm-branch-commits


@@ -669,6 +679,7 @@ define amdgpu_kernel void @global_volatile_store_1(
 ; GFX12-WGP-NEXT:s_wait_kmcnt 0x0
 ; GFX12-WGP-NEXT:s_wait_storecnt 0x0
 ; GFX12-WGP-NEXT:global_store_b32 v0, v1, s[0:1] scope:SCOPE_SYS
+; GFX12-WGP-NEXT:s_wait_loadcnt 0x3f

ssahasra wrote:

If we agree with the basic design, then these are expected. There's a whole 
bunch of tests that either stop at the memory legalizer, or they run llc with 
`-O0`, like this one. The "trivial" wait counts show up in all these tests 
because SIInsertWaitcnts did not get a chance to clean it up. In particular, 
see how `TrySimplify` in that pass controls whether or not to clean up these 
wait counts. They disappear in the optimized ISA output.

https://github.com/llvm/llvm-project/pull/147257
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU] always emit a soft wait even if it is trivially ~0 (PR #147257)

2025-07-07 Thread Sameer Sahasrabuddhe via llvm-branch-commits


@@ -669,6 +679,7 @@ define amdgpu_kernel void @global_volatile_store_1(
 ; GFX12-WGP-NEXT:s_wait_kmcnt 0x0
 ; GFX12-WGP-NEXT:s_wait_storecnt 0x0
 ; GFX12-WGP-NEXT:global_store_b32 v0, v1, s[0:1] scope:SCOPE_SYS
+; GFX12-WGP-NEXT:s_wait_loadcnt 0x3f

ssahasra wrote:

Yes, I did consider that as an option. But there is the hypothetical corner 
case where the memory legalizer might deliberately compute the wait count to be 
so large that it gets clamped at the max value (not the same as ~0, strictly 
speaking). If that is not an issue, it will significantly reduce the diff for 
tests that don't stop after the legalizer.

https://github.com/llvm/llvm-project/pull/147257
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU] always emit a soft wait even if it is trivially ~0 (PR #147257)

2025-07-07 Thread Sameer Sahasrabuddhe via llvm-branch-commits

ssahasra wrote:

This is part of a stack:

- #147258
- #147257 
- #147256 

https://github.com/llvm/llvm-project/pull/147257
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU] efficiently wait for direct loads to LDS at all scopes (PR #147258)

2025-07-07 Thread Sameer Sahasrabuddhe via llvm-branch-commits

ssahasra wrote:

This is part of a stack:

- #147258
- #147257 
- #147256 

https://github.com/llvm/llvm-project/pull/147258
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU] efficiently wait for direct loads to LDS at all scopes (PR #147258)

2025-07-09 Thread Sameer Sahasrabuddhe via llvm-branch-commits

ssahasra wrote:

Note that the best way to see the effect of this PR is to view only the second 
diff of the two in this PR. It shows how the missing vmcnt(0) shows up in the 
new test introduced by the first commit.

https://github.com/llvm/llvm-project/pull/147258
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU] efficiently wait for direct loads to LDS at all scopes (PR #147258)

2025-07-09 Thread Sameer Sahasrabuddhe via llvm-branch-commits

https://github.com/ssahasra updated 
https://github.com/llvm/llvm-project/pull/147258

>From 95ffad8e0c22f261999f8a87abde8592c0596395 Mon Sep 17 00:00:00 2001
From: Sameer Sahasrabuddhe 
Date: Tue, 17 Jun 2025 13:11:55 +0530
Subject: [PATCH 1/2] [AMDGCN] pre-checkin test for LDS DMA and release
 operations

---
 .../AMDGPU/lds-dma-workgroup-release.ll   | 482 ++
 1 file changed, 482 insertions(+)
 create mode 100644 llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll

diff --git a/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll 
b/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll
new file mode 100644
index 0..1db15c3c6099c
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll
@@ -0,0 +1,482 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py 
UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck %s 
--check-prefixes=GFX900
+; RUN: llc -mtriple=amdgcn -mcpu=gfx90a < %s | FileCheck %s 
--check-prefixes=GFX90A
+; RUN: llc -mtriple=amdgcn -mcpu=gfx90a -mattr=+tgsplit < %s | FileCheck %s 
--check-prefixes=GFX90A-TGSPLIT
+; RUN: llc -mtriple=amdgcn -mcpu=gfx942 < %s | FileCheck %s 
--check-prefixes=GFX942
+; RUN: llc -mtriple=amdgcn -mcpu=gfx942 -mattr=+tgsplit < %s | FileCheck %s 
--check-prefixes=GFX942-TGSPLIT
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck %s 
--check-prefixes=GFX1010
+
+; In each of these tests, an LDS DMA operation is followed by a release pattern
+; at workgroup scope. The fence in such a release (implicit or explicit) should
+; wait for the store component in the LDS DMA. The additional noalias metadata
+; is just meant to ensure that the wait counts are not generated due to some
+; unintended aliasing.
+
+declare void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr 
addrspace(3) nocapture, i32 %size, i32 %voffset, i32 %soffset, i32 %offset, i32 
%aux)
+
+define amdgpu_kernel void @barrier_release(<4 x i32> inreg %rsrc,
+; GFX900-LABEL: barrier_release:
+; GFX900:   ; %bb.0: ; %main_body
+; GFX900-NEXT:s_load_dwordx8 s[8:15], s[4:5], 0x24
+; GFX900-NEXT:v_mov_b32_e32 v0, 0x800
+; GFX900-NEXT:v_mov_b32_e32 v1, 0
+; GFX900-NEXT:s_waitcnt lgkmcnt(0)
+; GFX900-NEXT:s_mov_b32 m0, s12
+; GFX900-NEXT:s_nop 0
+; GFX900-NEXT:buffer_load_dword v0, s[8:11], 0 offen lds
+; GFX900-NEXT:v_mov_b32_e32 v0, s13
+; GFX900-NEXT:s_waitcnt vmcnt(0)
+; GFX900-NEXT:s_barrier
+; GFX900-NEXT:ds_read_b32 v0, v0
+; GFX900-NEXT:s_waitcnt lgkmcnt(0)
+; GFX900-NEXT:global_store_dword v1, v0, s[14:15]
+; GFX900-NEXT:s_endpgm
+;
+; GFX90A-LABEL: barrier_release:
+; GFX90A:   ; %bb.1:
+; GFX90A-NEXT:s_load_dwordx4 s[8:11], s[4:5], 0x0
+; GFX90A-NEXT:s_load_dwordx2 s[12:13], s[4:5], 0x10
+; GFX90A-NEXT:s_waitcnt lgkmcnt(0)
+; GFX90A-NEXT:s_branch .LBB0_0
+; GFX90A-NEXT:.p2align 8
+; GFX90A-NEXT:  ; %bb.2:
+; GFX90A-NEXT:  .LBB0_0: ; %main_body
+; GFX90A-NEXT:s_mov_b32 m0, s12
+; GFX90A-NEXT:v_mov_b32_e32 v0, 0x800
+; GFX90A-NEXT:buffer_load_dword v0, s[8:11], 0 offen lds
+; GFX90A-NEXT:v_mov_b32_e32 v0, s13
+; GFX90A-NEXT:s_load_dwordx2 s[0:1], s[4:5], 0x3c
+; GFX90A-NEXT:s_waitcnt lgkmcnt(0)
+; GFX90A-NEXT:s_barrier
+; GFX90A-NEXT:s_waitcnt vmcnt(0)
+; GFX90A-NEXT:ds_read_b32 v0, v0
+; GFX90A-NEXT:v_mov_b32_e32 v1, 0
+; GFX90A-NEXT:s_waitcnt lgkmcnt(0)
+; GFX90A-NEXT:global_store_dword v1, v0, s[0:1]
+; GFX90A-NEXT:s_endpgm
+;
+; GFX90A-TGSPLIT-LABEL: barrier_release:
+; GFX90A-TGSPLIT:   ; %bb.1:
+; GFX90A-TGSPLIT-NEXT:s_load_dwordx4 s[8:11], s[4:5], 0x0
+; GFX90A-TGSPLIT-NEXT:s_load_dwordx2 s[12:13], s[4:5], 0x10
+; GFX90A-TGSPLIT-NEXT:s_waitcnt lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:s_branch .LBB0_0
+; GFX90A-TGSPLIT-NEXT:.p2align 8
+; GFX90A-TGSPLIT-NEXT:  ; %bb.2:
+; GFX90A-TGSPLIT-NEXT:  .LBB0_0: ; %main_body
+; GFX90A-TGSPLIT-NEXT:s_mov_b32 m0, s12
+; GFX90A-TGSPLIT-NEXT:v_mov_b32_e32 v0, 0x800
+; GFX90A-TGSPLIT-NEXT:buffer_load_dword v0, s[8:11], 0 offen lds
+; GFX90A-TGSPLIT-NEXT:v_mov_b32_e32 v0, s13
+; GFX90A-TGSPLIT-NEXT:s_load_dwordx2 s[0:1], s[4:5], 0x3c
+; GFX90A-TGSPLIT-NEXT:s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:s_barrier
+; GFX90A-TGSPLIT-NEXT:buffer_wbinvl1_vol
+; GFX90A-TGSPLIT-NEXT:ds_read_b32 v0, v0
+; GFX90A-TGSPLIT-NEXT:v_mov_b32_e32 v1, 0
+; GFX90A-TGSPLIT-NEXT:s_waitcnt lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:global_store_dword v1, v0, s[0:1]
+; GFX90A-TGSPLIT-NEXT:s_endpgm
+;
+; GFX942-LABEL: barrier_release:
+; GFX942:   ; %bb.1:
+; GFX942-NEXT:s_load_dwordx4 s[8:11], s[4:5], 0x0
+; GFX942-NEXT:s_load_dwordx2 s[12:13], s[4:5], 0x10
+; GFX942-NEXT:s_waitcnt lgkmcnt(0)
+; GFX942-NEXT:s_branch .LBB0_0
+; GFX942-NEXT:.p2align 8
+; GFX942-NEXT:  ; %bb.2:
+; GFX942-NEXT:  .LBB0_0: ; %main_body
+; GFX942-NEXT:s_mov_b32 m0, s1