[llvm-branch-commits] [llvm] [Attributor][AMDGPU] Improve the handling of indirect calls (PR #100954)
ssahasra wrote: The apparent change here is to simply reverse the effect of #100952 on the lit test. Would be good to have a test which shows what the improvement is. Also, I think #100952 merely enables AAIndirectCallInfo, and feels like an integral part of this change itself. I would lean towards squashing it into this change. https://github.com/llvm/llvm-project/pull/100954 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [Transforms] Refactor CreateControlFlowHub (PR #103013)
https://github.com/ssahasra created https://github.com/llvm/llvm-project/pull/103013 CreateControlFlowHub is a method that redirects control flow edges from a set of incoming blocks to a set of outgoing blocks through a new set of "guard" blocks. This is now refactored into a separate file with one enhancement: The input to the method is now a set of branches rather than two sets of blocks. The original implementation reroutes every edge from incoming blocks to outgoing blocks. But it is possible that for some incoming block InBB, some successor S might be in the set of outgoing blocks, but that particular edge should not be rerouted. The new implementation makes this possible by allowing the user to specify the targets of each branch that need to be rerouted. This is needed when improving the implementation of FixIrreducible #101386. Current uses in FixIrreducible and UnifyLoopExits do not demonstrate this finer control over the edges being rerouted. >From 9b8d7f65680155f04bc754aebd4d820bad743581 Mon Sep 17 00:00:00 2001 From: Sameer Sahasrabuddhe Date: Tue, 13 Aug 2024 12:07:00 +0530 Subject: [PATCH] [Transforms] Refactor CreateControlFlowHub CreateControlFlowHub is a method that redirects control flow edges from a set of incoming blocks to a set of outgoing blocks through a new set of "guard" blocks. This is now refactored into a separate file with one enhancement: The input to the method is now a set of branches rather than two sets of blocks. The original implementation reroutes every edge from incoming blocks to outgoing blocks. But it is possible that for some incoming block InBB, some successor S might be in the set of outgoing blocks, but that particular edge should not be rerouted. The new implementation makes this possible by allowing the user to specify the targets of each branch that need to be rerouted. This is needed when improving the implementation of FixIrreducible #101386. Current uses in FixIrreducible and UnifyLoopExits do not demonstrate this finer control over the edges being rerouted. --- .../llvm/Transforms/Utils/BasicBlockUtils.h | 75 .../llvm/Transforms/Utils/ControlFlowUtils.h | 121 +++ llvm/lib/Transforms/Utils/BasicBlockUtils.cpp | 314 llvm/lib/Transforms/Utils/CMakeLists.txt | 1 + .../lib/Transforms/Utils/ControlFlowUtils.cpp | 341 ++ llvm/lib/Transforms/Utils/FixIrreducible.cpp | 35 +- llvm/lib/Transforms/Utils/UnifyLoopExits.cpp | 69 ++-- .../CodeGen/AMDGPU/local-atomicrmw-fadd.ll| 32 +- llvm/test/Transforms/FixIrreducible/basic.ll | 4 +- .../Transforms/FixIrreducible/bug45623.ll | 3 +- llvm/test/Transforms/FixIrreducible/nested.ll | 3 +- llvm/test/Transforms/FixIrreducible/switch.ll | 3 +- .../Transforms/FixIrreducible/unreachable.ll | 4 +- 13 files changed, 545 insertions(+), 460 deletions(-) create mode 100644 llvm/include/llvm/Transforms/Utils/ControlFlowUtils.h create mode 100644 llvm/lib/Transforms/Utils/ControlFlowUtils.cpp diff --git a/llvm/include/llvm/Transforms/Utils/BasicBlockUtils.h b/llvm/include/llvm/Transforms/Utils/BasicBlockUtils.h index c99df6bf94d025..b447942ffbd676 100644 --- a/llvm/include/llvm/Transforms/Utils/BasicBlockUtils.h +++ b/llvm/include/llvm/Transforms/Utils/BasicBlockUtils.h @@ -602,81 +602,6 @@ bool SplitIndirectBrCriticalEdges(Function &F, bool IgnoreBlocksWithoutPHI, BranchProbabilityInfo *BPI = nullptr, BlockFrequencyInfo *BFI = nullptr); -/// Given a set of incoming and outgoing blocks, create a "hub" such that every -/// edge from an incoming block InBB to an outgoing block OutBB is now split -/// into two edges, one from InBB to the hub and another from the hub to -/// OutBB. The hub consists of a series of guard blocks, one for each outgoing -/// block. Each guard block conditionally branches to the corresponding outgoing -/// block, or the next guard block in the chain. These guard blocks are returned -/// in the argument vector. -/// -/// Since the control flow edges from InBB to OutBB have now been replaced, the -/// function also updates any PHINodes in OutBB. For each such PHINode, the -/// operands corresponding to incoming blocks are moved to a new PHINode in the -/// hub, and the hub is made an operand of the original PHINode. -/// -/// Input CFG: -/// -- -/// -///Def -/// | -/// v -/// In1 In2 -///|| -///|| -///vv -/// Foo ---> Out1 Out2 -/// | -/// v -///Use -/// -/// -/// Create hub: Incoming = {In1, In2}, Outgoing = {Out1, Out2} -/// -- -/// -/// Def -/// | -/// v -/// In1In2 Foo -/// |Hub || -///
[llvm-branch-commits] [llvm] [Transforms] Refactor CreateControlFlowHub (PR #103013)
https://github.com/ssahasra edited https://github.com/llvm/llvm-project/pull/103013 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [FixIrreducible] Use CycleInfo instead of a custom SCC traversal (PR #103014)
https://github.com/ssahasra created https://github.com/llvm/llvm-project/pull/103014 1. CycleInfo efficiently locates all cycles in a single pass, while the SCC is repeated inside every natural loop. 2. CycleInfo provides a hierarchy of irreducible cycles, and the new implementation transforms each cycle in this hierarchy separately instead of reducing an entire irreducible SCC in a single step. This reduces the number of control-flow paths that pass through the header of each newly created loop. This is evidenced by the reduced number of predecessors on the "guard" blocks in the lit tests, and fewer operands on the corresponding PHI nodes. 3. When an entry of an irreducible cycle is the header of a child natural loop, the original implementation destroyed that loop. This is now preserved, since the incoming edges on non-header entries are not touched. >From 0ba4872d47179a4d54a06224008cc160905360dc Mon Sep 17 00:00:00 2001 From: Sameer Sahasrabuddhe Date: Mon, 12 Aug 2024 14:44:13 +0530 Subject: [PATCH] [FixIrreducible] Use CycleInfo instead of a custom SCC traversal 1. CycleInfo efficiently locates all cycles in a single pass, while the SCC is repeated inside every natural loop. 2. CycleInfo provides a hierarchy of irreducible cycles, and the new implementation transforms each cycle in this hierarchy separately instead of reducing an entire irreducible SCC in a single step. This reduces the number of control-flow paths that pass through the header of each newly created loop. This is evidenced by the reduced number of predecessors on the "guard" blocks in the lit tests, and fewer operands on the corresponding PHI nodes. 3. When an entry of an irreducible cycle is the header of a child natural loop, the original implementation destroyed that loop. This is now preserved, since the incoming edges on non-header entries are not touched. --- llvm/include/llvm/ADT/GenericCycleInfo.h | 28 +- llvm/lib/Transforms/Utils/FixIrreducible.cpp | 364 +- llvm/test/CodeGen/AMDGPU/llc-pipeline.ll | 15 +- llvm/test/Transforms/FixIrreducible/basic.ll | 98 ++--- .../Transforms/FixIrreducible/bug45623.ll | 9 +- llvm/test/Transforms/FixIrreducible/nested.ll | 143 --- llvm/test/Transforms/FixIrreducible/switch.ll | 8 +- .../Transforms/FixIrreducible/unreachable.ll | 1 + .../workarounds/needs-fix-reducible.ll| 56 +-- .../workarounds/needs-fr-ule.ll | 173 + 10 files changed, 500 insertions(+), 395 deletions(-) diff --git a/llvm/include/llvm/ADT/GenericCycleInfo.h b/llvm/include/llvm/ADT/GenericCycleInfo.h index b5d719c6313c43..cf13f8e95a35e3 100644 --- a/llvm/include/llvm/ADT/GenericCycleInfo.h +++ b/llvm/include/llvm/ADT/GenericCycleInfo.h @@ -107,6 +107,13 @@ template class GenericCycle { return is_contained(Entries, Block); } + /// \brief Replace all entries with \p Block as single entry. + void setSingleEntry(BlockT *Block) { +assert(contains(Block)); +Entries.clear(); +Entries.push_back(Block); + } + /// \brief Return whether \p Block is contained in the cycle. bool contains(const BlockT *Block) const { return Blocks.contains(Block); } @@ -192,11 +199,16 @@ template class GenericCycle { //@{ using const_entry_iterator = typename SmallVectorImpl::const_iterator; - + const_entry_iterator entry_begin() const { return Entries.begin(); } + const_entry_iterator entry_end() const { return Entries.end(); } size_t getNumEntries() const { return Entries.size(); } iterator_range entries() const { -return llvm::make_range(Entries.begin(), Entries.end()); +return llvm::make_range(entry_begin(), entry_end()); } + using const_reverse_entry_iterator = + typename SmallVectorImpl::const_reverse_iterator; + const_reverse_entry_iterator entry_rbegin() const { return Entries.rbegin(); } + const_reverse_entry_iterator entry_rend() const { return Entries.rend(); } //@} Printable printEntries(const ContextT &Ctx) const { @@ -257,12 +269,6 @@ template class GenericCycleInfo { /// the subtree. void moveTopLevelCycleToNewParent(CycleT *NewParent, CycleT *Child); - /// Assumes that \p Cycle is the innermost cycle containing \p Block. - /// \p Block will be appended to \p Cycle and all of its parent cycles. - /// \p Block will be added to BlockMap with \p Cycle and - /// BlockMapTopLevel with \p Cycle's top level parent cycle. - void addBlockToCycle(BlockT *Block, CycleT *Cycle); - public: GenericCycleInfo() = default; GenericCycleInfo(GenericCycleInfo &&) = default; @@ -280,6 +286,12 @@ template class GenericCycleInfo { unsigned getCycleDepth(const BlockT *Block) const; CycleT *getTopLevelParentCycle(BlockT *Block); + /// Assumes that \p Cycle is the innermost cycle containing \p Block. + /// \p Block will be appended to \p Cycle and all of its parent cycles. + /// \p Block will be added to BlockMap with \p Cyc
[llvm-branch-commits] [llvm] [FixIrreducible] Use CycleInfo instead of a custom SCC traversal (PR #103014)
https://github.com/ssahasra closed https://github.com/llvm/llvm-project/pull/103014 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [FixIrreducible] Use CycleInfo instead of a custom SCC traversal (PR #101386)
https://github.com/ssahasra edited https://github.com/llvm/llvm-project/pull/101386 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [FixIrreducible] Use CycleInfo instead of a custom SCC traversal (PR #101386)
@@ -189,6 +195,21 @@ template class GenericCycle { //@{ using const_entry_iterator = typename SmallVectorImpl::const_iterator; + const_entry_iterator entry_begin() const { +return const_entry_iterator{Entries.begin()}; ssahasra wrote: Fixed. https://github.com/llvm/llvm-project/pull/101386 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [FixIrreducible] Use CycleInfo instead of a custom SCC traversal (PR #101386)
@@ -107,6 +107,12 @@ template class GenericCycle { return is_contained(Entries, Block); } + /// \brief Replace all entries with \p Block as single entry. + void setSingleEntry(BlockT *Block) { +Entries.clear(); +Entries.push_back(Block); ssahasra wrote: Fixed. https://github.com/llvm/llvm-project/pull/101386 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [FixIrreducible] Use CycleInfo instead of a custom SCC traversal (PR #101386)
ssahasra wrote: > This needs a finer method that redirects only specific edges. Either that, or > we let the pass destroy some cycles. But updating `CycleInfo` for these > missing subcycles may be a fair amount of work too, so I would rather do it > the right way. This now depends on the newly refactored ControlFlowHub, which correctly reroutes only the relevant edges. The effect was already caught in an existing test with nested cycles and a common header, so no new test needs to be written for this. https://github.com/llvm/llvm-project/pull/101386 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [FixIrreducible] Use CycleInfo instead of a custom SCC traversal (PR #101386)
ssahasra wrote: > Note that I have not yet finished verifying all the lit tests. I might also > have to add a few more tests, especially involving a mix of irreducible and > reducible cycles that are siblings and/or nested inside each other in various > combinations. Especially with some overlap in the entry and header nodes. - New tests added that involve nesting with common header or entry nodes. Existing tests also covered some relevant combinations. - Verified all tests. https://github.com/llvm/llvm-project/pull/101386 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [FixIrreducible] Use CycleInfo instead of a custom SCC traversal (PR #101386)
https://github.com/ssahasra edited https://github.com/llvm/llvm-project/pull/101386 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [FixIrreducible] Use CycleInfo instead of a custom SCC traversal (PR #101386)
https://github.com/ssahasra closed https://github.com/llvm/llvm-project/pull/101386 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [FixIrreducible] Use CycleInfo instead of a custom SCC traversal (PR #101386)
https://github.com/ssahasra reopened https://github.com/llvm/llvm-project/pull/101386 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] c540ce9 - [AMDGPU] pin lit test divergent-unswitch.ll to the old pass manager
Author: Sameer Sahasrabuddhe Date: 2021-01-20T22:02:09+05:30 New Revision: c540ce9900ff99566b4951186e2f070b3b36cdbe URL: https://github.com/llvm/llvm-project/commit/c540ce9900ff99566b4951186e2f070b3b36cdbe DIFF: https://github.com/llvm/llvm-project/commit/c540ce9900ff99566b4951186e2f070b3b36cdbe.diff LOG: [AMDGPU] pin lit test divergent-unswitch.ll to the old pass manager The loop-unswitch transform should not be performed on a loop whose condition is divergent. For this to happen correctly, divergence analysis must be available. The existing divergence analysis has not been ported to the new pass manager yet. As a result, loop unswitching on the new pass manager is currently unsafe on targets that care about divergence. This test is temporarily disabled to unblock work on the new pass manager. The issue is now tracked in bug 48819. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D95051 Added: Modified: llvm/test/Transforms/LoopUnswitch/AMDGPU/divergent-unswitch.ll Removed: diff --git a/llvm/test/Transforms/LoopUnswitch/AMDGPU/divergent-unswitch.ll b/llvm/test/Transforms/LoopUnswitch/AMDGPU/divergent-unswitch.ll index 1f106bd894a8..873a7653973d 100644 --- a/llvm/test/Transforms/LoopUnswitch/AMDGPU/divergent-unswitch.ll +++ b/llvm/test/Transforms/LoopUnswitch/AMDGPU/divergent-unswitch.ll @@ -1,4 +1,7 @@ -; RUN: opt -mtriple=amdgcn-- -O3 -S %s | FileCheck %s +; RUN: opt -mtriple=amdgcn-- -O3 -S -enable-new-pm=0 %s | FileCheck %s + +; This fails with the new pass manager: +; https://bugs.llvm.org/show_bug.cgi?id=48819 ; Check that loop unswitch happened and condition hoisted out of the loop. ; Condition is uniform so all targets should perform unswitching. ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering (non i1) (PR #124298)
@@ -342,6 +342,10 @@ template class GenericUniformityAnalysisImpl { typename SyncDependenceAnalysisT::DivergenceDescriptor; using BlockLabelMapT = typename SyncDependenceAnalysisT::BlockLabelMap; + // Use outside cycle with divergent exit + using UOCWDE = ssahasra wrote: Alternatively, UOCWDE can be renamed to ``TemporalDivergenceTuple``? https://github.com/llvm/llvm-project/pull/124298 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering (non i1) (PR #124298)
https://github.com/ssahasra commented: The changes to UA look good to me. I can't comment much about the actual patch itself. https://github.com/llvm/llvm-project/pull/124298 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering (non i1) (PR #124298)
@@ -188,6 +190,37 @@ void DivergenceLoweringHelper::constrainAsLaneMask(Incoming &In) { In.Reg = Copy.getReg(0); } +void replaceUsesOfRegInInstWith(Register Reg, MachineInstr *Inst, +Register NewReg) { + for (MachineOperand &Op : Inst->operands()) { +if (Op.isReg() && Op.getReg() == Reg) + Op.setReg(NewReg); + } +} + +bool DivergenceLoweringHelper::lowerTempDivergence() { + AMDGPU::IntrinsicLaneMaskAnalyzer ILMA(*MF); + + for (auto [Inst, UseInst, _] : MUI->getUsesOutsideCycleWithDivergentExit()) { +Register Reg = Inst->getOperand(0).getReg(); +if (MRI->getType(Reg) == LLT::scalar(1) || MUI->isDivergent(Reg) || +ILMA.isS32S64LaneMask(Reg)) + continue; + +MachineInstr *MI = const_cast(Inst); ssahasra wrote: I lean on the other side. If you look at LoopInfoBase or LoopBase, their functions take const pointers as arguments but return non-const pointers when asked. Sure, an analysis should treat its inputs as const, but when it returns something to the client, that client owns it anyway, so forcing that to be const is just an inconvenience. I would rather have the analysis do the const_cast before returning a list of pointers to something I already own. This seems to be the first time that uniformity analysis is returning something. Until now, the public interface has simply been a bunch of predicates like "isUniform" that take a const pointer as arguments. https://github.com/llvm/llvm-project/pull/124298 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [clang] [llvm] AMDGPU: Fix libcall recognition of image array types (PR #119832)
@@ -622,9 +622,9 @@ bool ItaniumParamParser::parseItaniumParam(StringRef& param, if (isDigit(TC)) { res.ArgType = StringSwitch(eatLengthPrefixedName(param)) -.Case("ocl_image1darray", AMDGPULibFunc::IMG1DA) -.Case("ocl_image1dbuffer", AMDGPULibFunc::IMG1DB) -.Case("ocl_image2darray", AMDGPULibFunc::IMG2DA) +.StartsWith("ocl_image1d_array", AMDGPULibFunc::IMG1DA) +.StartsWith("ocl_image1d_buffer", AMDGPULibFunc::IMG1DB) +.StartsWith("ocl_image2d_array", AMDGPULibFunc::IMG2DA) ssahasra wrote: Shouldn't this change also fix the mangling generated in `getItaniumTypeName`? https://github.com/llvm/llvm-project/pull/119832 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [clang] [llvm] AMDGPU: Fix libcall recognition of image array types (PR #119832)
https://github.com/ssahasra approved this pull request. https://github.com/llvm/llvm-project/pull/119832 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering (non i1) (PR #124298)
@@ -395,6 +399,14 @@ template class GenericUniformityAnalysisImpl { } void print(raw_ostream &out) const; + SmallVector UsesOutsideCycleWithDivergentExit; + void recordUseOutsideCycleWithDivergentExit(const InstructionT *, ssahasra wrote: Everywhere in this patch, is there some reason to precisely say "UseOutsideCycleWithDivergentExit"? Can't we just say "TemporalDivergence"? https://github.com/llvm/llvm-project/pull/124298 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering (non i1) (PR #124298)
https://github.com/ssahasra edited https://github.com/llvm/llvm-project/pull/124298 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering (non i1) (PR #124298)
@@ -395,6 +399,14 @@ template class GenericUniformityAnalysisImpl { } void print(raw_ostream &out) const; + SmallVector UsesOutsideCycleWithDivergentExit; + void recordUseOutsideCycleWithDivergentExit(const InstructionT *, ssahasra wrote: You're right. The LLVM doc does not actually define the term "temporal divergence". But it has always been used in a way that means "uniform inside cycle, divergent outside cycle, due to divergent cycle exit. But whether the value is uniform inside the cycle is less important. What matters is that values arrive at the use on exits from different iterations by different threads. I think we should use the name TemporalDivergence here. It's shorter and will show up when someone greps for temporal divergence. Let's also not add "Candidate" ... it just makes the name longer with only a little bit of new information. https://github.com/llvm/llvm-project/pull/124298 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering (non i1) (PR #124298)
@@ -342,6 +342,10 @@ template class GenericUniformityAnalysisImpl { typename SyncDependenceAnalysisT::DivergenceDescriptor; using BlockLabelMapT = typename SyncDependenceAnalysisT::BlockLabelMap; + // Use outside cycle with divergent exit + using UOCWDE = ssahasra wrote: Just a suggestion, I would consider giving the name "TemporalDivergenceList" to the entire type ``SmallVectorhttps://github.com/llvm/llvm-project/pull/124298 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering (non i1) (PR #124298)
@@ -1210,6 +1240,13 @@ void GenericUniformityAnalysisImpl::print(raw_ostream &OS) const { } } +template +iterator_range::UOCWDE *> ssahasra wrote: Just say ``auto`` as the return type here? Or if this needs to be exposed in an outer header file, then name a new type such as ``temporal_divergence_range``? https://github.com/llvm/llvm-project/pull/124298 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering (non i1) (PR #124298)
@@ -40,6 +40,10 @@ template class GenericUniformityInfo { using CycleInfoT = GenericCycleInfo; using CycleT = typename CycleInfoT::CycleT; + // Use outside cycle with divergent exit + using UOCWDE = ssahasra wrote: This declaration got repeated. One of them can be eliminated? https://github.com/llvm/llvm-project/pull/124298 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [clang] [llvm] [clang] Redefine `noconvergent` and generate convergence control tokens (PR #136282)
https://github.com/ssahasra created https://github.com/llvm/llvm-project/pull/136282 This introduces the `-fconvergence-control` flag that emits convergence control intrinsics which are then used as the `convergencectrl` operand bundle on convergent calls. This also redefines the `noconvergent` attribute in Clang. The existing simple interpretation is that if a statement is marked `noconvergent`, then every asm call is treated as a non-convergent operation in the emitted LLVM IR. The new semantics introduces a more powerful notion that a `noconvergent` statement may contain convergent operations, but the resulting convergence constraints are limited to the scope of that statement. As a whole the statement itself does not place any convergence constraints on the control flow reaching it. When emitting convergence tokens, this attribute results in a call to the `anchor` intrinsic that determines convergence within the statement. >From 5681859e308283628da481c0ddc09a39345b3d46 Mon Sep 17 00:00:00 2001 From: Sameer Sahasrabuddhe Date: Tue, 15 Apr 2025 18:00:01 +0530 Subject: [PATCH] [clang] Redefine `noconvergent` and generate convergence control tokens This introduces the `-fconvergence-control` flag that emits convergence control intrinsics which are then used as the `convergencectrl` operand bundle on convergent calls. This also redefines the `noconvergent` attribute in Clang. The existing simple interpretation is that if a statement is marked `noconvergent`, then every asm call is treated as a non-convergent operation in the emitted LLVM IR. The new semantics introduces a more powerful notion that a `noconvergent` statement may contain convergent operations, but the resulting convergence constraints are limited to the scope of that statement. As a whole the statement itself does not place any convergence constraints on the control flow reaching it. When emitting convergence tokens, this attribute results in a call to the `anchor` intrinsic that determines convergence within the statement. --- clang/docs/ThreadConvergence.rst | 27 + .../Analysis/Analyses/ConvergenceCheck.h | 3 +- clang/include/clang/Basic/AttrDocs.td | 15 +- .../clang/Basic/DiagnosticSemaKinds.td| 2 + clang/include/clang/Basic/LangOptions.def | 2 + clang/include/clang/Driver/Options.td | 5 + clang/lib/Analysis/ConvergenceCheck.cpp | 43 +- clang/lib/CodeGen/CGCall.cpp | 8 +- clang/lib/CodeGen/CGStmt.cpp | 44 +- clang/lib/CodeGen/CodeGenFunction.cpp | 23 +- clang/lib/CodeGen/CodeGenFunction.h | 13 +- clang/lib/CodeGen/CodeGenModule.h | 2 +- clang/lib/Driver/ToolChains/Clang.cpp | 3 + clang/lib/Sema/AnalysisBasedWarnings.cpp | 8 +- clang/test/CodeGenHIP/convergence-tokens.hip | 687 ++ .../CodeGenHIP/noconvergent-statement.hip | 109 +++ .../noconvergent-errors/backwards_jump.hip| 23 + .../noconvergent-errors/jump-into-nest.hip| 32 + .../SemaHIP/noconvergent-errors/no-errors.hip | 83 +++ .../noconvergent-errors/simple_jump.hip | 23 + llvm/include/llvm/IR/InstrTypes.h | 8 +- llvm/include/llvm/IR/IntrinsicInst.h | 12 + .../Transforms/Utils/FixConvergenceControl.h | 21 + llvm/lib/IR/Instructions.cpp | 7 + llvm/lib/IR/IntrinsicInst.cpp | 21 + llvm/lib/Transforms/Utils/CMakeLists.txt | 1 + .../Utils/FixConvergenceControl.cpp | 191 + 27 files changed, 1365 insertions(+), 51 deletions(-) create mode 100644 clang/test/CodeGenHIP/convergence-tokens.hip create mode 100644 clang/test/CodeGenHIP/noconvergent-statement.hip create mode 100644 clang/test/SemaHIP/noconvergent-errors/backwards_jump.hip create mode 100644 clang/test/SemaHIP/noconvergent-errors/jump-into-nest.hip create mode 100644 clang/test/SemaHIP/noconvergent-errors/no-errors.hip create mode 100644 clang/test/SemaHIP/noconvergent-errors/simple_jump.hip create mode 100644 llvm/include/llvm/Transforms/Utils/FixConvergenceControl.h create mode 100644 llvm/lib/Transforms/Utils/FixConvergenceControl.cpp diff --git a/clang/docs/ThreadConvergence.rst b/clang/docs/ThreadConvergence.rst index d872ab9cb77f5..ce2ca2cbeacde 100644 --- a/clang/docs/ThreadConvergence.rst +++ b/clang/docs/ThreadConvergence.rst @@ -564,6 +564,33 @@ backwards ``goto`` instead of a ``while`` statement. ``outside_loop``. This includes threads that jumped from ``G2`` as well as threads that reached ``outside_loop`` after executing ``C``. +.. _noconvergent-statement: + +The ``noconvergent`` Statement +== + +When a statement is marked as ``noconvergent`` the convergence of threads at the +start of this statement is not constrained by any convergent operations inside +the statement. + +- When two threads execute a statement marked ``noconvergent``, it is + implementation-
[llvm-branch-commits] [clang] [llvm] [clang] Redefine `noconvergent` and generate convergence control tokens (PR #136282)
https://github.com/ssahasra edited https://github.com/llvm/llvm-project/pull/136282 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [clang] [llvm] [clang] Redefine `noconvergent` and generate convergence control tokens (PR #136282)
https://github.com/ssahasra edited https://github.com/llvm/llvm-project/pull/136282 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] efficiently wait for direct loads to LDS at all scopes (PR #147258)
https://github.com/ssahasra created https://github.com/llvm/llvm-project/pull/147258 Currently, the memory legalizer does not generate any wait on vmcnt at workgroup scope. This is incorrect because direct loads to LDS are tracked using vmcnt and they need to be released properly at workgroup scope. The memory legalizer was previously updated to always emit a soft wait instruction even when all counts are trivially ~0. SIInsertWaitcnts now examines pending loads to LDS at each S_WAITCNT_soft instruction. If such instructions exist, the vmcnt (which could be ~0) is upgraded to a value that wiats for any such pending loads to LDS. After that, any soft instruction that has only trivial ~0 counts is automatically dropped. Thus, common programs that do not use direct loads to LDS remain unaffected, but programs that do use such loads see a correct and efficient vmcnt even at workgroup scope. >From de111cd96570df7127722cb7df476cb833694f72 Mon Sep 17 00:00:00 2001 From: Sameer Sahasrabuddhe Date: Tue, 17 Jun 2025 13:11:55 +0530 Subject: [PATCH 1/2] [AMDGCN] pre-checkin test for LDS DMA and release operations --- .../AMDGPU/lds-dma-workgroup-release.ll | 482 ++ 1 file changed, 482 insertions(+) create mode 100644 llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll diff --git a/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll b/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll new file mode 100644 index 0..1db15c3c6099c --- /dev/null +++ b/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll @@ -0,0 +1,482 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 +; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck %s --check-prefixes=GFX900 +; RUN: llc -mtriple=amdgcn -mcpu=gfx90a < %s | FileCheck %s --check-prefixes=GFX90A +; RUN: llc -mtriple=amdgcn -mcpu=gfx90a -mattr=+tgsplit < %s | FileCheck %s --check-prefixes=GFX90A-TGSPLIT +; RUN: llc -mtriple=amdgcn -mcpu=gfx942 < %s | FileCheck %s --check-prefixes=GFX942 +; RUN: llc -mtriple=amdgcn -mcpu=gfx942 -mattr=+tgsplit < %s | FileCheck %s --check-prefixes=GFX942-TGSPLIT +; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck %s --check-prefixes=GFX1010 + +; In each of these tests, an LDS DMA operation is followed by a release pattern +; at workgroup scope. The fence in such a release (implicit or explicit) should +; wait for the store component in the LDS DMA. The additional noalias metadata +; is just meant to ensure that the wait counts are not generated due to some +; unintended aliasing. + +declare void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) nocapture, i32 %size, i32 %voffset, i32 %soffset, i32 %offset, i32 %aux) + +define amdgpu_kernel void @barrier_release(<4 x i32> inreg %rsrc, +; GFX900-LABEL: barrier_release: +; GFX900: ; %bb.0: ; %main_body +; GFX900-NEXT:s_load_dwordx8 s[8:15], s[4:5], 0x24 +; GFX900-NEXT:v_mov_b32_e32 v0, 0x800 +; GFX900-NEXT:v_mov_b32_e32 v1, 0 +; GFX900-NEXT:s_waitcnt lgkmcnt(0) +; GFX900-NEXT:s_mov_b32 m0, s12 +; GFX900-NEXT:s_nop 0 +; GFX900-NEXT:buffer_load_dword v0, s[8:11], 0 offen lds +; GFX900-NEXT:v_mov_b32_e32 v0, s13 +; GFX900-NEXT:s_waitcnt vmcnt(0) +; GFX900-NEXT:s_barrier +; GFX900-NEXT:ds_read_b32 v0, v0 +; GFX900-NEXT:s_waitcnt lgkmcnt(0) +; GFX900-NEXT:global_store_dword v1, v0, s[14:15] +; GFX900-NEXT:s_endpgm +; +; GFX90A-LABEL: barrier_release: +; GFX90A: ; %bb.1: +; GFX90A-NEXT:s_load_dwordx4 s[8:11], s[4:5], 0x0 +; GFX90A-NEXT:s_load_dwordx2 s[12:13], s[4:5], 0x10 +; GFX90A-NEXT:s_waitcnt lgkmcnt(0) +; GFX90A-NEXT:s_branch .LBB0_0 +; GFX90A-NEXT:.p2align 8 +; GFX90A-NEXT: ; %bb.2: +; GFX90A-NEXT: .LBB0_0: ; %main_body +; GFX90A-NEXT:s_mov_b32 m0, s12 +; GFX90A-NEXT:v_mov_b32_e32 v0, 0x800 +; GFX90A-NEXT:buffer_load_dword v0, s[8:11], 0 offen lds +; GFX90A-NEXT:v_mov_b32_e32 v0, s13 +; GFX90A-NEXT:s_load_dwordx2 s[0:1], s[4:5], 0x3c +; GFX90A-NEXT:s_waitcnt lgkmcnt(0) +; GFX90A-NEXT:s_barrier +; GFX90A-NEXT:s_waitcnt vmcnt(0) +; GFX90A-NEXT:ds_read_b32 v0, v0 +; GFX90A-NEXT:v_mov_b32_e32 v1, 0 +; GFX90A-NEXT:s_waitcnt lgkmcnt(0) +; GFX90A-NEXT:global_store_dword v1, v0, s[0:1] +; GFX90A-NEXT:s_endpgm +; +; GFX90A-TGSPLIT-LABEL: barrier_release: +; GFX90A-TGSPLIT: ; %bb.1: +; GFX90A-TGSPLIT-NEXT:s_load_dwordx4 s[8:11], s[4:5], 0x0 +; GFX90A-TGSPLIT-NEXT:s_load_dwordx2 s[12:13], s[4:5], 0x10 +; GFX90A-TGSPLIT-NEXT:s_waitcnt lgkmcnt(0) +; GFX90A-TGSPLIT-NEXT:s_branch .LBB0_0 +; GFX90A-TGSPLIT-NEXT:.p2align 8 +; GFX90A-TGSPLIT-NEXT: ; %bb.2: +; GFX90A-TGSPLIT-NEXT: .LBB0_0: ; %main_body +; GFX90A-TGSPLIT-NEXT:s_mov_b32 m0, s12 +; GFX90A-TGSPLIT-NEXT:v_mov_b32_e32 v0, 0x800 +; GFX90A-TGSPLIT-NEXT:buffer_load_dword v0, s[8:11], 0 offen lds +; GFX90A-TGSPLIT-NEXT:v_mov_b32_e32 v0, s13 +; G
[llvm-branch-commits] [llvm] [AMDGPU] always emit a soft wait even if it is trivially ~0 (PR #147257)
@@ -669,6 +679,7 @@ define amdgpu_kernel void @global_volatile_store_1( ; GFX12-WGP-NEXT:s_wait_kmcnt 0x0 ; GFX12-WGP-NEXT:s_wait_storecnt 0x0 ; GFX12-WGP-NEXT:global_store_b32 v0, v1, s[0:1] scope:SCOPE_SYS +; GFX12-WGP-NEXT:s_wait_loadcnt 0x3f ssahasra wrote: Not directly related to this discussion, but this line does exist: ``` 1390 // Merge consecutive waitcnt of the same type by erasing multiples. 1391 if (WaitcntInstr || (!Wait.hasWaitExceptStoreCnt() && TrySimplify)) { ``` It is meant to preserver S_WAITCNT_soft even if there is no actual wait required. @jayfoad , you had introduced `TrySimplify` ... do you think it is okay to relax its uses? ``` 1373 if (TrySimplify **|| (Opcode != II.getOpcode() && OldWait.hasValuesSetToMax()**) 1374 ScoreBrackets.simplifyWaitcnt(OldWait); ``` Here, `hasValuesSetToMax()` is a hypothetical function that checks the encoding of each count separately to have all bits set to 1, and not just a ~0 in the data structure. https://github.com/llvm/llvm-project/pull/147257 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] always emit a soft wait even if it is trivially ~0 (PR #147257)
https://github.com/ssahasra edited https://github.com/llvm/llvm-project/pull/147257 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] always emit a soft wait even if it is trivially ~0 (PR #147257)
@@ -669,6 +679,7 @@ define amdgpu_kernel void @global_volatile_store_1( ; GFX12-WGP-NEXT:s_wait_kmcnt 0x0 ; GFX12-WGP-NEXT:s_wait_storecnt 0x0 ; GFX12-WGP-NEXT:global_store_b32 v0, v1, s[0:1] scope:SCOPE_SYS +; GFX12-WGP-NEXT:s_wait_loadcnt 0x3f ssahasra wrote: > These should always be printed with the named counter syntax I haven't check what's different about this wait count for it to be printed like this. Will need to follow it up as a separate change. https://github.com/llvm/llvm-project/pull/147257 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] always emit a soft wait even if it is trivially ~0 (PR #147257)
@@ -669,6 +679,7 @@ define amdgpu_kernel void @global_volatile_store_1( ; GFX12-WGP-NEXT:s_wait_kmcnt 0x0 ; GFX12-WGP-NEXT:s_wait_storecnt 0x0 ; GFX12-WGP-NEXT:global_store_b32 v0, v1, s[0:1] scope:SCOPE_SYS +; GFX12-WGP-NEXT:s_wait_loadcnt 0x3f ssahasra wrote: If we agree with the basic design, then these are expected. There's a whole bunch of tests that either stop at the memory legalizer, or they run llc with `-O0`, like this one. The "trivial" wait counts show up in all these tests because SIInsertWaitcnts did not get a chance to clean it up. In particular, see how `TrySimplify` in that pass controls whether or not to clean up these wait counts. They disappear in the optimized ISA output. https://github.com/llvm/llvm-project/pull/147257 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] always emit a soft wait even if it is trivially ~0 (PR #147257)
@@ -669,6 +679,7 @@ define amdgpu_kernel void @global_volatile_store_1( ; GFX12-WGP-NEXT:s_wait_kmcnt 0x0 ; GFX12-WGP-NEXT:s_wait_storecnt 0x0 ; GFX12-WGP-NEXT:global_store_b32 v0, v1, s[0:1] scope:SCOPE_SYS +; GFX12-WGP-NEXT:s_wait_loadcnt 0x3f ssahasra wrote: Yes, I did consider that as an option. But there is the hypothetical corner case where the memory legalizer might deliberately compute the wait count to be so large that it gets clamped at the max value (not the same as ~0, strictly speaking). If that is not an issue, it will significantly reduce the diff for tests that don't stop after the legalizer. https://github.com/llvm/llvm-project/pull/147257 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] always emit a soft wait even if it is trivially ~0 (PR #147257)
ssahasra wrote: This is part of a stack: - #147258 - #147257 - #147256 https://github.com/llvm/llvm-project/pull/147257 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] efficiently wait for direct loads to LDS at all scopes (PR #147258)
ssahasra wrote: This is part of a stack: - #147258 - #147257 - #147256 https://github.com/llvm/llvm-project/pull/147258 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] efficiently wait for direct loads to LDS at all scopes (PR #147258)
ssahasra wrote: Note that the best way to see the effect of this PR is to view only the second diff of the two in this PR. It shows how the missing vmcnt(0) shows up in the new test introduced by the first commit. https://github.com/llvm/llvm-project/pull/147258 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] efficiently wait for direct loads to LDS at all scopes (PR #147258)
https://github.com/ssahasra updated https://github.com/llvm/llvm-project/pull/147258 >From 95ffad8e0c22f261999f8a87abde8592c0596395 Mon Sep 17 00:00:00 2001 From: Sameer Sahasrabuddhe Date: Tue, 17 Jun 2025 13:11:55 +0530 Subject: [PATCH 1/2] [AMDGCN] pre-checkin test for LDS DMA and release operations --- .../AMDGPU/lds-dma-workgroup-release.ll | 482 ++ 1 file changed, 482 insertions(+) create mode 100644 llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll diff --git a/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll b/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll new file mode 100644 index 0..1db15c3c6099c --- /dev/null +++ b/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll @@ -0,0 +1,482 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 +; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck %s --check-prefixes=GFX900 +; RUN: llc -mtriple=amdgcn -mcpu=gfx90a < %s | FileCheck %s --check-prefixes=GFX90A +; RUN: llc -mtriple=amdgcn -mcpu=gfx90a -mattr=+tgsplit < %s | FileCheck %s --check-prefixes=GFX90A-TGSPLIT +; RUN: llc -mtriple=amdgcn -mcpu=gfx942 < %s | FileCheck %s --check-prefixes=GFX942 +; RUN: llc -mtriple=amdgcn -mcpu=gfx942 -mattr=+tgsplit < %s | FileCheck %s --check-prefixes=GFX942-TGSPLIT +; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck %s --check-prefixes=GFX1010 + +; In each of these tests, an LDS DMA operation is followed by a release pattern +; at workgroup scope. The fence in such a release (implicit or explicit) should +; wait for the store component in the LDS DMA. The additional noalias metadata +; is just meant to ensure that the wait counts are not generated due to some +; unintended aliasing. + +declare void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) nocapture, i32 %size, i32 %voffset, i32 %soffset, i32 %offset, i32 %aux) + +define amdgpu_kernel void @barrier_release(<4 x i32> inreg %rsrc, +; GFX900-LABEL: barrier_release: +; GFX900: ; %bb.0: ; %main_body +; GFX900-NEXT:s_load_dwordx8 s[8:15], s[4:5], 0x24 +; GFX900-NEXT:v_mov_b32_e32 v0, 0x800 +; GFX900-NEXT:v_mov_b32_e32 v1, 0 +; GFX900-NEXT:s_waitcnt lgkmcnt(0) +; GFX900-NEXT:s_mov_b32 m0, s12 +; GFX900-NEXT:s_nop 0 +; GFX900-NEXT:buffer_load_dword v0, s[8:11], 0 offen lds +; GFX900-NEXT:v_mov_b32_e32 v0, s13 +; GFX900-NEXT:s_waitcnt vmcnt(0) +; GFX900-NEXT:s_barrier +; GFX900-NEXT:ds_read_b32 v0, v0 +; GFX900-NEXT:s_waitcnt lgkmcnt(0) +; GFX900-NEXT:global_store_dword v1, v0, s[14:15] +; GFX900-NEXT:s_endpgm +; +; GFX90A-LABEL: barrier_release: +; GFX90A: ; %bb.1: +; GFX90A-NEXT:s_load_dwordx4 s[8:11], s[4:5], 0x0 +; GFX90A-NEXT:s_load_dwordx2 s[12:13], s[4:5], 0x10 +; GFX90A-NEXT:s_waitcnt lgkmcnt(0) +; GFX90A-NEXT:s_branch .LBB0_0 +; GFX90A-NEXT:.p2align 8 +; GFX90A-NEXT: ; %bb.2: +; GFX90A-NEXT: .LBB0_0: ; %main_body +; GFX90A-NEXT:s_mov_b32 m0, s12 +; GFX90A-NEXT:v_mov_b32_e32 v0, 0x800 +; GFX90A-NEXT:buffer_load_dword v0, s[8:11], 0 offen lds +; GFX90A-NEXT:v_mov_b32_e32 v0, s13 +; GFX90A-NEXT:s_load_dwordx2 s[0:1], s[4:5], 0x3c +; GFX90A-NEXT:s_waitcnt lgkmcnt(0) +; GFX90A-NEXT:s_barrier +; GFX90A-NEXT:s_waitcnt vmcnt(0) +; GFX90A-NEXT:ds_read_b32 v0, v0 +; GFX90A-NEXT:v_mov_b32_e32 v1, 0 +; GFX90A-NEXT:s_waitcnt lgkmcnt(0) +; GFX90A-NEXT:global_store_dword v1, v0, s[0:1] +; GFX90A-NEXT:s_endpgm +; +; GFX90A-TGSPLIT-LABEL: barrier_release: +; GFX90A-TGSPLIT: ; %bb.1: +; GFX90A-TGSPLIT-NEXT:s_load_dwordx4 s[8:11], s[4:5], 0x0 +; GFX90A-TGSPLIT-NEXT:s_load_dwordx2 s[12:13], s[4:5], 0x10 +; GFX90A-TGSPLIT-NEXT:s_waitcnt lgkmcnt(0) +; GFX90A-TGSPLIT-NEXT:s_branch .LBB0_0 +; GFX90A-TGSPLIT-NEXT:.p2align 8 +; GFX90A-TGSPLIT-NEXT: ; %bb.2: +; GFX90A-TGSPLIT-NEXT: .LBB0_0: ; %main_body +; GFX90A-TGSPLIT-NEXT:s_mov_b32 m0, s12 +; GFX90A-TGSPLIT-NEXT:v_mov_b32_e32 v0, 0x800 +; GFX90A-TGSPLIT-NEXT:buffer_load_dword v0, s[8:11], 0 offen lds +; GFX90A-TGSPLIT-NEXT:v_mov_b32_e32 v0, s13 +; GFX90A-TGSPLIT-NEXT:s_load_dwordx2 s[0:1], s[4:5], 0x3c +; GFX90A-TGSPLIT-NEXT:s_waitcnt vmcnt(0) lgkmcnt(0) +; GFX90A-TGSPLIT-NEXT:s_barrier +; GFX90A-TGSPLIT-NEXT:buffer_wbinvl1_vol +; GFX90A-TGSPLIT-NEXT:ds_read_b32 v0, v0 +; GFX90A-TGSPLIT-NEXT:v_mov_b32_e32 v1, 0 +; GFX90A-TGSPLIT-NEXT:s_waitcnt lgkmcnt(0) +; GFX90A-TGSPLIT-NEXT:global_store_dword v1, v0, s[0:1] +; GFX90A-TGSPLIT-NEXT:s_endpgm +; +; GFX942-LABEL: barrier_release: +; GFX942: ; %bb.1: +; GFX942-NEXT:s_load_dwordx4 s[8:11], s[4:5], 0x0 +; GFX942-NEXT:s_load_dwordx2 s[12:13], s[4:5], 0x10 +; GFX942-NEXT:s_waitcnt lgkmcnt(0) +; GFX942-NEXT:s_branch .LBB0_0 +; GFX942-NEXT:.p2align 8 +; GFX942-NEXT: ; %bb.2: +; GFX942-NEXT: .LBB0_0: ; %main_body +; GFX942-NEXT:s_mov_b32 m0, s1