[llvm-branch-commits] [llvm] [AMDGPU] Fix gfx12 waitcnt type for image_msaa_load (#90201) (PR #90582)
https://github.com/jayfoad closed https://github.com/llvm/llvm-project/pull/90582 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Fix gfx12 waitcnt type for image_msaa_load (#90201) (PR #90582)
jayfoad wrote: Too late to backport - no more 18.x releases are planned. https://github.com/llvm/llvm-project/pull/90582 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] AMDGPU: Fix buffer intrinsic store of bfloat (PR #95377)
https://github.com/jayfoad approved this pull request. https://github.com/llvm/llvm-project/pull/95377 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] AMDGPU: Handle legal v2bf16 atomicrmw fadd for gfx12 (PR #95930)
@@ -1735,8 +1737,11 @@ defm : SIBufferAtomicPat<"SIbuffer_atomic_dec", i64, "BUFFER_ATOMIC_DEC_X2">; let OtherPredicates = [HasAtomicCSubNoRtnInsts] in defm : SIBufferAtomicPat<"SIbuffer_atomic_csub", i32, "BUFFER_ATOMIC_CSUB", ["noret"]>; -let SubtargetPredicate = isGFX12Plus in { +let SubtargetPredicate = HasAtomicBufferPkAddBF16Inst in { defm : SIBufferAtomicPat_Common<"SIbuffer_atomic_fadd", v2bf16, "BUFFER_ATOMIC_PK_ADD_BF16_VBUFFER">; jayfoad wrote: VBUFFER is a new encoding in GFX12 which replaces the old MTBUF and MUBUF encodings. We have different pseudos for VBUFFER (which should only be selected on GFX12+) and MTBUF/MUBUF (which should only be selected pre-GFX12). https://github.com/llvm/llvm-project/pull/95930 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
jayfoad wrote: This looks like it is affecting codegen even when xnack is disabled? That should not happen. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
jayfoad wrote: > > This looks like it is affecting codegen even when xnack is disabled? That > > should not happen. > > It shouldn't. I put the xnack replay subtarget check before using *_ec > equivalents. See the code here: > [65eb443#diff-35f4d1b6c4c17815f6989f86abbac2e606ca760f9d93f501ff503449048bf760R1735](https://github.com/llvm/llvm-project/commit/65eb44327cf32a83dbbf13eb70f9d8c03f3efaef#diff-35f4d1b6c4c17815f6989f86abbac2e606ca760f9d93f501ff503449048bf760R1735) You're checking `STI->hasXnackReplay()` which is true on all GFX8+ targets. You should be checking whether xnack support is enabled with `STI->isXNACKEnabled()`. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -967,6 +967,7 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo, bool hasLDSFPAtomicAddF32() const { return GFX8Insts; } bool hasLDSFPAtomicAddF64() const { return GFX90AInsts; } + bool hasXnackReplay() const { return GFX8Insts; } jayfoad wrote: We already have a field SupportsXNACK for this which is hooked up to the "xnack-support" target feature. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)
@@ -867,13 +867,104 @@ def SMRDBufferImm : ComplexPattern; def SMRDBufferImm32 : ComplexPattern; def SMRDBufferSgprImm : ComplexPattern; +class SMRDAlignedLoadPat : PatFrag <(ops node:$ptr), (Op node:$ptr), [{ + // Returns true if it is a naturally aligned multi-dword load. + LoadSDNode *Ld = cast(N); + unsigned Size = Ld->getMemoryVT().getStoreSize(); + return (Size <= 4) || (Ld->getAlign().value() >= PowerOf2Ceil(Size)); jayfoad wrote: Right but the PowerOf2Ceil makes no difference. Either you test 16>=12 or 16>=16, the result it the same. Also you don't need most of the parens on this line. https://github.com/llvm/llvm-project/pull/96163 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)
@@ -867,13 +867,104 @@ def SMRDBufferImm : ComplexPattern; def SMRDBufferImm32 : ComplexPattern; def SMRDBufferSgprImm : ComplexPattern; +class SMRDAlignedLoadPat : PatFrag <(ops node:$ptr), (Op node:$ptr), [{ + // Returns true if it is a naturally aligned multi-dword load. + LoadSDNode *Ld = cast(N); + unsigned Size = Ld->getMemoryVT().getStoreSize(); + return (Size <= 4) || (Ld->getAlign().value() >= PowerOf2Ceil(Size)); jayfoad wrote: `Ld->getAlign().value()` will never be 12. There's no such thing as a non-power-of-two alignment. https://github.com/llvm/llvm-project/pull/96163 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -1701,17 +1732,33 @@ unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo &CI, return AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR_IMM; } case S_LOAD_IMM: -switch (Width) { -default: - return 0; -case 2: - return AMDGPU::S_LOAD_DWORDX2_IMM; -case 3: - return AMDGPU::S_LOAD_DWORDX3_IMM; -case 4: - return AMDGPU::S_LOAD_DWORDX4_IMM; -case 8: - return AMDGPU::S_LOAD_DWORDX8_IMM; +// For targets that support XNACK replay, use the constrained load opcode. +if (STI && STI->hasXnackReplay()) { + switch (Width) { jayfoad wrote: > currently the alignment is picked from the first MMO and that'd definitely be > smaller than the natural align requirement for the new load You don't know that - the alignment in the first MMO will be whatever alignment the compiler could deduce, which could be large. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
https://github.com/jayfoad edited https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] AMDGPU: Add subtarget feature for global atomic fadd denormal support (PR #96443)
@@ -167,6 +167,7 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo, bool HasAtomicFlatPkAdd16Insts = false; bool HasAtomicFaddRtnInsts = false; bool HasAtomicFaddNoRtnInsts = false; + bool HasAtomicMemoryAtomicFaddF32DenormalSupport = false; jayfoad wrote: What does "AtomicMemoryAtomic" mean? https://github.com/llvm/llvm-project/pull/96443 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] DAG: Call SimplifyDemandedBits on fcopysign sign value (PR #97151)
https://github.com/jayfoad edited https://github.com/llvm/llvm-project/pull/97151 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] DAG: Call SimplifyDemandedBits on fcopysign sign value (PR #97151)
@@ -17565,6 +17565,12 @@ SDValue DAGCombiner::visitFCOPYSIGN(SDNode *N) { if (CanCombineFCOPYSIGN_EXTEND_ROUND(N)) return DAG.getNode(ISD::FCOPYSIGN, SDLoc(N), VT, N0, N1.getOperand(0)); + // We only take the sign bit from the sign operand. + EVT SignVT = N1.getValueType(); + if (SimplifyDemandedBits(N1, jayfoad wrote: I think this should be able to subsume some of the optimizations above, e.g. `copysign(x, abs(y)) -> abs(x)` would fall out if SimplifyDemandedBits knew about extracting the sign bit from `abs(x)`. https://github.com/llvm/llvm-project/pull/97151 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] DAG: Call SimplifyDemandedBits on fcopysign sign value (PR #97151)
https://github.com/jayfoad approved this pull request. LGTM https://github.com/llvm/llvm-project/pull/97151 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Enable atomic optimizer for divergent i64 and double values (PR #96934)
@@ -313,8 +327,7 @@ void AMDGPUAtomicOptimizerImpl::visitIntrinsicInst(IntrinsicInst &I) { // value to the atomic calculation. We can only optimize divergent values if // we have DPP available on our subtarget, and the atomic operation is 32 // bits. - if (ValDivergent && - (!ST->hasDPP() || DL->getTypeSizeInBits(I.getType()) != 32)) { + if (ValDivergent && (!ST->hasDPP() || !isOptimizableAtomic(I.getType( { jayfoad wrote: Same here. https://github.com/llvm/llvm-project/pull/96934 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Enable atomic optimizer for divergent i64 and double values (PR #96934)
@@ -230,8 +245,7 @@ void AMDGPUAtomicOptimizerImpl::visitAtomicRMWInst(AtomicRMWInst &I) { // value to the atomic calculation. We can only optimize divergent values if // we have DPP available on our subtarget, and the atomic operation is 32 // bits. - if (ValDivergent && - (!ST->hasDPP() || DL->getTypeSizeInBits(I.getType()) != 32)) { + if (ValDivergent && (!ST->hasDPP() || !isOptimizableAtomic(I.getType( { jayfoad wrote: Pre-existing problem: this `hasDPP` check is in the wrong place. It should only be tested if we're using the DPP strategy. https://github.com/llvm/llvm-project/pull/96934 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Enable atomic optimizer for divergent i64 and double values (PR #96934)
@@ -178,6 +178,21 @@ bool AMDGPUAtomicOptimizerImpl::run(Function &F) { return Changed; } +static bool isOptimizableAtomic(Type *Ty) { + switch (Ty->getTypeID()) { + case Type::FloatTyID: + case Type::DoubleTyID: +return true; + case Type::IntegerTyID: { +unsigned size = Ty->getIntegerBitWidth(); jayfoad wrote: ```suggestion unsigned Size = Ty->getIntegerBitWidth(); ``` https://github.com/llvm/llvm-project/pull/96934 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Enable atomic optimizer for divergent i64 and double values (PR #96934)
jayfoad wrote: > [AMDGPU] Enable atomic optimizer for divergent i64 and double values Needs some i64 tests https://github.com/llvm/llvm-project/pull/96934 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -1700,19 +1725,30 @@ unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo &CI, case 8: return AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR_IMM; } - case S_LOAD_IMM: + case S_LOAD_IMM: { +// If XNACK is enabled, use the constrained opcodes when the first load is +// under-aligned. +const MachineMemOperand *MMO = *CI.I->memoperands_begin(); +auto NeedsConstrainedOpc = [&MMO, Width](const GCNSubtarget &ST) { + return ST.isXNACKEnabled() && MMO->getAlign().value() < Width; jayfoad wrote: This doesn't look right since `Width` is in units of dwords here. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -1212,8 +1228,17 @@ void SILoadStoreOptimizer::copyToDestRegs( // Copy to the old destination registers. const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY); - const auto *Dest0 = TII->getNamedOperand(*CI.I, OpName); - const auto *Dest1 = TII->getNamedOperand(*Paired.I, OpName); + auto *Dest0 = TII->getNamedOperand(*CI.I, OpName); + auto *Dest1 = TII->getNamedOperand(*Paired.I, OpName); + + // The constrained sload instructions in S_LOAD_IMM class will have + // `early-clobber` flag in the dst operand. Remove the flag before using the + // MOs in copies. + if (Dest0->isEarlyClobber()) +Dest0->setIsEarlyClobber(false); + + if (Dest1->isEarlyClobber()) +Dest1->setIsEarlyClobber(false); jayfoad wrote: ```suggestion Dest0->setIsEarlyClobber(false); Dest1->setIsEarlyClobber(false); ``` https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -1700,19 +1725,30 @@ unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo &CI, case 8: return AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR_IMM; } - case S_LOAD_IMM: + case S_LOAD_IMM: { +// If XNACK is enabled, use the constrained opcodes when the first load is +// under-aligned. +const MachineMemOperand *MMO = *CI.I->memoperands_begin(); +auto NeedsConstrainedOpc = [&MMO, Width](const GCNSubtarget &ST) { jayfoad wrote: This doesn't need to be a lambda. It is always called, with identical arguments. Just calculate the result as a `bool` here. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)
@@ -867,13 +867,61 @@ def SMRDBufferImm : ComplexPattern; def SMRDBufferImm32 : ComplexPattern; def SMRDBufferSgprImm : ComplexPattern; +class SMRDAlignedLoadPat : PatFrag <(ops node:$ptr), (Op node:$ptr), [{ + // Ignore the alignment check if XNACK support is disabled. + if (!Subtarget->isXNACKEnabled()) +return true; + + // Returns true if it is a naturally aligned multi-dword load. + LoadSDNode *Ld = cast(N); + unsigned Size = Ld->getMemoryVT().getStoreSize(); + return Size <= 4 || Ld->getAlign().value() >= Size; +}]> { + let GISelPredicateCode = [{ + if (!Subtarget->isXNACKEnabled()) +return true; + + auto &Ld = cast(MI); + TypeSize Size = Ld.getMMO().getSize().getValue(); + return Size <= 4 || Ld.getMMO().getAlign().value() >= Size; + }]; +} + +class SMRDUnalignedLoadPat : PatFrag <(ops node:$ptr), (Op node:$ptr), [{ + // Do the alignment check if XNACK support is enabled. + if (!Subtarget->isXNACKEnabled()) +return false; + + // Returns true if it is an under aligned multi-dword load. + LoadSDNode *Ld = cast(N); + unsigned Size = Ld->getMemoryVT().getStoreSize(); + return Size > 4 && (Ld->getAlign().value() < Size); jayfoad wrote: Don't need the parens https://github.com/llvm/llvm-project/pull/96163 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)
@@ -867,13 +867,61 @@ def SMRDBufferImm : ComplexPattern; def SMRDBufferImm32 : ComplexPattern; def SMRDBufferSgprImm : ComplexPattern; +class SMRDAlignedLoadPat : PatFrag <(ops node:$ptr), (Op node:$ptr), [{ + // Ignore the alignment check if XNACK support is disabled. + if (!Subtarget->isXNACKEnabled()) +return true; + + // Returns true if it is a naturally aligned multi-dword load. jayfoad wrote: ... or if it's a non-multi-dword load. https://github.com/llvm/llvm-project/pull/96163 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)
@@ -867,13 +867,61 @@ def SMRDBufferImm : ComplexPattern; def SMRDBufferImm32 : ComplexPattern; def SMRDBufferSgprImm : ComplexPattern; +class SMRDAlignedLoadPat : PatFrag <(ops node:$ptr), (Op node:$ptr), [{ + // Ignore the alignment check if XNACK support is disabled. + if (!Subtarget->isXNACKEnabled()) +return true; + + // Returns true if it is a naturally aligned multi-dword load. + LoadSDNode *Ld = cast(N); + unsigned Size = Ld->getMemoryVT().getStoreSize(); + return Size <= 4 || Ld->getAlign().value() >= Size; +}]> { + let GISelPredicateCode = [{ + if (!Subtarget->isXNACKEnabled()) +return true; + + auto &Ld = cast(MI); + TypeSize Size = Ld.getMMO().getSize().getValue(); + return Size <= 4 || Ld.getMMO().getAlign().value() >= Size; + }]; +} + +class SMRDUnalignedLoadPat : PatFrag <(ops node:$ptr), (Op node:$ptr), [{ + // Do the alignment check if XNACK support is enabled. + if (!Subtarget->isXNACKEnabled()) +return false; + + // Returns true if it is an under aligned multi-dword load. + LoadSDNode *Ld = cast(N); + unsigned Size = Ld->getMemoryVT().getStoreSize(); + return Size > 4 && (Ld->getAlign().value() < Size); +}]> { + let GISelPredicateCode = [{ + if (!Subtarget->isXNACKEnabled()) +return false; + + auto &Ld = cast(MI); + TypeSize Size = Ld.getMMO().getSize().getValue(); + return Size > 4 && (Ld.getMMO().getAlign().value() < Size); jayfoad wrote: Don't need the parens https://github.com/llvm/llvm-project/pull/96163 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -1700,19 +1722,29 @@ unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo &CI, case 8: return AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR_IMM; } - case S_LOAD_IMM: + case S_LOAD_IMM: { +// If XNACK is enabled, use the constrained opcodes when the first load is +// under-aligned. +const MachineMemOperand *MMO = *CI.I->memoperands_begin(); +bool NeedsConstrainedOpc = +STM->isXNACKEnabled() && MMO->getAlign().value() < (Width << 2); jayfoad wrote: ```suggestion STM->isXNACKEnabled() && MMO->getAlign().value() < Width * 4; ``` https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -658,17 +658,17 @@ define amdgpu_kernel void @image_bvh_intersect_ray_nsa_reassign(ptr %p_node_ptr, ; ; GFX1013-LABEL: image_bvh_intersect_ray_nsa_reassign: ; GFX1013: ; %bb.0: -; GFX1013-NEXT:s_load_dwordx8 s[0:7], s[0:1], 0x24 +; GFX1013-NEXT:s_load_dwordx8 s[4:11], s[0:1], 0x24 jayfoad wrote: I guess this code changes because xnack is enabled by default for GFX10.1? Is there anything we could do to add known alignment info here, to avoid the code pessimization? https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -1212,8 +1228,14 @@ void SILoadStoreOptimizer::copyToDestRegs( // Copy to the old destination registers. const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY); - const auto *Dest0 = TII->getNamedOperand(*CI.I, OpName); - const auto *Dest1 = TII->getNamedOperand(*Paired.I, OpName); + auto *Dest0 = TII->getNamedOperand(*CI.I, OpName); + auto *Dest1 = TII->getNamedOperand(*Paired.I, OpName); + + // The constrained sload instructions in S_LOAD_IMM class will have + // `early-clobber` flag in the dst operand. Remove the flag before using the + // MOs in copies. + Dest0->setIsEarlyClobber(false); + Dest1->setIsEarlyClobber(false); jayfoad wrote: It's a bit ugly to modify in-place the operands of `CI.I` and `Paired.I`. But I guess it is harmless since they will be erased soon, when the merged load instruction is created. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)
@@ -866,13 +866,61 @@ def SMRDBufferImm : ComplexPattern; def SMRDBufferImm32 : ComplexPattern; def SMRDBufferSgprImm : ComplexPattern; +class SMRDAlignedLoadPat : PatFrag <(ops node:$ptr), (Op node:$ptr), [{ + // Ignore the alignment check if XNACK support is disabled. + if (!Subtarget->isXNACKEnabled()) +return true; + + // Returns true if it is a single dword load or naturally aligned multi-dword load. + LoadSDNode *Ld = cast(N); + unsigned Size = Ld->getMemoryVT().getStoreSize(); + return Size <= 4 || Ld->getAlign().value() >= Size; +}]> { + let GISelPredicateCode = [{ + if (!Subtarget->isXNACKEnabled()) +return true; + + auto &Ld = cast(MI); + TypeSize Size = Ld.getMMO().getSize().getValue(); + return Size <= 4 || Ld.getMMO().getAlign().value() >= Size; + }]; +} + +class SMRDUnalignedLoadPat : PatFrag <(ops node:$ptr), (Op node:$ptr), [{ jayfoad wrote: I don't think you need this class at all, since the _ec forms should work in all cases. It's just an optimization to prefer the non-_ec forms when the load is suitable aligned, and you can handle that with DAG pattern priority (maybe by setting AddedComplexity). https://github.com/llvm/llvm-project/pull/96163 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -6,7 +6,7 @@ declare i32 @llvm.amdgcn.global.atomic.csub(ptr addrspace(1), i32) ; GCN-LABEL: {{^}}global_atomic_csub_rtn: ; PREGFX12: global_atomic_csub v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9:]+}}, s{{\[[0-9]+:[0-9]+\]}} glc -; GFX12PLUS: global_atomic_sub_clamp_u32 v0, v0, v1, s[0:1] th:TH_ATOMIC_RETURN +; GFX12PLUS: global_atomic_sub_clamp_u32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}, s{{\[[0-9]+:[0-9]+\]}} th:TH_ATOMIC_RETURN jayfoad wrote: You shouldn't need any changes in this file. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
https://github.com/jayfoad approved this pull request. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)
https://github.com/jayfoad edited https://github.com/llvm/llvm-project/pull/96163 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)
@@ -34,18 +34,17 @@ entry: } define amdgpu_kernel void @test_llvm_amdgcn_fdot2_bf16_bf16_dpp( -; SDAG-GFX11-LABEL: test_llvm_amdgcn_fdot2_bf16_bf16_dpp: -; SDAG-GFX11: ; %bb.0: ; %entry -; SDAG-GFX11-NEXT:s_load_b128 s[0:3], s[0:1], 0x24 -; SDAG-GFX11-NEXT:s_waitcnt lgkmcnt(0) -; SDAG-GFX11-NEXT:scratch_load_b32 v0, off, s2 -; SDAG-GFX11-NEXT:scratch_load_u16 v1, off, s3 -; SDAG-GFX11-NEXT:scratch_load_b32 v2, off, s1 -; SDAG-GFX11-NEXT:s_waitcnt vmcnt(0) -; SDAG-GFX11-NEXT:v_dot2_bf16_bf16_e64_dpp v0, v2, v0, v1 quad_perm:[1,0,0,0] row_mask:0xf bank_mask:0xf bound_ctrl:1 -; SDAG-GFX11-NEXT:scratch_store_b16 off, v0, s0 -; SDAG-GFX11-NEXT:s_endpgm -; +; GFX11-LABEL: test_llvm_amdgcn_fdot2_bf16_bf16_dpp: +; GFX11: ; %bb.0: ; %entry +; GFX11-NEXT:s_load_b128 s[0:3], s[0:1], 0x24 +; GFX11-NEXT:s_waitcnt lgkmcnt(0) +; GFX11-NEXT:scratch_load_b32 v0, off, s2 +; GFX11-NEXT:scratch_load_u16 v1, off, s3 +; GFX11-NEXT:scratch_load_b32 v2, off, s1 +; GFX11-NEXT:s_waitcnt vmcnt(0) +; GFX11-NEXT:v_dot2_bf16_bf16_e64_dpp v0, v2, v0, v1 quad_perm:[1,0,0,0] row_mask:0xf bank_mask:0xf bound_ctrl:1 +; GFX11-NEXT:scratch_store_b16 off, v0, s0 +; GFX11-NEXT:s_endpgm ; GISEL-GFX11-LABEL: test_llvm_amdgcn_fdot2_bf16_bf16_dpp: jayfoad wrote: Should probably remove these GISEL-GFX11 checks since the corresponding RUN line is disabled. https://github.com/llvm/llvm-project/pull/96163 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)
https://github.com/jayfoad approved this pull request. LGTM. https://github.com/llvm/llvm-project/pull/96163 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] DAG: Lower is.fpclass fcSubnormal|fcZero to fabs(x) < smallest_normal (PR #100390)
https://github.com/jayfoad approved this pull request. Makes sense to me. For the ordered case I think this would only be profitable if fabs is free _and_ you don't have integer "test"-style instructions. https://github.com/llvm/llvm-project/pull/100390 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] DAG: Lower fcNormal is.fpclass to compare with inf (PR #100389)
jayfoad wrote: > Looks worse for x86 without the fabs check. Not sure if this is useful for > any targets. Seems unlikely that this would ever be profitable in the ordered case, since you can implement that with pretty simple integer checks on the exponent field. (Check that it isn't 0 and isn't maximal.) https://github.com/llvm/llvm-project/pull/100389 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] AMDGPU: Add baseline test for vectorize of integer min/max (PR #100513)
https://github.com/jayfoad edited https://github.com/llvm/llvm-project/pull/100513 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] AMDGPU: Add baseline test for vectorize of integer min/max (PR #100513)
https://github.com/jayfoad approved this pull request. LGTM. https://github.com/llvm/llvm-project/pull/100513 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] AMDGPU: Add baseline test for vectorize of integer min/max (PR #100513)
@@ -0,0 +1,366 @@ +; NOTE: Assertions have been autogenerated by utils/update_test_checks.py +; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -mcpu=hawaii -passes=slp-vectorizer,instcombine %s | FileCheck -check-prefixes=GCN,GFX7 %s +; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -mcpu=fiji -passes=slp-vectorizer,instcombine %s | FileCheck -check-prefixes=GCN,GFX8 %s +; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -passes=slp-vectorizer,instcombine %s | FileCheck -check-prefixes=GCN,GFX9 %s + +define <2 x i16> @uadd_sat_v2i16(<2 x i16> %arg0, <2 x i16> %arg1) { +; GFX7-LABEL: @uadd_sat_v2i16( +; GFX7-NEXT: bb: +; GFX7-NEXT:[[ARG0_0:%.*]] = extractelement <2 x i16> [[ARG0:%.*]], i64 0 +; GFX7-NEXT:[[ARG0_1:%.*]] = extractelement <2 x i16> [[ARG0]], i64 1 +; GFX7-NEXT:[[ARG1_0:%.*]] = extractelement <2 x i16> [[ARG1:%.*]], i64 0 +; GFX7-NEXT:[[ARG1_1:%.*]] = extractelement <2 x i16> [[ARG1]], i64 1 +; GFX7-NEXT:[[ADD_0:%.*]] = call i16 @llvm.umin.i16(i16 [[ARG0_0]], i16 [[ARG1_0]]) +; GFX7-NEXT:[[ADD_1:%.*]] = call i16 @llvm.umin.i16(i16 [[ARG0_1]], i16 [[ARG1_1]]) +; GFX7-NEXT:[[INS_0:%.*]] = insertelement <2 x i16> poison, i16 [[ADD_0]], i64 0 +; GFX7-NEXT:[[INS_1:%.*]] = insertelement <2 x i16> [[INS_0]], i16 [[ADD_1]], i64 1 +; GFX7-NEXT:ret <2 x i16> [[INS_1]] +; +; GFX8-LABEL: @uadd_sat_v2i16( +; GFX8-NEXT: bb: +; GFX8-NEXT:[[TMP0:%.*]] = call <2 x i16> @llvm.umin.v2i16(<2 x i16> [[ARG0:%.*]], <2 x i16> [[ARG1:%.*]]) +; GFX8-NEXT:ret <2 x i16> [[TMP0]] +; +; GFX9-LABEL: @uadd_sat_v2i16( +; GFX9-NEXT: bb: +; GFX9-NEXT:[[TMP0:%.*]] = call <2 x i16> @llvm.umin.v2i16(<2 x i16> [[ARG0:%.*]], <2 x i16> [[ARG1:%.*]]) +; GFX9-NEXT:ret <2 x i16> [[TMP0]] +; +bb: + %arg0.0 = extractelement <2 x i16> %arg0, i64 0 + %arg0.1 = extractelement <2 x i16> %arg0, i64 1 + %arg1.0 = extractelement <2 x i16> %arg1, i64 0 + %arg1.1 = extractelement <2 x i16> %arg1, i64 1 + %add.0 = call i16 @llvm.umin.i16(i16 %arg0.0, i16 %arg1.0) + %add.1 = call i16 @llvm.umin.i16(i16 %arg0.1, i16 %arg1.1) + %ins.0 = insertelement <2 x i16> undef, i16 %add.0, i64 0 + %ins.1 = insertelement <2 x i16> %ins.0, i16 %add.1, i64 1 + ret <2 x i16> %ins.1 +} + +define <2 x i16> @usub_sat_v2i16(<2 x i16> %arg0, <2 x i16> %arg1) { +; GFX7-LABEL: @usub_sat_v2i16( +; GFX7-NEXT: bb: +; GFX7-NEXT:[[ARG0_0:%.*]] = extractelement <2 x i16> [[ARG0:%.*]], i64 0 +; GFX7-NEXT:[[ARG0_1:%.*]] = extractelement <2 x i16> [[ARG0]], i64 1 +; GFX7-NEXT:[[ARG1_0:%.*]] = extractelement <2 x i16> [[ARG1:%.*]], i64 0 +; GFX7-NEXT:[[ARG1_1:%.*]] = extractelement <2 x i16> [[ARG1]], i64 1 +; GFX7-NEXT:[[ADD_0:%.*]] = call i16 @llvm.umax.i16(i16 [[ARG0_0]], i16 [[ARG1_0]]) +; GFX7-NEXT:[[ADD_1:%.*]] = call i16 @llvm.umax.i16(i16 [[ARG0_1]], i16 [[ARG1_1]]) +; GFX7-NEXT:[[INS_0:%.*]] = insertelement <2 x i16> poison, i16 [[ADD_0]], i64 0 +; GFX7-NEXT:[[INS_1:%.*]] = insertelement <2 x i16> [[INS_0]], i16 [[ADD_1]], i64 1 +; GFX7-NEXT:ret <2 x i16> [[INS_1]] +; +; GFX8-LABEL: @usub_sat_v2i16( +; GFX8-NEXT: bb: +; GFX8-NEXT:[[TMP0:%.*]] = call <2 x i16> @llvm.umax.v2i16(<2 x i16> [[ARG0:%.*]], <2 x i16> [[ARG1:%.*]]) +; GFX8-NEXT:ret <2 x i16> [[TMP0]] +; +; GFX9-LABEL: @usub_sat_v2i16( +; GFX9-NEXT: bb: +; GFX9-NEXT:[[TMP0:%.*]] = call <2 x i16> @llvm.umax.v2i16(<2 x i16> [[ARG0:%.*]], <2 x i16> [[ARG1:%.*]]) +; GFX9-NEXT:ret <2 x i16> [[TMP0]] +; +bb: + %arg0.0 = extractelement <2 x i16> %arg0, i64 0 + %arg0.1 = extractelement <2 x i16> %arg0, i64 1 + %arg1.0 = extractelement <2 x i16> %arg1, i64 0 + %arg1.1 = extractelement <2 x i16> %arg1, i64 1 + %add.0 = call i16 @llvm.umax.i16(i16 %arg0.0, i16 %arg1.0) + %add.1 = call i16 @llvm.umax.i16(i16 %arg0.1, i16 %arg1.1) + %ins.0 = insertelement <2 x i16> undef, i16 %add.0, i64 0 + %ins.1 = insertelement <2 x i16> %ins.0, i16 %add.1, i64 1 + ret <2 x i16> %ins.1 +} + +define <2 x i16> @sadd_sat_v2i16(<2 x i16> %arg0, <2 x i16> %arg1) { +; GFX7-LABEL: @sadd_sat_v2i16( +; GFX7-NEXT: bb: +; GFX7-NEXT:[[ARG0_0:%.*]] = extractelement <2 x i16> [[ARG0:%.*]], i64 0 +; GFX7-NEXT:[[ARG0_1:%.*]] = extractelement <2 x i16> [[ARG0]], i64 1 +; GFX7-NEXT:[[ARG1_0:%.*]] = extractelement <2 x i16> [[ARG1:%.*]], i64 0 +; GFX7-NEXT:[[ARG1_1:%.*]] = extractelement <2 x i16> [[ARG1]], i64 1 +; GFX7-NEXT:[[ADD_0:%.*]] = call i16 @llvm.smin.i16(i16 [[ARG0_0]], i16 [[ARG1_0]]) +; GFX7-NEXT:[[ADD_1:%.*]] = call i16 @llvm.smin.i16(i16 [[ARG0_1]], i16 [[ARG1_1]]) +; GFX7-NEXT:[[INS_0:%.*]] = insertelement <2 x i16> poison, i16 [[ADD_0]], i64 0 +; GFX7-NEXT:[[INS_1:%.*]] = insertelement <2 x i16> [[INS_0]], i16 [[ADD_1]], i64 1 +; GFX7-NEXT:ret <2 x i16> [[INS_1]] +; +; GFX8-LABEL: @sadd_sat_v2i16( +; GFX8-NEXT: bb: +; GFX8-NEXT:[[TMP0:%.*]] = call <2 x i16> @llvm.smin.v2i16(<2 x i16> [[ARG0:%.*]], <2 x i16> [[ARG1:%.*]]) +; GFX8-NEXT:ret <2 x i16> [[T
[llvm-branch-commits] [llvm] TTI: Check legalization cost of abs nodes (PR #100523)
@@ -54,11 +54,11 @@ define i32 @abs_nonpoison(i32 %arg) { ; FAST-NEXT: Cost Model: Found an estimated cost of 80 for instruction: %V16I32 = call <16 x i32> @llvm.abs.v16i32(<16 x i32> undef, i1 false) ; FAST-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %I16 = call i16 @llvm.abs.i16(i16 undef, i1 false) ; FAST-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %V2I16 = call <2 x i16> @llvm.abs.v2i16(<2 x i16> undef, i1 false) -; FAST-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %V4I16 = call <4 x i16> @llvm.abs.v4i16(<4 x i16> undef, i1 false) -; FAST-NEXT: Cost Model: Found an estimated cost of 34 for instruction: %V8I16 = call <8 x i16> @llvm.abs.v8i16(<8 x i16> undef, i1 false) -; FAST-NEXT: Cost Model: Found an estimated cost of 70 for instruction: %V16I16 = call <16 x i16> @llvm.abs.v16i16(<16 x i16> undef, i1 false) -; FAST-NEXT: Cost Model: Found an estimated cost of 114 for instruction: %V17I16 = call <17 x i16> @llvm.abs.v17i16(<17 x i16> undef, i1 false) -; FAST-NEXT: Cost Model: Found an estimated cost of 174 for instruction: %V32I16 = call <32 x i16> @llvm.abs.v32i16(<32 x i16> undef, i1 false) +; FAST-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V4I16 = call <4 x i16> @llvm.abs.v4i16(<4 x i16> undef, i1 false) +; FAST-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V8I16 = call <8 x i16> @llvm.abs.v8i16(<8 x i16> undef, i1 false) +; FAST-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V16I16 = call <16 x i16> @llvm.abs.v16i16(<16 x i16> undef, i1 false) +; FAST-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V17I16 = call <17 x i16> @llvm.abs.v17i16(<17 x i16> undef, i1 false) +; FAST-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V32I16 = call <32 x i16> @llvm.abs.v32i16(<32 x i16> undef, i1 false) jayfoad wrote: What is this demonstrating? 2 does not seem like the right cost for any VALU/SALU operation on v32i16. https://github.com/llvm/llvm-project/pull/100523 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] release/19.x: [AMDGPU] Fix folding clamp into pseudo scalar instructions (#100568) (PR #102446)
https://github.com/jayfoad approved this pull request. LGTM for backporting. https://github.com/llvm/llvm-project/pull/102446 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] release/19.x: [AMDGPU] Disable inline constants for pseudo scalar transcendentals (#104395) (PR #105472)
https://github.com/jayfoad approved this pull request. https://github.com/llvm/llvm-project/pull/105472 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] release/18.x: Convert many LivePhysRegs uses to LiveRegUnits (PR #84118)
https://github.com/jayfoad requested changes to this pull request. > this isn't fixing any known correctness issue Exactly. I don't think there is any reason to backport this. https://github.com/llvm/llvm-project/pull/84118 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Fix setting nontemporal in memory legalizer (#83815) (PR #90204)
https://github.com/jayfoad created https://github.com/llvm/llvm-project/pull/90204 Iterator MI can advance in insertWait() but we need original instruction to set temporal hint. Just move it before handling volatile. >From b544217fb31ffafb9b072de53a28c71acc169cf8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mirko=20Brku=C5=A1anin?= Date: Mon, 4 Mar 2024 15:05:31 +0100 Subject: [PATCH] [AMDGPU] Fix setting nontemporal in memory legalizer (#83815) Iterator MI can advance in insertWait() but we need original instruction to set temporal hint. Just move it before handling volatile. --- llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp | 10 +- .../memory-legalizer-flat-nontemporal.ll | 165 ++ .../memory-legalizer-global-nontemporal.ll| 158 ++ .../memory-legalizer-local-nontemporal.ll | 179 +++ .../memory-legalizer-private-nontemporal.ll | 203 ++ 5 files changed, 710 insertions(+), 5 deletions(-) diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp index 84b9330ef9633e..50d8bfa8750818 100644 --- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp +++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp @@ -2358,6 +2358,11 @@ bool SIGfx12CacheControl::enableVolatileAndOrNonTemporal( bool Changed = false; + if (IsNonTemporal) { +// Set non-temporal hint for all cache levels. +Changed |= setTH(MI, AMDGPU::CPol::TH_NT); + } + if (IsVolatile) { Changed |= setScope(MI, AMDGPU::CPol::SCOPE_SYS); @@ -2370,11 +2375,6 @@ bool SIGfx12CacheControl::enableVolatileAndOrNonTemporal( Position::AFTER); } - if (IsNonTemporal) { -// Set non-temporal hint for all cache levels. -Changed |= setTH(MI, AMDGPU::CPol::TH_NT); - } - return Changed; } diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-nontemporal.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-nontemporal.ll index a59c0394bebe20..ca7486536cf556 100644 --- a/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-nontemporal.ll +++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-nontemporal.ll @@ -582,5 +582,170 @@ entry: ret void } +define amdgpu_kernel void @flat_nontemporal_volatile_load( +; GFX7-LABEL: flat_nontemporal_volatile_load: +; GFX7: ; %bb.0: ; %entry +; GFX7-NEXT:s_load_dwordx4 s[0:3], s[4:5], 0x0 +; GFX7-NEXT:s_waitcnt lgkmcnt(0) +; GFX7-NEXT:v_mov_b32_e32 v0, s0 +; GFX7-NEXT:v_mov_b32_e32 v1, s1 +; GFX7-NEXT:flat_load_dword v2, v[0:1] glc +; GFX7-NEXT:s_waitcnt vmcnt(0) +; GFX7-NEXT:v_mov_b32_e32 v0, s2 +; GFX7-NEXT:v_mov_b32_e32 v1, s3 +; GFX7-NEXT:s_waitcnt lgkmcnt(0) +; GFX7-NEXT:flat_store_dword v[0:1], v2 +; GFX7-NEXT:s_endpgm +; +; GFX10-WGP-LABEL: flat_nontemporal_volatile_load: +; GFX10-WGP: ; %bb.0: ; %entry +; GFX10-WGP-NEXT:s_load_dwordx4 s[0:3], s[4:5], 0x0 +; GFX10-WGP-NEXT:s_waitcnt lgkmcnt(0) +; GFX10-WGP-NEXT:v_mov_b32_e32 v0, s0 +; GFX10-WGP-NEXT:v_mov_b32_e32 v1, s1 +; GFX10-WGP-NEXT:flat_load_dword v2, v[0:1] glc dlc +; GFX10-WGP-NEXT:s_waitcnt vmcnt(0) +; GFX10-WGP-NEXT:v_mov_b32_e32 v0, s2 +; GFX10-WGP-NEXT:v_mov_b32_e32 v1, s3 +; GFX10-WGP-NEXT:s_waitcnt lgkmcnt(0) +; GFX10-WGP-NEXT:flat_store_dword v[0:1], v2 +; GFX10-WGP-NEXT:s_endpgm +; +; GFX10-CU-LABEL: flat_nontemporal_volatile_load: +; GFX10-CU: ; %bb.0: ; %entry +; GFX10-CU-NEXT:s_load_dwordx4 s[0:3], s[4:5], 0x0 +; GFX10-CU-NEXT:s_waitcnt lgkmcnt(0) +; GFX10-CU-NEXT:v_mov_b32_e32 v0, s0 +; GFX10-CU-NEXT:v_mov_b32_e32 v1, s1 +; GFX10-CU-NEXT:flat_load_dword v2, v[0:1] glc dlc +; GFX10-CU-NEXT:s_waitcnt vmcnt(0) +; GFX10-CU-NEXT:v_mov_b32_e32 v0, s2 +; GFX10-CU-NEXT:v_mov_b32_e32 v1, s3 +; GFX10-CU-NEXT:s_waitcnt lgkmcnt(0) +; GFX10-CU-NEXT:flat_store_dword v[0:1], v2 +; GFX10-CU-NEXT:s_endpgm +; +; SKIP-CACHE-INV-LABEL: flat_nontemporal_volatile_load: +; SKIP-CACHE-INV: ; %bb.0: ; %entry +; SKIP-CACHE-INV-NEXT:s_load_dwordx4 s[0:3], s[0:1], 0x0 +; SKIP-CACHE-INV-NEXT:s_waitcnt lgkmcnt(0) +; SKIP-CACHE-INV-NEXT:v_mov_b32_e32 v0, s0 +; SKIP-CACHE-INV-NEXT:v_mov_b32_e32 v1, s1 +; SKIP-CACHE-INV-NEXT:flat_load_dword v2, v[0:1] glc +; SKIP-CACHE-INV-NEXT:s_waitcnt vmcnt(0) +; SKIP-CACHE-INV-NEXT:v_mov_b32_e32 v0, s2 +; SKIP-CACHE-INV-NEXT:v_mov_b32_e32 v1, s3 +; SKIP-CACHE-INV-NEXT:s_waitcnt lgkmcnt(0) +; SKIP-CACHE-INV-NEXT:flat_store_dword v[0:1], v2 +; SKIP-CACHE-INV-NEXT:s_endpgm +; +; GFX90A-NOTTGSPLIT-LABEL: flat_nontemporal_volatile_load: +; GFX90A-NOTTGSPLIT: ; %bb.0: ; %entry +; GFX90A-NOTTGSPLIT-NEXT:s_load_dwordx4 s[0:3], s[4:5], 0x0 +; GFX90A-NOTTGSPLIT-NEXT:s_waitcnt lgkmcnt(0) +; GFX90A-NOTTGSPLIT-NEXT:v_mov_b32_e32 v0, s0 +; GFX90A-NOTTGSPLIT-NEXT:v_mov_b32_e32 v1, s1 +; GFX90A-NOTTGSPLIT-NEXT:flat_load_dword v2, v[0:1] glc +; GFX90A-NOTTGSPLIT-NEXT:
[llvm-branch-commits] [llvm] [AMDGPU] Fix setting nontemporal in memory legalizer (#83815) (PR #90204)
https://github.com/jayfoad milestoned https://github.com/llvm/llvm-project/pull/90204 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] b544217 - [AMDGPU] Fix setting nontemporal in memory legalizer (#83815)
Author: Mirko Brkušanin Date: 2024-04-26T13:35:58+01:00 New Revision: b544217fb31ffafb9b072de53a28c71acc169cf8 URL: https://github.com/llvm/llvm-project/commit/b544217fb31ffafb9b072de53a28c71acc169cf8 DIFF: https://github.com/llvm/llvm-project/commit/b544217fb31ffafb9b072de53a28c71acc169cf8.diff LOG: [AMDGPU] Fix setting nontemporal in memory legalizer (#83815) Iterator MI can advance in insertWait() but we need original instruction to set temporal hint. Just move it before handling volatile. Added: Modified: llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-nontemporal.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-global-nontemporal.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-local-nontemporal.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-private-nontemporal.ll Removed: diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp index 84b9330ef9633e..50d8bfa8750818 100644 --- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp +++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp @@ -2358,6 +2358,11 @@ bool SIGfx12CacheControl::enableVolatileAndOrNonTemporal( bool Changed = false; + if (IsNonTemporal) { +// Set non-temporal hint for all cache levels. +Changed |= setTH(MI, AMDGPU::CPol::TH_NT); + } + if (IsVolatile) { Changed |= setScope(MI, AMDGPU::CPol::SCOPE_SYS); @@ -2370,11 +2375,6 @@ bool SIGfx12CacheControl::enableVolatileAndOrNonTemporal( Position::AFTER); } - if (IsNonTemporal) { -// Set non-temporal hint for all cache levels. -Changed |= setTH(MI, AMDGPU::CPol::TH_NT); - } - return Changed; } diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-nontemporal.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-nontemporal.ll index a59c0394bebe20..ca7486536cf556 100644 --- a/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-nontemporal.ll +++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-nontemporal.ll @@ -582,5 +582,170 @@ entry: ret void } +define amdgpu_kernel void @flat_nontemporal_volatile_load( +; GFX7-LABEL: flat_nontemporal_volatile_load: +; GFX7: ; %bb.0: ; %entry +; GFX7-NEXT:s_load_dwordx4 s[0:3], s[4:5], 0x0 +; GFX7-NEXT:s_waitcnt lgkmcnt(0) +; GFX7-NEXT:v_mov_b32_e32 v0, s0 +; GFX7-NEXT:v_mov_b32_e32 v1, s1 +; GFX7-NEXT:flat_load_dword v2, v[0:1] glc +; GFX7-NEXT:s_waitcnt vmcnt(0) +; GFX7-NEXT:v_mov_b32_e32 v0, s2 +; GFX7-NEXT:v_mov_b32_e32 v1, s3 +; GFX7-NEXT:s_waitcnt lgkmcnt(0) +; GFX7-NEXT:flat_store_dword v[0:1], v2 +; GFX7-NEXT:s_endpgm +; +; GFX10-WGP-LABEL: flat_nontemporal_volatile_load: +; GFX10-WGP: ; %bb.0: ; %entry +; GFX10-WGP-NEXT:s_load_dwordx4 s[0:3], s[4:5], 0x0 +; GFX10-WGP-NEXT:s_waitcnt lgkmcnt(0) +; GFX10-WGP-NEXT:v_mov_b32_e32 v0, s0 +; GFX10-WGP-NEXT:v_mov_b32_e32 v1, s1 +; GFX10-WGP-NEXT:flat_load_dword v2, v[0:1] glc dlc +; GFX10-WGP-NEXT:s_waitcnt vmcnt(0) +; GFX10-WGP-NEXT:v_mov_b32_e32 v0, s2 +; GFX10-WGP-NEXT:v_mov_b32_e32 v1, s3 +; GFX10-WGP-NEXT:s_waitcnt lgkmcnt(0) +; GFX10-WGP-NEXT:flat_store_dword v[0:1], v2 +; GFX10-WGP-NEXT:s_endpgm +; +; GFX10-CU-LABEL: flat_nontemporal_volatile_load: +; GFX10-CU: ; %bb.0: ; %entry +; GFX10-CU-NEXT:s_load_dwordx4 s[0:3], s[4:5], 0x0 +; GFX10-CU-NEXT:s_waitcnt lgkmcnt(0) +; GFX10-CU-NEXT:v_mov_b32_e32 v0, s0 +; GFX10-CU-NEXT:v_mov_b32_e32 v1, s1 +; GFX10-CU-NEXT:flat_load_dword v2, v[0:1] glc dlc +; GFX10-CU-NEXT:s_waitcnt vmcnt(0) +; GFX10-CU-NEXT:v_mov_b32_e32 v0, s2 +; GFX10-CU-NEXT:v_mov_b32_e32 v1, s3 +; GFX10-CU-NEXT:s_waitcnt lgkmcnt(0) +; GFX10-CU-NEXT:flat_store_dword v[0:1], v2 +; GFX10-CU-NEXT:s_endpgm +; +; SKIP-CACHE-INV-LABEL: flat_nontemporal_volatile_load: +; SKIP-CACHE-INV: ; %bb.0: ; %entry +; SKIP-CACHE-INV-NEXT:s_load_dwordx4 s[0:3], s[0:1], 0x0 +; SKIP-CACHE-INV-NEXT:s_waitcnt lgkmcnt(0) +; SKIP-CACHE-INV-NEXT:v_mov_b32_e32 v0, s0 +; SKIP-CACHE-INV-NEXT:v_mov_b32_e32 v1, s1 +; SKIP-CACHE-INV-NEXT:flat_load_dword v2, v[0:1] glc +; SKIP-CACHE-INV-NEXT:s_waitcnt vmcnt(0) +; SKIP-CACHE-INV-NEXT:v_mov_b32_e32 v0, s2 +; SKIP-CACHE-INV-NEXT:v_mov_b32_e32 v1, s3 +; SKIP-CACHE-INV-NEXT:s_waitcnt lgkmcnt(0) +; SKIP-CACHE-INV-NEXT:flat_store_dword v[0:1], v2 +; SKIP-CACHE-INV-NEXT:s_endpgm +; +; GFX90A-NOTTGSPLIT-LABEL: flat_nontemporal_volatile_load: +; GFX90A-NOTTGSPLIT: ; %bb.0: ; %entry +; GFX90A-NOTTGSPLIT-NEXT:s_load_dwordx4 s[0:3], s[4:5], 0x0 +; GFX90A-NOTTGSPLIT-NEXT:s_waitcnt lgkmcnt(0) +; GFX90A-NOTTGSPLIT-NEXT:v_mov_b32_e32 v0, s0 +; GFX90A-NOTTGSPLIT-NEXT:v_mov_b32_e32 v1, s1 +; GFX90A-NOTTGSPLIT-NEXT:flat_load_dword v2, v[0:1] glc +; GFX90A-NOTTGSPLIT-NEXT:s_waitcnt vmcnt(0) +;
[llvm-branch-commits] [llvm] [AMDGPU] Fix gfx12 waitcnt type for image_msaa_load (#90201) (PR #90582)
https://github.com/jayfoad created https://github.com/llvm/llvm-project/pull/90582 image_msaa_load is actually encoded as a VSAMPLE instruction and requires the appropriate waitcnt variant. >From 17b75a9517891d662e677a357713c920bb79c43c Mon Sep 17 00:00:00 2001 From: David Stuttard Date: Tue, 30 Apr 2024 10:41:51 +0100 Subject: [PATCH] [AMDGPU] Fix gfx12 waitcnt type for image_msaa_load (#90201) image_msaa_load is actually encoded as a VSAMPLE instruction and requires the appropriate waitcnt variant. --- llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp | 8 -- .../AMDGPU/llvm.amdgcn.image.msaa.load.ll | 26 +-- 2 files changed, 19 insertions(+), 15 deletions(-) diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp index 6ecb1c8bf6e1db..97c55e4d9e41c2 100644 --- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp +++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp @@ -187,8 +187,12 @@ VmemType getVmemType(const MachineInstr &Inst) { const AMDGPU::MIMGInfo *Info = AMDGPU::getMIMGInfo(Inst.getOpcode()); const AMDGPU::MIMGBaseOpcodeInfo *BaseInfo = AMDGPU::getMIMGBaseOpcodeInfo(Info->BaseOpcode); - return BaseInfo->BVH ? VMEM_BVH - : BaseInfo->Sampler ? VMEM_SAMPLER : VMEM_NOSAMPLER; + // The test for MSAA here is because gfx12+ image_msaa_load is actually + // encoded as VSAMPLE and requires the appropriate s_waitcnt variant for that. + // Pre-gfx12 doesn't care since all vmem types result in the same s_waitcnt. + return BaseInfo->BVH ? VMEM_BVH + : BaseInfo->Sampler || BaseInfo->MSAA ? VMEM_SAMPLER + : VMEM_NOSAMPLER; } unsigned &getCounterRef(AMDGPU::Waitcnt &Wait, InstCounterType T) { diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.msaa.load.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.msaa.load.ll index 1348315e72e7bc..8da48551855570 100644 --- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.msaa.load.ll +++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.msaa.load.ll @@ -12,7 +12,7 @@ define amdgpu_ps <4 x float> @load_2dmsaa(<8 x i32> inreg %rsrc, i32 %s, i32 %t, ; GFX12-LABEL: load_2dmsaa: ; GFX12: ; %bb.0: ; %main_body ; GFX12-NEXT:image_msaa_load v[0:3], [v0, v1, v2], s[0:7] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm ; encoding: [0x06,0x20,0x46,0xe4,0x00,0x00,0x00,0x00,0x00,0x01,0x02,0x00] -; GFX12-NEXT:s_wait_loadcnt 0x0 ; encoding: [0x00,0x00,0xc0,0xbf] +; GFX12-NEXT:s_wait_samplecnt 0x0 ; encoding: [0x00,0x00,0xc2,0xbf] ; GFX12-NEXT:; return to shader part epilog main_body: %v = call <4 x float> @llvm.amdgcn.image.msaa.load.2dmsaa.v4f32.i32(i32 1, i32 %s, i32 %t, i32 %fragid, <8 x i32> %rsrc, i32 0, i32 0) @@ -32,7 +32,7 @@ define amdgpu_ps <4 x float> @load_2dmsaa_both(<8 x i32> inreg %rsrc, ptr addrsp ; GFX12: ; %bb.0: ; %main_body ; GFX12-NEXT:image_msaa_load v[0:4], [v0, v1, v2], s[0:7] dmask:0x2 dim:SQ_RSRC_IMG_2D_MSAA unorm tfe lwe ; encoding: [0x0e,0x20,0x86,0xe4,0x00,0x01,0x00,0x00,0x00,0x01,0x02,0x00] ; GFX12-NEXT:v_mov_b32_e32 v5, 0 ; encoding: [0x80,0x02,0x0a,0x7e] -; GFX12-NEXT:s_wait_loadcnt 0x0 ; encoding: [0x00,0x00,0xc0,0xbf] +; GFX12-NEXT:s_wait_samplecnt 0x0 ; encoding: [0x00,0x00,0xc2,0xbf] ; GFX12-NEXT:global_store_b32 v5, v4, s[8:9] ; encoding: [0x08,0x80,0x06,0xee,0x00,0x00,0x00,0x02,0x05,0x00,0x00,0x00] ; GFX12-NEXT:; return to shader part epilog main_body: @@ -53,7 +53,7 @@ define amdgpu_ps <4 x float> @load_2darraymsaa(<8 x i32> inreg %rsrc, i32 %s, i3 ; GFX12-LABEL: load_2darraymsaa: ; GFX12: ; %bb.0: ; %main_body ; GFX12-NEXT:image_msaa_load v[0:3], [v0, v1, v2, v3], s[0:7] dmask:0x4 dim:SQ_RSRC_IMG_2D_MSAA_ARRAY unorm ; encoding: [0x07,0x20,0x06,0xe5,0x00,0x00,0x00,0x00,0x00,0x01,0x02,0x03] -; GFX12-NEXT:s_wait_loadcnt 0x0 ; encoding: [0x00,0x00,0xc0,0xbf] +; GFX12-NEXT:s_wait_samplecnt 0x0 ; encoding: [0x00,0x00,0xc2,0xbf] ; GFX12-NEXT:; return to shader part epilog main_body: %v = call <4 x float> @llvm.amdgcn.image.msaa.load.2darraymsaa.v4f32.i32(i32 4, i32 %s, i32 %t, i32 %slice, i32 %fragid, <8 x i32> %rsrc, i32 0, i32 0) @@ -73,7 +73,7 @@ define amdgpu_ps <4 x float> @load_2darraymsaa_tfe(<8 x i32> inreg %rsrc, ptr ad ; GFX12: ; %bb.0: ; %main_body ; GFX12-NEXT:image_msaa_load v[0:4], [v0, v1, v2, v3], s[0:7] dmask:0x8 dim:SQ_RSRC_IMG_2D_MSAA_ARRAY unorm tfe ; encoding: [0x0f,0x20,0x06,0xe6,0x00,0x00,0x00,0x00,0x00,0x01,0x02,0x03] ; GFX12-NEXT:v_mov_b32_e32 v5, 0 ; encoding: [0x80,0x02,0x0a,0x7e] -; GFX12-NEXT:s_wait_loadcnt 0x0 ; encoding: [0x00,0x00,0xc0,0xbf] +; GFX12-NEXT:s_wait_samplecnt 0x0 ; encoding: [0x00,0x00,0xc2,0xbf] ; GFX12-NEXT:global_store_b32 v5, v4, s[8:9] ; encoding: [0x08,0x80,0x06,0xee,0x00,0x00,0x00,0x02,0x05,0x00,0x00,0x00] ; GFX12-NEXT:; return to shader part epilog main_body: @@ -94,7 +94,7 @@ defin
[llvm-branch-commits] [llvm] [AMDGPU] Fix gfx12 waitcnt type for image_msaa_load (#90201) (PR #90582)
https://github.com/jayfoad milestoned https://github.com/llvm/llvm-project/pull/90582 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Fix gfx12 waitcnt type for image_msaa_load (#90201) (PR #90582)
jayfoad wrote: Let's not backport this yet since @pendingchaos has pointed out a problem with #90201. https://github.com/llvm/llvm-project/pull/90582 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Fix gfx12 waitcnt type for image_msaa_load (#90201) (PR #90582)
https://github.com/jayfoad converted_to_draft https://github.com/llvm/llvm-project/pull/90582 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Enhance s_waitcnt insertion before barrier for gfx12 (#90595) (PR #90719)
https://github.com/jayfoad created https://github.com/llvm/llvm-project/pull/90719 Code to determine if a waitcnt is required before a barrier instruction only considered S_BARRIER. gfx12 adds barrier_signal/wait so need to enhance the existing code to look for a barrier start (which is just an S_BARRIER for earlier architectures). >From e31113098e4669850f3ff924bead9e0fb9618f20 Mon Sep 17 00:00:00 2001 From: David Stuttard Date: Wed, 1 May 2024 11:37:13 +0100 Subject: [PATCH] [AMDGPU] Enhance s_waitcnt insertion before barrier for gfx12 (#90595) Code to determine if a waitcnt is required before a barrier instruction only considered S_BARRIER. gfx12 adds barrier_signal/wait so need to enhance the existing code to look for a barrier start (which is just an S_BARRIER for earlier architectures). --- llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp | 2 +- llvm/lib/Target/AMDGPU/SIInstrInfo.h | 11 ++ .../CodeGen/AMDGPU/llvm.amdgcn.s.barrier.ll | 2 ++ .../AMDGPU/llvm.amdgcn.s.barrier.wait.ll | 22 +++ 4 files changed, 36 insertions(+), 1 deletion(-) diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp index 6ecb1c8bf6e1db..7a3198612f86fc 100644 --- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp +++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp @@ -1832,7 +1832,7 @@ bool SIInsertWaitcnts::generateWaitcntInstBefore(MachineInstr &MI, // not, we need to ensure the subtarget is capable of backing off barrier // instructions in case there are any outstanding memory operations that may // cause an exception. Otherwise, insert an explicit S_WAITCNT 0 here. - if (MI.getOpcode() == AMDGPU::S_BARRIER && + if (TII->isBarrierStart(MI.getOpcode()) && !ST->hasAutoWaitcntBeforeBarrier() && !ST->supportsBackOffBarrier()) { Wait = Wait.combined( AMDGPU::Waitcnt::allZero(ST->hasExtendedWaitCounts(), ST->hasVscnt())); diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h index 1c9dacc09f8154..626d903c0c6958 100644 --- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h +++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h @@ -908,6 +908,17 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo { return MI.getDesc().TSFlags & SIInstrFlags::IsNeverUniform; } + // Check to see if opcode is for a barrier start. Pre gfx12 this is just the + // S_BARRIER, but after support for S_BARRIER_SIGNAL* / S_BARRIER_WAIT we want + // to check for the barrier start (S_BARRIER_SIGNAL*) + bool isBarrierStart(unsigned Opcode) const { +return Opcode == AMDGPU::S_BARRIER || + Opcode == AMDGPU::S_BARRIER_SIGNAL_M0 || + Opcode == AMDGPU::S_BARRIER_SIGNAL_ISFIRST_M0 || + Opcode == AMDGPU::S_BARRIER_SIGNAL_IMM || + Opcode == AMDGPU::S_BARRIER_SIGNAL_ISFIRST_IMM; + } + static bool doesNotReadTiedSource(const MachineInstr &MI) { return MI.getDesc().TSFlags & SIInstrFlags::TiedSourceNotRead; } diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.ll index a7d3115af29bff..47c021769aa56f 100644 --- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.ll +++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.ll @@ -96,6 +96,7 @@ define amdgpu_kernel void @test_barrier(ptr addrspace(1) %out, i32 %size) #0 { ; VARIANT4-NEXT:s_wait_kmcnt 0x0 ; VARIANT4-NEXT:v_xad_u32 v1, v0, -1, s2 ; VARIANT4-NEXT:global_store_b32 v3, v0, s[0:1] +; VARIANT4-NEXT:s_wait_storecnt 0x0 ; VARIANT4-NEXT:s_barrier_signal -1 ; VARIANT4-NEXT:s_barrier_wait -1 ; VARIANT4-NEXT:v_ashrrev_i32_e32 v2, 31, v1 @@ -142,6 +143,7 @@ define amdgpu_kernel void @test_barrier(ptr addrspace(1) %out, i32 %size) #0 { ; VARIANT6-NEXT:v_dual_mov_b32 v4, s1 :: v_dual_mov_b32 v3, s0 ; VARIANT6-NEXT:v_sub_nc_u32_e32 v1, s2, v0 ; VARIANT6-NEXT:global_store_b32 v5, v0, s[0:1] +; VARIANT6-NEXT:s_wait_storecnt 0x0 ; VARIANT6-NEXT:s_barrier_signal -1 ; VARIANT6-NEXT:s_barrier_wait -1 ; VARIANT6-NEXT:v_ashrrev_i32_e32 v2, 31, v1 diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.wait.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.wait.ll index 4ab5e97964a857..38a34ec6daf73c 100644 --- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.wait.ll +++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.wait.ll @@ -12,6 +12,7 @@ define amdgpu_kernel void @test1_s_barrier_signal(ptr addrspace(1) %out) #0 { ; GCN-NEXT:v_sub_nc_u32_e32 v0, v1, v0 ; GCN-NEXT:s_wait_kmcnt 0x0 ; GCN-NEXT:global_store_b32 v3, v2, s[0:1] +; GCN-NEXT:s_wait_storecnt 0x0 ; GCN-NEXT:s_barrier_signal -1 ; GCN-NEXT:s_barrier_wait -1 ; GCN-NEXT:global_store_b32 v3, v0, s[0:1] @@ -28,6 +29,7 @@ define amdgpu_kernel void @test1_s_barrier_signal(ptr addrspace(1) %out) #0 { ; GLOBAL-ISEL-NEXT:v_sub_nc_u32_e32 v0, v1, v0 ; GLOBAL-ISEL-NEXT:s_wait_kmcnt 0x0 ; GLOBAL-ISEL-N
[llvm-branch-commits] [llvm] [AMDGPU] Enhance s_waitcnt insertion before barrier for gfx12 (#90595) (PR #90719)
https://github.com/jayfoad milestoned https://github.com/llvm/llvm-project/pull/90719 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Fix gfx12 waitcnt type for image_msaa_load (#90201) (PR #90582)
https://github.com/jayfoad updated https://github.com/llvm/llvm-project/pull/90582 >From 17b75a9517891d662e677a357713c920bb79c43c Mon Sep 17 00:00:00 2001 From: David Stuttard Date: Tue, 30 Apr 2024 10:41:51 +0100 Subject: [PATCH 1/2] [AMDGPU] Fix gfx12 waitcnt type for image_msaa_load (#90201) image_msaa_load is actually encoded as a VSAMPLE instruction and requires the appropriate waitcnt variant. --- llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp | 8 -- .../AMDGPU/llvm.amdgcn.image.msaa.load.ll | 26 +-- 2 files changed, 19 insertions(+), 15 deletions(-) diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp index 6ecb1c8bf6e1db..97c55e4d9e41c2 100644 --- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp +++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp @@ -187,8 +187,12 @@ VmemType getVmemType(const MachineInstr &Inst) { const AMDGPU::MIMGInfo *Info = AMDGPU::getMIMGInfo(Inst.getOpcode()); const AMDGPU::MIMGBaseOpcodeInfo *BaseInfo = AMDGPU::getMIMGBaseOpcodeInfo(Info->BaseOpcode); - return BaseInfo->BVH ? VMEM_BVH - : BaseInfo->Sampler ? VMEM_SAMPLER : VMEM_NOSAMPLER; + // The test for MSAA here is because gfx12+ image_msaa_load is actually + // encoded as VSAMPLE and requires the appropriate s_waitcnt variant for that. + // Pre-gfx12 doesn't care since all vmem types result in the same s_waitcnt. + return BaseInfo->BVH ? VMEM_BVH + : BaseInfo->Sampler || BaseInfo->MSAA ? VMEM_SAMPLER + : VMEM_NOSAMPLER; } unsigned &getCounterRef(AMDGPU::Waitcnt &Wait, InstCounterType T) { diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.msaa.load.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.msaa.load.ll index 1348315e72e7bc..8da48551855570 100644 --- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.msaa.load.ll +++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.msaa.load.ll @@ -12,7 +12,7 @@ define amdgpu_ps <4 x float> @load_2dmsaa(<8 x i32> inreg %rsrc, i32 %s, i32 %t, ; GFX12-LABEL: load_2dmsaa: ; GFX12: ; %bb.0: ; %main_body ; GFX12-NEXT:image_msaa_load v[0:3], [v0, v1, v2], s[0:7] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm ; encoding: [0x06,0x20,0x46,0xe4,0x00,0x00,0x00,0x00,0x00,0x01,0x02,0x00] -; GFX12-NEXT:s_wait_loadcnt 0x0 ; encoding: [0x00,0x00,0xc0,0xbf] +; GFX12-NEXT:s_wait_samplecnt 0x0 ; encoding: [0x00,0x00,0xc2,0xbf] ; GFX12-NEXT:; return to shader part epilog main_body: %v = call <4 x float> @llvm.amdgcn.image.msaa.load.2dmsaa.v4f32.i32(i32 1, i32 %s, i32 %t, i32 %fragid, <8 x i32> %rsrc, i32 0, i32 0) @@ -32,7 +32,7 @@ define amdgpu_ps <4 x float> @load_2dmsaa_both(<8 x i32> inreg %rsrc, ptr addrsp ; GFX12: ; %bb.0: ; %main_body ; GFX12-NEXT:image_msaa_load v[0:4], [v0, v1, v2], s[0:7] dmask:0x2 dim:SQ_RSRC_IMG_2D_MSAA unorm tfe lwe ; encoding: [0x0e,0x20,0x86,0xe4,0x00,0x01,0x00,0x00,0x00,0x01,0x02,0x00] ; GFX12-NEXT:v_mov_b32_e32 v5, 0 ; encoding: [0x80,0x02,0x0a,0x7e] -; GFX12-NEXT:s_wait_loadcnt 0x0 ; encoding: [0x00,0x00,0xc0,0xbf] +; GFX12-NEXT:s_wait_samplecnt 0x0 ; encoding: [0x00,0x00,0xc2,0xbf] ; GFX12-NEXT:global_store_b32 v5, v4, s[8:9] ; encoding: [0x08,0x80,0x06,0xee,0x00,0x00,0x00,0x02,0x05,0x00,0x00,0x00] ; GFX12-NEXT:; return to shader part epilog main_body: @@ -53,7 +53,7 @@ define amdgpu_ps <4 x float> @load_2darraymsaa(<8 x i32> inreg %rsrc, i32 %s, i3 ; GFX12-LABEL: load_2darraymsaa: ; GFX12: ; %bb.0: ; %main_body ; GFX12-NEXT:image_msaa_load v[0:3], [v0, v1, v2, v3], s[0:7] dmask:0x4 dim:SQ_RSRC_IMG_2D_MSAA_ARRAY unorm ; encoding: [0x07,0x20,0x06,0xe5,0x00,0x00,0x00,0x00,0x00,0x01,0x02,0x03] -; GFX12-NEXT:s_wait_loadcnt 0x0 ; encoding: [0x00,0x00,0xc0,0xbf] +; GFX12-NEXT:s_wait_samplecnt 0x0 ; encoding: [0x00,0x00,0xc2,0xbf] ; GFX12-NEXT:; return to shader part epilog main_body: %v = call <4 x float> @llvm.amdgcn.image.msaa.load.2darraymsaa.v4f32.i32(i32 4, i32 %s, i32 %t, i32 %slice, i32 %fragid, <8 x i32> %rsrc, i32 0, i32 0) @@ -73,7 +73,7 @@ define amdgpu_ps <4 x float> @load_2darraymsaa_tfe(<8 x i32> inreg %rsrc, ptr ad ; GFX12: ; %bb.0: ; %main_body ; GFX12-NEXT:image_msaa_load v[0:4], [v0, v1, v2, v3], s[0:7] dmask:0x8 dim:SQ_RSRC_IMG_2D_MSAA_ARRAY unorm tfe ; encoding: [0x0f,0x20,0x06,0xe6,0x00,0x00,0x00,0x00,0x00,0x01,0x02,0x03] ; GFX12-NEXT:v_mov_b32_e32 v5, 0 ; encoding: [0x80,0x02,0x0a,0x7e] -; GFX12-NEXT:s_wait_loadcnt 0x0 ; encoding: [0x00,0x00,0xc0,0xbf] +; GFX12-NEXT:s_wait_samplecnt 0x0 ; encoding: [0x00,0x00,0xc2,0xbf] ; GFX12-NEXT:global_store_b32 v5, v4, s[8:9] ; encoding: [0x08,0x80,0x06,0xee,0x00,0x00,0x00,0x02,0x05,0x00,0x00,0x00] ; GFX12-NEXT:; return to shader part epilog main_body: @@ -94,7 +94,7 @@ define amdgpu_ps <4 x float> @load_2dmsaa_glc(<8 x i32> inreg %rsrc, i32 %s, i32 ; GFX12-LABEL: load_2dmsaa
[llvm-branch-commits] [llvm] [AMDGPU] Fix gfx12 waitcnt type for image_msaa_load (#90201) (PR #90582)
https://github.com/jayfoad ready_for_review https://github.com/llvm/llvm-project/pull/90582 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Fix gfx12 waitcnt type for image_msaa_load (#90201) (PR #90582)
jayfoad wrote: > Let's not backport this yet since @pendingchaos has pointed out a problem > with #90201. Fixed by #90710 which I have added to this PR. https://github.com/llvm/llvm-project/pull/90582 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Fix setting nontemporal in memory legalizer (#83815) (PR #90204)
jayfoad wrote: > Hi @jayfoad (or anyone else). If you would like to add a note about this fix > in the release notes (completely optional). Please reply to this comment with > a one or two sentence description of the fix. When you are done, please add > the release:note label to this PR. I don't think this fix is particularly noteworthy. Would there already be a list of bugs fixed in the release notes? https://github.com/llvm/llvm-project/pull/90204 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] release/18.x: [AMDGPU] Fix GFX12 encoding of s_wait_event export_ready (#89622) (PR #91034)
https://github.com/jayfoad approved this pull request. https://github.com/llvm/llvm-project/pull/91034 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] release/18.x: [AMDGPU] Fix GFX12 encoding of s_wait_event export_ready (#89622) (PR #91034)
jayfoad wrote: > Fixed encoding of AMDGPU instructions I don't think the release notes should say that. It makes it sound like all encodings were wrong. https://github.com/llvm/llvm-project/pull/91034 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] PR for llvm/llvm-project#79451 (PR #79457)
jayfoad wrote: > @jayfoad What do you think about merging this PR to the release branch? LGTM, but it was me that requested it. https://github.com/llvm/llvm-project/pull/79457 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] New llvm.amdgcn.wave.id intrinsic (#79325) (PR #79689)
https://github.com/jayfoad created https://github.com/llvm/llvm-project/pull/79689 This is only valid on targets with architected SGPRs. >From c5949b09b05e7417d0494b2301781b84d22b95ef Mon Sep 17 00:00:00 2001 From: Jay Foad Date: Thu, 25 Jan 2024 07:48:06 + Subject: [PATCH] [AMDGPU] New llvm.amdgcn.wave.id intrinsic (#79325) This is only valid on targets with architected SGPRs. --- llvm/include/llvm/IR/IntrinsicsAMDGPU.td | 4 ++ .../lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | 19 ++ llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h | 1 + llvm/lib/Target/AMDGPU/SIISelLowering.cpp | 14 + llvm/lib/Target/AMDGPU/SIISelLowering.h | 1 + .../CodeGen/AMDGPU/llvm.amdgcn.wave.id.ll | 61 +++ 6 files changed, 100 insertions(+) create mode 100644 llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wave.id.ll diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td index 9eb1ac8e27befb..c5f43d17d1c148 100644 --- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td +++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td @@ -2777,6 +2777,10 @@ class AMDGPULoadTr: def int_amdgcn_global_load_tr : AMDGPULoadTr; +// i32 @llvm.amdgcn.wave.id() +def int_amdgcn_wave_id : + DefaultAttrsIntrinsic<[llvm_i32_ty], [], [IntrNoMem, IntrSpeculatable]>; + //===--===// // Deep learning intrinsics. //===--===// diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp index 32921bb248caf0..118c8b7c66690f 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp +++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp @@ -6848,6 +6848,23 @@ bool AMDGPULegalizerInfo::legalizeStackSave(MachineInstr &MI, return true; } +bool AMDGPULegalizerInfo::legalizeWaveID(MachineInstr &MI, + MachineIRBuilder &B) const { + // With architected SGPRs, waveIDinGroup is in TTMP8[29:25]. + if (!ST.hasArchitectedSGPRs()) +return false; + LLT S32 = LLT::scalar(32); + Register DstReg = MI.getOperand(0).getReg(); + Register TTMP8 = + getFunctionLiveInPhysReg(B.getMF(), B.getTII(), AMDGPU::TTMP8, + AMDGPU::SReg_32RegClass, B.getDebugLoc(), S32); + auto LSB = B.buildConstant(S32, 25); + auto Width = B.buildConstant(S32, 5); + B.buildUbfx(DstReg, TTMP8, LSB, Width); + MI.eraseFromParent(); + return true; +} + bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper, MachineInstr &MI) const { MachineIRBuilder &B = Helper.MIRBuilder; @@ -6970,6 +6987,8 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper, case Intrinsic::amdgcn_workgroup_id_z: return legalizePreloadedArgIntrin(MI, MRI, B, AMDGPUFunctionArgInfo::WORKGROUP_ID_Z); + case Intrinsic::amdgcn_wave_id: +return legalizeWaveID(MI, B); case Intrinsic::amdgcn_lds_kernel_id: return legalizePreloadedArgIntrin(MI, MRI, B, AMDGPUFunctionArgInfo::LDS_KERNEL_ID); diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h index 56aabd4f6ab71b..ecbe42681c6690 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h +++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h @@ -212,6 +212,7 @@ class AMDGPULegalizerInfo final : public LegalizerInfo { bool legalizeFPTruncRound(MachineInstr &MI, MachineIRBuilder &B) const; bool legalizeStackSave(MachineInstr &MI, MachineIRBuilder &B) const; + bool legalizeWaveID(MachineInstr &MI, MachineIRBuilder &B) const; bool legalizeImageIntrinsic( MachineInstr &MI, MachineIRBuilder &B, diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp index d35b76c8ad54eb..9cbcf0012ea878 100644 --- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp +++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp @@ -7890,6 +7890,18 @@ SDValue SITargetLowering::lowerSBuffer(EVT VT, SDLoc DL, SDValue Rsrc, return Loads[0]; } +SDValue SITargetLowering::lowerWaveID(SelectionDAG &DAG, SDValue Op) const { + // With architected SGPRs, waveIDinGroup is in TTMP8[29:25]. + if (!Subtarget->hasArchitectedSGPRs()) +return {}; + SDLoc SL(Op); + MVT VT = MVT::i32; + SDValue TTMP8 = CreateLiveInRegister(DAG, &AMDGPU::SReg_32RegClass, + AMDGPU::TTMP8, VT, SL); + return DAG.getNode(AMDGPUISD::BFE_U32, SL, VT, TTMP8, + DAG.getConstant(25, SL, VT), DAG.getConstant(5, SL, VT)); +} + SDValue SITargetLowering::lowerWorkitemID(SelectionDAG &DAG, SDValue Op, unsigned Dim, const ArgDescriptor &Arg) const { @@ -8060,6 +8072,8 @@ SDValue SITargetLowering::Lower
[llvm-branch-commits] [llvm] [AMDGPU] New llvm.amdgcn.wave.id intrinsic (#79325) (PR #79689)
https://github.com/jayfoad milestoned https://github.com/llvm/llvm-project/pull/79689 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] New llvm.amdgcn.wave.id intrinsic (#79325) (PR #79689)
https://github.com/jayfoad edited https://github.com/llvm/llvm-project/pull/79689 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] New llvm.amdgcn.wave.id intrinsic (#79325) (PR #79689)
jayfoad wrote: @tstellar does this backport PR look OK? I created it with `gh pr create -f -B release/18.x` and I wasn't sure if I had to edit anything, apart from adding the release milestone. https://github.com/llvm/llvm-project/pull/79689 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] New llvm.amdgcn.wave.id intrinsic (#79325) (PR #79689)
https://github.com/jayfoad closed https://github.com/llvm/llvm-project/pull/79689 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] Backport 45d2d7757feb386186f69af6ef57bde7b5adc2db to release/18.x (PR #79839)
https://github.com/jayfoad created https://github.com/llvm/llvm-project/pull/79839 This just missed the branch creation and is the last piece of functionality required to get AMDGPU GFX12 support working in the 18.x release. >From c265c8527285075a58b2425198dbd4cca8b69477 Mon Sep 17 00:00:00 2001 From: Jay Foad Date: Thu, 25 Jan 2024 07:48:06 + Subject: [PATCH] [AMDGPU] New llvm.amdgcn.wave.id intrinsic (#79325) This is only valid on targets with architected SGPRs. --- llvm/include/llvm/IR/IntrinsicsAMDGPU.td | 4 ++ .../lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | 19 ++ llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h | 1 + llvm/lib/Target/AMDGPU/SIISelLowering.cpp | 14 + llvm/lib/Target/AMDGPU/SIISelLowering.h | 1 + .../CodeGen/AMDGPU/llvm.amdgcn.wave.id.ll | 61 +++ 6 files changed, 100 insertions(+) create mode 100644 llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wave.id.ll diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td index 9eb1ac8e27befb1..c5f43d17d1c1481 100644 --- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td +++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td @@ -2777,6 +2777,10 @@ class AMDGPULoadTr: def int_amdgcn_global_load_tr : AMDGPULoadTr; +// i32 @llvm.amdgcn.wave.id() +def int_amdgcn_wave_id : + DefaultAttrsIntrinsic<[llvm_i32_ty], [], [IntrNoMem, IntrSpeculatable]>; + //===--===// // Deep learning intrinsics. //===--===// diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp index 615685822f91eeb..e98ede88a7e2db9 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp +++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp @@ -6883,6 +6883,23 @@ bool AMDGPULegalizerInfo::legalizeStackSave(MachineInstr &MI, return true; } +bool AMDGPULegalizerInfo::legalizeWaveID(MachineInstr &MI, + MachineIRBuilder &B) const { + // With architected SGPRs, waveIDinGroup is in TTMP8[29:25]. + if (!ST.hasArchitectedSGPRs()) +return false; + LLT S32 = LLT::scalar(32); + Register DstReg = MI.getOperand(0).getReg(); + Register TTMP8 = + getFunctionLiveInPhysReg(B.getMF(), B.getTII(), AMDGPU::TTMP8, + AMDGPU::SReg_32RegClass, B.getDebugLoc(), S32); + auto LSB = B.buildConstant(S32, 25); + auto Width = B.buildConstant(S32, 5); + B.buildUbfx(DstReg, TTMP8, LSB, Width); + MI.eraseFromParent(); + return true; +} + bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper, MachineInstr &MI) const { MachineIRBuilder &B = Helper.MIRBuilder; @@ -7005,6 +7022,8 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper, case Intrinsic::amdgcn_workgroup_id_z: return legalizePreloadedArgIntrin(MI, MRI, B, AMDGPUFunctionArgInfo::WORKGROUP_ID_Z); + case Intrinsic::amdgcn_wave_id: +return legalizeWaveID(MI, B); case Intrinsic::amdgcn_lds_kernel_id: return legalizePreloadedArgIntrin(MI, MRI, B, AMDGPUFunctionArgInfo::LDS_KERNEL_ID); diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h index 56aabd4f6ab71b6..ecbe42681c6690c 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h +++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h @@ -212,6 +212,7 @@ class AMDGPULegalizerInfo final : public LegalizerInfo { bool legalizeFPTruncRound(MachineInstr &MI, MachineIRBuilder &B) const; bool legalizeStackSave(MachineInstr &MI, MachineIRBuilder &B) const; + bool legalizeWaveID(MachineInstr &MI, MachineIRBuilder &B) const; bool legalizeImageIntrinsic( MachineInstr &MI, MachineIRBuilder &B, diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp index d60f511302613e1..c5ad9da88ec2b31 100644 --- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp +++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp @@ -7920,6 +7920,18 @@ SDValue SITargetLowering::lowerSBuffer(EVT VT, SDLoc DL, SDValue Rsrc, return Loads[0]; } +SDValue SITargetLowering::lowerWaveID(SelectionDAG &DAG, SDValue Op) const { + // With architected SGPRs, waveIDinGroup is in TTMP8[29:25]. + if (!Subtarget->hasArchitectedSGPRs()) +return {}; + SDLoc SL(Op); + MVT VT = MVT::i32; + SDValue TTMP8 = CreateLiveInRegister(DAG, &AMDGPU::SReg_32RegClass, + AMDGPU::TTMP8, VT, SL); + return DAG.getNode(AMDGPUISD::BFE_U32, SL, VT, TTMP8, + DAG.getConstant(25, SL, VT), DAG.getConstant(5, SL, VT)); +} + SDValue SITargetLowering::lowerWorkitemID(SelectionDAG &DAG, SDValue Op, unsigned Dim,
[llvm-branch-commits] [llvm] Backport 45d2d7757feb386186f69af6ef57bde7b5adc2db to release/18.x (PR #79839)
https://github.com/jayfoad milestoned https://github.com/llvm/llvm-project/pull/79839 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] New llvm.amdgcn.wave.id intrinsic (#79325) (PR #79689)
jayfoad wrote: > jayfoad closed this by deleting the head repository 3 hours ago Sorry. Recreated as #79839 https://github.com/llvm/llvm-project/pull/79689 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] Backport 45d2d7757feb386186f69af6ef57bde7b5adc2db to release/18.x (PR #79839)
https://github.com/jayfoad updated https://github.com/llvm/llvm-project/pull/79839 >From c265c8527285075a58b2425198dbd4cca8b69477 Mon Sep 17 00:00:00 2001 From: Jay Foad Date: Thu, 25 Jan 2024 07:48:06 + Subject: [PATCH 1/2] [AMDGPU] New llvm.amdgcn.wave.id intrinsic (#79325) This is only valid on targets with architected SGPRs. --- llvm/include/llvm/IR/IntrinsicsAMDGPU.td | 4 ++ .../lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | 19 ++ llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h | 1 + llvm/lib/Target/AMDGPU/SIISelLowering.cpp | 14 + llvm/lib/Target/AMDGPU/SIISelLowering.h | 1 + .../CodeGen/AMDGPU/llvm.amdgcn.wave.id.ll | 61 +++ 6 files changed, 100 insertions(+) create mode 100644 llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wave.id.ll diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td index 9eb1ac8e27befb..c5f43d17d1c148 100644 --- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td +++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td @@ -2777,6 +2777,10 @@ class AMDGPULoadTr: def int_amdgcn_global_load_tr : AMDGPULoadTr; +// i32 @llvm.amdgcn.wave.id() +def int_amdgcn_wave_id : + DefaultAttrsIntrinsic<[llvm_i32_ty], [], [IntrNoMem, IntrSpeculatable]>; + //===--===// // Deep learning intrinsics. //===--===// diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp index 615685822f91ee..e98ede88a7e2db 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp +++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp @@ -6883,6 +6883,23 @@ bool AMDGPULegalizerInfo::legalizeStackSave(MachineInstr &MI, return true; } +bool AMDGPULegalizerInfo::legalizeWaveID(MachineInstr &MI, + MachineIRBuilder &B) const { + // With architected SGPRs, waveIDinGroup is in TTMP8[29:25]. + if (!ST.hasArchitectedSGPRs()) +return false; + LLT S32 = LLT::scalar(32); + Register DstReg = MI.getOperand(0).getReg(); + Register TTMP8 = + getFunctionLiveInPhysReg(B.getMF(), B.getTII(), AMDGPU::TTMP8, + AMDGPU::SReg_32RegClass, B.getDebugLoc(), S32); + auto LSB = B.buildConstant(S32, 25); + auto Width = B.buildConstant(S32, 5); + B.buildUbfx(DstReg, TTMP8, LSB, Width); + MI.eraseFromParent(); + return true; +} + bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper, MachineInstr &MI) const { MachineIRBuilder &B = Helper.MIRBuilder; @@ -7005,6 +7022,8 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper, case Intrinsic::amdgcn_workgroup_id_z: return legalizePreloadedArgIntrin(MI, MRI, B, AMDGPUFunctionArgInfo::WORKGROUP_ID_Z); + case Intrinsic::amdgcn_wave_id: +return legalizeWaveID(MI, B); case Intrinsic::amdgcn_lds_kernel_id: return legalizePreloadedArgIntrin(MI, MRI, B, AMDGPUFunctionArgInfo::LDS_KERNEL_ID); diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h index 56aabd4f6ab71b..ecbe42681c6690 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h +++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h @@ -212,6 +212,7 @@ class AMDGPULegalizerInfo final : public LegalizerInfo { bool legalizeFPTruncRound(MachineInstr &MI, MachineIRBuilder &B) const; bool legalizeStackSave(MachineInstr &MI, MachineIRBuilder &B) const; + bool legalizeWaveID(MachineInstr &MI, MachineIRBuilder &B) const; bool legalizeImageIntrinsic( MachineInstr &MI, MachineIRBuilder &B, diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp index d60f511302613e..c5ad9da88ec2b3 100644 --- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp +++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp @@ -7920,6 +7920,18 @@ SDValue SITargetLowering::lowerSBuffer(EVT VT, SDLoc DL, SDValue Rsrc, return Loads[0]; } +SDValue SITargetLowering::lowerWaveID(SelectionDAG &DAG, SDValue Op) const { + // With architected SGPRs, waveIDinGroup is in TTMP8[29:25]. + if (!Subtarget->hasArchitectedSGPRs()) +return {}; + SDLoc SL(Op); + MVT VT = MVT::i32; + SDValue TTMP8 = CreateLiveInRegister(DAG, &AMDGPU::SReg_32RegClass, + AMDGPU::TTMP8, VT, SL); + return DAG.getNode(AMDGPUISD::BFE_U32, SL, VT, TTMP8, + DAG.getConstant(25, SL, VT), DAG.getConstant(5, SL, VT)); +} + SDValue SITargetLowering::lowerWorkitemID(SelectionDAG &DAG, SDValue Op, unsigned Dim, const ArgDescriptor &Arg) const { @@ -8090,6 +8102,8 @@ SDValue SITargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op, case Intrinsic::
[llvm-branch-commits] [llvm] Backport 45d2d7757feb386186f69af6ef57bde7b5adc2db to release/18.x (PR #79839)
@@ -6883,6 +6883,23 @@ bool AMDGPULegalizerInfo::legalizeStackSave(MachineInstr &MI, return true; } +bool AMDGPULegalizerInfo::legalizeWaveID(MachineInstr &MI, + MachineIRBuilder &B) const { + // With architected SGPRs, waveIDinGroup is in TTMP8[29:25]. + if (!ST.hasArchitectedSGPRs()) +return false; + LLT S32 = LLT::scalar(32); + Register DstReg = MI.getOperand(0).getReg(); + Register TTMP8 = + getFunctionLiveInPhysReg(B.getMF(), B.getTII(), AMDGPU::TTMP8, jayfoad wrote: True, 66c710ec9dcdbdec6cadd89b972d8945983dc92f improved this to avoid adding liveins. I wasn't going to bother backporting that since I didn't think it was required for correctness. But I have cherry-picked it into this PR now. https://github.com/llvm/llvm-project/pull/79839 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] GFX12 VMEM instructions can write VGPR results out of order (PR #105549)
https://github.com/jayfoad created https://github.com/llvm/llvm-project/pull/105549 Fix SIInsertWaitcnts to account for this by adding extra waits to avoid WAW dependencies. >From 9a2103df4094af38f59e1adce5414b94672e6d6e Mon Sep 17 00:00:00 2001 From: Jay Foad Date: Wed, 21 Aug 2024 16:23:49 +0100 Subject: [PATCH] [AMDGPU] GFX12 VMEM instructions can write VGPR results out of order Fix SIInsertWaitcnts to account for this by adding extra waits to avoid WAW dependencies. --- llvm/lib/Target/AMDGPU/AMDGPU.td | 23 ++- llvm/lib/Target/AMDGPU/GCNSubtarget.h | 3 +++ llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp | 7 +++--- .../buffer-fat-pointer-atomicrmw-fadd.ll | 3 +++ .../buffer-fat-pointer-atomicrmw-fmax.ll | 5 .../buffer-fat-pointer-atomicrmw-fmin.ll | 5 amdgcn.struct.buffer.load.format.v3f16.ll | 1 + llvm/test/CodeGen/AMDGPU/load-constant-i16.ll | 10 +++- llvm/test/CodeGen/AMDGPU/load-global-i16.ll | 10 llvm/test/CodeGen/AMDGPU/load-global-i32.ll | 2 ++ .../AMDGPU/spill-csr-frame-ptr-reg-copy.ll| 1 + .../CodeGen/AMDGPU/waitcnt-vmcnt-loop.mir | 8 +++ 12 files changed, 64 insertions(+), 14 deletions(-) diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td index 7906e0ee9d7858..9efdbd751d96e3 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPU.td +++ b/llvm/lib/Target/AMDGPU/AMDGPU.td @@ -953,6 +953,12 @@ def FeatureRequiredExportPriority : SubtargetFeature<"required-export-priority", "Export priority must be explicitly manipulated on GFX11.5" >; +def FeatureVmemWriteVgprInOrder : SubtargetFeature<"vmem-write-vgpr-in-order", + "HasVmemWriteVgprInOrder", + "true", + "VMEM instructions of the same type write VGPR results in order" +>; + //======// // Subtarget Features (options and debugging) //======// @@ -1123,7 +1129,8 @@ def FeatureSouthernIslands : GCNSubtargetFeatureGeneration<"SOUTHERN_ISLANDS", FeatureDsSrc2Insts, FeatureLDSBankCount32, FeatureMovrel, FeatureTrigReducedRange, FeatureExtendedImageInsts, FeatureImageInsts, FeatureGDS, FeatureGWS, FeatureDefaultComponentZero, - FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF64GlobalInsts + FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF64GlobalInsts, + FeatureVmemWriteVgprInOrder ] >; @@ -1136,7 +1143,8 @@ def FeatureSeaIslands : GCNSubtargetFeatureGeneration<"SEA_ISLANDS", FeatureDsSrc2Insts, FeatureExtendedImageInsts, FeatureUnalignedBufferAccess, FeatureImageInsts, FeatureGDS, FeatureGWS, FeatureDefaultComponentZero, FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF64GlobalInsts, - FeatureAtomicFMinFMaxF32FlatInsts, FeatureAtomicFMinFMaxF64FlatInsts + FeatureAtomicFMinFMaxF32FlatInsts, FeatureAtomicFMinFMaxF64FlatInsts, + FeatureVmemWriteVgprInOrder ] >; @@ -1152,7 +1160,7 @@ def FeatureVolcanicIslands : GCNSubtargetFeatureGeneration<"VOLCANIC_ISLANDS", FeatureGFX7GFX8GFX9Insts, FeatureSMemTimeInst, FeatureMadMacF32Insts, FeatureDsSrc2Insts, FeatureExtendedImageInsts, FeatureFastDenormalF32, FeatureUnalignedBufferAccess, FeatureImageInsts, FeatureGDS, FeatureGWS, - FeatureDefaultComponentZero + FeatureDefaultComponentZero, FeatureVmemWriteVgprInOrder ] >; @@ -1170,7 +1178,8 @@ def FeatureGFX9 : GCNSubtargetFeatureGeneration<"GFX9", FeatureScalarFlatScratchInsts, FeatureScalarAtomics, FeatureR128A16, FeatureA16, FeatureSMemTimeInst, FeatureFastDenormalF32, FeatureSupportsXNACK, FeatureUnalignedBufferAccess, FeatureUnalignedDSAccess, - FeatureNegativeScratchOffsetBug, FeatureGWS, FeatureDefaultComponentZero + FeatureNegativeScratchOffsetBug, FeatureGWS, FeatureDefaultComponentZero, + FeatureVmemWriteVgprInOrder ] >; @@ -1193,7 +1202,8 @@ def FeatureGFX10 : GCNSubtargetFeatureGeneration<"GFX10", FeatureGDS, FeatureGWS, FeatureDefaultComponentZero, FeatureMaxHardClauseLength63, FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF64GlobalInsts, - FeatureAtomicFMinFMaxF32FlatInsts, FeatureAtomicFMinFMaxF64FlatInsts + FeatureAtomicFMinFMaxF32FlatInsts, FeatureAtomicFMinFMaxF64FlatInsts, + FeatureVmemWriteVgprInOrder ] >; @@ -1215,7 +1225,8 @@ def FeatureGFX11 : GCNSubtargetFeatureGeneration<"GFX11", FeatureUnalignedBufferAccess, FeatureUnalignedDSAccess, FeatureGDS, FeatureGWS, FeatureDefaultComponentZero, FeatureMaxHardClauseLength32, - FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF32FlatInsts + FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF32FlatInsts, + FeatureVmemWriteVgprInOrder ] >; diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h index 902f51ae358d59..9386bcf0d74b22 100644 --- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h +++ b/llvm/lib/Target/AMDGPU
[llvm-branch-commits] [llvm] [AMDGPU] Remove one case of vmcnt loop header flushing for GFX12 (PR #105550)
https://github.com/jayfoad created https://github.com/llvm/llvm-project/pull/105550 When a loop contains a VMEM load whose result is only used outside the loop, do not bother to flush vmcnt in the loop head on GFX12. A wait for vmcnt will be required inside the loop anyway, because VMEM instructions can write their VGPR results out of order. >From e53f75835dd0f0fc9d11b17afbe40de9b4a8a35b Mon Sep 17 00:00:00 2001 From: Jay Foad Date: Wed, 21 Aug 2024 16:57:24 +0100 Subject: [PATCH] [AMDGPU] Remove one case of vmcnt loop header flushing for GFX12 When a loop contains a VMEM load whose result is only used outside the loop, do not bother to flush vmcnt in the loop head on GFX12. A wait for vmcnt will be required inside the loop anyway, because VMEM instructions can write their VGPR results out of order. --- llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp | 2 +- llvm/test/CodeGen/AMDGPU/waitcnt-vmcnt-loop.mir | 10 +- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp index 4262e7b5d9c25..eafe20be17d5b 100644 --- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp +++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp @@ -2390,7 +2390,7 @@ bool SIInsertWaitcnts::shouldFlushVmCnt(MachineLoop *ML, } if (!ST->hasVscnt() && HasVMemStore && !HasVMemLoad && UsesVgprLoadedOutside) return true; - return HasVMemLoad && UsesVgprLoadedOutside; + return HasVMemLoad && UsesVgprLoadedOutside && ST->hasVmemWriteVgprInOrder(); } bool SIInsertWaitcnts::runOnMachineFunction(MachineFunction &MF) { diff --git a/llvm/test/CodeGen/AMDGPU/waitcnt-vmcnt-loop.mir b/llvm/test/CodeGen/AMDGPU/waitcnt-vmcnt-loop.mir index bdef55ab956a0..0ddd2aa285b26 100644 --- a/llvm/test/CodeGen/AMDGPU/waitcnt-vmcnt-loop.mir +++ b/llvm/test/CodeGen/AMDGPU/waitcnt-vmcnt-loop.mir @@ -295,7 +295,7 @@ body: | # GFX12-LABEL: waitcnt_vm_loop2 # GFX12-LABEL: bb.0: # GFX12: BUFFER_LOAD_FORMAT_X_IDXEN -# GFX12: S_WAIT_LOADCNT 0 +# GFX12-NOT: S_WAIT_LOADCNT 0 # GFX12-LABEL: bb.1: # GFX12: S_WAIT_LOADCNT 0 # GFX12-LABEL: bb.2: @@ -342,7 +342,7 @@ body: | # GFX12-LABEL: waitcnt_vm_loop2_store # GFX12-LABEL: bb.0: # GFX12: BUFFER_LOAD_FORMAT_X_IDXEN -# GFX12: S_WAIT_LOADCNT 0 +# GFX12-NOT: S_WAIT_LOADCNT 0 # GFX12-LABEL: bb.1: # GFX12: S_WAIT_LOADCNT 0 # GFX12-LABEL: bb.2: @@ -499,9 +499,9 @@ body: | # GFX12-LABEL: waitcnt_vm_loop2_reginterval # GFX12-LABEL: bb.0: # GFX12: GLOBAL_LOAD_DWORDX4 -# GFX12: S_WAIT_LOADCNT 0 -# GFX12-LABEL: bb.1: # GFX12-NOT: S_WAIT_LOADCNT 0 +# GFX12-LABEL: bb.1: +# GFX12: S_WAIT_LOADCNT 0 # GFX12-LABEL: bb.2: name:waitcnt_vm_loop2_reginterval body: | @@ -600,7 +600,7 @@ body: | # GFX12-LABEL: bb.0: # GFX12: BUFFER_LOAD_FORMAT_X_IDXEN # GFX12: BUFFER_LOAD_FORMAT_X_IDXEN -# GFX12: S_WAIT_LOADCNT 0 +# GFX12-NOT: S_WAIT_LOADCNT 0 # GFX12-LABEL: bb.1: # GFX12: S_WAIT_LOADCNT 0 # GFX12-LABEL: bb.2: ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] GFX12 VMEM instructions can write VGPR results out of order (PR #105549)
jayfoad wrote: > [!WARNING] > This pull request is not mergeable via GitHub because a downstack PR is > open. Once all requirements are satisfied, merge this PR as a stack href="https://app.graphite.dev/github/pr/llvm/llvm-project/105549?utm_source=stack-comment-downstack-mergeability-warning"; > >on Graphite. > https://graphite.dev/docs/merge-pull-requests";>Learn more * **#105550** https://app.graphite.dev/github/pr/llvm/llvm-project/105550?utm_source=stack-comment-icon"; target="_blank">https://static.graphite.dev/graphite-32x32-black.png"; alt="Graphite" width="10px" height="10px"/> * **#105549** https://app.graphite.dev/github/pr/llvm/llvm-project/105549?utm_source=stack-comment-icon"; target="_blank">https://static.graphite.dev/graphite-32x32-black.png"; alt="Graphite" width="10px" height="10px"/> 👈 * **#105548** https://app.graphite.dev/github/pr/llvm/llvm-project/105548?utm_source=stack-comment-icon"; target="_blank">https://static.graphite.dev/graphite-32x32-black.png"; alt="Graphite" width="10px" height="10px"/> * `main` This stack of pull requests is managed by Graphite. https://stacking.dev/?utm_source=stack-comment";>Learn more about stacking. Join @jayfoad and the rest of your teammates on https://graphite.dev?utm-source=stack-comment";>https://static.graphite.dev/graphite-32x32-black.png"; alt="Graphite" width="11px" height="11px"/> Graphite https://github.com/llvm/llvm-project/pull/105549 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Remove one case of vmcnt loop header flushing for GFX12 (PR #105550)
jayfoad wrote: > [!WARNING] > This pull request is not mergeable via GitHub because a downstack PR is > open. Once all requirements are satisfied, merge this PR as a stack href="https://app.graphite.dev/github/pr/llvm/llvm-project/105550?utm_source=stack-comment-downstack-mergeability-warning"; > >on Graphite. > https://graphite.dev/docs/merge-pull-requests";>Learn more * **#105550** https://app.graphite.dev/github/pr/llvm/llvm-project/105550?utm_source=stack-comment-icon"; target="_blank">https://static.graphite.dev/graphite-32x32-black.png"; alt="Graphite" width="10px" height="10px"/> 👈 * **#105549** https://app.graphite.dev/github/pr/llvm/llvm-project/105549?utm_source=stack-comment-icon"; target="_blank">https://static.graphite.dev/graphite-32x32-black.png"; alt="Graphite" width="10px" height="10px"/> * **#105548** https://app.graphite.dev/github/pr/llvm/llvm-project/105548?utm_source=stack-comment-icon"; target="_blank">https://static.graphite.dev/graphite-32x32-black.png"; alt="Graphite" width="10px" height="10px"/> * `main` This stack of pull requests is managed by Graphite. https://stacking.dev/?utm_source=stack-comment";>Learn more about stacking. Join @jayfoad and the rest of your teammates on https://graphite.dev?utm-source=stack-comment";>https://static.graphite.dev/graphite-32x32-black.png"; alt="Graphite" width="11px" height="11px"/> Graphite https://github.com/llvm/llvm-project/pull/105550 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] GFX12 VMEM instructions can write VGPR results out of order (PR #105549)
https://github.com/jayfoad ready_for_review https://github.com/llvm/llvm-project/pull/105549 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Remove one case of vmcnt loop header flushing for GFX12 (PR #105550)
https://github.com/jayfoad ready_for_review https://github.com/llvm/llvm-project/pull/105550 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] GFX12 VMEM instructions can write VGPR results out of order (PR #105549)
@@ -953,6 +953,12 @@ def FeatureRequiredExportPriority : SubtargetFeature<"required-export-priority", "Export priority must be explicitly manipulated on GFX11.5" >; +def FeatureVmemWriteVgprInOrder : SubtargetFeature<"vmem-write-vgpr-in-order", jayfoad wrote: "Easier" how? You mean it would make the patch smaller? I prefer to have features that state things in a "positive" way, so that not having the feature still generates conservatively correct code. https://github.com/llvm/llvm-project/pull/105549 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] GFX12 VMEM instructions can write VGPR results out of order (PR #105549)
@@ -1778,11 +1778,12 @@ bool SIInsertWaitcnts::generateWaitcntInstBefore(MachineInstr &MI, if (IsVGPR) { // RAW always needs an s_waitcnt. WAW needs an s_waitcnt unless the // previous write and this write are the same type of VMEM -// instruction, in which case they're guaranteed to write their -// results in order anyway. +// instruction, in which case they are (in some architectures) +// guaranteed to write their results in order anyway. jayfoad wrote: No this is nothing to do with storing data to memory. We are only talking about loads (or atomic with results) and the order in which they write the loaded data into the result VGPR. https://github.com/llvm/llvm-project/pull/105549 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] GFX12 VMEM instructions can write VGPR results out of order (PR #105549)
@@ -4371,8 +4375,10 @@ define amdgpu_kernel void @global_sextload_v64i16_to_v64i32(ptr addrspace(1) %ou ; GCN-NOHSA-SI-NEXT:buffer_store_dwordx4 v[8:11], off, s[0:3], 0 offset:48 ; GCN-NOHSA-SI-NEXT:buffer_store_dwordx4 v[4:7], off, s[0:3], 0 ; GCN-NOHSA-SI-NEXT:buffer_load_dword v0, off, s[12:15], 0 ; 4-byte Folded Reload +; GCN-NOHSA-SI-NEXT:s_waitcnt vmcnt(0) jayfoad wrote: The first RUN line does not specify a CPU so it will get some generic CPU that does not have the new feature. https://github.com/llvm/llvm-project/pull/105549 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] GFX12 VMEM instructions can write VGPR results out of order (PR #105549)
@@ -754,13 +754,21 @@ define amdgpu_kernel void @constant_load_v16i16_align2(ptr addrspace(4) %ptr0) # ; GFX12-NEXT:global_load_u16 v6, v8, s[0:1] offset:8 ; GFX12-NEXT:global_load_u16 v5, v8, s[0:1] offset:4 ; GFX12-NEXT:global_load_u16 v4, v8, s[0:1] +; GFX12-NEXT:s_wait_loadcnt 0x7 jayfoad wrote: This wait is required to ensure that the global_load_u16 on line 749 writes to v3 before the global_load_d16_hi_b16 on line 758. https://github.com/llvm/llvm-project/pull/105549 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] GFX12 VMEM loads can write VGPR results out of order (PR #105549)
https://github.com/jayfoad edited https://github.com/llvm/llvm-project/pull/105549 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] GFX12 VMEM loads can write VGPR results out of order (PR #105549)
jayfoad wrote: ### Merge activity * **Aug 22, 6:34 AM EDT**: @jayfoad started a stack merge that includes this pull request via [Graphite](https://app.graphite.dev/github/pr/llvm/llvm-project/105549). https://github.com/llvm/llvm-project/pull/105549 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] release/19.x: [AMDGPU] Remove one case of vmcnt loop header flushing for GFX12 (#105550) (PR #105808)
jayfoad wrote: I'm not sure if I should have done three different backport requests for the three commits. It could be confusing if they get squash-and-merged onto the release branch. https://github.com/llvm/llvm-project/pull/105808 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Fix sign confusion in performMulLoHiCombine (PR #106977)
https://github.com/jayfoad milestoned https://github.com/llvm/llvm-project/pull/106977 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Fix sign confusion in performMulLoHiCombine (PR #106977)
https://github.com/jayfoad created https://github.com/llvm/llvm-project/pull/106977 SMUL_LOHI and UMUL_LOHI are different operations because the high part of the result is different, so it is not OK to optimize the signed version to MUL_U24/MULHI_U24 or the unsigned version to MUL_I24/MULHI_I24. >From 04226baceb4e2823a7ca3daac236f705b3c6c33e Mon Sep 17 00:00:00 2001 From: Jay Foad Date: Tue, 27 Aug 2024 17:09:40 +0100 Subject: [PATCH] [AMDGPU] Fix sign confusion in performMulLoHiCombine (#105831) SMUL_LOHI and UMUL_LOHI are different operations because the high part of the result is different, so it is not OK to optimize the signed version to MUL_U24/MULHI_U24 or the unsigned version to MUL_I24/MULHI_I24. --- llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp | 30 +++--- llvm/test/CodeGen/AMDGPU/mul_int24.ll | 98 +++ 2 files changed, 116 insertions(+), 12 deletions(-) diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp index 39ae7c96cf7729..a71c9453d968dd 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp +++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp @@ -4349,6 +4349,7 @@ AMDGPUTargetLowering::performMulLoHiCombine(SDNode *N, SelectionDAG &DAG = DCI.DAG; SDLoc DL(N); + bool Signed = N->getOpcode() == ISD::SMUL_LOHI; SDValue N0 = N->getOperand(0); SDValue N1 = N->getOperand(1); @@ -4363,20 +4364,25 @@ AMDGPUTargetLowering::performMulLoHiCombine(SDNode *N, // Try to use two fast 24-bit multiplies (one for each half of the result) // instead of one slow extending multiply. - unsigned LoOpcode, HiOpcode; - if (Subtarget->hasMulU24() && isU24(N0, DAG) && isU24(N1, DAG)) { -N0 = DAG.getZExtOrTrunc(N0, DL, MVT::i32); -N1 = DAG.getZExtOrTrunc(N1, DL, MVT::i32); -LoOpcode = AMDGPUISD::MUL_U24; -HiOpcode = AMDGPUISD::MULHI_U24; - } else if (Subtarget->hasMulI24() && isI24(N0, DAG) && isI24(N1, DAG)) { -N0 = DAG.getSExtOrTrunc(N0, DL, MVT::i32); -N1 = DAG.getSExtOrTrunc(N1, DL, MVT::i32); -LoOpcode = AMDGPUISD::MUL_I24; -HiOpcode = AMDGPUISD::MULHI_I24; + unsigned LoOpcode = 0; + unsigned HiOpcode = 0; + if (Signed) { +if (Subtarget->hasMulI24() && isI24(N0, DAG) && isI24(N1, DAG)) { + N0 = DAG.getSExtOrTrunc(N0, DL, MVT::i32); + N1 = DAG.getSExtOrTrunc(N1, DL, MVT::i32); + LoOpcode = AMDGPUISD::MUL_I24; + HiOpcode = AMDGPUISD::MULHI_I24; +} } else { -return SDValue(); +if (Subtarget->hasMulU24() && isU24(N0, DAG) && isU24(N1, DAG)) { + N0 = DAG.getZExtOrTrunc(N0, DL, MVT::i32); + N1 = DAG.getZExtOrTrunc(N1, DL, MVT::i32); + LoOpcode = AMDGPUISD::MUL_U24; + HiOpcode = AMDGPUISD::MULHI_U24; +} } + if (!LoOpcode) +return SDValue(); SDValue Lo = DAG.getNode(LoOpcode, DL, MVT::i32, N0, N1); SDValue Hi = DAG.getNode(HiOpcode, DL, MVT::i32, N0, N1); diff --git a/llvm/test/CodeGen/AMDGPU/mul_int24.ll b/llvm/test/CodeGen/AMDGPU/mul_int24.ll index be77a10380c49b..8f4c48fae6fb31 100644 --- a/llvm/test/CodeGen/AMDGPU/mul_int24.ll +++ b/llvm/test/CodeGen/AMDGPU/mul_int24.ll @@ -813,4 +813,102 @@ bb7: ret void } + +define amdgpu_kernel void @test_umul_i24(ptr addrspace(1) %out, i32 %arg) { +; SI-LABEL: test_umul_i24: +; SI: ; %bb.0: +; SI-NEXT:s_load_dword s1, s[2:3], 0xb +; SI-NEXT:v_mov_b32_e32 v0, 0xff803fe1 +; SI-NEXT:s_mov_b32 s0, 0 +; SI-NEXT:s_mov_b32 s3, 0xf000 +; SI-NEXT:s_waitcnt lgkmcnt(0) +; SI-NEXT:s_lshr_b32 s1, s1, 9 +; SI-NEXT:v_mul_hi_u32 v0, s1, v0 +; SI-NEXT:s_mul_i32 s1, s1, 0xff803fe1 +; SI-NEXT:v_alignbit_b32 v0, v0, s1, 1 +; SI-NEXT:s_mov_b32 s2, -1 +; SI-NEXT:s_mov_b32 s1, s0 +; SI-NEXT:buffer_store_dword v0, off, s[0:3], 0 +; SI-NEXT:s_endpgm +; +; VI-LABEL: test_umul_i24: +; VI: ; %bb.0: +; VI-NEXT:s_load_dword s0, s[2:3], 0x2c +; VI-NEXT:v_mov_b32_e32 v0, 0xff803fe1 +; VI-NEXT:s_mov_b32 s3, 0xf000 +; VI-NEXT:s_mov_b32 s2, -1 +; VI-NEXT:s_waitcnt lgkmcnt(0) +; VI-NEXT:s_lshr_b32 s0, s0, 9 +; VI-NEXT:v_mad_u64_u32 v[0:1], s[0:1], s0, v0, 0 +; VI-NEXT:s_mov_b32 s0, 0 +; VI-NEXT:s_mov_b32 s1, s0 +; VI-NEXT:v_alignbit_b32 v0, v1, v0, 1 +; VI-NEXT:s_nop 1 +; VI-NEXT:buffer_store_dword v0, off, s[0:3], 0 +; VI-NEXT:s_endpgm +; +; GFX9-LABEL: test_umul_i24: +; GFX9: ; %bb.0: +; GFX9-NEXT:s_load_dword s1, s[2:3], 0x2c +; GFX9-NEXT:s_mov_b32 s0, 0 +; GFX9-NEXT:s_mov_b32 s3, 0xf000 +; GFX9-NEXT:s_mov_b32 s2, -1 +; GFX9-NEXT:s_waitcnt lgkmcnt(0) +; GFX9-NEXT:s_lshr_b32 s1, s1, 9 +; GFX9-NEXT:s_mul_hi_u32 s4, s1, 0xff803fe1 +; GFX9-NEXT:s_mul_i32 s1, s1, 0xff803fe1 +; GFX9-NEXT:v_mov_b32_e32 v0, s1 +; GFX9-NEXT:v_alignbit_b32 v0, s4, v0, 1 +; GFX9-NEXT:s_mov_b32 s1, s0 +; GFX9-NEXT:buffer_store_dword v0, off, s[0:3], 0 +; GFX9-NEXT:s_endpgm +; +; EG-LABEL: test_umul_i24: +; EG: ; %bb.0: +; EG-
[llvm-branch-commits] [llvm] [AMDGPU] Fix sign confusion in performMulLoHiCombine (PR #106977)
jayfoad wrote: This is a backport of #105831. https://github.com/llvm/llvm-project/pull/106977 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Fix sign confusion in performMulLoHiCombine (PR #106977)
https://github.com/jayfoad edited https://github.com/llvm/llvm-project/pull/106977 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [X86] Avoid generating nested CALLSEQ for TLS pointer function arguments (PR #106965)
jayfoad wrote: > > This sounds sketchy to me. Is it really valid to enter a second call inside > > another call's CALLSEQ markers, but only if we avoid adding a second nested > > set of markers? It feels like attacking the symptom of the issue, but not > > the root cause. (I'm not certain it's _not_ valid, but it just seems really > > suspicious...) > > From what I've gathered from the source comments and the > [patch](https://github.com/llvm/llvm-project/commit/228978c0dcfc9a9793f3dc8a69f42471192223bc) > introducing the code that inserts these CALLSEQ markers for TLSADDRs, their > only point here is to stop shrink-wrapping from moving the function > prologue/epilogue past the call to get the TLS address. This should also be > given when the TLSADDR is in another CALLSEQ. > > I am however by no means an expert on this topic; I'd appreciate more > insights on which uses of CALLSEQ markers are and are not valid (besides the > MachineVerifier checks). I also wondered about this. Are there other mechanisms that block shrink wrapping from moving the prologue? E.g. what if a regular instruction (not a call) has to come after the prologue, how would that be marked? Maybe adding an implicit use or def of some particular physical register would be enough?? https://github.com/llvm/llvm-project/pull/106965 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Fix sign confusion in performMulLoHiCombine (PR #106977)
jayfoad wrote: > Is this PR a fix for a regression or a critical issue? No, I believe it has been broken for about 3 years (since d7e03df719464354b20a845b7853be57da863924) but it was only reported to me recently. I guess this means it is not appropriate for 19.1.0. https://github.com/llvm/llvm-project/pull/106977 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Fix sign confusion in performMulLoHiCombine (PR #106977)
jayfoad wrote: > > Is this PR a fix for a regression or a critical issue? > > No, I believe it has been broken for about 3 years (since > [d7e03df](https://github.com/llvm/llvm-project/commit/d7e03df719464354b20a845b7853be57da863924)) > but it was only reported to me recently. > > I guess this means it is not appropriate for 19.1.0. Cc @marekolsak FYI. https://github.com/llvm/llvm-project/pull/106977 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] a1cba5b - [SelectionDAG] Make use of KnownBits::commonBits. NFC.
Author: Jay Foad Date: 2021-01-14T14:02:43Z New Revision: a1cba5b7a1fb09d2d4082967e2466a5a89ed698a URL: https://github.com/llvm/llvm-project/commit/a1cba5b7a1fb09d2d4082967e2466a5a89ed698a DIFF: https://github.com/llvm/llvm-project/commit/a1cba5b7a1fb09d2d4082967e2466a5a89ed698a.diff LOG: [SelectionDAG] Make use of KnownBits::commonBits. NFC. Differential Revision: https://reviews.llvm.org/D94587 Added: Modified: llvm/lib/CodeGen/SelectionDAG/FunctionLoweringInfo.cpp llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp Removed: diff --git a/llvm/lib/CodeGen/SelectionDAG/FunctionLoweringInfo.cpp b/llvm/lib/CodeGen/SelectionDAG/FunctionLoweringInfo.cpp index 669bca966a7d..0b830f462c90 100644 --- a/llvm/lib/CodeGen/SelectionDAG/FunctionLoweringInfo.cpp +++ b/llvm/lib/CodeGen/SelectionDAG/FunctionLoweringInfo.cpp @@ -509,8 +509,7 @@ void FunctionLoweringInfo::ComputePHILiveOutRegInfo(const PHINode *PN) { return; } DestLOI.NumSignBits = std::min(DestLOI.NumSignBits, SrcLOI->NumSignBits); -DestLOI.Known.Zero &= SrcLOI->Known.Zero; -DestLOI.Known.One &= SrcLOI->Known.One; +DestLOI.Known = KnownBits::commonBits(DestLOI.Known, SrcLOI->Known); } } diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp index 7ea0b09ef9c9..173e45a4b18e 100644 --- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp +++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp @@ -1016,10 +1016,8 @@ bool TargetLowering::SimplifyDemandedBits( Depth + 1)) return true; -if (!!DemandedVecElts) { - Known.One &= KnownVec.One; - Known.Zero &= KnownVec.Zero; -} +if (!!DemandedVecElts) + Known = KnownBits::commonBits(Known, KnownVec); return false; } @@ -1044,14 +1042,10 @@ bool TargetLowering::SimplifyDemandedBits( Known.Zero.setAllBits(); Known.One.setAllBits(); -if (!!DemandedSubElts) { - Known.One &= KnownSub.One; - Known.Zero &= KnownSub.Zero; -} -if (!!DemandedSrcElts) { - Known.One &= KnownSrc.One; - Known.Zero &= KnownSrc.Zero; -} +if (!!DemandedSubElts) + Known = KnownBits::commonBits(Known, KnownSub); +if (!!DemandedSrcElts) + Known = KnownBits::commonBits(Known, KnownSrc); // Attempt to avoid multi-use src if we don't need anything from it. if (!DemandedBits.isAllOnesValue() || !DemandedSubElts.isAllOnesValue() || @@ -1108,10 +1102,8 @@ bool TargetLowering::SimplifyDemandedBits( Known2, TLO, Depth + 1)) return true; // Known bits are shared by every demanded subvector element. - if (!!DemandedSubElts) { -Known.One &= Known2.One; -Known.Zero &= Known2.Zero; - } + if (!!DemandedSubElts) +Known = KnownBits::commonBits(Known, Known2); } break; } @@ -1149,15 +1141,13 @@ bool TargetLowering::SimplifyDemandedBits( if (SimplifyDemandedBits(Op0, DemandedBits, DemandedLHS, Known2, TLO, Depth + 1)) return true; -Known.One &= Known2.One; -Known.Zero &= Known2.Zero; +Known = KnownBits::commonBits(Known, Known2); } if (!!DemandedRHS) { if (SimplifyDemandedBits(Op1, DemandedBits, DemandedRHS, Known2, TLO, Depth + 1)) return true; -Known.One &= Known2.One; -Known.Zero &= Known2.Zero; +Known = KnownBits::commonBits(Known, Known2); } // Attempt to avoid multi-use ops if we don't need anything from them. @@ -1384,8 +1374,7 @@ bool TargetLowering::SimplifyDemandedBits( return true; // Only known if known in both the LHS and RHS. -Known.One &= Known2.One; -Known.Zero &= Known2.Zero; +Known = KnownBits::commonBits(Known, Known2); break; case ISD::SELECT_CC: if (SimplifyDemandedBits(Op.getOperand(3), DemandedBits, Known, TLO, @@ -1402,8 +1391,7 @@ bool TargetLowering::SimplifyDemandedBits( return true; // Only known if known in both the LHS and RHS. -Known.One &= Known2.One; -Known.Zero &= Known2.Zero; +Known = KnownBits::commonBits(Known, Known2); break; case ISD::SETCC: { SDValue Op0 = Op.getOperand(0); ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] 517196e - [Analysis, CodeGen] Make use of KnownBits::makeConstant. NFC.
Author: Jay Foad Date: 2021-01-14T14:02:43Z New Revision: 517196e569129677be32d6ebcfa57bac552268a4 URL: https://github.com/llvm/llvm-project/commit/517196e569129677be32d6ebcfa57bac552268a4 DIFF: https://github.com/llvm/llvm-project/commit/517196e569129677be32d6ebcfa57bac552268a4.diff LOG: [Analysis,CodeGen] Make use of KnownBits::makeConstant. NFC. Differential Revision: https://reviews.llvm.org/D94588 Added: Modified: llvm/lib/Analysis/ValueTracking.cpp llvm/lib/CodeGen/GlobalISel/GISelKnownBits.cpp llvm/lib/CodeGen/SelectionDAG/FunctionLoweringInfo.cpp llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp Removed: diff --git a/llvm/lib/Analysis/ValueTracking.cpp b/llvm/lib/Analysis/ValueTracking.cpp index b138caa05610..61c992d0eedf 100644 --- a/llvm/lib/Analysis/ValueTracking.cpp +++ b/llvm/lib/Analysis/ValueTracking.cpp @@ -1337,8 +1337,8 @@ static void computeKnownBitsFromOperator(const Operator *I, AccConstIndices += IndexConst.sextOrTrunc(BitWidth); continue; } else { -ScalingFactor.Zero = ~TypeSizeInBytes; -ScalingFactor.One = TypeSizeInBytes; +ScalingFactor = +KnownBits::makeConstant(APInt(IndexBitWidth, TypeSizeInBytes)); } IndexBits = KnownBits::computeForMul(IndexBits, ScalingFactor); @@ -1353,9 +1353,7 @@ static void computeKnownBitsFromOperator(const Operator *I, /*Add=*/true, /*NSW=*/false, Known, IndexBits); } if (!Known.isUnknown() && !AccConstIndices.isNullValue()) { - KnownBits Index(BitWidth); - Index.Zero = ~AccConstIndices; - Index.One = AccConstIndices; + KnownBits Index = KnownBits::makeConstant(AccConstIndices); Known = KnownBits::computeForAddSub( /*Add=*/true, /*NSW=*/false, Known, Index); } @@ -1818,8 +1816,7 @@ void computeKnownBits(const Value *V, const APInt &DemandedElts, const APInt *C; if (match(V, m_APInt(C))) { // We know all of the bits for a scalar constant or a splat vector constant! -Known.One = *C; -Known.Zero = ~Known.One; +Known = KnownBits::makeConstant(*C); return; } // Null and aggregate-zero are all-zeros. diff --git a/llvm/lib/CodeGen/GlobalISel/GISelKnownBits.cpp b/llvm/lib/CodeGen/GlobalISel/GISelKnownBits.cpp index 64c7fb486493..aac7a73e858f 100644 --- a/llvm/lib/CodeGen/GlobalISel/GISelKnownBits.cpp +++ b/llvm/lib/CodeGen/GlobalISel/GISelKnownBits.cpp @@ -217,8 +217,7 @@ void GISelKnownBits::computeKnownBitsImpl(Register R, KnownBits &Known, auto CstVal = getConstantVRegVal(R, MRI); if (!CstVal) break; -Known.One = *CstVal; -Known.Zero = ~Known.One; +Known = KnownBits::makeConstant(*CstVal); break; } case TargetOpcode::G_FRAME_INDEX: { diff --git a/llvm/lib/CodeGen/SelectionDAG/FunctionLoweringInfo.cpp b/llvm/lib/CodeGen/SelectionDAG/FunctionLoweringInfo.cpp index 0b830f462c90..32a4f60df097 100644 --- a/llvm/lib/CodeGen/SelectionDAG/FunctionLoweringInfo.cpp +++ b/llvm/lib/CodeGen/SelectionDAG/FunctionLoweringInfo.cpp @@ -458,8 +458,7 @@ void FunctionLoweringInfo::ComputePHILiveOutRegInfo(const PHINode *PN) { if (ConstantInt *CI = dyn_cast(V)) { APInt Val = CI->getValue().zextOrTrunc(BitWidth); DestLOI.NumSignBits = Val.getNumSignBits(); -DestLOI.Known.Zero = ~Val; -DestLOI.Known.One = Val; +DestLOI.Known = KnownBits::makeConstant(Val); } else { assert(ValueMap.count(V) && "V should have been placed in ValueMap when its" "CopyToReg node was created."); diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp index e080408bbe42..7084ab68524b 100644 --- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp +++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp @@ -3134,13 +3134,10 @@ KnownBits SelectionDAG::computeKnownBits(SDValue Op, const APInt &DemandedElts, } } else if (BitWidth == CstTy->getPrimitiveSizeInBits()) { if (auto *CInt = dyn_cast(Cst)) { -const APInt &Value = CInt->getValue(); -Known.One = Value; -Known.Zero = ~Value; +Known = KnownBits::makeConstant(CInt->getValue()); } else if (auto *CFP = dyn_cast(Cst)) { -APInt Value = CFP->getValueAPF().bitcastToAPInt(); -Known.One = Value; -Known.Zero = ~Value; +Known = +KnownBits::makeConstant(CFP->getValueAPF().bitcastToAPInt()); } } } diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp index 173e45a4b18e..6ae0a39962b3 100644 --- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp +++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp @@ -912,15 +912,14 @@ boo
[llvm-branch-commits] [llvm] 90b310f - [Support] Simplify KnownBits::icmp helpers. NFC.
Author: Jay Foad Date: 2021-01-14T14:02:43Z New Revision: 90b310f6caf0b356075c70407c338b3c751eebb3 URL: https://github.com/llvm/llvm-project/commit/90b310f6caf0b356075c70407c338b3c751eebb3 DIFF: https://github.com/llvm/llvm-project/commit/90b310f6caf0b356075c70407c338b3c751eebb3.diff LOG: [Support] Simplify KnownBits::icmp helpers. NFC. Remove some special cases that aren't really any simpler than the general case. Differential Revision: https://reviews.llvm.org/D94595 Added: Modified: llvm/lib/Support/KnownBits.cpp Removed: diff --git a/llvm/lib/Support/KnownBits.cpp b/llvm/lib/Support/KnownBits.cpp index 0147d21d153a..0f36c6a9ef1d 100644 --- a/llvm/lib/Support/KnownBits.cpp +++ b/llvm/lib/Support/KnownBits.cpp @@ -271,9 +271,6 @@ KnownBits KnownBits::ashr(const KnownBits &LHS, const KnownBits &RHS) { Optional KnownBits::eq(const KnownBits &LHS, const KnownBits &RHS) { if (LHS.isConstant() && RHS.isConstant()) return Optional(LHS.getConstant() == RHS.getConstant()); - if (LHS.getMaxValue().ult(RHS.getMinValue()) || - LHS.getMinValue().ugt(RHS.getMaxValue())) -return Optional(false); if (LHS.One.intersects(RHS.Zero) || RHS.One.intersects(LHS.Zero)) return Optional(false); return None; @@ -286,8 +283,6 @@ Optional KnownBits::ne(const KnownBits &LHS, const KnownBits &RHS) { } Optional KnownBits::ugt(const KnownBits &LHS, const KnownBits &RHS) { - if (LHS.isConstant() && RHS.isConstant()) -return Optional(LHS.getConstant().ugt(RHS.getConstant())); // LHS >u RHS -> false if umax(LHS) <= umax(RHS) if (LHS.getMaxValue().ule(RHS.getMinValue())) return Optional(false); @@ -312,8 +307,6 @@ Optional KnownBits::ule(const KnownBits &LHS, const KnownBits &RHS) { } Optional KnownBits::sgt(const KnownBits &LHS, const KnownBits &RHS) { - if (LHS.isConstant() && RHS.isConstant()) -return Optional(LHS.getConstant().sgt(RHS.getConstant())); // LHS >s RHS -> false if smax(LHS) <= smax(RHS) if (LHS.getSignedMaxValue().sle(RHS.getSignedMinValue())) return Optional(false); ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] 868da2e - [SelectionDAG] Remove an early-out from computeKnownBits for smin/smax
Author: Jay Foad Date: 2021-01-14T18:15:17Z New Revision: 868da2ea939baf8c71a6dcb878cf6094ede9486e URL: https://github.com/llvm/llvm-project/commit/868da2ea939baf8c71a6dcb878cf6094ede9486e DIFF: https://github.com/llvm/llvm-project/commit/868da2ea939baf8c71a6dcb878cf6094ede9486e.diff LOG: [SelectionDAG] Remove an early-out from computeKnownBits for smin/smax Even if we know nothing about LHS, it can still be useful to know that smax(LHS, RHS) >= RHS and smin(LHS, RHS) <= RHS. Differential Revision: https://reviews.llvm.org/D87145 Added: Modified: llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp llvm/test/CodeGen/X86/known-bits-vector.ll Removed: diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp index 7084ab68524b5..82da553954d2f 100644 --- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp +++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp @@ -3416,7 +3416,6 @@ KnownBits SelectionDAG::computeKnownBits(SDValue Op, const APInt &DemandedElts, } Known = computeKnownBits(Op.getOperand(0), DemandedElts, Depth + 1); -if (Known.isUnknown()) break; // Early-out Known2 = computeKnownBits(Op.getOperand(1), DemandedElts, Depth + 1); if (IsMax) Known = KnownBits::smax(Known, Known2); diff --git a/llvm/test/CodeGen/X86/known-bits-vector.ll b/llvm/test/CodeGen/X86/known-bits-vector.ll index 3b6912a9d9461..05bf984101abc 100644 --- a/llvm/test/CodeGen/X86/known-bits-vector.ll +++ b/llvm/test/CodeGen/X86/known-bits-vector.ll @@ -435,11 +435,7 @@ define <4 x float> @knownbits_smax_smin_shuffle_uitofp(<4 x i32> %a0) { ; X32-NEXT:vpminsd {{\.LCPI.*}}, %xmm0, %xmm0 ; X32-NEXT:vpmaxsd {{\.LCPI.*}}, %xmm0, %xmm0 ; X32-NEXT:vpshufd {{.*#+}} xmm0 = xmm0[0,0,3,3] -; X32-NEXT:vpblendw {{.*#+}} xmm1 = xmm0[0],mem[1],xmm0[2],mem[3],xmm0[4],mem[5],xmm0[6],mem[7] -; X32-NEXT:vpsrld $16, %xmm0, %xmm0 -; X32-NEXT:vpblendw {{.*#+}} xmm0 = xmm0[0],mem[1],xmm0[2],mem[3],xmm0[4],mem[5],xmm0[6],mem[7] -; X32-NEXT:vsubps {{\.LCPI.*}}, %xmm0, %xmm0 -; X32-NEXT:vaddps %xmm0, %xmm1, %xmm0 +; X32-NEXT:vcvtdq2ps %xmm0, %xmm0 ; X32-NEXT:retl ; ; X64-LABEL: knownbits_smax_smin_shuffle_uitofp: @@ -447,11 +443,7 @@ define <4 x float> @knownbits_smax_smin_shuffle_uitofp(<4 x i32> %a0) { ; X64-NEXT:vpminsd {{.*}}(%rip), %xmm0, %xmm0 ; X64-NEXT:vpmaxsd {{.*}}(%rip), %xmm0, %xmm0 ; X64-NEXT:vpshufd {{.*#+}} xmm0 = xmm0[0,0,3,3] -; X64-NEXT:vpblendw {{.*#+}} xmm1 = xmm0[0],mem[1],xmm0[2],mem[3],xmm0[4],mem[5],xmm0[6],mem[7] -; X64-NEXT:vpsrld $16, %xmm0, %xmm0 -; X64-NEXT:vpblendw {{.*#+}} xmm0 = xmm0[0],mem[1],xmm0[2],mem[3],xmm0[4],mem[5],xmm0[6],mem[7] -; X64-NEXT:vsubps {{.*}}(%rip), %xmm0, %xmm0 -; X64-NEXT:vaddps %xmm0, %xmm1, %xmm0 +; X64-NEXT:vcvtdq2ps %xmm0, %xmm0 ; X64-NEXT:retq %1 = call <4 x i32> @llvm.x86.sse41.pminsd(<4 x i32> %a0, <4 x i32> ) %2 = call <4 x i32> @llvm.x86.sse41.pmaxsd(<4 x i32> %1, <4 x i32> ) ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] 49dce85 - [AMDGPU] Simplify AMDGPUInstPrinter::printExpSrcN. NFC.
Author: Jay Foad Date: 2021-01-19T10:39:56Z New Revision: 49dce85584e34ee7fb973da9ba617169fd0f103c URL: https://github.com/llvm/llvm-project/commit/49dce85584e34ee7fb973da9ba617169fd0f103c DIFF: https://github.com/llvm/llvm-project/commit/49dce85584e34ee7fb973da9ba617169fd0f103c.diff LOG: [AMDGPU] Simplify AMDGPUInstPrinter::printExpSrcN. NFC. Change-Id: Idd7f47647bc0faa3ad6f61f44728c0f20540ec00 Added: Modified: llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.cpp llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.h Removed: diff --git a/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.cpp b/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.cpp index 574fba62f5f3..fcca32abdd5a 100644 --- a/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.cpp +++ b/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.cpp @@ -958,10 +958,9 @@ void AMDGPUInstPrinter::printSDWADstUnused(const MCInst *MI, unsigned OpNo, } } -template void AMDGPUInstPrinter::printExpSrcN(const MCInst *MI, unsigned OpNo, - const MCSubtargetInfo &STI, - raw_ostream &O) { + const MCSubtargetInfo &STI, raw_ostream &O, + unsigned N) { unsigned Opc = MI->getOpcode(); int EnIdx = AMDGPU::getNamedOperandIdx(Opc, AMDGPU::OpName::en); unsigned En = MI->getOperand(EnIdx).getImm(); @@ -969,12 +968,8 @@ void AMDGPUInstPrinter::printExpSrcN(const MCInst *MI, unsigned OpNo, int ComprIdx = AMDGPU::getNamedOperandIdx(Opc, AMDGPU::OpName::compr); // If compr is set, print as src0, src0, src1, src1 - if (MI->getOperand(ComprIdx).getImm()) { -if (N == 1 || N == 2) - --OpNo; -else if (N == 3) - OpNo -= 2; - } + if (MI->getOperand(ComprIdx).getImm()) +OpNo = OpNo - N + N / 2; if (En & (1 << N)) printRegOperand(MI->getOperand(OpNo).getReg(), O, MRI); @@ -985,25 +980,25 @@ void AMDGPUInstPrinter::printExpSrcN(const MCInst *MI, unsigned OpNo, void AMDGPUInstPrinter::printExpSrc0(const MCInst *MI, unsigned OpNo, const MCSubtargetInfo &STI, raw_ostream &O) { - printExpSrcN<0>(MI, OpNo, STI, O); + printExpSrcN(MI, OpNo, STI, O, 0); } void AMDGPUInstPrinter::printExpSrc1(const MCInst *MI, unsigned OpNo, const MCSubtargetInfo &STI, raw_ostream &O) { - printExpSrcN<1>(MI, OpNo, STI, O); + printExpSrcN(MI, OpNo, STI, O, 1); } void AMDGPUInstPrinter::printExpSrc2(const MCInst *MI, unsigned OpNo, const MCSubtargetInfo &STI, raw_ostream &O) { - printExpSrcN<2>(MI, OpNo, STI, O); + printExpSrcN(MI, OpNo, STI, O, 2); } void AMDGPUInstPrinter::printExpSrc3(const MCInst *MI, unsigned OpNo, const MCSubtargetInfo &STI, raw_ostream &O) { - printExpSrcN<3>(MI, OpNo, STI, O); + printExpSrcN(MI, OpNo, STI, O, 3); } void AMDGPUInstPrinter::printExpTgt(const MCInst *MI, unsigned OpNo, diff --git a/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.h b/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.h index 64ccb9092ec4..8d13aa682211 100644 --- a/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.h +++ b/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.h @@ -179,10 +179,8 @@ class AMDGPUInstPrinter : public MCInstPrinter { void printDefaultVccOperand(unsigned OpNo, const MCSubtargetInfo &STI, raw_ostream &O); - - template - void printExpSrcN(const MCInst *MI, unsigned OpNo, -const MCSubtargetInfo &STI, raw_ostream &O); + void printExpSrcN(const MCInst *MI, unsigned OpNo, const MCSubtargetInfo &STI, +raw_ostream &O, unsigned N); void printExpSrc0(const MCInst *MI, unsigned OpNo, const MCSubtargetInfo &STI, raw_ostream &O); void printExpSrc1(const MCInst *MI, unsigned OpNo, ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] de2f942 - [AMDGPU] Simplify test case for D94010
Author: Jay Foad Date: 2021-01-19T16:36:43Z New Revision: de2f9423995d52a5457752256815dc54d317c8d1 URL: https://github.com/llvm/llvm-project/commit/de2f9423995d52a5457752256815dc54d317c8d1 DIFF: https://github.com/llvm/llvm-project/commit/de2f9423995d52a5457752256815dc54d317c8d1.diff LOG: [AMDGPU] Simplify test case for D94010 Added: Modified: llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fma.legacy.ll Removed: diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fma.legacy.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fma.legacy.ll index 03584312e2af..8df0215a6fe2 100644 --- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fma.legacy.ll +++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fma.legacy.ll @@ -10,7 +10,6 @@ define float @v_fma(float %a, float %b, float %c) { ; GCN-NEXT:v_fmac_legacy_f32_e64 v2, v0, v1 ; GCN-NEXT:v_mov_b32_e32 v0, v2 ; GCN-NEXT:s_setpc_b64 s[30:31] -; %fma = call float @llvm.amdgcn.fma.legacy(float %a, float %b, float %c) ret float %fma } @@ -22,7 +21,6 @@ define float @v_fabs_fma(float %a, float %b, float %c) { ; GCN-NEXT:s_waitcnt_vscnt null, 0x0 ; GCN-NEXT:v_fma_legacy_f32 v0, |v0|, v1, v2 ; GCN-NEXT:s_setpc_b64 s[30:31] -; %fabs.a = call float @llvm.fabs.f32(float %a) %fma = call float @llvm.amdgcn.fma.legacy(float %fabs.a, float %b, float %c) ret float %fma @@ -35,7 +33,6 @@ define float @v_fneg_fabs_fma(float %a, float %b, float %c) { ; GCN-NEXT:s_waitcnt_vscnt null, 0x0 ; GCN-NEXT:v_fma_legacy_f32 v0, v0, -|v1|, v2 ; GCN-NEXT:s_setpc_b64 s[30:31] -; %fabs.b = call float @llvm.fabs.f32(float %b) %neg.fabs.b = fneg float %fabs.b %fma = call float @llvm.amdgcn.fma.legacy(float %a, float %neg.fabs.b, float %c) @@ -49,92 +46,21 @@ define float @v_fneg_fma(float %a, float %b, float %c) { ; GCN-NEXT:s_waitcnt_vscnt null, 0x0 ; GCN-NEXT:v_fma_legacy_f32 v0, v0, v1, -v2 ; GCN-NEXT:s_setpc_b64 s[30:31] -; %neg.c = fneg float %c %fma = call float @llvm.amdgcn.fma.legacy(float %a, float %b, float %neg.c) ret float %fma } -define amdgpu_ps <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> @main(<4 x i32> addrspace(6)* inreg noalias align 32 dereferenceable(18446744073709551615) %arg, <8 x i32> addrspace(6)* inreg noalias align 32 dereferenceable(18446744073709551615) %arg1, <4 x i32> addrspace(6)* inreg noalias align 32 dereferenceable(18446744073709551615) %arg2, <8 x i32> addrspace(6)* inreg noalias align 32 dereferenceable(18446744073709551615) %arg3, i32 inreg %arg4, i32 inreg %arg5, <2 x i32> %arg6, <2 x i32> %arg7, <2 x i32> %arg8, <3 x i32> %arg9, <2 x i32> %arg10, <2 x i32> %arg11, <2 x i32> %arg12, <3 x float> %arg13, float %arg14, float %arg15, float %arg16, float %arg17, i32 %arg18, i32 %arg19, float %arg20, i32 %arg21) #0 { -; SDAG-LABEL: main: -; SDAG: ; %bb.0: -; SDAG-NEXT:s_mov_b32 s16, exec_lo -; SDAG-NEXT:v_mov_b32_e32 v14, v2 -; SDAG-NEXT:s_mov_b32 s0, s5 -; SDAG-NEXT:s_wqm_b32 exec_lo, exec_lo -; SDAG-NEXT:s_mov_b32 s1, 0 -; SDAG-NEXT:s_mov_b32 m0, s7 -; SDAG-NEXT:s_clause 0x1 -; SDAG-NEXT:s_load_dwordx8 s[8:15], s[0:1], 0x400 -; SDAG-NEXT:s_load_dwordx4 s[0:3], s[0:1], 0x430 -; SDAG-NEXT:v_interp_p1_f32_e32 v2, v0, attr0.x -; SDAG-NEXT:v_interp_p1_f32_e32 v3, v0, attr0.y -; SDAG-NEXT:s_mov_b32 s4, s6 -; SDAG-NEXT:v_interp_p2_f32_e32 v2, v1, attr0.x -; SDAG-NEXT:v_interp_p2_f32_e32 v3, v1, attr0.y -; SDAG-NEXT:s_and_b32 exec_lo, exec_lo, s16 -; SDAG-NEXT:s_waitcnt lgkmcnt(0) -; SDAG-NEXT:image_sample v[0:3], v[2:3], s[8:15], s[0:3] dmask:0xf dim:SQ_RSRC_IMG_2D -; SDAG-NEXT:s_waitcnt vmcnt(0) -; SDAG-NEXT:v_fma_legacy_f32 v0, v0, 2.0, -1.0 -; SDAG-NEXT:v_fma_legacy_f32 v1, v1, 2.0, -1.0 -; SDAG-NEXT:; return to shader part epilog -; -; GISEL-LABEL: main: -; GISEL: ; %bb.0: -; GISEL-NEXT:s_mov_b32 s16, exec_lo -; GISEL-NEXT:s_mov_b32 s4, s6 -; GISEL-NEXT:s_mov_b32 m0, s7 -; GISEL-NEXT:s_wqm_b32 exec_lo, exec_lo -; GISEL-NEXT:s_add_u32 s0, s5, 0x400 -; GISEL-NEXT:s_mov_b32 s1, 0 -; GISEL-NEXT:v_interp_p1_f32_e32 v3, v0, attr0.y -; GISEL-NEXT:s_load_dwordx8 s[8:15], s[0:1], 0x0 -; GISEL-NEXT:s_add_u32 s0, s5, 0x430 -; GISEL-NEXT:v_mov_b32_e32 v14, v2 -; GISEL-NEXT:s_load_dwordx4 s[0:3], s[0:1], 0x0 -; GISEL-NEXT:v_interp_p1_f32_e32 v2, v0, attr0.x -; GISEL-NEXT:v_interp_p2_f32_e32 v3, v1, attr0.y -; GISEL-NEXT:v_interp_p2_f32_e32 v2, v1, attr0.x -; GISEL-NEXT:s_and_b32 exec_lo, exec_lo, s16 -; GISEL-NEXT:s_waitcnt lgkmcnt(0) -; GISEL-NEXT:image_sample v[0:3], v[2:3], s[8:15], s[0:3] dmask:0xf dim:SQ_RSRC_IMG_2D -; GISEL-NEXT:s_waitcnt vmcnt(0) -; GISEL-NEXT:v_fma_legacy_f32 v0, v0, 2.0, -1.0 -; GISEL-NEXT:v_fma_legacy_f32 v1,
[llvm-branch-commits] [llvm] 0808c70 - [AMDGPU] Fix test case for D94010
Author: Jay Foad Date: 2021-01-19T16:46:47Z New Revision: 0808c7009a06773e78772c7b74d254fd3572f0ea URL: https://github.com/llvm/llvm-project/commit/0808c7009a06773e78772c7b74d254fd3572f0ea DIFF: https://github.com/llvm/llvm-project/commit/0808c7009a06773e78772c7b74d254fd3572f0ea.diff LOG: [AMDGPU] Fix test case for D94010 Added: Modified: llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fma.legacy.ll Removed: diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fma.legacy.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fma.legacy.ll index 8df0215a6fe2..5c333f0ce97d 100644 --- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fma.legacy.ll +++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fma.legacy.ll @@ -1,6 +1,6 @@ ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py -; RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1030 < %s | FileCheck -check-prefixes=GCN,SDAG %s -; RUN: llc -global-isel -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1030 < %s | FileCheck -check-prefixes=GCN,GISEL %s +; RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1030 < %s | FileCheck -check-prefix=GCN %s +; RUN: llc -global-isel -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1030 < %s | FileCheck -check-prefix=GCN %s define float @v_fma(float %a, float %b, float %c) { ; GCN-LABEL: v_fma: ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] 18cb744 - [AMDGPU] Simpler names for arch-specific ttmp registers. NFC.
Author: Jay Foad Date: 2021-01-19T18:47:14Z New Revision: 18cb7441b69a22565dcc340bac0e58bc9f301439 URL: https://github.com/llvm/llvm-project/commit/18cb7441b69a22565dcc340bac0e58bc9f301439 DIFF: https://github.com/llvm/llvm-project/commit/18cb7441b69a22565dcc340bac0e58bc9f301439.diff LOG: [AMDGPU] Simpler names for arch-specific ttmp registers. NFC. Rename the *_gfx9_gfx10 ttmp registers to *_gfx9plus for simplicity, and use the corresponding isGFX9Plus predicate to decide when to use them instead of the old *_vi versions. Differential Revision: https://reviews.llvm.org/D94975 Added: Modified: llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp llvm/lib/Target/AMDGPU/SIDefines.h llvm/lib/Target/AMDGPU/SIRegisterInfo.td llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp Removed: diff --git a/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp b/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp index 7f68174e506d..08b340c8fd66 100644 --- a/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp +++ b/llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp @@ -997,8 +997,8 @@ unsigned AMDGPUDisassembler::getTtmpClassId(const OpWidthTy Width) const { int AMDGPUDisassembler::getTTmpIdx(unsigned Val) const { using namespace AMDGPU::EncValues; - unsigned TTmpMin = isGFX9Plus() ? TTMP_GFX9_GFX10_MIN : TTMP_VI_MIN; - unsigned TTmpMax = isGFX9Plus() ? TTMP_GFX9_GFX10_MAX : TTMP_VI_MAX; + unsigned TTmpMin = isGFX9Plus() ? TTMP_GFX9PLUS_MIN : TTMP_VI_MIN; + unsigned TTmpMax = isGFX9Plus() ? TTMP_GFX9PLUS_MAX : TTMP_VI_MAX; return (TTmpMin <= Val && Val <= TTmpMax)? Val - TTmpMin : -1; } diff --git a/llvm/lib/Target/AMDGPU/SIDefines.h b/llvm/lib/Target/AMDGPU/SIDefines.h index b9a2bcf81903..f7555f0453bb 100644 --- a/llvm/lib/Target/AMDGPU/SIDefines.h +++ b/llvm/lib/Target/AMDGPU/SIDefines.h @@ -247,8 +247,8 @@ enum : unsigned { SGPR_MAX_GFX10 = 105, TTMP_VI_MIN = 112, TTMP_VI_MAX = 123, - TTMP_GFX9_GFX10_MIN = 108, - TTMP_GFX9_GFX10_MAX = 123, + TTMP_GFX9PLUS_MIN = 108, + TTMP_GFX9PLUS_MAX = 123, INLINE_INTEGER_C_MIN = 128, INLINE_INTEGER_C_POSITIVE_MAX = 192, // 64 INLINE_INTEGER_C_MAX = 208, diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.td b/llvm/lib/Target/AMDGPU/SIRegisterInfo.td index 378fc5df21e5..92390f1f3297 100644 --- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.td +++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.td @@ -246,9 +246,9 @@ def TMA : RegisterWithSubRegs<"tma", [TMA_LO, TMA_HI]> { } foreach Index = 0...15 in { - defm TTMP#Index#_vi : SIRegLoHi16<"ttmp"#Index, !add(112, Index)>; - defm TTMP#Index#_gfx9_gfx10 : SIRegLoHi16<"ttmp"#Index, !add(108, Index)>; - defm TTMP#Index : SIRegLoHi16<"ttmp"#Index, 0>; + defm TTMP#Index#_vi : SIRegLoHi16<"ttmp"#Index, !add(112, Index)>; + defm TTMP#Index#_gfx9plus : SIRegLoHi16<"ttmp"#Index, !add(108, Index)>; + defm TTMP#Index : SIRegLoHi16<"ttmp"#Index, 0>; } multiclass FLAT_SCR_LOHI_m ci_e, bits<16> vi_e> { @@ -419,8 +419,8 @@ class TmpRegTuples.ret>; foreach Index = {0, 2, 4, 6, 8, 10, 12, 14} in { - def TTMP#Index#_TTMP#!add(Index,1)#_vi : TmpRegTuples<"_vi", 2, Index>; - def TTMP#Index#_TTMP#!add(Index,1)#_gfx9_gfx10 : TmpRegTuples<"_gfx9_gfx10", 2, Index>; + def TTMP#Index#_TTMP#!add(Index,1)#_vi : TmpRegTuples<"_vi", 2, Index>; + def TTMP#Index#_TTMP#!add(Index,1)#_gfx9plus : TmpRegTuples<"_gfx9plus", 2, Index>; } foreach Index = {0, 4, 8, 12} in { @@ -429,7 +429,7 @@ foreach Index = {0, 4, 8, 12} in { _TTMP#!add(Index,3)#_vi : TmpRegTuples<"_vi", 4, Index>; def TTMP#Index#_TTMP#!add(Index,1)# _TTMP#!add(Index,2)# - _TTMP#!add(Index,3)#_gfx9_gfx10 : TmpRegTuples<"_gfx9_gfx10", 4, Index>; + _TTMP#!add(Index,3)#_gfx9plus : TmpRegTuples<"_gfx9plus", 4, Index>; } foreach Index = {0, 4, 8} in { @@ -446,7 +446,7 @@ foreach Index = {0, 4, 8} in { _TTMP#!add(Index,4)# _TTMP#!add(Index,5)# _TTMP#!add(Index,6)# - _TTMP#!add(Index,7)#_gfx9_gfx10 : TmpRegTuples<"_gfx9_gfx10", 8, Index>; + _TTMP#!add(Index,7)#_gfx9plus : TmpRegTuples<"_gfx9plus", 8, Index>; } def TTMP0_TTMP1_TTMP2_TTMP3_TTMP4_TTMP5_TTMP6_TTMP7_TTMP8_TTMP9_TTMP10_TTMP11_TTMP12_TTMP13_TTMP14_TTMP15_vi : @@ -456,12 +456,12 @@ def TTMP0_TTMP1_TTMP2_TTMP3_TTMP4_TTMP5_TTMP6_TTMP7_TTMP8_TTMP9_TTMP10_TTMP11_TT TTMP8_vi, TTMP9_vi, TTMP10_vi, TTMP11_vi, TTMP12_vi, TTMP13_vi, TTMP14_vi, TTMP15_vi]>; -def TTMP0_TTMP1_TTMP2_TTMP3_TTMP4_TTMP5_TTMP6_TTMP7_TTMP8_TTMP9_TTMP10_TTMP11_TTMP12_TTMP13_TTMP14_TTMP15_gfx9_gfx10 : +def TTMP0_TTMP1_TTMP2_TTMP3_TTMP4_TTMP5_TTMP6_TTMP7_TTMP8_TTMP9_TTMP10_TTMP11_TTMP12_TTMP13_TTMP14_TTMP15_gfx9plu
[llvm-branch-commits] [llvm] c0b3c5a - [AMDGPU][GlobalISel] Run SIAddImgInit
Author: Jay Foad Date: 2021-01-21T15:54:54Z New Revision: c0b3c5a06451aad4351e35c74ccf2fe5da917a41 URL: https://github.com/llvm/llvm-project/commit/c0b3c5a06451aad4351e35c74ccf2fe5da917a41 DIFF: https://github.com/llvm/llvm-project/commit/c0b3c5a06451aad4351e35c74ccf2fe5da917a41.diff LOG: [AMDGPU][GlobalISel] Run SIAddImgInit This pass is required to get correct codegen for image instructions with the tfe or lwe bits set. Differential Revision: https://reviews.llvm.org/D95132 Added: Modified: llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.image.load.1d.d16.ll llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.image.load.1d.ll llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.image.load.2d.ll llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.image.load.2darraymsaa.a16.ll llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.image.load.2darraymsaa.ll llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.image.load.3d.a16.ll llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.image.load.3d.ll Removed: diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp index 58c436836d19..7d8e8486602b 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp +++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp @@ -1109,6 +1109,10 @@ bool GCNPassConfig::addRegBankSelect() { bool GCNPassConfig::addGlobalInstructionSelect() { addPass(new InstructionSelect()); + // TODO: Fix instruction selection to do the right thing for image + // instructions with tfe or lwe in the first place, instead of running a + // separate pass to fix them up? + addPass(createSIAddIMGInitPass()); return false; } diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.image.load.1d.d16.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.image.load.1d.d16.ll index 36f3e63598ca..99ab3580b91d 100644 --- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.image.load.1d.d16.ll +++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.image.load.1d.d16.ll @@ -655,6 +655,7 @@ define amdgpu_ps <4 x half> @load_1d_v4f16_xyzw(<8 x i32> inreg %rsrc, i32 %s) { define amdgpu_ps float @load_1d_f16_tfe_dmask_x(<8 x i32> inreg %rsrc, i32 %s) { ; GFX8-UNPACKED-LABEL: load_1d_f16_tfe_dmask_x: ; GFX8-UNPACKED: ; %bb.0: +; GFX8-UNPACKED-NEXT:v_mov_b32_e32 v1, 0 ; GFX8-UNPACKED-NEXT:s_mov_b32 s0, s2 ; GFX8-UNPACKED-NEXT:s_mov_b32 s1, s3 ; GFX8-UNPACKED-NEXT:s_mov_b32 s2, s4 @@ -663,13 +664,15 @@ define amdgpu_ps float @load_1d_f16_tfe_dmask_x(<8 x i32> inreg %rsrc, i32 %s) { ; GFX8-UNPACKED-NEXT:s_mov_b32 s5, s7 ; GFX8-UNPACKED-NEXT:s_mov_b32 s6, s8 ; GFX8-UNPACKED-NEXT:s_mov_b32 s7, s9 -; GFX8-UNPACKED-NEXT:image_load v[0:1], v0, s[0:7] dmask:0x1 unorm tfe d16 +; GFX8-UNPACKED-NEXT:v_mov_b32_e32 v2, v1 +; GFX8-UNPACKED-NEXT:image_load v[1:2], v0, s[0:7] dmask:0x1 unorm tfe d16 ; GFX8-UNPACKED-NEXT:s_waitcnt vmcnt(0) -; GFX8-UNPACKED-NEXT:v_mov_b32_e32 v0, v1 +; GFX8-UNPACKED-NEXT:v_mov_b32_e32 v0, v2 ; GFX8-UNPACKED-NEXT:; return to shader part epilog ; ; GFX8-PACKED-LABEL: load_1d_f16_tfe_dmask_x: ; GFX8-PACKED: ; %bb.0: +; GFX8-PACKED-NEXT:v_mov_b32_e32 v1, 0 ; GFX8-PACKED-NEXT:s_mov_b32 s0, s2 ; GFX8-PACKED-NEXT:s_mov_b32 s1, s3 ; GFX8-PACKED-NEXT:s_mov_b32 s2, s4 @@ -678,13 +681,15 @@ define amdgpu_ps float @load_1d_f16_tfe_dmask_x(<8 x i32> inreg %rsrc, i32 %s) { ; GFX8-PACKED-NEXT:s_mov_b32 s5, s7 ; GFX8-PACKED-NEXT:s_mov_b32 s6, s8 ; GFX8-PACKED-NEXT:s_mov_b32 s7, s9 -; GFX8-PACKED-NEXT:image_load v[0:1], v0, s[0:7] dmask:0x1 unorm tfe d16 +; GFX8-PACKED-NEXT:v_mov_b32_e32 v2, v1 +; GFX8-PACKED-NEXT:image_load v[1:2], v0, s[0:7] dmask:0x1 unorm tfe d16 ; GFX8-PACKED-NEXT:s_waitcnt vmcnt(0) -; GFX8-PACKED-NEXT:v_mov_b32_e32 v0, v1 +; GFX8-PACKED-NEXT:v_mov_b32_e32 v0, v2 ; GFX8-PACKED-NEXT:; return to shader part epilog ; ; GFX9-LABEL: load_1d_f16_tfe_dmask_x: ; GFX9: ; %bb.0: +; GFX9-NEXT:v_mov_b32_e32 v1, 0 ; GFX9-NEXT:s_mov_b32 s0, s2 ; GFX9-NEXT:s_mov_b32 s1, s3 ; GFX9-NEXT:s_mov_b32 s2, s4 @@ -693,13 +698,15 @@ define amdgpu_ps float @load_1d_f16_tfe_dmask_x(<8 x i32> inreg %rsrc, i32 %s) { ; GFX9-NEXT:s_mov_b32 s5, s7 ; GFX9-NEXT:s_mov_b32 s6, s8 ; GFX9-NEXT:s_mov_b32 s7, s9 -; GFX9-NEXT:image_load v[0:1], v0, s[0:7] dmask:0x1 unorm tfe d16 +; GFX9-NEXT:v_mov_b32_e32 v2, v1 +; GFX9-NEXT:image_load v[1:2], v0, s[0:7] dmask:0x1 unorm tfe d16 ; GFX9-NEXT:s_waitcnt vmcnt(0) -; GFX9-NEXT:v_mov_b32_e32 v0, v1 +; GFX9-NEXT:v_mov_b32_e32 v0, v2 ; GFX9-NEXT:; return to shader part epilog ; ; GFX10-LABEL: load_1d_f16_tfe_dmask_x: ; GFX10: ; %bb.0: +; GFX10-NEXT:v_mov_b32_e32 v1, 0 ; GFX10-NEXT:s
[llvm-branch-commits] [llvm] 14eea6b - [LegacyPM] Update InversedLastUser on the fly. NFC.
Author: Jay Foad Date: 2021-01-22T09:48:54Z New Revision: 14eea6b0ecddfe7d1c68754a8bfb7c21cde82df8 URL: https://github.com/llvm/llvm-project/commit/14eea6b0ecddfe7d1c68754a8bfb7c21cde82df8 DIFF: https://github.com/llvm/llvm-project/commit/14eea6b0ecddfe7d1c68754a8bfb7c21cde82df8.diff LOG: [LegacyPM] Update InversedLastUser on the fly. NFC. This speeds up setLastUser enough to give a 5% to 10% speed up on trivial invocations of opt and llc, as measured by: perf stat -r 100 opt -S -o /dev/null -O3 /dev/null perf stat -r 100 llc -march=amdgcn /dev/null -filetype null Don't dump last use information unless -debug-pass=Details to avoid printing lots of spam that will break some existing lit tests. Before this patch, dumping last use information was broken anyway, because it used InversedLastUser before it had been populated. Differential Revision: https://reviews.llvm.org/D92309 Added: Modified: llvm/include/llvm/IR/LegacyPassManagers.h llvm/lib/IR/LegacyPassManager.cpp Removed: diff --git a/llvm/include/llvm/IR/LegacyPassManagers.h b/llvm/include/llvm/IR/LegacyPassManagers.h index 498e736a0100..f4fae184e428 100644 --- a/llvm/include/llvm/IR/LegacyPassManagers.h +++ b/llvm/include/llvm/IR/LegacyPassManagers.h @@ -230,11 +230,11 @@ class PMTopLevelManager { // Map to keep track of last user of the analysis pass. // LastUser->second is the last user of Lastuser->first. + // This is kept in sync with InversedLastUser. DenseMap LastUser; // Map to keep track of passes that are last used by a pass. - // This inverse map is initialized at PM->run() based on - // LastUser map. + // This is kept in sync with LastUser. DenseMap > InversedLastUser; /// Immutable passes are managed by top level manager. diff --git a/llvm/lib/IR/LegacyPassManager.cpp b/llvm/lib/IR/LegacyPassManager.cpp index 5575bc469a87..4547c3a01239 100644 --- a/llvm/lib/IR/LegacyPassManager.cpp +++ b/llvm/lib/IR/LegacyPassManager.cpp @@ -568,7 +568,12 @@ PMTopLevelManager::setLastUser(ArrayRef AnalysisPasses, Pass *P) { PDepth = P->getResolver()->getPMDataManager().getDepth(); for (Pass *AP : AnalysisPasses) { -LastUser[AP] = P; +// Record P as the new last user of AP. +auto &LastUserOfAP = LastUser[AP]; +if (LastUserOfAP) + InversedLastUser[LastUserOfAP].erase(AP); +LastUserOfAP = P; +InversedLastUser[P].insert(AP); if (P == AP) continue; @@ -598,13 +603,13 @@ PMTopLevelManager::setLastUser(ArrayRef AnalysisPasses, Pass *P) { if (P->getResolver()) setLastUser(LastPMUses, P->getResolver()->getPMDataManager().getAsPass()); - // If AP is the last user of other passes then make P last user of // such passes. -for (auto &LU : LastUser) { - if (LU.second == AP) -LU.second = P; -} +auto &LastUsedByAP = InversedLastUser[AP]; +for (Pass *L : LastUsedByAP) + LastUser[L] = P; +InversedLastUser[P].insert(LastUsedByAP.begin(), LastUsedByAP.end()); +LastUsedByAP.clear(); } } @@ -850,11 +855,6 @@ void PMTopLevelManager::initializeAllAnalysisInfo() { // Initailize other pass managers for (PMDataManager *IPM : IndirectPassManagers) IPM->initializeAnalysisInfo(); - - for (auto LU : LastUser) { -SmallPtrSet &L = InversedLastUser[LU.second]; -L.insert(LU.first); - } } /// Destructor @@ -1151,6 +1151,8 @@ Pass *PMDataManager::findAnalysisPass(AnalysisID AID, bool SearchParent) { // Print list of passes that are last used by P. void PMDataManager::dumpLastUses(Pass *P, unsigned Offset) const{ + if (PassDebugging < Details) +return; SmallVector LUses; ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] 4e6054a - [AMDGPU] Split out new helper function macToMad in SIFoldOperands. NFC.
Author: Jay Foad Date: 2021-01-05T11:54:48Z New Revision: 4e6054a86c0cb0697913007c99b59f3f65c9d04b URL: https://github.com/llvm/llvm-project/commit/4e6054a86c0cb0697913007c99b59f3f65c9d04b DIFF: https://github.com/llvm/llvm-project/commit/4e6054a86c0cb0697913007c99b59f3f65c9d04b.diff LOG: [AMDGPU] Split out new helper function macToMad in SIFoldOperands. NFC. Differential Revision: https://reviews.llvm.org/D94009 Added: Modified: llvm/lib/Target/AMDGPU/SIFoldOperands.cpp Removed: diff --git a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp index d86527df5c3c..6dc01c3d3c21 100644 --- a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp +++ b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp @@ -129,6 +129,21 @@ char SIFoldOperands::ID = 0; char &llvm::SIFoldOperandsID = SIFoldOperands::ID; +// Map multiply-accumulate opcode to corresponding multiply-add opcode if any. +static unsigned macToMad(unsigned Opc) { + switch (Opc) { + case AMDGPU::V_MAC_F32_e64: +return AMDGPU::V_MAD_F32; + case AMDGPU::V_MAC_F16_e64: +return AMDGPU::V_MAD_F16; + case AMDGPU::V_FMAC_F32_e64: +return AMDGPU::V_FMA_F32; + case AMDGPU::V_FMAC_F16_e64: +return AMDGPU::V_FMA_F16_gfx9; + } + return AMDGPU::INSTRUCTION_LIST_END; +} + // Wrapper around isInlineConstant that understands special cases when // instruction types are replaced during operand folding. static bool isInlineConstantIfFolded(const SIInstrInfo *TII, @@ -139,31 +154,18 @@ static bool isInlineConstantIfFolded(const SIInstrInfo *TII, return true; unsigned Opc = UseMI.getOpcode(); - switch (Opc) { - case AMDGPU::V_MAC_F32_e64: - case AMDGPU::V_MAC_F16_e64: - case AMDGPU::V_FMAC_F32_e64: - case AMDGPU::V_FMAC_F16_e64: { + unsigned NewOpc = macToMad(Opc); + if (NewOpc != AMDGPU::INSTRUCTION_LIST_END) { // Special case for mac. Since this is replaced with mad when folded into // src2, we need to check the legality for the final instruction. int Src2Idx = AMDGPU::getNamedOperandIdx(Opc, AMDGPU::OpName::src2); if (static_cast(OpNo) == Src2Idx) { - bool IsFMA = Opc == AMDGPU::V_FMAC_F32_e64 || - Opc == AMDGPU::V_FMAC_F16_e64; - bool IsF32 = Opc == AMDGPU::V_MAC_F32_e64 || - Opc == AMDGPU::V_FMAC_F32_e64; - - unsigned Opc = IsFMA ? -(IsF32 ? AMDGPU::V_FMA_F32 : AMDGPU::V_FMA_F16_gfx9) : -(IsF32 ? AMDGPU::V_MAD_F32 : AMDGPU::V_MAD_F16); - const MCInstrDesc &MadDesc = TII->get(Opc); + const MCInstrDesc &MadDesc = TII->get(NewOpc); return TII->isInlineConstant(OpToFold, MadDesc.OpInfo[OpNo].OperandType); } -return false; - } - default: -return false; } + + return false; } // TODO: Add heuristic that the frame index might not fit in the addressing mode @@ -346,17 +348,8 @@ static bool tryAddToFoldList(SmallVectorImpl &FoldList, if (!TII->isOperandLegal(*MI, OpNo, OpToFold)) { // Special case for v_mac_{f16, f32}_e64 if we are trying to fold into src2 unsigned Opc = MI->getOpcode(); -if ((Opc == AMDGPU::V_MAC_F32_e64 || Opc == AMDGPU::V_MAC_F16_e64 || - Opc == AMDGPU::V_FMAC_F32_e64 || Opc == AMDGPU::V_FMAC_F16_e64) && -(int)OpNo == AMDGPU::getNamedOperandIdx(Opc, AMDGPU::OpName::src2)) { - bool IsFMA = Opc == AMDGPU::V_FMAC_F32_e64 || - Opc == AMDGPU::V_FMAC_F16_e64; - bool IsF32 = Opc == AMDGPU::V_MAC_F32_e64 || - Opc == AMDGPU::V_FMAC_F32_e64; - unsigned NewOpc = IsFMA ? -(IsF32 ? AMDGPU::V_FMA_F32 : AMDGPU::V_FMA_F16_gfx9) : -(IsF32 ? AMDGPU::V_MAD_F32 : AMDGPU::V_MAD_F16); - +unsigned NewOpc = macToMad(Opc); +if (NewOpc != AMDGPU::INSTRUCTION_LIST_END) { // Check if changing this to a v_mad_{f16, f32} instruction will allow us // to fold the operand. MI->setDesc(TII->get(NewOpc)); ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits