date:20250221

[llvm-branch-commits] [llvm] release/20.x: [SLP] Check for PHI nodes (potentially cycles!) when checking dependencies (PR #127294)

2025-02-21 Thread Nikita Popov via llvm-branch-commits


https://github.com/nikic commented:

Looks like there is a test failures in 
Transforms/SLPVectorizer/X86/perfect-matched-reused-bv.ll.

https://github.com/llvm/llvm-project/pull/127294
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] release/20.x: [RISCV] [MachineOutliner] Analyze all candidates (#127659) (PR #128146)

2025-02-21 Thread Pengcheng Wang via llvm-branch-commits


https://github.com/wangpc-pp approved this pull request.

LGTM.

https://github.com/llvm/llvm-project/pull/128146
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [clang] [clang] Fix preprocessor output from #embed (#126742) (PR #127222)

2025-02-21 Thread Mariya Podchishchaeva via llvm-branch-commits


https://github.com/Fznamznon updated 
https://github.com/llvm/llvm-project/pull/127222

>From 95cf7310c15324f25e9e5276772278fa58ba6926 Mon Sep 17 00:00:00 2001
From: Mariya Podchishchaeva 
Date: Thu, 13 Feb 2025 10:59:21 +0100
Subject: [PATCH] [clang] Fix preprocessor output from #embed (#126742)

When bytes with negative signed char values appear in the data, make
sure to use raw bytes from the data string when preprocessing, not char
values.

Fixes https://github.com/llvm/llvm-project/issues/102798
---
 clang/docs/ReleaseNotes.rst| 2 ++
 clang/lib/Frontend/PrintPreprocessedOutput.cpp | 5 ++---
 clang/test/Preprocessor/embed_preprocess_to_file.c | 8 
 3 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/clang/docs/ReleaseNotes.rst b/clang/docs/ReleaseNotes.rst
index ad1a5e7ae282e..08f8491e2928d 100644
--- a/clang/docs/ReleaseNotes.rst
+++ b/clang/docs/ReleaseNotes.rst
@@ -897,6 +897,8 @@ Bug Fixes in This Version
 - No longer return ``false`` for ``noexcept`` expressions involving a
   ``delete`` which resolves to a destroying delete but the type of the object
   being deleted has a potentially throwing destructor (#GH118660).
+- Clang now outputs correct values when #embed data contains bytes with 
negative
+  signed char values (#GH102798).
 
 Bug Fixes to Compiler Builtins
 ^^
diff --git a/clang/lib/Frontend/PrintPreprocessedOutput.cpp 
b/clang/lib/Frontend/PrintPreprocessedOutput.cpp
index 1005825441b3e..2ae355fb33885 100644
--- a/clang/lib/Frontend/PrintPreprocessedOutput.cpp
+++ b/clang/lib/Frontend/PrintPreprocessedOutput.cpp
@@ -974,11 +974,10 @@ static void PrintPreprocessedTokens(Preprocessor &PP, 
Token &Tok,
   // Loop over the contents and print them as a comma-delimited list of
   // values.
   bool PrintComma = false;
-  for (auto Iter = Data->BinaryData.begin(), End = Data->BinaryData.end();
-   Iter != End; ++Iter) {
+  for (unsigned char Byte : Data->BinaryData.bytes()) {
 if (PrintComma)
   *Callbacks->OS << ", ";
-*Callbacks->OS << static_cast(*Iter);
+*Callbacks->OS << static_cast(Byte);
 PrintComma = true;
   }
 } else if (Tok.isAnnotation()) {
diff --git a/clang/test/Preprocessor/embed_preprocess_to_file.c 
b/clang/test/Preprocessor/embed_preprocess_to_file.c
index 9895d958cf96d..b3c99d36f784a 100644
--- a/clang/test/Preprocessor/embed_preprocess_to_file.c
+++ b/clang/test/Preprocessor/embed_preprocess_to_file.c
@@ -37,3 +37,11 @@ const char even_more[] = {
 // DIRECTIVE-NEXT: #embed  prefix(4, 5,) suffix(, 6, 7) /* clang -E 
-dE */
 // DIRECTIVE-NEXT:  , 8, 9, 10
 // DIRECTIVE-NEXT: };
+
+constexpr char big_one[] = {
+#embed 
+};
+
+// EXPANDED: constexpr char big_one[] = {255
+// DIRECTIVE: constexpr char big_one[] = {
+// DIRECTIVE-NEXT: #embed 

___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [clang] [clang] Fix preprocessor output from #embed (#126742) (PR #127222)

2025-02-21 Thread Aaron Ballman via llvm-branch-commits


https://github.com/AaronBallman approved this pull request.

LGTM!

https://github.com/llvm/llvm-project/pull/127222
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [clang] [libc] release/20.x: [Clang] Fix cross-lane scan when given divergent lanes (#127703) (PR #128085)

2025-02-21 Thread Shilei Tian via llvm-branch-commits


https://github.com/shiltian approved this pull request.


https://github.com/llvm/llvm-project/pull/128085
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] release/20.x: AMDGPU: Widen f16 minimum/maximum to v2f16 on gfx950 (#128121) (PR #128132)

2025-02-21 Thread Shilei Tian via llvm-branch-commits


https://github.com/shiltian approved this pull request.


https://github.com/llvm/llvm-project/pull/128132
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [clang] release/20.x: [clang][CodeGen] `sret` args should always point to the `alloca` AS, so use that (#114062) (PR #127552)

2025-02-21 Thread Tom Stellard via llvm-branch-commits


tstellar wrote:

@arsenm What do you think ?

https://github.com/llvm/llvm-project/pull/127552
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [clang] release/20.x: [CUDA] Add support for sm101 and sm120 target architectures (#127187) (PR #127918)

2025-02-21 Thread Sebastian Jodłowski via llvm-branch-commits


jodelek wrote:

@Artem-B, just curious - is there anything additional that needs to happen 
before you can approve this?

https://github.com/llvm/llvm-project/pull/127918
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] release/20.x: [RISCV] [MachineOutliner] Analyze all candidates (#127659) (PR #128146)

2025-02-21 Thread via llvm-branch-commits


github-actions[bot] wrote:

@svs-quic (or anyone else). If you would like to add a note about this fix in 
the release notes (completely optional). Please reply to this comment with a 
one or two sentence description of the fix.  When you are done, please add the 
release:note label to this PR. 

https://github.com/llvm/llvm-project/pull/128146
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [clang] release/20.x: [CUDA] Add support for sm101 and sm120 target architectures (#127187) (PR #127918)

2025-02-21 Thread Artem Belevich via llvm-branch-commits


https://github.com/Artem-B approved this pull request.

I was the one proposing to merge this change, so I assumed that it's the 
release maintainers who'd need to stamp it.

I am all for merging it.

https://github.com/llvm/llvm-project/pull/127918
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [clang] release/20.x: [CSKY] Default to unsigned char (PR #126436)

2025-02-21 Thread via llvm-branch-commits


https://github.com/llvmbot updated 
https://github.com/llvm/llvm-project/pull/126436

>From 77195a5edb332947a991a1f0c4e915f5f1d9411f Mon Sep 17 00:00:00 2001
From: Alexander Richardson 
Date: Sun, 9 Feb 2025 12:18:52 -0800
Subject: [PATCH] [CSKY] Default to unsigned char

This matches the ABI document found at
https://github.com/c-sky/csky-doc/blob/master/C-SKY_V2_CPU_Applications_Binary_Interface_Standards_Manual.pdf

Partially addresses https://github.com/llvm/llvm-project/issues/115957

Reviewed By: zixuan-wu

Pull Request: https://github.com/llvm/llvm-project/pull/115961

(cherry picked from commit d2047242e6d0f0deb7634ff22ab164354c520c79)
---
 clang/lib/Driver/ToolChains/Clang.cpp | 1 +
 clang/test/Driver/csky-toolchain.c| 1 +
 2 files changed, 2 insertions(+)

diff --git a/clang/lib/Driver/ToolChains/Clang.cpp 
b/clang/lib/Driver/ToolChains/Clang.cpp
index ec5ee29ece434..57b7d2bd46989 100644
--- a/clang/lib/Driver/ToolChains/Clang.cpp
+++ b/clang/lib/Driver/ToolChains/Clang.cpp
@@ -1358,6 +1358,7 @@ static bool isSignedCharDefault(const llvm::Triple 
&Triple) {
   return true;
 return false;
 
+  case llvm::Triple::csky:
   case llvm::Triple::hexagon:
   case llvm::Triple::msp430:
   case llvm::Triple::ppcle:
diff --git a/clang/test/Driver/csky-toolchain.c 
b/clang/test/Driver/csky-toolchain.c
index 66485464652ac..638ce64ec98cd 100644
--- a/clang/test/Driver/csky-toolchain.c
+++ b/clang/test/Driver/csky-toolchain.c
@@ -3,6 +3,7 @@
 
 // RUN: %clang -### %s --target=csky 2>&1 | FileCheck -check-prefix=CC1 %s
 // CC1: "-cc1" "-triple" "csky"
+// CC1: "-fno-signed-char"
 
 // In the below tests, --rtlib=platform is used so that the driver ignores
 // the configure-time CLANG_DEFAULT_RTLIB option when choosing the runtime lib

___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] 1504fc5 - AMDGPU: Stop emitting an error on illegal addrspacecasts (#127487) (#127751)

2025-02-21 Thread via llvm-branch-commits


Author: Matt Arsenault
Date: 2025-02-21T09:35:52-08:00
New Revision: 1504fc57d88d5d700d5f8053ebc46b33e8bb12bf

URL: 
https://github.com/llvm/llvm-project/commit/1504fc57d88d5d700d5f8053ebc46b33e8bb12bf
DIFF: 
https://github.com/llvm/llvm-project/commit/1504fc57d88d5d700d5f8053ebc46b33e8bb12bf.diff

LOG: AMDGPU: Stop emitting an error on illegal addrspacecasts (#127487) 
(#127751)

These cannot be static compile errors, and should be treated as
poison. Invalid casts may be introduced which are dynamically dead.

For example:

```
  void foo(volatile generic int* x) {
__builtin_assume(is_shared(x));
*x = 4;
  }

  void bar() {
private int y;
foo(&y); // violation, wrong address space
  }
```

This could produce a compile time backend error or not depending on
the optimization level. Similarly, the new test demonstrates a failure
on a lowered atomicrmw which required inserting runtime address
space checks. The invalid cases are dynamically dead, we should not
error, and the AtomicExpand pass shouldn't have to consider the details
of the incoming pointer to produce valid IR.

This should go to the release branch. This fixes broken -O0 compiles
with 64-bit atomics which would have started failing in
1d03708.

(cherry picked from commit 18ea6c9)

Added: 


Modified: 
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
llvm/test/CodeGen/AMDGPU/invalid-addrspacecast.ll

Removed: 




diff  --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp 
b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index e9e47eaadd557..e84f0f5fa615a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -2426,11 +2426,8 @@ bool AMDGPULegalizerInfo::legalizeAddrSpaceCast(
 return true;
   }
 
-  DiagnosticInfoUnsupported InvalidAddrSpaceCast(
-  MF.getFunction(), "invalid addrspacecast", B.getDebugLoc());
-
-  LLVMContext &Ctx = MF.getFunction().getContext();
-  Ctx.diagnose(InvalidAddrSpaceCast);
+  // Invalid casts are poison.
+  // TODO: Should return poison
   B.buildUndef(Dst);
   MI.eraseFromParent();
   return true;

diff  --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp 
b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index b632c50dae0e3..e09df53995d61 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -7340,11 +7340,8 @@ SDValue SITargetLowering::lowerADDRSPACECAST(SDValue Op,
 
   // global <-> flat are no-ops and never emitted.
 
-  const MachineFunction &MF = DAG.getMachineFunction();
-  DiagnosticInfoUnsupported InvalidAddrSpaceCast(
-  MF.getFunction(), "invalid addrspacecast", SL.getDebugLoc());
-  DAG.getContext()->diagnose(InvalidAddrSpaceCast);
-
+  // Invalid casts are poison.
+  // TODO: Should return poison
   return DAG.getUNDEF(Op->getValueType(0));
 }
 

diff  --git a/llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll 
b/llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
index f5c9b1a79b476..9b446896db590 100644
--- a/llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
@@ -444,6 +444,761 @@ define float @no_unsafe(ptr %addr, float %val) {
   ret float %res
 }
 
+@global = hidden addrspace(1) global i64 0, align 8
+
+; Make sure there is no error on an invalid addrspacecast without optimizations
+define i64 @optnone_atomicrmw_add_i64_expand(i64 %val) #1 {
+; GFX908-LABEL: optnone_atomicrmw_add_i64_expand:
+; GFX908:   ; %bb.0:
+; GFX908-NEXT:s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX908-NEXT:s_mov_b64 s[4:5], src_private_base
+; GFX908-NEXT:s_mov_b32 s6, 32
+; GFX908-NEXT:s_lshr_b64 s[4:5], s[4:5], s6
+; GFX908-NEXT:s_getpc_b64 s[6:7]
+; GFX908-NEXT:s_add_u32 s6, s6, global@rel32@lo+4
+; GFX908-NEXT:s_addc_u32 s7, s7, global@rel32@hi+12
+; GFX908-NEXT:s_cmp_eq_u32 s7, s4
+; GFX908-NEXT:s_cselect_b64 s[4:5], -1, 0
+; GFX908-NEXT:v_cndmask_b32_e64 v2, 0, 1, s[4:5]
+; GFX908-NEXT:s_mov_b64 s[4:5], -1
+; GFX908-NEXT:s_mov_b32 s6, 1
+; GFX908-NEXT:v_cmp_ne_u32_e64 s[6:7], v2, s6
+; GFX908-NEXT:s_and_b64 vcc, exec, s[6:7]
+; GFX908-NEXT:; implicit-def: $vgpr3_vgpr4
+; GFX908-NEXT:s_cbranch_vccnz .LBB4_3
+; GFX908-NEXT:  .LBB4_1: ; %Flow
+; GFX908-NEXT:v_cndmask_b32_e64 v2, 0, 1, s[4:5]
+; GFX908-NEXT:s_mov_b32 s4, 1
+; GFX908-NEXT:v_cmp_ne_u32_e64 s[4:5], v2, s4
+; GFX908-NEXT:s_and_b64 vcc, exec, s[4:5]
+; GFX908-NEXT:s_cbranch_vccnz .LBB4_4
+; GFX908-NEXT:  ; %bb.2: ; %atomicrmw.private
+; GFX908-NEXT:s_waitcnt lgkmcnt(0)
+; GFX908-NEXT:buffer_load_dword v3, v0, s[0:3], 0 offen
+; GFX908-NEXT:s_waitcnt vmcnt(0)
+; GFX908-NEXT:v_mov_b32_e32 v4, v3
+; GFX908-NEXT:v_add_co_u32_e64 v0, s[4:5], v3, v0
+; GFX908-NEXT:v_addc_co_u32_e64 v1, s[4:5], v4, v1, s[4

[llvm-branch-commits] [clang] release/20.x: [CSKY] Default to unsigned char (PR #126436)

2025-02-21 Thread Alexander Richardson via llvm-branch-commits


arichardson wrote:

> @arichardson Would you be able to create a follow-up PR with the a release 
> note entry?

Sure, will do this.

https://github.com/llvm/llvm-project/pull/126436
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [clang] deb63e7 - [clang] Track function template instantiation from definition (#125266) (#127777)

2025-02-21 Thread via llvm-branch-commits


Author: Matheus Izvekov
Date: 2025-02-21T10:49:10-08:00
New Revision: deb63e72d6c9ed98a2fbf4f8249ca6911bd189b8

URL: 
https://github.com/llvm/llvm-project/commit/deb63e72d6c9ed98a2fbf4f8249ca6911bd189b8
DIFF: 
https://github.com/llvm/llvm-project/commit/deb63e72d6c9ed98a2fbf4f8249ca6911bd189b8.diff

LOG: [clang] Track function template instantiation from definition (#125266) 
(#12)

This fixes instantiation of definition for friend function templates,
when the declaration found and the one containing the definition have
different template contexts.

In these cases, the the function declaration corresponding to the
definition is not available; it may not even be instantiated at all.

So this patch adds a bit which tracks which function template
declaration was instantiated from the member template. It's used to find
which primary template serves as a context for the purpose of
obtainining the template arguments needed to instantiate the definition.

Fixes #55509

Added: 
clang/test/SemaTemplate/GH55509.cpp

Modified: 
clang/docs/ReleaseNotes.rst
clang/include/clang/AST/Decl.h
clang/include/clang/AST/DeclBase.h
clang/include/clang/AST/DeclTemplate.h
clang/lib/AST/Decl.cpp
clang/lib/Sema/SemaTemplateDeduction.cpp
clang/lib/Sema/SemaTemplateInstantiate.cpp
clang/lib/Sema/SemaTemplateInstantiateDecl.cpp
clang/lib/Serialization/ASTReaderDecl.cpp
clang/lib/Serialization/ASTWriterDecl.cpp

Removed: 




diff  --git a/clang/docs/ReleaseNotes.rst b/clang/docs/ReleaseNotes.rst
index e716efa46a1f0..a2518042cb5b0 100644
--- a/clang/docs/ReleaseNotes.rst
+++ b/clang/docs/ReleaseNotes.rst
@@ -1057,6 +1057,7 @@ Bug Fixes to C++ Support
 - Fix that some dependent immediate expressions did not cause immediate 
escalation (#GH119046)
 - Fixed a substitution bug in transforming CTAD aliases when the type alias 
contains a non-pack template argument
   corresponding to a pack parameter (#GH124715)
+- Clang is now better at keeping track of friend function template instance 
contexts. (#GH55509)
 
 Bug Fixes to AST Handling
 ^

diff  --git a/clang/include/clang/AST/Decl.h b/clang/include/clang/AST/Decl.h
index 9593bab576412..362a2741a0cdd 100644
--- a/clang/include/clang/AST/Decl.h
+++ b/clang/include/clang/AST/Decl.h
@@ -2298,6 +2298,13 @@ class FunctionDecl : public DeclaratorDecl,
 FunctionDeclBits.IsLateTemplateParsed = ILT;
   }
 
+  bool isInstantiatedFromMemberTemplate() const {
+return FunctionDeclBits.IsInstantiatedFromMemberTemplate;
+  }
+  void setInstantiatedFromMemberTemplate(bool Val = true) {
+FunctionDeclBits.IsInstantiatedFromMemberTemplate = Val;
+  }
+
   /// Whether this function is "trivial" in some specialized C++ senses.
   /// Can only be true for default constructors, copy constructors,
   /// copy assignment operators, and destructors.  Not meaningful until

diff  --git a/clang/include/clang/AST/DeclBase.h 
b/clang/include/clang/AST/DeclBase.h
index 3bb82c1572ef9..648dae2838e03 100644
--- a/clang/include/clang/AST/DeclBase.h
+++ b/clang/include/clang/AST/DeclBase.h
@@ -1780,6 +1780,8 @@ class DeclContext {
 uint64_t HasImplicitReturnZero : 1;
 LLVM_PREFERRED_TYPE(bool)
 uint64_t IsLateTemplateParsed : 1;
+LLVM_PREFERRED_TYPE(bool)
+uint64_t IsInstantiatedFromMemberTemplate : 1;
 
 /// Kind of contexpr specifier as defined by ConstexprSpecKind.
 LLVM_PREFERRED_TYPE(ConstexprSpecKind)
@@ -1830,7 +1832,7 @@ class DeclContext {
   };
 
   /// Number of inherited and non-inherited bits in FunctionDeclBitfields.
-  enum { NumFunctionDeclBits = NumDeclContextBits + 31 };
+  enum { NumFunctionDeclBits = NumDeclContextBits + 32 };
 
   /// Stores the bits used by CXXConstructorDecl. If modified
   /// NumCXXConstructorDeclBits and the accessor
@@ -1841,12 +1843,12 @@ class DeclContext {
 LLVM_PREFERRED_TYPE(FunctionDeclBitfields)
 uint64_t : NumFunctionDeclBits;
 
-/// 20 bits to fit in the remaining available space.
+/// 19 bits to fit in the remaining available space.
 /// Note that this makes CXXConstructorDeclBitfields take
 /// exactly 64 bits and thus the width of NumCtorInitializers
 /// will need to be shrunk if some bit is added to NumDeclContextBitfields,
 /// NumFunctionDeclBitfields or CXXConstructorDeclBitfields.
-uint64_t NumCtorInitializers : 17;
+uint64_t NumCtorInitializers : 16;
 LLVM_PREFERRED_TYPE(bool)
 uint64_t IsInheritingConstructor : 1;
 
@@ -1860,7 +1862,7 @@ class DeclContext {
   };
 
   /// Number of inherited and non-inherited bits in 
CXXConstructorDeclBitfields.
-  enum { NumCXXConstructorDeclBits = NumFunctionDeclBits + 20 };
+  enum { NumCXXConstructorDeclBits = NumFunctionDeclBits + 19 };
 
   /// Stores the bits used by ObjCMethodDecl.
   /// If modified NumObjCMethodDeclBits and the accessor

diff  --git a/clang/include/clang/AST/Dec

[llvm-branch-commits] [clang] Backport: [clang] Track function template instantiation from definition (#125266) (PR #127777)

2025-02-21 Thread Tom Stellard via llvm-branch-commits


https://github.com/tstellar closed 
https://github.com/llvm/llvm-project/pull/12
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering (non i1) (PR #124298)

2025-02-21 Thread Nicolai Hähnle via llvm-branch-commits


nhaehnle wrote:

How about this comment from earlier:

> Every Inst may potentially appear with many UseInsts in the temporal 
> divergence list. The current code will create multiple new registers and 
> multiple COPY instructions, which seems wasteful even if downstream passes 
> can often clean it up.
>
> I would suggest capturing the created register in a DenseMap Register> for re-use.
>
> Also, how about inserting the COPY at the end of Inst->getParent()? That way, 
> the live range of the VGPR is reduced.

?

https://github.com/llvm/llvm-project/pull/124298
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [clang] release/20.x: [CMake][Release] Statically link clang with stage1 runtimes (#127268) (PR #127949)

2025-02-21 Thread via llvm-branch-commits


https://github.com/llvmbot updated 
https://github.com/llvm/llvm-project/pull/127949

>From dc1bd6a8fa6a5f4fc38f7c3ce77c0ffcfcaa66e9 Mon Sep 17 00:00:00 2001
From: Tom Stellard 
Date: Wed, 19 Feb 2025 17:46:29 -0800
Subject: [PATCH] [CMake][Release] Statically link clang with stage1 runtimes
 (#127268)

This change will cause clang and the other tools to statically link
against the runtimes built in stage1. This will make the built binaries
more portable by eliminating dependencies on system libraries like
libgcc and libstdc++.

(cherry picked from commit f5b311e47de044160aeb25221095898c35c4847f)
---
 clang/cmake/caches/Release.cmake | 27 +++
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/clang/cmake/caches/Release.cmake b/clang/cmake/caches/Release.cmake
index 23e99493087ff..a1c68fc51dbd0 100644
--- a/clang/cmake/caches/Release.cmake
+++ b/clang/cmake/caches/Release.cmake
@@ -48,10 +48,8 @@ set(CLANG_ENABLE_BOOTSTRAP ON CACHE BOOL "")
 
 set(STAGE1_PROJECTS "clang")
 
-# Building Flang on Windows requires compiler-rt, so we need to build it in
-# stage1.  compiler-rt is also required for building the Flang tests on
-# macOS.
-set(STAGE1_RUNTIMES "compiler-rt")
+# Build all runtimes so we can statically link them into the stage2 compiler.
+set(STAGE1_RUNTIMES "compiler-rt;libcxx;libcxxabi;libunwind")
 
 if (LLVM_RELEASE_ENABLE_PGO)
   list(APPEND STAGE1_PROJECTS "lld")
@@ -90,9 +88,20 @@ else()
   set(CLANG_BOOTSTRAP_TARGETS ${LLVM_RELEASE_FINAL_STAGE_TARGETS} CACHE STRING 
"")
 endif()
 
+if (LLVM_RELEASE_ENABLE_LTO)
+  # Enable LTO for the runtimes.  We need to configure stage1 clang to default
+  # to using lld as the linker because the stage1 toolchain will be used to
+  # build and link the runtimes.
+  # FIXME: We can't use LLVM_ENABLE_LTO=Thin here, because it causes the CMake
+  # step for the libcxx build to fail.  CMAKE_INTERPROCEDURAL_OPTIMIZATION does
+  # enable ThinLTO, though.
+  set(RUNTIMES_CMAKE_ARGS "-DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON 
-DLLVM_ENABLE_LLD=ON" CACHE STRING "")
+endif()
+
 # Stage 1 Common Config
 set(LLVM_ENABLE_RUNTIMES ${STAGE1_RUNTIMES} CACHE STRING "")
 set(LLVM_ENABLE_PROJECTS ${STAGE1_PROJECTS} CACHE STRING "")
+set(LIBCXX_STATICALLY_LINK_ABI_IN_STATIC_LIBRARY ON CACHE STRING "")
 
 # stage2-instrumented and Final Stage Config:
 # Options that need to be set in both the instrumented stage (if we are doing
@@ -102,6 +111,16 @@ set_instrument_and_final_stage_var(LLVM_ENABLE_LTO 
"${LLVM_RELEASE_ENABLE_LTO}"
 if (LLVM_RELEASE_ENABLE_LTO)
   set_instrument_and_final_stage_var(LLVM_ENABLE_LLD "ON" BOOL)
 endif()
+set_instrument_and_final_stage_var(LLVM_ENABLE_LIBCXX "ON" BOOL)
+set_instrument_and_final_stage_var(LLVM_STATIC_LINK_CXX_STDLIB "ON" BOOL)
+set(RELEASE_LINKER_FLAGS "-rtlib=compiler-rt --unwindlib=libunwind")
+if(NOT ${CMAKE_HOST_SYSTEM_NAME} MATCHES "Darwin")
+  set(RELEASE_LINKER_FLAGS "${RELEASE_LINKER_FLAGS} -static-libgcc")
+endif()
+
+set_instrument_and_final_stage_var(CMAKE_EXE_LINKER_FLAGS 
${RELEASE_LINKER_FLAGS} STRING)
+set_instrument_and_final_stage_var(CMAKE_SHARED_LINKER_FLAGS 
${RELEASE_LINKER_FLAGS} STRING)
+set_instrument_and_final_stage_var(CMAKE_MODULE_LINKER_FLAGS 
${RELEASE_LINKER_FLAGS} STRING)
 
 # Final Stage Config (stage2)
 set_final_stage_var(LLVM_ENABLE_RUNTIMES "${LLVM_RELEASE_ENABLE_RUNTIMES}" 
STRING)

___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] 3076a68 - [RISCV] [MachineOutliner] Analyze all candidates (#127659)

2025-02-21 Thread Tom Stellard via llvm-branch-commits


Author: Sudharsan Veeravalli
Date: 2025-02-21T10:56:29-08:00
New Revision: 3076a68f69aac3f87195eec12f38908a499263cb

URL: 
https://github.com/llvm/llvm-project/commit/3076a68f69aac3f87195eec12f38908a499263cb
DIFF: 
https://github.com/llvm/llvm-project/commit/3076a68f69aac3f87195eec12f38908a499263cb.diff

LOG: [RISCV] [MachineOutliner] Analyze all candidates (#127659)

#117700 made a change from analyzing all the candidates to analyzing
just the first candidate before deciding to either delete or keep all of
them.

Even though the candidates all have the same instructions, the basic
blocks in which they are present are different and we will need to check
each of them before deciding whether to keep or erase them.
Particularly, `isAvailableAcrossAndOutOfSeq` checks to see if the
register (x5 in this case) is available from the end of the MBB to the
beginning of the candidate and not checking this for each candidate led
to incorrect candidates being outlined resulting in correctness issues
in a few downstream benchmarks.

Similarly, deleting all the candidates if the first one is not viable
will result in missed outlining opportunities.

(cherry picked from commit 6757cf4e6f1c7767d605e579930a24758c0778dc)

Added: 
llvm/test/CodeGen/RISCV/machine-outliner-call-x5-liveout.mir

Modified: 
llvm/lib/Target/RISCV/RISCVInstrInfo.cpp

Removed: 




diff  --git a/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp 
b/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
index 12a7af0750813..87f1f35835cbe 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
@@ -3010,30 +3010,25 @@ static bool cannotInsertTailCall(const 
MachineBasicBlock &MBB) {
   return false;
 }
 
-static std::optional
-analyzeCandidate(outliner::Candidate &C) {
+static bool analyzeCandidate(outliner::Candidate &C) {
   // If last instruction is return then we can rely on
   // the verification already performed in the getOutliningTypeImpl.
   if (C.back().isReturn()) {
 assert(!cannotInsertTailCall(*C.getMBB()) &&
"The candidate who uses return instruction must be outlined "
"using tail call");
-return MachineOutlinerTailCall;
+return false;
   }
 
-  auto CandidateUsesX5 = [](outliner::Candidate &C) {
-const TargetRegisterInfo *TRI = 
C.getMF()->getSubtarget().getRegisterInfo();
-if (std::any_of(C.begin(), C.end(), [TRI](const MachineInstr &MI) {
-  return isMIModifiesReg(MI, TRI, RISCV::X5);
-}))
-  return true;
-return !C.isAvailableAcrossAndOutOfSeq(RISCV::X5, *TRI);
-  };
-
-  if (!CandidateUsesX5(C))
-return MachineOutlinerDefault;
+  // Filter out candidates where the X5 register (t0) can't be used to setup
+  // the function call.
+  const TargetRegisterInfo *TRI = C.getMF()->getSubtarget().getRegisterInfo();
+  if (std::any_of(C.begin(), C.end(), [TRI](const MachineInstr &MI) {
+return isMIModifiesReg(MI, TRI, RISCV::X5);
+  }))
+return true;
 
-  return std::nullopt;
+  return !C.isAvailableAcrossAndOutOfSeq(RISCV::X5, *TRI);
 }
 
 std::optional>
@@ -3042,35 +3037,32 @@ RISCVInstrInfo::getOutliningCandidateInfo(
 std::vector &RepeatedSequenceLocs,
 unsigned MinRepeats) const {
 
-  // Each RepeatedSequenceLoc is identical.
-  outliner::Candidate &Candidate = RepeatedSequenceLocs[0];
-  auto CandidateInfo = analyzeCandidate(Candidate);
-  if (!CandidateInfo)
-RepeatedSequenceLocs.clear();
+  // Analyze each candidate and erase the ones that are not viable.
+  llvm::erase_if(RepeatedSequenceLocs, analyzeCandidate);
 
   // If the sequence doesn't have enough candidates left, then we're done.
   if (RepeatedSequenceLocs.size() < MinRepeats)
 return std::nullopt;
 
+  // Each RepeatedSequenceLoc is identical.
+  outliner::Candidate &Candidate = RepeatedSequenceLocs[0];
   unsigned InstrSizeCExt =
   Candidate.getMF()->getSubtarget().hasStdExtCOrZca() ? 2
   : 4;
   unsigned CallOverhead = 0, FrameOverhead = 0;
 
-  MachineOutlinerConstructionID MOCI = CandidateInfo.value();
-  switch (MOCI) {
-  case MachineOutlinerDefault:
-// call t0, function = 8 bytes.
-CallOverhead = 8;
-// jr t0 = 4 bytes, 2 bytes if compressed instructions are enabled.
-FrameOverhead = InstrSizeCExt;
-break;
-  case MachineOutlinerTailCall:
+  MachineOutlinerConstructionID MOCI = MachineOutlinerDefault;
+  if (Candidate.back().isReturn()) {
+MOCI = MachineOutlinerTailCall;
 // tail call = auipc + jalr in the worst case without linker relaxation.
 CallOverhead = 4 + InstrSizeCExt;
 // Using tail call we move ret instruction from caller to callee.
 FrameOverhead = 0;
-break;
+  } else {
+// call t0, function = 8 bytes.
+CallOverhead = 8;
+// jr t0 = 4 bytes, 2 bytes if compressed instructions are enabled.
+FrameOverhead

[llvm-branch-commits] [llvm] release/20.x: [RISCV] [MachineOutliner] Analyze all candidates (#127659) (PR #128146)

2025-02-21 Thread Tom Stellard via llvm-branch-commits


https://github.com/tstellar closed 
https://github.com/llvm/llvm-project/pull/128146
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] release/20.x: [RISCV] [MachineOutliner] Analyze all candidates (#127659) (PR #128146)

2025-02-21 Thread via llvm-branch-commits


https://github.com/llvmbot updated 
https://github.com/llvm/llvm-project/pull/128146

>From 3076a68f69aac3f87195eec12f38908a499263cb Mon Sep 17 00:00:00 2001
From: Sudharsan Veeravalli 
Date: Fri, 21 Feb 2025 12:53:13 +0530
Subject: [PATCH] [RISCV] [MachineOutliner] Analyze all candidates (#127659)

#117700 made a change from analyzing all the candidates to analyzing
just the first candidate before deciding to either delete or keep all of
them.

Even though the candidates all have the same instructions, the basic
blocks in which they are present are different and we will need to check
each of them before deciding whether to keep or erase them.
Particularly, `isAvailableAcrossAndOutOfSeq` checks to see if the
register (x5 in this case) is available from the end of the MBB to the
beginning of the candidate and not checking this for each candidate led
to incorrect candidates being outlined resulting in correctness issues
in a few downstream benchmarks.

Similarly, deleting all the candidates if the first one is not viable
will result in missed outlining opportunities.

(cherry picked from commit 6757cf4e6f1c7767d605e579930a24758c0778dc)
---
 llvm/lib/Target/RISCV/RISCVInstrInfo.cpp  |  52 +++
 .../machine-outliner-call-x5-liveout.mir  | 136 ++
 2 files changed, 158 insertions(+), 30 deletions(-)
 create mode 100644 llvm/test/CodeGen/RISCV/machine-outliner-call-x5-liveout.mir

diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp 
b/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
index 12a7af0750813..87f1f35835cbe 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
@@ -3010,30 +3010,25 @@ static bool cannotInsertTailCall(const 
MachineBasicBlock &MBB) {
   return false;
 }
 
-static std::optional
-analyzeCandidate(outliner::Candidate &C) {
+static bool analyzeCandidate(outliner::Candidate &C) {
   // If last instruction is return then we can rely on
   // the verification already performed in the getOutliningTypeImpl.
   if (C.back().isReturn()) {
 assert(!cannotInsertTailCall(*C.getMBB()) &&
"The candidate who uses return instruction must be outlined "
"using tail call");
-return MachineOutlinerTailCall;
+return false;
   }
 
-  auto CandidateUsesX5 = [](outliner::Candidate &C) {
-const TargetRegisterInfo *TRI = 
C.getMF()->getSubtarget().getRegisterInfo();
-if (std::any_of(C.begin(), C.end(), [TRI](const MachineInstr &MI) {
-  return isMIModifiesReg(MI, TRI, RISCV::X5);
-}))
-  return true;
-return !C.isAvailableAcrossAndOutOfSeq(RISCV::X5, *TRI);
-  };
-
-  if (!CandidateUsesX5(C))
-return MachineOutlinerDefault;
+  // Filter out candidates where the X5 register (t0) can't be used to setup
+  // the function call.
+  const TargetRegisterInfo *TRI = C.getMF()->getSubtarget().getRegisterInfo();
+  if (std::any_of(C.begin(), C.end(), [TRI](const MachineInstr &MI) {
+return isMIModifiesReg(MI, TRI, RISCV::X5);
+  }))
+return true;
 
-  return std::nullopt;
+  return !C.isAvailableAcrossAndOutOfSeq(RISCV::X5, *TRI);
 }
 
 std::optional>
@@ -3042,35 +3037,32 @@ RISCVInstrInfo::getOutliningCandidateInfo(
 std::vector &RepeatedSequenceLocs,
 unsigned MinRepeats) const {
 
-  // Each RepeatedSequenceLoc is identical.
-  outliner::Candidate &Candidate = RepeatedSequenceLocs[0];
-  auto CandidateInfo = analyzeCandidate(Candidate);
-  if (!CandidateInfo)
-RepeatedSequenceLocs.clear();
+  // Analyze each candidate and erase the ones that are not viable.
+  llvm::erase_if(RepeatedSequenceLocs, analyzeCandidate);
 
   // If the sequence doesn't have enough candidates left, then we're done.
   if (RepeatedSequenceLocs.size() < MinRepeats)
 return std::nullopt;
 
+  // Each RepeatedSequenceLoc is identical.
+  outliner::Candidate &Candidate = RepeatedSequenceLocs[0];
   unsigned InstrSizeCExt =
   Candidate.getMF()->getSubtarget().hasStdExtCOrZca() ? 2
   : 4;
   unsigned CallOverhead = 0, FrameOverhead = 0;
 
-  MachineOutlinerConstructionID MOCI = CandidateInfo.value();
-  switch (MOCI) {
-  case MachineOutlinerDefault:
-// call t0, function = 8 bytes.
-CallOverhead = 8;
-// jr t0 = 4 bytes, 2 bytes if compressed instructions are enabled.
-FrameOverhead = InstrSizeCExt;
-break;
-  case MachineOutlinerTailCall:
+  MachineOutlinerConstructionID MOCI = MachineOutlinerDefault;
+  if (Candidate.back().isReturn()) {
+MOCI = MachineOutlinerTailCall;
 // tail call = auipc + jalr in the worst case without linker relaxation.
 CallOverhead = 4 + InstrSizeCExt;
 // Using tail call we move ret instruction from caller to callee.
 FrameOverhead = 0;
-break;
+  } else {
+// call t0, function = 8 bytes.
+CallOverhead = 8;
+// jr t0 = 4 bytes, 2 bytes if compressed instructions are enabled.
+FrameOverhead = InstrSizeCExt;
   }
 
   fo

[llvm-branch-commits] [clang] release/20.x: [CSKY] Default to unsigned char (PR #126436)

2025-02-21 Thread Tom Stellard via llvm-branch-commits


tstellar wrote:

@arichardson Would you be able to create a follow-up PR with the a release note 
entry?

https://github.com/llvm/llvm-project/pull/126436
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] release/20.x: AMDGPU: Stop emitting an error on illegal addrspacecasts (PR #127751)

2025-02-21 Thread via llvm-branch-commits


github-actions[bot] wrote:

@arsenm (or anyone else). If you would like to add a note about this fix in the 
release notes (completely optional). Please reply to this comment with a one or 
two sentence description of the fix.  When you are done, please add the 
release:note label to this PR. 

https://github.com/llvm/llvm-project/pull/127751
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] [AMDGPU][Attributor] Rework update of `AAAMDWavesPerEU` (PR #123995)

2025-02-21 Thread Shilei Tian via llvm-branch-commits


shiltian wrote:

ping

@arsenm @jdoerfert 

https://github.com/llvm/llvm-project/pull/123995
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] release/20.x: AMDGPU: Stop emitting an error on illegal addrspacecasts (PR #127751)

2025-02-21 Thread Tom Stellard via llvm-branch-commits


https://github.com/tstellar closed 
https://github.com/llvm/llvm-project/pull/127751
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] d51f233 - AMDGPU: Add some release 20 notes (#128136)

2025-02-21 Thread via llvm-branch-commits


Author: Matt Arsenault
Date: 2025-02-21T11:23:06-08:00
New Revision: d51f23377a77eace4ef006e0e6b23460ed05576c

URL: 
https://github.com/llvm/llvm-project/commit/d51f23377a77eace4ef006e0e6b23460ed05576c
DIFF: 
https://github.com/llvm/llvm-project/commit/d51f23377a77eace4ef006e0e6b23460ed05576c.diff

LOG: AMDGPU: Add some release 20 notes (#128136)

Added: 


Modified: 
llvm/docs/ReleaseNotes.md

Removed: 




diff  --git a/llvm/docs/ReleaseNotes.md b/llvm/docs/ReleaseNotes.md
index c80aecfdea084..e654509792652 100644
--- a/llvm/docs/ReleaseNotes.md
+++ b/llvm/docs/ReleaseNotes.md
@@ -159,6 +159,17 @@ Changes to the AArch64 Backend
 Changes to the AMDGPU Backend
 -
 
+* Initial support for gfx950
+
+* Improved ``llvm.memcpy``, ``llvm.memmove`` and ``llvm.memset`` lowering
+
+* Fixed expansion of 64-bit flat address space ``atomicrmw`` and
+  ``cmpxchg`` operations which may access private
+  memory. `noalias.addrspace` metadat may be used to avoid the
+  expansion if the target address is known to not be on the stack.
+
+* Fix compile failures when emitting unreachable functions.
+
 * Removed `llvm.amdgcn.flat.atomic.fadd` and
   `llvm.amdgcn.global.atomic.fadd` intrinsics. Users should use the
   {ref}`atomicrmw ` instruction with `fadd` and



___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] AMDGPU: Add some release 20 notes (PR #128136)

2025-02-21 Thread Tom Stellard via llvm-branch-commits


https://github.com/tstellar closed 
https://github.com/llvm/llvm-project/pull/128136
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] AMDGPU: Add some release 20 notes (PR #128136)

2025-02-21 Thread via llvm-branch-commits


github-actions[bot] wrote:

@arsenm (or anyone else). If you would like to add a note about this fix in the 
release notes (completely optional). Please reply to this comment with a one or 
two sentence description of the fix.  When you are done, please add the 
release:note label to this PR. 

https://github.com/llvm/llvm-project/pull/128136
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [clang] release/20.x: [CUDA] Add support for sm101 and sm120 target architectures (#127187) (PR #127918)

2025-02-21 Thread Sebastian Jodłowski via llvm-branch-commits


jodelek wrote:

Do you know who is the person I should bother?

https://github.com/llvm/llvm-project/pull/127918
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [clang] release/20.x: [CUDA] Add support for sm101 and sm120 target architectures (#127187) (PR #127918)

2025-02-21 Thread Eli Friedman via llvm-branch-commits


efriedma-quic wrote:

The process is that the patch is first reviewed by someone familiar with the 
code.  They approve the patch, and describe how the fix meets the release 
branch patch requirements 
(https://llvm.org/docs/HowToReleaseLLVM.html#release-patch-rules).

Once it's approved, the release manager will look at the look at the patch, and 
either merge or request changes.  You don't need to specifically ping the 
release manager; they track all the pending pull requests.

https://github.com/llvm/llvm-project/pull/127918
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [clang] release/20.x: [CUDA] Add support for sm101 and sm120 target architectures (#127187) (PR #127918)

2025-02-21 Thread Artem Belevich via llvm-branch-commits

Artem-B wrote:

> patch is first reviewed by someone familiar with the code. 

That would be me, as I am the maintainer of CUDA code and had reviewed the 
original PR.

> They approve the patch, and describe how the fix meets the release branch 
> patch requirements 
> (https://llvm.org/docs/HowToReleaseLLVM.html#release-patch-rules).

This patch fits item #3 on the rule list "or completion of features that were 
started before the branch was created. "

These changes allow clang users to compile CUDA code with just-released 
cuda-12.8 which adds these new GPU variants. 

https://github.com/llvm/llvm-project/pull/127918
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] release/20.x: Add Wasm, RISC-V, BPF, and NVPTX targets back to Windows release packaging (#127794) (PR #127982)

2025-02-21 Thread via llvm-branch-commits


https://github.com/llvmbot updated 
https://github.com/llvm/llvm-project/pull/127982

>From b727a13fecc4e29b6f8499afd95626795c9f6a8e Mon Sep 17 00:00:00 2001
From: Hans Wennborg 
Date: Thu, 20 Feb 2025 11:02:33 +0100
Subject: [PATCH] Add Wasm, RISC-V, BPF, and NVPTX targets back to Windows
 release packaging (#127794)

In #106059 we reduced the targets to those supported by Windows (X86 and
ARM) to avoid running into size limitations of the NSIS compiler.

Since then, people complained about the lack of Wasm [1], RISC-V [2],
BPF [3], and NVPTX [4]. These do seem to fit in the installer (at least
for 20.1.0-rc2), so let's add them back.

[1]
https://discourse.llvm.org/t/llvm-19-x-release-third-party-binaries/80374/26
[2]
https://discourse.llvm.org/t/llvm-19-x-release-third-party-binaries/80374/53
[3] https://github.com/llvm/llvm-project/issues/127120
[4]
https://github.com/llvm/llvm-project/pull/127794#issuecomment-2668677203

(cherry picked from commit 6e047a5ab42698165a4746ef681396fab1698327)
---
 llvm/utils/release/build_llvm_release.bat | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/llvm/utils/release/build_llvm_release.bat 
b/llvm/utils/release/build_llvm_release.bat
index dd041d7d384ec..1c30673cf88bd 100755
--- a/llvm/utils/release/build_llvm_release.bat
+++ b/llvm/utils/release/build_llvm_release.bat
@@ -150,7 +150,7 @@ set common_cmake_flags=^
   -DCMAKE_BUILD_TYPE=Release ^
   -DLLVM_ENABLE_ASSERTIONS=OFF ^
   -DLLVM_INSTALL_TOOLCHAIN_ONLY=ON ^
-  -DLLVM_TARGETS_TO_BUILD="AArch64;ARM;X86" ^
+  -DLLVM_TARGETS_TO_BUILD="AArch64;ARM;X86;BPF;WebAssembly;RISCV;NVPTX" ^
   -DLLVM_BUILD_LLVM_C_DYLIB=ON ^
   -DCMAKE_INSTALL_UCRT_LIBRARIES=ON ^
   -DPython3_FIND_REGISTRY=NEVER ^

___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] release/20.x: Add Wasm, RISC-V, BPF, and NVPTX targets back to Windows release packaging (#127794) (PR #127982)

2025-02-21 Thread via llvm-branch-commits


github-actions[bot] wrote:

@zmodem (or anyone else). If you would like to add a note about this fix in the 
release notes (completely optional). Please reply to this comment with a one or 
two sentence description of the fix.  When you are done, please add the 
release:note label to this PR. 

https://github.com/llvm/llvm-project/pull/127982
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [clang] release/20.x: [CSKY] Default to unsigned char (PR #126436)

2025-02-21 Thread Alexander Richardson via llvm-branch-commits


arichardson wrote:

> @arichardson (or anyone else). If you would like to add a note about this fix 
> in the release notes (completely optional). Please reply to this comment with 
> a one or two sentence description of the fix. When you are done, please add 
> the release:note label to this PR.

Prior to Clang 20, the CSKY target used an incorrect ABI with `char` being 
signed instead of unsigned. Code that relies on the incorrect definition can 
restore the old behavior by passing `-fsigned-char`.



https://github.com/llvm/llvm-project/pull/126436
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [libcxx] af9d7dd - [libc++] Fix stray usage of _LIBCPP_HAS_NO_WIDE_CHARACTERS on Windows

2025-02-21 Thread Tom Stellard via llvm-branch-commits


Author: Louis Dionne
Date: 2025-02-21T14:08:41-08:00
New Revision: af9d7dda2125c2ab10758ce6b5a968fd56af5048

URL: 
https://github.com/llvm/llvm-project/commit/af9d7dda2125c2ab10758ce6b5a968fd56af5048
DIFF: 
https://github.com/llvm/llvm-project/commit/af9d7dda2125c2ab10758ce6b5a968fd56af5048.diff

LOG: [libc++] Fix stray usage of _LIBCPP_HAS_NO_WIDE_CHARACTERS on Windows

(cherry picked from commit bcfd9f81e1bc9954d616ffbb8625099916bebd5b)

Added: 


Modified: 
libcxx/include/__locale_dir/support/windows.h

Removed: 




diff  --git a/libcxx/include/__locale_dir/support/windows.h 
b/libcxx/include/__locale_dir/support/windows.h
index ff89d3e87eb44..f0f76c527264a 100644
--- a/libcxx/include/__locale_dir/support/windows.h
+++ b/libcxx/include/__locale_dir/support/windows.h
@@ -215,7 +215,7 @@ inline _LIBCPP_HIDE_FROM_ABI size_t __strxfrm(char* __dest, 
const char* __src, s
   return ::_strxfrm_l(__dest, __src, __n, __loc);
 }
 
-#ifndef _LIBCPP_HAS_NO_WIDE_CHARACTERS
+#if _LIBCPP_HAS_WIDE_CHARACTERS
 inline _LIBCPP_HIDE_FROM_ABI int __iswctype(wint_t __c, wctype_t __type, 
__locale_t __loc) {
   return ::_iswctype_l(__c, __type, __loc);
 }
@@ -240,7 +240,7 @@ inline _LIBCPP_HIDE_FROM_ABI int __wcscoll(const wchar_t* 
__ws1, const wchar_t*
 inline _LIBCPP_HIDE_FROM_ABI size_t __wcsxfrm(wchar_t* __dest, const wchar_t* 
__src, size_t __n, __locale_t __loc) {
   return ::_wcsxfrm_l(__dest, __src, __n, __loc);
 }
-#endif // !_LIBCPP_HAS_NO_WIDE_CHARACTERS
+#endif // _LIBCPP_HAS_WIDE_CHARACTERS
 
 #if defined(__MINGW32__) && __MSVCRT_VERSION__ < 0x0800
 _LIBCPP_EXPORTED_FROM_ABI size_t __strftime(char*, size_t, const char*, const 
struct tm*, __locale_t);



___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [libcxx] release/20.x: [libc++] Reduce the dependency of the locale base API on the base system from the headers (#117764) (PR #128009)

2025-02-21 Thread via llvm-branch-commits


https://github.com/llvmbot updated 
https://github.com/llvm/llvm-project/pull/128009

>From af9d7dda2125c2ab10758ce6b5a968fd56af5048 Mon Sep 17 00:00:00 2001
From: Louis Dionne 
Date: Wed, 5 Feb 2025 08:33:14 -0500
Subject: [PATCH 1/2] [libc++] Fix stray usage of
 _LIBCPP_HAS_NO_WIDE_CHARACTERS on Windows

(cherry picked from commit bcfd9f81e1bc9954d616ffbb8625099916bebd5b)
---
 libcxx/include/__locale_dir/support/windows.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/libcxx/include/__locale_dir/support/windows.h 
b/libcxx/include/__locale_dir/support/windows.h
index ff89d3e87eb44..f0f76c527264a 100644
--- a/libcxx/include/__locale_dir/support/windows.h
+++ b/libcxx/include/__locale_dir/support/windows.h
@@ -215,7 +215,7 @@ inline _LIBCPP_HIDE_FROM_ABI size_t __strxfrm(char* __dest, 
const char* __src, s
   return ::_strxfrm_l(__dest, __src, __n, __loc);
 }
 
-#ifndef _LIBCPP_HAS_NO_WIDE_CHARACTERS
+#if _LIBCPP_HAS_WIDE_CHARACTERS
 inline _LIBCPP_HIDE_FROM_ABI int __iswctype(wint_t __c, wctype_t __type, 
__locale_t __loc) {
   return ::_iswctype_l(__c, __type, __loc);
 }
@@ -240,7 +240,7 @@ inline _LIBCPP_HIDE_FROM_ABI int __wcscoll(const wchar_t* 
__ws1, const wchar_t*
 inline _LIBCPP_HIDE_FROM_ABI size_t __wcsxfrm(wchar_t* __dest, const wchar_t* 
__src, size_t __n, __locale_t __loc) {
   return ::_wcsxfrm_l(__dest, __src, __n, __loc);
 }
-#endif // !_LIBCPP_HAS_NO_WIDE_CHARACTERS
+#endif // _LIBCPP_HAS_WIDE_CHARACTERS
 
 #if defined(__MINGW32__) && __MSVCRT_VERSION__ < 0x0800
 _LIBCPP_EXPORTED_FROM_ABI size_t __strftime(char*, size_t, const char*, const 
struct tm*, __locale_t);

>From 43a04b1db60414089bc7f864feb7cd8be7600498 Mon Sep 17 00:00:00 2001
From: Louis Dionne 
Date: Thu, 20 Feb 2025 08:38:42 -0500
Subject: [PATCH 2/2] [libc++] Reduce the dependency of the locale base API on
 the base system from the headers (#117764)

Many parts of the locale base API are only required when building the
shared/static library, but not from the headers. Document those
functions and carve out a few of those that don't work when
_XOPEN_SOURCE is defined to something old.

Fixes #117630

(cherry picked from commit f00b32e2d0ee666d32f1ddd0c687e269fab95b44)
---
 libcxx/include/__locale_dir/locale_base_api.h | 56 ---
 .../include/__locale_dir/support/bsd_like.h   | 22 +---
 libcxx/include/__locale_dir/support/fuchsia.h |  9 ++-
 .../support/no_locale/characters.h|  8 ++-
 libcxx/include/__locale_dir/support/windows.h | 18 --
 libcxx/test/libcxx/xopen_source.gen.py| 53 ++
 6 files changed, 128 insertions(+), 38 deletions(-)
 create mode 100644 libcxx/test/libcxx/xopen_source.gen.py

diff --git a/libcxx/include/__locale_dir/locale_base_api.h 
b/libcxx/include/__locale_dir/locale_base_api.h
index bbee9f49867fd..c1e73caeecced 100644
--- a/libcxx/include/__locale_dir/locale_base_api.h
+++ b/libcxx/include/__locale_dir/locale_base_api.h
@@ -23,12 +23,16 @@
 // Variadic functions may be implemented as templates with a parameter pack 
instead
 // of C-style variadic functions.
 //
+// Most of these functions are only required when building the library. 
Functions that are also
+// required when merely using the headers are marked as such below.
+//
 // TODO: __localeconv shouldn't take a reference, but the Windows 
implementation doesn't allow copying __locale_t
+// TODO: Eliminate the need for any of these functions from the headers.
 //
 // Locale management
 // -
 // namespace __locale {
-//  using __locale_t = implementation-defined;
+//  using __locale_t = implementation-defined;  // required by the headers
 //  using __lconv_t  = implementation-defined;
 //  __locale_t  __newlocale(int, const char*, __locale_t);
 //  void__freelocale(__locale_t);
@@ -36,6 +40,7 @@
 //  __lconv_t*  __localeconv(__locale_t&);
 // }
 //
+// // required by the headers
 // #define _LIBCPP_COLLATE_MASK   /* implementation-defined */
 // #define _LIBCPP_CTYPE_MASK /* implementation-defined */
 // #define _LIBCPP_MONETARY_MASK  /* implementation-defined */
@@ -48,6 +53,7 @@
 // Strtonum functions
 // --
 // namespace __locale {
+//  // required by the headers
 //  float   __strtof(const char*, char**, __locale_t);
 //  double  __strtod(const char*, char**, __locale_t);
 //  long double __strtold(const char*, char**, __locale_t);
@@ -60,8 +66,8 @@
 // namespace __locale {
 //  int __islower(int, __locale_t);
 //  int __isupper(int, __locale_t);
-//  int __isdigit(int, __locale_t);
-//  int __isxdigit(int, __locale_t);
+//  int __isdigit(int, __locale_t);  // required by the headers
+//  int __isxdigit(int, __locale_t); // required by the headers
 //  int __toupper(int, __locale_t);
 //  int __tolower(int, __locale_t);
 //  int __strcoll(const char*, const char*, __locale_t);
@@ -99,9 +105,10 @@
 //  int __mbtowc(wchar_t*, const char*, size_t, __l

[llvm-branch-commits] [libcxx] 43a04b1 - [libc++] Reduce the dependency of the locale base API on the base system from the headers (#117764)

2025-02-21 Thread Tom Stellard via llvm-branch-commits


Author: Louis Dionne
Date: 2025-02-21T14:08:41-08:00
New Revision: 43a04b1db60414089bc7f864feb7cd8be7600498

URL: 
https://github.com/llvm/llvm-project/commit/43a04b1db60414089bc7f864feb7cd8be7600498
DIFF: 
https://github.com/llvm/llvm-project/commit/43a04b1db60414089bc7f864feb7cd8be7600498.diff

LOG: [libc++] Reduce the dependency of the locale base API on the base system 
from the headers (#117764)

Many parts of the locale base API are only required when building the
shared/static library, but not from the headers. Document those
functions and carve out a few of those that don't work when
_XOPEN_SOURCE is defined to something old.

Fixes #117630

(cherry picked from commit f00b32e2d0ee666d32f1ddd0c687e269fab95b44)

Added: 
libcxx/test/libcxx/xopen_source.gen.py

Modified: 
libcxx/include/__locale_dir/locale_base_api.h
libcxx/include/__locale_dir/support/bsd_like.h
libcxx/include/__locale_dir/support/fuchsia.h
libcxx/include/__locale_dir/support/no_locale/characters.h
libcxx/include/__locale_dir/support/windows.h

Removed: 




diff  --git a/libcxx/include/__locale_dir/locale_base_api.h 
b/libcxx/include/__locale_dir/locale_base_api.h
index bbee9f49867fd..c1e73caeecced 100644
--- a/libcxx/include/__locale_dir/locale_base_api.h
+++ b/libcxx/include/__locale_dir/locale_base_api.h
@@ -23,12 +23,16 @@
 // Variadic functions may be implemented as templates with a parameter pack 
instead
 // of C-style variadic functions.
 //
+// Most of these functions are only required when building the library. 
Functions that are also
+// required when merely using the headers are marked as such below.
+//
 // TODO: __localeconv shouldn't take a reference, but the Windows 
implementation doesn't allow copying __locale_t
+// TODO: Eliminate the need for any of these functions from the headers.
 //
 // Locale management
 // -
 // namespace __locale {
-//  using __locale_t = implementation-defined;
+//  using __locale_t = implementation-defined;  // required by the headers
 //  using __lconv_t  = implementation-defined;
 //  __locale_t  __newlocale(int, const char*, __locale_t);
 //  void__freelocale(__locale_t);
@@ -36,6 +40,7 @@
 //  __lconv_t*  __localeconv(__locale_t&);
 // }
 //
+// // required by the headers
 // #define _LIBCPP_COLLATE_MASK   /* implementation-defined */
 // #define _LIBCPP_CTYPE_MASK /* implementation-defined */
 // #define _LIBCPP_MONETARY_MASK  /* implementation-defined */
@@ -48,6 +53,7 @@
 // Strtonum functions
 // --
 // namespace __locale {
+//  // required by the headers
 //  float   __strtof(const char*, char**, __locale_t);
 //  double  __strtod(const char*, char**, __locale_t);
 //  long double __strtold(const char*, char**, __locale_t);
@@ -60,8 +66,8 @@
 // namespace __locale {
 //  int __islower(int, __locale_t);
 //  int __isupper(int, __locale_t);
-//  int __isdigit(int, __locale_t);
-//  int __isxdigit(int, __locale_t);
+//  int __isdigit(int, __locale_t);  // required by the headers
+//  int __isxdigit(int, __locale_t); // required by the headers
 //  int __toupper(int, __locale_t);
 //  int __tolower(int, __locale_t);
 //  int __strcoll(const char*, const char*, __locale_t);
@@ -99,9 +105,10 @@
 //  int __mbtowc(wchar_t*, const char*, size_t, __locale_t);
 //  size_t  __mbrlen(const char*, size_t, mbstate_t*, __locale_t);
 //  size_t  __mbsrtowcs(wchar_t*, const char**, size_t, mbstate_t*, 
__locale_t);
-//  int __snprintf(char*, size_t, __locale_t, const char*, ...);
-//  int __asprintf(char**, __locale_t, const char*, ...);
-//  int __sscanf(const char*, __locale_t, const char*, ...);
+//
+//  int __snprintf(char*, size_t, __locale_t, const char*, ...); // 
required by the headers
+//  int __asprintf(char**, __locale_t, const char*, ...);// 
required by the headers
+//  int __sscanf(const char*, __locale_t, const char*, ...); // 
required by the headers
 // }
 
 #if defined(__APPLE__)
@@ -143,8 +150,19 @@ namespace __locale {
 //
 // Locale management
 //
+#  define _LIBCPP_COLLATE_MASK LC_COLLATE_MASK
+#  define _LIBCPP_CTYPE_MASK LC_CTYPE_MASK
+#  define _LIBCPP_MONETARY_MASK LC_MONETARY_MASK
+#  define _LIBCPP_NUMERIC_MASK LC_NUMERIC_MASK
+#  define _LIBCPP_TIME_MASK LC_TIME_MASK
+#  define _LIBCPP_MESSAGES_MASK LC_MESSAGES_MASK
+#  define _LIBCPP_ALL_MASK LC_ALL_MASK
+#  define _LIBCPP_LC_ALL LC_ALL
+
 using __locale_t _LIBCPP_NODEBUG = locale_t;
-using __lconv_t _LIBCPP_NODEBUG  = lconv;
+
+#  if defined(_LIBCPP_BUILDING_LIBRARY)
+using __lconv_t _LIBCPP_NODEBUG = lconv;
 
 inline _LIBCPP_HIDE_FROM_ABI __locale_t __newlocale(int __category_mask, const 
char* __name, __locale_t __loc) {
   return newlocale(__category_mask, __name, __loc);
@@ -157,15 +175,7 @@ inline _LIBCPP_HIDE_FROM_ABI char* __setlocale(int 
__category,

[llvm-branch-commits] [libcxx] release/20.x: [libc++] Reduce the dependency of the locale base API on the base system from the headers (#117764) (PR #128009)

2025-02-21 Thread Tom Stellard via llvm-branch-commits


https://github.com/tstellar closed 
https://github.com/llvm/llvm-project/pull/128009
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [clang] [libc] release/20.x: [Clang] Fix cross-lane scan when given divergent lanes (#127703) (PR #128085)

2025-02-21 Thread via llvm-branch-commits


github-actions[bot] wrote:

@jhuber6 (or anyone else). If you would like to add a note about this fix in 
the release notes (completely optional). Please reply to this comment with a 
one or two sentence description of the fix.  When you are done, please add the 
release:note label to this PR. 

https://github.com/llvm/llvm-project/pull/128085
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [clang] [libc] release/20.x: [Clang] Fix cross-lane scan when given divergent lanes (#127703) (PR #128085)

2025-02-21 Thread Tom Stellard via llvm-branch-commits


https://github.com/tstellar closed 
https://github.com/llvm/llvm-project/pull/128085
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] release/20.x: Add Wasm, RISC-V, BPF, and NVPTX targets back to Windows release packaging (#127794) (PR #127982)

2025-02-21 Thread Tom Stellard via llvm-branch-commits


https://github.com/tstellar closed 
https://github.com/llvm/llvm-project/pull/127982
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [clang] release/20.x: [CUDA] Add support for sm101 and sm120 target architectures (#127187) (PR #127918)

2025-02-21 Thread via llvm-branch-commits


https://github.com/llvmbot updated 
https://github.com/llvm/llvm-project/pull/127918

>From b84ffb9f3b349dd4548a9d3c0ead91021b7905a3 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Sebastian=20Jod=C5=82owski?= 
Date: Wed, 19 Feb 2025 14:41:07 -0800
Subject: [PATCH] [CUDA] Add support for sm101 and sm120 target architectures
 (#127187)

Add support for sm101 and sm120 target architectures. It requires CUDA
12.8.

-

Co-authored-by: Sebastian Jodlowski 
(cherry picked from commit 0127f169dc8e0b5b6c2a24f74cd42d9d277916f6)
---
 clang/include/clang/Basic/BuiltinsNVPTX.td|  8 ---
 clang/include/clang/Basic/Cuda.h  |  4 
 clang/lib/Basic/Cuda.cpp  |  8 +++
 clang/lib/Basic/Targets/NVPTX.cpp | 23 +++
 clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp  |  4 
 .../test/Misc/target-invalid-cpu-note/nvptx.c |  4 
 6 files changed, 43 insertions(+), 8 deletions(-)

diff --git a/clang/include/clang/Basic/BuiltinsNVPTX.td 
b/clang/include/clang/Basic/BuiltinsNVPTX.td
index 9d24a992563a4..b550fff8567df 100644
--- a/clang/include/clang/Basic/BuiltinsNVPTX.td
+++ b/clang/include/clang/Basic/BuiltinsNVPTX.td
@@ -21,12 +21,14 @@ class SM newer_list> : 
SMFeatures {
 !strconcat(f, "|", newer.Features));
 }
 
+let Features = "sm_120a" in def SM_120a : SMFeatures;
+let Features = "sm_101a" in def SM_101a : SMFeatures;
 let Features = "sm_100a" in def SM_100a : SMFeatures;
-
-def SM_100 : SM<"100", [SM_100a]>;
-
 let Features = "sm_90a" in def SM_90a : SMFeatures;
 
+def SM_120 : SM<"120", [SM_120a]>;
+def SM_101 : SM<"101", [SM_101a, SM_120]>;
+def SM_100 : SM<"100", [SM_100a, SM_101]>;
 def SM_90 : SM<"90", [SM_90a, SM_100]>;
 def SM_89 : SM<"89", [SM_90]>;
 def SM_87 : SM<"87", [SM_89]>;
diff --git a/clang/include/clang/Basic/Cuda.h b/clang/include/clang/Basic/Cuda.h
index f33ba46233a7a..5c909a8e9ca11 100644
--- a/clang/include/clang/Basic/Cuda.h
+++ b/clang/include/clang/Basic/Cuda.h
@@ -82,6 +82,10 @@ enum class OffloadArch {
   SM_90a,
   SM_100,
   SM_100a,
+  SM_101,
+  SM_101a,
+  SM_120,
+  SM_120a,
   GFX600,
   GFX601,
   GFX602,
diff --git a/clang/lib/Basic/Cuda.cpp b/clang/lib/Basic/Cuda.cpp
index 1bfec0b37c5ee..79cac0ec119dd 100644
--- a/clang/lib/Basic/Cuda.cpp
+++ b/clang/lib/Basic/Cuda.cpp
@@ -100,6 +100,10 @@ static const OffloadArchToStringMap arch_names[] = {
 SM(90a), // Hopper
 SM(100), // Blackwell
 SM(100a),// Blackwell
+SM(101), // Blackwell
+SM(101a),// Blackwell
+SM(120), // Blackwell
+SM(120a),// Blackwell
 GFX(600),  // gfx600
 GFX(601),  // gfx601
 GFX(602),  // gfx602
@@ -230,6 +234,10 @@ CudaVersion MinVersionForOffloadArch(OffloadArch A) {
 return CudaVersion::CUDA_120;
   case OffloadArch::SM_100:
   case OffloadArch::SM_100a:
+  case OffloadArch::SM_101:
+  case OffloadArch::SM_101a:
+  case OffloadArch::SM_120:
+  case OffloadArch::SM_120a:
 return CudaVersion::CUDA_128;
   default:
 llvm_unreachable("invalid enum");
diff --git a/clang/lib/Basic/Targets/NVPTX.cpp 
b/clang/lib/Basic/Targets/NVPTX.cpp
index a03f4983b9d03..9be12cbe7ac19 100644
--- a/clang/lib/Basic/Targets/NVPTX.cpp
+++ b/clang/lib/Basic/Targets/NVPTX.cpp
@@ -176,7 +176,7 @@ void NVPTXTargetInfo::getTargetDefines(const LangOptions 
&Opts,
 
   if (Opts.CUDAIsDevice || Opts.OpenMPIsTargetDevice || !HostTarget) {
 // Set __CUDA_ARCH__ for the GPU specified.
-std::string CUDAArchCode = [this] {
+llvm::StringRef CUDAArchCode = [this] {
   switch (GPU) {
   case OffloadArch::GFX600:
   case OffloadArch::GFX601:
@@ -283,14 +283,27 @@ void NVPTXTargetInfo::getTargetDefines(const LangOptions 
&Opts,
   case OffloadArch::SM_100:
   case OffloadArch::SM_100a:
 return "1000";
+  case OffloadArch::SM_101:
+  case OffloadArch::SM_101a:
+ return "1010";
+  case OffloadArch::SM_120:
+  case OffloadArch::SM_120a:
+ return "1200";
   }
   llvm_unreachable("unhandled OffloadArch");
 }();
 Builder.defineMacro("__CUDA_ARCH__", CUDAArchCode);
-if (GPU == OffloadArch::SM_90a)
-  Builder.defineMacro("__CUDA_ARCH_FEAT_SM90_ALL", "1");
-if (GPU == OffloadArch::SM_100a)
-  Builder.defineMacro("__CUDA_ARCH_FEAT_SM100_ALL", "1");
+switch(GPU) {
+  case OffloadArch::SM_90a:
+  case OffloadArch::SM_100a:
+  case OffloadArch::SM_101a:
+  case OffloadArch::SM_120a:
+Builder.defineMacro("__CUDA_ARCH_FEAT_SM" + CUDAArchCode.drop_back() + 
"_ALL", "1");
+break;
+  default:
+// Do nothing if this is not an enhanced architecture.
+break;
+}
   }
 }
 
diff --git a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp 
b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
index c13928f61a748..dc417880a50e9 10064

[llvm-branch-commits] [clang] release/20.x: [CUDA] Add support for sm101 and sm120 target architectures (#127187) (PR #127918)

2025-02-21 Thread Tom Stellard via llvm-branch-commits


https://github.com/tstellar closed 
https://github.com/llvm/llvm-project/pull/127918
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [clang] b84ffb9 - [CUDA] Add support for sm101 and sm120 target architectures (#127187)

2025-02-21 Thread Tom Stellard via llvm-branch-commits


Author: Sebastian Jodłowski
Date: 2025-02-21T14:06:54-08:00
New Revision: b84ffb9f3b349dd4548a9d3c0ead91021b7905a3

URL: 
https://github.com/llvm/llvm-project/commit/b84ffb9f3b349dd4548a9d3c0ead91021b7905a3
DIFF: 
https://github.com/llvm/llvm-project/commit/b84ffb9f3b349dd4548a9d3c0ead91021b7905a3.diff

LOG: [CUDA] Add support for sm101 and sm120 target architectures (#127187)

Add support for sm101 and sm120 target architectures. It requires CUDA
12.8.

-

Co-authored-by: Sebastian Jodlowski 
(cherry picked from commit 0127f169dc8e0b5b6c2a24f74cd42d9d277916f6)

Added: 


Modified: 
clang/include/clang/Basic/BuiltinsNVPTX.td
clang/include/clang/Basic/Cuda.h
clang/lib/Basic/Cuda.cpp
clang/lib/Basic/Targets/NVPTX.cpp
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
clang/test/Misc/target-invalid-cpu-note/nvptx.c

Removed: 




diff  --git a/clang/include/clang/Basic/BuiltinsNVPTX.td 
b/clang/include/clang/Basic/BuiltinsNVPTX.td
index 9d24a992563a4..b550fff8567df 100644
--- a/clang/include/clang/Basic/BuiltinsNVPTX.td
+++ b/clang/include/clang/Basic/BuiltinsNVPTX.td
@@ -21,12 +21,14 @@ class SM newer_list> : 
SMFeatures {
 !strconcat(f, "|", newer.Features));
 }
 
+let Features = "sm_120a" in def SM_120a : SMFeatures;
+let Features = "sm_101a" in def SM_101a : SMFeatures;
 let Features = "sm_100a" in def SM_100a : SMFeatures;
-
-def SM_100 : SM<"100", [SM_100a]>;
-
 let Features = "sm_90a" in def SM_90a : SMFeatures;
 
+def SM_120 : SM<"120", [SM_120a]>;
+def SM_101 : SM<"101", [SM_101a, SM_120]>;
+def SM_100 : SM<"100", [SM_100a, SM_101]>;
 def SM_90 : SM<"90", [SM_90a, SM_100]>;
 def SM_89 : SM<"89", [SM_90]>;
 def SM_87 : SM<"87", [SM_89]>;

diff  --git a/clang/include/clang/Basic/Cuda.h 
b/clang/include/clang/Basic/Cuda.h
index f33ba46233a7a..5c909a8e9ca11 100644
--- a/clang/include/clang/Basic/Cuda.h
+++ b/clang/include/clang/Basic/Cuda.h
@@ -82,6 +82,10 @@ enum class OffloadArch {
   SM_90a,
   SM_100,
   SM_100a,
+  SM_101,
+  SM_101a,
+  SM_120,
+  SM_120a,
   GFX600,
   GFX601,
   GFX602,

diff  --git a/clang/lib/Basic/Cuda.cpp b/clang/lib/Basic/Cuda.cpp
index 1bfec0b37c5ee..79cac0ec119dd 100644
--- a/clang/lib/Basic/Cuda.cpp
+++ b/clang/lib/Basic/Cuda.cpp
@@ -100,6 +100,10 @@ static const OffloadArchToStringMap arch_names[] = {
 SM(90a), // Hopper
 SM(100), // Blackwell
 SM(100a),// Blackwell
+SM(101), // Blackwell
+SM(101a),// Blackwell
+SM(120), // Blackwell
+SM(120a),// Blackwell
 GFX(600),  // gfx600
 GFX(601),  // gfx601
 GFX(602),  // gfx602
@@ -230,6 +234,10 @@ CudaVersion MinVersionForOffloadArch(OffloadArch A) {
 return CudaVersion::CUDA_120;
   case OffloadArch::SM_100:
   case OffloadArch::SM_100a:
+  case OffloadArch::SM_101:
+  case OffloadArch::SM_101a:
+  case OffloadArch::SM_120:
+  case OffloadArch::SM_120a:
 return CudaVersion::CUDA_128;
   default:
 llvm_unreachable("invalid enum");

diff  --git a/clang/lib/Basic/Targets/NVPTX.cpp 
b/clang/lib/Basic/Targets/NVPTX.cpp
index a03f4983b9d03..9be12cbe7ac19 100644
--- a/clang/lib/Basic/Targets/NVPTX.cpp
+++ b/clang/lib/Basic/Targets/NVPTX.cpp
@@ -176,7 +176,7 @@ void NVPTXTargetInfo::getTargetDefines(const LangOptions 
&Opts,
 
   if (Opts.CUDAIsDevice || Opts.OpenMPIsTargetDevice || !HostTarget) {
 // Set __CUDA_ARCH__ for the GPU specified.
-std::string CUDAArchCode = [this] {
+llvm::StringRef CUDAArchCode = [this] {
   switch (GPU) {
   case OffloadArch::GFX600:
   case OffloadArch::GFX601:
@@ -283,14 +283,27 @@ void NVPTXTargetInfo::getTargetDefines(const LangOptions 
&Opts,
   case OffloadArch::SM_100:
   case OffloadArch::SM_100a:
 return "1000";
+  case OffloadArch::SM_101:
+  case OffloadArch::SM_101a:
+ return "1010";
+  case OffloadArch::SM_120:
+  case OffloadArch::SM_120a:
+ return "1200";
   }
   llvm_unreachable("unhandled OffloadArch");
 }();
 Builder.defineMacro("__CUDA_ARCH__", CUDAArchCode);
-if (GPU == OffloadArch::SM_90a)
-  Builder.defineMacro("__CUDA_ARCH_FEAT_SM90_ALL", "1");
-if (GPU == OffloadArch::SM_100a)
-  Builder.defineMacro("__CUDA_ARCH_FEAT_SM100_ALL", "1");
+switch(GPU) {
+  case OffloadArch::SM_90a:
+  case OffloadArch::SM_100a:
+  case OffloadArch::SM_101a:
+  case OffloadArch::SM_120a:
+Builder.defineMacro("__CUDA_ARCH_FEAT_SM" + CUDAArchCode.drop_back() + 
"_ALL", "1");
+break;
+  default:
+// Do nothing if this is not an enhanced architecture.
+break;
+}
   }
 }
 

diff  --git a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp 
b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
index c13928f61a748..dc417

[llvm-branch-commits] [clang] release/20.x: [CUDA] Add support for sm101 and sm120 target architectures (#127187) (PR #127918)

2025-02-21 Thread via llvm-branch-commits


github-actions[bot] wrote:

@Artem-B (or anyone else). If you would like to add a note about this fix in 
the release notes (completely optional). Please reply to this comment with a 
one or two sentence description of the fix.  When you are done, please add the 
release:note label to this PR. 

https://github.com/llvm/llvm-project/pull/127918
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [clang] [libc] release/20.x: [Clang] Fix cross-lane scan when given divergent lanes (#127703) (PR #128085)

2025-02-21 Thread via llvm-branch-commits


https://github.com/llvmbot updated 
https://github.com/llvm/llvm-project/pull/128085

>From e0c4a3397fd2f80740d776de85360dc12cd0bcc7 Mon Sep 17 00:00:00 2001
From: Joseph Huber 
Date: Wed, 19 Feb 2025 16:46:59 -0600
Subject: [PATCH] [Clang] Fix cross-lane scan when given divergent lanes
 (#127703)

Summary:
The scan operation implemented here only works if there are contiguous
ones in the executation mask that can be used to propagate the result.
There are two solutions to this, one is to enter 'whole-wave-mode' and
forcibly turn them back on, or to do this serially. This implementation
does the latter because it's more portable, but checks to see if the
parallel fast-path is applicable.

Needs to be backported for correct behavior and because it fixes a
failing libc test.

(cherry picked from commit 6cc7ca084a5bbb7ccf606cab12065604453dde59)
---
 clang/lib/Headers/gpuintrin.h | 74 ---
 clang/lib/Headers/nvptxintrin.h   |  5 +-
 .../src/__support/GPU/scan_reduce.cpp | 49 
 3 files changed, 102 insertions(+), 26 deletions(-)

diff --git a/clang/lib/Headers/gpuintrin.h b/clang/lib/Headers/gpuintrin.h
index 11c87e85cd497..efdc3d94ac0b3 100644
--- a/clang/lib/Headers/gpuintrin.h
+++ b/clang/lib/Headers/gpuintrin.h
@@ -150,35 +150,33 @@ __gpu_shuffle_idx_f64(uint64_t __lane_mask, uint32_t 
__idx, double __x,
 __builtin_bit_cast(uint64_t, __x), __width));
 }
 
-// Gets the sum of all lanes inside the warp or wavefront.
-#define __DO_LANE_SUM(__type, __suffix)
\
-  _DEFAULT_FN_ATTRS static __inline__ __type __gpu_lane_sum_##__suffix(
\
-  uint64_t __lane_mask, __type __x) {  
\
-for (uint32_t __step = __gpu_num_lanes() / 2; __step > 0; __step /= 2) {   
\
-  uint32_t __index = __step + __gpu_lane_id(); 
\
-  __x += __gpu_shuffle_idx_##__suffix(__lane_mask, __index, __x,   
\
-  __gpu_num_lanes());  
\
-}  
\
-return __gpu_read_first_lane_##__suffix(__lane_mask, __x); 
\
-  }
-__DO_LANE_SUM(uint32_t, u32); // uint32_t __gpu_lane_sum_u32(m, x)
-__DO_LANE_SUM(uint64_t, u64); // uint64_t __gpu_lane_sum_u64(m, x)
-__DO_LANE_SUM(float, f32);// float __gpu_lane_sum_f32(m, x)
-__DO_LANE_SUM(double, f64);   // double __gpu_lane_sum_f64(m, x)
-#undef __DO_LANE_SUM
-
 // Gets the accumulator scan of the threads in the warp or wavefront.
 #define __DO_LANE_SCAN(__type, __bitmask_type, __suffix)   
\
   _DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_lane_scan_##__suffix( 
\
   uint64_t __lane_mask, uint32_t __x) {
\
-for (uint32_t __step = 1; __step < __gpu_num_lanes(); __step *= 2) {   
\
-  uint32_t __index = __gpu_lane_id() - __step; 
\
-  __bitmask_type bitmask = __gpu_lane_id() >= __step;  
\
-  __x += __builtin_bit_cast(   
\
-  __type, -bitmask & __builtin_bit_cast(__bitmask_type,
\
-__gpu_shuffle_idx_##__suffix(  
\
-__lane_mask, __index, __x, 
\
-__gpu_num_lanes(;  
\
+uint64_t __first = __lane_mask >> __builtin_ctzll(__lane_mask);
\
+bool __divergent = __gpu_read_first_lane_##__suffix(   
\
+__lane_mask, __first & (__first + 1)); 
\
+if (__divergent) { 
\
+  __type __accum = 0;  
\
+  for (uint64_t __mask = __lane_mask; __mask; __mask &= __mask - 1) {  
\
+__type __index = __builtin_ctzll(__mask);  
\
+__type __tmp = __gpu_shuffle_idx_##__suffix(__lane_mask, __index, __x, 
\
+__gpu_num_lanes());
\
+__x = __gpu_lane_id() == __index ? __accum + __tmp : __x;  
\
+__accum += __tmp;  
\
+  }
\
+} else {   
\
+  for (uint32_t __step = 1; __step < __gpu_num_lanes(); __step *= 2) { 
\
+uint32_t __index = __gpu_lane_id() - __step;   
\
+__bitmask_type bitmask = __gpu_lane_id() >= __step;
\
+__x += __builtin_bit_cast( 
\
+__type,

[llvm-branch-commits] [libc] e0c4a33 - [Clang] Fix cross-lane scan when given divergent lanes (#127703)

2025-02-21 Thread Tom Stellard via llvm-branch-commits


Author: Joseph Huber
Date: 2025-02-21T14:10:17-08:00
New Revision: e0c4a3397fd2f80740d776de85360dc12cd0bcc7

URL: 
https://github.com/llvm/llvm-project/commit/e0c4a3397fd2f80740d776de85360dc12cd0bcc7
DIFF: 
https://github.com/llvm/llvm-project/commit/e0c4a3397fd2f80740d776de85360dc12cd0bcc7.diff

LOG: [Clang] Fix cross-lane scan when given divergent lanes (#127703)

Summary:
The scan operation implemented here only works if there are contiguous
ones in the executation mask that can be used to propagate the result.
There are two solutions to this, one is to enter 'whole-wave-mode' and
forcibly turn them back on, or to do this serially. This implementation
does the latter because it's more portable, but checks to see if the
parallel fast-path is applicable.

Needs to be backported for correct behavior and because it fixes a
failing libc test.

(cherry picked from commit 6cc7ca084a5bbb7ccf606cab12065604453dde59)

Added: 


Modified: 
clang/lib/Headers/gpuintrin.h
clang/lib/Headers/nvptxintrin.h
libc/test/integration/src/__support/GPU/scan_reduce.cpp

Removed: 




diff  --git a/clang/lib/Headers/gpuintrin.h b/clang/lib/Headers/gpuintrin.h
index 11c87e85cd497..efdc3d94ac0b3 100644
--- a/clang/lib/Headers/gpuintrin.h
+++ b/clang/lib/Headers/gpuintrin.h
@@ -150,35 +150,33 @@ __gpu_shuffle_idx_f64(uint64_t __lane_mask, uint32_t 
__idx, double __x,
 __builtin_bit_cast(uint64_t, __x), __width));
 }
 
-// Gets the sum of all lanes inside the warp or wavefront.
-#define __DO_LANE_SUM(__type, __suffix)
\
-  _DEFAULT_FN_ATTRS static __inline__ __type __gpu_lane_sum_##__suffix(
\
-  uint64_t __lane_mask, __type __x) {  
\
-for (uint32_t __step = __gpu_num_lanes() / 2; __step > 0; __step /= 2) {   
\
-  uint32_t __index = __step + __gpu_lane_id(); 
\
-  __x += __gpu_shuffle_idx_##__suffix(__lane_mask, __index, __x,   
\
-  __gpu_num_lanes());  
\
-}  
\
-return __gpu_read_first_lane_##__suffix(__lane_mask, __x); 
\
-  }
-__DO_LANE_SUM(uint32_t, u32); // uint32_t __gpu_lane_sum_u32(m, x)
-__DO_LANE_SUM(uint64_t, u64); // uint64_t __gpu_lane_sum_u64(m, x)
-__DO_LANE_SUM(float, f32);// float __gpu_lane_sum_f32(m, x)
-__DO_LANE_SUM(double, f64);   // double __gpu_lane_sum_f64(m, x)
-#undef __DO_LANE_SUM
-
 // Gets the accumulator scan of the threads in the warp or wavefront.
 #define __DO_LANE_SCAN(__type, __bitmask_type, __suffix)   
\
   _DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_lane_scan_##__suffix( 
\
   uint64_t __lane_mask, uint32_t __x) {
\
-for (uint32_t __step = 1; __step < __gpu_num_lanes(); __step *= 2) {   
\
-  uint32_t __index = __gpu_lane_id() - __step; 
\
-  __bitmask_type bitmask = __gpu_lane_id() >= __step;  
\
-  __x += __builtin_bit_cast(   
\
-  __type, -bitmask & __builtin_bit_cast(__bitmask_type,
\
-__gpu_shuffle_idx_##__suffix(  
\
-__lane_mask, __index, __x, 
\
-__gpu_num_lanes(;  
\
+uint64_t __first = __lane_mask >> __builtin_ctzll(__lane_mask);
\
+bool __divergent = __gpu_read_first_lane_##__suffix(   
\
+__lane_mask, __first & (__first + 1)); 
\
+if (__divergent) { 
\
+  __type __accum = 0;  
\
+  for (uint64_t __mask = __lane_mask; __mask; __mask &= __mask - 1) {  
\
+__type __index = __builtin_ctzll(__mask);  
\
+__type __tmp = __gpu_shuffle_idx_##__suffix(__lane_mask, __index, __x, 
\
+__gpu_num_lanes());
\
+__x = __gpu_lane_id() == __index ? __accum + __tmp : __x;  
\
+__accum += __tmp;  
\
+  }
\
+} else {   
\
+  for (uint32_t __step = 1; __step < __gpu_num_lanes(); __step *= 2) { 
\
+uint32_t __index = __gpu_lane_id() - __step;   
\
+__bitmask_type bitmask = __gpu_lane_id() >= __step;
\
+__x += __builtin_bit_cast(

[llvm-branch-commits] [libcxx] release/20.x: [libc++] Reduce the dependency of the locale base API on the base system from the headers (#117764) (PR #128009)

2025-02-21 Thread via llvm-branch-commits


github-actions[bot] wrote:

@ldionne (or anyone else). If you would like to add a note about this fix in 
the release notes (completely optional). Please reply to this comment with a 
one or two sentence description of the fix.  When you are done, please add the 
release:note label to this PR. 

https://github.com/llvm/llvm-project/pull/128009
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] b727a13 - Add Wasm, RISC-V, BPF, and NVPTX targets back to Windows release packaging (#127794)

2025-02-21 Thread Tom Stellard via llvm-branch-commits


Author: Hans Wennborg
Date: 2025-02-21T14:03:27-08:00
New Revision: b727a13fecc4e29b6f8499afd95626795c9f6a8e

URL: 
https://github.com/llvm/llvm-project/commit/b727a13fecc4e29b6f8499afd95626795c9f6a8e
DIFF: 
https://github.com/llvm/llvm-project/commit/b727a13fecc4e29b6f8499afd95626795c9f6a8e.diff

LOG: Add Wasm, RISC-V, BPF, and NVPTX targets back to Windows release packaging 
(#127794)

In #106059 we reduced the targets to those supported by Windows (X86 and
ARM) to avoid running into size limitations of the NSIS compiler.

Since then, people complained about the lack of Wasm [1], RISC-V [2],
BPF [3], and NVPTX [4]. These do seem to fit in the installer (at least
for 20.1.0-rc2), so let's add them back.

[1]
https://discourse.llvm.org/t/llvm-19-x-release-third-party-binaries/80374/26
[2]
https://discourse.llvm.org/t/llvm-19-x-release-third-party-binaries/80374/53
[3] https://github.com/llvm/llvm-project/issues/127120
[4]
https://github.com/llvm/llvm-project/pull/127794#issuecomment-2668677203

(cherry picked from commit 6e047a5ab42698165a4746ef681396fab1698327)

Added: 


Modified: 
llvm/utils/release/build_llvm_release.bat

Removed: 




diff  --git a/llvm/utils/release/build_llvm_release.bat 
b/llvm/utils/release/build_llvm_release.bat
index dd041d7d384ec..1c30673cf88bd 100755
--- a/llvm/utils/release/build_llvm_release.bat
+++ b/llvm/utils/release/build_llvm_release.bat
@@ -150,7 +150,7 @@ set common_cmake_flags=^
   -DCMAKE_BUILD_TYPE=Release ^
   -DLLVM_ENABLE_ASSERTIONS=OFF ^
   -DLLVM_INSTALL_TOOLCHAIN_ONLY=ON ^
-  -DLLVM_TARGETS_TO_BUILD="AArch64;ARM;X86" ^
+  -DLLVM_TARGETS_TO_BUILD="AArch64;ARM;X86;BPF;WebAssembly;RISCV;NVPTX" ^
   -DLLVM_BUILD_LLVM_C_DYLIB=ON ^
   -DCMAKE_INSTALL_UCRT_LIBRARIES=ON ^
   -DPython3_FIND_REGISTRY=NEVER ^



___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] release/20.x: AMDGPU: Widen f16 minimum/maximum to v2f16 on gfx950 (#128121) (PR #128132)

2025-02-21 Thread via llvm-branch-commits


https://github.com/llvmbot updated 
https://github.com/llvm/llvm-project/pull/128132

>From e6d4fd035fdf90348fbeba6e73f90feb6e66b30b Mon Sep 17 00:00:00 2001
From: Matt Arsenault 
Date: Fri, 21 Feb 2025 12:08:49 +0700
Subject: [PATCH] AMDGPU: Widen f16 minimum/maximum to v2f16 on gfx950
 (#128121)

Unfortunately we only have the vector versions of v2f16 minimum3
and maximum. Widen to v2f16 so we can lower as minimum333(x, y, y).

(cherry picked from commit e729dc759d052de122c8a918fe51b05ac796bb50)
---
 llvm/lib/Target/AMDGPU/SIISelLowering.cpp|  40 +-
 llvm/lib/Target/AMDGPU/SIISelLowering.h  |   1 +
 llvm/test/CodeGen/AMDGPU/fmaximum3.ll| 689 ---
 llvm/test/CodeGen/AMDGPU/fminimum3.ll| 689 ---
 llvm/test/CodeGen/AMDGPU/llvm.maximum.f16.ll |  66 +-
 llvm/test/CodeGen/AMDGPU/llvm.minimum.f16.ll |  66 +-
 6 files changed, 966 insertions(+), 585 deletions(-)

diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp 
b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index e09df53995d61..d45ae7398e25d 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -869,8 +869,13 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
 if (Subtarget->hasMinimum3Maximum3F32())
   setOperationAction({ISD::FMAXIMUM, ISD::FMINIMUM}, MVT::f32, Legal);
 
-if (Subtarget->hasMinimum3Maximum3PKF16())
+if (Subtarget->hasMinimum3Maximum3PKF16()) {
   setOperationAction({ISD::FMAXIMUM, ISD::FMINIMUM}, MVT::v2f16, Legal);
+
+  // If only the vector form is available, we need to widen to a vector.
+  if (!Subtarget->hasMinimum3Maximum3F16())
+setOperationAction({ISD::FMAXIMUM, ISD::FMINIMUM}, MVT::f16, Custom);
+}
   }
 
   setOperationAction(ISD::INTRINSIC_WO_CHAIN,
@@ -5963,6 +5968,9 @@ SDValue SITargetLowering::LowerOperation(SDValue Op, 
SelectionDAG &DAG) const {
   case ISD::FMINNUM:
   case ISD::FMAXNUM:
 return lowerFMINNUM_FMAXNUM(Op, DAG);
+  case ISD::FMINIMUM:
+  case ISD::FMAXIMUM:
+return lowerFMINIMUM_FMAXIMUM(Op, DAG);
   case ISD::FLDEXP:
   case ISD::STRICT_FLDEXP:
 return lowerFLDEXP(Op, DAG);
@@ -5984,8 +5992,6 @@ SDValue SITargetLowering::LowerOperation(SDValue Op, 
SelectionDAG &DAG) const {
   case ISD::FMUL:
   case ISD::FMINNUM_IEEE:
   case ISD::FMAXNUM_IEEE:
-  case ISD::FMINIMUM:
-  case ISD::FMAXIMUM:
   case ISD::FMINIMUMNUM:
   case ISD::FMAXIMUMNUM:
   case ISD::UADDSAT:
@@ -6840,6 +6846,34 @@ SDValue SITargetLowering::lowerFMINNUM_FMAXNUM(SDValue 
Op,
   return Op;
 }
 
+SDValue SITargetLowering::lowerFMINIMUM_FMAXIMUM(SDValue Op,
+ SelectionDAG &DAG) const {
+  EVT VT = Op.getValueType();
+  if (VT.isVector())
+return splitBinaryVectorOp(Op, DAG);
+
+  assert(!Subtarget->hasIEEEMinMax() && !Subtarget->hasMinimum3Maximum3F16() &&
+ Subtarget->hasMinimum3Maximum3PKF16() && VT == MVT::f16 &&
+ "should not need to widen f16 minimum/maximum to v2f16");
+
+  // Widen f16 operation to v2f16
+
+  // fminimum f16:x, f16:y ->
+  //   extract_vector_elt (fminimum (v2f16 (scalar_to_vector x))
+  //(v2f16 (scalar_to_vector y))), 0
+  SDLoc SL(Op);
+  SDValue WideSrc0 =
+  DAG.getNode(ISD::SCALAR_TO_VECTOR, SL, MVT::v2f16, Op.getOperand(0));
+  SDValue WideSrc1 =
+  DAG.getNode(ISD::SCALAR_TO_VECTOR, SL, MVT::v2f16, Op.getOperand(1));
+
+  SDValue Widened =
+  DAG.getNode(Op.getOpcode(), SL, MVT::v2f16, WideSrc0, WideSrc1);
+
+  return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, SL, MVT::f16, Widened,
+ DAG.getConstant(0, SL, MVT::i32));
+}
+
 SDValue SITargetLowering::lowerFLDEXP(SDValue Op, SelectionDAG &DAG) const {
   bool IsStrict = Op.getOpcode() == ISD::STRICT_FLDEXP;
   EVT VT = Op.getValueType();
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.h 
b/llvm/lib/Target/AMDGPU/SIISelLowering.h
index 1cd7f1b29e077..9b2c14862407a 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.h
@@ -146,6 +146,7 @@ class SITargetLowering final : public AMDGPUTargetLowering {
   /// Custom lowering for ISD::FP_ROUND for MVT::f16.
   SDValue lowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const;
   SDValue lowerFMINNUM_FMAXNUM(SDValue Op, SelectionDAG &DAG) const;
+  SDValue lowerFMINIMUM_FMAXIMUM(SDValue Op, SelectionDAG &DAG) const;
   SDValue lowerFLDEXP(SDValue Op, SelectionDAG &DAG) const;
   SDValue promoteUniformOpToI32(SDValue Op, DAGCombinerInfo &DCI) const;
   SDValue lowerMUL(SDValue Op, SelectionDAG &DAG) const;
diff --git a/llvm/test/CodeGen/AMDGPU/fmaximum3.ll 
b/llvm/test/CodeGen/AMDGPU/fmaximum3.ll
index f0fa621e3b4bc..6724c37605eb4 100644
--- a/llvm/test/CodeGen/AMDGPU/fmaximum3.ll
+++ b/llvm/test/CodeGen/AMDGPU/fmaximum3.ll
@@ -1251,19 +1251,27 @@ define half @v_fmaximum3_f16(half %a, half %b, half %c) 
{
 ; GFX12-NEXT:v_maximum3_f16 v0, v0, v1, v2
 ; GFX12-NEXT:s_setpc_b64 s[30:3

[llvm-branch-commits] [llvm] e6d4fd0 - AMDGPU: Widen f16 minimum/maximum to v2f16 on gfx950 (#128121)

2025-02-21 Thread Tom Stellard via llvm-branch-commits


Author: Matt Arsenault
Date: 2025-02-21T14:14:20-08:00
New Revision: e6d4fd035fdf90348fbeba6e73f90feb6e66b30b

URL: 
https://github.com/llvm/llvm-project/commit/e6d4fd035fdf90348fbeba6e73f90feb6e66b30b
DIFF: 
https://github.com/llvm/llvm-project/commit/e6d4fd035fdf90348fbeba6e73f90feb6e66b30b.diff

LOG: AMDGPU: Widen f16 minimum/maximum to v2f16 on gfx950 (#128121)

Unfortunately we only have the vector versions of v2f16 minimum3
and maximum. Widen to v2f16 so we can lower as minimum333(x, y, y).

(cherry picked from commit e729dc759d052de122c8a918fe51b05ac796bb50)

Added: 


Modified: 
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
llvm/lib/Target/AMDGPU/SIISelLowering.h
llvm/test/CodeGen/AMDGPU/fmaximum3.ll
llvm/test/CodeGen/AMDGPU/fminimum3.ll
llvm/test/CodeGen/AMDGPU/llvm.maximum.f16.ll
llvm/test/CodeGen/AMDGPU/llvm.minimum.f16.ll

Removed: 




diff  --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp 
b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index e09df53995d61..d45ae7398e25d 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -869,8 +869,13 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
 if (Subtarget->hasMinimum3Maximum3F32())
   setOperationAction({ISD::FMAXIMUM, ISD::FMINIMUM}, MVT::f32, Legal);
 
-if (Subtarget->hasMinimum3Maximum3PKF16())
+if (Subtarget->hasMinimum3Maximum3PKF16()) {
   setOperationAction({ISD::FMAXIMUM, ISD::FMINIMUM}, MVT::v2f16, Legal);
+
+  // If only the vector form is available, we need to widen to a vector.
+  if (!Subtarget->hasMinimum3Maximum3F16())
+setOperationAction({ISD::FMAXIMUM, ISD::FMINIMUM}, MVT::f16, Custom);
+}
   }
 
   setOperationAction(ISD::INTRINSIC_WO_CHAIN,
@@ -5963,6 +5968,9 @@ SDValue SITargetLowering::LowerOperation(SDValue Op, 
SelectionDAG &DAG) const {
   case ISD::FMINNUM:
   case ISD::FMAXNUM:
 return lowerFMINNUM_FMAXNUM(Op, DAG);
+  case ISD::FMINIMUM:
+  case ISD::FMAXIMUM:
+return lowerFMINIMUM_FMAXIMUM(Op, DAG);
   case ISD::FLDEXP:
   case ISD::STRICT_FLDEXP:
 return lowerFLDEXP(Op, DAG);
@@ -5984,8 +5992,6 @@ SDValue SITargetLowering::LowerOperation(SDValue Op, 
SelectionDAG &DAG) const {
   case ISD::FMUL:
   case ISD::FMINNUM_IEEE:
   case ISD::FMAXNUM_IEEE:
-  case ISD::FMINIMUM:
-  case ISD::FMAXIMUM:
   case ISD::FMINIMUMNUM:
   case ISD::FMAXIMUMNUM:
   case ISD::UADDSAT:
@@ -6840,6 +6846,34 @@ SDValue SITargetLowering::lowerFMINNUM_FMAXNUM(SDValue 
Op,
   return Op;
 }
 
+SDValue SITargetLowering::lowerFMINIMUM_FMAXIMUM(SDValue Op,
+ SelectionDAG &DAG) const {
+  EVT VT = Op.getValueType();
+  if (VT.isVector())
+return splitBinaryVectorOp(Op, DAG);
+
+  assert(!Subtarget->hasIEEEMinMax() && !Subtarget->hasMinimum3Maximum3F16() &&
+ Subtarget->hasMinimum3Maximum3PKF16() && VT == MVT::f16 &&
+ "should not need to widen f16 minimum/maximum to v2f16");
+
+  // Widen f16 operation to v2f16
+
+  // fminimum f16:x, f16:y ->
+  //   extract_vector_elt (fminimum (v2f16 (scalar_to_vector x))
+  //(v2f16 (scalar_to_vector y))), 0
+  SDLoc SL(Op);
+  SDValue WideSrc0 =
+  DAG.getNode(ISD::SCALAR_TO_VECTOR, SL, MVT::v2f16, Op.getOperand(0));
+  SDValue WideSrc1 =
+  DAG.getNode(ISD::SCALAR_TO_VECTOR, SL, MVT::v2f16, Op.getOperand(1));
+
+  SDValue Widened =
+  DAG.getNode(Op.getOpcode(), SL, MVT::v2f16, WideSrc0, WideSrc1);
+
+  return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, SL, MVT::f16, Widened,
+ DAG.getConstant(0, SL, MVT::i32));
+}
+
 SDValue SITargetLowering::lowerFLDEXP(SDValue Op, SelectionDAG &DAG) const {
   bool IsStrict = Op.getOpcode() == ISD::STRICT_FLDEXP;
   EVT VT = Op.getValueType();

diff  --git a/llvm/lib/Target/AMDGPU/SIISelLowering.h 
b/llvm/lib/Target/AMDGPU/SIISelLowering.h
index 1cd7f1b29e077..9b2c14862407a 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.h
@@ -146,6 +146,7 @@ class SITargetLowering final : public AMDGPUTargetLowering {
   /// Custom lowering for ISD::FP_ROUND for MVT::f16.
   SDValue lowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const;
   SDValue lowerFMINNUM_FMAXNUM(SDValue Op, SelectionDAG &DAG) const;
+  SDValue lowerFMINIMUM_FMAXIMUM(SDValue Op, SelectionDAG &DAG) const;
   SDValue lowerFLDEXP(SDValue Op, SelectionDAG &DAG) const;
   SDValue promoteUniformOpToI32(SDValue Op, DAGCombinerInfo &DCI) const;
   SDValue lowerMUL(SDValue Op, SelectionDAG &DAG) const;

diff  --git a/llvm/test/CodeGen/AMDGPU/fmaximum3.ll 
b/llvm/test/CodeGen/AMDGPU/fmaximum3.ll
index f0fa621e3b4bc..6724c37605eb4 100644
--- a/llvm/test/CodeGen/AMDGPU/fmaximum3.ll
+++ b/llvm/test/CodeGen/AMDGPU/fmaximum3.ll
@@ -1251,19 +1251,27 @@ define half @v_fmaximum3_f16(half %a, half %b, half %c) 
{
 ; GFX12-NEXT:v_maximum3_f16 v0

[llvm-branch-commits] [llvm] release/20.x: AMDGPU: Widen f16 minimum/maximum to v2f16 on gfx950 (#128121) (PR #128132)

2025-02-21 Thread via llvm-branch-commits


github-actions[bot] wrote:

@arsenm (or anyone else). If you would like to add a note about this fix in the 
release notes (completely optional). Please reply to this comment with a one or 
two sentence description of the fix.  When you are done, please add the 
release:note label to this PR. 

https://github.com/llvm/llvm-project/pull/128132
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [clang] release/20.x: [CUDA] Add support for sm101 and sm120 target architectures (#127187) (PR #127918)

2025-02-21 Thread Artem Belevich via llvm-branch-commits


Artem-B wrote:

```
# CUDA
- Clang now supports CUDA compilation with CUDA SDK up to v12.8
- Clang can now target sm_100, sm_101, and sm_120 GPUs (Blackwell)
```

https://github.com/llvm/llvm-project/pull/127918
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] release/20.x: AMDGPU: Widen f16 minimum/maximum to v2f16 on gfx950 (#128121) (PR #128132)

2025-02-21 Thread Tom Stellard via llvm-branch-commits


https://github.com/tstellar closed 
https://github.com/llvm/llvm-project/pull/128132
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] release/20.x: [RISCV] [MachineOutliner] Analyze all candidates (#127659) (PR #128146)

2025-02-21 Thread via llvm-branch-commits


https://github.com/llvmbot created 
https://github.com/llvm/llvm-project/pull/128146

Backport 6757cf4

Requested by: @svs-quic

>From 8f71b2383f1da600e396fbc912795362659adba1 Mon Sep 17 00:00:00 2001
From: Sudharsan Veeravalli 
Date: Fri, 21 Feb 2025 12:53:13 +0530
Subject: [PATCH] [RISCV] [MachineOutliner] Analyze all candidates (#127659)

#117700 made a change from analyzing all the candidates to analyzing
just the first candidate before deciding to either delete or keep all of
them.

Even though the candidates all have the same instructions, the basic
blocks in which they are present are different and we will need to check
each of them before deciding whether to keep or erase them.
Particularly, `isAvailableAcrossAndOutOfSeq` checks to see if the
register (x5 in this case) is available from the end of the MBB to the
beginning of the candidate and not checking this for each candidate led
to incorrect candidates being outlined resulting in correctness issues
in a few downstream benchmarks.

Similarly, deleting all the candidates if the first one is not viable
will result in missed outlining opportunities.

(cherry picked from commit 6757cf4e6f1c7767d605e579930a24758c0778dc)
---
 llvm/lib/Target/RISCV/RISCVInstrInfo.cpp  |  52 +++
 .../machine-outliner-call-x5-liveout.mir  | 136 ++
 2 files changed, 158 insertions(+), 30 deletions(-)
 create mode 100644 llvm/test/CodeGen/RISCV/machine-outliner-call-x5-liveout.mir

diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp 
b/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
index 12a7af0750813..87f1f35835cbe 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
@@ -3010,30 +3010,25 @@ static bool cannotInsertTailCall(const 
MachineBasicBlock &MBB) {
   return false;
 }
 
-static std::optional
-analyzeCandidate(outliner::Candidate &C) {
+static bool analyzeCandidate(outliner::Candidate &C) {
   // If last instruction is return then we can rely on
   // the verification already performed in the getOutliningTypeImpl.
   if (C.back().isReturn()) {
 assert(!cannotInsertTailCall(*C.getMBB()) &&
"The candidate who uses return instruction must be outlined "
"using tail call");
-return MachineOutlinerTailCall;
+return false;
   }
 
-  auto CandidateUsesX5 = [](outliner::Candidate &C) {
-const TargetRegisterInfo *TRI = 
C.getMF()->getSubtarget().getRegisterInfo();
-if (std::any_of(C.begin(), C.end(), [TRI](const MachineInstr &MI) {
-  return isMIModifiesReg(MI, TRI, RISCV::X5);
-}))
-  return true;
-return !C.isAvailableAcrossAndOutOfSeq(RISCV::X5, *TRI);
-  };
-
-  if (!CandidateUsesX5(C))
-return MachineOutlinerDefault;
+  // Filter out candidates where the X5 register (t0) can't be used to setup
+  // the function call.
+  const TargetRegisterInfo *TRI = C.getMF()->getSubtarget().getRegisterInfo();
+  if (std::any_of(C.begin(), C.end(), [TRI](const MachineInstr &MI) {
+return isMIModifiesReg(MI, TRI, RISCV::X5);
+  }))
+return true;
 
-  return std::nullopt;
+  return !C.isAvailableAcrossAndOutOfSeq(RISCV::X5, *TRI);
 }
 
 std::optional>
@@ -3042,35 +3037,32 @@ RISCVInstrInfo::getOutliningCandidateInfo(
 std::vector &RepeatedSequenceLocs,
 unsigned MinRepeats) const {
 
-  // Each RepeatedSequenceLoc is identical.
-  outliner::Candidate &Candidate = RepeatedSequenceLocs[0];
-  auto CandidateInfo = analyzeCandidate(Candidate);
-  if (!CandidateInfo)
-RepeatedSequenceLocs.clear();
+  // Analyze each candidate and erase the ones that are not viable.
+  llvm::erase_if(RepeatedSequenceLocs, analyzeCandidate);
 
   // If the sequence doesn't have enough candidates left, then we're done.
   if (RepeatedSequenceLocs.size() < MinRepeats)
 return std::nullopt;
 
+  // Each RepeatedSequenceLoc is identical.
+  outliner::Candidate &Candidate = RepeatedSequenceLocs[0];
   unsigned InstrSizeCExt =
   Candidate.getMF()->getSubtarget().hasStdExtCOrZca() ? 2
   : 4;
   unsigned CallOverhead = 0, FrameOverhead = 0;
 
-  MachineOutlinerConstructionID MOCI = CandidateInfo.value();
-  switch (MOCI) {
-  case MachineOutlinerDefault:
-// call t0, function = 8 bytes.
-CallOverhead = 8;
-// jr t0 = 4 bytes, 2 bytes if compressed instructions are enabled.
-FrameOverhead = InstrSizeCExt;
-break;
-  case MachineOutlinerTailCall:
+  MachineOutlinerConstructionID MOCI = MachineOutlinerDefault;
+  if (Candidate.back().isReturn()) {
+MOCI = MachineOutlinerTailCall;
 // tail call = auipc + jalr in the worst case without linker relaxation.
 CallOverhead = 4 + InstrSizeCExt;
 // Using tail call we move ret instruction from caller to callee.
 FrameOverhead = 0;
-break;
+  } else {
+// call t0, function = 8 bytes.
+CallOverhead = 8;
+// jr t0 = 4 bytes, 2 bytes if compressed instructions are enabled.
+

[llvm-branch-commits] [llvm] release/20.x: [RISCV] [MachineOutliner] Analyze all candidates (#127659) (PR #128146)

2025-02-21 Thread via llvm-branch-commits


https://github.com/llvmbot milestoned 
https://github.com/llvm/llvm-project/pull/128146
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] release/20.x: [RISCV] [MachineOutliner] Analyze all candidates (#127659) (PR #128146)

2025-02-21 Thread via llvm-branch-commits


llvmbot wrote:

@wangpc-pp What do you think about merging this PR to the release branch?

https://github.com/llvm/llvm-project/pull/128146
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] release/20.x: [RISCV] [MachineOutliner] Analyze all candidates (#127659) (PR #128146)

2025-02-21 Thread via llvm-branch-commits


llvmbot wrote:




@llvm/pr-subscribers-backend-risc-v

Author: None (llvmbot)


Changes

Backport 6757cf4

Requested by: @svs-quic

---
Full diff: https://github.com/llvm/llvm-project/pull/128146.diff


2 Files Affected:

- (modified) llvm/lib/Target/RISCV/RISCVInstrInfo.cpp (+22-30) 
- (added) llvm/test/CodeGen/RISCV/machine-outliner-call-x5-liveout.mir (+136) 


``diff
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp 
b/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
index 12a7af0750813..87f1f35835cbe 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
@@ -3010,30 +3010,25 @@ static bool cannotInsertTailCall(const 
MachineBasicBlock &MBB) {
   return false;
 }
 
-static std::optional
-analyzeCandidate(outliner::Candidate &C) {
+static bool analyzeCandidate(outliner::Candidate &C) {
   // If last instruction is return then we can rely on
   // the verification already performed in the getOutliningTypeImpl.
   if (C.back().isReturn()) {
 assert(!cannotInsertTailCall(*C.getMBB()) &&
"The candidate who uses return instruction must be outlined "
"using tail call");
-return MachineOutlinerTailCall;
+return false;
   }
 
-  auto CandidateUsesX5 = [](outliner::Candidate &C) {
-const TargetRegisterInfo *TRI = 
C.getMF()->getSubtarget().getRegisterInfo();
-if (std::any_of(C.begin(), C.end(), [TRI](const MachineInstr &MI) {
-  return isMIModifiesReg(MI, TRI, RISCV::X5);
-}))
-  return true;
-return !C.isAvailableAcrossAndOutOfSeq(RISCV::X5, *TRI);
-  };
-
-  if (!CandidateUsesX5(C))
-return MachineOutlinerDefault;
+  // Filter out candidates where the X5 register (t0) can't be used to setup
+  // the function call.
+  const TargetRegisterInfo *TRI = C.getMF()->getSubtarget().getRegisterInfo();
+  if (std::any_of(C.begin(), C.end(), [TRI](const MachineInstr &MI) {
+return isMIModifiesReg(MI, TRI, RISCV::X5);
+  }))
+return true;
 
-  return std::nullopt;
+  return !C.isAvailableAcrossAndOutOfSeq(RISCV::X5, *TRI);
 }
 
 std::optional>
@@ -3042,35 +3037,32 @@ RISCVInstrInfo::getOutliningCandidateInfo(
 std::vector &RepeatedSequenceLocs,
 unsigned MinRepeats) const {
 
-  // Each RepeatedSequenceLoc is identical.
-  outliner::Candidate &Candidate = RepeatedSequenceLocs[0];
-  auto CandidateInfo = analyzeCandidate(Candidate);
-  if (!CandidateInfo)
-RepeatedSequenceLocs.clear();
+  // Analyze each candidate and erase the ones that are not viable.
+  llvm::erase_if(RepeatedSequenceLocs, analyzeCandidate);
 
   // If the sequence doesn't have enough candidates left, then we're done.
   if (RepeatedSequenceLocs.size() < MinRepeats)
 return std::nullopt;
 
+  // Each RepeatedSequenceLoc is identical.
+  outliner::Candidate &Candidate = RepeatedSequenceLocs[0];
   unsigned InstrSizeCExt =
   Candidate.getMF()->getSubtarget().hasStdExtCOrZca() ? 2
   : 4;
   unsigned CallOverhead = 0, FrameOverhead = 0;
 
-  MachineOutlinerConstructionID MOCI = CandidateInfo.value();
-  switch (MOCI) {
-  case MachineOutlinerDefault:
-// call t0, function = 8 bytes.
-CallOverhead = 8;
-// jr t0 = 4 bytes, 2 bytes if compressed instructions are enabled.
-FrameOverhead = InstrSizeCExt;
-break;
-  case MachineOutlinerTailCall:
+  MachineOutlinerConstructionID MOCI = MachineOutlinerDefault;
+  if (Candidate.back().isReturn()) {
+MOCI = MachineOutlinerTailCall;
 // tail call = auipc + jalr in the worst case without linker relaxation.
 CallOverhead = 4 + InstrSizeCExt;
 // Using tail call we move ret instruction from caller to callee.
 FrameOverhead = 0;
-break;
+  } else {
+// call t0, function = 8 bytes.
+CallOverhead = 8;
+// jr t0 = 4 bytes, 2 bytes if compressed instructions are enabled.
+FrameOverhead = InstrSizeCExt;
   }
 
   for (auto &C : RepeatedSequenceLocs)
diff --git a/llvm/test/CodeGen/RISCV/machine-outliner-call-x5-liveout.mir 
b/llvm/test/CodeGen/RISCV/machine-outliner-call-x5-liveout.mir
new file mode 100644
index 0..f7bea33e52885
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/machine-outliner-call-x5-liveout.mir
@@ -0,0 +1,136 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py 
UTC_ARGS: --version 5
+# RUN: llc -mtriple=riscv32 -x mir -run-pass=machine-outliner -simplify-mir 
-verify-machineinstrs < %s \
+# RUN: | FileCheck -check-prefixes=RV32I-MO %s
+# RUN: llc -mtriple=riscv64 -x mir -run-pass=machine-outliner -simplify-mir 
-verify-machineinstrs < %s \
+# RUN: | FileCheck -check-prefixes=RV64I-MO %s
+
+# MIR has been edited by hand to have x5 as live out in @dont_outline
+
+---
+
+name:outline_0
+tracksRegLiveness: true
+isOutlined: false
+body: |
+  bb.0:
+liveins: $x10, $x11
+
+; RV32I-MO-LABEL: name: outline_0
+; RV32I-MO: liveins: $x10, $x11
+; RV32I-M

[llvm-branch-commits] [flang] [flang][OpenMP] Extend `do concurrent` mapping to multi-range loops (PR #127634)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits



@@ -102,6 +105,47 @@ mlir::Operation *findLoopIterationVarMemDecl(fir::DoLoopOp 
doLoop) {
   return result.getDefiningOp();
 }
 
+/// Collects the op(s) responsible for updating a loop's iteration variable 
with
+/// the current iteration number. For example, for the input IR:
+/// ```
+/// %i = fir.alloca i32 {bindc_name = "i"}
+/// %i_decl:2 = hlfir.declare %i ...
+/// ...
+/// fir.do_loop %i_iv = %lb to %ub step %step unordered {
+///   %1 = fir.convert %i_iv : (index) -> i32
+///   fir.store %1 to %i_decl#1 : !fir.ref
+///   ...
+/// }
+/// ```
+/// this function would return the first 2 ops in the `fir.do_loop`'s region.
+llvm::SetVector
+extractIndVarUpdateOps(fir::DoLoopOp doLoop) {
+  mlir::Value indVar = doLoop.getInductionVar();
+  llvm::SetVector indVarUpdateOps;
+
+  llvm::SmallVector toProcess;
+  toProcess.push_back(indVar);
+
+  llvm::DenseSet done;
+
+  while (!toProcess.empty()) {
+mlir::Value val = toProcess.back();
+toProcess.pop_back();
+
+if (!done.insert(val).second)
+  continue;
+
+for (mlir::Operation *user : val.getUsers()) {
+  indVarUpdateOps.insert(user);
+
+  for (mlir::Value result : user->getResults())
+toProcess.push_back(result);
+}
+  }
+
+  return std::move(indVarUpdateOps);

skatrak wrote:

Returning containers goes a bit against general recommendations, but if you 
prefer to keep this approach rather than populating an output `SmallVectorImpl 
&` argument with help of `llvm::is_contained()` (which is what `SetVector` does 
for small vectors), I'd suggest considering the following:
```suggestion
  return std::move(indVarUpdateOps.takeVector());
```
That way, at least we don't leak implementation details of the structure we 
used to avoid duplicates.

https://github.com/llvm/llvm-project/pull/127634
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [flang] [flang][OpenMP] Extend `do concurrent` mapping to multi-range loops (PR #127634)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits


https://github.com/skatrak edited 
https://github.com/llvm/llvm-project/pull/127634
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [flang] [flang][OpenMP] Extend `do concurrent` mapping to multi-range loops (PR #127634)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits



@@ -102,6 +105,47 @@ mlir::Operation *findLoopIterationVarMemDecl(fir::DoLoopOp 
doLoop) {
   return result.getDefiningOp();
 }
 
+/// Collects the op(s) responsible for updating a loop's iteration variable 
with
+/// the current iteration number. For example, for the input IR:

skatrak wrote:

This function seems to do something more generic than that: it collects all of 
the ops that either take the loop's induction variable as argument or take a 
value as argument that has been calculated based on the result of another 
operation that directly or indirectly took the loop's induction variable as 
argument.

I guess that, similarly to another comment I left at a previous PR in the stack 
https://github.com/llvm/llvm-project/pull/127633#discussion_r1963571510, it's 
doing something more general than it states. If, like the other case, the idea 
is to just store the associated `fir.convert` and `fir.store` operations, 
perhaps it makes more sense to match that pattern specifically.

https://github.com/llvm/llvm-project/pull/127634
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] [OpenMPIRBuilder] Split calculation of canonical loop trip count, NFC (PR #127820)

2025-02-21 Thread Michael Kruse via llvm-branch-commits


https://github.com/Meinersbur approved this pull request.

LGTM

https://github.com/llvm/llvm-project/pull/127820
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [flang] [Flang][OpenMP] Allow host evaluation of loop bounds for distribute (PR #127822)

2025-02-21 Thread Michael Kruse via llvm-branch-commits


https://github.com/Meinersbur approved this pull request.

LGTM

https://github.com/llvm/llvm-project/pull/127822
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] [OpenMPIRBuilder] Add support for distribute-parallel-for/do constructs (PR #127818)

2025-02-21 Thread Michael Kruse via llvm-branch-commits


https://github.com/Meinersbur approved this pull request.

LGTM

https://github.com/llvm/llvm-project/pull/127818
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [clang] release/20.x: [CSKY] Default to unsigned char (PR #126436)

2025-02-21 Thread Aaron Ballman via llvm-branch-commits


https://github.com/AaronBallman approved this pull request.

LGTM though this definitely needs a release note so that anyone relying on the 
old behavior has some amount of notice (and a suggestion as to how to get the 
old behavior back).

https://github.com/llvm/llvm-project/pull/126436
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [flang] [flang][OpenMP] Extend `do concurrent` mapping to multi-range loops (PR #127634)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits



@@ -102,6 +105,47 @@ mlir::Operation *findLoopIterationVarMemDecl(fir::DoLoopOp 
doLoop) {
   return result.getDefiningOp();
 }
 
+/// Collects the op(s) responsible for updating a loop's iteration variable 
with
+/// the current iteration number. For example, for the input IR:
+/// ```
+/// %i = fir.alloca i32 {bindc_name = "i"}
+/// %i_decl:2 = hlfir.declare %i ...
+/// ...
+/// fir.do_loop %i_iv = %lb to %ub step %step unordered {
+///   %1 = fir.convert %i_iv : (index) -> i32
+///   fir.store %1 to %i_decl#1 : !fir.ref
+///   ...
+/// }
+/// ```
+/// this function would return the first 2 ops in the `fir.do_loop`'s region.
+llvm::SetVector
+extractIndVarUpdateOps(fir::DoLoopOp doLoop) {
+  mlir::Value indVar = doLoop.getInductionVar();
+  llvm::SetVector indVarUpdateOps;
+
+  llvm::SmallVector toProcess;
+  toProcess.push_back(indVar);
+
+  llvm::DenseSet done;
+
+  while (!toProcess.empty()) {
+mlir::Value val = toProcess.back();
+toProcess.pop_back();
+
+if (!done.insert(val).second)
+  continue;

skatrak wrote:

If I understand it correctly, this set could potentially be avoided if we 
checked below whether `indVarUpdateOps.insert(user)` actually inserted 
something before adding its results to the processing list.

https://github.com/llvm/llvm-project/pull/127634
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [flang] [flang][OpenMP] Extend `do concurrent` mapping to multi-range loops (PR #127634)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits



@@ -30,6 +30,9 @@ namespace looputils {
 struct InductionVariableInfo {
   /// the operation allocating memory for iteration variable,
   mlir::Operation *iterVarMemDef;
+  /// the operation(s) updating the iteration variable with the current
+  /// iteration number.
+  llvm::SetVector indVarUpdateOps;

skatrak wrote:

Is there any reason why this has to be a set? It seems like an implementation 
detail that leaked out of `extractIndVarUpdateOps`.

https://github.com/llvm/llvm-project/pull/127634
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [flang] [flang][OpenMP] Extend `do concurrent` mapping to multi-range loops (PR #127634)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits


https://github.com/skatrak commented:

Thank you Kareem, some small comments from me.

https://github.com/llvm/llvm-project/pull/127634
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [flang] [flang][OpenMP] Map simple `do concurrent` loops to OpenMP host constructs (PR #127633)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits



@@ -152,26 +231,136 @@ class DoConcurrentConversion : public 
mlir::OpConversionPattern {
 public:
   using mlir::OpConversionPattern::OpConversionPattern;
 
-  DoConcurrentConversion(mlir::MLIRContext *context, bool mapToDevice)
-  : OpConversionPattern(context), mapToDevice(mapToDevice) {}
+  DoConcurrentConversion(mlir::MLIRContext *context, bool mapToDevice,
+ llvm::DenseSet &concurrentLoopsToSkip)
+  : OpConversionPattern(context), mapToDevice(mapToDevice),
+concurrentLoopsToSkip(concurrentLoopsToSkip) {}
 
   mlir::LogicalResult
   matchAndRewrite(fir::DoLoopOp doLoop, OpAdaptor adaptor,
   mlir::ConversionPatternRewriter &rewriter) const override {
-looputils::LoopNest loopNest;
+looputils::LoopNestToIndVarMap loopNest;
 bool hasRemainingNestedLoops =
 failed(looputils::collectLoopNest(doLoop, loopNest));
 if (hasRemainingNestedLoops)
   mlir::emitWarning(doLoop.getLoc(),
 "Some `do concurent` loops are not perfectly-nested. "
 "These will be serialized.");
 
-// TODO This will be filled in with the next PRs that upstreams the rest of
-// the ROCm implementaion.
+mlir::IRMapping mapper;
+genParallelOp(doLoop.getLoc(), rewriter, loopNest, mapper);
+mlir::omp::LoopNestOperands loopNestClauseOps;
+genLoopNestClauseOps(doLoop.getLoc(), rewriter, loopNest, mapper,
+ loopNestClauseOps);
+
+mlir::omp::LoopNestOp ompLoopNest =
+genWsLoopOp(rewriter, loopNest.back().first, mapper, loopNestClauseOps,
+/*isComposite=*/mapToDevice);

skatrak wrote:

This will at the moment cause invalid MLIR to be produced (composite 
`omp.wsloop` with no other loop wrappers). We should probably emit a not yet 
implemented error if `mapToDevice=true` at the beginning of the function 
instead, unless you intend to merge host and target support at the same time.

https://github.com/llvm/llvm-project/pull/127633
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering (non i1) (PR #124298)

2025-02-21 Thread Petar Avramovic via llvm-branch-commits


https://github.com/petar-avramovic updated 
https://github.com/llvm/llvm-project/pull/124298

>From 3f039f909b91cc5ad1f92208944e0b66447346df Mon Sep 17 00:00:00 2001
From: Petar Avramovic 
Date: Fri, 21 Feb 2025 14:33:44 +0100
Subject: [PATCH] AMDGPU/GlobalISel: Temporal divergence lowering (non i1)

Record all uses outside cycle with divergent exit during
propagateTemporalDivergence in Uniformity analysis.
With this list of candidates for temporal divergence lowering,
excluding known lane masks from control flow intrinsics,
find sources from inside the cycle that are not i1 and uniform.
Temporal divergence lowering (non i1):
create copy(v_mov) to vgpr, with implicit exec (to stop other
passes from moving this copy outside of the cycle) and use this
vgpr outside of the cycle instead of original uniform source.
---
 llvm/include/llvm/ADT/GenericUniformityImpl.h | 46 ++-
 llvm/include/llvm/ADT/GenericUniformityInfo.h |  5 ++
 llvm/lib/Analysis/UniformityAnalysis.cpp  |  3 +-
 .../lib/CodeGen/MachineUniformityAnalysis.cpp |  6 +--
 .../AMDGPUGlobalISelDivergenceLowering.cpp| 44 +-
 .../lib/Target/AMDGPU/AMDGPURegBankSelect.cpp | 25 --
 llvm/lib/Target/AMDGPU/SILowerI1Copies.h  |  6 +++
 ...divergent-i1-phis-no-lane-mask-merging.mir |  7 +--
 ...ergence-divergent-i1-used-outside-loop.mir | 19 
 .../divergence-temporal-divergent-reg.ll  | 18 
 .../divergence-temporal-divergent-reg.mir |  3 +-
 .../AMDGPU/GlobalISel/regbankselect-mui.ll| 17 +++
 12 files changed, 157 insertions(+), 42 deletions(-)

diff --git a/llvm/include/llvm/ADT/GenericUniformityImpl.h 
b/llvm/include/llvm/ADT/GenericUniformityImpl.h
index bd09f4fe43e08..6411fc9b4b974 100644
--- a/llvm/include/llvm/ADT/GenericUniformityImpl.h
+++ b/llvm/include/llvm/ADT/GenericUniformityImpl.h
@@ -51,7 +51,10 @@
 #include "llvm/ADT/SmallPtrSet.h"
 #include "llvm/ADT/SparseBitVector.h"
 #include "llvm/ADT/StringExtras.h"
+#include "llvm/CodeGen/MachineInstr.h"
+#include "llvm/Support/Debug.h"
 #include "llvm/Support/raw_ostream.h"
+#include 
 
 #define DEBUG_TYPE "uniformity"
 
@@ -342,6 +345,9 @@ template  class 
GenericUniformityAnalysisImpl {
   typename SyncDependenceAnalysisT::DivergenceDescriptor;
   using BlockLabelMapT = typename SyncDependenceAnalysisT::BlockLabelMap;
 
+  using TemporalDivergenceTuple =
+  std::tuple;
+
   GenericUniformityAnalysisImpl(const DominatorTreeT &DT, const CycleInfoT &CI,
 const TargetTransformInfo *TTI)
   : Context(CI.getSSAContext()), F(*Context.getFunction()), CI(CI),
@@ -396,6 +402,11 @@ template  class 
GenericUniformityAnalysisImpl {
 
   void print(raw_ostream &out) const;
 
+  SmallVector TemporalDivergenceList;
+
+  void recordTemporalDivergence(ConstValueRefT, const InstructionT *,
+const CycleT *);
+
 protected:
   /// \brief Value/block pair representing a single phi input.
   struct PhiInput {
@@ -1129,6 +1140,13 @@ void GenericUniformityAnalysisImpl::compute() {
   }
 }
 
+template 
+void GenericUniformityAnalysisImpl::recordTemporalDivergence(
+ConstValueRefT Val, const InstructionT *User, const CycleT *Cycle) {
+  TemporalDivergenceList.emplace_back(Val, const_cast(User),
+  Cycle);
+}
+
 template 
 bool GenericUniformityAnalysisImpl::isAlwaysUniform(
 const InstructionT &Instr) const {
@@ -1146,6 +1164,12 @@ template 
 void GenericUniformityAnalysisImpl::print(raw_ostream &OS) const {
   bool haveDivergentArgs = false;
 
+  // When we print Value, LLVM IR instruction, we want to print extra new line.
+  // In LLVM IR print function for Value does not print new line at the end.
+  // In MIR print for MachineInstr prints new line at the end.
+  constexpr bool IsMIR = std::is_same::value;
+  std::string NewLine = IsMIR ? "" : "\n";
+
   // Control flow instructions may be divergent even if their inputs are
   // uniform. Thus, although exceedingly rare, it is possible to have a program
   // with no divergent values but with divergent control structures.
@@ -1180,6 +1204,16 @@ void 
GenericUniformityAnalysisImpl::print(raw_ostream &OS) const {
 }
   }
 
+  if (!TemporalDivergenceList.empty()) {
+OS << "\nTEMPORAL DIVERGENCE LIST:\n";
+
+for (auto [Val, UseInst, Cycle] : TemporalDivergenceList) {
+  OS << "Value :" << Context.print(Val) << NewLine
+ << "Used by   :" << Context.print(UseInst) << NewLine
+ << "Outside cycle :" << Cycle->print(Context) << "\n\n";
+}
+  }
+
   for (auto &block : F) {
 OS << "\nBLOCK " << Context.print(&block) << '\n';
 
@@ -1191,7 +1225,7 @@ void 
GenericUniformityAnalysisImpl::print(raw_ostream &OS) const {
 OS << "  DIVERGENT: ";
   else
 OS << " ";
-  OS << Context.print(value) << '\n';
+  OS << Context.print(value) << NewLine;
 }
 
 OS << "TERMINATORS\n";
@@ -1203,13 +1237,21

[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering (non i1) (PR #124298)

2025-02-21 Thread Petar Avramovic via llvm-branch-commits



@@ -188,6 +190,35 @@ void 
DivergenceLoweringHelper::constrainAsLaneMask(Incoming &In) {
   In.Reg = Copy.getReg(0);
 }
 
+void replaceUsesOfRegInInstWith(Register Reg, MachineInstr *Inst,
+Register NewReg) {
+  for (MachineOperand &Op : Inst->operands()) {
+if (Op.isReg() && Op.getReg() == Reg)
+  Op.setReg(NewReg);
+  }
+}
+
+bool DivergenceLoweringHelper::lowerTemporalDivergence() {
+  AMDGPU::IntrinsicLaneMaskAnalyzer ILMA(*MF);
+
+  for (auto [Inst, UseInst, _] : MUI->getTemporalDivergenceList()) {

petar-avramovic wrote:

Updated types for recording TemporalDivergence and prints, improved new line 
prints.

https://github.com/llvm/llvm-project/pull/124298
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [flang] [flang][OpenMP] Handle "loop-local values" in `do concurrent` nests (PR #127635)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits



@@ -202,6 +202,57 @@ variables: `i` and `j`. These are locally allocated inside 
the parallel/target
 OpenMP region similar to what the single-range example in previous section
 shows.
 
+### Data environment
+
+By default, variables that are used inside a `do concurrent` loop nest are
+either treated as `shared` in case of mapping to `host`, or mapped into the
+`target` region using a `map` clause in case of mapping to `device`. The only
+exceptions to this are:
+  1. the loop's iteration variable(s) (IV) of **perfect** loop nests. In that
+ case, for each IV, we allocate a local copy as shown by the mapping
+ examples above.
+  1. any values that are from allocations outside the loop nest and used
+ exclusively inside of it. In such cases, a local privatized
+ copy is created in the OpenMP region to prevent multiple teams of threads
+ from accessing and destroying the same memory block, which causes runtime
+ issues. For an example of such cases, see
+ `flang/test/Transforms/DoConcurrent/locally_destroyed_temp.f90`.
+
+Implicit mapping detection (for mapping to the target device) is still quite
+limited and work to make it smarter is underway for both OpenMP in general 
+and `do concurrent` mapping.

skatrak wrote:

Nit: There's no mapping support at this stage, so maybe state that to avoid 
misleading anyone reading it.

https://github.com/llvm/llvm-project/pull/127635
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [flang] [flang][OpenMP] Handle "loop-local values" in `do concurrent` nests (PR #127635)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits


https://github.com/skatrak approved this pull request.

I have a couple of nits, but LGTM otherwise. Thank you!

https://github.com/llvm/llvm-project/pull/127635
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [flang] [flang][OpenMP] Handle "loop-local values" in `do concurrent` nests (PR #127635)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits



@@ -361,6 +361,64 @@ void sinkLoopIVArgs(mlir::ConversionPatternRewriter 
&rewriter,
 ++idx;
   }
 }
+
+/// Collects values that are local to a loop: "loop-local values". A loop-local
+/// value is one that is used exclusively inside the loop but allocated outside
+/// of it. This usually corresponds to temporary values that are used inside 
the
+/// loop body for initialzing other variables for example.

skatrak wrote:

```suggestion
/// loop body for initializing other variables for example.
```

https://github.com/llvm/llvm-project/pull/127635
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [flang] [flang][OpenMP] Handle "loop-local values" in `do concurrent` nests (PR #127635)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits


https://github.com/skatrak edited 
https://github.com/llvm/llvm-project/pull/127635
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [flang] [flang][OpenMP] Handle "loop-local values" in `do concurrent` nests (PR #127635)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits



@@ -361,6 +361,64 @@ void sinkLoopIVArgs(mlir::ConversionPatternRewriter 
&rewriter,
 ++idx;
   }
 }
+
+/// Collects values that are local to a loop: "loop-local values". A loop-local
+/// value is one that is used exclusively inside the loop but allocated outside
+/// of it. This usually corresponds to temporary values that are used inside 
the
+/// loop body for initialzing other variables for example.
+///
+/// See `flang/test/Transforms/DoConcurrent/locally_destroyed_temp.f90` for an
+/// example of why we need this.
+///
+/// \param [in] doLoop - the loop within which the function searches for values
+/// used exclusively inside.
+///
+/// \param [out] locals - the list of loop-local values detected for \p doLoop.
+void collectLoopLocalValues(fir::DoLoopOp doLoop,
+llvm::SetVector &locals) {
+  doLoop.walk([&](mlir::Operation *op) {
+for (mlir::Value operand : op->getOperands()) {
+  if (locals.contains(operand))
+continue;
+
+  bool isLocal = true;
+
+  if (!mlir::isa_and_present(operand.getDefiningOp()))
+continue;
+
+  // Values defined inside the loop are not interesting since they do not
+  // need to be localized.
+  if (doLoop->isAncestor(operand.getDefiningOp()))
+continue;
+
+  for (auto *user : operand.getUsers()) {
+if (!doLoop->isAncestor(user)) {
+  isLocal = false;
+  break;
+}
+  }
+
+  if (isLocal)
+locals.insert(operand);

skatrak wrote:

Nit: I think something like this might be a bit more concise, but feel free to 
disagree. In that case, the `isLocal` declaration might be good to move it 
closer to the loop.
```suggestion
  auto users = operand.getUsers();
  if (llvm::find_if(users, [&](mlir::Operation *user) { return 
!doLoop->isAncestor(user); }) == users.end())
locals.insert(operand);
```

https://github.com/llvm/llvm-project/pull/127635
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [flang] [flang][OpenMP] Handle "loop-local values" in `do concurrent` nests (PR #127635)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits



@@ -202,6 +202,57 @@ variables: `i` and `j`. These are locally allocated inside 
the parallel/target
 OpenMP region similar to what the single-range example in previous section
 shows.
 
+### Data environment
+
+By default, variables that are used inside a `do concurrent` loop nest are
+either treated as `shared` in case of mapping to `host`, or mapped into the
+`target` region using a `map` clause in case of mapping to `device`. The only
+exceptions to this are:
+  1. the loop's iteration variable(s) (IV) of **perfect** loop nests. In that
+ case, for each IV, we allocate a local copy as shown by the mapping
+ examples above.
+  1. any values that are from allocations outside the loop nest and used
+ exclusively inside of it. In such cases, a local privatized
+ copy is created in the OpenMP region to prevent multiple teams of threads

skatrak wrote:

Nit: In the OpenMP parallel region?

https://github.com/llvm/llvm-project/pull/127635
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [flang] [flang][OpenMP] Handle "loop-local values" in `do concurrent` nests (PR #127635)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits



@@ -0,0 +1,62 @@
+! Tests that "loop-local values" are properly handled by localizing them to the
+! body of the loop nest. See `collectLoopLocalValues` and 
`localizeLoopLocalValue`
+! for a definition of "loop-local values" and how they are handled.

skatrak wrote:

Nit: Maybe it's better to just generally point at the `DoConcurrentConversion` 
pass for more information, since this comment will easily get out of sync of 
the actual implementation otherwise.

https://github.com/llvm/llvm-project/pull/127635
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [flang] [flang][OpenMP] Extend `do concurrent` mapping to multi-range loops (PR #127634)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits



@@ -102,6 +105,47 @@ mlir::Operation *findLoopIterationVarMemDecl(fir::DoLoopOp 
doLoop) {
   return result.getDefiningOp();
 }
 
+/// Collects the op(s) responsible for updating a loop's iteration variable 
with
+/// the current iteration number. For example, for the input IR:
+/// ```
+/// %i = fir.alloca i32 {bindc_name = "i"}
+/// %i_decl:2 = hlfir.declare %i ...
+/// ...
+/// fir.do_loop %i_iv = %lb to %ub step %step unordered {
+///   %1 = fir.convert %i_iv : (index) -> i32
+///   fir.store %1 to %i_decl#1 : !fir.ref
+///   ...
+/// }
+/// ```
+/// this function would return the first 2 ops in the `fir.do_loop`'s region.
+llvm::SetVector
+extractIndVarUpdateOps(fir::DoLoopOp doLoop) {
+  mlir::Value indVar = doLoop.getInductionVar();
+  llvm::SetVector indVarUpdateOps;
+
+  llvm::SmallVector toProcess;
+  toProcess.push_back(indVar);
+
+  llvm::DenseSet done;
+
+  while (!toProcess.empty()) {
+mlir::Value val = toProcess.back();
+toProcess.pop_back();

skatrak wrote:

```suggestion
mlir::Value val = toProcess.pop_back_val();
```

https://github.com/llvm/llvm-project/pull/127634
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [mlir] [MLIR][OpenMP] Host lowering of distribute-parallel-do/for (PR #127819)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits


https://github.com/skatrak updated 
https://github.com/llvm/llvm-project/pull/127819

>From 33d5af4e9d8aaf9464aa74f5031d60001d77c610 Mon Sep 17 00:00:00 2001
From: Sergio Afonso 
Date: Tue, 18 Feb 2025 13:07:51 +
Subject: [PATCH] [MLIR][OpenMP] Host lowering of distribute-parallel-do/for

This patch adds support for translating composite `omp.parallel` +
`omp.distribute` + `omp.wsloop` loops to LLVM IR on the host. This is done by
passing an updated `WorksharingLoopType` to the call to `applyWorkshareLoop`
associated to the lowering of the `omp.wsloop` operation, so that
`__kmpc_dist_for_static_init` is called at runtime in place of
`__kmpc_for_static_init`.

Existing translation rules take care of creating a parallel region to hold the
workshared and workdistributed loop.
---
 .../OpenMP/OpenMPToLLVMIRTranslation.cpp  | 21 --
 mlir/test/Target/LLVMIR/openmp-llvm.mlir  | 65 +++
 mlir/test/Target/LLVMIR/openmp-todo.mlir  | 19 --
 3 files changed, 81 insertions(+), 24 deletions(-)

diff --git 
a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp 
b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index 987f18fc7bc47..fbea278b2511f 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -257,10 +257,6 @@ static LogicalResult checkImplementationStatus(Operation 
&op) {
   LogicalResult result = success();
   llvm::TypeSwitch(op)
   .Case([&](omp::DistributeOp op) {
-if (op.isComposite() &&
-isa_and_present(op.getNestedWrapper()))
-  result = op.emitError() << "not yet implemented: "
- "composite omp.distribute + omp.wsloop";
 checkAllocate(op, result);
 checkDistSchedule(op, result);
 checkOrder(op, result);
@@ -1990,6 +1986,14 @@ convertOmpWsloop(Operation &opInst, llvm::IRBuilderBase 
&builder,
   bool isSimd = wsloopOp.getScheduleSimd();
   bool loopNeedsBarrier = !wsloopOp.getNowait();
 
+  // The only legal way for the direct parent to be omp.distribute is that this
+  // represents 'distribute parallel do'. Otherwise, this is a regular
+  // worksharing loop.
+  llvm::omp::WorksharingLoopType workshareLoopType =
+  llvm::isa_and_present(opInst.getParentOp())
+  ? llvm::omp::WorksharingLoopType::DistributeForStaticLoop
+  : llvm::omp::WorksharingLoopType::ForStaticLoop;
+
   llvm::OpenMPIRBuilder::LocationDescription ompLoc(builder);
   llvm::Expected regionBlock = convertOmpOpRegions(
   wsloopOp.getRegion(), "omp.wsloop.region", builder, moduleTranslation);
@@ -2005,7 +2009,8 @@ convertOmpWsloop(Operation &opInst, llvm::IRBuilderBase 
&builder,
   ompLoc.DL, loopInfo, allocaIP, loopNeedsBarrier,
   convertToScheduleKind(schedule), chunk, isSimd,
   scheduleMod == omp::ScheduleModifier::monotonic,
-  scheduleMod == omp::ScheduleModifier::nonmonotonic, isOrdered);
+  scheduleMod == omp::ScheduleModifier::nonmonotonic, isOrdered,
+  workshareLoopType);
 
   if (failed(handleError(wsloopIP, opInst)))
 return failure();
@@ -3791,6 +3796,12 @@ convertOmpDistribute(Operation &opInst, 
llvm::IRBuilderBase &builder,
   return regionBlock.takeError();
 builder.SetInsertPoint(*regionBlock, (*regionBlock)->begin());
 
+// Skip applying a workshare loop below when translating 'distribute
+// parallel do' (it's been already handled by this point while translating
+// the nested omp.wsloop).
+if (isa_and_present(distributeOp.getNestedWrapper()))
+  return llvm::Error::success();
+
 // TODO: Add support for clauses which are valid for DISTRIBUTE constructs.
 // Static schedule is the default.
 auto schedule = omp::ClauseScheduleKind::Static;
diff --git a/mlir/test/Target/LLVMIR/openmp-llvm.mlir 
b/mlir/test/Target/LLVMIR/openmp-llvm.mlir
index a5a490e527d79..d85b149c66811 100644
--- a/mlir/test/Target/LLVMIR/openmp-llvm.mlir
+++ b/mlir/test/Target/LLVMIR/openmp-llvm.mlir
@@ -3307,3 +3307,68 @@ llvm.func @distribute() {
 // CHECK: store i64 1, ptr %[[STRIDE]]
 // CHECK: %[[TID:.*]] = call i32 @__kmpc_global_thread_num({{.*}})
 // CHECK: call void @__kmpc_for_static_init_{{.*}}(ptr @{{.*}}, i32 
%[[TID]], i32 92, ptr %[[LASTITER]], ptr %[[LB]], ptr %[[UB]], ptr %[[STRIDE]], 
i64 1, i64 0)
+
+// -
+
+llvm.func @distribute_wsloop(%lb : i32, %ub : i32, %step : i32) {
+  omp.parallel {
+omp.distribute {
+  omp.wsloop {
+omp.loop_nest (%iv) : i32 = (%lb) to (%ub) step (%step) {
+  omp.yield
+}
+  } {omp.composite}
+} {omp.composite}
+omp.terminator
+  } {omp.composite}
+  llvm.return
+}
+
+// CHECK-LABEL: define void @distribute_wsloop
+// CHECK: call void{{.*}}@__kmpc_fork_call({{.*}}, ptr 
@[[OUTLINED_PARALLEL:.*]],
+
+// CHECK:   define internal void @[[OUTLINE

[llvm-branch-commits] [llvm] [OpenMPIRBuilder] Split calculation of canonical loop trip count, NFC (PR #127820)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits


https://github.com/skatrak updated 
https://github.com/llvm/llvm-project/pull/127820

>From 082d8e12a622e2315dd4503ce460f9a0e6f29007 Mon Sep 17 00:00:00 2001
From: Sergio Afonso 
Date: Tue, 18 Feb 2025 14:19:30 +
Subject: [PATCH] [OpenMPIRBuilder] Split calculation of canonical loop trip
 count, NFC

This patch splits off the calculation of canonical loop trip counts from the
creation of canonical loops. This makes it possible to reuse this logic to, for
instance, populate the `__tgt_target_kernel` runtime call for SPMD kernels.

This feature is used to simplify one of the existing OpenMPIRBuilder tests.
---
 .../llvm/Frontend/OpenMP/OMPIRBuilder.h   | 38 +++
 llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp | 27 -
 .../Frontend/OpenMPIRBuilderTest.cpp  | 16 ++--
 3 files changed, 52 insertions(+), 29 deletions(-)

diff --git a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h 
b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
index 9ad85413acd34..207ca7fb05f62 100644
--- a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
+++ b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
@@ -728,13 +728,12 @@ class OpenMPIRBuilder {
   LoopBodyGenCallbackTy BodyGenCB, Value *TripCount,
   const Twine &Name = "loop");
 
-  /// Generator for the control flow structure of an OpenMP canonical loop.
+  /// Calculate the trip count of a canonical loop.
   ///
-  /// Instead of a logical iteration space, this allows specifying user-defined
-  /// loop counter values using increment, upper- and lower bounds. To
-  /// disambiguate the terminology when counting downwards, instead of lower
-  /// bounds we use \p Start for the loop counter value in the first body
-  /// iteration.
+  /// This allows specifying user-defined loop counter values using increment,
+  /// upper- and lower bounds. To disambiguate the terminology when counting
+  /// downwards, instead of lower bounds we use \p Start for the loop counter
+  /// value in the first body iteration.
   ///
   /// Consider the following limitations:
   ///
@@ -758,7 +757,32 @@ class OpenMPIRBuilder {
   ///
   ///  for (int i = 0; i < 42; i -= 1u)
   ///
-  //
+  /// \param Loc   The insert and source location description.
+  /// \param Start Value of the loop counter for the first iterations.
+  /// \param Stop  Loop counter values past this will stop the loop.
+  /// \param Step  Loop counter increment after each iteration; negative
+  ///  means counting down.
+  /// \param IsSigned  Whether Start, Stop and Step are signed integers.
+  /// \param InclusiveStop Whether \p Stop itself is a valid value for the loop
+  ///  counter.
+  /// \param Name  Base name used to derive instruction names.
+  ///
+  /// \returns The value holding the calculated trip count.
+  Value *calculateCanonicalLoopTripCount(const LocationDescription &Loc,
+ Value *Start, Value *Stop, Value 
*Step,
+ bool IsSigned, bool InclusiveStop,
+ const Twine &Name = "loop");
+
+  /// Generator for the control flow structure of an OpenMP canonical loop.
+  ///
+  /// Instead of a logical iteration space, this allows specifying user-defined
+  /// loop counter values using increment, upper- and lower bounds. To
+  /// disambiguate the terminology when counting downwards, instead of lower
+  /// bounds we use \p Start for the loop counter value in the first body
+  ///
+  /// It calls \see calculateCanonicalLoopTripCount for trip count 
calculations,
+  /// so limitations of that method apply here as well.
+  ///
   /// \param Loc   The insert and source location description.
   /// \param BodyGenCB Callback that will generate the loop body code.
   /// \param Start Value of the loop counter for the first iterations.
diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp 
b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index 7788897fc0795..eee6e3e54d615 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -4059,10 +4059,9 @@ OpenMPIRBuilder::createCanonicalLoop(const 
LocationDescription &Loc,
   return CL;
 }
 
-Expected OpenMPIRBuilder::createCanonicalLoop(
-const LocationDescription &Loc, LoopBodyGenCallbackTy BodyGenCB,
-Value *Start, Value *Stop, Value *Step, bool IsSigned, bool InclusiveStop,
-InsertPointTy ComputeIP, const Twine &Name) {
+Value *OpenMPIRBuilder::calculateCanonicalLoopTripCount(
+const LocationDescription &Loc, Value *Start, Value *Stop, Value *Step,
+bool IsSigned, bool InclusiveStop, const Twine &Name) {
 
   // Consider the following difficulties (assuming 8-bit signed integers):
   //  * Adding \p Step to the loop counter which passes \p Stop may overflow:
@@ -4075,9 +4074,7 @@ Expected 
OpenMPIRBuilder::createCanonicalLoop(
   assert(IndVarTy ==

[llvm-branch-commits] [mlir] [MLIR][OpenMP] Host lowering of standalone distribute (PR #127817)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits


https://github.com/skatrak updated 
https://github.com/llvm/llvm-project/pull/127817

>From 55089ba79ac352b05553d3d930ffca3f94562dc1 Mon Sep 17 00:00:00 2001
From: Sergio Afonso 
Date: Tue, 18 Feb 2025 11:22:43 +
Subject: [PATCH] [MLIR][OpenMP] Host lowering of standalone distribute

This patch adds MLIR to LLVM IR translation support for standalone
`omp.distribute` operations, as well as `distribute simd` through ignoring
SIMD information (similarly to `do/for simd`).

Co-authored-by: Dominik Adamski 
---
 .../OpenMP/OpenMPToLLVMIRTranslation.cpp  | 83 +++
 mlir/test/Target/LLVMIR/openmp-llvm.mlir  | 37 +
 mlir/test/Target/LLVMIR/openmp-todo.mlir  | 66 ++-
 3 files changed, 183 insertions(+), 3 deletions(-)

diff --git 
a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp 
b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index 1344f992c116e..87b690912620b 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -161,6 +161,10 @@ static LogicalResult checkImplementationStatus(Operation 
&op) {
 if (op.getDevice())
   result = todo("device");
   };
+  auto checkDistSchedule = [&todo](auto op, LogicalResult &result) {
+if (op.getDistScheduleChunkSize())
+  result = todo("dist_schedule with chunk_size");
+  };
   auto checkHasDeviceAddr = [&todo](auto op, LogicalResult &result) {
 if (!op.getHasDeviceAddrVars().empty())
   result = todo("has_device_addr");
@@ -252,6 +256,16 @@ static LogicalResult checkImplementationStatus(Operation 
&op) {
 
   LogicalResult result = success();
   llvm::TypeSwitch(op)
+  .Case([&](omp::DistributeOp op) {
+if (op.isComposite() &&
+isa_and_present(op.getNestedWrapper()))
+  result = op.emitError() << "not yet implemented: "
+ "composite omp.distribute + omp.wsloop";
+checkAllocate(op, result);
+checkDistSchedule(op, result);
+checkOrder(op, result);
+checkPrivate(op, result);
+  })
   .Case([&](omp::OrderedRegionOp op) { checkParLevelSimd(op, result); })
   .Case([&](omp::SectionsOp op) {
 checkAllocate(op, result);
@@ -3754,6 +3768,72 @@ convertOmpTargetData(Operation *op, llvm::IRBuilderBase 
&builder,
   return success();
 }
 
+static LogicalResult
+convertOmpDistribute(Operation &opInst, llvm::IRBuilderBase &builder,
+ LLVM::ModuleTranslation &moduleTranslation) {
+  llvm::OpenMPIRBuilder *ompBuilder = moduleTranslation.getOpenMPBuilder();
+  auto distributeOp = cast(opInst);
+  if (failed(checkImplementationStatus(opInst)))
+return failure();
+
+  using InsertPointTy = llvm::OpenMPIRBuilder::InsertPointTy;
+  auto bodyGenCB = [&](InsertPointTy allocaIP,
+   InsertPointTy codeGenIP) -> llvm::Error {
+// Save the alloca insertion point on ModuleTranslation stack for use in
+// nested regions.
+LLVM::ModuleTranslation::SaveStack frame(
+moduleTranslation, allocaIP);
+
+// DistributeOp has only one region associated with it.
+builder.restoreIP(codeGenIP);
+
+llvm::OpenMPIRBuilder *ompBuilder = moduleTranslation.getOpenMPBuilder();
+llvm::OpenMPIRBuilder::LocationDescription ompLoc(builder);
+llvm::Expected regionBlock =
+convertOmpOpRegions(distributeOp.getRegion(), "omp.distribute.region",
+builder, moduleTranslation);
+if (!regionBlock)
+  return regionBlock.takeError();
+builder.SetInsertPoint(*regionBlock, (*regionBlock)->begin());
+
+// TODO: Add support for clauses which are valid for DISTRIBUTE constructs.
+// Static schedule is the default.
+auto schedule = omp::ClauseScheduleKind::Static;
+bool isOrdered = false;
+std::optional scheduleMod;
+bool isSimd = false;
+llvm::omp::WorksharingLoopType workshareLoopType =
+llvm::omp::WorksharingLoopType::DistributeStaticLoop;
+bool loopNeedsBarrier = false;
+llvm::Value *chunk = nullptr;
+
+llvm::CanonicalLoopInfo *loopInfo = findCurrentLoopInfo(moduleTranslation);
+llvm::OpenMPIRBuilder::InsertPointOrErrorTy wsloopIP =
+ompBuilder->applyWorkshareLoop(
+ompLoc.DL, loopInfo, allocaIP, loopNeedsBarrier,
+convertToScheduleKind(schedule), chunk, isSimd,
+scheduleMod == omp::ScheduleModifier::monotonic,
+scheduleMod == omp::ScheduleModifier::nonmonotonic, isOrdered,
+workshareLoopType);
+
+if (!wsloopIP)
+  return wsloopIP.takeError();
+return llvm::Error::success();
+  };
+
+  llvm::OpenMPIRBuilder::InsertPointTy allocaIP =
+  findAllocaInsertPoint(builder, moduleTranslation);
+  llvm::OpenMPIRBuilder::LocationDescription ompLoc(builder);
+  llvm::OpenMPIRBuilder::InsertPointOrErrorTy afterIP =
+  ompBuilder->createDistribute(ompLoc,

[llvm-branch-commits] [llvm] [OpenMPIRBuilder] Add support for distribute-parallel-for/do constructs (PR #127818)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits


https://github.com/skatrak updated 
https://github.com/llvm/llvm-project/pull/127818

>From ba9ea8c2cbe7848ca36c92e4c3ee464bcf0e6c39 Mon Sep 17 00:00:00 2001
From: Sergio Afonso 
Date: Tue, 18 Feb 2025 12:04:53 +
Subject: [PATCH] [OpenMPIRBuilder] Add support for distribute-parallel-for/do
 constructs

This patch adds codegen for `kmpc_dist_for_static_init` runtime calls, used to
support worksharing a single loop across teams and threads. This can be used to
implement `distribute parallel for/do` support.
---
 llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp | 34 ---
 1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp 
b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index 9e380bf2d3dbe..7788897fc0795 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -4130,6 +4130,23 @@ Expected 
OpenMPIRBuilder::createCanonicalLoop(
   return createCanonicalLoop(LoopLoc, BodyGen, TripCount, Name);
 }
 
+// Returns an LLVM function to call for initializing loop bounds using OpenMP
+// static scheduling for composite `distribute parallel for` depending on
+// `type`. Only i32 and i64 are supported by the runtime. Always interpret
+// integers as unsigned similarly to CanonicalLoopInfo.
+static FunctionCallee
+getKmpcDistForStaticInitForType(Type *Ty, Module &M,
+OpenMPIRBuilder &OMPBuilder) {
+  unsigned Bitwidth = Ty->getIntegerBitWidth();
+  if (Bitwidth == 32)
+return OMPBuilder.getOrCreateRuntimeFunction(
+M, omp::RuntimeFunction::OMPRTL___kmpc_dist_for_static_init_4u);
+  if (Bitwidth == 64)
+return OMPBuilder.getOrCreateRuntimeFunction(
+M, omp::RuntimeFunction::OMPRTL___kmpc_dist_for_static_init_8u);
+  llvm_unreachable("unknown OpenMP loop iterator bitwidth");
+}
+
 // Returns an LLVM function to call for initializing loop bounds using OpenMP
 // static scheduling depending on `type`. Only i32 and i64 are supported by the
 // runtime. Always interpret integers as unsigned similarly to
@@ -4164,7 +4181,10 @@ OpenMPIRBuilder::InsertPointOrErrorTy 
OpenMPIRBuilder::applyStaticWorkshareLoop(
   // Declare useful OpenMP runtime functions.
   Value *IV = CLI->getIndVar();
   Type *IVTy = IV->getType();
-  FunctionCallee StaticInit = getKmpcForStaticInitForType(IVTy, M, *this);
+  FunctionCallee StaticInit =
+  LoopType == WorksharingLoopType::DistributeForStaticLoop
+  ? getKmpcDistForStaticInitForType(IVTy, M, *this)
+  : getKmpcForStaticInitForType(IVTy, M, *this);
   FunctionCallee StaticFini =
   getOrCreateRuntimeFunction(M, omp::OMPRTL___kmpc_for_static_fini);
 
@@ -4200,9 +4220,15 @@ OpenMPIRBuilder::InsertPointOrErrorTy 
OpenMPIRBuilder::applyStaticWorkshareLoop(
 
   // Call the "init" function and update the trip count of the loop with the
   // value it produced.
-  Builder.CreateCall(StaticInit,
- {SrcLoc, ThreadNum, SchedulingType, PLastIter, 
PLowerBound,
-  PUpperBound, PStride, One, Zero});
+  SmallVector Args(
+  {SrcLoc, ThreadNum, SchedulingType, PLastIter, PLowerBound, 
PUpperBound});
+  if (LoopType == WorksharingLoopType::DistributeForStaticLoop) {
+Value *PDistUpperBound =
+Builder.CreateAlloca(IVTy, nullptr, "p.distupperbound");
+Args.push_back(PDistUpperBound);
+  }
+  Args.append({PStride, One, Zero});
+  Builder.CreateCall(StaticInit, Args);
   Value *LowerBound = Builder.CreateLoad(IVTy, PLowerBound);
   Value *InclusiveUpperBound = Builder.CreateLoad(IVTy, PUpperBound);
   Value *TripCountMinusOne = Builder.CreateSub(InclusiveUpperBound, 
LowerBound);

___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] [OpenMPIRBuilder] Split calculation of canonical loop trip count, NFC (PR #127820)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits


https://github.com/skatrak updated 
https://github.com/llvm/llvm-project/pull/127820

>From 033091e14c76c3e9c7adb0deae2451a298a7fe9e Mon Sep 17 00:00:00 2001
From: Sergio Afonso 
Date: Tue, 18 Feb 2025 14:19:30 +
Subject: [PATCH] [OpenMPIRBuilder] Split calculation of canonical loop trip
 count, NFC

This patch splits off the calculation of canonical loop trip counts from the
creation of canonical loops. This makes it possible to reuse this logic to, for
instance, populate the `__tgt_target_kernel` runtime call for SPMD kernels.

This feature is used to simplify one of the existing OpenMPIRBuilder tests.
---
 .../llvm/Frontend/OpenMP/OMPIRBuilder.h   | 38 +++
 llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp | 27 -
 .../Frontend/OpenMPIRBuilderTest.cpp  | 16 ++--
 3 files changed, 52 insertions(+), 29 deletions(-)

diff --git a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h 
b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
index 9ad85413acd34..207ca7fb05f62 100644
--- a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
+++ b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
@@ -728,13 +728,12 @@ class OpenMPIRBuilder {
   LoopBodyGenCallbackTy BodyGenCB, Value *TripCount,
   const Twine &Name = "loop");
 
-  /// Generator for the control flow structure of an OpenMP canonical loop.
+  /// Calculate the trip count of a canonical loop.
   ///
-  /// Instead of a logical iteration space, this allows specifying user-defined
-  /// loop counter values using increment, upper- and lower bounds. To
-  /// disambiguate the terminology when counting downwards, instead of lower
-  /// bounds we use \p Start for the loop counter value in the first body
-  /// iteration.
+  /// This allows specifying user-defined loop counter values using increment,
+  /// upper- and lower bounds. To disambiguate the terminology when counting
+  /// downwards, instead of lower bounds we use \p Start for the loop counter
+  /// value in the first body iteration.
   ///
   /// Consider the following limitations:
   ///
@@ -758,7 +757,32 @@ class OpenMPIRBuilder {
   ///
   ///  for (int i = 0; i < 42; i -= 1u)
   ///
-  //
+  /// \param Loc   The insert and source location description.
+  /// \param Start Value of the loop counter for the first iterations.
+  /// \param Stop  Loop counter values past this will stop the loop.
+  /// \param Step  Loop counter increment after each iteration; negative
+  ///  means counting down.
+  /// \param IsSigned  Whether Start, Stop and Step are signed integers.
+  /// \param InclusiveStop Whether \p Stop itself is a valid value for the loop
+  ///  counter.
+  /// \param Name  Base name used to derive instruction names.
+  ///
+  /// \returns The value holding the calculated trip count.
+  Value *calculateCanonicalLoopTripCount(const LocationDescription &Loc,
+ Value *Start, Value *Stop, Value 
*Step,
+ bool IsSigned, bool InclusiveStop,
+ const Twine &Name = "loop");
+
+  /// Generator for the control flow structure of an OpenMP canonical loop.
+  ///
+  /// Instead of a logical iteration space, this allows specifying user-defined
+  /// loop counter values using increment, upper- and lower bounds. To
+  /// disambiguate the terminology when counting downwards, instead of lower
+  /// bounds we use \p Start for the loop counter value in the first body
+  ///
+  /// It calls \see calculateCanonicalLoopTripCount for trip count 
calculations,
+  /// so limitations of that method apply here as well.
+  ///
   /// \param Loc   The insert and source location description.
   /// \param BodyGenCB Callback that will generate the loop body code.
   /// \param Start Value of the loop counter for the first iterations.
diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp 
b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index 7788897fc0795..eee6e3e54d615 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -4059,10 +4059,9 @@ OpenMPIRBuilder::createCanonicalLoop(const 
LocationDescription &Loc,
   return CL;
 }
 
-Expected OpenMPIRBuilder::createCanonicalLoop(
-const LocationDescription &Loc, LoopBodyGenCallbackTy BodyGenCB,
-Value *Start, Value *Stop, Value *Step, bool IsSigned, bool InclusiveStop,
-InsertPointTy ComputeIP, const Twine &Name) {
+Value *OpenMPIRBuilder::calculateCanonicalLoopTripCount(
+const LocationDescription &Loc, Value *Start, Value *Stop, Value *Step,
+bool IsSigned, bool InclusiveStop, const Twine &Name) {
 
   // Consider the following difficulties (assuming 8-bit signed integers):
   //  * Adding \p Step to the loop counter which passes \p Stop may overflow:
@@ -4075,9 +4074,7 @@ Expected 
OpenMPIRBuilder::createCanonicalLoop(
   assert(IndVarTy ==

[llvm-branch-commits] [mlir] [MLIR][OpenMP] Support target SPMD (PR #127821)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits


https://github.com/skatrak updated 
https://github.com/llvm/llvm-project/pull/127821

>From 32e696f446082a50b60032f1f5b656e494db5570 Mon Sep 17 00:00:00 2001
From: Sergio Afonso 
Date: Wed, 19 Feb 2025 14:41:12 +
Subject: [PATCH 1/2] [MLIR][OpenMP] Support target SPMD

This patch implements MLIR to LLVM IR translation of host-evaluated loop
bounds, completing initial support for `target teams distribute parallel do
[simd]` and `target teams distribute [simd]`.
---
 .../OpenMP/OpenMPToLLVMIRTranslation.cpp  | 83 
 .../Target/LLVMIR/openmp-target-spmd.mlir | 96 +++
 mlir/test/Target/LLVMIR/openmp-todo.mlir  | 24 -
 3 files changed, 159 insertions(+), 44 deletions(-)
 create mode 100644 mlir/test/Target/LLVMIR/openmp-target-spmd.mlir

diff --git 
a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp 
b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index adac89988a2da..a7d2a00a1bd90 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -173,15 +173,6 @@ static LogicalResult checkImplementationStatus(Operation 
&op) {
 if (op.getHint())
   op.emitWarning("hint clause discarded");
   };
-  auto checkHostEval = [](auto op, LogicalResult &result) {
-// Host evaluated clauses are supported, except for loop bounds.
-for (BlockArgument arg :
- cast(*op).getHostEvalBlockArgs())
-  for (Operation *user : arg.getUsers())
-if (isa(user))
-  result = op.emitError("not yet implemented: host evaluation of loop "
-"bounds in omp.target operation");
-  };
   auto checkInReduction = [&todo](auto op, LogicalResult &result) {
 if (!op.getInReductionVars().empty() || op.getInReductionByref() ||
 op.getInReductionSyms())
@@ -318,7 +309,6 @@ static LogicalResult checkImplementationStatus(Operation 
&op) {
 checkBare(op, result);
 checkDevice(op, result);
 checkHasDeviceAddr(op, result);
-checkHostEval(op, result);
 checkInReduction(op, result);
 checkIsDevicePtr(op, result);
 checkPrivate(op, result);
@@ -4058,9 +4048,13 @@ createDeviceArgumentAccessor(MapInfoData &mapData, 
llvm::Argument &arg,
 ///
 /// Loop bounds and steps are only optionally populated, if output vectors are
 /// provided.
-static void extractHostEvalClauses(omp::TargetOp targetOp, Value &numThreads,
-   Value &numTeamsLower, Value &numTeamsUpper,
-   Value &threadLimit) {
+static void
+extractHostEvalClauses(omp::TargetOp targetOp, Value &numThreads,
+   Value &numTeamsLower, Value &numTeamsUpper,
+   Value &threadLimit,
+   llvm::SmallVectorImpl *lowerBounds = nullptr,
+   llvm::SmallVectorImpl *upperBounds = nullptr,
+   llvm::SmallVectorImpl *steps = nullptr) {
   auto blockArgIface = llvm::cast(*targetOp);
   for (auto item : llvm::zip_equal(targetOp.getHostEvalVars(),
blockArgIface.getHostEvalBlockArgs())) {
@@ -4085,11 +4079,26 @@ static void extractHostEvalClauses(omp::TargetOp 
targetOp, Value &numThreads,
   llvm_unreachable("unsupported host_eval use");
   })
   .Case([&](omp::LoopNestOp loopOp) {
-// TODO: Extract bounds and step values. Currently, this cannot be
-// reached because translation would have been stopped earlier as a
-// result of `checkImplementationStatus` detecting and reporting
-// this situation.
-llvm_unreachable("unsupported host_eval use");
+auto processBounds =
+[&](OperandRange opBounds,
+llvm::SmallVectorImpl *outBounds) -> bool {
+  bool found = false;
+  for (auto [i, lb] : llvm::enumerate(opBounds)) {
+if (lb == blockArg) {
+  found = true;
+  if (outBounds)
+(*outBounds)[i] = hostEvalVar;
+}
+  }
+  return found;
+};
+bool found =
+processBounds(loopOp.getLoopLowerBounds(), lowerBounds);
+found = processBounds(loopOp.getLoopUpperBounds(), upperBounds) ||
+found;
+found = processBounds(loopOp.getLoopSteps(), steps) || found;
+if (!found)
+  llvm_unreachable("unsupported host_eval use");
   })
   .Default([](Operation *) {
 llvm_unreachable("unsupported host_eval use");
@@ -4226,6 +4235,7 @@ initTargetDefaultAttrs(omp::TargetOp targetOp,
 combinedMaxThreadsVal = maxThreadsVal;
 
   // Update kernel bounds structure for the `OpenMPIRBuilder` to use.
+  attrs.ExecFlags = targetOp.getKernelExecFla

[llvm-branch-commits] [llvm] [MachineBasicBlock][NFC] Decouple SplitCriticalEdges from pass manager (PR #128151)

2025-02-21 Thread Christudasan Devadasan via llvm-branch-commits


https://github.com/cdevadas approved this pull request.


https://github.com/llvm/llvm-project/pull/128151
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] [CodeGen][NewPM] Port MachineSink to NPM (PR #115434)

2025-02-21 Thread Christudasan Devadasan via llvm-branch-commits



@@ -189,30 +198,19 @@ class MachineSinking : public MachineFunctionPass {
   bool EnableSinkAndFold;
 
 public:
-  static char ID; // Pass identification
-
-  MachineSinking() : MachineFunctionPass(ID) {
-initializeMachineSinkingPass(*PassRegistry::getPassRegistry());
-  }
-
-  bool runOnMachineFunction(MachineFunction &MF) override;
-
-  void getAnalysisUsage(AnalysisUsage &AU) const override {
-MachineFunctionPass::getAnalysisUsage(AU);
-AU.addRequired();
-AU.addRequired();
-AU.addRequired();
-AU.addRequired();
-AU.addRequired();
-AU.addPreserved();
-AU.addPreserved();
-AU.addRequired();
-if (UseBlockFreqInfo)
-  AU.addRequired();
-AU.addRequired();
-  }
-
-  void releaseMemory() override {
+  MachineSinking(bool EnableSinkAndFold, MachineDominatorTree *DT,
+ MachinePostDominatorTree *PDT, LiveVariables *LV,
+ MachineLoopInfo *MLI, SlotIndexes *SI, LiveIntervals *LIS,
+ MachineCycleInfo *CI, ProfileSummaryInfo *PSI,
+ MachineBlockFrequencyInfo *MBFI,
+ const MachineBranchProbabilityInfo *MBPI, AliasAnalysis *AA)
+  : DT(DT), PDT(PDT), CI(CI), PSI(PSI), MBFI(MBFI), MBPI(MBPI), AA(AA),

cdevadas wrote:

Should have used `RequiredAnalyses` instead of this long list of arguments.

https://github.com/llvm/llvm-project/pull/115434
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] [NFC][MachineBasicBlock] Decouple SplitCriticalEdges from pass manager (PR #128151)

2025-02-21 Thread Akshat Oke via llvm-branch-commits


https://github.com/optimisan edited 
https://github.com/llvm/llvm-project/pull/128151
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [flang] [Flang][OpenMP] Allow host evaluation of loop bounds for distribute (PR #127822)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits


https://github.com/skatrak updated 
https://github.com/llvm/llvm-project/pull/127822

>From c3db0a39f6515911deece48d61e7ee5bfb7219b1 Mon Sep 17 00:00:00 2001
From: Sergio Afonso 
Date: Wed, 19 Feb 2025 15:15:01 +
Subject: [PATCH] [Flang][OpenMP] Allow host evaluation of loop bounds for
 distribute

This patch adds `target teams distribute [simd]` and equivalent construct nests
to the list of cases where loop bounds can be evaluated in the host, as they
represent Generic-SPMD kernels for which the trip count must also be evaluated
in advance to the kernel call.
---
 flang/lib/Lower/OpenMP/OpenMP.cpp |  12 +--
 flang/test/Lower/OpenMP/host-eval.f90 | 103 ++
 2 files changed, 110 insertions(+), 5 deletions(-)

diff --git a/flang/lib/Lower/OpenMP/OpenMP.cpp 
b/flang/lib/Lower/OpenMP/OpenMP.cpp
index bd794033cdf11..8c80453610473 100644
--- a/flang/lib/Lower/OpenMP/OpenMP.cpp
+++ b/flang/lib/Lower/OpenMP/OpenMP.cpp
@@ -562,8 +562,11 @@ static void 
processHostEvalClauses(lower::AbstractConverter &converter,
   [[fallthrough]];
 case OMPD_distribute_parallel_do:
 case OMPD_distribute_parallel_do_simd:
-  cp.processCollapse(loc, eval, hostInfo.ops, hostInfo.iv);
   cp.processNumThreads(stmtCtx, hostInfo.ops);
+  [[fallthrough]];
+case OMPD_distribute:
+case OMPD_distribute_simd:
+  cp.processCollapse(loc, eval, hostInfo.ops, hostInfo.iv);
   break;
 
 // Cases where 'teams' clauses might be present, and target SPMD is
@@ -573,10 +576,8 @@ static void 
processHostEvalClauses(lower::AbstractConverter &converter,
   [[fallthrough]];
 case OMPD_target_teams:
   cp.processNumTeams(stmtCtx, hostInfo.ops);
-  processSingleNestedIf([](Directive nestedDir) {
-return nestedDir == OMPD_distribute_parallel_do ||
-   nestedDir == OMPD_distribute_parallel_do_simd;
-  });
+  processSingleNestedIf(
+  [](Directive nestedDir) { return topDistributeSet.test(nestedDir); 
});
   break;
 
 // Cases where only 'teams' host-evaluated clauses might be present.
@@ -586,6 +587,7 @@ static void processHostEvalClauses(lower::AbstractConverter 
&converter,
   [[fallthrough]];
 case OMPD_target_teams_distribute:
 case OMPD_target_teams_distribute_simd:
+  cp.processCollapse(loc, eval, hostInfo.ops, hostInfo.iv);
   cp.processNumTeams(stmtCtx, hostInfo.ops);
   break;
 
diff --git a/flang/test/Lower/OpenMP/host-eval.f90 
b/flang/test/Lower/OpenMP/host-eval.f90
index 32c52462b86a7..65258c91e5daf 100644
--- a/flang/test/Lower/OpenMP/host-eval.f90
+++ b/flang/test/Lower/OpenMP/host-eval.f90
@@ -155,3 +155,106 @@ subroutine distribute_parallel_do_simd()
   !$omp end distribute parallel do simd
   !$omp end teams
 end subroutine distribute_parallel_do_simd
+
+! BOTH-LABEL: func.func @_QPdistribute
+subroutine distribute()
+  ! BOTH: omp.target
+  
+  ! HOST-SAME: host_eval(%{{.*}} -> %[[LB:.*]], %{{.*}} -> %[[UB:.*]], %{{.*}} 
-> %[[STEP:.*]] : i32, i32, i32)
+  
+  ! DEVICE-NOT: host_eval({{.*}})
+  ! DEVICE-SAME: {
+
+  ! BOTH: omp.teams
+  !$omp target teams
+
+  ! BOTH: omp.distribute
+  ! BOTH-NEXT: omp.loop_nest
+
+  ! HOST-SAME: (%{{.*}}) : i32 = (%[[LB]]) to (%[[UB]]) inclusive step 
(%[[STEP]])
+  !$omp distribute
+  do i=1,10
+call foo()
+  end do
+  !$omp end distribute
+  !$omp end target teams
+
+  ! BOTH: omp.target
+  ! BOTH-NOT: host_eval({{.*}})
+  ! BOTH-SAME: {
+  ! BOTH: omp.teams
+  !$omp target teams
+  call foo() !< Prevents this from being Generic-SPMD.
+
+  ! BOTH: omp.distribute
+  !$omp distribute
+  do i=1,10
+call foo()
+  end do
+  !$omp end distribute
+  !$omp end target teams
+
+  ! BOTH: omp.teams
+  !$omp teams
+
+  ! BOTH: omp.distribute
+  !$omp distribute
+  do i=1,10
+call foo()
+  end do
+  !$omp end distribute
+  !$omp end teams
+end subroutine distribute
+
+! BOTH-LABEL: func.func @_QPdistribute_simd
+subroutine distribute_simd()
+  ! BOTH: omp.target
+  
+  ! HOST-SAME: host_eval(%{{.*}} -> %[[LB:.*]], %{{.*}} -> %[[UB:.*]], %{{.*}} 
-> %[[STEP:.*]] : i32, i32, i32)
+  
+  ! DEVICE-NOT: host_eval({{.*}})
+  ! DEVICE-SAME: {
+
+  ! BOTH: omp.teams
+  !$omp target teams
+
+  ! BOTH: omp.distribute
+  ! BOTH-NEXT: omp.simd
+  ! BOTH-NEXT: omp.loop_nest
+
+  ! HOST-SAME: (%{{.*}}) : i32 = (%[[LB]]) to (%[[UB]]) inclusive step 
(%[[STEP]])
+  !$omp distribute simd
+  do i=1,10
+call foo()
+  end do
+  !$omp end distribute simd
+  !$omp end target teams
+
+  ! BOTH: omp.target
+  ! BOTH-NOT: host_eval({{.*}})
+  ! BOTH-SAME: {
+  ! BOTH: omp.teams
+  !$omp target teams
+  call foo() !< Prevents this from being Generic-SPMD.
+
+  ! BOTH: omp.distribute
+  ! BOTH-NEXT: omp.simd
+  !$omp distribute simd
+  do i=1,10
+call foo()
+  end do
+  !$omp end distribute simd
+  !$omp end target teams
+
+  ! BOTH: omp.teams
+  !$omp teams
+
+  ! BOTH: omp.distribute
+  ! BOTH-NEXT: omp.simd
+  !$omp distribute simd
+  do i=1,10
+call foo()
+  en

[llvm-branch-commits] [mlir] [MLIR][OpenMP] Host lowering of standalone distribute (PR #127817)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits


https://github.com/skatrak updated 
https://github.com/llvm/llvm-project/pull/127817

>From 128819a704e4c3497c55fe21d0b588f24240af34 Mon Sep 17 00:00:00 2001
From: Sergio Afonso 
Date: Tue, 18 Feb 2025 11:22:43 +
Subject: [PATCH] [MLIR][OpenMP] Host lowering of standalone distribute

This patch adds MLIR to LLVM IR translation support for standalone
`omp.distribute` operations, as well as `distribute simd` through ignoring
SIMD information (similarly to `do/for simd`).

Co-authored-by: Dominik Adamski 
---
 .../OpenMP/OpenMPToLLVMIRTranslation.cpp  | 78 +++
 mlir/test/Target/LLVMIR/openmp-llvm.mlir  | 37 +
 mlir/test/Target/LLVMIR/openmp-todo.mlir  | 66 +++-
 3 files changed, 178 insertions(+), 3 deletions(-)

diff --git 
a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp 
b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index 1344f992c116e..987f18fc7bc47 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -161,6 +161,10 @@ static LogicalResult checkImplementationStatus(Operation 
&op) {
 if (op.getDevice())
   result = todo("device");
   };
+  auto checkDistSchedule = [&todo](auto op, LogicalResult &result) {
+if (op.getDistScheduleChunkSize())
+  result = todo("dist_schedule with chunk_size");
+  };
   auto checkHasDeviceAddr = [&todo](auto op, LogicalResult &result) {
 if (!op.getHasDeviceAddrVars().empty())
   result = todo("has_device_addr");
@@ -252,6 +256,16 @@ static LogicalResult checkImplementationStatus(Operation 
&op) {
 
   LogicalResult result = success();
   llvm::TypeSwitch(op)
+  .Case([&](omp::DistributeOp op) {
+if (op.isComposite() &&
+isa_and_present(op.getNestedWrapper()))
+  result = op.emitError() << "not yet implemented: "
+ "composite omp.distribute + omp.wsloop";
+checkAllocate(op, result);
+checkDistSchedule(op, result);
+checkOrder(op, result);
+checkPrivate(op, result);
+  })
   .Case([&](omp::OrderedRegionOp op) { checkParLevelSimd(op, result); })
   .Case([&](omp::SectionsOp op) {
 checkAllocate(op, result);
@@ -3754,6 +3768,67 @@ convertOmpTargetData(Operation *op, llvm::IRBuilderBase 
&builder,
   return success();
 }
 
+static LogicalResult
+convertOmpDistribute(Operation &opInst, llvm::IRBuilderBase &builder,
+ LLVM::ModuleTranslation &moduleTranslation) {
+  llvm::OpenMPIRBuilder *ompBuilder = moduleTranslation.getOpenMPBuilder();
+  auto distributeOp = cast(opInst);
+  if (failed(checkImplementationStatus(opInst)))
+return failure();
+
+  using InsertPointTy = llvm::OpenMPIRBuilder::InsertPointTy;
+  auto bodyGenCB = [&](InsertPointTy allocaIP,
+   InsertPointTy codeGenIP) -> llvm::Error {
+// DistributeOp has only one region associated with it.
+builder.restoreIP(codeGenIP);
+
+llvm::OpenMPIRBuilder *ompBuilder = moduleTranslation.getOpenMPBuilder();
+llvm::OpenMPIRBuilder::LocationDescription ompLoc(builder);
+llvm::Expected regionBlock =
+convertOmpOpRegions(distributeOp.getRegion(), "omp.distribute.region",
+builder, moduleTranslation);
+if (!regionBlock)
+  return regionBlock.takeError();
+builder.SetInsertPoint(*regionBlock, (*regionBlock)->begin());
+
+// TODO: Add support for clauses which are valid for DISTRIBUTE constructs.
+// Static schedule is the default.
+auto schedule = omp::ClauseScheduleKind::Static;
+bool isOrdered = false;
+std::optional scheduleMod;
+bool isSimd = false;
+llvm::omp::WorksharingLoopType workshareLoopType =
+llvm::omp::WorksharingLoopType::DistributeStaticLoop;
+bool loopNeedsBarrier = false;
+llvm::Value *chunk = nullptr;
+
+llvm::CanonicalLoopInfo *loopInfo = 
*findCurrentLoopInfo(moduleTranslation);
+llvm::OpenMPIRBuilder::InsertPointOrErrorTy wsloopIP =
+ompBuilder->applyWorkshareLoop(
+ompLoc.DL, loopInfo, allocaIP, loopNeedsBarrier,
+convertToScheduleKind(schedule), chunk, isSimd,
+scheduleMod == omp::ScheduleModifier::monotonic,
+scheduleMod == omp::ScheduleModifier::nonmonotonic, isOrdered,
+workshareLoopType);
+
+if (!wsloopIP)
+  return wsloopIP.takeError();
+return llvm::Error::success();
+  };
+
+  llvm::OpenMPIRBuilder::InsertPointTy allocaIP =
+  findAllocaInsertPoint(builder, moduleTranslation);
+  llvm::OpenMPIRBuilder::LocationDescription ompLoc(builder);
+  llvm::OpenMPIRBuilder::InsertPointOrErrorTy afterIP =
+  ompBuilder->createDistribute(ompLoc, allocaIP, bodyGenCB);
+
+  if (failed(handleError(afterIP, opInst)))
+return failure();
+
+  builder.restoreIP(*afterIP);
+  return success();
+}
+
 /// Lowers the FlagsAttr which is

[llvm-branch-commits] [llvm] [OpenMPIRBuilder] Add support for distribute-parallel-for/do constructs (PR #127818)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits


https://github.com/skatrak updated 
https://github.com/llvm/llvm-project/pull/127818

>From dbe0d70c0d1c83027ffc9b6eda637257a362adc5 Mon Sep 17 00:00:00 2001
From: Sergio Afonso 
Date: Tue, 18 Feb 2025 12:04:53 +
Subject: [PATCH] [OpenMPIRBuilder] Add support for distribute-parallel-for/do
 constructs

This patch adds codegen for `kmpc_dist_for_static_init` runtime calls, used to
support worksharing a single loop across teams and threads. This can be used to
implement `distribute parallel for/do` support.
---
 llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp | 34 ---
 1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp 
b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index 9e380bf2d3dbe..7788897fc0795 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -4130,6 +4130,23 @@ Expected 
OpenMPIRBuilder::createCanonicalLoop(
   return createCanonicalLoop(LoopLoc, BodyGen, TripCount, Name);
 }
 
+// Returns an LLVM function to call for initializing loop bounds using OpenMP
+// static scheduling for composite `distribute parallel for` depending on
+// `type`. Only i32 and i64 are supported by the runtime. Always interpret
+// integers as unsigned similarly to CanonicalLoopInfo.
+static FunctionCallee
+getKmpcDistForStaticInitForType(Type *Ty, Module &M,
+OpenMPIRBuilder &OMPBuilder) {
+  unsigned Bitwidth = Ty->getIntegerBitWidth();
+  if (Bitwidth == 32)
+return OMPBuilder.getOrCreateRuntimeFunction(
+M, omp::RuntimeFunction::OMPRTL___kmpc_dist_for_static_init_4u);
+  if (Bitwidth == 64)
+return OMPBuilder.getOrCreateRuntimeFunction(
+M, omp::RuntimeFunction::OMPRTL___kmpc_dist_for_static_init_8u);
+  llvm_unreachable("unknown OpenMP loop iterator bitwidth");
+}
+
 // Returns an LLVM function to call for initializing loop bounds using OpenMP
 // static scheduling depending on `type`. Only i32 and i64 are supported by the
 // runtime. Always interpret integers as unsigned similarly to
@@ -4164,7 +4181,10 @@ OpenMPIRBuilder::InsertPointOrErrorTy 
OpenMPIRBuilder::applyStaticWorkshareLoop(
   // Declare useful OpenMP runtime functions.
   Value *IV = CLI->getIndVar();
   Type *IVTy = IV->getType();
-  FunctionCallee StaticInit = getKmpcForStaticInitForType(IVTy, M, *this);
+  FunctionCallee StaticInit =
+  LoopType == WorksharingLoopType::DistributeForStaticLoop
+  ? getKmpcDistForStaticInitForType(IVTy, M, *this)
+  : getKmpcForStaticInitForType(IVTy, M, *this);
   FunctionCallee StaticFini =
   getOrCreateRuntimeFunction(M, omp::OMPRTL___kmpc_for_static_fini);
 
@@ -4200,9 +4220,15 @@ OpenMPIRBuilder::InsertPointOrErrorTy 
OpenMPIRBuilder::applyStaticWorkshareLoop(
 
   // Call the "init" function and update the trip count of the loop with the
   // value it produced.
-  Builder.CreateCall(StaticInit,
- {SrcLoc, ThreadNum, SchedulingType, PLastIter, 
PLowerBound,
-  PUpperBound, PStride, One, Zero});
+  SmallVector Args(
+  {SrcLoc, ThreadNum, SchedulingType, PLastIter, PLowerBound, 
PUpperBound});
+  if (LoopType == WorksharingLoopType::DistributeForStaticLoop) {
+Value *PDistUpperBound =
+Builder.CreateAlloca(IVTy, nullptr, "p.distupperbound");
+Args.push_back(PDistUpperBound);
+  }
+  Args.append({PStride, One, Zero});
+  Builder.CreateCall(StaticInit, Args);
   Value *LowerBound = Builder.CreateLoad(IVTy, PLowerBound);
   Value *InclusiveUpperBound = Builder.CreateLoad(IVTy, PUpperBound);
   Value *TripCountMinusOne = Builder.CreateSub(InclusiveUpperBound, 
LowerBound);

___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] [CodeGen][NewPM] Port MachineSink to NPM (PR #115434)

2025-02-21 Thread Akshat Oke via llvm-branch-commits


https://github.com/optimisan updated 
https://github.com/llvm/llvm-project/pull/115434

>From 17ae43cbf8e8aad79f3cba192079c3841e1425f5 Mon Sep 17 00:00:00 2001
From: Akshat Oke 
Date: Wed, 30 Oct 2024 04:56:54 +
Subject: [PATCH 1/4] [CodeGen][NewPM] Port MachineSink to NPM

Targets can set the EnableSinkAndFold option in CGPassBuilderOptions for
the NPM pipeline in buildCodeGenPipeline(... &Opts, ...)
---
 llvm/include/llvm/CodeGen/MachineSink.h   |  29 
 llvm/include/llvm/CodeGen/Passes.h|   2 +-
 llvm/include/llvm/InitializePasses.h  |   2 +-
 llvm/include/llvm/Passes/CodeGenPassBuilder.h |   3 +-
 .../llvm/Passes/MachinePassRegistry.def   |   9 +-
 .../include/llvm/Target/CGPassBuilderOption.h |   1 +
 llvm/lib/CodeGen/CodeGen.cpp  |   2 +-
 llvm/lib/CodeGen/MachineSink.cpp  | 136 --
 llvm/lib/CodeGen/TargetPassConfig.cpp |   4 +-
 llvm/lib/Passes/PassBuilder.cpp   |   6 +
 llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp  |   2 +-
 llvm/test/CodeGen/AArch64/loop-sink.mir   |   1 +
 .../sink-and-fold-preserve-debugloc.mir   |   2 +
 ...e-sink-temporal-divergence-swdev407790.mir |   2 +
 .../CodeGen/ARM/machine-sink-multidef.mir |   2 +
 .../Hexagon/machine-sink-float-usr.mir|   2 +
 .../PowerPC/sink-down-more-instructions-1.mir |   2 +
 .../CodeGen/RISCV/MachineSink-implicit-x0.mir |   1 +
 .../CodeGen/SystemZ/machinesink-dead-cc.mir   |   3 +
 .../CodeGen/X86/machinesink-debug-inv-0.mir   |   3 +
 .../DebugInfo/MIR/X86/sink-leaves-undef.mir   |   1 +
 21 files changed, 166 insertions(+), 49 deletions(-)
 create mode 100644 llvm/include/llvm/CodeGen/MachineSink.h

diff --git a/llvm/include/llvm/CodeGen/MachineSink.h 
b/llvm/include/llvm/CodeGen/MachineSink.h
new file mode 100644
index 0..1eee9d7f7e2a4
--- /dev/null
+++ b/llvm/include/llvm/CodeGen/MachineSink.h
@@ -0,0 +1,29 @@
+//===- MachineSink.h *- C++ 
-*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM 
Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===--===//
+
+#ifndef LLVM_CODEGEN_MACHINESINK_H
+#define LLVM_CODEGEN_MACHINESINK_H
+
+#include "llvm/CodeGen/MachinePassManager.h"
+
+namespace llvm {
+
+class MachineSinkingPass : public PassInfoMixin {
+  bool EnableSinkAndFold;
+
+public:
+  MachineSinkingPass(bool EnableSinkAndFold = false)
+  : EnableSinkAndFold(EnableSinkAndFold) {}
+
+  PreservedAnalyses run(MachineFunction &MF, MachineFunctionAnalysisManager &);
+
+  void printPipeline(raw_ostream &OS, function_ref 
MapClassName2PassName);
+};
+
+} // namespace llvm
+#endif // LLVM_CODEGEN_MACHINESINK_H
diff --git a/llvm/include/llvm/CodeGen/Passes.h 
b/llvm/include/llvm/CodeGen/Passes.h
index b5d2a7e6bf035..581c38e5c1a52 100644
--- a/llvm/include/llvm/CodeGen/Passes.h
+++ b/llvm/include/llvm/CodeGen/Passes.h
@@ -353,7 +353,7 @@ namespace llvm {
   extern char &EarlyMachineLICMID;
 
   /// MachineSinking - This pass performs sinking on machine instructions.
-  extern char &MachineSinkingID;
+  extern char &MachineSinkingLegacyID;
 
   /// MachineCopyPropagation - This pass performs copy propagation on
   /// machine instructions.
diff --git a/llvm/include/llvm/InitializePasses.h 
b/llvm/include/llvm/InitializePasses.h
index 30c7402bd6606..5c45a405663b1 100644
--- a/llvm/include/llvm/InitializePasses.h
+++ b/llvm/include/llvm/InitializePasses.h
@@ -208,7 +208,7 @@ void 
initializeMachinePostDominatorTreeWrapperPassPass(PassRegistry &);
 void initializeMachineRegionInfoPassPass(PassRegistry &);
 void initializeMachineSanitizerBinaryMetadataPass(PassRegistry &);
 void initializeMachineSchedulerLegacyPass(PassRegistry &);
-void initializeMachineSinkingPass(PassRegistry &);
+void initializeMachineSinkingLegacyPass(PassRegistry &);
 void initializeMachineTraceMetricsWrapperPassPass(PassRegistry &);
 void initializeMachineUniformityInfoPrinterPassPass(PassRegistry &);
 void initializeMachineUniformityAnalysisPassPass(PassRegistry &);
diff --git a/llvm/include/llvm/Passes/CodeGenPassBuilder.h 
b/llvm/include/llvm/Passes/CodeGenPassBuilder.h
index 12781e2b84623..1967a323129c1 100644
--- a/llvm/include/llvm/Passes/CodeGenPassBuilder.h
+++ b/llvm/include/llvm/Passes/CodeGenPassBuilder.h
@@ -51,6 +51,7 @@
 #include "llvm/CodeGen/MachineModuleInfo.h"
 #include "llvm/CodeGen/MachinePassManager.h"
 #include "llvm/CodeGen/MachineScheduler.h"
+#include "llvm/CodeGen/MachineSink.h"
 #include "llvm/CodeGen/MachineVerifier.h"
 #include "llvm/CodeGen/OptimizePHIs.h"
 #include "llvm/CodeGen/PHIElimination.h"
@@ -1042,7 +1043,7 @@ void CodeGenPassBuilder::addMachineSSAOptimization(
   addPass(EarlyMachineLICMPass());
   addPass(MachineCSEPass());
 
-  addPass(MachineSinkingPass());
+  addPass(MachineSin

[llvm-branch-commits] [flang] [Flang][OpenMP] Allow host evaluation of loop bounds for distribute (PR #127822)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits


https://github.com/skatrak updated 
https://github.com/llvm/llvm-project/pull/127822

>From 25e308a580946e40e4d74aae7f04d570723bb267 Mon Sep 17 00:00:00 2001
From: Sergio Afonso 
Date: Wed, 19 Feb 2025 15:15:01 +
Subject: [PATCH] [Flang][OpenMP] Allow host evaluation of loop bounds for
 distribute

This patch adds `target teams distribute [simd]` and equivalent construct nests
to the list of cases where loop bounds can be evaluated in the host, as they
represent Generic-SPMD kernels for which the trip count must also be evaluated
in advance to the kernel call.
---
 flang/lib/Lower/OpenMP/OpenMP.cpp |  12 +--
 flang/test/Lower/OpenMP/host-eval.f90 | 103 ++
 2 files changed, 110 insertions(+), 5 deletions(-)

diff --git a/flang/lib/Lower/OpenMP/OpenMP.cpp 
b/flang/lib/Lower/OpenMP/OpenMP.cpp
index bd794033cdf11..8c80453610473 100644
--- a/flang/lib/Lower/OpenMP/OpenMP.cpp
+++ b/flang/lib/Lower/OpenMP/OpenMP.cpp
@@ -562,8 +562,11 @@ static void 
processHostEvalClauses(lower::AbstractConverter &converter,
   [[fallthrough]];
 case OMPD_distribute_parallel_do:
 case OMPD_distribute_parallel_do_simd:
-  cp.processCollapse(loc, eval, hostInfo.ops, hostInfo.iv);
   cp.processNumThreads(stmtCtx, hostInfo.ops);
+  [[fallthrough]];
+case OMPD_distribute:
+case OMPD_distribute_simd:
+  cp.processCollapse(loc, eval, hostInfo.ops, hostInfo.iv);
   break;
 
 // Cases where 'teams' clauses might be present, and target SPMD is
@@ -573,10 +576,8 @@ static void 
processHostEvalClauses(lower::AbstractConverter &converter,
   [[fallthrough]];
 case OMPD_target_teams:
   cp.processNumTeams(stmtCtx, hostInfo.ops);
-  processSingleNestedIf([](Directive nestedDir) {
-return nestedDir == OMPD_distribute_parallel_do ||
-   nestedDir == OMPD_distribute_parallel_do_simd;
-  });
+  processSingleNestedIf(
+  [](Directive nestedDir) { return topDistributeSet.test(nestedDir); 
});
   break;
 
 // Cases where only 'teams' host-evaluated clauses might be present.
@@ -586,6 +587,7 @@ static void processHostEvalClauses(lower::AbstractConverter 
&converter,
   [[fallthrough]];
 case OMPD_target_teams_distribute:
 case OMPD_target_teams_distribute_simd:
+  cp.processCollapse(loc, eval, hostInfo.ops, hostInfo.iv);
   cp.processNumTeams(stmtCtx, hostInfo.ops);
   break;
 
diff --git a/flang/test/Lower/OpenMP/host-eval.f90 
b/flang/test/Lower/OpenMP/host-eval.f90
index 32c52462b86a7..65258c91e5daf 100644
--- a/flang/test/Lower/OpenMP/host-eval.f90
+++ b/flang/test/Lower/OpenMP/host-eval.f90
@@ -155,3 +155,106 @@ subroutine distribute_parallel_do_simd()
   !$omp end distribute parallel do simd
   !$omp end teams
 end subroutine distribute_parallel_do_simd
+
+! BOTH-LABEL: func.func @_QPdistribute
+subroutine distribute()
+  ! BOTH: omp.target
+  
+  ! HOST-SAME: host_eval(%{{.*}} -> %[[LB:.*]], %{{.*}} -> %[[UB:.*]], %{{.*}} 
-> %[[STEP:.*]] : i32, i32, i32)
+  
+  ! DEVICE-NOT: host_eval({{.*}})
+  ! DEVICE-SAME: {
+
+  ! BOTH: omp.teams
+  !$omp target teams
+
+  ! BOTH: omp.distribute
+  ! BOTH-NEXT: omp.loop_nest
+
+  ! HOST-SAME: (%{{.*}}) : i32 = (%[[LB]]) to (%[[UB]]) inclusive step 
(%[[STEP]])
+  !$omp distribute
+  do i=1,10
+call foo()
+  end do
+  !$omp end distribute
+  !$omp end target teams
+
+  ! BOTH: omp.target
+  ! BOTH-NOT: host_eval({{.*}})
+  ! BOTH-SAME: {
+  ! BOTH: omp.teams
+  !$omp target teams
+  call foo() !< Prevents this from being Generic-SPMD.
+
+  ! BOTH: omp.distribute
+  !$omp distribute
+  do i=1,10
+call foo()
+  end do
+  !$omp end distribute
+  !$omp end target teams
+
+  ! BOTH: omp.teams
+  !$omp teams
+
+  ! BOTH: omp.distribute
+  !$omp distribute
+  do i=1,10
+call foo()
+  end do
+  !$omp end distribute
+  !$omp end teams
+end subroutine distribute
+
+! BOTH-LABEL: func.func @_QPdistribute_simd
+subroutine distribute_simd()
+  ! BOTH: omp.target
+  
+  ! HOST-SAME: host_eval(%{{.*}} -> %[[LB:.*]], %{{.*}} -> %[[UB:.*]], %{{.*}} 
-> %[[STEP:.*]] : i32, i32, i32)
+  
+  ! DEVICE-NOT: host_eval({{.*}})
+  ! DEVICE-SAME: {
+
+  ! BOTH: omp.teams
+  !$omp target teams
+
+  ! BOTH: omp.distribute
+  ! BOTH-NEXT: omp.simd
+  ! BOTH-NEXT: omp.loop_nest
+
+  ! HOST-SAME: (%{{.*}}) : i32 = (%[[LB]]) to (%[[UB]]) inclusive step 
(%[[STEP]])
+  !$omp distribute simd
+  do i=1,10
+call foo()
+  end do
+  !$omp end distribute simd
+  !$omp end target teams
+
+  ! BOTH: omp.target
+  ! BOTH-NOT: host_eval({{.*}})
+  ! BOTH-SAME: {
+  ! BOTH: omp.teams
+  !$omp target teams
+  call foo() !< Prevents this from being Generic-SPMD.
+
+  ! BOTH: omp.distribute
+  ! BOTH-NEXT: omp.simd
+  !$omp distribute simd
+  do i=1,10
+call foo()
+  end do
+  !$omp end distribute simd
+  !$omp end target teams
+
+  ! BOTH: omp.teams
+  !$omp teams
+
+  ! BOTH: omp.distribute
+  ! BOTH-NEXT: omp.simd
+  !$omp distribute simd
+  do i=1,10
+call foo()
+  en

[llvm-branch-commits] [llvm] [OpenMPIRBuilder] Add support for distribute constructs (PR #127816)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits


https://github.com/skatrak updated 
https://github.com/llvm/llvm-project/pull/127816

>From 40d140e6bc0be9556bc09524b38e642cb9885a9d Mon Sep 17 00:00:00 2001
From: Dominik Adamski 
Date: Mon, 17 Feb 2025 14:25:40 +
Subject: [PATCH] [OpenMPIRBuilder] Add support for distribute constructs

This patch adds the `OpenMPIRBuilder::createDistribute()` function and updates
`OpenMPIRBuilder::applyStaticWorkshareLoop()` in preparation for adding
`distribute` support to flang.

Co-authored-by: Sergio Afonso 
---
 .../llvm/Frontend/OpenMP/OMPIRBuilder.h   | 17 --
 llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp | 59 ---
 2 files changed, 64 insertions(+), 12 deletions(-)

diff --git a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h 
b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
index d25077cae63e4..9ad85413acd34 100644
--- a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
+++ b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
@@ -1004,12 +1004,12 @@ class OpenMPIRBuilder {
   /// preheader of the loop.
   /// \param NeedsBarrier Indicates whether a barrier must be inserted after
   /// the loop.
+  /// \param LoopType Type of workshare loop.
   ///
   /// \returns Point where to insert code after the workshare construct.
-  InsertPointOrErrorTy applyStaticWorkshareLoop(DebugLoc DL,
-CanonicalLoopInfo *CLI,
-InsertPointTy AllocaIP,
-bool NeedsBarrier);
+  InsertPointOrErrorTy applyStaticWorkshareLoop(
+  DebugLoc DL, CanonicalLoopInfo *CLI, InsertPointTy AllocaIP,
+  omp::WorksharingLoopType LoopType, bool NeedsBarrier);
 
   /// Modifies the canonical loop a statically-scheduled workshare loop with a
   /// user-specified chunk size.
@@ -2660,6 +2660,15 @@ class OpenMPIRBuilder {
   Value *NumTeamsLower = nullptr, Value *NumTeamsUpper = nullptr,
   Value *ThreadLimit = nullptr, Value *IfExpr = nullptr);
 
+  /// Generator for `#omp distribute`
+  ///
+  /// \param Loc The location where the distribute construct was encountered.
+  /// \param AllocaIP The insertion points to be used for alloca instructions.
+  /// \param BodyGenCB Callback that will generate the region code.
+  InsertPointOrErrorTy createDistribute(const LocationDescription &Loc,
+InsertPointTy AllocaIP,
+BodyGenCallbackTy BodyGenCB);
+
   /// Generate conditional branch and relevant BasicBlocks through which 
private
   /// threads copy the 'copyin' variables from Master copy to threadprivate
   /// copies.
diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp 
b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index 04acab1e5765e..9e380bf2d3dbe 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -2295,7 +2295,8 @@ OpenMPIRBuilder::InsertPointOrErrorTy 
OpenMPIRBuilder::createSections(
 return LoopInfo.takeError();
 
   InsertPointOrErrorTy WsloopIP =
-  applyStaticWorkshareLoop(Loc.DL, *LoopInfo, AllocaIP, !IsNowait);
+  applyStaticWorkshareLoop(Loc.DL, *LoopInfo, AllocaIP,
+   WorksharingLoopType::ForStaticLoop, !IsNowait);
   if (!WsloopIP)
 return WsloopIP.takeError();
   InsertPointTy AfterIP = *WsloopIP;
@@ -4145,10 +4146,9 @@ static FunctionCallee getKmpcForStaticInitForType(Type 
*Ty, Module &M,
   llvm_unreachable("unknown OpenMP loop iterator bitwidth");
 }
 
-OpenMPIRBuilder::InsertPointOrErrorTy
-OpenMPIRBuilder::applyStaticWorkshareLoop(DebugLoc DL, CanonicalLoopInfo *CLI,
-  InsertPointTy AllocaIP,
-  bool NeedsBarrier) {
+OpenMPIRBuilder::InsertPointOrErrorTy 
OpenMPIRBuilder::applyStaticWorkshareLoop(
+DebugLoc DL, CanonicalLoopInfo *CLI, InsertPointTy AllocaIP,
+WorksharingLoopType LoopType, bool NeedsBarrier) {
   assert(CLI->isValid() && "Requires a valid canonical loop");
   assert(!isConflictIP(AllocaIP, CLI->getPreheaderIP()) &&
  "Require dedicated allocate IP");
@@ -4191,8 +4191,12 @@ OpenMPIRBuilder::applyStaticWorkshareLoop(DebugLoc DL, 
CanonicalLoopInfo *CLI,
 
   Value *ThreadNum = getOrCreateThreadID(SrcLoc);
 
-  Constant *SchedulingType = ConstantInt::get(
-  I32Type, static_cast(OMPScheduleType::UnorderedStatic));
+  OMPScheduleType SchedType =
+  (LoopType == WorksharingLoopType::DistributeStaticLoop)
+  ? OMPScheduleType::OrderedDistribute
+  : OMPScheduleType::UnorderedStatic;
+  Constant *SchedulingType =
+  ConstantInt::get(I32Type, static_cast(SchedType));
 
   // Call the "init" function and update the trip count of the loop with the
   // value it produced.
@@ -4452,6 +4456,7 @@ static void createTargetLoopWorkshareCall(
   RealArgs.push_back(TripCount);
   if (LoopType == Worksh

[llvm-branch-commits] [llvm] [OpenMPIRBuilder] Split calculation of canonical loop trip count, NFC (PR #127820)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits


https://github.com/skatrak updated 
https://github.com/llvm/llvm-project/pull/127820

>From f14d964b8b744ebbf2f981ff07e0051c338db335 Mon Sep 17 00:00:00 2001
From: Sergio Afonso 
Date: Tue, 18 Feb 2025 14:19:30 +
Subject: [PATCH] [OpenMPIRBuilder] Split calculation of canonical loop trip
 count, NFC

This patch splits off the calculation of canonical loop trip counts from the
creation of canonical loops. This makes it possible to reuse this logic to, for
instance, populate the `__tgt_target_kernel` runtime call for SPMD kernels.

This feature is used to simplify one of the existing OpenMPIRBuilder tests.
---
 .../llvm/Frontend/OpenMP/OMPIRBuilder.h   | 38 +++
 llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp | 27 -
 .../Frontend/OpenMPIRBuilderTest.cpp  | 16 ++--
 3 files changed, 52 insertions(+), 29 deletions(-)

diff --git a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h 
b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
index 9ad85413acd34..207ca7fb05f62 100644
--- a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
+++ b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
@@ -728,13 +728,12 @@ class OpenMPIRBuilder {
   LoopBodyGenCallbackTy BodyGenCB, Value *TripCount,
   const Twine &Name = "loop");
 
-  /// Generator for the control flow structure of an OpenMP canonical loop.
+  /// Calculate the trip count of a canonical loop.
   ///
-  /// Instead of a logical iteration space, this allows specifying user-defined
-  /// loop counter values using increment, upper- and lower bounds. To
-  /// disambiguate the terminology when counting downwards, instead of lower
-  /// bounds we use \p Start for the loop counter value in the first body
-  /// iteration.
+  /// This allows specifying user-defined loop counter values using increment,
+  /// upper- and lower bounds. To disambiguate the terminology when counting
+  /// downwards, instead of lower bounds we use \p Start for the loop counter
+  /// value in the first body iteration.
   ///
   /// Consider the following limitations:
   ///
@@ -758,7 +757,32 @@ class OpenMPIRBuilder {
   ///
   ///  for (int i = 0; i < 42; i -= 1u)
   ///
-  //
+  /// \param Loc   The insert and source location description.
+  /// \param Start Value of the loop counter for the first iterations.
+  /// \param Stop  Loop counter values past this will stop the loop.
+  /// \param Step  Loop counter increment after each iteration; negative
+  ///  means counting down.
+  /// \param IsSigned  Whether Start, Stop and Step are signed integers.
+  /// \param InclusiveStop Whether \p Stop itself is a valid value for the loop
+  ///  counter.
+  /// \param Name  Base name used to derive instruction names.
+  ///
+  /// \returns The value holding the calculated trip count.
+  Value *calculateCanonicalLoopTripCount(const LocationDescription &Loc,
+ Value *Start, Value *Stop, Value 
*Step,
+ bool IsSigned, bool InclusiveStop,
+ const Twine &Name = "loop");
+
+  /// Generator for the control flow structure of an OpenMP canonical loop.
+  ///
+  /// Instead of a logical iteration space, this allows specifying user-defined
+  /// loop counter values using increment, upper- and lower bounds. To
+  /// disambiguate the terminology when counting downwards, instead of lower
+  /// bounds we use \p Start for the loop counter value in the first body
+  ///
+  /// It calls \see calculateCanonicalLoopTripCount for trip count 
calculations,
+  /// so limitations of that method apply here as well.
+  ///
   /// \param Loc   The insert and source location description.
   /// \param BodyGenCB Callback that will generate the loop body code.
   /// \param Start Value of the loop counter for the first iterations.
diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp 
b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index 7788897fc0795..eee6e3e54d615 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -4059,10 +4059,9 @@ OpenMPIRBuilder::createCanonicalLoop(const 
LocationDescription &Loc,
   return CL;
 }
 
-Expected OpenMPIRBuilder::createCanonicalLoop(
-const LocationDescription &Loc, LoopBodyGenCallbackTy BodyGenCB,
-Value *Start, Value *Stop, Value *Step, bool IsSigned, bool InclusiveStop,
-InsertPointTy ComputeIP, const Twine &Name) {
+Value *OpenMPIRBuilder::calculateCanonicalLoopTripCount(
+const LocationDescription &Loc, Value *Start, Value *Stop, Value *Step,
+bool IsSigned, bool InclusiveStop, const Twine &Name) {
 
   // Consider the following difficulties (assuming 8-bit signed integers):
   //  * Adding \p Step to the loop counter which passes \p Stop may overflow:
@@ -4075,9 +4074,7 @@ Expected 
OpenMPIRBuilder::createCanonicalLoop(
   assert(IndVarTy ==

[llvm-branch-commits] [mlir] [MLIR][OpenMP] Support target SPMD (PR #127821)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits


https://github.com/skatrak updated 
https://github.com/llvm/llvm-project/pull/127821

>From 27139f8f6260de93a0e6d6163b9562c7daa451b8 Mon Sep 17 00:00:00 2001
From: Sergio Afonso 
Date: Wed, 19 Feb 2025 14:41:12 +
Subject: [PATCH 1/2] [MLIR][OpenMP] Support target SPMD

This patch implements MLIR to LLVM IR translation of host-evaluated loop
bounds, completing initial support for `target teams distribute parallel do
[simd]` and `target teams distribute [simd]`.
---
 .../OpenMP/OpenMPToLLVMIRTranslation.cpp  | 83 
 .../Target/LLVMIR/openmp-target-spmd.mlir | 96 +++
 mlir/test/Target/LLVMIR/openmp-todo.mlir  | 24 -
 3 files changed, 159 insertions(+), 44 deletions(-)
 create mode 100644 mlir/test/Target/LLVMIR/openmp-target-spmd.mlir

diff --git 
a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp 
b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index fbea278b2511f..9d07bf7b5d224 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -173,15 +173,6 @@ static LogicalResult checkImplementationStatus(Operation 
&op) {
 if (op.getHint())
   op.emitWarning("hint clause discarded");
   };
-  auto checkHostEval = [](auto op, LogicalResult &result) {
-// Host evaluated clauses are supported, except for loop bounds.
-for (BlockArgument arg :
- cast(*op).getHostEvalBlockArgs())
-  for (Operation *user : arg.getUsers())
-if (isa(user))
-  result = op.emitError("not yet implemented: host evaluation of loop "
-"bounds in omp.target operation");
-  };
   auto checkInReduction = [&todo](auto op, LogicalResult &result) {
 if (!op.getInReductionVars().empty() || op.getInReductionByref() ||
 op.getInReductionSyms())
@@ -318,7 +309,6 @@ static LogicalResult checkImplementationStatus(Operation 
&op) {
 checkBare(op, result);
 checkDevice(op, result);
 checkHasDeviceAddr(op, result);
-checkHostEval(op, result);
 checkInReduction(op, result);
 checkIsDevicePtr(op, result);
 checkPrivate(op, result);
@@ -4053,9 +4043,13 @@ createDeviceArgumentAccessor(MapInfoData &mapData, 
llvm::Argument &arg,
 ///
 /// Loop bounds and steps are only optionally populated, if output vectors are
 /// provided.
-static void extractHostEvalClauses(omp::TargetOp targetOp, Value &numThreads,
-   Value &numTeamsLower, Value &numTeamsUpper,
-   Value &threadLimit) {
+static void
+extractHostEvalClauses(omp::TargetOp targetOp, Value &numThreads,
+   Value &numTeamsLower, Value &numTeamsUpper,
+   Value &threadLimit,
+   llvm::SmallVectorImpl *lowerBounds = nullptr,
+   llvm::SmallVectorImpl *upperBounds = nullptr,
+   llvm::SmallVectorImpl *steps = nullptr) {
   auto blockArgIface = llvm::cast(*targetOp);
   for (auto item : llvm::zip_equal(targetOp.getHostEvalVars(),
blockArgIface.getHostEvalBlockArgs())) {
@@ -4080,11 +4074,26 @@ static void extractHostEvalClauses(omp::TargetOp 
targetOp, Value &numThreads,
   llvm_unreachable("unsupported host_eval use");
   })
   .Case([&](omp::LoopNestOp loopOp) {
-// TODO: Extract bounds and step values. Currently, this cannot be
-// reached because translation would have been stopped earlier as a
-// result of `checkImplementationStatus` detecting and reporting
-// this situation.
-llvm_unreachable("unsupported host_eval use");
+auto processBounds =
+[&](OperandRange opBounds,
+llvm::SmallVectorImpl *outBounds) -> bool {
+  bool found = false;
+  for (auto [i, lb] : llvm::enumerate(opBounds)) {
+if (lb == blockArg) {
+  found = true;
+  if (outBounds)
+(*outBounds)[i] = hostEvalVar;
+}
+  }
+  return found;
+};
+bool found =
+processBounds(loopOp.getLoopLowerBounds(), lowerBounds);
+found = processBounds(loopOp.getLoopUpperBounds(), upperBounds) ||
+found;
+found = processBounds(loopOp.getLoopSteps(), steps) || found;
+if (!found)
+  llvm_unreachable("unsupported host_eval use");
   })
   .Default([](Operation *) {
 llvm_unreachable("unsupported host_eval use");
@@ -4221,6 +4230,7 @@ initTargetDefaultAttrs(omp::TargetOp targetOp,
 combinedMaxThreadsVal = maxThreadsVal;
 
   // Update kernel bounds structure for the `OpenMPIRBuilder` to use.
+  attrs.ExecFlags = targetOp.getKernelExecFla

[llvm-branch-commits] [mlir] [MLIR][OpenMP] Host lowering of distribute-parallel-do/for (PR #127819)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits


https://github.com/skatrak updated 
https://github.com/llvm/llvm-project/pull/127819

>From aad04faf1796c328ac2a4280939a7fb9d7503ab1 Mon Sep 17 00:00:00 2001
From: Sergio Afonso 
Date: Tue, 18 Feb 2025 13:07:51 +
Subject: [PATCH 1/2] [MLIR][OpenMP] Host lowering of
 distribute-parallel-do/for

This patch adds support for translating composite `omp.parallel` +
`omp.distribute` + `omp.wsloop` loops to LLVM IR on the host. This is done by
passing an updated `WorksharingLoopType` to the call to `applyWorkshareLoop`
associated to the lowering of the `omp.wsloop` operation, so that
`__kmpc_dist_for_static_init` is called at runtime in place of
`__kmpc_for_static_init`.

Existing translation rules take care of creating a parallel region to hold the
workshared and workdistributed loop.
---
 .../OpenMP/OpenMPToLLVMIRTranslation.cpp  | 21 --
 mlir/test/Target/LLVMIR/openmp-llvm.mlir  | 65 +++
 mlir/test/Target/LLVMIR/openmp-todo.mlir  | 19 --
 3 files changed, 81 insertions(+), 24 deletions(-)

diff --git 
a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp 
b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index 87b690912620b..adac89988a2da 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -257,10 +257,6 @@ static LogicalResult checkImplementationStatus(Operation 
&op) {
   LogicalResult result = success();
   llvm::TypeSwitch(op)
   .Case([&](omp::DistributeOp op) {
-if (op.isComposite() &&
-isa_and_present(op.getNestedWrapper()))
-  result = op.emitError() << "not yet implemented: "
- "composite omp.distribute + omp.wsloop";
 checkAllocate(op, result);
 checkDistSchedule(op, result);
 checkOrder(op, result);
@@ -1990,6 +1986,14 @@ convertOmpWsloop(Operation &opInst, llvm::IRBuilderBase 
&builder,
   bool isSimd = wsloopOp.getScheduleSimd();
   bool loopNeedsBarrier = !wsloopOp.getNowait();
 
+  // The only legal way for the direct parent to be omp.distribute is that this
+  // represents 'distribute parallel do'. Otherwise, this is a regular
+  // worksharing loop.
+  llvm::omp::WorksharingLoopType workshareLoopType =
+  llvm::isa_and_present(opInst.getParentOp())
+  ? llvm::omp::WorksharingLoopType::DistributeForStaticLoop
+  : llvm::omp::WorksharingLoopType::ForStaticLoop;
+
   llvm::OpenMPIRBuilder::LocationDescription ompLoc(builder);
   llvm::Expected regionBlock = convertOmpOpRegions(
   wsloopOp.getRegion(), "omp.wsloop.region", builder, moduleTranslation);
@@ -2005,7 +2009,8 @@ convertOmpWsloop(Operation &opInst, llvm::IRBuilderBase 
&builder,
   ompLoc.DL, loopInfo, allocaIP, loopNeedsBarrier,
   convertToScheduleKind(schedule), chunk, isSimd,
   scheduleMod == omp::ScheduleModifier::monotonic,
-  scheduleMod == omp::ScheduleModifier::nonmonotonic, isOrdered);
+  scheduleMod == omp::ScheduleModifier::nonmonotonic, isOrdered,
+  workshareLoopType);
 
   if (failed(handleError(wsloopIP, opInst)))
 return failure();
@@ -3796,6 +3801,12 @@ convertOmpDistribute(Operation &opInst, 
llvm::IRBuilderBase &builder,
   return regionBlock.takeError();
 builder.SetInsertPoint(*regionBlock, (*regionBlock)->begin());
 
+// Skip applying a workshare loop below when translating 'distribute
+// parallel do' (it's been already handled by this point while translating
+// the nested omp.wsloop).
+if (isa_and_present(distributeOp.getNestedWrapper()))
+  return llvm::Error::success();
+
 // TODO: Add support for clauses which are valid for DISTRIBUTE constructs.
 // Static schedule is the default.
 auto schedule = omp::ClauseScheduleKind::Static;
diff --git a/mlir/test/Target/LLVMIR/openmp-llvm.mlir 
b/mlir/test/Target/LLVMIR/openmp-llvm.mlir
index a5a490e527d79..d85b149c66811 100644
--- a/mlir/test/Target/LLVMIR/openmp-llvm.mlir
+++ b/mlir/test/Target/LLVMIR/openmp-llvm.mlir
@@ -3307,3 +3307,68 @@ llvm.func @distribute() {
 // CHECK: store i64 1, ptr %[[STRIDE]]
 // CHECK: %[[TID:.*]] = call i32 @__kmpc_global_thread_num({{.*}})
 // CHECK: call void @__kmpc_for_static_init_{{.*}}(ptr @{{.*}}, i32 
%[[TID]], i32 92, ptr %[[LASTITER]], ptr %[[LB]], ptr %[[UB]], ptr %[[STRIDE]], 
i64 1, i64 0)
+
+// -
+
+llvm.func @distribute_wsloop(%lb : i32, %ub : i32, %step : i32) {
+  omp.parallel {
+omp.distribute {
+  omp.wsloop {
+omp.loop_nest (%iv) : i32 = (%lb) to (%ub) step (%step) {
+  omp.yield
+}
+  } {omp.composite}
+} {omp.composite}
+omp.terminator
+  } {omp.composite}
+  llvm.return
+}
+
+// CHECK-LABEL: define void @distribute_wsloop
+// CHECK: call void{{.*}}@__kmpc_fork_call({{.*}}, ptr 
@[[OUTLINED_PARALLEL:.*]],
+
+// CHECK:   define internal void @[[OU

[llvm-branch-commits] [mlir] [MLIR][OpenMP] Support target SPMD (PR #127821)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits


https://github.com/skatrak updated 
https://github.com/llvm/llvm-project/pull/127821

>From e965e0e637551c9b5b5f7fb526a809d1186ef261 Mon Sep 17 00:00:00 2001
From: Sergio Afonso 
Date: Wed, 19 Feb 2025 14:41:12 +
Subject: [PATCH 1/2] [MLIR][OpenMP] Support target SPMD

This patch implements MLIR to LLVM IR translation of host-evaluated loop
bounds, completing initial support for `target teams distribute parallel do
[simd]` and `target teams distribute [simd]`.
---
 .../OpenMP/OpenMPToLLVMIRTranslation.cpp  | 83 
 .../Target/LLVMIR/openmp-target-spmd.mlir | 96 +++
 mlir/test/Target/LLVMIR/openmp-todo.mlir  | 24 -
 3 files changed, 159 insertions(+), 44 deletions(-)
 create mode 100644 mlir/test/Target/LLVMIR/openmp-target-spmd.mlir

diff --git 
a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp 
b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index adac89988a2da..a7d2a00a1bd90 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -173,15 +173,6 @@ static LogicalResult checkImplementationStatus(Operation 
&op) {
 if (op.getHint())
   op.emitWarning("hint clause discarded");
   };
-  auto checkHostEval = [](auto op, LogicalResult &result) {
-// Host evaluated clauses are supported, except for loop bounds.
-for (BlockArgument arg :
- cast(*op).getHostEvalBlockArgs())
-  for (Operation *user : arg.getUsers())
-if (isa(user))
-  result = op.emitError("not yet implemented: host evaluation of loop "
-"bounds in omp.target operation");
-  };
   auto checkInReduction = [&todo](auto op, LogicalResult &result) {
 if (!op.getInReductionVars().empty() || op.getInReductionByref() ||
 op.getInReductionSyms())
@@ -318,7 +309,6 @@ static LogicalResult checkImplementationStatus(Operation 
&op) {
 checkBare(op, result);
 checkDevice(op, result);
 checkHasDeviceAddr(op, result);
-checkHostEval(op, result);
 checkInReduction(op, result);
 checkIsDevicePtr(op, result);
 checkPrivate(op, result);
@@ -4058,9 +4048,13 @@ createDeviceArgumentAccessor(MapInfoData &mapData, 
llvm::Argument &arg,
 ///
 /// Loop bounds and steps are only optionally populated, if output vectors are
 /// provided.
-static void extractHostEvalClauses(omp::TargetOp targetOp, Value &numThreads,
-   Value &numTeamsLower, Value &numTeamsUpper,
-   Value &threadLimit) {
+static void
+extractHostEvalClauses(omp::TargetOp targetOp, Value &numThreads,
+   Value &numTeamsLower, Value &numTeamsUpper,
+   Value &threadLimit,
+   llvm::SmallVectorImpl *lowerBounds = nullptr,
+   llvm::SmallVectorImpl *upperBounds = nullptr,
+   llvm::SmallVectorImpl *steps = nullptr) {
   auto blockArgIface = llvm::cast(*targetOp);
   for (auto item : llvm::zip_equal(targetOp.getHostEvalVars(),
blockArgIface.getHostEvalBlockArgs())) {
@@ -4085,11 +4079,26 @@ static void extractHostEvalClauses(omp::TargetOp 
targetOp, Value &numThreads,
   llvm_unreachable("unsupported host_eval use");
   })
   .Case([&](omp::LoopNestOp loopOp) {
-// TODO: Extract bounds and step values. Currently, this cannot be
-// reached because translation would have been stopped earlier as a
-// result of `checkImplementationStatus` detecting and reporting
-// this situation.
-llvm_unreachable("unsupported host_eval use");
+auto processBounds =
+[&](OperandRange opBounds,
+llvm::SmallVectorImpl *outBounds) -> bool {
+  bool found = false;
+  for (auto [i, lb] : llvm::enumerate(opBounds)) {
+if (lb == blockArg) {
+  found = true;
+  if (outBounds)
+(*outBounds)[i] = hostEvalVar;
+}
+  }
+  return found;
+};
+bool found =
+processBounds(loopOp.getLoopLowerBounds(), lowerBounds);
+found = processBounds(loopOp.getLoopUpperBounds(), upperBounds) ||
+found;
+found = processBounds(loopOp.getLoopSteps(), steps) || found;
+if (!found)
+  llvm_unreachable("unsupported host_eval use");
   })
   .Default([](Operation *) {
 llvm_unreachable("unsupported host_eval use");
@@ -4226,6 +4235,7 @@ initTargetDefaultAttrs(omp::TargetOp targetOp,
 combinedMaxThreadsVal = maxThreadsVal;
 
   // Update kernel bounds structure for the `OpenMPIRBuilder` to use.
+  attrs.ExecFlags = targetOp.getKernelExecFla

[llvm-branch-commits] [flang] [Flang][OpenMP] Allow host evaluation of loop bounds for distribute (PR #127822)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits


https://github.com/skatrak updated 
https://github.com/llvm/llvm-project/pull/127822

>From de75db239e6725be6509c06057a338842339bc0a Mon Sep 17 00:00:00 2001
From: Sergio Afonso 
Date: Wed, 19 Feb 2025 15:15:01 +
Subject: [PATCH] [Flang][OpenMP] Allow host evaluation of loop bounds for
 distribute

This patch adds `target teams distribute [simd]` and equivalent construct nests
to the list of cases where loop bounds can be evaluated in the host, as they
represent Generic-SPMD kernels for which the trip count must also be evaluated
in advance to the kernel call.
---
 flang/lib/Lower/OpenMP/OpenMP.cpp |  12 +--
 flang/test/Lower/OpenMP/host-eval.f90 | 103 ++
 2 files changed, 110 insertions(+), 5 deletions(-)

diff --git a/flang/lib/Lower/OpenMP/OpenMP.cpp 
b/flang/lib/Lower/OpenMP/OpenMP.cpp
index bd794033cdf11..8c80453610473 100644
--- a/flang/lib/Lower/OpenMP/OpenMP.cpp
+++ b/flang/lib/Lower/OpenMP/OpenMP.cpp
@@ -562,8 +562,11 @@ static void 
processHostEvalClauses(lower::AbstractConverter &converter,
   [[fallthrough]];
 case OMPD_distribute_parallel_do:
 case OMPD_distribute_parallel_do_simd:
-  cp.processCollapse(loc, eval, hostInfo.ops, hostInfo.iv);
   cp.processNumThreads(stmtCtx, hostInfo.ops);
+  [[fallthrough]];
+case OMPD_distribute:
+case OMPD_distribute_simd:
+  cp.processCollapse(loc, eval, hostInfo.ops, hostInfo.iv);
   break;
 
 // Cases where 'teams' clauses might be present, and target SPMD is
@@ -573,10 +576,8 @@ static void 
processHostEvalClauses(lower::AbstractConverter &converter,
   [[fallthrough]];
 case OMPD_target_teams:
   cp.processNumTeams(stmtCtx, hostInfo.ops);
-  processSingleNestedIf([](Directive nestedDir) {
-return nestedDir == OMPD_distribute_parallel_do ||
-   nestedDir == OMPD_distribute_parallel_do_simd;
-  });
+  processSingleNestedIf(
+  [](Directive nestedDir) { return topDistributeSet.test(nestedDir); 
});
   break;
 
 // Cases where only 'teams' host-evaluated clauses might be present.
@@ -586,6 +587,7 @@ static void processHostEvalClauses(lower::AbstractConverter 
&converter,
   [[fallthrough]];
 case OMPD_target_teams_distribute:
 case OMPD_target_teams_distribute_simd:
+  cp.processCollapse(loc, eval, hostInfo.ops, hostInfo.iv);
   cp.processNumTeams(stmtCtx, hostInfo.ops);
   break;
 
diff --git a/flang/test/Lower/OpenMP/host-eval.f90 
b/flang/test/Lower/OpenMP/host-eval.f90
index 32c52462b86a7..65258c91e5daf 100644
--- a/flang/test/Lower/OpenMP/host-eval.f90
+++ b/flang/test/Lower/OpenMP/host-eval.f90
@@ -155,3 +155,106 @@ subroutine distribute_parallel_do_simd()
   !$omp end distribute parallel do simd
   !$omp end teams
 end subroutine distribute_parallel_do_simd
+
+! BOTH-LABEL: func.func @_QPdistribute
+subroutine distribute()
+  ! BOTH: omp.target
+  
+  ! HOST-SAME: host_eval(%{{.*}} -> %[[LB:.*]], %{{.*}} -> %[[UB:.*]], %{{.*}} 
-> %[[STEP:.*]] : i32, i32, i32)
+  
+  ! DEVICE-NOT: host_eval({{.*}})
+  ! DEVICE-SAME: {
+
+  ! BOTH: omp.teams
+  !$omp target teams
+
+  ! BOTH: omp.distribute
+  ! BOTH-NEXT: omp.loop_nest
+
+  ! HOST-SAME: (%{{.*}}) : i32 = (%[[LB]]) to (%[[UB]]) inclusive step 
(%[[STEP]])
+  !$omp distribute
+  do i=1,10
+call foo()
+  end do
+  !$omp end distribute
+  !$omp end target teams
+
+  ! BOTH: omp.target
+  ! BOTH-NOT: host_eval({{.*}})
+  ! BOTH-SAME: {
+  ! BOTH: omp.teams
+  !$omp target teams
+  call foo() !< Prevents this from being Generic-SPMD.
+
+  ! BOTH: omp.distribute
+  !$omp distribute
+  do i=1,10
+call foo()
+  end do
+  !$omp end distribute
+  !$omp end target teams
+
+  ! BOTH: omp.teams
+  !$omp teams
+
+  ! BOTH: omp.distribute
+  !$omp distribute
+  do i=1,10
+call foo()
+  end do
+  !$omp end distribute
+  !$omp end teams
+end subroutine distribute
+
+! BOTH-LABEL: func.func @_QPdistribute_simd
+subroutine distribute_simd()
+  ! BOTH: omp.target
+  
+  ! HOST-SAME: host_eval(%{{.*}} -> %[[LB:.*]], %{{.*}} -> %[[UB:.*]], %{{.*}} 
-> %[[STEP:.*]] : i32, i32, i32)
+  
+  ! DEVICE-NOT: host_eval({{.*}})
+  ! DEVICE-SAME: {
+
+  ! BOTH: omp.teams
+  !$omp target teams
+
+  ! BOTH: omp.distribute
+  ! BOTH-NEXT: omp.simd
+  ! BOTH-NEXT: omp.loop_nest
+
+  ! HOST-SAME: (%{{.*}}) : i32 = (%[[LB]]) to (%[[UB]]) inclusive step 
(%[[STEP]])
+  !$omp distribute simd
+  do i=1,10
+call foo()
+  end do
+  !$omp end distribute simd
+  !$omp end target teams
+
+  ! BOTH: omp.target
+  ! BOTH-NOT: host_eval({{.*}})
+  ! BOTH-SAME: {
+  ! BOTH: omp.teams
+  !$omp target teams
+  call foo() !< Prevents this from being Generic-SPMD.
+
+  ! BOTH: omp.distribute
+  ! BOTH-NEXT: omp.simd
+  !$omp distribute simd
+  do i=1,10
+call foo()
+  end do
+  !$omp end distribute simd
+  !$omp end target teams
+
+  ! BOTH: omp.teams
+  !$omp teams
+
+  ! BOTH: omp.distribute
+  ! BOTH-NEXT: omp.simd
+  !$omp distribute simd
+  do i=1,10
+call foo()
+  en

[llvm-branch-commits] [mlir] [MLIR][OpenMP] Host lowering of distribute-parallel-do/for (PR #127819)

2025-02-21 Thread Sergio Afonso via llvm-branch-commits


https://github.com/skatrak updated 
https://github.com/llvm/llvm-project/pull/127819

>From aad04faf1796c328ac2a4280939a7fb9d7503ab1 Mon Sep 17 00:00:00 2001
From: Sergio Afonso 
Date: Tue, 18 Feb 2025 13:07:51 +
Subject: [PATCH] [MLIR][OpenMP] Host lowering of distribute-parallel-do/for

This patch adds support for translating composite `omp.parallel` +
`omp.distribute` + `omp.wsloop` loops to LLVM IR on the host. This is done by
passing an updated `WorksharingLoopType` to the call to `applyWorkshareLoop`
associated to the lowering of the `omp.wsloop` operation, so that
`__kmpc_dist_for_static_init` is called at runtime in place of
`__kmpc_for_static_init`.

Existing translation rules take care of creating a parallel region to hold the
workshared and workdistributed loop.
---
 .../OpenMP/OpenMPToLLVMIRTranslation.cpp  | 21 --
 mlir/test/Target/LLVMIR/openmp-llvm.mlir  | 65 +++
 mlir/test/Target/LLVMIR/openmp-todo.mlir  | 19 --
 3 files changed, 81 insertions(+), 24 deletions(-)

diff --git 
a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp 
b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index 87b690912620b..adac89988a2da 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -257,10 +257,6 @@ static LogicalResult checkImplementationStatus(Operation 
&op) {
   LogicalResult result = success();
   llvm::TypeSwitch(op)
   .Case([&](omp::DistributeOp op) {
-if (op.isComposite() &&
-isa_and_present(op.getNestedWrapper()))
-  result = op.emitError() << "not yet implemented: "
- "composite omp.distribute + omp.wsloop";
 checkAllocate(op, result);
 checkDistSchedule(op, result);
 checkOrder(op, result);
@@ -1990,6 +1986,14 @@ convertOmpWsloop(Operation &opInst, llvm::IRBuilderBase 
&builder,
   bool isSimd = wsloopOp.getScheduleSimd();
   bool loopNeedsBarrier = !wsloopOp.getNowait();
 
+  // The only legal way for the direct parent to be omp.distribute is that this
+  // represents 'distribute parallel do'. Otherwise, this is a regular
+  // worksharing loop.
+  llvm::omp::WorksharingLoopType workshareLoopType =
+  llvm::isa_and_present(opInst.getParentOp())
+  ? llvm::omp::WorksharingLoopType::DistributeForStaticLoop
+  : llvm::omp::WorksharingLoopType::ForStaticLoop;
+
   llvm::OpenMPIRBuilder::LocationDescription ompLoc(builder);
   llvm::Expected regionBlock = convertOmpOpRegions(
   wsloopOp.getRegion(), "omp.wsloop.region", builder, moduleTranslation);
@@ -2005,7 +2009,8 @@ convertOmpWsloop(Operation &opInst, llvm::IRBuilderBase 
&builder,
   ompLoc.DL, loopInfo, allocaIP, loopNeedsBarrier,
   convertToScheduleKind(schedule), chunk, isSimd,
   scheduleMod == omp::ScheduleModifier::monotonic,
-  scheduleMod == omp::ScheduleModifier::nonmonotonic, isOrdered);
+  scheduleMod == omp::ScheduleModifier::nonmonotonic, isOrdered,
+  workshareLoopType);
 
   if (failed(handleError(wsloopIP, opInst)))
 return failure();
@@ -3796,6 +3801,12 @@ convertOmpDistribute(Operation &opInst, 
llvm::IRBuilderBase &builder,
   return regionBlock.takeError();
 builder.SetInsertPoint(*regionBlock, (*regionBlock)->begin());
 
+// Skip applying a workshare loop below when translating 'distribute
+// parallel do' (it's been already handled by this point while translating
+// the nested omp.wsloop).
+if (isa_and_present(distributeOp.getNestedWrapper()))
+  return llvm::Error::success();
+
 // TODO: Add support for clauses which are valid for DISTRIBUTE constructs.
 // Static schedule is the default.
 auto schedule = omp::ClauseScheduleKind::Static;
diff --git a/mlir/test/Target/LLVMIR/openmp-llvm.mlir 
b/mlir/test/Target/LLVMIR/openmp-llvm.mlir
index a5a490e527d79..d85b149c66811 100644
--- a/mlir/test/Target/LLVMIR/openmp-llvm.mlir
+++ b/mlir/test/Target/LLVMIR/openmp-llvm.mlir
@@ -3307,3 +3307,68 @@ llvm.func @distribute() {
 // CHECK: store i64 1, ptr %[[STRIDE]]
 // CHECK: %[[TID:.*]] = call i32 @__kmpc_global_thread_num({{.*}})
 // CHECK: call void @__kmpc_for_static_init_{{.*}}(ptr @{{.*}}, i32 
%[[TID]], i32 92, ptr %[[LASTITER]], ptr %[[LB]], ptr %[[UB]], ptr %[[STRIDE]], 
i64 1, i64 0)
+
+// -
+
+llvm.func @distribute_wsloop(%lb : i32, %ub : i32, %step : i32) {
+  omp.parallel {
+omp.distribute {
+  omp.wsloop {
+omp.loop_nest (%iv) : i32 = (%lb) to (%ub) step (%step) {
+  omp.yield
+}
+  } {omp.composite}
+} {omp.composite}
+omp.terminator
+  } {omp.composite}
+  llvm.return
+}
+
+// CHECK-LABEL: define void @distribute_wsloop
+// CHECK: call void{{.*}}@__kmpc_fork_call({{.*}}, ptr 
@[[OUTLINED_PARALLEL:.*]],
+
+// CHECK:   define internal void @[[OUTLINE

[llvm-branch-commits] [llvm] [MachineBasicBlock][NFC] Decouple SplitCriticalEdges from pass manager (PR #128151)

2025-02-21 Thread Akshat Oke via llvm-branch-commits


https://github.com/optimisan edited 
https://github.com/llvm/llvm-project/pull/128151
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

1 2 >

1 - 100 of 114 matches

Mail list logo