[PATCH] D79754: [OpenMP][AMDGCN] Support OpenMP offloading for AMDGCN architecture - Part 1

2020-05-27 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds accepted this revision.
sameerds added a comment.
This revision is now accepted and ready to land.

LGTM!


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D79754/new/

https://reviews.llvm.org/D79754



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D79400: [CMAKE] Fix build failure when source directory is read only

2020-05-29 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
This revision was automatically updated to reflect the committed changes.
Closed by commit rG16fef6d0b46f: Fix build failure when source is read only 
(authored by pdhaliwal, committed by sameerds).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D79400/new/

https://reviews.llvm.org/D79400

Files:
  llvm/cmake/modules/AddLLVM.cmake
  llvm/include/llvm/Support/CMakeLists.txt


Index: llvm/include/llvm/Support/CMakeLists.txt
===
--- llvm/include/llvm/Support/CMakeLists.txt
+++ llvm/include/llvm/Support/CMakeLists.txt
@@ -5,12 +5,19 @@
 
 set(generate_vcs_version_script 
"${LLVM_CMAKE_PATH}/GenerateVersionFromVCS.cmake")
 
-if(llvm_vc AND LLVM_APPEND_VC_REV)
+if(LLVM_APPEND_VC_REV)
   set(llvm_source_dir ${LLVM_MAIN_SRC_DIR})
+
+  # A fake version file and is not expected to exist. It is being used to
+  # force regeneration of VCSRevision.h for source directory with no write
+  # permission available.
+  if (NOT llvm_vc)
+set(fake_version_inc "${CMAKE_CURRENT_BINARY_DIR}/__FakeVCSRevision.h")
+  endif()
 endif()
 
 # Create custom target to generate the VC revision include.
-add_custom_command(OUTPUT "${version_inc}"
+add_custom_command(OUTPUT "${version_inc}" "${fake_version_inc}"
   DEPENDS "${llvm_vc}" "${generate_vcs_version_script}"
   COMMAND ${CMAKE_COMMAND} "-DNAMES=LLVM"
"-DLLVM_SOURCE_DIR=${llvm_source_dir}"
@@ -22,5 +29,5 @@
   PROPERTIES GENERATED TRUE
  HEADER_FILE_ONLY TRUE)
 
-add_custom_target(llvm_vcsrevision_h DEPENDS "${version_inc}")
+add_custom_target(llvm_vcsrevision_h ALL DEPENDS "${version_inc}" 
"${fake_version_inc}")
 set_target_properties(llvm_vcsrevision_h PROPERTIES FOLDER "Misc")
Index: llvm/cmake/modules/AddLLVM.cmake
===
--- llvm/cmake/modules/AddLLVM.cmake
+++ llvm/cmake/modules/AddLLVM.cmake
@@ -2118,7 +2118,13 @@
 get_filename_component(git_dir ${git_output} ABSOLUTE BASE_DIR ${path})
 # Some branchless cases (e.g. 'repo') may not yet have .git/logs/HEAD
 if (NOT EXISTS "${git_dir}/logs/HEAD")
-  file(WRITE "${git_dir}/logs/HEAD" "")
+  execute_process(COMMAND ${CMAKE_COMMAND} -E touch HEAD
+WORKING_DIRECTORY "${git_dir}/logs"
+RESULT_VARIABLE touch_head_result
+ERROR_QUIET)
+  if (NOT touch_head_result EQUAL 0)
+return()
+  endif()
 endif()
 set(${out_var} "${git_dir}/logs/HEAD" PARENT_SCOPE)
   endif()


Index: llvm/include/llvm/Support/CMakeLists.txt
===
--- llvm/include/llvm/Support/CMakeLists.txt
+++ llvm/include/llvm/Support/CMakeLists.txt
@@ -5,12 +5,19 @@
 
 set(generate_vcs_version_script "${LLVM_CMAKE_PATH}/GenerateVersionFromVCS.cmake")
 
-if(llvm_vc AND LLVM_APPEND_VC_REV)
+if(LLVM_APPEND_VC_REV)
   set(llvm_source_dir ${LLVM_MAIN_SRC_DIR})
+
+  # A fake version file and is not expected to exist. It is being used to
+  # force regeneration of VCSRevision.h for source directory with no write
+  # permission available.
+  if (NOT llvm_vc)
+set(fake_version_inc "${CMAKE_CURRENT_BINARY_DIR}/__FakeVCSRevision.h")
+  endif()
 endif()
 
 # Create custom target to generate the VC revision include.
-add_custom_command(OUTPUT "${version_inc}"
+add_custom_command(OUTPUT "${version_inc}" "${fake_version_inc}"
   DEPENDS "${llvm_vc}" "${generate_vcs_version_script}"
   COMMAND ${CMAKE_COMMAND} "-DNAMES=LLVM"
"-DLLVM_SOURCE_DIR=${llvm_source_dir}"
@@ -22,5 +29,5 @@
   PROPERTIES GENERATED TRUE
  HEADER_FILE_ONLY TRUE)
 
-add_custom_target(llvm_vcsrevision_h DEPENDS "${version_inc}")
+add_custom_target(llvm_vcsrevision_h ALL DEPENDS "${version_inc}" "${fake_version_inc}")
 set_target_properties(llvm_vcsrevision_h PROPERTIES FOLDER "Misc")
Index: llvm/cmake/modules/AddLLVM.cmake
===
--- llvm/cmake/modules/AddLLVM.cmake
+++ llvm/cmake/modules/AddLLVM.cmake
@@ -2118,7 +2118,13 @@
 get_filename_component(git_dir ${git_output} ABSOLUTE BASE_DIR ${path})
 # Some branchless cases (e.g. 'repo') may not yet have .git/logs/HEAD
 if (NOT EXISTS "${git_dir}/logs/HEAD")
-  file(WRITE "${git_dir}/logs/HEAD" "")
+  execute_process(COMMAND ${CMAKE_COMMAND} -E touch HEAD
+WORKING_DIRECTORY "${git_dir}/logs"
+RESULT_VARIABLE touch_head_result
+ERROR_QUIET)
+  if (NOT touch_head_result EQUAL 0)
+return()
+  endif()
 endif()
 set(${out_var} "${git_dir}/logs/HEAD" PARENT_SCOPE)
   endif()
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D80804: [AMDGPU] Expose llvm atomic inc/dec instructions as clang builtins for AMDGPU target

2020-05-29 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

The commit description needs fixing. These are not "llvm instructions" they are 
"AMDGCN intrinsics". They don't exist in the LangRef. Also, I would recommend 
inverting the first line of the description: "Introduce Clang builtins that are 
mapped to AMDGCN intrinsics".

EDIT: Another nitpick in the description: it's the Clang C ABI, not LLVM C ABI.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D80804/new/

https://reviews.llvm.org/D80804



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D80804: [AMDGPU] Expose llvm atomic inc/dec instructions as clang builtins for AMDGPU target

2020-05-29 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: clang/lib/CodeGen/CGBuiltin.cpp:14529-14530
+
+  // llvm.amdgcn.atomic.inc and llvm.amdgcn.atomic.dec expects ordering and
+  // scope as unsigned values
+  Value *MemOrder = Builder.getInt32(static_cast(AO));

arsenm wrote:
> We should fix this (or move these into atomicrmw)
I am not sure why these intrinsics exist as separate from atomicrmw. But while 
they do, taking a numerical scope is not a problem since they are 
target-specific. The LLVM instructions take scope as an opaque string just to 
keep target-specific bits out of the IR.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D80804/new/

https://reviews.llvm.org/D80804



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D80804: [AMDGPU] Expose llvm atomic inc/dec instructions as clang builtins for AMDGPU target

2020-05-29 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

Actually, the question really is about why inc/dec are needed as separate 
operations either as IR intrinsics or Clang builtins. Why not just expose a 
__builtin_amdgcn_atomicrmw that takes a scope, and map it to the LLVM 
atomicrmw? That would be way cleaner. The language can provide convenience 
functions for inc/dec that internally call the rmw builtin.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D80804/new/

https://reviews.llvm.org/D80804



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D80804: [AMDGPU] Introduce Clang builtins to be mapped to AMDGCN atomic inc/dec intrinsics

2020-05-29 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

In D80804#2063522 , @saiislam wrote:

> In D80804#2063451 , @sameerds wrote:
>
> > Actually, the question really is about why inc/dec are needed as separate 
> > operations either as IR intrinsics or Clang builtins. Why not just expose a 
> > __builtin_amdgcn_atomicrmw that takes a scope, and map it to the LLVM 
> > atomicrmw? That would be way cleaner. The language can provide convenience 
> > functions for inc/dec that internally call the rmw builtin.
>
>
> At the moment, atomic inc/dec exist along with atomicrmw. This patch only 
> aims to devise a way to make it (inc/dec) accessible from the language level.


Just to clarify that a bit, the point is that atomic inc/dec are not available 
as operations in atomicrmw, and cannot be treated as an optimization of (add 1) 
or (sub 1). So the intrinsics are required until atomicrmw is enhanced, which 
is out of scope for this patch.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D80804/new/

https://reviews.llvm.org/D80804



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D80804: [AMDGPU] Introduce Clang builtins to be mapped to AMDGCN atomic inc/dec intrinsics

2020-05-29 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds accepted this revision.
sameerds added a comment.

LGTM, but please wait for approval from @arsenm


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D80804/new/

https://reviews.llvm.org/D80804



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D75917: Expose llvm fence instruction as clang intrinsic

2020-04-05 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds requested changes to this revision.
sameerds added inline comments.
This revision now requires changes to proceed.



Comment at: clang/include/clang/Basic/Builtins.def:1583
+// Second argument : target specific sync scope string
+BUILTIN(__builtin_memory_fence, "vUicC*", "n")
+

This should be moved to be near line 786,  where atomic builtins are listed 
under the comment "// GCC does not support these, they are a Clang extension."



Comment at: clang/lib/CodeGen/CGBuiltin.cpp:13630
+
+  // Map C11/C++11 memory ordering to LLVM memory ordering
+  switch (static_cast(ord)) {

There should no mention of any high-level language here. The correct enum to 
validate against is llvm::AtomicOrdering from llvm/Support/AtomicOrdering.h, 
and not the C ABI or any other language ABI.



Comment at: clang/lib/CodeGen/CGBuiltin.cpp:13651
+  llvm::getConstantStringInfo(Scope, scp);
+  SSID = getLLVMContext().getOrInsertSyncScopeID(scp);
+

This seems to be creating a new ID for any arbitrary string passed as sync 
scope. This should be validated against LLVMContext::getSyncScopeNames(). 



Comment at: clang/test/CodeGenHIP/builtin_memory_fence.cpp:5
+
+void test_memory_fence_success() {
+// CHECK-LABEL: test_memory_fence_success

There should be a line that tries to do:
  __builtin_memory_fence(__ATOMIC_SEQ_CST, "foobar");



Comment at: clang/test/CodeGenHIP/builtin_memory_fence.cpp:9
+  // CHECK: fence syncscope("workgroup") seq_cst
+  __builtin_memory_fence(__ATOMIC_SEQ_CST,  "workgroup");
+  

Orderings like `__ATOMIC_SEQ_CST` are defined for C/C++ memory models. They 
should not be used with the new builtin because this new builtin does not 
follow any specific language model. For user convenience, the right thing to do 
is to introduce new tokens in the Clang preprocessor, similar to the 
`__ATOMIC_*` tokens. The convenient shortcut is to just tell the user to supply 
numerical values by looking at the LLVM source code.

From llvm/Support/AtomicOrdering.h, note how the numerical value for 
`__ATOMIC_SEQ_CST` is 5, but the numerical value for the LLVM 
SequentiallyConsistent ordering is 7. The numerical value 5 refers to the LLVM 
ordering "release". So, if the implementation were correct, this line should 
result in the following unexpected LLVM IR:
  fence syncscope("workgroup") release


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75917/new/

https://reviews.llvm.org/D75917



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D75917: Expose llvm fence instruction as clang intrinsic

2020-04-06 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: clang/lib/CodeGen/CGBuiltin.cpp:13651
+  llvm::getConstantStringInfo(Scope, scp);
+  SSID = getLLVMContext().getOrInsertSyncScopeID(scp);
+

saiislam wrote:
> sameerds wrote:
> > This seems to be creating a new ID for any arbitrary string passed as sync 
> > scope. This should be validated against LLVMContext::getSyncScopeNames(). 
> As the FE is not aware about specific target and implementation of sync scope 
> for that target, getSyncScopeNames() here returns llvm'sdefault sync scopes, 
> which only supports singlethreaded and system as valid scopes. Validity 
> checking of memory scope string is being intentionally left for the later 
> stages which deal with the generated IR.
That's pretty strange. At this point, Clang should know what the target is, and 
it should have a chance to update the list of sync scopes somewhere. @arsenm, 
do you see a way around this?



Comment at: clang/test/CodeGenHIP/builtin_memory_fence.cpp:9
+  // CHECK: fence syncscope("workgroup") seq_cst
+  __builtin_memory_fence(__ATOMIC_SEQ_CST,  "workgroup");
+  

JonChesterfield wrote:
> saiislam wrote:
> > sameerds wrote:
> > > Orderings like `__ATOMIC_SEQ_CST` are defined for C/C++ memory models. 
> > > They should not be used with the new builtin because this new builtin 
> > > does not follow any specific language model. For user convenience, the 
> > > right thing to do is to introduce new tokens in the Clang preprocessor, 
> > > similar to the `__ATOMIC_*` tokens. The convenient shortcut is to just 
> > > tell the user to supply numerical values by looking at the LLVM source 
> > > code.
> > > 
> > > From llvm/Support/AtomicOrdering.h, note how the numerical value for 
> > > `__ATOMIC_SEQ_CST` is 5, but the numerical value for the LLVM 
> > > SequentiallyConsistent ordering is 7. The numerical value 5 refers to the 
> > > LLVM ordering "release". So, if the implementation were correct, this 
> > > line should result in the following unexpected LLVM IR:
> > >   fence syncscope("workgroup") release
> > As you pointed out, the range of acquire to sequentiallly consistent memory 
> > orders for llvm::AtomicOrdering is [4, 7], while for 
> > llvm::AtomicOrderingCABI is [2, 5]. Enums of C ABI was taken to ensure easy 
> > of use for the users who are familiar with C/C++ standard memory model. It 
> > allows them to use macros like __ATOMIC_ACQUIRE etc.
> > Clang CodeGen of the builtin internally maps C ABI ordering to llvm atomic 
> > ordering.
> What language, implemented in clang, do you have in mind that reusing the 
> existing __ATOMIC_* macros would be incorrect for?
I think we agreed that this builtin exposes the LLVM fence exactly. That would 
mean it takes arguments defined by LLVM. If you are implementing something 
different from that, then it first needs to be specified properly. Perhaps you 
could say that this is a C ABI compatible builtin, that happens to take target 
specific scopes? That should cover OpenCL whose scope enum is designed to be 
compatible with C.

Whatever it is that you are trying to implement here, it definitely does not 
expose a raw LLVM fence.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75917/new/

https://reviews.llvm.org/D75917



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D75917: Expose llvm fence instruction as clang intrinsic

2020-04-06 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds requested changes to this revision.
sameerds added inline comments.
This revision now requires changes to proceed.



Comment at: clang/lib/CodeGen/CGBuiltin.cpp:13651
+  llvm::getConstantStringInfo(Scope, scp);
+  SSID = getLLVMContext().getOrInsertSyncScopeID(scp);
+

JonChesterfield wrote:
> sameerds wrote:
> > saiislam wrote:
> > > sameerds wrote:
> > > > This seems to be creating a new ID for any arbitrary string passed as 
> > > > sync scope. This should be validated against 
> > > > LLVMContext::getSyncScopeNames(). 
> > > As the FE is not aware about specific target and implementation of sync 
> > > scope for that target, getSyncScopeNames() here returns llvm'sdefault 
> > > sync scopes, which only supports singlethreaded and system as valid 
> > > scopes. Validity checking of memory scope string is being intentionally 
> > > left for the later stages which deal with the generated IR.
> > That's pretty strange. At this point, Clang should know what the target is, 
> > and it should have a chance to update the list of sync scopes somewhere. 
> > @arsenm, do you see a way around this?
> There is already sufficient IR level checking for the string at the 
> instruction level. Warning in clang as well could be a nicer user experience, 
> but that seems low priority to me.
If there is some checking happening anywhere, then that needs to be 
demonstrated in a testcase where the input high-level program passes an illegal 
string as the scope argument.



Comment at: clang/test/CodeGenHIP/builtin_memory_fence.cpp:9
+  // CHECK: fence syncscope("workgroup") seq_cst
+  __builtin_memory_fence(__ATOMIC_SEQ_CST,  "workgroup");
+  

JonChesterfield wrote:
> sameerds wrote:
> > JonChesterfield wrote:
> > > saiislam wrote:
> > > > sameerds wrote:
> > > > > Orderings like `__ATOMIC_SEQ_CST` are defined for C/C++ memory 
> > > > > models. They should not be used with the new builtin because this new 
> > > > > builtin does not follow any specific language model. For user 
> > > > > convenience, the right thing to do is to introduce new tokens in the 
> > > > > Clang preprocessor, similar to the `__ATOMIC_*` tokens. The 
> > > > > convenient shortcut is to just tell the user to supply numerical 
> > > > > values by looking at the LLVM source code.
> > > > > 
> > > > > From llvm/Support/AtomicOrdering.h, note how the numerical value for 
> > > > > `__ATOMIC_SEQ_CST` is 5, but the numerical value for the LLVM 
> > > > > SequentiallyConsistent ordering is 7. The numerical value 5 refers to 
> > > > > the LLVM ordering "release". So, if the implementation were correct, 
> > > > > this line should result in the following unexpected LLVM IR:
> > > > >   fence syncscope("workgroup") release
> > > > As you pointed out, the range of acquire to sequentiallly consistent 
> > > > memory orders for llvm::AtomicOrdering is [4, 7], while for 
> > > > llvm::AtomicOrderingCABI is [2, 5]. Enums of C ABI was taken to ensure 
> > > > easy of use for the users who are familiar with C/C++ standard memory 
> > > > model. It allows them to use macros like __ATOMIC_ACQUIRE etc.
> > > > Clang CodeGen of the builtin internally maps C ABI ordering to llvm 
> > > > atomic ordering.
> > > What language, implemented in clang, do you have in mind that reusing the 
> > > existing __ATOMIC_* macros would be incorrect for?
> > I think we agreed that this builtin exposes the LLVM fence exactly. That 
> > would mean it takes arguments defined by LLVM. If you are implementing 
> > something different from that, then it first needs to be specified 
> > properly. Perhaps you could say that this is a C ABI compatible builtin, 
> > that happens to take target specific scopes? That should cover OpenCL whose 
> > scope enum is designed to be compatible with C.
> > 
> > Whatever it is that you are trying to implement here, it definitely does 
> > not expose a raw LLVM fence.
> The llvm fence, in text form, uses a symbol for the memory scope. Not an enum.
> 
> This symbol is set using these macros for the existing atomic builtins. Using 
> an implementation detail of clang instead is strictly worse, by layering and 
> by precedent.
> 
> ABI is not involved here. Nor is OpenCl.
The `__ATOMIC_*` symbols in Clang quite literally represent the C/C++ ABI. See 
the details in AtomicOrdering.h and InitPreprocessor.cpp. I am not opposed to 
specifying that the builtin expects these symbols, but then it is incorrect to 
say that the builtin exposes the raw LLVM builtin. It is a C-ABI-compatible 
builtin that happens to take target-specific scope as a string argument. And 
that would also make it an overload of the already existing builting 
__atomic_fence().


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75917/new/

https://reviews.llvm.org/D75917



___
cfe-commits mailing list
cfe-commits@lists.llvm

[PATCH] D75917: Expose llvm fence instruction as clang intrinsic

2020-04-06 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: clang/test/CodeGenHIP/builtin_memory_fence.cpp:9
+  // CHECK: fence syncscope("workgroup") seq_cst
+  __builtin_memory_fence(__ATOMIC_SEQ_CST,  "workgroup");
+  

JonChesterfield wrote:
> JonChesterfield wrote:
> > sameerds wrote:
> > > JonChesterfield wrote:
> > > > sameerds wrote:
> > > > > JonChesterfield wrote:
> > > > > > saiislam wrote:
> > > > > > > sameerds wrote:
> > > > > > > > Orderings like `__ATOMIC_SEQ_CST` are defined for C/C++ memory 
> > > > > > > > models. They should not be used with the new builtin because 
> > > > > > > > this new builtin does not follow any specific language model. 
> > > > > > > > For user convenience, the right thing to do is to introduce new 
> > > > > > > > tokens in the Clang preprocessor, similar to the `__ATOMIC_*` 
> > > > > > > > tokens. The convenient shortcut is to just tell the user to 
> > > > > > > > supply numerical values by looking at the LLVM source code.
> > > > > > > > 
> > > > > > > > From llvm/Support/AtomicOrdering.h, note how the numerical 
> > > > > > > > value for `__ATOMIC_SEQ_CST` is 5, but the numerical value for 
> > > > > > > > the LLVM SequentiallyConsistent ordering is 7. The numerical 
> > > > > > > > value 5 refers to the LLVM ordering "release". So, if the 
> > > > > > > > implementation were correct, this line should result in the 
> > > > > > > > following unexpected LLVM IR:
> > > > > > > >   fence syncscope("workgroup") release
> > > > > > > As you pointed out, the range of acquire to sequentiallly 
> > > > > > > consistent memory orders for llvm::AtomicOrdering is [4, 7], 
> > > > > > > while for llvm::AtomicOrderingCABI is [2, 5]. Enums of C ABI was 
> > > > > > > taken to ensure easy of use for the users who are familiar with 
> > > > > > > C/C++ standard memory model. It allows them to use macros like 
> > > > > > > __ATOMIC_ACQUIRE etc.
> > > > > > > Clang CodeGen of the builtin internally maps C ABI ordering to 
> > > > > > > llvm atomic ordering.
> > > > > > What language, implemented in clang, do you have in mind that 
> > > > > > reusing the existing __ATOMIC_* macros would be incorrect for?
> > > > > I think we agreed that this builtin exposes the LLVM fence exactly. 
> > > > > That would mean it takes arguments defined by LLVM. If you are 
> > > > > implementing something different from that, then it first needs to be 
> > > > > specified properly. Perhaps you could say that this is a C ABI 
> > > > > compatible builtin, that happens to take target specific scopes? That 
> > > > > should cover OpenCL whose scope enum is designed to be compatible 
> > > > > with C.
> > > > > 
> > > > > Whatever it is that you are trying to implement here, it definitely 
> > > > > does not expose a raw LLVM fence.
> > > > The llvm fence, in text form, uses a symbol for the memory scope. Not 
> > > > an enum.
> > > > 
> > > > This symbol is set using these macros for the existing atomic builtins. 
> > > > Using an implementation detail of clang instead is strictly worse, by 
> > > > layering and by precedent.
> > > > 
> > > > ABI is not involved here. Nor is OpenCl.
> > > The `__ATOMIC_*` symbols in Clang quite literally represent the C/C++ 
> > > ABI. See the details in AtomicOrdering.h and InitPreprocessor.cpp. I am 
> > > not opposed to specifying that the builtin expects these symbols, but 
> > > then it is incorrect to say that the builtin exposes the raw LLVM 
> > > builtin. It is a C-ABI-compatible builtin that happens to take 
> > > target-specific scope as a string argument. And that would also make it 
> > > an overload of the already existing builting __atomic_fence().
> > I don't know what you mean by "raw",  but am guessing you're asking for 
> > documentation for the intrinsic. Said documentation should indeed be added 
> > for this builtin - it'll probably be in a tablegen file.
> I will try to stop using builtin and intrinsic as synonyms.
Right. It's actually an LLVM instruction, not even an intrinsic. I am guilty of 
using the wrong term quite often, but usually the context helps. I think the 
following is still needed:

  # A testcase that exercises the builtin with an invalid string argument for 
scope.
  # An update to the change description describing what is being introduced 
here. It is more than a direct mapping from builtin to instruction. The C ABI 
is involved.
  # An update to the Clang documentation describing this new builtin: 
https://clang.llvm.org/docs/LanguageExtensions.html#builtin-functions



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75917/new/

https://reviews.llvm.org/D75917



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D80996: [AMDGPU][OpenMP] Fix duplicate copies of arguments in commands

2020-06-03 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: clang/lib/Driver/ToolChains/HIP.cpp:389
 
-  for (Arg *A : Args) {
-DAL->append(A);
+  if (DeviceOffloadKind != Action::OFK_OpenMP) {
+for (Arg *A : Args) {

pdhaliwal wrote:
> arsenm wrote:
> > Needs a comment? I don't understand why openmp is any different here
> `HostTC.TranslateArgs` (HostTC = Generic_GCC, Line#383) returns `DAL` which  
> contains `Args` when offloadkind is OFK_OpenMP (for all other cases, it 
> returns nullptr). Thus, Line#{390,391} is just duplicating the arguments 
> which are propagating to the opt, llc, etc. commands.
> Ref: https://clang.llvm.org/doxygen/Gnu_8cpp_source.html#l02966
I think @arsenm was asking for a comment in the code itself.

Also, I am not sufficiently familiar with the design here, but why is the HIP 
driver checking for OpenMP offloading? Is the offloaded region treated as HIP 
code? It seems a bit strange that we are mixing two different things in the 
same driver.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D80996/new/

https://reviews.llvm.org/D80996



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D78900: [HIP][AMDGPU] Enable structurizer workarounds

2020-06-04 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds abandoned this revision.
sameerds added a comment.

Abandoned in favour of enabling the workarounds in the AMDGPU backend: 
https://reviews.llvm.org/D81211


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D78900/new/

https://reviews.llvm.org/D78900



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D82087: AMDGPU/clang: Add builtins for llvm.amdgcn.ballot

2020-07-08 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

The documentation for HIP __ballot seems to indicate that the user does not 
have to explicitly specify the warp size. How is that achieved with these new 
builtins? Can this  be captured in a lit test here?

https://github.com/ROCm-Developer-Tools/HIP/blob/master/docs/markdown/hip_kernel_language.md#warp-vote-and-ballot-functions




Comment at: clang/lib/Basic/Targets/AMDGPU.cpp:288
+  if (!IsNullCPU) {
+// Default to wave32 if available, or wave64 if not
+if (Features.count("wavefrontsize32") == 0 &&

So the implication here is that wave32 is the preferred choice on newer 
architectures, and hence the default when available?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D82087/new/

https://reviews.llvm.org/D82087



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D82087: AMDGPU/clang: Add builtins for llvm.amdgcn.ballot

2020-07-12 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

>> https://github.com/ROCm-Developer-Tools/HIP/blob/master/docs/markdown/hip_kernel_language.md#warp-vote-and-ballot-functions
> 
> I think if the language interface insists on fixing the wave size, then I 
> think the correct solution is to implement this in the header based on a wave 
> size macro (which we're desperately missing). The library implementation 
> should be responsible for inserting the extension to 64-bit for wave32

Not sure if the frontend should try to infer warpsize and the mask size, or 
even whether it can in all cases. But this can result in wrong behaviour when 
the program passes 32-bit mask but then gets compiled for a 64-bit mask. It's 
easy to say that the programmer must not assume a warp-size, but it would be 
useful if the language can

In D82087#2144712 , @arsenm wrote:

> In D82087#2140778 , @sameerds wrote:
>
> > The documentation for HIP __ballot seems to indicate that the user does not 
> > have to explicitly specify the warp size. How is that achieved with these 
> > new builtins? Can this  be captured in a lit test here?
>
>
> This seems like a defect in the design to me
>
> > https://github.com/ROCm-Developer-Tools/HIP/blob/master/docs/markdown/hip_kernel_language.md#warp-vote-and-ballot-functions


I tend to agree. The HIP vote/ballot builtins are also missing a mask 
parameter, whose type needs to match the wavesize.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D82087/new/

https://reviews.llvm.org/D82087



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D75917: Expose llvm fence instruction as clang intrinsic

2020-04-22 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

Can you please submit a squashed single commit for review against the master 
branch? All the recent commits seem to be relative to previous commits, and I 
am having trouble looking at the change as a whole.

The commit description needs some fixes:

1. Remove use of Title Case, in places like "using Fence instruction" and "LLVM 
Atomic Memory Ordering" and "as a String". They are simply common nouns, not 
proper nouns. When in doubt, look at the LangRef to see how these words are 
used as common nouns.
2. Don't mention that enum values are okay too. I am sure they will always be 
okay, but it's better to encourage the use of symbols.
3. Don't mention HSA even if the AMDGPU user manual does so.
4. In fact the last two sentences in the description are not strictly necessary 
... they are the logical outcome of the scopes being target-specific strings.
5. Note that the general practice in documentation is to say "AMDGPU" which 
covers the LLVM Target, while "amdgcn" is only used in the code because it is 
one of multiple architectures in the AMDGPU backend.




Comment at: clang/docs/LanguageExtensions.rst:2458
 
+AMDGCN specific builtins
+-

We probably don't need to document the new builtin here. Clearly, we have not 
documented any other AMDGPU builtin, and there is no need to start now. If 
necessary, we can take that up as a separate task covering all builtins.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75917/new/

https://reviews.llvm.org/D75917



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D75917: Expose llvm fence instruction as clang intrinsic

2020-04-23 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds accepted this revision.
sameerds added a comment.
This revision is now accepted and ready to land.

Thanks @saiislam ... this looks much better!

Two nitpicks, that must be fixed. But it is okay if you directly submit after 
fixing them.

1. The change description should use "const char *" in the signature and not 
"String".
2. Can you please add a test that passes an integer constant as the scope? I am 
assuming that the signature check will complain that it is not a string.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75917/new/

https://reviews.llvm.org/D75917



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D75917: Expose llvm fence instruction as clang intrinsic

2020-04-26 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
This revision was automatically updated to reflect the committed changes.
Closed by commit rG06bdffb2bb45: [AMDGPU] Expose llvm fence instruction as 
clang intrinsic (authored by saiislam, committed by sameerds).

Changed prior to commit:
  https://reviews.llvm.org/D75917?vs=260053&id=260207#toc

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75917/new/

https://reviews.llvm.org/D75917

Files:
  clang/include/clang/Basic/BuiltinsAMDGPU.def
  clang/include/clang/Sema/Sema.h
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/lib/Sema/SemaChecking.cpp
  clang/test/CodeGenCXX/builtin-amdgcn-fence.cpp
  clang/test/Sema/builtin-amdgcn-fence-failure.cpp
  clang/test/SemaOpenCL/builtins-amdgcn-error.cl

Index: clang/test/SemaOpenCL/builtins-amdgcn-error.cl
===
--- clang/test/SemaOpenCL/builtins-amdgcn-error.cl
+++ clang/test/SemaOpenCL/builtins-amdgcn-error.cl
@@ -128,3 +128,14 @@
   *out = __builtin_amdgcn_ds_fmaxf(out, src, 0, a, false); // expected-error {{argument to '__builtin_amdgcn_ds_fmaxf' must be a constant integer}}
   *out = __builtin_amdgcn_ds_fmaxf(out, src, 0, 0, a); // expected-error {{argument to '__builtin_amdgcn_ds_fmaxf' must be a constant integer}}
 }
+
+void test_fence() {
+  __builtin_amdgcn_fence(__ATOMIC_SEQ_CST + 1, "workgroup"); // expected-warning {{memory order argument to atomic operation is invalid}}
+  __builtin_amdgcn_fence(__ATOMIC_ACQUIRE - 1, "workgroup"); // expected-warning {{memory order argument to atomic operation is invalid}}
+  __builtin_amdgcn_fence(4); // expected-error {{too few arguments to function call, expected 2}}
+  __builtin_amdgcn_fence(4, 4, 4); // expected-error {{too many arguments to function call, expected 2}}
+  __builtin_amdgcn_fence(3.14, ""); // expected-warning {{implicit conversion from 'double' to 'unsigned int' changes value from 3.14 to 3}}
+  __builtin_amdgcn_fence(__ATOMIC_ACQUIRE, 5); // expected-warning {{incompatible integer to pointer conversion passing 'int' to parameter of type 'const char *'}}
+  const char ptr[] = "workgroup";
+  __builtin_amdgcn_fence(__ATOMIC_ACQUIRE, ptr); // expected-error {{expression is not a string literal}}
+}
Index: clang/test/Sema/builtin-amdgcn-fence-failure.cpp
===
--- /dev/null
+++ clang/test/Sema/builtin-amdgcn-fence-failure.cpp
@@ -0,0 +1,9 @@
+// REQUIRES: amdgpu-registered-target
+// RUN: not %clang_cc1 %s -S \
+// RUN:   -triple=amdgcn-amd-amdhsa 2>&1 | FileCheck %s
+
+void test_amdgcn_fence_failure() {
+
+  // CHECK: error: Unsupported atomic synchronization scope
+  __builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "foobar");
+}
Index: clang/test/CodeGenCXX/builtin-amdgcn-fence.cpp
===
--- /dev/null
+++ clang/test/CodeGenCXX/builtin-amdgcn-fence.cpp
@@ -0,0 +1,22 @@
+// REQUIRES: amdgpu-registered-target
+// RUN: %clang_cc1 %s -emit-llvm -O0 -o - \
+// RUN:   -triple=amdgcn-amd-amdhsa  | opt -S | FileCheck %s
+
+void test_memory_fence_success() {
+  // CHECK-LABEL: test_memory_fence_success
+
+  // CHECK: fence syncscope("workgroup") seq_cst
+  __builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "workgroup");
+
+  // CHECK: fence syncscope("agent") acquire
+  __builtin_amdgcn_fence(__ATOMIC_ACQUIRE, "agent");
+
+  // CHECK: fence seq_cst
+  __builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "");
+
+  // CHECK: fence syncscope("agent") acq_rel
+  __builtin_amdgcn_fence(4, "agent");
+
+  // CHECK: fence syncscope("workgroup") release
+  __builtin_amdgcn_fence(3, "workgroup");
+}
Index: clang/lib/Sema/SemaChecking.cpp
===
--- clang/lib/Sema/SemaChecking.cpp
+++ clang/lib/Sema/SemaChecking.cpp
@@ -1920,6 +1920,10 @@
 if (CheckPPCBuiltinFunctionCall(BuiltinID, TheCall))
   return ExprError();
 break;
+  case llvm::Triple::amdgcn:
+if (CheckAMDGCNBuiltinFunctionCall(BuiltinID, TheCall))
+  return ExprError();
+break;
   default:
 break;
 }
@@ -3033,6 +3037,46 @@
   return SemaBuiltinConstantArgRange(TheCall, i, l, u);
 }
 
+bool Sema::CheckAMDGCNBuiltinFunctionCall(unsigned BuiltinID,
+  CallExpr *TheCall) {
+  switch (BuiltinID) {
+  case AMDGPU::BI__builtin_amdgcn_fence: {
+ExprResult Arg = TheCall->getArg(0);
+auto ArgExpr = Arg.get();
+Expr::EvalResult ArgResult;
+
+if (!ArgExpr->EvaluateAsInt(ArgResult, Context))
+  return Diag(ArgExpr->getExprLoc(), diag::err_typecheck_expect_int)
+ << ArgExpr->getType();
+int ord = ArgResult.Val.getInt().getZExtValue();
+
+// Check valididty of memory ordering as per C11 / C++11's memody model.
+switch (static_cast(ord)) {
+case llvm::AtomicOrderingCABI::acquire:
+case llvm::AtomicOrderingCABI::release:
+case llvm::AtomicOrderingCABI::acq_rel:
+   

[PATCH] D78900: [HIP][AMDGPU] Enable structurizer workarounds

2020-04-27 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds created this revision.
Herald added subscribers: llvm-commits, cfe-commits, kerbowa, hiraditya, t-tye, 
tpr, dstuttard, yaxunl, nhaehnle, wdng, jvesely, kzhuravl, arsenm.
Herald added projects: clang, LLVM.
sameerds added reviewers: yaxunl, scchan, arsenm, rampitec, dstuttard.

The StrucurizeCFG pass has two known limitations that are addressed by
the FixIrreducible pass and the UnifyLoopExits pass. These passes are
invoked by the AMDGPU backend, but currently hidden behind a
command-line option which is disabled by default.

This change enables the workaround when compiling a HIP program. The
option can now be specified multiple times. The user can thus
override the HIP default using:

  clang -mllvm --amdgpu-enable-structurizer-workarounds=false

The HIP toolchain passes all "-mllvm" options to both opt and
llc. But this particular option has no effect on the opt invocation
since it is only used by the AMDGPU target during code generation.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D78900

Files:
  clang/lib/Driver/ToolChains/HIP.cpp
  clang/test/Driver/hip-structurizer-workarounds.hip
  llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp


Index: llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
===
--- llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -195,7 +195,7 @@
 static cl::opt EnableStructurizerWorkarounds(
 "amdgpu-enable-structurizer-workarounds",
 cl::desc("Enable workarounds for the StructurizeCFG pass"), 
cl::init(false),
-cl::Hidden);
+cl::Hidden, cl::ZeroOrMore);
 
 extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
   // Register the target
Index: clang/test/Driver/hip-structurizer-workarounds.hip
===
--- /dev/null
+++ clang/test/Driver/hip-structurizer-workarounds.hip
@@ -0,0 +1,11 @@
+// REQUIRES: clang-driver
+// REQUIRES: x86-registered-target
+// REQUIRES: amdgpu-registered-target
+
+// RUN: %clang -### -x hip %s 2>&1 | FileCheck %s -check-prefix=CHECK
+// RUN: %clang -### -x hip -mllvm 
--amdgpu-enable-structurizer-workarounds=false %s 2>&1 | FileCheck %s 
-check-prefix=CHECK -check-prefix=MLLVM
+
+// CHECK: "{{.*llc}}"
+// CHECK-SAME: "-mtriple=amdgcn-amd-amdhsa"
+// CHECK-SAME: "--amdgpu-enable-structurizer-workarounds"
+// MLLVM-SAME: "--amdgpu-enable-structurizer-workarounds=false"
Index: clang/lib/Driver/ToolChains/HIP.cpp
===
--- clang/lib/Driver/ToolChains/HIP.cpp
+++ clang/lib/Driver/ToolChains/HIP.cpp
@@ -155,6 +155,7 @@
   LlcArgs.push_back(Args.MakeArgString("-mcpu=" + SubArchName));
   LlcArgs.push_back(
   Args.MakeArgString(Twine("-filetype=") + (OutputIsAsm ? "asm" : "obj")));
+  LlcArgs.push_back("--amdgpu-enable-structurizer-workarounds");
 
   // Extract all the -m options
   std::vector Features;


Index: llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
===
--- llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -195,7 +195,7 @@
 static cl::opt EnableStructurizerWorkarounds(
 "amdgpu-enable-structurizer-workarounds",
 cl::desc("Enable workarounds for the StructurizeCFG pass"), cl::init(false),
-cl::Hidden);
+cl::Hidden, cl::ZeroOrMore);
 
 extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
   // Register the target
Index: clang/test/Driver/hip-structurizer-workarounds.hip
===
--- /dev/null
+++ clang/test/Driver/hip-structurizer-workarounds.hip
@@ -0,0 +1,11 @@
+// REQUIRES: clang-driver
+// REQUIRES: x86-registered-target
+// REQUIRES: amdgpu-registered-target
+
+// RUN: %clang -### -x hip %s 2>&1 | FileCheck %s -check-prefix=CHECK
+// RUN: %clang -### -x hip -mllvm --amdgpu-enable-structurizer-workarounds=false %s 2>&1 | FileCheck %s -check-prefix=CHECK -check-prefix=MLLVM
+
+// CHECK: "{{.*llc}}"
+// CHECK-SAME: "-mtriple=amdgcn-amd-amdhsa"
+// CHECK-SAME: "--amdgpu-enable-structurizer-workarounds"
+// MLLVM-SAME: "--amdgpu-enable-structurizer-workarounds=false"
Index: clang/lib/Driver/ToolChains/HIP.cpp
===
--- clang/lib/Driver/ToolChains/HIP.cpp
+++ clang/lib/Driver/ToolChains/HIP.cpp
@@ -155,6 +155,7 @@
   LlcArgs.push_back(Args.MakeArgString("-mcpu=" + SubArchName));
   LlcArgs.push_back(
   Args.MakeArgString(Twine("-filetype=") + (OutputIsAsm ? "asm" : "obj")));
+  LlcArgs.push_back("--amdgpu-enable-structurizer-workarounds");
 
   // Extract all the -m options
   std::vector Features;
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D78900: [HIP][AMDGPU] Enable structurizer workarounds

2020-04-27 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: clang/lib/Driver/ToolChains/HIP.cpp:158
   Args.MakeArgString(Twine("-filetype=") + (OutputIsAsm ? "asm" : "obj")));
+  LlcArgs.push_back("--amdgpu-enable-structurizer-workarounds");
 

arsenm wrote:
> We should just flip this in the backend. llc flags are not intended for 
> frontends to set options
Admittedly, this is a broken workflow. I am not even sure if Clang should be 
invoking llc at all. Ideally, we would want to eliminate this flag entirely. 
But flipping it is okay too if graphics workloads are sensitive to it. The 
priority right now is to enable the workarounds officially only for HIP and 
this seems to be most concise way to do it. We can flip the flag once other 
users of the AMDGPU are on board.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D78900/new/

https://reviews.llvm.org/D78900



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D78900: [HIP][AMDGPU] Enable structurizer workarounds

2020-05-10 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds marked an inline comment as done.
sameerds added inline comments.



Comment at: clang/lib/Driver/ToolChains/HIP.cpp:158
   Args.MakeArgString(Twine("-filetype=") + (OutputIsAsm ? "asm" : "obj")));
+  LlcArgs.push_back("--amdgpu-enable-structurizer-workarounds");
 

sameerds wrote:
> arsenm wrote:
> > We should just flip this in the backend. llc flags are not intended for 
> > frontends to set options
> Admittedly, this is a broken workflow. I am not even sure if Clang should be 
> invoking llc at all. Ideally, we would want to eliminate this flag entirely. 
> But flipping it is okay too if graphics workloads are sensitive to it. The 
> priority right now is to enable the workarounds officially only for HIP and 
> this seems to be most concise way to do it. We can flip the flag once other 
> users of the AMDGPU are on board.
To clarify, we can flip it in the backend if graphcis workloads are **not** 
sensitive to it. But HIP needs this earlier.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D78900/new/

https://reviews.llvm.org/D78900



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D79754: [OpenMP][AMDGCN] Support OpenMP offloading for AMDGCN architecture - Part 1

2020-05-11 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: clang/lib/AST/Decl.cpp:3224
 
+  if (Context.getTargetInfo().getTriple().isAMDGCN() &&
+  Context.BuiltinInfo.isPredefinedLibFunction(BuiltinID) &&

arsenm wrote:
> This is also identical to the cuda handling above, can you merge them
It's not identical, because CUDA is filtering out host code and its check is 
target independent.

But then, Saiyed, it seems that the new patch disallows library builtins on all 
languages that reach this point, both on host and device code. Is that the 
intention? Does this point in the flow preclude any side-effects outside of 
"OpenMP on AMDGCN"?



Comment at: llvm/include/llvm/ADT/Triple.h:696
+  /// Tests whether the target is AMDGCN
+  bool isAMDGCN() const { return getArch() == Triple::amdgcn; }
+

Why not just isAMDGPU()? I myself don't have an opinion either way, but still 
curious.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D79754/new/

https://reviews.llvm.org/D79754



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D79754: [OpenMP][AMDGCN] Support OpenMP offloading for AMDGCN architecture - Part 1

2020-05-11 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: clang/lib/Frontend/CompilerInvocation.cpp:3162
+  Opts.OpenMPIsDevice && (T.isNVPTX() || T.isAMDGCN()) &&
   Args.hasArg(options::OPT_fopenmp_cuda_force_full_runtime);
 

jdoerfert wrote:
> Can we please not call it CUDA mode for non-CUDA targets? Or do you expect 
> the compilation to "identify" as CUDA compilation?
I suspect it's just a lot of water under the bridge. All over Clang, HIP has 
traditionally co-opted CUDA code without introducing new identifiers, like 
"-fcuda-is-device". It won't be easy to redo that now, and definitely looks out 
of scope for this change. A typical HIP compilation usually does identify 
itself as a HIP compilation like setting the isHIP flag. This allows the 
frontend to distinguish between CUDA and HIP when it matters.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D79754/new/

https://reviews.llvm.org/D79754



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D79754: [OpenMP][AMDGCN] Support OpenMP offloading for AMDGCN architecture - Part 1

2020-05-11 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: llvm/include/llvm/ADT/Triple.h:700
 return getArch() == Triple::r600 || getArch() == Triple::amdgcn;
   }
 

yaxunl wrote:
> jdoerfert wrote:
> > What's the difference between an AMDGPU and AMDGCN?
> AMDGPU inlude r600 and amdgcn. r600 are old AMD GPUs. They do not support 
> flat address space therefore cannot support CUDA or HIP.
As a separate topic, does that mean that Clang should reject r600 at an earlier 
entry point itself?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D79754/new/

https://reviews.llvm.org/D79754



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D79754: [OpenMP][AMDGCN] Support OpenMP offloading for AMDGCN architecture - Part 1

2020-05-11 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: clang/lib/AST/Decl.cpp:3227
+  !(BuiltinID == Builtin::BIprintf || BuiltinID == Builtin::BImalloc))
+return 0;
+

This needs a test.



Comment at: clang/lib/Frontend/CompilerInvocation.cpp:3102
   // handling code for those requiring so.
-  if ((Opts.OpenMPIsDevice && T.isNVPTX()) || Opts.OpenCLCPlusPlus) {
+  if ((Opts.OpenMPIsDevice && (T.isNVPTX() || T.isAMDGCN())) ||
+  Opts.OpenCLCPlusPlus) {

Needs a test.



Comment at: clang/lib/Frontend/CompilerInvocation.cpp:3137
 TT.getArch() == llvm::Triple::nvptx64 ||
+TT.getArch() == llvm::Triple::amdgcn ||
 TT.getArch() == llvm::Triple::x86 ||

This is probably too fundamental to need a test on its own.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D79754/new/

https://reviews.llvm.org/D79754



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D79754: [OpenMP][AMDGCN] Support OpenMP offloading for AMDGCN architecture - Part 1

2020-05-15 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: clang/include/clang/Basic/TargetInfo.h:1261
+  /// Currently only supports NVPTX and AMDGCN
+  static bool isOpenMPGPU(llvm::Triple &T) {
+return T.isNVPTX() || T.isAMDGCN();

How is "OpenMP-compatible GPU" defined? I think it's too early to start 
designing predicates about whether a target is a GPU and whether it supports 
OpenMP.



Comment at: clang/lib/AST/Decl.cpp:3221
+!hasAttr()) ||
+   Context.getTargetInfo().getTriple().isAMDGCN()) &&
   !(BuiltinID == Builtin::BIprintf || BuiltinID == Builtin::BImalloc))

This seems awkward to me. Why mix it up with only CUDA and HIP? The earlier 
factoring is better, where CUDA/HIP took care of their own business, and the 
catch-all case of AMDGCN was a separate clause by itself. It doesn't matter 
that the builtins being checked for AMDGCN on OpenMP are //currently// 
identical to CUDA/HIP. When this situation later changes (I am sure OpenMP will 
support more builtins), we will have to split it out again anyway.



Comment at: clang/lib/Frontend/CompilerInvocation.cpp:3100
 
+  bool IsOpenMPGPU = clang::TargetInfo::isOpenMPGPU(T);
+

I am not particularly in favour of introducing a variable just because it looks 
smaller than a call at each appropriate location. If you really want it this 
way, at least make it a const.



Comment at: clang/lib/Frontend/CompilerInvocation.cpp:3104
   // handling code for those requiring so.
-  if ((Opts.OpenMPIsDevice && T.isNVPTX()) || Opts.OpenCLCPlusPlus) {
+  if ((Opts.OpenMPIsDevice && IsOpenMPGPU) || Opts.OpenCLCPlusPlus) {
 Opts.Exceptions = 0;

Looking at the comment before this line, the correct predicate would "target 
supports exceptions with OpenMP". Is it always true that every GPU that 
supports OpenMP will not support exception handling? I would recommend just 
checking individual targets for now instead of inventing predicates.



Comment at: clang/lib/Frontend/CompilerInvocation.cpp:3157
+  // Set CUDA mode for OpenMP target NVPTX/AMDGCN if specified in options
+  Opts.OpenMPCUDAMode = Opts.OpenMPIsDevice && IsOpenMPGPU &&
 Args.hasArg(options::OPT_fopenmp_cuda_mode);

Is there any reason to believe that every future GPU added to this predicate 
will also want the CUDA mode set? I would recommend using individual targets 
for now instead of inventing a new predicate.



Comment at: clang/lib/Frontend/CompilerInvocation.cpp:3162
   Opts.OpenMPCUDAForceFullRuntime =
-  Opts.OpenMPIsDevice && T.isNVPTX() &&
+  Opts.OpenMPIsDevice && IsOpenMPGPU &&
   Args.hasArg(options::OPT_fopenmp_cuda_force_full_runtime);

Same doubt about this use of an artificial predicate as commented earlier.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D79754/new/

https://reviews.llvm.org/D79754



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D79754: [OpenMP][AMDGCN] Support OpenMP offloading for AMDGCN architecture - Part 1

2020-05-16 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: clang/lib/AST/Decl.cpp:3221
+!hasAttr()) ||
+   Context.getTargetInfo().getTriple().isAMDGCN()) &&
   !(BuiltinID == Builtin::BIprintf || BuiltinID == Builtin::BImalloc))

sameerds wrote:
> This seems awkward to me. Why mix it up with only CUDA and HIP? The earlier 
> factoring is better, where CUDA/HIP took care of their own business, and the 
> catch-all case of AMDGCN was a separate clause by itself. It doesn't matter 
> that the builtins being checked for AMDGCN on OpenMP are //currently// 
> identical to CUDA/HIP. When this situation later changes (I am sure OpenMP 
> will support more builtins), we will have to split it out again anyway.
This clause seems to assume that AMDGCN automatically means OpenMP whenever it 
is not HIP. That does not sound right. Is there a Context.getLangOpts().OpenMP 
flag? If not, why not? There also seems to be an Opts.OpenMPIsDevice ... 
perhaps that could be used here to write a separate OpenMP clause?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D79754/new/

https://reviews.llvm.org/D79754



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D79754: [OpenMP][AMDGCN] Support OpenMP offloading for AMDGCN architecture - Part 1

2020-05-17 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: clang/lib/AST/Decl.cpp:3227
+  !(BuiltinID == Builtin::BIprintf || BuiltinID == Builtin::BImalloc))
+return 0;
+

sameerds wrote:
> This needs a test.
This still needs a test. One that shows that specific builtins are allowed and 
others are not.



Comment at: clang/test/Driver/openmp-offload-gpu.c:257
 // RUN:   | FileCheck -check-prefix=CUDA_MODE %s
-// CUDA_MODE: clang{{.*}}"-cc1"{{.*}}"-triple" "nvptx64-nvidia-cuda"
-// CUDA_MODE-SAME: "-fopenmp-cuda-mode"

Is there a functional reason to move these lines? Changes to existing files 
should be minimized to show only functional changes. Any NFC rearrangement 
should be a separate patch.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D79754/new/

https://reviews.llvm.org/D79754



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D79754: [OpenMP][AMDGCN] Support OpenMP offloading for AMDGCN architecture - Part 1

2020-05-17 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: clang/lib/AST/Decl.cpp:3224
 
+  // HIP does not have device-side standard library. printf and malloc are
+  // the only special cases that are supported by device-side runtime.

This should say "OpenMP does not have ..."?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D79754/new/

https://reviews.llvm.org/D79754



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D79754: [OpenMP][AMDGCN] Support OpenMP offloading for AMDGCN architecture - Part 1

2020-05-18 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: clang/lib/AST/Decl.cpp:3226
+  // the only special cases that are supported by device-side runtime.
+  if (Context.getTargetInfo().getTriple().isAMDGCN() &&
+  Context.getLangOpts().OpenMPIsDevice &&

Why is the check for AMDGCN required here? Doesn't the language define the set 
of supported builtins in a language-independent way? At least that seems to be 
true for OpenCL and CUDA above.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D79754/new/

https://reviews.llvm.org/D79754



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D121951: [AMDGPU][OpenCL] Add "amdgpu-no-hostcall-ptr" in Clang codegen pre-COV_5

2022-03-24 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: clang/lib/CodeGen/TargetInfo.cpp:9381
+  M.getTarget().getTargetOpts().CodeObjectVersion != 500) {
+F->addFnAttr("amdgpu-no-hostcall-ptr");
+  }

The frontend does not need to worry about this attribute. See the comment in 
the MetadataStreamer. A worthwhile check would be to generate an error if we 
are able to detect that some hostcall service is being used in OpenCL on 
code-object-v4 or lower. None exists right now, but we should add the check if 
such services show up. But those checks are likely to be in a different place. 
For example, enabling asan on OpenCL for code-object-v4 should result in an 
error in the place where asan commandline options are parsed.



Comment at: llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp:406-408
 if (Func.getParent()->getNamedMetadata("llvm.printf.fmts"))
   emitKernelArg(DL, Int8PtrTy, Align(8), ValueKind::HiddenPrintfBuffer);
+else if (!Func.hasFnAttribute("amdgpu-no-hostcall-ptr"))

I would structure this differently: If this is code-object-v4 or lower, then if 
 "llvm.printf.fmts" is present, then this kernel clearly contains OpenCL bits, 
and cannot use hostcall. So it's okay to just assume that the no-hostcall-ptr 
attribute is always present in this situation, which means the only metadata 
generated is for ValueKind::HiddenPrintfBuffer. Else if this is code-object-v5, 
then proceed to emit both metadata.




Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D121951/new/

https://reviews.llvm.org/D121951

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D121951: [AMDGPU][OpenCL] Add "amdgpu-no-hostcall-ptr" in Clang codegen pre-COV_5

2022-03-29 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

In D121951#3411856 , @scott.linder 
wrote:

> @yaxunl Does excluding device-libs via COV_None make sense?
>
> @sameerds Would you still rather we just not add this attribute in the 
> frontend at all? I'm OK with it if that is the consensus

Yes, I still think there is no need to emit that attribute in the frontend. It 
will always be inferred by the Attributor when optimization is enabled. This 
also eliminates the check for COV_None and there seems to be some uncertainty 
about COV_None anyway. This also eliminates the updates to all the tests where 
the no-hostcall-ptr attribute does not actually matter. If ever we need to 
check if hostcall is being used on OpenCL and COV < 5, we should do it per 
feature and inform the user appropriately.




Comment at: llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp:406-408
 if (Func.getParent()->getNamedMetadata("llvm.printf.fmts"))
   emitKernelArg(DL, Int8PtrTy, Align(8), ValueKind::HiddenPrintfBuffer);
+else if (!Func.hasFnAttribute("amdgpu-no-hostcall-ptr"))

scott.linder wrote:
> sameerds wrote:
> > I would structure this differently: If this is code-object-v4 or lower, 
> > then if  "llvm.printf.fmts" is present, then this kernel clearly contains 
> > OpenCL bits, and cannot use hostcall. So it's okay to just assume that the 
> > no-hostcall-ptr attribute is always present in this situation, which means 
> > the only metadata generated is for ValueKind::HiddenPrintfBuffer. Else if 
> > this is code-object-v5, then proceed to emit both metadata.
> > 
> > 
> I'm not sure I follow; doesn't the code as-is implement what you're 
> describing?
> 
> If the printf metadata is present, this will only ever emit the 
> `HiddenPrintfBuffer`, irrespective of the hostcall attribute. Otherwise, this 
> respects `amdgpu-no-hostcall-ptr` (i.e. for HIP and other languages).
> 
> The "if this is code-object-v4 or lower" portion is implicit, as this is just 
> the `MetadataStreamerV2` impl. The `MetadataStreamerV3` (below) and 
> `MetadataStreamerV4` (inherits from V3) impls below are similar. The 
> `MetadataStreamerV5` impl is already correct for V5 (i.e. supports both 
> argument kinds for the same kernel).
Your last para about different streamers cleared the confusion. So, this looks 
good.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D121951/new/

https://reviews.llvm.org/D121951

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D121951: [AMDGPU][OpenCL] Remove "printf and hostcall" diagnostic

2022-03-29 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds accepted this revision.
sameerds added a comment.

LGTM! But not sure if @yaxunl and/or @arsenm should have another look at the 
final version.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D121951/new/

https://reviews.llvm.org/D121951

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D121951: [AMDGPU] Only warn when mixing printf and hostcall

2022-03-18 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

The check for "__ockl_hostcall_internal" is not longer relevant with the new 
hostcall attribute. Can we simply remove this check? What is the situation 
where the warning is still useful?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D121951/new/

https://reviews.llvm.org/D121951

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D121951: [AMDGPU] Only warn when mixing printf and hostcall

2022-03-18 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

Yeah, it happened in D119216 


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D121951/new/

https://reviews.llvm.org/D121951

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D121951: [AMDGPU] Only warn when mixing printf and hostcall

2022-03-18 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

In D121951#3392470 , @scott.linder 
wrote:

> In D121951#3391472 , @sameerds 
> wrote:
>
>> The check for "__ockl_hostcall_internal" is not longer relevant with the new 
>> hostcall attribute. Can we simply remove this check? What is the situation 
>> where the warning is still useful?
>
> I wasn't aware of the new attribute, I'm happy to just delete it.

Yeah, it happened in D119216 .

But I am still curious about the check itself. Do we need to worry about a 
situation where it is important to check whether OpenCL printf (non-hostcall) 
and some other hostcall service are active in the same application?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D121951/new/

https://reviews.llvm.org/D121951

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D119216: [AMDGPU] replace hostcall module flag with function attribute

2022-02-07 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds created this revision.
Herald added subscribers: sdasgup3, wenzhicui, wrengr, Chia-hungDuan, foad, 
dcaballe, cota, teijeong, rdzhabarov, tatianashp, okura, jdoerfert, msifontes, 
jurahul, kuter, Kayjukh, grosul1, uenoku, Joonsoo, kerbowa, liufengdb, aartbik, 
mgester, arpith-jacob, csigg, antiagainst, shauheen, rriddle, mehdi_amini, 
hiraditya, t-tye, tpr, dstuttard, yaxunl, nhaehnle, jvesely, kzhuravl, arsenm.
Herald added a reviewer: uenoku.
Herald added a reviewer: bondhugula.
sameerds requested review of this revision.
Herald added subscribers: llvm-commits, cfe-commits, stephenneuendorffer, 
nicolasvasilache, wdng.
Herald added a reviewer: jdoerfert.
Herald added a reviewer: sstefan1.
Herald added a reviewer: herhut.
Herald added a reviewer: baziotis.
Herald added projects: clang, MLIR, LLVM.

The module flag to indicate use of hostcall is insufficient to catch
all cases where hostcall might be in use by a kernel. This is now
replaced by a function attribute that gets propagated to top-level
kernel functions via their respective call-graph.

If the attribute "amdgpu-no-hostcall-ptr" is absent on a kernel, the
default behaviour is to emit kernel metadata indicating that the
kernel uses the hostcall buffer pointer passed as an implicit
argument.

The attribute may be placed explicitly by the user, or inferred by the
AMDGPU attributor by examining the call-graph. The attribute is
inferred only if the function is not being sanitized, and the
implictarg_ptr does not result in a load of any byte in the hostcall
pointer argument.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D119216

Files:
  clang/lib/CodeGen/CodeGenModule.cpp
  clang/test/CodeGenCUDA/amdgpu-asan-printf.cu
  clang/test/CodeGenCUDA/amdgpu-asan.cu
  llvm/lib/Target/AMDGPU/AMDGPUAttributes.def
  llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
  llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp
  llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
  llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
  llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
  llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h
  llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
  llvm/test/CodeGen/AMDGPU/addrspacecast-constantexpr.ll
  llvm/test/CodeGen/AMDGPU/annotate-kernel-features-hsa-call.ll
  llvm/test/CodeGen/AMDGPU/annotate-kernel-features-hsa.ll
  llvm/test/CodeGen/AMDGPU/annotate-kernel-features.ll
  llvm/test/CodeGen/AMDGPU/direct-indirect-call.ll
  llvm/test/CodeGen/AMDGPU/duplicate-attribute-indirect.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-enqueue-kernel-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-enqueue-kernel.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hidden-args-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hidden-args-v5.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hidden-args.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-absent-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-absent.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-present-v3-asan.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-present-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-present.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-v5.ll
  llvm/test/CodeGen/AMDGPU/propagate-flat-work-group-size.ll
  llvm/test/CodeGen/AMDGPU/simple-indirect-call.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-attribute-missing.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-multistep.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-nested-function-calls.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-prevent-attribute-propagation.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-recursion-test.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
  mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp

Index: mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp
===
--- mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp
+++ mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp
@@ -308,11 +308,6 @@
 }
   }
 
-  // Set amdgpu_hostcall if host calls have been linked, as needed by newer LLVM
-  // FIXME: Is there a way to set this during printf() lowering that makes sense
-  if (ret->getFunction("__ockl_hostcall_internal"))
-if (!ret->getModuleFlag("amdgpu_hostcall"))
-  ret->addModuleFlag(llvm::Module::Override, "amdgpu_hostcall", 1);
   return ret;
 }
 
Index: llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
===
--- llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
+++ llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
@@ -61,5 +61,5 @@
 
 attributes #0 = { "uniform-work-group-size"="false" }
 ;.
-; CHECK: attributes #[[ATTR0]] = { "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem

[PATCH] D119216: [AMDGPU] replace hostcall module flag with function attribute

2022-02-08 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp:566
+  return false;
+};
+

jdoerfert wrote:
> You can use AAPointerInfo for the call site return IRPosition. It will 
> (through the iterations) gather all accesses and put them into "bins" based 
> on offset and size. It deals with uses in calls, etc. and if there is stuff 
> missing it is better to add it in one place so we benefit throughout. 
I am not following what you have in mind. "implicitarg_ptr" is a pointer 
returned by an intrinsic that reads an ABI-defined register. I need to check 
that for a given call-graph, a particular range of bytes relative to that base 
pointer are never accessed. The above DFS on the uses conservatively assumes 
that such a load exists unless it can conclusively trace every use of the base 
pointer. This may include the pointer being passed to an extern function or 
being stored into a different memory location (although we don't expect ABI 
registers being capture this way). I am not seeing how to construct this around 
AAPointerInfo. As far as I can see, the public interface only talks about uses 
that are recognized as loads and stores.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D119216/new/

https://reviews.llvm.org/D119216

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D119216: [AMDGPU] replace hostcall module flag with function attribute

2022-02-08 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds updated this revision to Diff 406785.
sameerds added a comment.

eliminated NeedsHostcall by simply checking the return value of the walk over 
all calls


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D119216/new/

https://reviews.llvm.org/D119216

Files:
  clang/lib/CodeGen/CodeGenModule.cpp
  clang/test/CodeGenCUDA/amdgpu-asan-printf.cu
  clang/test/CodeGenCUDA/amdgpu-asan.cu
  llvm/lib/Target/AMDGPU/AMDGPUAttributes.def
  llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
  llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp
  llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
  llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
  llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
  llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h
  llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
  llvm/test/CodeGen/AMDGPU/addrspacecast-constantexpr.ll
  llvm/test/CodeGen/AMDGPU/annotate-kernel-features-hsa-call.ll
  llvm/test/CodeGen/AMDGPU/annotate-kernel-features-hsa.ll
  llvm/test/CodeGen/AMDGPU/annotate-kernel-features.ll
  llvm/test/CodeGen/AMDGPU/direct-indirect-call.ll
  llvm/test/CodeGen/AMDGPU/duplicate-attribute-indirect.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-enqueue-kernel-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-enqueue-kernel.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hidden-args-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hidden-args-v5.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hidden-args.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-absent-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-absent.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-present-v3-asan.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-present-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-present.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-v5.ll
  llvm/test/CodeGen/AMDGPU/propagate-flat-work-group-size.ll
  llvm/test/CodeGen/AMDGPU/simple-indirect-call.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-attribute-missing.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-multistep.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-nested-function-calls.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-prevent-attribute-propagation.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-recursion-test.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
  mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp

Index: mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp
===
--- mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp
+++ mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp
@@ -308,11 +308,6 @@
 }
   }
 
-  // Set amdgpu_hostcall if host calls have been linked, as needed by newer LLVM
-  // FIXME: Is there a way to set this during printf() lowering that makes sense
-  if (ret->getFunction("__ockl_hostcall_internal"))
-if (!ret->getModuleFlag("amdgpu_hostcall"))
-  ret->addModuleFlag(llvm::Module::Override, "amdgpu_hostcall", 1);
   return ret;
 }
 
Index: llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
===
--- llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
+++ llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
@@ -61,5 +61,5 @@
 
 attributes #0 = { "uniform-work-group-size"="false" }
 ;.
-; CHECK: attributes #[[ATTR0]] = { "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
+; CHECK: attributes #[[ATTR0]] = { "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
 ;.
Index: llvm/test/CodeGen/AMDGPU/uniform-work-group-recursion-test.ll
===
--- llvm/test/CodeGen/AMDGPU/uniform-work-group-recursion-test.ll
+++ llvm/test/CodeGen/AMDGPU/uniform-work-group-recursion-test.ll
@@ -101,7 +101,7 @@
 attributes #0 = { nounwind readnone }
 attributes #1 = { "uniform-work-group-size"="true" }
 ;.
-; CHECK: attributes #[[ATTR0]] = { nounwind readnone "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
-; CHECK: attributes #[[ATTR1]] = { nounwind readnone "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-queue-ptr" "amdgpu-no-workgrou

[PATCH] D119216: [AMDGPU] replace hostcall module flag with function attribute

2022-02-08 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp:589
+A.checkForAllCallLikeInstructions(CheckForHostcallAccess, *this,
+  UsedAssumedInformation);
+

jdoerfert wrote:
> Always check the return value, if it is false something went wrong and some 
> calls might have been missed. Basically,
> ```
> if (!A.check...)
>   NeedsHostCallPtr = true;
> ```
> then you also can just return false in the callback w/o setting it.
Thanks! This eliminated the variable. Also used the opportunity to get more 
creative with the callback names!


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D119216/new/

https://reviews.llvm.org/D119216

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D119216: [AMDGPU] replace hostcall module flag with function attribute

2022-02-08 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds updated this revision to Diff 406860.
sameerds added a comment.

further simplified the callback return value; moved hostcall info out of the 
subtarget


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D119216/new/

https://reviews.llvm.org/D119216

Files:
  clang/lib/CodeGen/CodeGenModule.cpp
  clang/test/CodeGenCUDA/amdgpu-asan-printf.cu
  clang/test/CodeGenCUDA/amdgpu-asan.cu
  llvm/lib/Target/AMDGPU/AMDGPUAttributes.def
  llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
  llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp
  llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
  llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
  llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
  llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h
  llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
  llvm/test/CodeGen/AMDGPU/addrspacecast-constantexpr.ll
  llvm/test/CodeGen/AMDGPU/annotate-kernel-features-hsa-call.ll
  llvm/test/CodeGen/AMDGPU/annotate-kernel-features-hsa.ll
  llvm/test/CodeGen/AMDGPU/annotate-kernel-features.ll
  llvm/test/CodeGen/AMDGPU/direct-indirect-call.ll
  llvm/test/CodeGen/AMDGPU/duplicate-attribute-indirect.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-enqueue-kernel-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-enqueue-kernel.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hidden-args-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hidden-args-v5.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hidden-args.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-absent-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-absent.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-present-v3-asan.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-present-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-present.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-v5.ll
  llvm/test/CodeGen/AMDGPU/propagate-flat-work-group-size.ll
  llvm/test/CodeGen/AMDGPU/simple-indirect-call.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-attribute-missing.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-multistep.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-nested-function-calls.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-prevent-attribute-propagation.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-recursion-test.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
  mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp

Index: mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp
===
--- mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp
+++ mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp
@@ -308,11 +308,6 @@
 }
   }
 
-  // Set amdgpu_hostcall if host calls have been linked, as needed by newer LLVM
-  // FIXME: Is there a way to set this during printf() lowering that makes sense
-  if (ret->getFunction("__ockl_hostcall_internal"))
-if (!ret->getModuleFlag("amdgpu_hostcall"))
-  ret->addModuleFlag(llvm::Module::Override, "amdgpu_hostcall", 1);
   return ret;
 }
 
Index: llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
===
--- llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
+++ llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
@@ -61,5 +61,5 @@
 
 attributes #0 = { "uniform-work-group-size"="false" }
 ;.
-; CHECK: attributes #[[ATTR0]] = { "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
+; CHECK: attributes #[[ATTR0]] = { "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
 ;.
Index: llvm/test/CodeGen/AMDGPU/uniform-work-group-recursion-test.ll
===
--- llvm/test/CodeGen/AMDGPU/uniform-work-group-recursion-test.ll
+++ llvm/test/CodeGen/AMDGPU/uniform-work-group-recursion-test.ll
@@ -101,7 +101,7 @@
 attributes #0 = { nounwind readnone }
 attributes #1 = { "uniform-work-group-size"="true" }
 ;.
-; CHECK: attributes #[[ATTR0]] = { nounwind readnone "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
-; CHECK: attributes #[[ATTR1]] = { nounwind readnone "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup

[PATCH] D119216: [AMDGPU] replace hostcall module flag with function attribute

2022-02-08 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp:198
+const auto *STI = TM.getMCSubtargetInfo();
+return llvm::AMDGPU::getHostcallImplicitArgPosition(STI);
+  }

arsenm wrote:
> The ABI should not be a property of the subtarget, and the global subtarget 
> shouldn't be used
I don't understand what the objection is. Existing functions that check the ABI 
version clearly do so by accessing the subtarget. I am merely following 
existing practice.

I have now rearranged the code a bit. Maybe this works? To be honest, I am not 
very familiar with how ABI information is tracked. I will heartily welcome any 
advice on how to retrieve the ABI version and then determine the location of 
the hostcall implicit arg.



Comment at: llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp:521
+
+  while (!WorkList.empty()) {
+auto UseInfo = WorkList.back();

arsenm wrote:
> Can you use checkForAllUses instead of creating your own worklist?
checkForAllUses does not track sufficient state to catch every load that 
happens to overlap with the hostcall pointer. The only pattern likely to happen 
in real life is "GEP to offset 24, typecast to i64*, load 8 bytes". But unless 
we can guarantee that this is the only pattern, we need something robust.

@jdoerfert pointed me in the correct direction to make full use of existing 
machinery, but that is now dependent on D119249



Comment at: llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp:581-584
+  if (ArgUsedToRetrieveHostcallPtr(I)) {
+return false;
+  }
+  return true;

jdoerfert wrote:
> I'd suggest to switch the return value of ArgUsed... so it matches this 
> functions (and other callbacks).
Agreed. In fact switching the return value allowed me to merge the two 
callbacks into one. I am still keeping the positive name on the outermost 
function "funcRetrievesHostcallPtr" because it matches the sense near the 
callsite.



Comment at: llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp:566
+  return false;
+};
+

jdoerfert wrote:
> jdoerfert wrote:
> > sameerds wrote:
> > > jdoerfert wrote:
> > > > You can use AAPointerInfo for the call site return IRPosition. It will 
> > > > (through the iterations) gather all accesses and put them into "bins" 
> > > > based on offset and size. It deals with uses in calls, etc. and if 
> > > > there is stuff missing it is better to add it in one place so we 
> > > > benefit throughout. 
> > > I am not following what you have in mind. "implicitarg_ptr" is a pointer 
> > > returned by an intrinsic that reads an ABI-defined register. I need to 
> > > check that for a given call-graph, a particular range of bytes relative 
> > > to that base pointer are never accessed. The above DFS on the uses 
> > > conservatively assumes that such a load exists unless it can conclusively 
> > > trace every use of the base pointer. This may include the pointer being 
> > > passed to an extern function or being stored into a different memory 
> > > location (although we don't expect ABI registers being capture this way). 
> > > I am not seeing how to construct this around AAPointerInfo. As far as I 
> > > can see, the public interface only talks about uses that are recognized 
> > > as loads and stores.
> > Not actually tested, replaces the function body. Depends on D119249.
> > ```
> > const auto PointerInfoAA = A.getAAFor(*this, 
> > IRPosition::callback_returned(cast(Ptr)), DepClassTy::Required);
> > if (!PointerInfoAA.getState().isValidState())
> >   return true; // Abort (which is weird as false is abort in the other CB).
> > AAPointerInfo::OffsetAndSize OAS(*Position, /* probably look pointer width 
> > up in DL */ 8);
> > return !forallInterferingAccesses(OAS, [](const AAPointerInfo::Access &Acc, 
> > bool IsExact) {
> >return Acc.getRemoteInst()->isDroppable(); });
> > ```
> You don't actually need the state check.
> And as I said, this will take care of following pointers passed into callees 
> or through memory to other places, all while ignoring dead code, etc.
I see now. forallInterferingAccesses() does check for valid state on entry, 
which is sufficient to take care of all the opaque uses like a call to an 
extern function or a complicated phi or a capturing store. Thanks a ton ... 
this has been very educational!


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D119216/new/

https://reviews.llvm.org/D119216

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D119216: [AMDGPU] replace hostcall module flag with function attribute

2022-02-08 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds updated this revision to Diff 406889.
sameerds added a comment.

added tests for i128 load. hostcall position is now independent of subtarget.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D119216/new/

https://reviews.llvm.org/D119216

Files:
  clang/lib/CodeGen/CodeGenModule.cpp
  clang/test/CodeGenCUDA/amdgpu-asan-printf.cu
  clang/test/CodeGenCUDA/amdgpu-asan.cu
  llvm/lib/Target/AMDGPU/AMDGPUAttributes.def
  llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
  llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp
  llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
  llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
  llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
  llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h
  llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
  llvm/test/CodeGen/AMDGPU/addrspacecast-constantexpr.ll
  llvm/test/CodeGen/AMDGPU/annotate-kernel-features-hsa-call.ll
  llvm/test/CodeGen/AMDGPU/annotate-kernel-features-hsa.ll
  llvm/test/CodeGen/AMDGPU/annotate-kernel-features.ll
  llvm/test/CodeGen/AMDGPU/direct-indirect-call.ll
  llvm/test/CodeGen/AMDGPU/duplicate-attribute-indirect.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-enqueue-kernel-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-enqueue-kernel.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hidden-args-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hidden-args-v5.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hidden-args.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-absent-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-absent.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-present-v3-asan.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-present-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-present.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-v5.ll
  llvm/test/CodeGen/AMDGPU/propagate-flat-work-group-size.ll
  llvm/test/CodeGen/AMDGPU/simple-indirect-call.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-attribute-missing.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-multistep.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-nested-function-calls.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-prevent-attribute-propagation.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-recursion-test.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
  mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp

Index: mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp
===
--- mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp
+++ mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp
@@ -308,11 +308,6 @@
 }
   }
 
-  // Set amdgpu_hostcall if host calls have been linked, as needed by newer LLVM
-  // FIXME: Is there a way to set this during printf() lowering that makes sense
-  if (ret->getFunction("__ockl_hostcall_internal"))
-if (!ret->getModuleFlag("amdgpu_hostcall"))
-  ret->addModuleFlag(llvm::Module::Override, "amdgpu_hostcall", 1);
   return ret;
 }
 
Index: llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
===
--- llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
+++ llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
@@ -61,5 +61,5 @@
 
 attributes #0 = { "uniform-work-group-size"="false" }
 ;.
-; CHECK: attributes #[[ATTR0]] = { "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
+; CHECK: attributes #[[ATTR0]] = { "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
 ;.
Index: llvm/test/CodeGen/AMDGPU/uniform-work-group-recursion-test.ll
===
--- llvm/test/CodeGen/AMDGPU/uniform-work-group-recursion-test.ll
+++ llvm/test/CodeGen/AMDGPU/uniform-work-group-recursion-test.ll
@@ -101,7 +101,7 @@
 attributes #0 = { nounwind readnone }
 attributes #1 = { "uniform-work-group-size"="true" }
 ;.
-; CHECK: attributes #[[ATTR0]] = { nounwind readnone "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
-; CHECK: attributes #[[ATTR1]] = { nounwind readnone "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "am

[PATCH] D119216: [AMDGPU] replace hostcall module flag with function attribute

2022-02-08 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp:566
+  return false;
+};
+

jdoerfert wrote:
> sameerds wrote:
> > jdoerfert wrote:
> > > jdoerfert wrote:
> > > > sameerds wrote:
> > > > > jdoerfert wrote:
> > > > > > You can use AAPointerInfo for the call site return IRPosition. It 
> > > > > > will (through the iterations) gather all accesses and put them into 
> > > > > > "bins" based on offset and size. It deals with uses in calls, etc. 
> > > > > > and if there is stuff missing it is better to add it in one place 
> > > > > > so we benefit throughout. 
> > > > > I am not following what you have in mind. "implicitarg_ptr" is a 
> > > > > pointer returned by an intrinsic that reads an ABI-defined register. 
> > > > > I need to check that for a given call-graph, a particular range of 
> > > > > bytes relative to that base pointer are never accessed. The above DFS 
> > > > > on the uses conservatively assumes that such a load exists unless it 
> > > > > can conclusively trace every use of the base pointer. This may 
> > > > > include the pointer being passed to an extern function or being 
> > > > > stored into a different memory location (although we don't expect ABI 
> > > > > registers being capture this way). I am not seeing how to construct 
> > > > > this around AAPointerInfo. As far as I can see, the public interface 
> > > > > only talks about uses that are recognized as loads and stores.
> > > > Not actually tested, replaces the function body. Depends on D119249.
> > > > ```
> > > > const auto PointerInfoAA = A.getAAFor(*this, 
> > > > IRPosition::callback_returned(cast(Ptr)), 
> > > > DepClassTy::Required);
> > > > if (!PointerInfoAA.getState().isValidState())
> > > >   return true; // Abort (which is weird as false is abort in the other 
> > > > CB).
> > > > AAPointerInfo::OffsetAndSize OAS(*Position, /* probably look pointer 
> > > > width up in DL */ 8);
> > > > return !forallInterferingAccesses(OAS, [](const AAPointerInfo::Access 
> > > > &Acc, bool IsExact) {
> > > >return Acc.getRemoteInst()->isDroppable(); });
> > > > ```
> > > You don't actually need the state check.
> > > And as I said, this will take care of following pointers passed into 
> > > callees or through memory to other places, all while ignoring dead code, 
> > > etc.
> > I see now. forallInterferingAccesses() does check for valid state on entry, 
> > which is sufficient to take care of all the opaque uses like a call to an 
> > extern function or a complicated phi or a capturing store. Thanks a ton ... 
> > this has been very educational!
> Yes, all "forAll" functions will return `false` if we cannot visit "all". 
> Though, these methods will utilize the smarts, e.g., ignore what is dead, 
> look at loads if the value is stored in a way we can track it through memory, 
> transfer accesses in a callee to the caller "space" if a pointer is passed to 
> the callee,... etc.
> Yes, all "forAll" functions will return `false` if we cannot visit "all". 
> Though, these methods will utilize the smarts, e.g., ignore what is dead, 
> look at loads if the value is stored in a way we can track it through memory, 
> transfer accesses in a callee to the caller "space" if a pointer is passed to 
> the callee,... etc.

@jdoerfert, do you see  D119249 landing soon? We are kinda on a short runway 
here (less than a handful of days) and hoping to land this review quickly 
because it solves an important issue. I would prefer to have the check that you 
outlined, but the alternative is to let my version through now, and then update 
it when the new interface becomes available.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D119216/new/

https://reviews.llvm.org/D119216

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D119216: [AMDGPU] replace hostcall module flag with function attribute

2022-02-09 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds marked 2 inline comments as not done.
sameerds added inline comments.



Comment at: llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp:566
+  return false;
+};
+

jdoerfert wrote:
> sameerds wrote:
> > jdoerfert wrote:
> > > sameerds wrote:
> > > > jdoerfert wrote:
> > > > > jdoerfert wrote:
> > > > > > sameerds wrote:
> > > > > > > jdoerfert wrote:
> > > > > > > > You can use AAPointerInfo for the call site return IRPosition. 
> > > > > > > > It will (through the iterations) gather all accesses and put 
> > > > > > > > them into "bins" based on offset and size. It deals with uses 
> > > > > > > > in calls, etc. and if there is stuff missing it is better to 
> > > > > > > > add it in one place so we benefit throughout. 
> > > > > > > I am not following what you have in mind. "implicitarg_ptr" is a 
> > > > > > > pointer returned by an intrinsic that reads an ABI-defined 
> > > > > > > register. I need to check that for a given call-graph, a 
> > > > > > > particular range of bytes relative to that base pointer are never 
> > > > > > > accessed. The above DFS on the uses conservatively assumes that 
> > > > > > > such a load exists unless it can conclusively trace every use of 
> > > > > > > the base pointer. This may include the pointer being passed to an 
> > > > > > > extern function or being stored into a different memory location 
> > > > > > > (although we don't expect ABI registers being capture this way). 
> > > > > > > I am not seeing how to construct this around AAPointerInfo. As 
> > > > > > > far as I can see, the public interface only talks about uses that 
> > > > > > > are recognized as loads and stores.
> > > > > > Not actually tested, replaces the function body. Depends on D119249.
> > > > > > ```
> > > > > > const auto PointerInfoAA = A.getAAFor(*this, 
> > > > > > IRPosition::callback_returned(cast(Ptr)), 
> > > > > > DepClassTy::Required);
> > > > > > if (!PointerInfoAA.getState().isValidState())
> > > > > >   return true; // Abort (which is weird as false is abort in the 
> > > > > > other CB).
> > > > > > AAPointerInfo::OffsetAndSize OAS(*Position, /* probably look 
> > > > > > pointer width up in DL */ 8);
> > > > > > return !forallInterferingAccesses(OAS, [](const 
> > > > > > AAPointerInfo::Access &Acc, bool IsExact) {
> > > > > >return Acc.getRemoteInst()->isDroppable(); });
> > > > > > ```
> > > > > You don't actually need the state check.
> > > > > And as I said, this will take care of following pointers passed into 
> > > > > callees or through memory to other places, all while ignoring dead 
> > > > > code, etc.
> > > > I see now. forallInterferingAccesses() does check for valid state on 
> > > > entry, which is sufficient to take care of all the opaque uses like a 
> > > > call to an extern function or a complicated phi or a capturing store. 
> > > > Thanks a ton ... this has been very educational!
> > > Yes, all "forAll" functions will return `false` if we cannot visit "all". 
> > > Though, these methods will utilize the smarts, e.g., ignore what is dead, 
> > > look at loads if the value is stored in a way we can track it through 
> > > memory, transfer accesses in a callee to the caller "space" if a pointer 
> > > is passed to the callee,... etc.
> > > Yes, all "forAll" functions will return `false` if we cannot visit "all". 
> > > Though, these methods will utilize the smarts, e.g., ignore what is dead, 
> > > look at loads if the value is stored in a way we can track it through 
> > > memory, transfer accesses in a callee to the caller "space" if a pointer 
> > > is passed to the callee,... etc.
> > 
> > @jdoerfert, do you see  D119249 landing soon? We are kinda on a short 
> > runway here (less than a handful of days) and hoping to land this review 
> > quickly because it solves an important issue. I would prefer to have the 
> > check that you outlined, but the alternative is to let my version through 
> > now, and then update it when the new interface becomes available.
> Did you test my proposed code? I'll land my patch tomorrow, if yours works as 
> you expect with the proposed AAPointerInfo use, great. If it doesn't I don't 
> necessarily mind you merging something else with a clear intention to address 
> issues and remove duplication as we go.
> 
> I can lift my commit block as we go and @arsenm can also give it a good-to-go 
> if need be.
Lifting your commit block would be useful in general. I certainly do not intend 
to submit something in a hurry.

I applied your change and tested the proposed code. It gives more pessimistic 
results than my original crude version, i.e., it marks some uses of the 
`implicitarg_ptr` as accessing the target range, when I am pretty sure it does 
not. See `test_kernel42` in the lit test `hsa-metadata-hostcall-v3.ll` in this 
review. It is intended to access eight bytes from 16 onwards, which clearly 
does not overlap with offset 24, which is the start of the target range 
(hostcall poi

[PATCH] D119216: [AMDGPU] replace hostcall module flag with function attribute

2022-02-09 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp:566
+  return false;
+};
+

sameerds wrote:
> jdoerfert wrote:
> > sameerds wrote:
> > > jdoerfert wrote:
> > > > sameerds wrote:
> > > > > jdoerfert wrote:
> > > > > > jdoerfert wrote:
> > > > > > > sameerds wrote:
> > > > > > > > jdoerfert wrote:
> > > > > > > > > You can use AAPointerInfo for the call site return 
> > > > > > > > > IRPosition. It will (through the iterations) gather all 
> > > > > > > > > accesses and put them into "bins" based on offset and size. 
> > > > > > > > > It deals with uses in calls, etc. and if there is stuff 
> > > > > > > > > missing it is better to add it in one place so we benefit 
> > > > > > > > > throughout. 
> > > > > > > > I am not following what you have in mind. "implicitarg_ptr" is 
> > > > > > > > a pointer returned by an intrinsic that reads an ABI-defined 
> > > > > > > > register. I need to check that for a given call-graph, a 
> > > > > > > > particular range of bytes relative to that base pointer are 
> > > > > > > > never accessed. The above DFS on the uses conservatively 
> > > > > > > > assumes that such a load exists unless it can conclusively 
> > > > > > > > trace every use of the base pointer. This may include the 
> > > > > > > > pointer being passed to an extern function or being stored into 
> > > > > > > > a different memory location (although we don't expect ABI 
> > > > > > > > registers being capture this way). I am not seeing how to 
> > > > > > > > construct this around AAPointerInfo. As far as I can see, the 
> > > > > > > > public interface only talks about uses that are recognized as 
> > > > > > > > loads and stores.
> > > > > > > Not actually tested, replaces the function body. Depends on 
> > > > > > > D119249.
> > > > > > > ```
> > > > > > > const auto PointerInfoAA = A.getAAFor(*this, 
> > > > > > > IRPosition::callback_returned(cast(Ptr)), 
> > > > > > > DepClassTy::Required);
> > > > > > > if (!PointerInfoAA.getState().isValidState())
> > > > > > >   return true; // Abort (which is weird as false is abort in the 
> > > > > > > other CB).
> > > > > > > AAPointerInfo::OffsetAndSize OAS(*Position, /* probably look 
> > > > > > > pointer width up in DL */ 8);
> > > > > > > return !forallInterferingAccesses(OAS, [](const 
> > > > > > > AAPointerInfo::Access &Acc, bool IsExact) {
> > > > > > >return Acc.getRemoteInst()->isDroppable(); });
> > > > > > > ```
> > > > > > You don't actually need the state check.
> > > > > > And as I said, this will take care of following pointers passed 
> > > > > > into callees or through memory to other places, all while ignoring 
> > > > > > dead code, etc.
> > > > > I see now. forallInterferingAccesses() does check for valid state on 
> > > > > entry, which is sufficient to take care of all the opaque uses like a 
> > > > > call to an extern function or a complicated phi or a capturing store. 
> > > > > Thanks a ton ... this has been very educational!
> > > > Yes, all "forAll" functions will return `false` if we cannot visit 
> > > > "all". Though, these methods will utilize the smarts, e.g., ignore what 
> > > > is dead, look at loads if the value is stored in a way we can track it 
> > > > through memory, transfer accesses in a callee to the caller "space" if 
> > > > a pointer is passed to the callee,... etc.
> > > > Yes, all "forAll" functions will return `false` if we cannot visit 
> > > > "all". Though, these methods will utilize the smarts, e.g., ignore what 
> > > > is dead, look at loads if the value is stored in a way we can track it 
> > > > through memory, transfer accesses in a callee to the caller "space" if 
> > > > a pointer is passed to the callee,... etc.
> > > 
> > > @jdoerfert, do you see  D119249 landing soon? We are kinda on a short 
> > > runway here (less than a handful of days) and hoping to land this review 
> > > quickly because it solves an important issue. I would prefer to have the 
> > > check that you outlined, but the alternative is to let my version through 
> > > now, and then update it when the new interface becomes available.
> > Did you test my proposed code? I'll land my patch tomorrow, if yours works 
> > as you expect with the proposed AAPointerInfo use, great. If it doesn't I 
> > don't necessarily mind you merging something else with a clear intention to 
> > address issues and remove duplication as we go.
> > 
> > I can lift my commit block as we go and @arsenm can also give it a 
> > good-to-go if need be.
> Lifting your commit block would be useful in general. I certainly do not 
> intend to submit something in a hurry.
> 
> I applied your change and tested the proposed code. It gives more pessimistic 
> results than my original crude version, i.e., it marks some uses of the 
> `implicitarg_ptr` as accessing the target range, when I am pretty sure it 
> does not. See `test_kernel42` in the lit test `hsa-metadata-hostcall-v3.ll

[PATCH] D119216: [AMDGPU] replace hostcall module flag with function attribute

2022-02-10 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds updated this revision to Diff 407762.
sameerds added a comment.

use AAPointerInfo; add more tests to demonstrate the same


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D119216/new/

https://reviews.llvm.org/D119216

Files:
  clang/lib/CodeGen/CodeGenModule.cpp
  clang/test/CodeGenCUDA/amdgpu-asan-printf.cu
  clang/test/CodeGenCUDA/amdgpu-asan.cu
  llvm/lib/Target/AMDGPU/AMDGPUAttributes.def
  llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
  llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp
  llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
  llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
  llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
  llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h
  llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
  llvm/test/CodeGen/AMDGPU/addrspacecast-constantexpr.ll
  llvm/test/CodeGen/AMDGPU/annotate-kernel-features-hsa-call.ll
  llvm/test/CodeGen/AMDGPU/annotate-kernel-features-hsa.ll
  llvm/test/CodeGen/AMDGPU/annotate-kernel-features.ll
  llvm/test/CodeGen/AMDGPU/direct-indirect-call.ll
  llvm/test/CodeGen/AMDGPU/duplicate-attribute-indirect.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-enqueue-kernel-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-enqueue-kernel.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hidden-args-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hidden-args-v5.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hidden-args.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-absent-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-absent.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-present-v3-asan.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-present-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-present.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-v5.ll
  llvm/test/CodeGen/AMDGPU/propagate-flat-work-group-size.ll
  llvm/test/CodeGen/AMDGPU/simple-indirect-call.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-attribute-missing.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-multistep.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-nested-function-calls.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-prevent-attribute-propagation.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-recursion-test.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
  mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp

Index: mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp
===
--- mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp
+++ mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp
@@ -307,11 +307,6 @@
 }
   }
 
-  // Set amdgpu_hostcall if host calls have been linked, as needed by newer LLVM
-  // FIXME: Is there a way to set this during printf() lowering that makes sense
-  if (ret->getFunction("__ockl_hostcall_internal"))
-if (!ret->getModuleFlag("amdgpu_hostcall"))
-  ret->addModuleFlag(llvm::Module::Override, "amdgpu_hostcall", 1);
   return ret;
 }
 
Index: llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
===
--- llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
+++ llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
@@ -61,5 +61,5 @@
 
 attributes #0 = { "uniform-work-group-size"="false" }
 ;.
-; CHECK: attributes #[[ATTR0]] = { "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
+; CHECK: attributes #[[ATTR0]] = { "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
 ;.
Index: llvm/test/CodeGen/AMDGPU/uniform-work-group-recursion-test.ll
===
--- llvm/test/CodeGen/AMDGPU/uniform-work-group-recursion-test.ll
+++ llvm/test/CodeGen/AMDGPU/uniform-work-group-recursion-test.ll
@@ -101,7 +101,7 @@
 attributes #0 = { nounwind readnone }
 attributes #1 = { "uniform-work-group-size"="true" }
 ;.
-; CHECK: attributes #[[ATTR0]] = { nounwind readnone "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
-; CHECK: attributes #[[ATTR1]] = { nounwind readnone "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id

[PATCH] D119216: [AMDGPU] replace hostcall module flag with function attribute

2022-02-11 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
Closed by commit rGd8f99bb6e064: [AMDGPU] replace hostcall module flag with 
function attribute (authored by sameerds).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D119216/new/

https://reviews.llvm.org/D119216

Files:
  clang/lib/CodeGen/CodeGenModule.cpp
  clang/test/CodeGenCUDA/amdgpu-asan-printf.cu
  clang/test/CodeGenCUDA/amdgpu-asan.cu
  llvm/lib/Target/AMDGPU/AMDGPUAttributes.def
  llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
  llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp
  llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
  llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
  llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
  llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h
  llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
  llvm/test/CodeGen/AMDGPU/addrspacecast-constantexpr.ll
  llvm/test/CodeGen/AMDGPU/annotate-kernel-features-hsa-call.ll
  llvm/test/CodeGen/AMDGPU/annotate-kernel-features-hsa.ll
  llvm/test/CodeGen/AMDGPU/annotate-kernel-features.ll
  llvm/test/CodeGen/AMDGPU/direct-indirect-call.ll
  llvm/test/CodeGen/AMDGPU/duplicate-attribute-indirect.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-enqueue-kernel-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-enqueue-kernel.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hidden-args-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hidden-args-v5.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hidden-args.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-absent-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-absent.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-present-v3-asan.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-present-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-present.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-v3.ll
  llvm/test/CodeGen/AMDGPU/hsa-metadata-hostcall-v5.ll
  llvm/test/CodeGen/AMDGPU/propagate-flat-work-group-size.ll
  llvm/test/CodeGen/AMDGPU/simple-indirect-call.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-attribute-missing.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-multistep.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-nested-function-calls.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-prevent-attribute-propagation.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-recursion-test.ll
  llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
  mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp

Index: mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp
===
--- mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp
+++ mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp
@@ -307,11 +307,6 @@
 }
   }
 
-  // Set amdgpu_hostcall if host calls have been linked, as needed by newer LLVM
-  // FIXME: Is there a way to set this during printf() lowering that makes sense
-  if (ret->getFunction("__ockl_hostcall_internal"))
-if (!ret->getModuleFlag("amdgpu_hostcall"))
-  ret->addModuleFlag(llvm::Module::Override, "amdgpu_hostcall", 1);
   return ret;
 }
 
Index: llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
===
--- llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
+++ llvm/test/CodeGen/AMDGPU/uniform-work-group-test.ll
@@ -61,5 +61,5 @@
 
 attributes #0 = { "uniform-work-group-size"="false" }
 ;.
-; CHECK: attributes #[[ATTR0]] = { "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
+; CHECK: attributes #[[ATTR0]] = { "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
 ;.
Index: llvm/test/CodeGen/AMDGPU/uniform-work-group-recursion-test.ll
===
--- llvm/test/CodeGen/AMDGPU/uniform-work-group-recursion-test.ll
+++ llvm/test/CodeGen/AMDGPU/uniform-work-group-recursion-test.ll
@@ -101,7 +101,7 @@
 attributes #0 = { nounwind readnone }
 attributes #1 = { "uniform-work-group-size"="true" }
 ;.
-; CHECK: attributes #[[ATTR0]] = { nounwind readnone "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
-; CHECK: attributes #[[ATTR1]] = { nounwind readnone "amdgpu-no-dispatch-id" "amdgpu-

[PATCH] D124813: [HLSL] Add clang builtin for HLSL.

2022-05-04 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds resigned from this revision.
sameerds added a comment.

I am not much familiar with DirectX and HLSL, so unable to review this patch. 
It might help to post on the Discourse under Clang Frontend:

https://discourse.llvm.org/c/clang/6


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124813/new/

https://reviews.llvm.org/D124813

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D128907: [Clang] Disable noundef attribute for languages which allow uninitialized function arguments

2022-07-13 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

The current set of reviewers is mostly loaded with HIP engineers who are 
familiar with the issue and the proposed solution. But this solution affects 
all languages with convergent functions, which is visible from the affected 
tests. It will be good to seek comments from people responsible for CUDA, 
OpenMP and OpenCL too.




Comment at: clang/include/clang/Basic/LangOptions.h:534
+  /// Return true if uninitialized function arguments are allowed.
+  bool allowUninitializedFunctionsArgs() const {
+/// CUDA/HIP etc. cross-lane APIs are convergent functions

The spelling should be "allowUninitializedFunctionArgs()", where the noun 
"Function" is singular.



Comment at: clang/lib/CodeGen/CGCall.cpp:2306
 
+  // Enable noundef attribute based on codegen options and
+  // skip adding the attribute for languages which allows uninitialized

This comment is merely an English translation of the Boolean expression. It 
should instead provide a bit of context. Like the fact that on HIP, CUDA, etc, 
some functions have multi-threaded semantics where it is enough for only one or 
some threads to provide defined arguments. Depending on semantics, undef 
arguments in some threads don't produce undefined results in the function call.

It might be worth adding this longer explanation to the commit description too, 
rather than just saying "strict constraints".


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128907/new/

https://reviews.llvm.org/D128907

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71365: expand printf when compiling HIP to AMDGPU

2020-01-06 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds updated this revision to Diff 236312.
sameerds edited the summary of this revision.
sameerds added a comment.

Improved the test defined in clang/test/CodeGenHIP/printf.cpp


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71365/new/

https://reviews.llvm.org/D71365

Files:
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/lib/CodeGen/CGGPUBuiltin.cpp
  clang/lib/CodeGen/CodeGenFunction.h
  clang/lib/Driver/ToolChains/HIP.cpp
  clang/test/CodeGenHIP/printf-aggregate.cpp
  clang/test/CodeGenHIP/printf.cpp
  clang/test/Driver/hip-printf.hip
  llvm/include/llvm/Transforms/Utils/AMDGPUEmitPrintf.h
  llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
  llvm/lib/Transforms/Utils/CMakeLists.txt

Index: llvm/lib/Transforms/Utils/CMakeLists.txt
===
--- llvm/lib/Transforms/Utils/CMakeLists.txt
+++ llvm/lib/Transforms/Utils/CMakeLists.txt
@@ -1,4 +1,5 @@
 add_llvm_component_library(LLVMTransformUtils
+  AMDGPUEmitPrintf.cpp
   ASanStackFrameLayout.cpp
   AddDiscriminators.cpp
   BasicBlockUtils.cpp
Index: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
===
--- /dev/null
+++ llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
@@ -0,0 +1,243 @@
+//===- AMDGPUEmitPrintf.cpp ---===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===--===//
+//
+// Utility function to lower a printf call into a series of device
+// library calls on the AMDGPU target.
+//
+// WARNING: This file knows about certain library functions. It recognizes them
+// by name, and hardwires knowledge of their semantics.
+//
+//===--===//
+
+#include "llvm/Transforms/Utils/AMDGPUEmitPrintf.h"
+#include "llvm/ADT/SparseBitVector.h"
+#include "llvm/Analysis/ValueTracking.h"
+#include "llvm/IR/IRBuilder.h"
+
+#include 
+
+using namespace llvm;
+
+#define DEBUG_TYPE "amdgpu-emit-printf"
+
+static bool isCString(const Value *Arg) {
+  auto Ty = Arg->getType();
+  auto PtrTy = dyn_cast(Ty);
+  if (!PtrTy)
+return false;
+
+  auto IntTy = dyn_cast(PtrTy->getElementType());
+  if (!IntTy)
+return false;
+
+  return IntTy->getBitWidth() == 8;
+}
+
+static Value *fitArgInto64Bits(IRBuilder<> &Builder, Value *Arg) {
+  auto Int64Ty = Builder.getInt64Ty();
+  auto Ty = Arg->getType();
+  if (auto IntTy = dyn_cast(Ty)) {
+switch (IntTy->getBitWidth()) {
+case 32:
+  return Builder.CreateZExt(Arg, Int64Ty);
+case 64:
+  return Arg;
+}
+  } else if (Ty->getTypeID() == Type::DoubleTyID) {
+return Builder.CreateBitCast(Arg, Int64Ty);
+  } else if (auto PtrTy = dyn_cast(Ty)) {
+return Builder.CreatePtrToInt(Arg, Int64Ty);
+  }
+
+  llvm_unreachable("unexpected type");
+  return Builder.getInt64(0);
+}
+
+static Value *callPrintfBegin(IRBuilder<> &Builder, Value *Version) {
+  auto Int64Ty = Builder.getInt64Ty();
+  auto M = Builder.GetInsertBlock()->getModule();
+  auto Fn = M->getOrInsertFunction("__ockl_printf_begin", Int64Ty, Int64Ty);
+  return Builder.CreateCall(Fn, Version);
+}
+
+static Value *callAppendArgs(IRBuilder<> &Builder, Value *Desc, int NumArgs,
+ Value *Arg0, Value *Arg1, Value *Arg2, Value *Arg3,
+ Value *Arg4, Value *Arg5, Value *Arg6,
+ bool IsLast) {
+  auto Int64Ty = Builder.getInt64Ty();
+  auto Int32Ty = Builder.getInt32Ty();
+  auto M = Builder.GetInsertBlock()->getModule();
+  auto Fn = M->getOrInsertFunction("__ockl_printf_append_args", Int64Ty,
+   Int64Ty, Int32Ty, Int64Ty, Int64Ty, Int64Ty,
+   Int64Ty, Int64Ty, Int64Ty, Int64Ty, Int32Ty);
+  auto IsLastValue = Builder.getInt32(IsLast);
+  auto NumArgsValue = Builder.getInt32(NumArgs);
+  return Builder.CreateCall(Fn, {Desc, NumArgsValue, Arg0, Arg1, Arg2, Arg3,
+ Arg4, Arg5, Arg6, IsLastValue});
+}
+
+static Value *appendArg(IRBuilder<> &Builder, Value *Desc, Value *Arg,
+bool IsLast) {
+  auto Arg0 = fitArgInto64Bits(Builder, Arg);
+  auto Zero = Builder.getInt64(0);
+  return callAppendArgs(Builder, Desc, 1, Arg0, Zero, Zero, Zero, Zero, Zero,
+Zero, IsLast);
+}
+
+// The device library does not provide strlen, so we build our own loop
+// here. While we are at it, we also include the terminating null in the length.
+static Value *getStrlenWithNull(IRBuilder<> &Builder, Value *Str) {
+  auto *Prev = Builder.GetInsertBlock();
+  Module *M = Prev->getModule();
+
+  auto CharZero = Builder.getInt8(0);
+  auto One = Builder.g

[PATCH] D71365: expand printf when compiling HIP to AMDGPU

2020-01-06 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: clang/test/CodeGenHIP/printf.cpp:18
+}
+
+// CHECK: [[BEGIN:%.*]]   = call i64 @__ockl_printf_begin(i64 0)

arsenm wrote:
> This could use a lot more testcases. Can you add some half, float, and double 
> as well as pointers (including different address spaces) and vectors?
I am not sure what exactly should be tested. The validity of this expansion 
depends on the signature of the builtin printf function. Since printf is 
variadic, the "default argument promotions" in the C/C++ spec guarantee that 
the arguments are 32/64 bit integers or doubles if they are not pointers. The 
printf signature as well as the device library functions are defined using only 
generic pointers, so the address space does not matter. Non-scalar arguments 
are not supported, which is checked by another test using a struct. I could add 
a vector there, but that doesn't seem to be adding any value.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71365/new/

https://reviews.llvm.org/D71365



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71365: expand printf when compiling HIP to AMDGPU

2020-01-07 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

In D71365#1807872 , @b-sumner wrote:

> Should this be looking forward to also handling OpenCL, which does require 
> vector support?


Sure. The implementation is general enough that we can point at the two places 
in AMDGPUEmitPrintf.cpp that need to be enhanced for OpenCL:

1. processArg() should recognize a vector argument and transmit its elements to 
the host correctly.
2. locateCStrings() should deal with all the additional OpenCL specifiers (like 
vectors and half data-types) as needed.

Then depending on the input language, Clang codegen can indicate whether to 
support these extensions or emit an error.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71365/new/

https://reviews.llvm.org/D71365



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71365: expand printf when compiling HIP to AMDGPU

2020-01-09 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: clang/test/CodeGenHIP/printf.cpp:18
+}
+
+// CHECK: [[BEGIN:%.*]]   = call i64 @__ockl_printf_begin(i64 0)

sameerds wrote:
> arsenm wrote:
> > This could use a lot more testcases. Can you add some half, float, and 
> > double as well as pointers (including different address spaces) and vectors?
> I am not sure what exactly should be tested. The validity of this expansion 
> depends on the signature of the builtin printf function. Since printf is 
> variadic, the "default argument promotions" in the C/C++ spec guarantee that 
> the arguments are 32/64 bit integers or doubles if they are not pointers. The 
> printf signature as well as the device library functions are defined using 
> only generic pointers, so the address space does not matter. Non-scalar 
> arguments are not supported, which is checked by another test using a struct. 
> I could add a vector there, but that doesn't seem to be adding any value.
Bump!


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71365/new/

https://reviews.llvm.org/D71365



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71365: expand printf when compiling HIP to AMDGPU

2020-01-14 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds updated this revision to Diff 237864.
sameerds added a comment.

- Added a test for address spaces
- Removed an unnecessary addrspacecast in the strlen expansion
- Updated the tests to pass -fcuda-is-device
- printf is not a builtin for device code, so removed the code and tests 
related to that


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71365/new/

https://reviews.llvm.org/D71365

Files:
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/lib/CodeGen/CGGPUBuiltin.cpp
  clang/lib/CodeGen/CodeGenFunction.h
  clang/lib/Driver/ToolChains/HIP.cpp
  clang/test/CodeGenHIP/printf-aggregate.cpp
  clang/test/CodeGenHIP/printf.cpp
  clang/test/Driver/hip-printf.hip
  llvm/include/llvm/Transforms/Utils/AMDGPUEmitPrintf.h
  llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
  llvm/lib/Transforms/Utils/CMakeLists.txt

Index: llvm/lib/Transforms/Utils/CMakeLists.txt
===
--- llvm/lib/Transforms/Utils/CMakeLists.txt
+++ llvm/lib/Transforms/Utils/CMakeLists.txt
@@ -1,4 +1,5 @@
 add_llvm_component_library(LLVMTransformUtils
+  AMDGPUEmitPrintf.cpp
   ASanStackFrameLayout.cpp
   AddDiscriminators.cpp
   BasicBlockUtils.cpp
Index: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
===
--- /dev/null
+++ llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
@@ -0,0 +1,246 @@
+//===- AMDGPUEmitPrintf.cpp ---===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===--===//
+//
+// Utility function to lower a printf call into a series of device
+// library calls on the AMDGPU target.
+//
+// WARNING: This file knows about certain library functions. It recognizes them
+// by name, and hardwires knowledge of their semantics.
+//
+//===--===//
+
+#include "llvm/Transforms/Utils/AMDGPUEmitPrintf.h"
+#include "llvm/ADT/SparseBitVector.h"
+#include "llvm/Analysis/ValueTracking.h"
+#include "llvm/IR/IRBuilder.h"
+
+#include 
+
+using namespace llvm;
+
+#define DEBUG_TYPE "amdgpu-emit-printf"
+
+static bool isCString(const Value *Arg) {
+  auto Ty = Arg->getType();
+  auto PtrTy = dyn_cast(Ty);
+  if (!PtrTy)
+return false;
+
+  auto IntTy = dyn_cast(PtrTy->getElementType());
+  if (!IntTy)
+return false;
+
+  return IntTy->getBitWidth() == 8;
+}
+
+static Value *fitArgInto64Bits(IRBuilder<> &Builder, Value *Arg) {
+  auto Int64Ty = Builder.getInt64Ty();
+  auto Ty = Arg->getType();
+
+  if (auto IntTy = dyn_cast(Ty)) {
+switch (IntTy->getBitWidth()) {
+case 32:
+  return Builder.CreateZExt(Arg, Int64Ty);
+case 64:
+  return Arg;
+}
+  }
+
+  if (Ty->getTypeID() == Type::DoubleTyID) {
+return Builder.CreateBitCast(Arg, Int64Ty);
+  }
+
+  if (auto PtrTy = dyn_cast(Ty)) {
+return Builder.CreatePtrToInt(Arg, Int64Ty);
+  }
+
+  llvm_unreachable("unexpected type");
+}
+
+static Value *callPrintfBegin(IRBuilder<> &Builder, Value *Version) {
+  auto Int64Ty = Builder.getInt64Ty();
+  auto M = Builder.GetInsertBlock()->getModule();
+  auto Fn = M->getOrInsertFunction("__ockl_printf_begin", Int64Ty, Int64Ty);
+  return Builder.CreateCall(Fn, Version);
+}
+
+static Value *callAppendArgs(IRBuilder<> &Builder, Value *Desc, int NumArgs,
+ Value *Arg0, Value *Arg1, Value *Arg2, Value *Arg3,
+ Value *Arg4, Value *Arg5, Value *Arg6,
+ bool IsLast) {
+  auto Int64Ty = Builder.getInt64Ty();
+  auto Int32Ty = Builder.getInt32Ty();
+  auto M = Builder.GetInsertBlock()->getModule();
+  auto Fn = M->getOrInsertFunction("__ockl_printf_append_args", Int64Ty,
+   Int64Ty, Int32Ty, Int64Ty, Int64Ty, Int64Ty,
+   Int64Ty, Int64Ty, Int64Ty, Int64Ty, Int32Ty);
+  auto IsLastValue = Builder.getInt32(IsLast);
+  auto NumArgsValue = Builder.getInt32(NumArgs);
+  return Builder.CreateCall(Fn, {Desc, NumArgsValue, Arg0, Arg1, Arg2, Arg3,
+ Arg4, Arg5, Arg6, IsLastValue});
+}
+
+static Value *appendArg(IRBuilder<> &Builder, Value *Desc, Value *Arg,
+bool IsLast) {
+  auto Arg0 = fitArgInto64Bits(Builder, Arg);
+  auto Zero = Builder.getInt64(0);
+  return callAppendArgs(Builder, Desc, 1, Arg0, Zero, Zero, Zero, Zero, Zero,
+Zero, IsLast);
+}
+
+// The device library does not provide strlen, so we build our own loop
+// here. While we are at it, we also include the terminating null in the length.
+static Value *getStrlenWithNull(IRBuilder<> &Builder, Value *Str) {
+  auto *Prev = Builder.GetInsertBlock();
+  

[PATCH] D71365: expand printf when compiling HIP to AMDGPU

2020-01-14 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds marked 2 inline comments as done.
sameerds added inline comments.



Comment at: clang/test/CodeGenHIP/printf.cpp:18
+}
+
+// CHECK: [[BEGIN:%.*]]   = call i64 @__ockl_printf_begin(i64 0)

arsenm wrote:
> sameerds wrote:
> > sameerds wrote:
> > > arsenm wrote:
> > > > This could use a lot more testcases. Can you add some half, float, and 
> > > > double as well as pointers (including different address spaces) and 
> > > > vectors?
> > > I am not sure what exactly should be tested. The validity of this 
> > > expansion depends on the signature of the builtin printf function. Since 
> > > printf is variadic, the "default argument promotions" in the C/C++ spec 
> > > guarantee that the arguments are 32/64 bit integers or doubles if they 
> > > are not pointers. The printf signature as well as the device library 
> > > functions are defined using only generic pointers, so the address space 
> > > does not matter. Non-scalar arguments are not supported, which is checked 
> > > by another test using a struct. I could add a vector there, but that 
> > > doesn't seem to be adding any value.
> > Bump!
> The address space shouldn't matter, but it's good to make sure it doesn't. 
> 
> Vector arguments are 100% supported in OpenCL printf, and are not subject to 
> argument promotion. You have to use a width modifier for them though.
Added a test with address spaces on the pointer arguments.

Vectors are supported in OpenCL, but this initial implementation is 
specifically for HIP. The printf expansion is invoked by Clang CodeGen only 
when the language is HIP.

Vector support will arrive later. It's not sufficient to just implement the 
transmission via hostcall; it also needs changes to the printf processing 
performed by the hostcall listener thread on the host.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71365/new/

https://reviews.llvm.org/D71365



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71365: expand printf when compiling HIP to AMDGPU

2020-01-16 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
This revision was automatically updated to reflect the committed changes.
Closed by commit rGed181efa175d: [HIP][AMDGPU] expand printf when compiling HIP 
to AMDGPU (authored by sameerds).
Herald added a subscriber: kerbowa.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71365/new/

https://reviews.llvm.org/D71365

Files:
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/lib/CodeGen/CGGPUBuiltin.cpp
  clang/lib/CodeGen/CodeGenFunction.h
  clang/lib/Driver/ToolChains/HIP.cpp
  clang/test/CodeGenHIP/printf-aggregate.cpp
  clang/test/CodeGenHIP/printf.cpp
  clang/test/Driver/hip-printf.hip
  llvm/include/llvm/Transforms/Utils/AMDGPUEmitPrintf.h
  llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
  llvm/lib/Transforms/Utils/CMakeLists.txt

Index: llvm/lib/Transforms/Utils/CMakeLists.txt
===
--- llvm/lib/Transforms/Utils/CMakeLists.txt
+++ llvm/lib/Transforms/Utils/CMakeLists.txt
@@ -1,4 +1,5 @@
 add_llvm_component_library(LLVMTransformUtils
+  AMDGPUEmitPrintf.cpp
   ASanStackFrameLayout.cpp
   AddDiscriminators.cpp
   BasicBlockUtils.cpp
Index: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
===
--- /dev/null
+++ llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
@@ -0,0 +1,246 @@
+//===- AMDGPUEmitPrintf.cpp ---===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===--===//
+//
+// Utility function to lower a printf call into a series of device
+// library calls on the AMDGPU target.
+//
+// WARNING: This file knows about certain library functions. It recognizes them
+// by name, and hardwires knowledge of their semantics.
+//
+//===--===//
+
+#include "llvm/Transforms/Utils/AMDGPUEmitPrintf.h"
+#include "llvm/ADT/SparseBitVector.h"
+#include "llvm/Analysis/ValueTracking.h"
+#include "llvm/IR/IRBuilder.h"
+
+#include 
+
+using namespace llvm;
+
+#define DEBUG_TYPE "amdgpu-emit-printf"
+
+static bool isCString(const Value *Arg) {
+  auto Ty = Arg->getType();
+  auto PtrTy = dyn_cast(Ty);
+  if (!PtrTy)
+return false;
+
+  auto IntTy = dyn_cast(PtrTy->getElementType());
+  if (!IntTy)
+return false;
+
+  return IntTy->getBitWidth() == 8;
+}
+
+static Value *fitArgInto64Bits(IRBuilder<> &Builder, Value *Arg) {
+  auto Int64Ty = Builder.getInt64Ty();
+  auto Ty = Arg->getType();
+
+  if (auto IntTy = dyn_cast(Ty)) {
+switch (IntTy->getBitWidth()) {
+case 32:
+  return Builder.CreateZExt(Arg, Int64Ty);
+case 64:
+  return Arg;
+}
+  }
+
+  if (Ty->getTypeID() == Type::DoubleTyID) {
+return Builder.CreateBitCast(Arg, Int64Ty);
+  }
+
+  if (auto PtrTy = dyn_cast(Ty)) {
+return Builder.CreatePtrToInt(Arg, Int64Ty);
+  }
+
+  llvm_unreachable("unexpected type");
+}
+
+static Value *callPrintfBegin(IRBuilder<> &Builder, Value *Version) {
+  auto Int64Ty = Builder.getInt64Ty();
+  auto M = Builder.GetInsertBlock()->getModule();
+  auto Fn = M->getOrInsertFunction("__ockl_printf_begin", Int64Ty, Int64Ty);
+  return Builder.CreateCall(Fn, Version);
+}
+
+static Value *callAppendArgs(IRBuilder<> &Builder, Value *Desc, int NumArgs,
+ Value *Arg0, Value *Arg1, Value *Arg2, Value *Arg3,
+ Value *Arg4, Value *Arg5, Value *Arg6,
+ bool IsLast) {
+  auto Int64Ty = Builder.getInt64Ty();
+  auto Int32Ty = Builder.getInt32Ty();
+  auto M = Builder.GetInsertBlock()->getModule();
+  auto Fn = M->getOrInsertFunction("__ockl_printf_append_args", Int64Ty,
+   Int64Ty, Int32Ty, Int64Ty, Int64Ty, Int64Ty,
+   Int64Ty, Int64Ty, Int64Ty, Int64Ty, Int32Ty);
+  auto IsLastValue = Builder.getInt32(IsLast);
+  auto NumArgsValue = Builder.getInt32(NumArgs);
+  return Builder.CreateCall(Fn, {Desc, NumArgsValue, Arg0, Arg1, Arg2, Arg3,
+ Arg4, Arg5, Arg6, IsLastValue});
+}
+
+static Value *appendArg(IRBuilder<> &Builder, Value *Desc, Value *Arg,
+bool IsLast) {
+  auto Arg0 = fitArgInto64Bits(Builder, Arg);
+  auto Zero = Builder.getInt64(0);
+  return callAppendArgs(Builder, Desc, 1, Arg0, Zero, Zero, Zero, Zero, Zero,
+Zero, IsLast);
+}
+
+// The device library does not provide strlen, so we build our own loop
+// here. While we are at it, we also include the terminating null in the length.
+static Value *getStrlenWithNull(IRBuilder<> &Builder, Value *Str) {
+  auto *Prev = Builder.GetInsertBlock();
+  Module *M = Prev->getModule();
+
+  auto CharZero = Builder.getInt8(0);
+  auto On

[PATCH] D82087: AMDGPU/clang: Add builtins for llvm.amdgcn.ballot

2022-12-23 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: clang/test/CodeGenOpenCL/amdgpu-features.cl:7
+// RUN: %clang_cc1 -triple amdgcn -S -emit-llvm -o - %s | FileCheck 
--check-prefix=NOCPU %s
+// RUN: %clang_cc1 -triple amdgcn -target-feature +wavefrontsize32 -S 
-emit-llvm -o - %s | FileCheck --check-prefix=NOCPU-WAVE32 %s
+// RUN: %clang_cc1 -triple amdgcn -target-feature +wavefrontsize64 -S 
-emit-llvm -o - %s | FileCheck --check-prefix=NOCPU-WAVE64 %s

yaxunl wrote:
> what happens if both +wavefrontsize32 and +wavefrontsize64 are specified?
Shouldn't this be separately an error in itself? Is it tested elsewhere?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D82087/new/

https://reviews.llvm.org/D82087

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D82087: AMDGPU/clang: Add builtins for llvm.amdgcn.ballot

2022-12-26 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds accepted this revision.
sameerds added inline comments.
This revision is now accepted and ready to land.



Comment at: clang/lib/Basic/Targets/AMDGPU.cpp:353
+  if (HaveWave32 && HaveWave64) {
+Diags.Report(diag::err_invalid_feature_combination)
+<< "'wavefrontsize32' and 'wavefrontsize64' are mutually exclusive";

I would have preferred this to be a separate change, just like the FIXME for 
diagnosing wavefrontsize32 on targets that don't support it. But not feeling 
strongly enough to block this change!


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D82087/new/

https://reviews.llvm.org/D82087

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D82087: AMDGPU/clang: Add builtins for llvm.amdgcn.ballot

2022-12-26 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

And note that the change description is written in a first-person train of 
thought. Please do rewrite it!


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D82087/new/

https://reviews.llvm.org/D82087

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D141700: AMDGPU: Move enqueued block handling into clang

2023-02-05 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

LGTM, to the extent that I can see that the change does what is advertised, and 
the ultimately emitted HSA metadata preserves the current contract with the 
runtime.

A couple of tests can use a little more explanatory comments as noted.




Comment at: clang/lib/CodeGen/TargetInfo.cpp:12581
+  Mod, HandleTy,
+  /*isConstant=*/true, llvm::GlobalValue::InternalLinkage,
+  /*Initializer=*/RuntimeHandleInitializer, RuntimeHandleName,

jmmartinez wrote:
> Just a cosmetical remark: Is there any reason to keep the `/*isConstant=*/`, 
> `/*Initializer=*/`, ... comments? I think it would be better to avoid them.
FWIW, I find these comments very helpful when spelunking through code. I could 
sympathise with not needing `Initializer=` because the value name makes it 
clear. But an undecorated constant literal like "true" or "10" or "nullptr" 
works best when accompanied by a comment.



Comment at: llvm/test/Bitcode/amdgpu-autoupgrade-enqueued-block.ll:69
+
+; __enqueue_kernel* functions may get inlined
+define amdgpu_kernel void @inlined_caller(ptr addrspace(1) %a, i8 %b, ptr 
addrspace(1) %c, i64 %d) {

I did not understand what is being tested here.



Comment at: llvm/test/CodeGen/AMDGPU/amdgpu-export-kernel-runtime-handles.ll:2
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py 
UTC_ARGS: --function-signature --check-attributes --check-globals
+; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -amdgpu-export-kernel-runtime-handles 
< %s | FileCheck %s
+

Is there any visible effect of the pass being tested? Or the intention is 
simply to check that the output is the same as input, and there is no error?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D141700/new/

https://reviews.llvm.org/D141700

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71365: expand printf when compiling HIP to AMDGPU

2019-12-11 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds created this revision.
Herald added subscribers: llvm-commits, cfe-commits, hiraditya, t-tye, tpr, 
dstuttard, yaxunl, mgorny, nhaehnle, wdng, jvesely, kzhuravl.
Herald added projects: clang, LLVM.

This change implements the expansion in two parts:

- Add a utility function emitAMDGPUPrintfCall() in LLVM.
- Invoke the above function from Clang CodeGen, when processing a HIP program 
for the AMDGPU target.

The printf expansion has undefined behaviour if the format string is
not a compile-time constant. As a sufficient condition, the HIP
ToolChain now emits -Werror=format-nonliteral.

clang format


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D71365

Files:
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/lib/CodeGen/CGGPUBuiltin.cpp
  clang/lib/CodeGen/CodeGenFunction.h
  clang/lib/Driver/ToolChains/HIP.cpp
  clang/test/CodeGenHIP/printf-aggregate.cpp
  clang/test/CodeGenHIP/printf.cpp
  clang/test/Driver/hip-printf.hip
  llvm/include/llvm/Transforms/Utils/AMDGPUEmitPrintf.h
  llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
  llvm/lib/Transforms/Utils/CMakeLists.txt

Index: llvm/lib/Transforms/Utils/CMakeLists.txt
===
--- llvm/lib/Transforms/Utils/CMakeLists.txt
+++ llvm/lib/Transforms/Utils/CMakeLists.txt
@@ -1,4 +1,5 @@
 add_llvm_component_library(LLVMTransformUtils
+  AMDGPUEmitPrintf.cpp
   ASanStackFrameLayout.cpp
   AddDiscriminators.cpp
   BasicBlockUtils.cpp
Index: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
===
--- /dev/null
+++ llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp
@@ -0,0 +1,243 @@
+//===- AMDGPUEmitPrintf.cpp ---===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===--===//
+//
+// Utility function to lower a printf call into a series of device
+// library calls on the AMDGPU target.
+//
+// WARNING: This file knows about certain library functions. It recognizes them
+// by name, and hardwires knowledge of their semantics.
+//
+//===--===//
+
+#include "llvm/Transforms/Utils/AMDGPUEmitPrintf.h"
+#include "llvm/ADT/SparseBitVector.h"
+#include "llvm/Analysis/ValueTracking.h"
+#include "llvm/IR/IRBuilder.h"
+
+#include 
+
+using namespace llvm;
+
+#define DEBUG_TYPE "amdgpu-emit-printf"
+
+static bool isCString(const Value *Arg) {
+  auto Ty = Arg->getType();
+  auto PtrTy = dyn_cast(Ty);
+  if (!PtrTy)
+return false;
+
+  auto IntTy = dyn_cast(PtrTy->getElementType());
+  if (!IntTy)
+return false;
+
+  return IntTy->getBitWidth() == 8;
+}
+
+static Value *fitArgInto64Bits(IRBuilder<> &Builder, Value *Arg) {
+  auto Int64Ty = Builder.getInt64Ty();
+  auto Ty = Arg->getType();
+  if (auto IntTy = dyn_cast(Ty)) {
+switch (IntTy->getBitWidth()) {
+case 32:
+  return Builder.CreateZExt(Arg, Int64Ty);
+case 64:
+  return Arg;
+}
+  } else if (Ty->getTypeID() == Type::DoubleTyID) {
+return Builder.CreateBitCast(Arg, Int64Ty);
+  } else if (auto PtrTy = dyn_cast(Ty)) {
+return Builder.CreatePtrToInt(Arg, Int64Ty);
+  }
+
+  llvm_unreachable("unexpected type");
+  return Builder.getInt64(0);
+}
+
+static Value *callPrintfBegin(IRBuilder<> &Builder, Value *Version) {
+  auto Int64Ty = Builder.getInt64Ty();
+  auto M = Builder.GetInsertBlock()->getModule();
+  auto Fn = M->getOrInsertFunction("__ockl_printf_begin", Int64Ty, Int64Ty);
+  return Builder.CreateCall(Fn, Version);
+}
+
+static Value *callAppendArgs(IRBuilder<> &Builder, Value *Desc, int NumArgs,
+ Value *Arg0, Value *Arg1, Value *Arg2, Value *Arg3,
+ Value *Arg4, Value *Arg5, Value *Arg6,
+ bool IsLast) {
+  auto Int64Ty = Builder.getInt64Ty();
+  auto Int32Ty = Builder.getInt32Ty();
+  auto M = Builder.GetInsertBlock()->getModule();
+  auto Fn = M->getOrInsertFunction("__ockl_printf_append_args", Int64Ty,
+   Int64Ty, Int32Ty, Int64Ty, Int64Ty, Int64Ty,
+   Int64Ty, Int64Ty, Int64Ty, Int64Ty, Int32Ty);
+  auto IsLastValue = Builder.getInt32(IsLast);
+  auto NumArgsValue = Builder.getInt32(NumArgs);
+  return Builder.CreateCall(Fn, {Desc, NumArgsValue, Arg0, Arg1, Arg2, Arg3,
+ Arg4, Arg5, Arg6, IsLastValue});
+}
+
+static Value *appendArg(IRBuilder<> &Builder, Value *Desc, Value *Arg,
+bool IsLast) {
+  auto Arg0 = fitArgInto64Bits(Builder, Arg);
+  auto Zero = Builder.getInt64(0);
+  return callAppendArgs(Builder, Desc, 1, Arg0, Zero, Zero, Zero, Zero, Zero,
+Zero, IsLast);
+}

[PATCH] D75917: Expose llvm fence instruction as clang intrinsic

2020-03-10 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

The commit summary needs improvement. The syntax is not really necessary there, 
but instead this needs a better explanation of how this builtin fits in with 
the overall scheme of language-specific and target-specific details of an 
atomic operation. For example, is this meant only for OpenCL? Does it work with 
CUDA? Or HIP? What is the behaviour for scope in C++?




Comment at: clang/include/clang/Basic/DiagnosticSemaKinds.td:7860-7863
+def err_memory_fence_has_invalid_memory_order : Error<
+  "memory order argument to fence operation is invalid">;
+def err_memory_fence_has_invalid_synch_scope : Error<
+  "synchronization scope argument to fence operation is invalid">;

Just above this addition, atomic op seems to emit a warning for invalid memory 
order. Should that be the case with fence too?



Comment at: clang/lib/CodeGen/CGBuiltin.cpp:3707
+Value *Scope = EmitScalarExpr(E->getArg(1));
+auto ScopeModel = AtomicScopeModel::create(AtomicScopeModelKind::OpenCL);
+

The proposed builtin does not claim to be an OpenCL builtin, so it's probably 
not correct to simply assume the OpenCL model. Should the model be chosen based 
on the source language specified?


Repository:
  rC Clang

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75917/new/

https://reviews.llvm.org/D75917



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D75917: Expose llvm fence instruction as clang intrinsic

2020-03-11 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

In D75917#1916664 , @JonChesterfield 
wrote:

> In D75917#1916160 , @sameerds wrote:
>
> > how this builtin fits in with the overall scheme of language-specific and 
> > target-specific details of an atomic operation. For example, is this meant 
> > only for OpenCL? Does it work with CUDA? Or HIP? What is the behaviour for 
> > scope in C++?
>
>
> Identical to the fence instruction. Which is assumed well thought through 
> already, given it's an IR instruction.
>
> As far as I can tell, fence composes sensibly with other IR then generates 
> the right thing at the back end. So it looks fit for purpose, just not 
> currently available from clang.


Well, there is a problem: The LangRef says that scopes are target-defined. This 
change says that scopes are defined by the high-level language and further 
assumes that OpenCL scopes make sense in all languages. Besides conflicting 
with the LangRef, this not seem to work with C++, which has no scopes and nor 
with CUDA or HIP, whose scopes are not represented in any AtomicScopeModel.


Repository:
  rC Clang

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75917/new/

https://reviews.llvm.org/D75917



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D75917: Expose llvm fence instruction as clang intrinsic

2020-03-16 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

In D75917#1917296 , @JonChesterfield 
wrote:

> In D75917#1916972 , @sameerds wrote:
>
> > Well, there is a problem: The LangRef says that scopes are target-defined. 
> > This change says that scopes are defined by the high-level language and 
> > further assumes that OpenCL scopes make sense in all languages. Besides 
> > conflicting with the LangRef, this not seem to work with C++, which has no 
> > scopes and nor with CUDA or HIP, whose scopes are not represented in any 
> > AtomicScopeModel.
>
>
> I don't follow. IR has a fence instruction. This builtin maps directly to it, 
> passing whatever integer arguments were given to the intrinsic along 
> unchanged. It's exactly as valid, or invalid, as said fence instruction.


Is it really? The scope argument of the IR fence is a target-specific string:
http://llvm.org/docs/LangRef.html#syncscope

The change that I see here is assuming a numerical argument, and also assuming 
that the numbers used must conform to the OpenCL enum. That would certainly 
make the builtin quite different from the IR fence.


Repository:
  rC Clang

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75917/new/

https://reviews.llvm.org/D75917



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D75917: Expose llvm fence instruction as clang intrinsic

2020-03-16 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

In D75917#1925700 , @JonChesterfield 
wrote:

>




> I think I follow. The syncscope takes a string, therefore the builtin that 
> maps onto fence should also take a string for that parameter? That's fine by 
> me. Will help if a new non-opencl syncscope is introduced later.

Right. To be precise, it is a target-specific string, and should not be 
processed as if it was an OpenCL scope. The builtin should allow anything that 
the IR fence would allow in a .ll file created for the specified target.


Repository:
  rC Clang

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75917/new/

https://reviews.llvm.org/D75917



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153953: Revert "[AMDGPU] Mark mbcnt as convergent"

2023-06-28 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds created this revision.
Herald added subscribers: kerbowa, tpr, dstuttard, yaxunl, jvesely, kzhuravl.
Herald added a project: All.
sameerds requested review of this revision.
Herald added subscribers: llvm-commits, cfe-commits, wdng.
Herald added projects: clang, LLVM.

This reverts commit 37114036aa57e53217a57afacd7f47b36114edfb 
.

The output of mbcnt does not depend on other active lanes, and hence it is not
convergent. The original change was made as a possible fix for

https://github.com/ROCm-Developer-Tools/HIP/issues/3172

But changing mbcnt does not fix that issue.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D153953

Files:
  clang/test/CodeGenOpenCL/builtins-amdgcn.cl
  llvm/include/llvm/IR/IntrinsicsAMDGPU.td


Index: llvm/include/llvm/IR/IntrinsicsAMDGPU.td
===
--- llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -1833,12 +1833,12 @@
 def int_amdgcn_mbcnt_lo :
   ClangBuiltin<"__builtin_amdgcn_mbcnt_lo">,
   DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],
-   [IntrNoMem, IntrConvergent]>;
+   [IntrNoMem]>;
 
 def int_amdgcn_mbcnt_hi :
   ClangBuiltin<"__builtin_amdgcn_mbcnt_hi">,
   DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],
-[IntrNoMem, IntrConvergent]>;
+[IntrNoMem]>;
 
 // llvm.amdgcn.ds.swizzle src offset
 def int_amdgcn_ds_swizzle :
Index: clang/test/CodeGenOpenCL/builtins-amdgcn.cl
===
--- clang/test/CodeGenOpenCL/builtins-amdgcn.cl
+++ clang/test/CodeGenOpenCL/builtins-amdgcn.cl
@@ -690,14 +690,12 @@
 
 // CHECK-LABEL: @test_mbcnt_lo(
 // CHECK: call i32 @llvm.amdgcn.mbcnt.lo(i32 %src0, i32 %src1)
-// CHECK: declare i32 @llvm.amdgcn.mbcnt.lo(i32, i32) #[[$MBCNT_ATTRS:[0-9]+]]
 kernel void test_mbcnt_lo(global uint* out, uint src0, uint src1) {
   *out = __builtin_amdgcn_mbcnt_lo(src0, src1);
 }
 
 // CHECK-LABEL: @test_mbcnt_hi(
 // CHECK: call i32 @llvm.amdgcn.mbcnt.hi(i32 %src0, i32 %src1)
-// CHECK: declare i32 @llvm.amdgcn.mbcnt.hi(i32, i32) #[[$MBCNT_ATTRS]]
 kernel void test_mbcnt_hi(global uint* out, uint src0, uint src1) {
   *out = __builtin_amdgcn_mbcnt_hi(src0, src1);
 }
@@ -834,7 +832,6 @@
 // CHECK-DAG: [[$WS_RANGE]] = !{i16 1, i16 1025}
 // CHECK-DAG: attributes #[[$NOUNWIND_READONLY]] = { mustprogress nocallback 
nofree nosync nounwind willreturn memory(read) }
 // CHECK-DAG: attributes #[[$READ_EXEC_ATTRS]] = { convergent }
-// CHECK-DAG: attributes #[[$MBCNT_ATTRS]] = {{.* convergent .*}}
 // CHECK-DAG: ![[$EXEC]] = !{!"exec"}
 // CHECK-DAG: ![[$EXEC_LO]] = !{!"exec_lo"}
 // CHECK-DAG: ![[$EXEC_HI]] = !{!"exec_hi"}


Index: llvm/include/llvm/IR/IntrinsicsAMDGPU.td
===
--- llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -1833,12 +1833,12 @@
 def int_amdgcn_mbcnt_lo :
   ClangBuiltin<"__builtin_amdgcn_mbcnt_lo">,
   DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],
-   [IntrNoMem, IntrConvergent]>;
+   [IntrNoMem]>;
 
 def int_amdgcn_mbcnt_hi :
   ClangBuiltin<"__builtin_amdgcn_mbcnt_hi">,
   DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],
-[IntrNoMem, IntrConvergent]>;
+[IntrNoMem]>;
 
 // llvm.amdgcn.ds.swizzle src offset
 def int_amdgcn_ds_swizzle :
Index: clang/test/CodeGenOpenCL/builtins-amdgcn.cl
===
--- clang/test/CodeGenOpenCL/builtins-amdgcn.cl
+++ clang/test/CodeGenOpenCL/builtins-amdgcn.cl
@@ -690,14 +690,12 @@
 
 // CHECK-LABEL: @test_mbcnt_lo(
 // CHECK: call i32 @llvm.amdgcn.mbcnt.lo(i32 %src0, i32 %src1)
-// CHECK: declare i32 @llvm.amdgcn.mbcnt.lo(i32, i32) #[[$MBCNT_ATTRS:[0-9]+]]
 kernel void test_mbcnt_lo(global uint* out, uint src0, uint src1) {
   *out = __builtin_amdgcn_mbcnt_lo(src0, src1);
 }
 
 // CHECK-LABEL: @test_mbcnt_hi(
 // CHECK: call i32 @llvm.amdgcn.mbcnt.hi(i32 %src0, i32 %src1)
-// CHECK: declare i32 @llvm.amdgcn.mbcnt.hi(i32, i32) #[[$MBCNT_ATTRS]]
 kernel void test_mbcnt_hi(global uint* out, uint src0, uint src1) {
   *out = __builtin_amdgcn_mbcnt_hi(src0, src1);
 }
@@ -834,7 +832,6 @@
 // CHECK-DAG: [[$WS_RANGE]] = !{i16 1, i16 1025}
 // CHECK-DAG: attributes #[[$NOUNWIND_READONLY]] = { mustprogress nocallback nofree nosync nounwind willreturn memory(read) }
 // CHECK-DAG: attributes #[[$READ_EXEC_ATTRS]] = { convergent }
-// CHECK-DAG: attributes #[[$MBCNT_ATTRS]] = {{.* convergent .*}}
 // CHECK-DAG: ![[$EXEC]] = !{!"exec"}
 // CHECK-DAG: ![[$EXEC_LO]] = !{!"exec_lo"}
 // CHECK-DAG: ![[$EXEC_HI]] = !{!"exec_hi"}
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153953: Revert "[AMDGPU] Mark mbcnt as convergent"

2023-06-28 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

In D153953#4455794 , @yaxunl wrote:

> Marking mbcnt as convergent, together with https://reviews.llvm.org/D144756 
> prevent mbcnt to be merged, which fixed the reported issue.
>
> Do you have an alternative fix for the issue?

I completely disagree with this line of thought. The change to mbcnt is 
fundamentally incorrect and not related to the issue. There is no ground to ask 
for changes to this revision. Why was the change to mbcnt committed if was not 
an actual fix for anything?

Also, please note that the issue itself is invalid. I have put a comment in 
github explaining the same. The original program itself is invalid.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153953/new/

https://reviews.llvm.org/D153953

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153953: Revert "[AMDGPU] Mark mbcnt as convergent"

2023-06-28 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds requested review of this revision.
sameerds added a comment.

@pravinjagtap @arsenm ... reverting the mbcnt intrinsic affects tests that were 
added for atomic optimizations. In particular, the mbcnt is now being moved 
across/into/out of control flow because it is no longer convergent. I eyeballed 
one example and it seemed okay to me, but a more thorough check will be useful.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153953/new/

https://reviews.llvm.org/D153953

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153953: Revert "[AMDGPU] Mark mbcnt as convergent"

2023-06-29 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

In D153953#4458386 , @foad wrote:

> In D153953#4458134 , @sameerds 
> wrote:
>
>> @pravinjagtap @arsenm ... reverting the mbcnt intrinsic affects tests that 
>> were added for atomic optimizations. In particular, the mbcnt is now being 
>> moved across/into/out of control flow because it is no longer convergent. I 
>> eyeballed one example and it seemed okay to me, but a more thorough check 
>> will be useful.
>
> They are just being moved from before the loop to after the loop. This is 
> fine. It is even a bit weird that the atomic optimizer pass emits them before 
> the loop in the first place.

@foad just to respect the process, are you okay with approving the review 
again? I had set it back to "needs review".


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153953/new/

https://reviews.llvm.org/D153953

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D150427: [AMDGPU] Non hostcall printf support for HIP

2023-05-22 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp:184
 // specify a string, i.e, the "%s" specifier with optional '*' characters.
-static void locateCStrings(SparseBitVector<8> &BV, Value *Fmt) {
-  StringRef Str;
-  if (!getConstantStringInfo(Fmt, Str) || Str.empty())
-return;
+static void locateCStrings(SparseBitVector<8> &BV, Value *Fmt, StringRef &Str) 
{
 

Where is Fmt used in this function? Also, StringRef is passed by value. It's 
already a lightweight reference, as demonstrated by its name.



Comment at: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp:210
+// helper struct to package the string related data
+typedef struct StringData {
+  std::string Str = "";

typedef is not required for this in C++



Comment at: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp:486-487
+  {ConstantInt::get(Int32Ty, 8)});
+}
+else {
+  // Include a dummy metadata instance in case of only non constant

else should be on the same line as the closing brace. Please do run 
clang-format on the entire change.



Comment at: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp:458
+auto CreateControlDWord = M->getOrInsertFunction(
+StringRef("__ockl_create_control_dword"), Builder.getInt32Ty(),
+Builder.getInt32Ty(), Int1Ty, Int1Ty);

vikramRH wrote:
> arsenm wrote:
> > vikramRH wrote:
> > > arsenm wrote:
> > > > Do we really need another ockl control variable for this? Why isn't it 
> > > > a parameter? printf=stdout always 
> > > There are certain HIP API's such as "hip_assert" that output to stderr. 
> > > currently such API's are supported via hostcalls. Although this 
> > > implementation does not currently support the API's ,its kept as an 
> > > option. 
> > Right but the way to handle that would be a parameter for where to output, 
> > not an externally set global 
> I am not clear here, you expect additional inputs to device lib function ?
@arsenm, this "control word" written into the buffer. In that sense, it is 
indeed a parameter passed from device to host as part of the printf packet. It 
is not yet another global variable.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150427/new/

https://reviews.llvm.org/D150427

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D150427: [AMDGPU] Non hostcall printf support for HIP

2023-05-17 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp:212
+// helper struct to package the string related data
+typedef struct S {
+  std::string Str;

you can just say "struct StringData"



Comment at: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp:218-219
+
+  S(std::string str = "", bool IC = true, Value *RS = nullptr,
+Value *AS = nullptr)
+  : Str(str), isConst(IC), RealSize(RS), AlignedSize(AS) {}

Every argument has a default value. Why not simply assign them as member 
initializers?



Comment at: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp:238
+
+  // First 8 bytes to be reserved for control dword
+  size_t BufSize = 4;

Should this say 4?



Comment at: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp:256
+
+  StringRef ArgStr;
+  for (size_t i = 1; i < Args.size(); i++) {

move this closer to first use



Comment at: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp:313
+  SparseBitVector<8> &SpecIsCString,
+  SmallVector &StringContents,
+  bool isConstFmtStr) {

function argument should use SmallVectorImpl, and not SmallVector



Comment at: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp:321
+SmallVector WhatToStore;
+if ((i == 0) || SpecIsCString.test(i)) {
+  if ((*StrIt).isConst) {

This "if (...) { ... }" has a really long body. Can it be moved into a separate 
function for readability?



Comment at: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp:324
+Str = (*StrIt).Str;
+const uint64_t ReadSize = 4;
+

Try to move declarations to the smallest scope that they are used in. For 
ReadSize, this seems to be the while loop later.



Comment at: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp:352
+
+  // TODO: Should not bothering aligning up.
+  if (ReadNow < ReadSize)

typo: bother



Comment at: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp:426-427
+
+// The buffered version still follows OpenCL printf standards for
+// printf return value, i.e 0 on success, 1 on failure.
+ConstantPointerNull *zeroIntPtr =

So we cannot use a buffered printf in HIP?



Comment at: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp:451
+{ArgSize,
+ ConstantInt::get(Int1Ty, !FmtStr.empty() ? 1 : 0, false),
+ ConstantInt::get(Int1Ty, 0, false)});

FmtStr.empty() gets checked in multiple places. Can this be captured as one or 
more boolean variables with meaningful names?



Comment at: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp:487
+  if(metaD->getNumOperands() == 0) {
+MDString *fmtStrArray = MDString::get(Ctx, "0:0:deadbeef,\"\"");
+MDNode *myMD = MDNode::get(Ctx, fmtStrArray);

Probably not a good choice if it shows up where the user can see it. It's also 
not very informative. A more descriptive string will be better.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150427/new/

https://reviews.llvm.org/D150427

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D150427: [AMDGPU] Non hostcall printf support for HIP

2023-05-17 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp:184-185
 // specify a string, i.e, the "%s" specifier with optional '*' characters.
-static void locateCStrings(SparseBitVector<8> &BV, Value *Fmt) {
-  StringRef Str;
-  if (!getConstantStringInfo(Fmt, Str) || Str.empty())
+static void locateCStrings(SparseBitVector<8> &BV, Value *Fmt, StringRef 
&FmtStr) {
+  if (!getConstantStringInfo(Fmt, FmtStr) || FmtStr.empty())
 return;

Seems like the only purpose of this change is to make the new StringRef 
available outside. It will be less confusing to just  move the call to 
getConstantStringInfo() out of this function, and pass the StringRef as an 
input parameter instead of output. Also, try not to change existing variable 
names unless it is really relevant. That reduces the number of lines affected 
by mostly non-functional changes.



Comment at: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp:223
+
+static size_t alignUp(size_t Value, uint alignment) {
+  return (Value + alignment - 1) & ~(alignment - 1);

Needs consistent capitalization in the argument names. Since the rest of the 
code seems to prefer TitleCase, maybe alignment should start with "A"?



Comment at: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp:369
+// to alignment padding is not populated with any specific value
+// here, I feel this would be safe as long as runtime is sync with
+// the offsets.

Avoid first person comments. 


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150427/new/

https://reviews.llvm.org/D150427

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D150427: [AMDGPU] Non hostcall printf support for HIP

2023-06-07 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp:386-387
+} else {
+  auto IntTy = dyn_cast(Args[i]->getType());
+  if (IntTy && IntTy->getBitWidth() == 32)
+WhatToStore.push_back(

vikramRH wrote:
> arsenm wrote:
> > vikramRH wrote:
> > > arsenm wrote:
> > > > isIntegerTy(32).
> > > > 
> > > > I also do not understand why only 32-bit integers would be promoted to 
> > > > 64-bit (or why this would be zext). This doesn't match varargs ABI 
> > > > handling, where everything smaller would be promoted to i32.
> > > Right, But default promotions would have happened already, this 
> > > additional promotion is due to runtime processing requirement
> > Is this documented somewhere? If it's promote everything to i64, I'd prefer 
> > to handle all types rather than special casing 32
> That's precisely what the function "fitArgInto64Bits()" did. We eliminated it 
> due to some redundant instructions generated. int32 was the only case (apart 
> from float16 now) that required this special casting.
I think the problem is that the check for "isIntergerTy(32)" makes it look like 
it is some sort of special case. What we really mean is that if the type is an 
integer with width less than 64 bits, then zext it with padding to fit the 
64-bit store into the buffer. We can do that with just "isIntegerTy()".

Separately, we know from auto promotion of varargs that the integer can only be 
of size 32 or 64. We can assert that instead of assuming it in the code.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150427/new/

https://reviews.llvm.org/D150427

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D115283: [AMDGPU] Set "amdgpu_hostcall" module flag if an AMDGPU function has calls to device lib functions that use hostcalls.

2022-01-31 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added inline comments.



Comment at: clang/lib/CodeGen/TargetInfo.cpp:9434
+  if (Callee &&
+  (Callee->getName() == "__ockl_call_host_function" ||
+   Callee->getName() == "__ockl_fprintf_stderr_begin")) {

Just to confirm what others have probably disovered, the only function whose 
presence should be checked is ``__ockl_hostcall_internal``. All others are 
wrappers that are free to disappear during optimization.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D115283/new/

https://reviews.llvm.org/D115283

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D69498: IR: Invert convergent attribute handling

2021-04-22 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

In D69498#2708990 , @arsenm wrote:

> You're still saying the same thing. This needs to be defined generically. 
> Frontends don't *have* to to anything, they'll just get the assumed 
> convergent behavior by default. Either frontends could always add 
> noconvergent to avoid any possible optimization hit, or we could have a pass 
> add noconvergent depending on the target. I don't want to change the 
> interpretation of core attributes based on the target

Like I said earlier, this not a reinterpretation base on target. I think we are 
not talking about the same "convergent" here. There is currently an attribute 
in LLVM with the spelling "c-o-n-v-e-r-g-e-n-t". It does not do what its name 
says. It has a prescriptive definition that says "thou shalt not add control 
dependences to this function". This is not what we actually need, because it 
fails in two ways:

1. It is actually okay add control dependences to a convergent function as long 
as the new branches are uniform.
2. Some times we hack a transform to avoid doing things that cannot be 
described as "add new control dependence", and still talk about the function 
having the above named attribute.

There is another important bug that obfuscates the whole discussion: most 
places that use "isConvergent()" should in fact first check if it really 
matters to the surrounding control flow. There is too much loaded onto that one 
attribute, without sufficient context. The definition of "noconvergent" that I 
am proposing starts out by first fixing the definition of convergent itself. 
This definition is independent of target, and only talks about the properties 
of the control flow that reach the callsite. To repeat, this does not 
reinterpret the IR in a target-defined way. Like I said, in the new definition, 
all function calls are convergent even on CPUs, so I don't see where the target 
comes in. If you still insist on talking about interpretation of core 
attributes, please start by deconstructing the definition that I propose so I 
can see what I am missing.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D69498/new/

https://reviews.llvm.org/D69498

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D69498: IR: Invert convergent attribute handling

2021-04-23 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

In D69498#2710675 , @arsenm wrote:

> In D69498#2709218 , @sameerds wrote:
>
>> 1. It is actually okay add control dependences to a convergent function as 
>> long as the new branches are uniform.
>> 2. Some times we hack a transform to avoid doing things that cannot be 
>> described as "add new control dependence", and still talk about the function 
>> having the above named attribute.
>>
>> There is another important bug that obfuscates the whole discussion: most 
>> places that use "isConvergent()" should in fact first check if it really 
>> matters to the surrounding control flow. There is too much loaded onto that 
>> one attribute, without sufficient context. The definition of "noconvergent" 
>> that I am proposing starts out by first fixing the definition of convergent 
>> itself. This definition is independent of target, and only talks about the 
>> properties of the control flow that reach the callsite. To repeat, this does 
>> not reinterpret the IR in a target-defined way. Like I said, in the new 
>> definition, all function calls are convergent even on CPUs, so I don't see 
>> where the target comes in. If you still insist on talking about 
>> interpretation of core attributes, please start by deconstructing the 
>> definition that I propose so I can see what I am missing.
>
> Yes, the current concept of convergent is broken. I think this whole recent 
> burst of conversation has been too focused on the current convergent 
> situation and this patch in particular. I think we need to conceptually 
> rebase this into the world that D85603  
> creates.

Actually that is exactly what this current conversation is doing. My definition 
of `noconvergent` first rebases the definition of `convergent` into what D85603 
 does. My wording is no different than the one 
you see here: https://reviews.llvm.org/D85603#change-WRNH9XUSSoR8

> The expected convergence behavior is not only a target property, but of the 
> source language. The frontend ultimately has to express the intended 
> semantics of these cross lane regions, and the convergence tokens gives this 
> ability.

That is true, and it is covered by the simple fact that we both agree that 
`convergent` needs to be the default unless specified by the frontend using 
`noconvergent`.

> My point about not making this target dependent is we shouldn't be trying to 
> shoehorn in TTI checks around convergent operations just to save frontends 
> from the inconvenience of adding another attribute to function declarations. 
> We can interprocedurally infer noconvergent like any other attribute most of 
> the time. The convergent calls and their tokens should be interpreted the 
> same even if the target CPU doesn't really have cross lane behavior.

This is not an attempt to shoehorn TTI checks for mere convenience. You missed 
the part about how the current use of `CallInst::isConvergent()` is broken. 
This is where the wording in D85603  inherits 
a major fault from the existing definition of convergent. It is wrong to say 
that optimizations are plain forbidden by the mere knowledge/assumption that 
some call is convergent. Those optimizations are forbidden only if divergent 
control flow is involved. The definition of `convergent` needs to refrain from 
being prescriptive about what the compiler can and cannot do. See the 
definition of `nosync` for comparison.

The convergent property merely says that the call is special, and then it is up 
to each optimization to decide what happens. That decision is based on whether 
the optimization affects divergent control flow incident on the call, and 
whether that effect is relevant to a convergent function. Now that's where the 
following reasoning comes into play:

1. Every use of isConvergent() needs to be tempered with knowledge of the 
control flow incident on it.
2. The only way to check for divergent control flow is to call on divergence 
analysis.
3. When divergence analysis is called on a target with no divergence, it 
returns `nullptr`, which is treated as equivalent to returning `uniform` for 
every query. This is not hypothetical, it's already happening whenever a 
transform does `if (!DA) { ... }`.
4. Pending an attempt to actually call DA at each of these places, a correct 
simplification is to assume uniform control flow when the target does not have 
divergence, and assume divergent control flow otherwise.

That's all there is to this. Like I keep saying, this is not a 
reinterpretation, nor is it a hack. From my first comment on the thread, it is 
an attempt to recast the whole question o be one about combining optimizations 
with target-specific information. Now before you balk at yet another use of the 
phrase "target-specific", please note that DA is totally target-specific. The 
"sources 

[PATCH] D69498: IR: Invert convergent attribute handling

2021-04-24 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

In D69498#2712706 , @nhaehnle wrote:

> In D69498#2711396 , @foad wrote:
>
>> I don't have much to add to the conversation except to point out that this 
>> definition defines `noconvergent` in terms of divergent control flow, but 
>> the langref itself doesn't define what divergent control flow //is//, which 
>> makes it an incomplete spec. (Perhaps I'm just restating @arsenm's 
>> objections.) This seems unsatisfactory to me but I have no idea what to do 
>> about it. I agree with @sameerds that the current definition of `convergent` 
>> is too restrictive because in practice we really do want to be able to move 
>> convergent calls past uniform control flow.
>
> That is one of the things that D85603  
> addresses. I suspect Sameer was assuming the language from there in the 
> background.
>
> I agree with Matt that it would be best to avoid too much TTI dependence. The 
> existing situation with the intrinsics is already a bit of a strange one. I 
> wonder if it is possible to move to a world where isSourceOfDivergence need 
> not be target-dependent. (This requires e.g. new attributes on function 
> arguments as well as new information about address spaces in the data layout, 
> plus some new attributes to define intrinsic data flow. Likely beyond the 
> scope of this patch.)

In general, yes, it is great to avoid dependence on TTI. But this particular 
instance is actually a dependence on DA. The fact that DA depends on TTI is a 
separate problem. But I think that's a natural fallout of the fact that LLVM is 
not (and should not be) in the business of defining an execution model for 
multiple threads. Every optimization that affects control flow needs to be able 
to reason about an execution model that LLVM itself does not define. So what we 
end up with is an approximate model in the form of information gleaned from the 
frontend as well as the target. Neither source is "more correct" or "less 
ideal" than the other.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D69498/new/

https://reviews.llvm.org/D69498

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D69498: IR: Invert convergent attribute handling

2021-04-30 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

In D69498#2728451 , @JonChesterfield 
wrote:

> I wonder if it is necessary for the exec mask to be implicit state, managed 
> by a convergent/divergent abstraction.
>
> We could reify it as an argument passed to kernels (where all bits are set), 
> and adjusted by phi nodes at entry to basic blocks. Various intrinsics take 
> and return that reified value. __ballot, a comparison op, possibly load/store.
>
> At that point all the awkward control flow constraints are raised to data 
> flow, and I think (but haven't really chased this into the dark corners) that 
> solves the problems I know about. Notably, a uniform branch would be one 
> where the exec mask value was unchanged, so the associated phi is constant 
> folded.
>
> Makes a mess of the human readable IR, but possibly at a lower overall 
> complexity than continuing to handle divergence as control flow.

That is essentially what the current divergence analysis (Karrenberg and Hack) 
does anyway. Except that since we don't know how many threads are running 
concurrently, the mask is a symbolic value that can either be "divergent" or 
"uniform". If I am looking at this the right way, you seem to be saying two 
separate things: the notion of divergence/uniformity is what solves the problem 
at hand, and separately, the notion of divergence can be turned into explicit 
dataflow.

> edit: said reified mask, if integer, would also be the value post-volta sync 
> intrinsics take, where the developer is supposed to compute the mask across 
> branches and pass it around everywhere. Clang could therefore provide 
> builtins that lower to those sync intrinsics with an always correct mask 
> automatically passed. That would make volta much easier to program.

Additionally, the explicit mask (if we could agree on its type) would make it 
possible to have "dynamic uniformity". Currently we can only prove static 
uniformity.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D69498/new/

https://reviews.llvm.org/D69498

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D69498: IR: Invert convergent attribute handling

2021-04-21 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

The way I see it, the notion of convergence is relevant only to a certain class 
of targets (usually represented by GPUs) and it only affects certain 
optimizations. Then why not have only these optimizations check `TTI` to see if 
convergence matters? `TTI.hasBranchDivergence()` seems like a sufficient proxy 
for this information.

1. `convergent` becomes the default in LLVM IR, but it does not affect 
optimizations on non-GPU targets.
2. This is not a reinterpretation of the same IR on different targets. The 
notional execution model of LLVM IR will say that all function calls are 
convergent. Targets that only care about one thread at a time represent the 
degenerate case where all executions are convergent anyway.

This recasts the whole question to be one about combining optimizations with 
target-specific information. The only changes required are in transforms that 
check `CallInst::isConvergent()`. These should now also check `TTI`, possibly 
adding a dependency on the `TTI` analysis where it didn't exist earlier.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D69498/new/

https://reviews.llvm.org/D69498

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D69498: IR: Invert convergent attribute handling

2021-04-21 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

In D69498#2704364 , @foad wrote:

>> This recasts the whole question to be one about combining optimizations with 
>> target-specific information. The only changes required are in transforms 
>> that check `CallInst::isConvergent()`. These should now also check `TTI`, 
>> possibly adding a dependency on the `TTI` analysis where it didn't exist 
>> earlier.
>
> @sameerds I agree with your conclusions but I would describe the situation a 
> little differently. As I understand it, the optimizations that check 
> isConvergent really only care about moving convergent calls past control flow 
> //that might be divergent//. !hasBranchDivergence is a promise that there are 
> no possible sources of divergence for a target, so you can run a divergence 
> analysis if you like but it will just tell you that everything is uniform, so 
> all control flow is uniform, so it's OK to move isConvergent calls around.
>
> In practice the optimizations that check isConvergent don't seem to use 
> divergence analysis, they just pessimistically assume that any control flow 
> might be divergent (if hasBranchDivergence). But they could and perhaps 
> should use divergence analysis, and then it would all just fall out in the 
> wash with no need for an explicit hasBranchDivergence test.

Sure it is formally the same thing. But in practice, the main issue for 
everyone is the effect on compile time for targets that don't care about 
convergence/divergence. For such targets, running even the divergence analysis 
is an unnecessary cost. Any checks for uniform control flow will still be 
hidden under TTI.hasBranchDivergence() for such targets. We see that happening 
already in places where divergence analysis is constructed optionally.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D69498/new/

https://reviews.llvm.org/D69498

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D69498: IR: Invert convergent attribute handling

2021-04-21 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

I realize now that what @foad says above puts the idea in a clearer context. 
Essentially, any check for isConvergent() isn't happening in a vacuum. It is 
relevant only in the presence of divergent control flow, which in turn is 
relevant only when the target has divergence. Any standalone check for 
isConvergent() is merely making a pessimistic assumption that all the control 
flow incident on it is divergent. This has two consequences:

1. The meaning of the `convergent` attribute has always been target-dependent.
2. Dependence on TTI is not a real cost at all. We may eventually update every 
use of isConvergent() to depend on a check for divergence. The check for TTI is 
only the first step towards that.

I would propose refining the definition of the `noconvergent` attribute as 
follows:

> noconvergent:
>
> Some targets with a parallel execution model provide cross-thread operations 
> whose outcome is affected by the presence of divergent control flow. We call 
> such operations convergent. Optimizations that change control flow may affect 
> the correctness of a program that uses convergent operations. In the presence 
> of divergent control flow, such optimizations conservatively treat every 
> call/invoke instruction as convergent by default. The noconvergent attribute 
> relaxes this constraint as follows:
>
> - The noconvergent attribute can be added to a call/invoke to indicate that 
> it is not affected by changes to the control flow that reaches it.
> - The noconvergent attribute can be added to a function to indicate that it 
> does not execute any convergent operations. A call/invoke automatically 
> inherits the noconvergent attribute if it is set on the callee.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D69498/new/

https://reviews.llvm.org/D69498

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D69498: IR: Invert convergent attribute handling

2021-04-22 Thread Sameer Sahasrabuddhe via Phabricator via cfe-commits
sameerds added a comment.

In D69498#2706524 , @arsenm wrote:

> In D69498#2705441 , @sameerds wrote:
>
>> I realize now that what @foad says above puts the idea in a clearer context. 
>> Essentially, any check for isConvergent() isn't happening in a vacuum. It is 
>> relevant only in the presence of divergent control flow, which in turn is 
>> relevant only when the target has divergence. Any standalone check for 
>> isConvergent() is merely making a pessimistic assumption that all the 
>> control flow incident on it is divergent. This has two consequences:
>>
>> 1. The meaning of the `convergent` attribute has always been 
>> target-dependent.
>> 2. Dependence on TTI is not a real cost at all. We may eventually update 
>> every use of isConvergent() to depend on a check for divergence. The check 
>> for TTI is only the first step towards that.
>
> The core IR semantics must *not* depend on target interpretation. convergent 
> has never been target dependent, and was defined in an abstract way. Only 
> certain targets will really care, but we cannot directly define it to mean 
> what we want it to mean for particular targets

My bad. I should have said "the implementation of convergent" is target 
dependent. But even that is not precise enough. It is more correct to say that 
the convergent attribute can be implemented inside LLVM by depending on TTI so 
that it only impacts targets that have divergence. This dependence is not new; 
it's merely missing because the current uses of convergent pessimistically 
assume divergent control flow on targets.

The important point is that frontends have to do nothing special for targets 
that don't have divergence, because the concepts of convergence and divergence 
have a trivial effect on optimizations. I had already observed in a previous 
comment that this new definition does not amount to reinterpreting the same 
program differently on different targets. It is perfectly okay to say that all 
calls are convergent on a CPU, because all branches are trivially uniform on 
the CPU (this part is inspired by @foad's observation), and hence the default 
convergent nature of calls has no effect on optimization for CPUs. If we were 
to rename `isConvergent()` to `hasConvergenceConstraints()` then we can see 
that it trivially returns `false` for any target that has no divergence.

Note that my proposed definition for `noconvergent` makes no mention of the 
target. The first sentence starts with "some targets", but that's just for a 
helpful context. The concept of convergent calls is defined purely in terms of 
divergent control flow. This one sentence very nicely ties everything up:

> In the presence of divergent control flow, such optimizations conservatively 
> treat every call/invoke instruction as convergent by default.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D69498/new/

https://reviews.llvm.org/D69498

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits