Re: [PR] Fix mma tensorize error [tvm]

via GitHub Sat, 29 Nov 2025 05:28:10 -0800


gemini-code-assist[bot] commented on code in PR #18528:
URL: https://github.com/apache/tvm/pull/18528#discussion_r2573015008



##########
python/tvm/tir/tensor_intrin/cuda.py:
##########
@@ -1465,7 +1465,7 @@ def get_index_C(elem_offset, stride):
     stride_b = stride // 8
     bi = i // 8
     bj = j // 8
-    return (bi // 2) * 2 * stride_b + bi % 2 + bj * 2
+    return ((bi // 2) * 2 * stride_b + bi % 2 + bj * 2) * 2

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   This change correctly fixes the offset calculation. For better 
maintainability, please consider adding a comment explaining the multiplication 
by 2. For example, explaining how it relates to the register layout of the C 
fragment for `mma.sync` where each element might occupy two addressable units 
for the `ptx_mma` intrinsic.



##########
src/tir/transforms/memhammer_intermediate_stage.cc:
##########
@@ -263,8 +263,12 @@ std::pair<Stmt, SeqStmt> InsertCacheStage(Stmt stmt, bool 
is_write_cache, ffi::S
   for (const For& loop : outer_loops) {
     if (loop->kind == ForKind::kThreadBinding) {
       const ffi::String& thread_tag = loop->thread_binding.value()->thread_tag;
-      if 
(CanRelaxStorageUnderThread(runtime::StorageScope::Create(storage_scope),
-                                     
runtime::ThreadScope::Create(thread_tag))) {
+      auto thread_scope = runtime::ThreadScope::Create(thread_tag);
+      if 
(CanRelaxStorageUnderThread(runtime::StorageScope::Create(storage_scope), 
thread_scope)) {
+        if (is_write_cache && thread_scope.dim_index == 0) {
+          // writing C_reindex_m16n8k8_matrixC_shared_dyn is warp execution
+          continue;

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   The logic to skip `threadIdx.x` for write caches is correct. The added 
comment is helpful, but could be slightly more descriptive to aid future 
maintenance. Consider expanding it to explain *why* warp execution implies 
skipping this dimension, to prevent overestimation of shared memory.
   
   ```c
   // For write caches that are part of a warp-level execution (e.g., storing 
MMA results),
   // the threadIdx.x dimension should not contribute to the cache size 
calculation, as all
   // threads in the warp access the same memory region. Skipping it prevents 
overestimation
   // of shared memory.
   continue;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix mma tensorize error [tvm]

Reply via email to