[PATCH] D133956: git push Cuda Check for ignored errors after calling a CUDA kernel

Artem Belevich via Phabricator via cfe-commits Thu, 15 Sep 2022 12:03:51 -0700

tra added a comment.

> Our linter cannot easily check whether the error was reset


It can not in principle. Many CUDA errors are 'sticky' and can only be cleared 
by resetting the GPU or exiting the application and the former is virtually 
never used beyond toy examples (resetting a GPU would clear a lot of state, 
including memory allocations and restoring it is usually not feasible in 
practice). 
E.g:
https://stackoverflow.com/questions/43659314/how-can-i-reset-the-cuda-error-to-success-with-driver-api-after-a-trap-instructi
https://stackoverflow.com/questions/56329377/reset-cuda-context-after-exception/56330491

The checker has no way to tell whether the returned error is sticky and the 
stickiness can start at any CUDA runtime call, so, generally speaking, all CUDA 
API calls must be checked and any of them may be the one producing sticky 
errors due to preceding calls. At the very minimum, in addition to `<<<...>>>` 
kernel launches, user may also launch kernels via `cudaLaunchKernel()` and, I 
believe, CUDA runtime itself may launch some helper kernels under the hood, so 
I would not be surprised to see other sources of errors.

I think ultimately the checker should be generalized to flag all unchecked CUDA 
runtime calls. The problem is that that is going to be exceedingly noisy in 
practice as a lot of real code does not bother to check for the errors 
consistently. Limiting the checks to kernel launches may be a reasonable 
starting point as it would give us the ability to zero in on the culprit kernel 
by running the app with "CUDA_LAUNCH_BLOCKING".



================
Comment at: 
clang-tools-extra/test/clang-tidy/checkers/cuda/unsafe-kernel-call-macro-handler.cu:53
+  }
+  auto err = cudaGetLastError();
+
----------------
Just curious -- is it sufficient to just call `cudaGetLastError();` ? Or does 
the checker require using its result, too? I.e. in practice this particular 
check will not really do anything useful. The tests below look somewhat 
inconsistent. 


================
Comment at: 
clang-tools-extra/test/clang-tidy/checkers/cuda/unsafe-kernel-call-macro-handler.cu:75
+  // CHECK-MESSAGES: :[[@LINE-1]]:{{[0-9]+}}: warning: Possible unchecked 
error after a kernel launch.
+  // Calling an error-checking function after a kernel is not considered safe.
+  errorCheck();
----------------
WDYM by "is not considered safe" here? How is that different from calling 
`cudaGetLastError()` and checking its value?


================
Comment at: 
clang-tools-extra/test/clang-tidy/checkers/cuda/unsafe-kernel-call-macro-handler.cu:80-82
+  b<<<1, 2>>>();; /* The semicolons are here because the
+    detection of the macro is done with a lexer */ ;
+  CUDA_CHECK_KERNEL();
----------------
What would happen with a single `;` as would be seen in the normal user code?


================
Comment at: 
clang-tools-extra/test/clang-tidy/checkers/cuda/unsafe-kernel-call-macro-handler.cu:112
+  do {
+    auto err2 = cudaGetLastError();
+  } while(0);
----------------
Why does this case produce no warning, while a very similar case above does? In 
both cases result of `cudaGetLastError()` is assigned to an unused variable 
within the loop body.

```
  b<<<1, 2>>>();
  // CHECK-MESSAGES: :[[@LINE-1]]:{{[0-9]+}}: warning: Possible unchecked error 
after a kernel launch.
  for(;;)
    auto err2 = cudaGetLastError(); // Brackets omitted purposefully, since 
they create an additional AST node

```


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D133956/new/

https://reviews.llvm.org/D133956

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D133956: git push Cuda Check for ignored errors after calling a CUDA kernel

Reply via email to