Hi:
I am investigating the capability of TVM primitives (CUDA backend). I take
CUTLASS as a baseline of highly-optimized CUDA library.
I think most of optimization techniques used in CUTLASS like tiling, shared_mem
management are supported by TVM primitives.
Streaming is also an important
I don't know or think if we are exposing CUDA stream abstraction to python
frontend. We typically don't care about cuda stream (we don't support any
concurrency at runtime).
What is your use case?
---
[Visit
Topic](https://discuss.tvm.ai/t/how-cuda-kernel-is-launched-in-tvm-stack/6167/7)
Hi:
Thanks for you answer. I will check autotvm to see how it tunes grid/block.
Because based on experience, grid/block dims will affect performance.
And another question is that, I see there is arg for **cuda stream**
```
CUstream strm = static_cast(CUDAThreadEntry::ThreadLocal()->stream);
Correct. You can tweak the schedule to change the launch config, but as a user
you shouldn't care about the exact size of grid/block.
If you really want the best perf, use autotvm to tune your schedule, and the
resulting grid/block size is optimal based on real measurament.
---
[Visit
T
Hi:
Thank you for your help!
So, based on my understanding for these codes.
in python
```
func(a,b,c)
```
will call this
```
void operator() (TVMArgs args,
TVMRetValue* rv,
void** void_args) const
```
And grid_dim, block_dim are inferred from **TVMArgs args**(
The answer is we use CUDA driver API to launch kernels from C++ code.
```kernel<<>>(a,b,c)``` is not the only way to launch kernel
and it requires compiling with NVCC.
See
https://github.com/apache/incubator-tvm/blob/e0122c0ea68043372220e4e02b81692c34832227/src/runtime/cuda/cuda_module.cc#L1
BTW, I am also wondering if TVM stack supports CUDA streaming features like
(https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/)
---
[Visit
Topic](https://discuss.tvm.ai/t/how-cuda-kernel-is-launched-in-tvm-stack/6167/2)
to respond.
You are receiving this because
Hi all:
I am learning the TVM CUDA backend. I have a question about how CUDA kernel is
launched.
Below is my simple test program:
```
import tvm
from tvm import te
import numpy as np
dtype = "float32"
# GEMM size
M=16;K=8;N=16
# declear algorithm
k = te.reduce_axis((0, K), 'k') # loop over d