[TVM Discuss] [Questions] How CUDA kernel is launched in TVM stack

2020-04-02 Thread Wei Sun via TVM Discuss
Hi: I am investigating the capability of TVM primitives (CUDA backend). I take CUTLASS as a baseline of highly-optimized CUDA library. I think most of optimization techniques used in CUTLASS like tiling, shared_mem management are supported by TVM primitives. Streaming is also an important

[TVM Discuss] [Questions] How CUDA kernel is launched in TVM stack

2020-04-02 Thread masahi via TVM Discuss
I don't know or think if we are exposing CUDA stream abstraction to python frontend. We typically don't care about cuda stream (we don't support any concurrency at runtime). What is your use case? --- [Visit Topic](https://discuss.tvm.ai/t/how-cuda-kernel-is-launched-in-tvm-stack/6167/7)

[TVM Discuss] [Questions] How CUDA kernel is launched in TVM stack

2020-04-02 Thread Wei Sun via TVM Discuss
Hi: Thanks for you answer. I will check autotvm to see how it tunes grid/block. Because based on experience, grid/block dims will affect performance. And another question is that, I see there is arg for **cuda stream** ``` CUstream strm = static_cast(CUDAThreadEntry::ThreadLocal()->stream);

[TVM Discuss] [Questions] How CUDA kernel is launched in TVM stack

2020-04-01 Thread masahi via TVM Discuss
Correct. You can tweak the schedule to change the launch config, but as a user you shouldn't care about the exact size of grid/block. If you really want the best perf, use autotvm to tune your schedule, and the resulting grid/block size is optimal based on real measurament. --- [Visit T

[TVM Discuss] [Questions] How CUDA kernel is launched in TVM stack

2020-04-01 Thread Wei Sun via TVM Discuss
Hi: Thank you for your help! So, based on my understanding for these codes. in python ``` func(a,b,c) ``` will call this ``` void operator() (TVMArgs args, TVMRetValue* rv, void** void_args) const ``` And grid_dim, block_dim are inferred from **TVMArgs args**(

[TVM Discuss] [Questions] How CUDA kernel is launched in TVM stack

2020-04-01 Thread masahi via TVM Discuss
The answer is we use CUDA driver API to launch kernels from C++ code. ```kernel<<>>(a,b,c)``` is not the only way to launch kernel and it requires compiling with NVCC. See https://github.com/apache/incubator-tvm/blob/e0122c0ea68043372220e4e02b81692c34832227/src/runtime/cuda/cuda_module.cc#L1

[TVM Discuss] [Questions] How CUDA kernel is launched in TVM stack

2020-04-01 Thread Wei Sun via TVM Discuss
BTW, I am also wondering if TVM stack supports CUDA streaming features like (https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/) --- [Visit Topic](https://discuss.tvm.ai/t/how-cuda-kernel-is-launched-in-tvm-stack/6167/2) to respond. You are receiving this because

[TVM Discuss] [Questions] How CUDA kernel is launched in TVM stack

2020-04-01 Thread Wei Sun via TVM Discuss
Hi all: I am learning the TVM CUDA backend. I have a question about how CUDA kernel is launched. Below is my simple test program: ``` import tvm from tvm import te import numpy as np dtype = "float32" # GEMM size M=16;K=8;N=16 # declear algorithm k = te.reduce_axis((0, K), 'k') # loop over d