[Apache TVM Discuss] [Questions] Matrix multiplication example for Cuda

Le Xu via Apache TVM Discuss Sun, 04 Oct 2020 22:36:17 -0700


Hi! I have been studying how TVM works and I tried out this 
(https://github.com/apache/incubator-tvm/blob/master/tutorials/autotvm/tune_simple_template.py)
 tutorial example from the website and it seems like running this example with 
cuda (or OpenCL) produces errors like:


> Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 
> -thread_warp_size=32, workload=('tutorial/matmul', 512, 512, 512, 'float32'). 
> A fallback configuration is used, which may bring great performance 
> regression.
> Traceback (most recent call last):
>   File "tune_simple_template.py", line 321, in <module>
>     func = tvm.build(s, arg_bufs)
>   File 
> "/root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/driver/build_module.py",
>  line 413, in build
>     mod_host, mdev = _build_for_device(input_mod, tar, target_host)
>   File 
> "/root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/driver/build_module.py",
>  line 255, in _build_for_device
>     mod_mixed = tvm.transform.Sequential(opt_mixed)(mod_mixed)
>   File 
> "/root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/ir/transform.py",
>  line 127, in __call__
>     return _ffi_transform_api.RunPass(self, mod)
>   File "tvm/_ffi/_cython/./packed_func.pxi", line 321, in 
> tvm._ffi._cy3.core.PackedFuncBase.__call__
>   File "tvm/_ffi/_cython/./packed_func.pxi", line 256, in 
> tvm._ffi._cy3.core.FuncCall
>   File "tvm/_ffi/_cython/./packed_func.pxi", line 245, in 
> tvm._ffi._cy3.core.FuncCall3
>   File "tvm/_ffi/_cython/./base.pxi", line 160, in tvm._ffi._cy3.core.CALL
> tvm._ffi.base.TVMError: Traceback (most recent call last):
>   [bt] (5) 
> /root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(TVMFuncCall+0x65)
>  [0x7f0f613a6035]
>   [bt] (4) 
> /root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(+0x6d4af6)
>  [0x7f0f6097caf6]
>   [bt] (3) 
> /root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule,
>  tvm::transform::PassContext const&) const+0x2c8) [0x7f0f6097b8f8]
>   [bt] (2) 
> /root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(tvm::transform::ModulePassNode::operator()(tvm::IRModule,
>  tvm::transform::PassContext const&) const+0x12f) [0x7f0f6097c5af]
>   [bt] (1) 
> /root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(+0x8c352d)
>  [0x7f0f60b6b52d]
>   [bt] (0) 
> /root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(+0x8c00a2)
>  [0x7f0f60b680a2]
>   Did you forget to bind?
>     Variable `B` is directly accessed by host memory (it is not contained in 
> a thread environment or in the function arguments.
>     Variable `A` is directly accessed by host memory (it is not contained in 
> a thread environment or in the function arguments.
>     Variable `C` is directly accessed by host memory (it is not contained in 
> a thread environment or in the function arguments.
>     Variable `C` is directly accessed by host memory (it is not contained in 
> a thread environment or in the function arguments.
>     Variable `C` is directly accessed by host memory (it is not contained in 
> a thread environment or in the function arguments.
>   File "/local/incubator-tvm/src/tir/analysis/verify_memory.cc", line 202
> RuntimeError: Memory verification failed with the following errors:
> PrimFunc([A, B, C]) attrs={"global_symbol": "default_function", 
> "tir.noalias": (bool)1, "target": cuda -keys=cuda,gpu -max_num_threads=1024 
> -thread_warp_size=32} {
>   for (i.outer, 0, 512) {
>     for (j.outer, 0, 512) {
>       C[((i.outer*512) + j.outer)] = 0f
>       for (k, 0, 512) {
>         C[((i.outer*512) + j.outer)] = (C[((i.outer*512) + j.outer)] + 
> (A[((i.outer*512) + k)]*B[((k*512) + j.outer)]))
>       }
>     }
>   }
> }


Is there any quick fix I can modify to demonstrating GEMM optimization on GPUs? 
Any pointers are approciated!





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/matrix-multiplication-example-for-cuda/8078/1)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/853c06aa2161169cfea056557ee02a01597746f6f40b5e624b4f797577f80922).

[Apache TVM Discuss] [Questions] Matrix multiplication example for Cuda

Reply via email to