[Apache TVM Discuss] [Questions] Issues with Autotuning a Convolutional Network for VTA Simulator
Same issue, anyone can help?😂 --- [Visit Topic](https://discuss.tvm.apache.org/t/issues-with-autotuning-a-convolutional-network-for-vta-simulator/5821/5) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/007000cff798fea6bf88a4c73b87323539ce40fb4f74c832d650d43826bd0bac).
[Apache TVM Discuss] [Questions] Failure [INSTALL_FAILED_OLDER_SDK]
Hi @tkat0, Do you have any suggestions on this issue? Thanks and Regards, Raju --- [Visit Topic](https://discuss.tvm.apache.org/t/failure-install-failed-older-sdk/7792/3) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/8945f9a616cdf62a19b9097af723c9b22f258b53f8cb1cd67ddcfa686ad786f6).
[Apache TVM Discuss] [Questions] I can't find directory'device' in tvm/micro
Hi Andrew, Thank you for your reply and I have successfully worked out that problem with a repository of TVM posted in Mar. However when I test the *.bin after I flashed the document to NUCLEO-F746ZG,something goes wrong like this:   There are sevral errors,but I think the error which makes the program abort probabily is: monitor thread exiting while process still alive; killing process Is there something wrong with my settings?How should I solve this problem?I would very appreciate it if you could give me some suggestions. Mike --- [Visit Topic](https://discuss.tvm.apache.org/t/i-cant-find-directorydevice-in-tvm-micro/8272/3) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/6602062140ad96759f3e1c1257f085f818b58f91f07329d9531b039f4c3e0066).
[Apache TVM Discuss] [Questions] Tutorial, "How to optimize matmul with Auto TensorCore CodeGen", cannot work on my machien
My machine has a V100 with cuda 10.2. The tutorial cannot generate code using tensor core primitives but ordinary cuda code. Can anyone help to solve this situation? Thanks! Here is the code I try to run: import logging import sys import numpy as np import tvm from tvm import te import tvm.testing from tvm import autotvm from tvm.contrib import nvcc def matmul_nn(A, B, L, dtype="float16", layout="NN"): k = te.reduce_axis((0, L), name="k") if dtype == "float16": out_type = "float" elif dtype == "int8": out_type = "int" elif dtype == "int4" or dtype == "int1": out_type = "int" if layout == "NN": return te.compute( (N, M), lambda i, j: te.sum(A[i, k].astype(out_type) * B[k, j].astype(out_type), axis=k) ) if layout == "NT": return te.compute( (N, M), lambda i, j: te.sum(A[k, i].astype(out_type) * B[k, j].astype(out_type), axis=k) ) if layout == "TN": return te.compute( (N, M), lambda i, j: te.sum(A[i, k].astype(out_type) * B[j, k].astype(out_type), axis=k) ) if layout == "TT": return te.compute( (N, M), lambda i, j: te.sum(A[k, i].astype(out_type) * B[j, k].astype(out_type), axis=k) ) def test_gemm(N, L, M, dtype, layout): if layout == "NN": shape_a = (N, L) shape_b = (L, M) elif layout == "NT": shape_a = (L, N) shape_b = (L, M) elif layout == "TN": shape_a = (N, L) shape_b = (M, L) elif layout == "TT": shape_a = (L, N) shape_b = (M, L) else: print("Unsupported layout:", layout) sys.exit(1) A = te.placeholder(shape_a, name="A", dtype=dtype) B = te.placeholder(shape_b, name="B", dtype=dtype) C = matmul_nn(A, B, L, dtype, layout) s = te.create_schedule(C.op) y, x = s[C].op.axis k = s[C].op.reduce_axis[0] # storage_align params factor = 16 offset = 8 if dtype == "int8": factor = 32 offset = 16 elif dtype == "int4": factor = 64 offset = 32 elif dtype == "int1": factor = 256 offset = 128 # create cache stages AA = s.cache_read(A, "shared", [C]) if layout == "NN" or layout == "TN": s[AA].storage_align(AA.op.axis[0], factor, offset) AL = s.cache_read(AA, "local", [C]) BB = s.cache_read(B, "shared", [C]) if layout == "TT" or layout == "NT": s[BB].storage_align(BB.op.axis[0], factor, offset) BL = s.cache_read(BB, "local", [C]) CL = s.cache_write(C, "local") bx = 4 by = 32 step_k = 16 v = 8 # thread tile TX = 8 TY = 1 if dtype == "int4" or dtype == "int1": TX = 2 # warp tile warp_tile_m = 16 # it could also be 8 or 32 on CUDA version >= 10.0 warp_tile_k = 16 # it must be 16 for fp16/int8 data type if dtype == "int4": warp_tile_m = 8 warp_tile_k = 32 elif dtype == "int1": warp_tile_m = 8 warp_tile_k = 128 # block tile tile_x = bx * TX tile_y = by * TY yo, ty = s[C].split(y, tile_y) ty, yi = s[C].split(ty, TY) # schedule for C stage xo, xi = s[C].split(x, tile_x) WX = min(warp_tile_m, tile_x) tz, xi = s[C].split(xi, WX) tx, xi = s[C].split(xi, TX) s[C].reorder(yo, xo, tz, ty, tx, yi, xi) s[C].bind(yo, te.thread_axis("blockIdx.y")) s[C].bind(xo, te.thread_axis("blockIdx.x")) s[C].bind(ty, te.thread_axis("threadIdx.y")) s[C].bind(tz, te.thread_axis("threadIdx.z")) s[C].bind(tx, te.thread_axis("threadIdx.x")) # schedule for CL stage ko, ki = s[CL].split(k, step_k * warp_tile_k) kl, ki = s[CL].split(ki, warp_tile_k) s[CL].compute_at(s[C], tx) yo, xo = CL.op.axis s[CL].reorder(ko, kl, ki, yo, xo) # schedule for AA stage s[AA].compute_at(s[CL], ko) xo, xi = s[AA].split(s[AA].op.axis[1], factor=bx * v) tz, tx = s[AA].split(xi, factor=(WX // TX) * v) tx, vec = s[AA].split(tx, factor=v) fused = s[AA].fuse(s[AA].op.axis[0], xo) _, ty = s[AA].split(fused, factor=by) s[AA].bind(ty, te.thread_axis("threadIdx.y")) s[AA].bind(tz, te.thread_axis("threadIdx.z")) s[AA].bind(tx, te.thread_axis("threadIdx.x")) # vectorization is very important for float16/int8 inputs s[AA].vectorize(vec) # schedule for BB stage s[BB].compute_at(s[CL], ko) xo, xi = s[BB].split(s[B
[Apache TVM Discuss] [Questions] TVM int8 graph Optimization relu problem
In TVM, these has lots of operator which can not support int8 calculate such as mean and avg_pool2d. And take mean Op for example, the TVM will automatilly insert dequant and quant Op. And In graph optimization process, I will get the right result as show in pictures. However, In my mind, I think the left one will be the best graph optimization, because it has low memory access. So I wonder how to modify the graph optimization methods to generate the result as left.  --- [Visit Topic](https://discuss.tvm.apache.org/t/tvm-int8-graph-optimization-relu-problem/8347/1) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/77a9c3fef7d4fce6f866e1e3074148e6de3ecb96422adbab918a8f6872f8fb49).
[Apache TVM Discuss] [Questions] Issues with Autotuning a Convolutional Network for VTA Simulator
The same issue, I'm trying to tune some conv2d_packed workloads on vta fast simulator and test the speed of these ops. What should I do? How to set the `target` of the autotvm task and the `runner` of `measure_option`? --- [Visit Topic](https://discuss.tvm.apache.org/t/issues-with-autotuning-a-convolutional-network-for-vta-simulator/5821/6) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/12fd51fb7d055a94a82929d6df99dba472c72538a04872f1d90af670ae182e61).