Thanks @tqchen . after below steps, I find a new problem, I don't know how to
send wasm_binary via rpc tracker. I passed
`session_constructor_args=["rpc.WasmSession", wasm_binary]` in `rpc.connect`.
1. launch rpc tracker
> python -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9192
2. laun
Not sure about why opt_level=0,1,2 results in different outputs. Maybe TensorRT
assumes an opt_level 3 pass is always on so the correcntess without that pass
is not guaranteed, but this is just my guess.
For the performance, although again I'm not sure if this is the reason, you can
add this
Thanks for the suggestion, @comaniac . Adding matmul operator with
implementations of all combinations of inputs' layouts seems overkill to me.
Instead, adding a target-specific relay pass to deal with such target-specific
case would be a better solution, which is lightweight and orthogonal to
For the int8, there are intrinsic like DP4A to accelerate. Currently, auto
scheduler doesn't support it.
---
[Visit
Topic](https://discuss.tvm.apache.org/t/auto-scheduler-seems-slower-on-int8/9585/2)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscrib
Is your issue fixed? I was trapped in the same issue, opt_level set to 0/1, it
runs well, but opt_level >=2, CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES show up when
"module.run()"
with tvm.transform.PassContext(opt_level=3):
lib = relay.build(mod, target=target, params=params)
my gpu is Tesla V1
Okay cool, then I was on the right track after all :smile:
Thanks for the quick clarification @comaniac !
---
[Visit
Topic](https://discuss.tvm.apache.org/t/best-way-to-deal-with-kernel-layout/9576/10)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscri
Thanks for your replies!
I checked the result with the code like below and it seems the results are the
same:
#tvm_trt_compare.py
...
mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)
tgt = tvm.target.cuda()
ctx = tvm.gpu(0)
###Same Input
data
I am bit confused,
maybe I misunderstood your suggestion.
I am using the debug executor to measure the latency of the individual (fused)
TIRfunctions,
but I cannot tell which function corresponds to which part of the
original/optimized relay graph.
(example of TIR function name: fused_layout_t
According to TLCBench, on CNNs like resnet50 and mobilenet, models tuned by
auto-scheduler tend to be faster than AutoTVM. With the same setting, I've
tested tuning Yolo v3 on Tesla T4 and the result is as follows.
| | AutoTVM | Auto-Scheduler (nchw) | Auto-Scheduler (nhwc) |
| | --