[TVM Discuss] [Questions] Autotvm.task_extract_from_program in TFLite

2020-05-06 Thread Animesh Jain via TVM Discuss
Thanks for sharing. The failure is while calling tune_graph. The graph tuning assumes the data to be float32. Additionally, last time I tried, the graph tuning cant work with QNN ops. One way to handle this is to call QnnCanonilcalize (python/tvm/relay/qnn/transform.py) before calling graph tu

[TVM Discuss] [Questions] Autotvm.task_extract_from_program in TFLite

2020-05-06 Thread Animesh Jain via TVM Discuss
Hmm, this is weird. My script seems to work well. Is it possible for you to share the script? If not, can you reach the printing on relay_NHWC.txt for quantized model, or it fails before that? --- [Visit Topic](https://discuss.tvm.ai/t/autotvm-task-extract-from-program-in-tflite/6578/15)

[TVM Discuss] [Questions] Autotvm.task_extract_from_program in TFLite

2020-05-05 Thread Animesh Jain via TVM Discuss
[quote="alopez_13, post:7, topic:6578"] This is part of the Relay code: ``` %0 = layout_transform(%input, src_layout="NHWC", dst_layout="NCHW"); %1 = layout_transform(%v_param_1, src_layout="HWIO", dst_layout="OIHW"); %2 = qnn.conv2d(%0, %1, 128, 122, 0.0078125f, 0.0339689f, strides=[2, 2]

[TVM Discuss] [Questions] Autotvm.task_extract_from_program in TFLite

2020-05-05 Thread Animesh Jain via TVM Discuss
Just to confirm, can you please double check your script? We specify input shape and dtype for the model while parsing (`from_tflite`). So, even though most of the AutoTVM script can be same, there needs to be a small change while passing on the input shape and dtype for FP32 and quantized mo

[TVM Discuss] [Questions] Autotvm.task_extract_from_program in TFLite

2020-05-05 Thread Animesh Jain via TVM Discuss
IIUC, simple compilation (no auto-tuning) of both FP32 and quantized models work. But, the auto-tuning + compilation fails for quantized model (while the same script works for FP32), right? --- [Visit Topic](https://discuss.tvm.ai/t/autotvm-task-extract-from-program-in-tflite/6578/11) t

[TVM Discuss] [Questions] Autotvm.task_extract_from_program in TFLite

2020-05-05 Thread Animesh Jain via TVM Discuss
Are you giving the right input dtypes to the model. Tflite quantized models need `uint8` dtype. --- [Visit Topic](https://discuss.tvm.ai/t/autotvm-task-extract-from-program-in-tflite/6578/9) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from th

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

2020-04-15 Thread Animesh Jain via TVM Discuss
> [[topi] add ARM v8.2 udot (uint8) support > #3978](https://github.com/apache/incubator-tvm/pull/3978) This works if you have a machine/device with ARM v8.2 and DOT instruction. Rasp3b and 4b don't have it. --- [Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quan

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

2020-04-15 Thread Animesh Jain via TVM Discuss
I have mostly worked on pre-quantized models. So, I cant comment on the performance of Relay quantized model through ARM. There might be few missing pieces there. I am planning to write a tutorial by next week on how to read pre-quantized models from TFLite. You can also try @masahi tutorial

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

2020-04-10 Thread Animesh Jain via TVM Discuss
It is very difficult to estimate. Different people code with different pace. I can share my experience, but I am not sure if you should treat it seriously. My first task in TVM was to use Intel VNNI instructions for conv2d schedule. This took me around a month. I am not sure, how involved QNNP

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

2020-04-10 Thread Animesh Jain via TVM Discuss
QNNPACK is for ARM, whereas VNNI instructions are for Intel. So, not exactly that reason. But, the underlying statement might still be the case, that we dont have good TVM schedules. Regarding schedules to get same speedup as QNNPACK, we can write assembly implementation in TVM schedule and

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

2020-04-09 Thread Animesh Jain via TVM Discuss
Yes, that seems plausible. Please note that one might also make FP32 schedule better by working on low-level optimizations :) So, it is relative. --- [Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/36) to respond. You are receiving this be

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

2020-04-09 Thread Animesh Jain via TVM Discuss
Yeah, the work by AliOS is not available yet. They worked a lot on very low-level optimizations. Over time, this work will hopefully be upstreamed. For now, on master, QNNPACK is faster. --- [Visit Topic](https://discuss.tvm.ai/t/is-there-any-speed-comparison-of-quantization-on-cpu/6256/3

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

2020-04-09 Thread Animesh Jain via TVM Discuss
Yes, thats the selling point of TVM. TVM community works together on these TVM schedules. As we get more people interested in quantization, we can add more TVM schedules, for e.g., avx2 machine you are talking about. We dont want to fully rely on FBGEMM or QNNPACK, because it might cause conf

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

2020-04-09 Thread Animesh Jain via TVM Discuss
For rasp3 and rasp4, we saw 1.3x - 1.5x performance speedup going from FP32 to Int8. The link comparing QNNPACK and TVM is not upstream'd yet. If I understand correctly, it will be sometime before the authors of that work will be able to make it to upstream. There are some differences in unde

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

2020-04-09 Thread Animesh Jain via TVM Discuss
@kindlehe TVM might not be optimized for target 'llvm -mcpu=core-avx2'. I would suggest running it on CascadeLake. You would see major benefit. For rasp4, if you are comparing FP32 vs Int8, yes I have seen performance improvements. However, if you compare PyTorch (backed by QNNPACK) int8 vs TV

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

2020-04-07 Thread Animesh Jain via TVM Discuss
You are correct. I forgot about PyTorch frontend for quantizing. This is true for MXNet as well. We can also make a tutorial for all frameworks. you can take care of PyTorch, I can take care of MXNet (similar to PyTorch) and TFLite (easy). It can be just one tutorial with different sections

[TVM Discuss] [Questions] Is there any speed comparison of quantization on cpu

2020-04-07 Thread Animesh Jain via TVM Discuss
Thanks @kindlehe @masahi Masa explained it correctly. For a long time, processors had higher FP32 throughput than Int8 throughput.So, it is not fair to assume that quantization will give you performance benefits on all the machines. Check Intel VNNI, Nvidia DP4A and tensor cores, and ARM

[TVM Discuss] [Questions] Relay Op(fast_exp) can't be built

2020-03-22 Thread Animesh Jain via TVM Discuss
Thank you. Can you also register for fast_tanh? Also, a better usage for using fastmath pass is follows https://github.com/apache/incubator-tvm/blob/a5d7bdab8771430be052c22d07ebe2df6b320be4/tests/python/relay/test_pass_fast_math.py#L32-L33 --- [Visit Topic](https://discuss.tvm.ai/t/relay-o

[TVM Discuss] [Questions] Disabling LLVM Unrolling

2020-03-21 Thread Animesh Jain via TVM Discuss
Thanks for sharing your thoughts. Let me share some more background. To achieve high performance for compute heavy ops (close to hand-written kernels like MKLDNN or ACL), we need to perform vector register tiling. This is one more level lower than cache tiling. Here, we have to carefully craf

[TVM Discuss] [Questions] Disabling LLVM Unrolling

2020-03-20 Thread Animesh Jain via TVM Discuss
I have been working on TVM schedules for ARM. One thing that I notice is that LLVM has its own unrolling heuristics, that can completely mess up the analysis that one does for unrolling in TVM. For example, a developer can choose to unroll a particular axis with the goal of better reuse utili