I dropped a `print` statement into the [default AVX x86 conv2d schedule](https://github.com/apache/tvm/blob/70884e957aa5c8de9c02c25a14d30563d7300cb9/python/tvm/topi/x86/conv2d_avx_common.py#L87), so I know that this is the schedule that is being run.
To check if there is an int16 fallback, I can look at the code generated at each stage. However wouldn't int16 still be faster than float32, unless there is a big casting overhead? It doesn't look like there is int16 fallback happening, I explain how I have checked below: ### After quantization, before compilation This is the same regardless of the backend I use, since we haven't actually compiled at this point. I get the following output running: `mod = quantize(mod, params, mode=mode); print(mod)`. ``` def @main(%data: Tensor[(1, 3, 64, 64), float32]) -> Tensor[(1, 16, 64, 64), float32] { %0 = nn.conv2d(%data, meta[relay.Constant][0] /* ty=Tensor[(32, 3, 3, 3), float32] */, padding=[1, 1, 1, 1], channels=32, kernel_size=[3, 3]) /* ty=Tensor[(1, 32, 64, 64), float32] */; %1 = nn.relu(%0) /* ty=Tensor[(1, 32, 64, 64), float32] */; %2 = annotation.stop_fusion(%1) /* ty=Tensor[(1, 32, 64, 64), float32] */; %3 = multiply(%2, 16f /* ty=float32 */) /* ty=Tensor[(1, 32, 64, 64), float32] */; %4 = round(%3) /* ty=Tensor[(1, 32, 64, 64), float32] */; %5 = clip(%4, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 32, 64, 64), float32] */; %6 = cast(%5, dtype="int8") /* ty=Tensor[(1, 32, 64, 64), int8] */; %7 = nn.conv2d(%6, meta[relay.Constant][1] /* ty=Tensor[(16, 32, 3, 3), int8] */, padding=[1, 1, 1, 1], channels=16, kernel_size=[3, 3], out_dtype="int32") /* ty=Tensor[(1, 16, 64, 64), int32] */; %8 = nn.relu(%7) /* ty=Tensor[(1, 16, 64, 64), int32] */; %9 = add(%8, 1024 /* ty=int32 */) /* ty=Tensor[(1, 16, 64, 64), int32] */; %10 = right_shift(%9, 11 /* ty=int32 */) /* ty=Tensor[(1, 16, 64, 64), int32] */; %11 = clip(%10, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 16, 64, 64), int32] */; %12 = cast(%11, dtype="int8") /* ty=Tensor[(1, 16, 64, 64), int8] */; %13 = annotation.stop_fusion(%12) /* ty=Tensor[(1, 16, 64, 64), int8] */; %14 = cast(%13, dtype="float32") /* ty=Tensor[(1, 16, 64, 64), float32] */; multiply(%14, 0.0625f /* ty=float32 */) /* ty=Tensor[(1, 16, 64, 64), float32] */ } ``` ### After compilation Instead of creating a GraphModule, I compile using `relay.build`, i.e.: ```python with relay.build_config(opt_level=3): graph, lib, params = relay.build(mod, target=target, target_host=target) ``` #### Print graph If I print `print(graph)`, I see than the types look fine: ``` "attrs": { "dltype": [ "list_str", [ "float32", "float32", "float32", "float32", "uint8", "int8", "int32", "int8", "float32" ] ], ``` #### LLVM source The only way I know to look at the generated code directly is by dumping the LLVM using `lib.get_source()`. Doing this is of course very verbose, and I see lots of i16 and i8 instructions. --- [Visit Topic](https://discuss.tvm.apache.org/t/intution-on-why-this-int8-algorithm-is-slower/12920/3) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/432984526322aac3cc5d968c47099f22ab7eefd3378823fe8d9fb4980e073a6a).