I dropped a `print` statement into the [default AVX x86 conv2d 
schedule](https://github.com/apache/tvm/blob/70884e957aa5c8de9c02c25a14d30563d7300cb9/python/tvm/topi/x86/conv2d_avx_common.py#L87),
 so I know that this is the schedule that is being run.


To check if there is an int16 fallback, I can look at the code generated at 
each stage.  However wouldn't int16 still be faster than float32, unless there 
is a big casting overhead?

It doesn't look like there is int16 fallback happening, I explain how I have 
checked below:

### After quantization, before compilation

This is the same regardless of the backend I use, since we haven't actually 
compiled at this point.

I get the following output running: `mod = quantize(mod, params, mode=mode); 
print(mod)`.

```
def @main(%data: Tensor[(1, 3, 64, 64), float32]) -> Tensor[(1, 16, 64, 64), 
float32] {
  %0 = nn.conv2d(%data, meta[relay.Constant][0] /* ty=Tensor[(32, 3, 3, 3), 
float32] */, padding=[1, 1, 1, 1], channels=32, kernel_size=[3, 3]) /* 
ty=Tensor[(1, 32, 64, 64), float32] */;
  %1 = nn.relu(%0) /* ty=Tensor[(1, 32, 64, 64), float32] */;
  %2 = annotation.stop_fusion(%1) /* ty=Tensor[(1, 32, 64, 64), float32] */;
  %3 = multiply(%2, 16f /* ty=float32 */) /* ty=Tensor[(1, 32, 64, 64), 
float32] */;
  %4 = round(%3) /* ty=Tensor[(1, 32, 64, 64), float32] */;
  %5 = clip(%4, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 32, 64, 64), float32] 
*/;
  %6 = cast(%5, dtype="int8") /* ty=Tensor[(1, 32, 64, 64), int8] */;
  %7 = nn.conv2d(%6, meta[relay.Constant][1] /* ty=Tensor[(16, 32, 3, 3), int8] 
*/, padding=[1, 1, 1, 1], channels=16, kernel_size=[3, 3], out_dtype="int32") 
/* ty=Tensor[(1, 16, 64, 64), int32] */;
  %8 = nn.relu(%7) /* ty=Tensor[(1, 16, 64, 64), int32] */;
  %9 = add(%8, 1024 /* ty=int32 */) /* ty=Tensor[(1, 16, 64, 64), int32] */;
  %10 = right_shift(%9, 11 /* ty=int32 */) /* ty=Tensor[(1, 16, 64, 64), int32] 
*/;
  %11 = clip(%10, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 16, 64, 64), int32] 
*/;
  %12 = cast(%11, dtype="int8") /* ty=Tensor[(1, 16, 64, 64), int8] */;
  %13 = annotation.stop_fusion(%12) /* ty=Tensor[(1, 16, 64, 64), int8] */;
  %14 = cast(%13, dtype="float32") /* ty=Tensor[(1, 16, 64, 64), float32] */;
  multiply(%14, 0.0625f /* ty=float32 */) /* ty=Tensor[(1, 16, 64, 64), 
float32] */
}
```


### After compilation

Instead of creating a GraphModule, I compile using `relay.build`, i.e.:

```python
with relay.build_config(opt_level=3):
    graph, lib, params = relay.build(mod, target=target, target_host=target)

```
#### Print graph

If I print `print(graph)`, I see than the types look fine:

```
  "attrs": {
    "dltype": [
      "list_str",
      [
        "float32",
        "float32",
        "float32",
        "float32",
        "uint8",
        "int8",
        "int32",
        "int8",
        "float32"
      ]
    ],

```

#### LLVM source

The only way I know to look at the generated code directly is by dumping the 
LLVM using `lib.get_source()`.  Doing this is of course very verbose, and I see 
lots of i16 and i8 instructions.





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/intution-on-why-this-int8-algorithm-is-slower/12920/3)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/432984526322aac3cc5d968c47099f22ab7eefd3378823fe8d9fb4980e073a6a).

Reply via email to