## Motivation

Although TVM provides quantization flow for pre-quantized models, we do find 
some developers would prefer to use their own quantization flow for their 
accelerators, since they may have specialized calibration and quantization 
flows other than TVM QNN. However, current BYOC flow has limited support in 
this scenario. One current workaround involves two passes of compilation 
pipelines. In the first pass, we partition the graph and go through a graph 
runtime to get the calibration data. In the second pass, the calibration 
results are used along with the BYOC flow to generate the final quantized code 
for the accelerator.

## Proposal

In this RFC, we want to provide a clean and easy-to-use interface for 
developers to collect calibration data to feed into their calibration and 
quantization flows. With this interface, they can get the calibration data 
along with the subgraph information for the final code generation with only a 
single API.

### Programming Model

```python
mod, params = relay.testing.mobilenet.get_workload(...)

# passes for generating partitioned graphs
mod = transform.AnnotateTarget(["dnnl"])(mod)
mod = transform.MergeCompilerRegions()(mod)
mod = transform.PartitionGraph()(mod)

# proposed calibration flow and API
i_data = ... # the input data to be calibrated
calib_data = analysis.calibrate_parition_graph(mod, i_data, params)

# pass the calibration data to the external codegen and build the program
with transform.PassContext(opt_level=3, config={'calib_data': calib_data}):
    realy.build(mod, ...)
```

We propose a new analysis API ``calibrate_parition_graph`` (any better names 
would be appreciated) that takes in three inputs: the partitioned module, the 
input data to be calibrated, and the parameters. It returns the calibration 
data, which is a mapping between the subgraph name and all its input and output 
values. Following we show a synthetic example.

The Relay graph after partitioning:

```text
def @dnnl0(%dnnl0_i0: Tensor[(3, 3), float32], %dnnl0_i1: Tensor[(3, 3), 
float32]) -> Tensor[(3, 3), float32] {
  add(%dnnl0_i0, dnnl0_i1) 
}

def @dnnl1(%dnnl0_i0: Tensor[(3, 3), float32], %dnnl0_i1: Tensor[(3, 3), 
float32]) -> Tensor[(3, 3), float32] {
  sub(%dnnl0_i0, dnnl0_i1) 
}

def @main(%data0: Tensor[(3, 3), float32], %data1: Tensor[(3, 3), float32], 
%data2: Tensor[(3, 3), float32]) -> Tensor[(3, 3), float32] {
  %0 = @dnnl0(%data0, %data1)
  @dnnl1(%0, %data2)
}
```

Then this will be the calibration data we get:

```
{“main”: {“inputs”: [**data0**, **data1**, **data2**], 
          “outputs”: [**output**]},
 “dnnl0”: {“inputs”: [**data0**, **data1**],
           “outputs”: [**%0**]}
 “dnnl1”: {“intputs”: [**%0**, **data2**],
           “outputs”: [**output**]}}
```

Note that if we have multiple sets of data to be calibrated, the final results 
will be a list of list. Finally, to use the calibration data during code 
generation, we send them to the ``PassContext``.

## Implementation Details

We implement two passes to get the calibration results. The first pass will 
remove all back-end specific attributes and mark all intermediate tensors as 
the final outputs. Then, we use the graph runtime to get the tensor values. The 
second pass will get the mapping between the subgraph name and the tensor 
values. Then, we perform some post-processing to get the final calibration data 
as shown above.

The POC branch is available 
[here](https://github.com/seanlatias/incubator-tvm/tree/calibrate)

cc @zhiics, @comaniac, @masahi, @matt-arm, @tqchen





---
[Visit Topic](https://discuss.tvm.ai/t/rfc-byoc-data-calibration-flow/7099/1) 
to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/7736794e874e25487704516132b71ca5d4a647cd15c748f346efb864a5d9e696).

Reply via email to