[dmlc/tvm] [RFC][AutoTVM] Selective Tuning (#4188)

Cody Hao Yu Wed, 23 Oct 2019 17:21:55 -0700

Overview
---------
When a user wants to use AutoTVM to tune a model, she often lets AutoTVM tune 
every task extracted from the model sequentially. Assuming each task requires 1 
hour or so, tuning a model with 10 to 100+ tasks requires days. This RFC 
proposes a lightweight solution to reduce tuning time for a model but still 
achieve a decent model performance.


We achieve this goal by proposing a selective tuning mechanism for AutoTVM. It 
is mainly composed of two parts: task selection and schedule sharing.

Task Selection
--------------
Taks selection selects a few representative tasks from all tasks in a model. 
The idea behind it is that if a schedule works well for a conv2d on a certain 
platform, it may also work for another conv2d on the same platform. It implies 
that we can identify a few representative task and apply their top schedules to 
other tasks. As a result, the tuning time of other tasks can be saved.

The task selection algorithm is based on simularity rate (SM). The simularity 
rate is defined as the ratio of tuning space overlapping between two tasks. By 
computing simularity rate between any two tasks, we createt pairwise simularity 
matrix (PSM). With the PSM, we create a graph with tasks as nodes and SM as 
edge weights. Then finding all maximum cliques in the graph to cluster tasks. 
For each cluster, we select the task with the highest weight sum to all other 
tasks in the same cluster.

The API of using task selection is straightforward. A user only needs to call 
`mark_depend` after tasks have been created:

```python
tasks = autotvm.task.extract_from_program(...)
autotvm.task.mark_depend(tasks)
```

After the function call, unselected tasks will have an attribute `depend` 
referring to the selected task.

Schedule Sharing
------------------
We follow the current AutoTVM process to tune the representative tasks first, 
and then tune other tasks. When tuning other tasks, a user can specify how many 
schedules should be shared from the dependent task. For example, `top3` means 
sharing the best 3 schdules; `5%` means sharing the best 5% schedules.

An example of tuning tasks with some of them selected is shown as follows:

```python
sel_tasks = [t for t in tasks if t.depend == t]
other_tasks = [t for t in tasks if t.depend != t]

# the first pass tunes the representative tasks
for i, tsk in enumerate(reversed(sel_tasks)):
    prefix = "[Sel Task %2d/%2d] " %(i+1, len(sel_tasks))
    tuner_obj = create_tuner(tuner)

    # do tuning
    curr_trial = min(n_trial, len(tsk.config_space))
    tuner_obj.tune(n_trial=curr_trial,
                   early_stopping=early_stopping,
                   measure_option=measure_option,
                   callbacks=[
                       autotvm.callback.progress_bar(curr_trial, prefix=prefix),
                       autotvm.callback.log_to_file(log_file)])

# the second pass tunes the rest tasks
for i, tsk in enumerate(reversed(other_tasks)):
    prefix = "[Other Task %2d/%2d] " %(i+1, len(other_tasks))
    tuner_obj = create_tuner(tuner)

    # do tuning
    curr_trial = n_trial
    tuner_obj.tune(n_trial=curr_trial,
                   depend_mode='top10', # Indicating that we will share 10 best 
schedules
                   early_stopping=early_stopping,
                   measure_option=measure_option,
                   callbacks=[
                       autotvm.callback.progress_bar(10, prefix=prefix),
                       autotvm.callback.log_to_file(log_file)])
```

In above example, the second loop tunes the unselected tasks. Since the tuned 
schedules are cached in selected tasks, the tuner will use those schedules as 
the tuning space, which size is 10 in this example.

There are most two important advantages of this mechanism:

1. It is fully compatible to the current AutoTVM usecase. Existing AutoTVM 
scripts still work without any issue.

2. Even if the unselected task fails to achieve decent performance with shared 
schedules, users can still re-tune them as the normal AutoTVM process. This 
does not hurt because the time spending on the shared schedules is just minutes.

Results
-------

Here are the experimental results of using selective tuning for a set of models 
on EC2 P3 instance. We have evluated 7 models from Gluon CV model zoo, and the 
results are shown as follows. We tune selected tasks for 2,000 trials. 
Selective tuning achieves **on average 113% performance while using only 28% 
tuning time**. We consider the reason of outperforming the original AutoTVM is 
that the tuning space generated is not general enough to cover better schedules 
with non-factor tile sizes.


| Model | w/o Time (mins) | w/o Perf. (ms) | w. Time (mins) | w. Perf. (ms) | 
Used Time | Achieve Perf. |
|:-----:|:---------------:|:--------------:|:--------------:|:-------------:|:---------:|:-------------:|
| MobileNet V2 1.0       | 1185 | 0.74  | 404 | 0.78 | 34% | 95%  |
| ResNet 50 V1           | 833  | 2.69  | 179 | 3.7  | 21% | 73%  |
| VGG 19 BN              | 479  | 5.08  | 169 | 6.36 | 35% | 80%  |
| SqueezeNet 1.1         | 574  | 0.54  | 167 | 0.5  | 29% | 108% |
| DenseNet 121           | 2670 | 2.99  | 377 | 3.02 | 14% | 99%  |
| Yolo3 MobileNet1.0 voc | 1392 | 5.4   | 387 | 3.63 | 28% | 149% |
| SSD512 ResNet50 V1 voc | 1713 | 10.67 | 575 | 5.65 | 34% | 189% |

Note
----
1. The task selection implementation uses Python graph package networkx. This 
introduces a new dependency to AutoTVM.

2. This mechanism works well on GPU but not CPU, because the NCHWc layout 
limits the choices of C.

Tasks
------

- [x] Implementation #4187
- [ ] Tutorial

cc @kevinthesun @icemelon9 @tqchen 

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4188

[dmlc/tvm] [RFC][AutoTVM] Selective Tuning (#4188)

Reply via email to