Overview
---------
When a user wants to use AutoTVM to tune a model, she often lets AutoTVM tune
every task extracted from the model sequentially. Assuming each task requires 1
hour or so, tuning a model with 10 to 100+ tasks requires days. This RFC
proposes a lightweight solution to reduce tuning time for a model but still
achieve a decent model performance.
We achieve this goal by proposing a selective tuning mechanism for AutoTVM. It
is mainly composed of two parts: task selection and schedule sharing.
Task Selection
--------------
Taks selection selects a few representative tasks from all tasks in a model.
The idea behind it is that if a schedule works well for a conv2d on a certain
platform, it may also work for another conv2d on the same platform. It implies
that we can identify a few representative task and apply their top schedules to
other tasks. As a result, the tuning time of other tasks can be saved.
The task selection algorithm is based on simularity rate (SM). The simularity
rate is defined as the ratio of tuning space overlapping between two tasks. By
computing simularity rate between any two tasks, we createt pairwise simularity
matrix (PSM). With the PSM, we create a graph with tasks as nodes and SM as
edge weights. Then finding all maximum cliques in the graph to cluster tasks.
For each cluster, we select the task with the highest weight sum to all other
tasks in the same cluster.
The API of using task selection is straightforward. A user only needs to call
`mark_depend` after tasks have been created:
```python
tasks = autotvm.task.extract_from_program(...)
autotvm.task.mark_depend(tasks)
```
After the function call, unselected tasks will have an attribute `depend`
referring to the selected task.
Schedule Sharing
------------------
We follow the current AutoTVM process to tune the representative tasks first,
and then tune other tasks. When tuning other tasks, a user can specify how many
schedules should be shared from the dependent task. For example, `top3` means
sharing the best 3 schdules; `5%` means sharing the best 5% schedules.
An example of tuning tasks with some of them selected is shown as follows:
```python
sel_tasks = [t for t in tasks if t.depend == t]
other_tasks = [t for t in tasks if t.depend != t]
# the first pass tunes the representative tasks
for i, tsk in enumerate(reversed(sel_tasks)):
prefix = "[Sel Task %2d/%2d] " %(i+1, len(sel_tasks))
tuner_obj = create_tuner(tuner)
# do tuning
curr_trial = min(n_trial, len(tsk.config_space))
tuner_obj.tune(n_trial=curr_trial,
early_stopping=early_stopping,
measure_option=measure_option,
callbacks=[
autotvm.callback.progress_bar(curr_trial, prefix=prefix),
autotvm.callback.log_to_file(log_file)])
# the second pass tunes the rest tasks
for i, tsk in enumerate(reversed(other_tasks)):
prefix = "[Other Task %2d/%2d] " %(i+1, len(other_tasks))
tuner_obj = create_tuner(tuner)
# do tuning
curr_trial = n_trial
tuner_obj.tune(n_trial=curr_trial,
depend_mode='top10', # Indicating that we will share 10 best
schedules
early_stopping=early_stopping,
measure_option=measure_option,
callbacks=[
autotvm.callback.progress_bar(10, prefix=prefix),
autotvm.callback.log_to_file(log_file)])
```
In above example, the second loop tunes the unselected tasks. Since the tuned
schedules are cached in selected tasks, the tuner will use those schedules as
the tuning space, which size is 10 in this example.
There are most two important advantages of this mechanism:
1. It is fully compatible to the current AutoTVM usecase. Existing AutoTVM
scripts still work without any issue.
2. Even if the unselected task fails to achieve decent performance with shared
schedules, users can still re-tune them as the normal AutoTVM process. This
does not hurt because the time spending on the shared schedules is just minutes.
Results
-------
Here are the experimental results of using selective tuning for a set of models
on EC2 P3 instance. We have evluated 7 models from Gluon CV model zoo, and the
results are shown as follows. We tune selected tasks for 2,000 trials.
Selective tuning achieves **on average 113% performance while using only 28%
tuning time**. We consider the reason of outperforming the original AutoTVM is
that the tuning space generated is not general enough to cover better schedules
with non-factor tile sizes.
| Model | w/o Time (mins) | w/o Perf. (ms) | w. Time (mins) | w. Perf. (ms) |
Used Time | Achieve Perf. |
|:-----:|:---------------:|:--------------:|:--------------:|:-------------:|:---------:|:-------------:|
| MobileNet V2 1.0 | 1185 | 0.74 | 404 | 0.78 | 34% | 95% |
| ResNet 50 V1 | 833 | 2.69 | 179 | 3.7 | 21% | 73% |
| VGG 19 BN | 479 | 5.08 | 169 | 6.36 | 35% | 80% |
| SqueezeNet 1.1 | 574 | 0.54 | 167 | 0.5 | 29% | 108% |
| DenseNet 121 | 2670 | 2.99 | 377 | 3.02 | 14% | 99% |
| Yolo3 MobileNet1.0 voc | 1392 | 5.4 | 387 | 3.63 | 28% | 149% |
| SSD512 ResNet50 V1 voc | 1713 | 10.67 | 575 | 5.65 | 34% | 189% |
Note
----
1. The task selection implementation uses Python graph package networkx. This
introduces a new dependency to AutoTVM.
2. This mechanism works well on GPU but not CPU, because the NCHWc layout
limits the choices of C.
Tasks
------
- [x] Implementation #4187
- [ ] Tutorial
cc @kevinthesun @icemelon9 @tqchen
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4188