Overview --------- When a user wants to use AutoTVM to tune a model, she often lets AutoTVM tune every task extracted from the model sequentially. Assuming each task requires 1 hour or so, tuning a model with 10 to 100+ tasks requires days. This RFC proposes a lightweight solution to reduce tuning time for a model but still achieve a decent model performance.
We achieve this goal by proposing a selective tuning mechanism for AutoTVM. It is mainly composed of two parts: task selection and schedule sharing. Task Selection -------------- Taks selection selects a few representative tasks from all tasks in a model. The idea behind it is that if a schedule works well for a conv2d on a certain platform, it may also work for another conv2d on the same platform. It implies that we can identify a few representative task and apply their top schedules to other tasks. As a result, the tuning time of other tasks can be saved. The task selection algorithm is based on simularity rate (SM). The simularity rate is defined as the ratio of tuning space overlapping between two tasks. By computing simularity rate between any two tasks, we createt pairwise simularity matrix (PSM). With the PSM, we create a graph with tasks as nodes and SM as edge weights. Then finding all maximum cliques in the graph to cluster tasks. For each cluster, we select the task with the highest weight sum to all other tasks in the same cluster. The API of using task selection is straightforward. A user only needs to call `mark_depend` after tasks have been created: ```python tasks = autotvm.task.extract_from_program(...) autotvm.task.mark_depend(tasks) ``` After the function call, unselected tasks will have an attribute `depend` referring to the selected task. Schedule Sharing ------------------ We follow the current AutoTVM process to tune the representative tasks first, and then tune other tasks. When tuning other tasks, a user can specify how many schedules should be shared from the dependent task. For example, `top3` means sharing the best 3 schdules; `5%` means sharing the best 5% schedules. An example of tuning tasks with some of them selected is shown as follows: ```python sel_tasks = [t for t in tasks if t.depend == t] other_tasks = [t for t in tasks if t.depend != t] # the first pass tunes the representative tasks for i, tsk in enumerate(reversed(sel_tasks)): prefix = "[Sel Task %2d/%2d] " %(i+1, len(sel_tasks)) tuner_obj = create_tuner(tuner) # do tuning curr_trial = min(n_trial, len(tsk.config_space)) tuner_obj.tune(n_trial=curr_trial, early_stopping=early_stopping, measure_option=measure_option, callbacks=[ autotvm.callback.progress_bar(curr_trial, prefix=prefix), autotvm.callback.log_to_file(log_file)]) # the second pass tunes the rest tasks for i, tsk in enumerate(reversed(other_tasks)): prefix = "[Other Task %2d/%2d] " %(i+1, len(other_tasks)) tuner_obj = create_tuner(tuner) # do tuning curr_trial = n_trial tuner_obj.tune(n_trial=curr_trial, depend_mode='top10', # Indicating that we will share 10 best schedules early_stopping=early_stopping, measure_option=measure_option, callbacks=[ autotvm.callback.progress_bar(10, prefix=prefix), autotvm.callback.log_to_file(log_file)]) ``` In above example, the second loop tunes the unselected tasks. Since the tuned schedules are cached in selected tasks, the tuner will use those schedules as the tuning space, which size is 10 in this example. There are most two important advantages of this mechanism: 1. It is fully compatible to the current AutoTVM usecase. Existing AutoTVM scripts still work without any issue. 2. Even if the unselected task fails to achieve decent performance with shared schedules, users can still re-tune them as the normal AutoTVM process. This does not hurt because the time spending on the shared schedules is just minutes. Results ------- Here are the experimental results of using selective tuning for a set of models on EC2 P3 instance. We have evluated 7 models from Gluon CV model zoo, and the results are shown as follows. We tune selected tasks for 2,000 trials. Selective tuning achieves **on average 113% performance while using only 28% tuning time**. We consider the reason of outperforming the original AutoTVM is that the tuning space generated is not general enough to cover better schedules with non-factor tile sizes. | Model | w/o Time (mins) | w/o Perf. (ms) | w. Time (mins) | w. Perf. (ms) | Used Time | Achieve Perf. | |:-----:|:---------------:|:--------------:|:--------------:|:-------------:|:---------:|:-------------:| | MobileNet V2 1.0 | 1185 | 0.74 | 404 | 0.78 | 34% | 95% | | ResNet 50 V1 | 833 | 2.69 | 179 | 3.7 | 21% | 73% | | VGG 19 BN | 479 | 5.08 | 169 | 6.36 | 35% | 80% | | SqueezeNet 1.1 | 574 | 0.54 | 167 | 0.5 | 29% | 108% | | DenseNet 121 | 2670 | 2.99 | 377 | 3.02 | 14% | 99% | | Yolo3 MobileNet1.0 voc | 1392 | 5.4 | 387 | 3.63 | 28% | 149% | | SSD512 ResNet50 V1 voc | 1713 | 10.67 | 575 | 5.65 | 34% | 189% | Note ---- 1. The task selection implementation uses Python graph package networkx. This introduces a new dependency to AutoTVM. 2. This mechanism works well on GPU but not CPU, because the NCHWc layout limits the choices of C. Tasks ------ - [x] Implementation #4187 - [ ] Tutorial cc @kevinthesun @icemelon9 @tqchen -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/dmlc/tvm/issues/4188