[Apache TVM Discuss] [Development] [DISCUSS] TVM Core Strategy for Operator Scheduling and Tuning

tqchen via Apache TVM Discuss Thu, 01 Feb 2024 15:28:43 -0800


## Background

This doc aims to discuss how we should do operator scheduling(and optionally
tuning) following the [TVM Core strategy for emerging
needs](https://discuss.tvm.apache.org/t/discuss-tvm-core-strategy-for-emerging-needs/15751).
Operator scheduling refers to the phase where we take high-level description
of the operators and bring more performant code. This process is usually called
scheduling.

![Untitled|690x392](upload://eaNNOgEIL5F3lCgMWzQGn8sMej2.png)

Over the course of development, we have several components.

- TE-compute: ability to describe high-level computations via tensor expression.
- TE-schedule
- TensorIR: IRModule compatible abstraction for low-level tensor programs
- MetaSchedule: scheduling and auto-scheduling support for TensorIR.

Additionally, we have libraries to declare a set of common compute and possible
optimized schedules around them:

- TOPI-Compute: the compute definition
- TOPI-Schedule: a collection of manual schedules that accompany the compute

Most of the prior compilation flows are build-centric, causing the scheduling
to be deeply coupled as a fixed step of operator scheduling within the
compilation. There are several levels of coupling here:

- C0: The topi-compute and topi-schedule are usually deeply coupled, as the
schedule needs to operate on a particular kind of compute.
- C1: Tuning and application of tuned results are coupled, causing the build
step to always involve a long tuning time.
- C2: The mixing strategy of manual schedule, different library dispatch, and
auto-scheduling are coupled because the dispatching of operators has to be done
in a single shot.

## Proposed Strategy

TVM's core strategy for emerging needs provides a high-level guideline to
address the above challenges. In this section, we will elaborate on the general
technical strategy we should take moving forward. The strategy contains the
following key principles:

- K0: Scheduling of an operator is done via IRModule⇒IRModule transform, where
we update each TensorIR functions.
- K1: Tuning of the TensorIR functions decoupled from the application. We can
have a preprocessing pipeline to extract tasks. After tuning, we simply have an
ApplyBest pass that pattern matches the TensorIR and rewrites to the best-tuned
functions. This approach aligns with K0.
- K2: Library dispatch and BYOC are done through a separate pass that also
follows the K0 strategy, and we can customize the orders and selection
strategies to enable the composition of library dispatching, manual schedule,
and optional tuning.

K0 is especially important, as it helps us to create a modular set of
optimizing components that do not need special support within a build pipeline.
Such an approach also allows multiple developing communities to bring new
approaches and compose them together concurrently. We highlight a few of the
current infrastructures that follow these principles.

- MetaSchedule contains infrastructure for manual scheduling and application of
the Database as an independent pass.
- dlight — is a strategy-based module that detects the pattern and performs
general rewrites for GPU kernels. We also start to see fast-delight, which
combines domain knowledge and tuning.
- Sometimes, it is helpful to leverage structural hash and equality to simply
create manual dispatching of tensorIR to already optimized TensorIR; we find
this approach useful to do quick performance exploration before capturing them
in more generic passes.

## Suggestions for Future Development

While we will continue to maintain some of the existing approaches, this post
aims to provide a guideline for our new development thrusts. Here are some of
the suggestions

- S0: Move from TE-Schedule, TOPI-Schedule for new developments and use
TensorIR/MetaSchedule.
- S1: Continue to leverage TE-Compute only to create TensorIR functions.
- S2: Leverage MetaSchedule infra for autotuning application. While we can
create different modules to form different search spaces (the fast dlight
example), we can reuse most infra for tuning and application of the best result.
- S3: Supply manually specialized TensorIR when there is a demand, as long as
we isolate them as K0 passes.

## Discussions

We already successfully applied these principles for GenAI developments like
LLM and see great development productivity from the approach. So, this post
aims to summarize these lessons as a strategy that we think can benefit the
overall project. It would also be good to bring and examples around these
principles and new ideas to make the scheduling and optimization better.

Also love to see specific examples and usecases you see and we can discuss how
the new paradigm can help those cases.

---
[Visit
Topic](https://discuss.tvm.apache.org/t/discuss-tvm-core-strategy-for-operator-scheduling-and-tuning/16352/1)
to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click
here](https://discuss.tvm.apache.org/email/unsubscribe/9784d151342aee388f30ec2538690e816073f821d89106bb47c4d1624dcb97c2).

[Apache TVM Discuss] [Development] [DISCUSS] TVM Core Strategy for Operator Scheduling and Tuning

Reply via email to