[Apache TVM Discuss] [Development] [DISCUSS] TVM Community Strategy for Foundational Models

Bohan Hou via Apache TVM Discuss Thu, 27 Jul 2023 11:25:53 -0700


Foundation models are important workloads, and by pushing the local/server 
inference of LLMs to the extreme in the TVM stack, I believe we can push the 
resolution of pain points to a new stage for people to use TVM as THE deep 
learning compiler in general scenarios, which is necessary for us to keep 
competitive (and alive) and be able to provide productivity continuously.

To name some

**1. Out-of-box performance**

Triton has gradually become the solution for people writing custom ops on CUDA.
TorchInductor uses Triton to generate codes for long-tail operators
(reduction/elementwise) by default and will try to take Triton GEMM into search
space if more tuning is turned on.

LLM typically has 2 phases, prefill and decoding, where prefill is bound by
GEMMs and decoding is bound by (quantized) GEMVs, which are representative
workloads for GEMM ops and long-tail ops respectively. PT team tried to use TVM
as its backend, but MetaSchedule takes too long to tune and generates
sub-optimal programs. The resulting performance is poor given a certain amount
of compilation time.

We see an interesting future for other less popular backends, especially when
it's equipped with unified memory which allows larger models to run. It's
necessary to have knowledge of the kernel scheduling on these backends and to
fully leverage the advantage of TVM to transfer schedules between different
backends.

**2. Distributed inference**

No matter how good the quantization scheme and memory planning algorithms are,
people are always greedy for the size of the model they can run. Even with 4bit
quantization, 70B Llama2 still enquires 2x 40GB A100 to simply hold the model.

Instead of swapping the model between much slower storage and GPU memory,
unified memory provides an interesting solution here (64GB mbp can serve 70B
Llama2 alone).

Another orthogonal solution is distributed inference, and they can be combined
together.

**3. More hackable infrastructure and fewer new concepts in general**

We have seen some recent DL compilers written in purely Python (TorchInductor,
Hidet), which provides a much easier debug and hack experience for engineers.

Also, we have seen projects like llama.cpp and llama.c, which use purely cpp/c
to implement the whole model and kernels. People are actively contributing to
it and I believe one important reason is that it's straightforward to
understand by reading through its code, and people who have little knowledge of
how DL compiler generally works can hack and debug into the infra.

To be able to import and modify the model, insert new layers, substitute
generated kernels with different implementations in shader language, and change
the operator fusion in a smooth manner as people expect in their mind, while at
the same time providing reasonable debugging tools like setting breakpoints,
inspecting intermediate outputs (in Python) as people has always been doing
since their first day start to learn to program, can enable more people to come
to contribute to our stack.

---
[Visit
Topic](https://discuss.tvm.apache.org/t/discuss-tvm-community-strategy-for-foundational-models/15401/4)
to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click
here](https://discuss.tvm.apache.org/email/unsubscribe/59d46f9c44ff2bd84f209396aaf083d32d30d8d932796e7036cff249a8a7db75).

[Apache TVM Discuss] [Development] [DISCUSS] TVM Community Strategy for Foundational Models

Reply via email to