with the popular of LLMs,NLP models is becoming more and more big,although can 
be quantized, its still hard to deploy one LLM model in one GPU Ram. running 
LLMs on multi-Host-multi-GPUs may be the usage solution right now.

I'm wondering, if we can deploy large model on multiply GPUs with tvm-unity?

based on my previous understading of TVM,if one want to run large model with 
tvm,he/she can:

1. split model to multipy sub-graph,compile one-by-one,and infer thems with 
tvm-pipeline, which different sub-graph can be put on different GPUs based on 
current hardware resources
2. maybe use the Heterogeneous Execution for relax with the VDevice. 
[[RFC][Unity][Relax] Heterogeneous Execution for Relax - Development / unity - 
Apache TVM 
Discuss](https://discuss.tvm.apache.org/t/rfc-unity-relax-heterogeneous-execution-for-relax/14670)





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/does-unity-support-distributed-model-infering-such-as-multi-gpus/15222/1)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/24c6f312d7b8199b09856eeb40a1021d3b83dba1f14be58a532f900330d34cd4).

Reply via email to