with the popular of LLMs,NLP models is becoming more and more big,although can be quantized, its still hard to deploy one LLM model in one GPU Ram. running LLMs on multi-Host-multi-GPUs may be the usage solution right now.
I'm wondering, if we can deploy large model on multiply GPUs with tvm-unity? based on my previous understading of TVM,if one want to run large model with tvm,he/she can: 1. split model to multipy sub-graph,compile one-by-one,and infer thems with tvm-pipeline, which different sub-graph can be put on different GPUs based on current hardware resources 2. maybe use the Heterogeneous Execution for relax with the VDevice. [[RFC][Unity][Relax] Heterogeneous Execution for Relax - Development / unity - Apache TVM Discuss](https://discuss.tvm.apache.org/t/rfc-unity-relax-heterogeneous-execution-for-relax/14670) --- [Visit Topic](https://discuss.tvm.apache.org/t/does-unity-support-distributed-model-infering-such-as-multi-gpus/15222/1) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/24c6f312d7b8199b09856eeb40a1021d3b83dba1f14be58a532f900330d34cd4).