[TVM Discuss] [Development/RFC] [RFC][BYOC] Runtime module to offload subgraph to edge server

Morita Kazutaka via TVM Discuss Tue, 30 Jun 2020 12:39:27 -0700


The goal of this RFC is to offload subgraph inference from user devices to high 
performance edge servers. The initial code is available 
[here](https://github.com/kazum/tvm/tree/remote_runtime), which implements 
inference offloading based on BYOC.

# Motivation

The benefit of offloading inference is like as follows:

- In the 5G era, the network latency is very low. We can make use of high-spec
hardware in the cloud for better performance.
- In some cases, we don't want to expose the whole network structure or weight
data to users to protect intellectual property.

It is hard work to implement efficient inference offloading for each neural
network by hand. We can do it automatically if TVM has a runtime support for
offloading.

# Use case

The figure illustrates Mask R-CNN inference on an iPhone device.

![mec_mask_rcnn|690x314, 50%](upload://9ZqnMWRrXlqy59mxQBw4f1MAz3V.png)

With the subgraph offloading feature, we can run the R-CNN backbone on the
iPhone, send an encoded feature map to the MEC server, and run the head parts
on the MEC server. Each stage can be parallelized in a pipeline fashion.

We shouldn't send a raw input image to the server because the original picture
is a privacy sensitive data and, in addition, its size is too big to be sent
over the network. Instead, the encoded feature map can be smaller and less
sensitive than the original input.

I've implemented a PoC application for this and confirmed that we can show more
than 70 FPS. Such performance is unlikely to be achieved only on the iPhone
device.

Here is a demo video: https://youtu.be/7MHIfdq2SKU

# Proposal

## Workflow

1. Build

- Add annotation to specify which part of the graph should be offloaded to
the remote edge server. [[PoC
code](https://github.com/kazum/tvm/blob/remote_runtime/apps/ios_rpc/tests/mask_rcnn.py)]

- Unlike the other BYOC examples, we do nothing in relay.ext.remote. It is
because,
- TVM doesn't allow calling another relay.build inside relay.build.
- The content of subgraph should be updatable separately.

Instead, we build the subgraph part separately. [[PoC
code](https://github.com/kazum/tvm/blob/remote_runtime/tests/python/contrib/test_remote_runtime.py#L34-L39)]

2. Deploy
- Place the separately built library on the remote server. [[PoC
code](https://github.com/kazum/tvm/blob/remote_runtime/python/tvm/contrib/target/remote.py#L72-L85)]
- Run inference server to process inference requests from edge devices.

## Architecture
Two modules are introduced.

![modules|690x327, 50%](upload://mse8ZITBqkUz5cWJiuzzeXsjrb1.png)

- RemoteModule

This module is implemented based on BYOC. It calls the WrapGraphRuntime
module via RPC. We cannot call the remote GrapRuntime directly because the
subgraph structure and weight data are located on the remote server.

- WrapGraphRuntime

This module calls the local GraphRuntime using the deployed library.

## RPC protocol

Since we don't have an official inference server for TVM, I think of starting
from using the TVM RPC server to serve inference requests. There are some
points which should be improved.

- Bulk read/write

dmlc::Stream::{ReadArray,WriteArray} repeat read and write for the number of
elements, which is not efficient.

- Handle requests from multiple clients at the same time.

Not sure why we don't allow concurrent RPC requests now. I support it on my
PoC implementation with a quick
[patch](https://github.com/kazum/tvm/commit/33f8e71) temporarily.

- Reduce the number of round-trips

This is probably beyond the scope of the TVM RPC, but it'd be more efficient
if we can do the below with a single RPC.
- Send input tensors from local to remote
- Run the remote function
- Receive output tensors from remote to local

Supporting more standard protocols like GRPC, HTTP is future work. I think
it's also possible to cooperate with other inference servers like Tensorflow
serving, TensroRT inference server, and so on.

---
Any comments would be appreciated.

@tqchen @zhiics @haichen @masahi

---
[Visit
Topic](https://discuss.tvm.ai/t/rfc-byoc-runtime-module-to-offload-subgraph-to-edge-server/7141/1)
to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click
here](https://discuss.tvm.ai/email/unsubscribe/97c1c615712613777eb9eebbdd1deceb1a87be425cc9d24f3e43f3c1eb6ba4c5).

[TVM Discuss] [Development/RFC] [RFC][BYOC] Runtime module to offload subgraph to edge server

Reply via email to