[TVM Discuss] [Development/RFC] [RFC][µTVM] µTVM RPC Server

Andrew Reusch via TVM Discuss Thu, 27 Aug 2020 14:48:38 -0700


See [µTVM RPC Server Draft 
PR](https://github.com/apache/incubator-tvm/pull/6334).

## Motivation

The µTVM project can be thought of in two logical components that work together
to execute models on device:

1. A compiler that transforms Relay functions into a set of fused Relay
operators, and then generates portable C functions to implement each group of
fused operators. This is largely just the TVM compiler with a few modifications
to target a minimal runtime.
2. A minimal runtime compatible with bare-metal/RTOS environments.

To achieve its end goals, µTVM needs to be able to execute compiled Relay
operators under two different workflows:

1. Production workflow. The driver needs to be compiled into the device
firmware and needs to allocate Tensor memory and invoke operator
implementations in graph order. This workflow is not yet supported at `HEAD`,
and there are a variety of implementation strategies that will be explored in
the coming weeks.
2. AutoTVM/evaluation workflow. An attached host machine can drive overall
model execution for evaluation without writing complete firmware, or choose to
invoke one operator at a time for AutoTVM. Must be able to time operator
execution for AutoTVM.

This RFC is concerned primarily with the AutoTVM/evaluation workflow, which is
currently supported at `HEAD` today with substantial limitations. Currently,
µTVM loads a small
[runtime](https://github.com/apache/incubator-tvm/blob/99745a44407f2d1bd06b8c6a47e6c6c5239ec665/src/runtime/micro/host_driven/utvm_runtime.c)
into RAM, writes `TVMArgs` using GDB, populates a task list, and sets the
device PC to the runtime entry point. This process can be invoked remotely on a
TVM RPC Server by using the TVM Device API with a `micro_dev` context.

This strategy uses a very minimal on-device runtime; however, it has some
drawbacks:

- ISRs raised by the SoC aren't handled and appear as timeouts. If the SoC
enters an exception handler, it must be reset (sometimes, software reset is
sufficient, and others a hard reset or board power cycle is necessary).
- The SoC needs to be configured by a program loaded in flash. There are a
bunch of features that typically affect CPU performance: oscillator
configuration, caches, and power modes, among others. Currently, the µTVM
blogpost eval repo expects this [mBED-based
program](https://github.com/areusch/utvm-mbed-runtime/tree/utvm-blogpost-1) to
live in flash and execute on device startup to configure the SoC. However, this
isn't enforced or checked by TVM.
- For higher-bandwidth communication, device peripherals need to be configured.
Drivers for these peripherals are typically written in C (rather than something
usable from GDB) and expect to be able to use ISRs.

This RFC proposes to move the TVM RPC server onto the bare metal target, taking
advantage of the [RPC modularization
PR](https://github.com/apache/incubator-tvm/pull/5484) and the tendency for
embedded devices to contain stream-oriented peripherals. As an embedded device
is generally smaller, some limitations will exist in the µTVM RPC Server:

- Only the C++ [RPC Endpoint
API](https://github.com/apache/incubator-tvm/blob/master/src/runtime/rpc/rpc_endpoint.cc)
will be exposed. Features that live behind PackedFuncs, such as the RPC
proxying, etc won't necessarily be included.
- Dynamic code loading won't be supported initially (but may be possible in a
limited fashion in a future RFC)
- Some message length and tensor rank limits will be stricter than those on the
full Python hosted runtime

The goals of the µTVM On-Device RPC server are to allow users to evaluate
models and to run AutoTVM. A non-goal of the µTVM On-Device RPC server is to
handle model deployment.

## Approach

Breaking from the previous µTVM strategy, this RFC proposes that µTVM builds
binary images meant to be placed in device flash like any other long-lived
firmware. This means that the µTVM RPC server binary is responsible for the
following (in a typical AutoTVM session):

- SoC initialization (i.e. oscillator configuration, cache setup, etc)
- Handling interrupts
- Transmitting and receiving RPC protocol data over some peripheral
- Running the RPC server and resulting remote-triggered code
- Timing execution of TVM functions

### Code Organization

A µTVM RPC Server binary can be thought of in 3 parts:

1. **SoC Initialization, ISR handlers, and Device Drivers.**
In order to achieve reproducible results, the SoC needs to be configured from a
known good state e.g. from device reset. In some cases, a known good state is
power-on, so this code needs to live in the SoC flash and be invoked directly
from reset. This code is expected to live in repos outside the TVM repo, and
should be configured per-device or per-project. The `main()` function exists
here.
2. **TVM MinRPC Server and C Runtime**
Supplied from the TVM repo and invoked by the code in part #1. Implements the
TVM RPC server using the [C
Runtime](https://github.com/apache/incubator-tvm/tree/master/src/runtime/crt).
3. **Compiled TVM model functions**
Built per target and integrated as the System library.

Each piece is discussed in detail below.

### SoC Initialization, ISR Handlers, and Device Drivers

This code is intended to be specific to the targeted development board. It can
be based on anything from a `printf("Hello, world!\n")` demo to a fully-fledged
RTOS; the requirements are:

1. It needs to deterministically configure the SoC in terms of CPU performance
2. It needs to facilitate UART-like communication over any peripheral the host
can access (i.e. USB, Ethernet, semihosting).
3. It needs to handle device ISRs and understand when the device has entered a
bad state.
4. It needs to provide memory for the µTVM RPC server to allocate function
arguments and intermediate tensors.

This code does not live in the TVM repo, and is intended to just be referenced
from autotuning scripts. Examples exist using the
[mBED](https://github.com/areusch/utvm-mbed-runtime) and
[Zephyr](https://github.com/areusch/utvm-zephyr-runtime) RTOS.

As a secondary design goal, it should be able to make third-party libraries
available to the µTVM RPC Server as PackedFunc. These may be used to validate
preprocessing steps or capture data from an onboard sensor.

### TVM MinRPC Server and C Runtime

The basic approach is to instantiate the [MinRPC
server](https://github.com/apache/incubator-tvm/blob/master/src/runtime/rpc/minrpc/minrpc_server.h),
drive it using a message buffer, and use the MISRA-C runtime to handle the
lower-level details of RPC calls. To facilitate this, some changes were
necessary in the MISRA-C runtime (See "Changes to the MISRA-C Runtime").

### Compiled TVM Functions

This portion contains the `SystemLib` TVMModule instance, plus functions to
register it as such with the runtime.

## MinRPC Server Design

The MinRPC server uses a blocking strategy, which isn't particularly friendly
to microcontrollers without RTOS, especially those with watchdog timers or
other peripherals. However, the TVM RPC protocol is a message-oriented protocol
and each message begins with a length:

```bash
+---------------------------+
| Message Length (uint64_t) |
+---------------------------+
| Message Body |
+---------------------------+
```

This means that each message boundary is well-defined—so for the µTVM RPC
server, an event-driven approach can be safely used as follows:

1. A message buffer accumulates data until a full mesage has been received.
This part is non-blocking as it doesn't involve the MinRPC Server.
2. `MinRPC Server::ProcessOnePacket` is invoked. `Read()` calls consume data
from the message buffer. If `Read()` calls overrun the message buffer, it is a
`CHECK` failure.
3. The process repeats until MinRPC Server indicates it has shutdown.

### Framing and Session

MinRPC Server assumes that the underlying transport provides the properties of
UNIX pipes or TCP. Some additional components are needed to provide these
guarantees over a UART. Specifically, these challenges are faced:

- C1. The microcontroller's `CHECK` failure strategy is to reset. This means
that some wire protocol is needed for the µC to indicate that it has reset,
even if only half of the previous message had been transmitted. This can be
roughly thought of as a way to signal **Connection Reset** or **Broken Pipe**
in a UNIX socket. However, details of CHECK failures can only be read after the
microcontroller has rebooted, so there are some additional points to consider
here.

- C2. As a protocol agnostic to the underlying transport, some level of error
detection needs to be provided.

- C3. A design constraint of the transport is that it should use very little
memory and code space, but should be able to receive buffers that are large as
a percentage of on-device RAM (i.e. >50%). This means that implementations
which expect to buffer messages while performing error detection will limit the
RPC protocol on device. By contrast, µTVM doesn't care if the payload is
written to a large DLTensor before a CRC error is detected. While the blocking
nature of MinRPC server currently limits this, any error detection should pass
the payload through even if it may contain invalid data.

A **Framing** layer addresses parts of C1 and all of C2. The wire format of one
message is as follows:

```bash
+----------------------------------+
| Packet Start Escape (0xff 0xfd) |
+----------------------------------+
| Packet Length Bytes (uint32_t) |
+----------------------------------+
| Payload |
+----------------------------------+
| CRC-16 (CCITT, little-endian) |
+----------------------------------+
```

An **escape character** (`0xff`) is used to start a framing layer control
sequence. All fields (except the packet start field) need to be escaped on the
wire. Control sequences are at most 2 bytes long, the second byte indicating
the sequence. Possible values are:

- `0xff` - Escaped 0xff (so, translate `ff ff` on the wire to a single `ff` of
payload/length/CRC data)
- `0xfe` - Nop. Used to signal device reset.
- `0xfd` - Packet Start. Signals the beginning of a new packet. If a framing
layer receives Packet Start while already decoding a packet, the packet being
decoded is dropped.

While the RPC server is implemented using blocking `Read()` calls, there is
also a maximum packet length value enforced.

The exact values used here might be adjusted, since `0xff` is likely a fairly
common byte in `DLTensor`s.

A **Session** layer handles out-of-band signaling and addresses the remainder
of C1 and C3. Session Messages have the following structure:

```bash
+----------------------------+
| Message Type Code (1 byte) |
+----------------------------+
| Session ID (2 bytes) |
+----------------------------+
| Message Payload |
+----------------------------+
```

The following message types are supported:

1. **Session Start Init.** Starts a new session. Either party to the link can
send this message; the sending side becomes termed the *initiator*. This
message contains the initiator's nonce, which forms half of the session id.
Should two Session Start Init messages be sent simultaneously, the message
containing the numerically-lower nonce wins (the other message is ignored).
2. **Session Start Reply.** Confirms the new session as started. The party
sending this message is termed the *responder*. Contains the full session id to
be used in subsequent traffic.
3. **Terminate Session.** Contains no session id; invalidates any
previously-established session. Devices should send this message after
resetting, in case the other party is awaiting a reply.
4. **Log Message**. Allows the device, which typically has no connected
display, to asynchronously print diagnostic log messages on the host. Mostly
helpful for debugging. Log messages are always sent with session id 0 and are
valid regardless of whether a session is established.
5. **Normal Traffic**. Standard µTVM RPC traffic. Each Session message contains
exactly one TVM RPC message. The session id must match the session id
established during the **Session Start** handshake.

**Session Handshake**

Before normal traffic can be exchanged, a *session ID* is established using a
two-way handshake. *Session IDs* are 2 bytes: 1 byte populated by the initiator
and 1 by the responder. The handshake is as follows:

```bash
Initiator: Responder:
+--------------------------+
| Type: Session Start Init |
+--------------------------+ --->
| I_Nonce 0x00 |
+--------------------------+

+---------------------------+
| Type: Session Start Reply |
<--- +---------------------------+
| I_Nonce R_Nonce |
+---------------------------+
(session established, ID is {I_Nonce, R_Nonce}
```

**Session Termination**

When a *Terminate Session* message is received, the receiving party should
assume that the sender has lost all state. The proposed PR raises an exception
back to Python in this case.

**Long Messages**

µTVM RPC server faces a somewhat unique challenge in that some messages (e.g.
CopyToRemote) may have very large payloads relative to the amount of available
memory. At present, the proposed implementation can't receive messages like
this; however, a future PR could rewrite MinRPCServer to handle the message
header and payload separately. Then, CopyToRemote could progressively write the
payload directly to the allocated tensor space in a zero-copy fashion.

**Testing**

Initially testing will be done by compiling a µTVM RPC server targeted to the
host machine, invoking it as a subprocess, and using stdin/stdout as the
transport pipes. Most blackbox testing should be able to be accomplished in
this way. To catch cross-compilation errors, a qemu-based M3-based target could
be used.

Some additional unit testing is done using googletest; this could also be
ported to a target to validate it. However, this is somewhat more involved so
isn't done in the PR yet.

**Points for Discussion**

1. Is the CRC layer adequate given packet sizes?
1. Use a 16-bit CRC as done here and add an explicit packet length limit of
around 16K. Tensors longer than 16K, and modules (if loadable modules are
implemented in the future to alleviate flash stress) will need to be split into
multiple messages.
2. Use a 32-bit CRC, which will take more flash space and/or longer to
execute, but allow longer packets

---
[Visit Topic](https://discuss.tvm.ai/t/rfc-tvm-tvm-rpc-server/7742/1) to
respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click
here](https://discuss.tvm.ai/email/unsubscribe/1c1bd31d9a6bb405aed898ae264dc0986b4082b3c16a55826bd0bafc6ce3eb3e).

[TVM Discuss] [Development/RFC] [RFC][µTVM] µTVM RPC Server

Reply via email to