In collaboration with @tqchen

See also: [PoC](https://github.com/apache/incubator-tvm/pull/6917)

## Overview

In RAM-limited deployment scenarios (i.e. µTVM), it's desirable to place as 
much constant data as possible in a separate binary section and use it directly 
from that section. To that end, this RFC proposes a way for TVM to include 
pre-linked parameters in generated `runtime::Module` .

Depending on the target and available codegen, the solution to this problem 
could be quite expansive. For example, some architectures could benefit from a 
specific way of encoding parameters, while others may prefer to encode 
parameters for consumption by specific hardware accelerators. This RFC doesn't 
aim to preclude future work in those directions, but in the interest of forward 
progress, we constrain our goal to simply removing the need for GraphRuntime to 
allocate RAM space for parameter tensors used by the `tvm.cpu()` contexts. Only 
the `c` and `llvm` codegens are considered here. At the end, some future 
directions are discussed.

## Challenges

There are several challenges to be solved here:

C1. Indicating to the Relay compiler that the user wants to enable this feature

C2. Passing the set of parameters from GraphRuntimeCodegen to the 
target-specific codegen.

C3. Loading linked parameters at runtime

We start from the end and work backwards.

### C3. Loading Linked Parameters at runtime

Parameters can be stored either separately or as a single binary blob. 
Following are some storage schemes considered:

S1. The `data` field of each parameter's `DLTensor` is stored as a symbol named 
`__tvm_param__pN` , `pN` corresponds to the parameter's name after passing 
through `GraphRuntimeCodegen` .

S2. Similar to S1, but also include the `DLTensor` .

S3. Place parameters in module Metadata.

S3 is most compatible with the existing codegen, but it has these disadvantages:

* Since parameters are encoded as a single metadata blob, traditional binary 
size analysis (i.e. objdump, nm) will just report the size of the metadata blob 
instead of size per parameter.
* Parameters can't be pinned in memory or assigned to specific sections (unless 
the entire metadata blob fits in the desired section).
* At runtime, parameter pointers will be initially encoded as offsets into the 
metadata blob, requiring knowledge of the metadata layout at debug time.

S2 is the easiest to reason about logically (i.e. a `DLTensor` object is a 
concept that users are likely to understand). However, doing this would require 
encoding the `DLTensor` struct layout into each codegen, which could become 
hard to maintain. It's also overkill, since DLTensor metadata are stored in the 
JSON graph given to the GraphRuntime and also sent over RPC.

S1 provides the benefit of linked parameters without much overhead.

Schemes S1 and S2 don't specify how parameters are looked-up at runtime. We now 
consider this problem. At run time, `GraphRuntime` knows the string `name` and 
integer `storage_id` of each parameter. Either of these can be used to identify 
the tensor to be loaded (in some cases, `GraphRuntime` reuses `storage_id` 
between tensors, but it does not do this for parameters). The linked parameter 
load process can then be thought of as a function that accepts this identifier 
and returns a `DLTensor*` or `NDArray` (depending on C or C++ runtimes) whose 
`data` field points to the pre-loaded parameter array.

This function could be implemented in a few different ways:

F1. Each [model 
runtime](https://discuss.tvm.apache.org/t/discuss-module-based-model-runtime-interface/5025)
 could accept a standard data structure mapping `storage_id` to `void* data` .

F2. Each [model 
runtime](https://discuss.tvm.apache.org/t/discuss-module-based-model-runtime-interface/5025)
 could invoke a function in the TVM system runtime (i.e. CRT or C++ runtime) to 
do the same lookup as in F1.

F3. Each generated module could expose a standard function 
`__lookup_linked_param` .

F4. Each system runtime could load parameters given a standard data structure 
mapping model name and parameter string name to `void*` and then invoke 
`SetParam` on the [model 
runtime](https://discuss.tvm.apache.org/t/discuss-module-based-model-runtime-interface/5025).

F4 is difficult to implement, because the model name and parameter name lookup 
are more complex, more expensive, and the API to set parameters (i.e. 
`TVMSetModelParameters(Module* m, const char* model_name, void* param_mapping)` 
) is harder for the user to invoke. It's also difficult to be made automatic, 
because TVM runtime has limited knowledge of when a new model-specific TVM 
Module is instantiated.

F2 suffers from a similar complexity problem (needing to key both on 
`storage_id` and `model_name` ).

F1 is simple, but the data structure is not as easy to generate as it might 
seem. `storage_id` is not contiguous over the set of parameters, so the best 
implementation is as a list of pairs. This is awkward to work with and slow. 
Additionally, user code would need to separately keep track of this list and 
provide it to the model runtime to load parameters.

F3 is the best compromise—while no data-driven map exists, it offloads the 
lookup speed optimization onto the compiler via a switch statement. It also 
provides hardware-accelerated loaders a chance to execute any initialization 
code needed at parameter load time, such as waiting for backgrounded DMA 
transfers or decompression/decryption to complete. While this RFC doesn't 
consider heterogeneous execution contexts, this choice doesn't preclude their 
use at a later time.

In summary, the `llvm` and `c` codegen will generate an additional PackedFunc 
`__lookup_linked_param` in the generated `runtime::Module` which accepts a 
unique integer `id` identifying the parameter and returns a `void*` which 
should populate that the DLTensor `data` member for that parameter.

### C2. Passing parameters from Model-level to Target-level Codegen

Now that the job of the codegen is clear, the next challenge is passing 
parameters from model-level to target-level. Because the target-level codegen 
needs to include a new Module function, and the C runtime cannot rely on 
dynamic lookup such as `dlsym` , parameters need to be included in the same 
module as the generated functions.

However, at present, TVM is not guaranteed to invoke a target-level codegen for 
every model. It's possible that trivial models (i.e. `p0 + p1` ) may be fully 
realized at compile time, and an empty module will be returned. This can also 
happen when all functions are offloaded to accelerators.

Because of this, when linked parameters are generated, `BuildRelay` emits an 
additional function: the `__lookup_linked_param` . At present, this function 
contains no TIR code—the target-specific codegen is expected to provide an 
implementation. However, it attaches the parameters for the given modules as an 
attribute `tir.linked_params` .

When the target-specific codegen sees this function and sees that linked 
parameters are included, it translates those parameters' data into `static 
const` arrays and outputs the `__lookup_linked_param` implementation. This 
provides one global symbol per parameter, easing the task of analyzing binary 
bloat.

This approach is somewhat hacky because outside of the metadata module, TVM has 
no approach for including model-specific constant blobs. Since we prefer to 
avoid the metadata module due to aforementioned linking concerns, we feel it's 
best to avoid defining another generic model-level blob packager until more 
examples appear.

### C1. Enabling Linked Parameters

Linked parameters could be enabled a number of different ways:

W1. By marking each parameter with a special attribute. Each parameter with the 
attribute would be linked.

W2. With a target flag `--link-params` .

W3. With an additional parameter to `relay.build` .

W4. With a PassContext option.

W1 gives the finest-grain control, but is complex because the generated 
parameters may differ from those passed to 
`[relay.build](<http://relay.build>)` due to parameter simplification. It may 
be worth revisiting this approach when heterogeneous execution is considered.

W2 is the simplest, but it does mean that linked parameters require different 
autotuning schedules. It's not clear whether this is a good or bad thing; for 
µTVM, parameter access time may differ when loading from flash vs RAM, so 
separating the autotuning schedules is actually desirable.

W3 is a fairly high-level API change for such a specific feature. It also means 
that, unlike W2, that parameter is not propagated to target-level codegens. 
Those codegens then need to rely on other ways (i.e. checking for presence of 
`__lookup_linked_param` TIR function) to identify a linked parameter situation.

W4 is a reasonable choice, but would not invalidate autotuning schedules and is 
a bit odd since at present, linked parameters are not implemented as a TIR 
pass. One could envision the implementation moving into a TIR pass, though, so 
it's up for debate.

### Future Directions

This RFC doesn't tackle a number of challenges with pre-linking parameters, 
such as:

* Specifying a section for parameters
* Pinning each parameter to a specific memory location
* Supporting heterogeneous execution scenarios (i.e. offloading some parameters 
to BYOC)

In the future, additional configuration may be needed per parameter (i.e. 
section specifications, specific address pinning, etc). This could be done by 
expanding the `LinkedParamNode` class implemented in the PoC PR. It may be 
desirable to instead place this as an IRModule-level attribute. In a world 
where some parameters are linked using external BYOC codegen, parameters could 
be either omitted or better marked as such using `LinkedParamNode` .





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/rfc-linked-parameters-for-cpu-targets/8452/1)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/27d3e154d3c07a6cc7cc1a8b3ff9e907971da1a94b48a83e707b540e1953574e).

Reply via email to