## Motivation

[Arm Compute Library](https://github.com/ARM-software/ComputeLibrary) (ACL) is 
an open source project that provides hand-crafted assembler routines for Arm 
CPU's and GPU's. This integration will look at how we can accelerate CPU 
performance for Arm devices in TVM using ACL. The idea is that by offloading 
select operators from a relay graph to ACL we can achieve faster inference 
times due to these routines. The initial intention is that this will improve 
performance for FP32 models. Although, with further improvements to the 
integration this will extend to quantized models and support for a wider range 
of operators.

## Proposal

We have been working on integrating ACL using the BYOC infrastructure which is 
still under active development. Our current implementation uses JSON as a level 
of abstraction between relay operators and ACL functions (or layers). Here is 
an overview of the flow from compilation to runtime we aim to achieve:

* Front-end graph (Currently only NHWC is supported).
* Lower to relay graph.
* Run MergeComposite to create a one-to-one mapping of relay operators to ACL 
functions.
* Annotate graph for ACL and Partition (We currently do this without the 
MergeCompilerRegions pass, the reasoning for this will follow shortly).
* Pre-process the parts of the graph destined for ACL to align with expected 
formats.
* Use the codegen stage to convert Relay operators annotated for ACL to JSON.
* Serialize JSON and constant tensors into  `mod.so` .

*ACL runtime module context*

* Load  `mod.so`  and deserialize JSON and constant tensors.
* Create ACL functions from JSON representation and cache.
* The cached functions are exposed to the graph runtime as packed functions.

*Running the generated module*

* Run each ACL function as required by the graph runtime by retrieving from the 
cache and supplying input and output buffers.

### Building with ACL support

The current implementation has two separate build options in CMake. The reason 
for this split is because ACL cannot be used on an x86 machine. However, we 
still want to be able to compile an ACL runtime module on an x86 machine.

* USE_ACL - Enabling this flag will add support for compiling an ACL runtime 
module.
* USE_GRAPH_RUNTIME_ACL - Enabling this flag will allow the graph runtime to 
compute the ACL offloaded functions when running a module.

We expect typical usage will see USE_ACL enabled on an x86 "host" device and 
USE_GRAPH_RUNTIME enabled on an Aarch64 device.

We include a script under `docker/ubuntu_install_acl.sh` which pulls ACL from 
the github repository and makes building ACL cross-compiled for Aarch64 easy to 
use within TVM. This script can also be added to the ci_cpu docker container 
build. 

### ACL Graph representation

ACL doesn't have it's own graph representation and should be viewed as a 
library that will compute a single operator and return the result, rather than 
computing a whole sub-graph. For these reasons we offload operators to ACL 
one-by-one i.e. we wrap each ACL supported operator in its own function in 
relay and offload to the ACL runtime one operator at a time. This is the reason 
we don't run the MergeCompilerRegions pass mentioned above.

As new features are added to BYOC this approach may change in the future.

### Codegen and compilation

The process of offloading partitioned subgraphs to ACL starts at compilation. 
The aim here is to align with the expectations of ACL and convert the relay 
"sub-graphs" into a format that the ACL runtime understands. First we 
pre-process the functions that the codegen receives. Currently, we make use of 
the ConvertLayout and FoldConstant passes to convert from TVM's default NHWC 
kernel layout of HWIO to ACL's expected kernel layout of OHWI. After this, we 
codegen JSON from the operators we receive. The output of the codegen module is 
an ACLModule which contains a serialized JSON representation of the operators 
and serialized constant tensors.

### Runtime Support

We implement ACLModule to translate from the JSON that is received by codegen 
to ACL API. When an ACL module is compiled it only contains the necessary JSON 
operator descriptions and constant data for creation of each layer in ACL - we 
defer the actual creation of ACL layers until runtime. This is because ACL is 
not cross-platform. When the graph runtime is created, each ACL layer is 
configured and cached in the module separately. This way we can eliminate the 
overhead of repeatedly creating and preparing these layers for multiple 
inferences. ACL also applies another weight transformation for specific 
convolutions; since these layers are cached this transformation only occurs 
when creating the runtime.

Another optimization we implement at this level is to allow ACL to request 
auxiliary memory from TVM. This is working memory that ACL needs to perform 
some operations (currently only affects convolution). By requesting memory 
directly from the TVM device API we can witness a performance improvement of 
around 1.2x compared to using ACL backed memory.

### Operator Support

Currently the integration provides support for the following operators using 
FP32 precision:

* [pad] + conv2d + [bias_add] + [relu], where [] denotes an optional operator.
* maxpool2d
* reshape

This RFC is only intended to be "initial", further support for a wider range of 
operators will follow.

### Testing

We currently have 4 different types of tests all of which reside under 
`python/contrib/test_acl`.

* Under `test_operatorname.py` we have 2 types of tests:
  * The first requires USE_ACL and USE_ACL_GRAPH_RUNTIME are set. (Or the use 
of a remote device). These tests check that each individual operator runs 
end-to-end and that the output matches that of TVM.
  * The second only requires USE_ACL and simply test the codegen JSON output is 
as expected. These tests bypass the ACL runtime.
* `test_network.py` includes end-to-end network tests (currently vgg16 and 
mobilenet). These tests offload only the implemented operators to ACL with the 
rest of the operators that are unsupported continuing through the TVM stack. 
Again, the results are compared against the output of TVM. 
* `test_runtime.py` test elements of the runtime that haven't been tested via 
individual operators. Currently this consists of multiple inferences on the 
same model and testing multiple ops offloaded to acl works as intended.

Our hope is that in the future the whole ACL implementation can be checked with 
an Aarch64 CI setup.

## Future improvements

The integration in it's current form doesn't add support for most operators in 
ACL, it is mostly a proof of concept. Below is a series of items we hope to 
add/improve upon in the near future.

* Support a wider range of operators for FP32 (and FP16).
* Support for quantized operators.
* As BYOC evolves, we may offload whole sub-graphs to ACL (hopefully improving 
performance) and change the way constants are serialized (with the use of the 
upcoming JSON runtime).

Thanks, any thoughts are appreciated.





---
[Visit 
Topic](https://discuss.tvm.ai/t/rfc-byoc-arm-compute-library-integration/7082/1)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/ee5f20c2e77cea4f76a93daf658fbe7e09fb7481a1974172d3956bfa720cbbab).

Reply via email to