We at Arm would like to introduce a new optimization technique to TVM which we 
refer to as 'cascading', which aims to optimize a network's peak memory usage. 
While in many of the systems TVM targets today there is plentiful memory 
available, for the embedded devices that uTVM aims to target, working memory in 
the form of SRAM is at a premium. This means that if peak memory exceeds the 
size of the SRAM there will be a failure.

Cascading works by breaking operations up into a series of smaller sub-ops 
(acting on sub-tensors) which can be executed 'deeply' to avoid having to store 
large intermediate tensors. Let's look at a simple example:

Say there are two NHWC convolutions (with weight format HWIO) one after the 
other, A → B. A has a kernel hxw of 3x3 and B of 1x1. In this small network, A 
greatly increases the number of channels and B subsequently decreases them. 
This is a fairly typical pattern in convolutional networks. Now we tabulate the 
tensors present in this network.

|Operation|Input Tensor|Weights|Output Tensor|Output Size|
| --- | --- | --- | --- | --- |
|Convolution A|(1, 26, 26, 3)|(3, 3, 3, 32)|(1, 24, 24, 32)|18.4 kB|
|Convolution B|(1, 24, 24, 32)|(1, 1, 32, 8)|(1, 24, 24, 8)|3.9 kB|

We can see that the intermediate tensor (the output of A) is quite large 
compared to the final output tensor. If the available SRAM on the system were 
less than 18.4 kB, we would not be able to execute this network in a 
layer-by-layer fashion. But, we can note that to produce a single output 
element of B doesn't require *all* of A's output to be in memory due to the 
spatial locality of convolutions. We can therefore tile A, which we refer to as 
'striping', such that it is computed in, for instance, 16x(1, 6, 6, 32) output 
stripes. B can act on these stripes to produce part of its output, then the 
intermediate stripe can be overwritten with the next one. This means now only 
1/16th of the intermediate memory is required (or ~ 1.2 kB) significantly 
decreasing the peak memory of the network and enabling it to be run on the 
device.

While here I consider only the case of two operations, in principle this can 
happen through many. However, I have also ignored a subtlety by making 
convolution B a 1x1 convolution. If the kernel size is larger, there will be an 
overlap in data usage. This means it is not appropriate to entirely overwrite 
each stripe (unless you're happy to recompute some elements). For this case, 
rolling buffers should be used to store the intermediate stripes which allow 
for partial overwrite of the unneeded data while retaining the important 'edge' 
data (see my question here: 
https://discuss.tvm.apache.org/t/creating-store-at-in-tvm/8083 on what's 
necessary to make this possible).

By making extensive use of this technique, the peak memory usage of several 
well-known networks (eg. mobilenet and vgg16) can be decreased to ~1/3rd of the 
initial value. It is therefore of considerable interest for uTVM where such a 
decrease can be the difference between supporting a network and not.

I've been doing some experiments recently into how this could be implemented in 
TVM and have found that the scheduling language allows me to reasonably simply 
construct such cascades (although it does not support rolling buffers). 
However, there is a conflict in this area because in order to schedule multiple 
convolutions together they must end up in the same primitive function. Further 
to this, there's currently no concept of 'hierarchical scheduling' in TVM, so 
if we schedule for the cascade we can't also take advantage of the schedules 
from TOPI/AutoTVM. This implies to me that a two-stage scheduling would be more 
appropriate, first the cascading to break the ops into sub-ops, then the 
performance scheduling which could take place on a sub-op by sub-op basis.

In this initial RFC I just wanted to introduce the technique to see if it's 
something others have worked on or have any thoughts about the best method to 
integrate it within the stack. I'd also be interested to know if anyone else 
would find such a technique to be a useful feature. We intend to arrive at a 
more formal proposal eventually, but need to see some more detail on the new 
TensorIR before then.

If you have any further questions/comments I'd be happy to answer them, thanks!





---
[Visit Topic](https://discuss.tvm.apache.org/t/rfc-cascade-scheduling/8119/1) 
to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/a904a783ef72c2d7ef4af3f0c8a6a5fa841cfcda67275be0e4d8c2e1edb776fd).

Reply via email to