Re: [apache/tvm-rfcs] [RFC][TIR] Layout transformations on buffer access (#39)

2021-12-14 Thread wrongtest
@Lunderberg  Hi, I am much interested in `transform_layout`  but my team 
depends totally on TensorIR schedule instead of TE.  Could you kindly provide 
more design points on TensorIR side? It would be great if we can enjoy this 
preview feature in TensorIR. It is really useful for us.

We have implemented some TensorIR primitives to serve similar purposes in form 
below to mutate the `Buffer` object's layout, strides and dtypes. 
```python
s.replace_buffer_obj(block, write_buffer_idx, *a set of rewrite callbacks)
```
Since generally all buffer accesses are multi-dimensional in TensorIR schedule 
phase, the implementation is a bit easier (just something like a pass to 
substitute the buffer object) than in TE, if no extra representative variables 
are introduced. Is the `transform_layout` would also be like above?
```python
s.transform_layout(block, buffer_idx, remap_func)
```

Another form we use is just a duality of loop transformation primitives, where 
representative variables for buffer or axis of buffer are tracked into TensorIR 
schedule states.
```python
n, c, h, w = s.get_buffer_axes(block, buffer_idx)
c_outer, c_inner = s.split_for_buffer(c)
s.reorder_for_buffer(n, c_outer, h, w, c_inner)

# for `transform_layout` look like this?
buffer_rv = s.get_buffer(some identifier)
new_buffer_rv = s.transform_layout(buffer_rv, remap_func)
```
Is it possible to provide both integrated `transform_layout` primitive and step 
by step primitives for user's convenience?

Very glad to know your opinions!  :)

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/39#issuecomment-993336900

Re: [apache/tvm-rfcs] [RFC] Introducing DeclBuffer (PR #70)

2022-05-11 Thread wrongtest
Thanks a lot!  I think then we can handle buffer related issues in customized 
passes with more explicit and robust way.

I have one question on tir script, for certain algorithms in DL workloads, 
users may want to write non-stir formed script like
  ```python
  x = T.allocate((), "int32", "")
  x[()] = 0
  while x[()] < 128:
  x[()] = x[()] + 1
  # ...
  ```
   Could the parser support still write things like that (though underlying IR 
structure changed) instead of
   ```python
   x_data = T.allocate((), "int32", "")
   x = T.decl_buffer(data=x_data,)
   x[()] = 0
   # ...
   ```

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/70#issuecomment-1123341991
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm-rfcs] [RFC] Introducing DeclBuffer (PR #70)

2022-06-08 Thread wrongtest
reuse T.alloc_buffer seems good,as long as there is no ambiguity for parser 
impl :)

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/70#issuecomment-1149935097
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

2022-06-11 Thread wrongtest
Thanks for the all great discussions! It is so excited that we will have a more 
powerful ability to handle all things like paddings and imperfect tiles.

Since our team rely on the code path of s-tir, we are extremely interested in 
the story on s-tir. I would be very appreciated if we have some details on 
s-tir padding. I would like to use a [127, 127, 127] matmul to depict my 
questions :)

```python
@T.prim_func
def matmul(A: T.Buffer[(127, 127), "float32"], B: T.Buffer[(127, 127), 
"float32"], C: T.Buffer[(127, 127), "float32"]):
for i, j, k in T.grid(127, 127, 127):
with T.block("compute"):
vi, vj, vk = T.axis.remap("SSR", [i, j, k])
with T.init():
C[vi, vj] = 0.0
C[vi, vj] += A[vi, vk] * B[vk, vj]
```

In current s-tir state, we can construct padded loop and buffer using existing 
primitives by "split and then fuse" trick:
```python
s = tvm.tir.Schedule(matmul)
blk = s.get_block("compute")
i, j, k = s.get_loops(blk)
s.fuse(*s.split(i, factors=[4, 32]))
s.fuse(*s.split(j, factors=[4, 32]))
s.fuse(*s.split(k, factors=[4, 32]))
s.transform_layout(blk, "A", lambda i,k: ((i // 32) * 32 + i % 32, (k // 32) * 
32 + k % 32))
s.transform_layout(blk, "B", lambda k,j: ((k // 32) * 32 + k % 32, (j // 32) * 
32 + j % 32))
s.transform_layout(blk, "C", lambda i,j: ((i // 32) * 32 + i % 32, (j // 32) * 
32 + j % 32))
```
We will get (if simplified)
```python
@T.prim_func
def func(A: T.Buffer[(128, 128), "float32"], B: T.Buffer[(128, 128), 
"float32"], C: T.Buffer[(128, 128), "float32"]):
for i_0_i_1_fused, j_0_j_1_fused, k_0_k_1_fused in T.grid(128, 128, 128):
with T.block("compute"):
vi = T.axis.spatial(127, i_0_i_1_fused)
vj = T.axis.spatial(127, j_0_j_1_fused)
vk = T.axis.reduce(127, k_0_k_1_fused)
T.where(i_0_i_1_fused < 127 and j_0_j_1_fused < 127 and 
k_0_k_1_fused < 127)
T.reads(A[vi, vk], B[vk, vj])
T.writes(C[vi, vj])
with T.init():
C[vi, vj] = T.float32(0)
C[vi, vj] = C[vi, vj] + A[vi, vk] * B[vk, vj]
```
Then the only thing left is the condition for padding: `T.where(i_0_i_1_fused < 
127 and j_0_j_1_fused < 127 and k_0_k_1_fused < 127)`. I believe we now get to 
the point on current RFC about over-computation and branch tradeoff. And below 
are some my questions ~

1. What happened when change to `s.transform_layout(...,  pad_value=0)`? (if we 
want over-computations)
   - (possible behavior 1) Insert padding filling code as a producer block of 
`compute`.  
 - since the effect is immediate, maybe we do not need `BufferConstraint` 
annotations afterwards?
   - (possible behavior 2) Annotate buffers and let lowering passes to handle.
 - we may require `BufferConstraint` to direct lowering passes, 
   - (possible behavior 3) Pass `BufferConstraint` upwards into graph level
 -  thus assume the param buffer match the constraint, do not write edge 
values.
   
2.  For (1.2)(1.3), it seems encode the `BufferConstraint` into the buffer 
object is not the only choice.
- For s-tir,  fix me, at least for common cases the constraint could be 
treat to be local wrt the transformed block. What if we encode the constraint 
just into the block, as its memory access properties.
  We found previously, block memory annotations `T.reads`, `T.writes` 
(`BufferRegion`) have some limitations that they loss conditional access 
informations. Maybe we can also combine `BufferConstraint` with `BufferRegion`?

- For graph level annotations, IIUC,  it uses "Tensor" typed value instead 
of "Buffer" conceptually. Maybe we still need another construction instead of 
`Buffer` with `BufferConstraint` field? 
  We could also consider instantiate graph level transformation explicitly. 
This is our solution currently: 
https://discuss.tvm.apache.org/t/introducing-ty-nnp-backend-with-end2end-tensorir-integration/11807/4.
 

- Nevertheless, if finally we decide extent the buffer node structure, hope 
we can have an explicit lifetime for the `BufferConstraint` in the TIR 
lowering. Thus storage related passes afterwards do not bother, especially for 
customized passes developed by vendors.

3. For the reduce axis padding, mentioned in 
https://github.com/apache/tvm-rfcs/pull/77#discussion_r894899301
- In TIR level, since the schedule primitive should preserve the semantic 
correctness, how we prove the `k` dimension padding should only be zero? 
Especially when we do not know it is a "matmul" op generally. I think it is 
important if we want to use padded `transform_layout` in auto-schedule fashion 
applications.

cc @Lunderberg @tqchen @vinx13 @Hzfengsy 

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1152928725
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm-rfcs] [RFC] Adding initial SVE implementation (#18)

2022-07-02 Thread wrongtest
Hi~ here are my two questions :) 
cc @kparzysz-quic 

- > 2\. Make vector length a parameter to `stage.vectorize`.

What is the different between 
- `sch[C].vectorize(v, vector_length=32)` and 
- `vo, vi = sch[C].split(v, 32)` then `sch[C].vectorize(vi)`

 It seems that we could also choose to properly lower the split's predicate 
to reach the same goal as proposed below. For example,  weapons introduced in 
RFC https://github.com/apache/tvm-rfcs/pull/77 may help? 


- > 3\. Introduce "predicate" to `BufferLoad` and `BufferStore`.

Our team also get confused on how to represent predicated ld/st, when 
several months ago the upstream upgrade `T.load`/`T.store` (who have 1D 
predicate field) to `BufferLoad`/`BufferStore`. Now since 
`BufferLoad`/`BufferStore` are multi-dimensional, the predicate seems to also 
be multi-dimensional predicate? 

Another concern is whether embedding predicate into 
`BufferLoad`/`BufferStore` increase the complexity (or break) buffer region 
related analysis in existing implementations. Could we leverage `T.select(pred, 
A[...], undef)` to represent `A[..., pred]`, or just match the predicated 
memory access pattern like `if (pred) C[...] = ...`?

Thanks!







-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/18#issuecomment-1173003679
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm-rfcs] [RFC] Relax Upstreaming (PR #89)

2022-10-05 Thread wrongtest
In Intellif, people build, maintain and extend the DL compilation stack with 
Relay in past years. However, we never think the upstreaming of a new module 
would break existing functionalities or cause confusions, but huge 
opportunities to solve many technical issues which prove to be not so easy to 
handle in Relay, which are already emphasized in the discussion thread.

>From my perspective the TVM community is a very inclusive community. We do 
>have modules of certain overlapped functionality co-exist without so much 
>debates. As examples we could see different runtime implementation for Relay, 
>TE-schedule and TensorIR-schedule, Ansor and meta-schedule, etc. Wish it is 
>also not a problem on graph ast infra.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1268210986
You are receiving this because you are subscribed to this thread.

Message ID: 

[TVM Discuss] [Development/RFC] Yet another dense op combine strategy

2020-06-30 Thread wrongtest via TVM Discuss


Hello there. The idea is just same with existing IR pass described in 
https://discuss.tvm.ai/t/discussion-new-ir-pass-proposal-combineparalleldense/3813
 by @jonso .  Many sequential network structures conduct group of matmul 
operations on same input tensor such as

- gate projections on state within GRU/LSTM 
- Q/K/V projections on input within transformer layer

Thanks to `CombineParallelDense` pass such operations can be combined to fully 
utilize performance of matmul kernels.

The current implemented strategy is transform multiple matmul into batched 
matmul op:

- before: 
  
Y_1: [M, N] = matmul(X: [M, K], W_1: [K, N]), ..., Y_B = matmul(X, W_B: [K, 
N])

- after:

 Y: [B, M, N] = batch_matmul(stack(X...), stack(W_1, ... ,W_B))

However, there seems to be another simpler choice to just combine them into one 
matmul instead of batched matmul, and it also works with even different output 
channel sizes:

- before: 
  
   Y_1 = matmul(X: [M, K], W_1: [K, N_1]), ..., Y_B = matmul(X, W_B: [K, N_B])

- after: 

   Y: [M, N_1 + N_2 + ... + N_B] = matmul(X, stack(W_1, ..., W_B))

Since matmul and batch_matmul are different op implementations, the performance 
of combined op may differ. The output layout are also different which may 
affect downstream ops performance. 

We can conduct some comparison between matmul and equivalent batch_matmul with 
fixed LHS matrix. Use cublas as a reference, I find that use single cublasSgemm 
is significantly faster than cublasSgemmStridedBatched in certain circumstances 
with small B (typically 3)

The proposed strategy can be an option to current CombineParallelDense pass. 
And I think the basic implementation logic will highly resemble 
`CombineParallelConv2d`. CombineParallelDense pass can now select better 
strategy between them to get more performance benefits.





---
[Visit 
Topic](https://discuss.tvm.ai/t/yet-another-dense-op-combine-strategy/7126/1) 
to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/da7fcd65104119955ca46f12df640ea3704b2a9ebceb5e4d4a5518323dd2db78).


[Apache TVM Discuss] [Development/RFC] [RFC] Differentiable tensor expression (Create and verify backward op automatically)

2020-09-21 Thread wrongtest via Apache TVM Discuss


As there are more and more demands on TVM's training support, one of the most 
tedious but important work is to write backward implementation for operators. 
It may take great benefit if we can provide automation tools to help this 
process. Such tool can serve in two functionalities:

- Automatically create backward definition from forward definition
- Check gradient given forward and backward definition

Traditional deep learning framework (perhaps Theano except :wink: ) conduct 
auto back-propagation on op graph level, that is, they have to implement one 
backward op given one forward op. Theoretically there should be 1 backward 
op definitions if they have 1 forward ops.

For TVM however, there is an opportunity that we may conduct back-propagation 
on tensor expression level. Tensor expression operations are much less than 
whole neural network operators set, thus it will greatly reduce human work on 
higher level (relay op).

### Backward tensor expression generator
 Interface
Since tensor expression defines how to compute output from input symbolically, 
we can just try apply back-propagation rule to it. eg, we can provide utility 
interface like
```python
def auto_backprop(inputs: List[Tensor], output: Tensor) -> (List[Tensor], 
Tensor):
"""
Given input tensor list and output tensor, generate backward computation.
- The inputs are the placeholder representing the gradient respect to 
original output and some other necessary original tensors.
- The outputs are gradients respect to each of the original inputs.
"""
pass
```
Now if we have already defined some forward computation, then we can extract a 
"default" backward computation definition:
```python
x = te.placeholder((n, k))
y = te.placeholder((m, k))
z = te.compute((n, m), ...)
  
((grad_x, grad_y), grad_z_placeholder) = te.auto_backprop((x, y), z)
sched = te.create_schedule(grad_x.op)
# Do schedule and tune backward ops...
```

The transformation should happens before create_schedule(), since generally 
forward & backward definitions are different and may not share same 
optimization strategies. 

We can wrap this sort of utility in topi and relay, where we can try best to 
provide default backward op definitions automatically without hand-written 
definition. Some pros and cons are listed below:

- Pros
- Avoid hand-written work for at least some portion of operations.
- Auto generated definition maybe more robust on boundary behaviors and 
corner cases.
- Cons 
- It is not all-powerfull. Not all operators can be automatically backward.
- Some optimization hint may lose (backward of matmul is also matmul, 
backward of conv2d is also conv2d)
 
 Transformation logic
At the beginning we may just focus on `te.compute()`, and do not support for 
tensor intrinsic / hybrid / extern. 
- ```te.compute()```
- Use simple matmul as an example
  ```python
  te.compute((m, n), lambda i, j: tvm.sum(data[i, k] * weight[j, k], axis=k)
  ```
  If we want to compute gradient respect to `weight[w1][w2]`, we have to 
know how output is related to this weight position. Thus we "remap" the iter 
vars related to weight:
  ```python
  j = w1, k = w2
  ```
  Then all iter vars in compute expression can be represented with [w1, w2] 
with affine transformations. 
  ```python
  tvm.sum(data[i, w2] * weight[w1, w2], axis=..)  (for i, j=w1)
  ```
  `i` is free variable inner, it can be seen that each `weight[w1, w2]` 
contribute to all `output[i, w1]` for each feasible `i`. For each `i`, the 
gradient of `tvm.sum(...)`  respect to `weight[w1, w2]` is `data[i, w2]`. 
According to chain rule, the gradient of loss respect to `weight[w1, w2]` can 
be computed as

```python
   tvm.sum(data[i, w2] * grad_output[i, w1], axis=i)
```
- Actual back-propagation logic should carefully handle iter var 
relationships. For each occurance of target tensor to compute gradient in the 
expression, the feasible integer sets of each free iter var will get inferred 
based on iter var remapping. Given free vars fixed, compute gradient expression 
of output expression respect to target tensor position. Finally chain rule is 
applied to sum gradient expression among free var's feasible set. Unsupported 
case should be detected explicitly. 

- ```te.scan()```  is also an interesting operation valuable to support 
back-propagation, with which we 
 can get backward implementations of RNN/LSTM/GRU directly.

### Gradient checking between forward && backward ops
Given forward and backward implementation pair, we can verify the correctness 
with approximate gradients. This help developer to detect implementation error 
on general and corner cases. One of the methods is well described in 
https://datascience-enthusiast.com/DL/Improving_DeepNeural_Networks_Gradient_Checking.html





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/rfc-differentiable-tensor-expression-create-and

[Apache TVM Discuss] [Development/RFC] [RFC] Differentiable tensor expression (Create and verify backward op automatically)

2020-09-21 Thread wrongtest via Apache TVM Discuss


Glad to see autodiff is already in progress! I think this rfc can be withdrew 
since this is exactly what autodiff is doing.

Now I am very curious about current progress of autodiff with some questions. 
- If I have some common neural network structure such as resnet50 at hand, can 
I just use autodiff to get backward computation graph?
- Is there some description about common ops which can be coveraged by autodiff?
- Can te.scan() be supported?





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/rfc-differentiable-tensor-expression-create-and-verify-backward-op-automatically/7960/3)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/8349395ea57da88fe33bdb6e99388b410e6120246422cf1deb9af09122aeac4c).


[Apache TVM Discuss] [Development/pre-RFC] Introducing TY-NNP backend with end2end TensorIR integration

2021-12-31 Thread wrongtest via Apache TVM Discuss


Hi, all~

This RFC is to upstream the support for our TY-NNP accelerator backend. We are 
from the AI accelerator toolchain team of 
[Intellifusion](https://www.intellif.com/), who has been focusing on developing 
vision processor that accelerates deep neural networks in visual recognition 
and searching in endpoints, such as IP cameras and robots, as well as in cloud.

Nowadays, TVM has become the most important component in our AI software stack 
and we would like to upstream our work back. We believe participating in the 
open-source ecosystem will benefit both the internal software infrastructures 
and our customers!

# Overall architecture

The TY-NNP refers to the neural network accelerator architecture serving a wide 
range of our edge AI scenarios. TY-NNP takes a typical NPU design to offload 
neural network computation workloads to various kinds of domain-specified 
designed computing units. Generally, there are three kinds of computing units:

* NU (neural units)

NU is designed for high-throughput computation of typical neural-network 
workloads such as Conv/Matmul. Comparing to TensorCores in NVGPU, NU works in a 
coarse-grained fashion from a software perspective. Instead of 
software-programming of fine-grained M * N * K mma intrinsics, NU provides 
CISC-style instructions and a bundle of hardware configurations to developers. 
The NU components automatically load input/weight data from input buffers, 
execute fine-grained mma operations with hardware tiling control, and store 
result to output buffers.

In TVM, we program NU with customized TIR intrinsics. Developers should use 
schedules to lower the specified computation patterns to NU intrinsics, arrange 
the on-chip input/output buffers, and perform tuning to determine the best 
hardware configurations.

* VU (vector units)

VU accelerates general computation workloads which can not fit NU. TY-NNP 
provides a set of on-chip VU cores, each taking its own on-chip buffer (called 
VM), and a set of vectorized/scalar function units and physical registers. VU 
programming is just like general vectorized programming on CPUs.

In TVM, to offload the computation to VU, developers should schedule the 
computations into vectorizable form, arrange the on-chip input/output buffers, 
and mark the proper computation axis with `vectorize` or replace it with VU 
intrinsics.

* CU (control units)

CU can be seen as a small on-chip core and does not provide high 
computation abilities. It aims to control the on-chip execution flow and the 
whole on-chip kernel execution wiil starts from CU.

TY-NNP takes an explicitly managed memory hierarchy, each computing unit has 
its own buffer and there is a global on-chip buffer (called DM) to transfer 
data between each unit. Data transfer is explicitly done by asynchronous DMA 
operations and explicit/implicit synchronizations are used to avoid hazards. In 
TVM, DMA and synchronization are also represented by TIR intrinsics.

An off-chip storage (called DDR) is managed to transfer data between host and 
device, which takes much larger space than on-chip buffers and supports dynamic 
memory allocations. In TVM the DDR storage just corresponds to the storage 
scope `kGlobal` and is managed by runtime.

# Implementation design

The current TVM compilation stack for TY-NNP is as follows:

### Relay level

* We use a fusion pass based on a dedicated hardware cost model. Beyond 
traditional heuristic-based fusion for `conv-bn-relu` like patterns, it 
performs a much more aggressive strategy to merge multiple anchor ops like conv 
into a single device kernel. This brings opportunities to schedule multiple 
anchor ops simultaneously, which we think is essential to saturate our NPU 
hardware.
* A schedule-aware layout rewrite mechanism is added. Our tir schedule phase 
would rewrite tensor layouts to fit hardware features, so we modify the compile 
engine to give a chance of compatible updates at relay level.

### TIR level

A key difference from the current cpu/gpu design is that we try to 
schedule&tune blocks for multiple ops. It is ok to compute a single heavy op 
for a single kernel on a gpu device. But we think NPU may prefer to launch a 
block of consecutive ops to avoid frequent kernel launches. Thus, the proposed 
fusion pass described above is a way to achieve this.

Also, since the main efforts of tvm community are on cpu/gpu backend, there do 
exist pain points when developing tir supports for NPU fashion backend. We take 
some struggling to make it work through the standard schedule -> lower flow.

* We use TensorIR schedule 
(https://discuss.tvm.apache.org/t/rfc-tensorir-a-schedulable-ir-for-tvm/7872) 
to schedule the computations. **This is the first trial of TensorIR schedule on 
NPU infrastructures as far as we know.**
* A set of new schedule primitives are added to utilize hardware features.
* A set of new tir passes are added to utilize hardware features.
* We use `device_scope` att

[Apache TVM Discuss] [Development/pre-RFC] Introducing TY-NNP backend with end2end TensorIR integration

2022-01-04 Thread wrongtest via Apache TVM Discuss


Thanks for your comments:)

[quote="areusch, post:3, topic:11807"]
could you say more here? is this a Relay-level thing or a TIR thing? presuming 
you’ve implemented this as a pass, how do you plan to ensure that the 
Relay-level pass makes the same scheduling decision as the TIR pass?
[/quote]

Perhaps I could take a fake example on Conv2d to describe it:

fn (%arg0: Tensor[(1, 32, 224, 224), int8], %nn.conv2d_arg: Tensor[(32, 3, 
7, 7), int8]) {
  %conv_fn = fn (%data: Tensor[(1, 3, 224, 224), int8], %weight: 
Tensor[(32, 3, 7, 7), int8], Primitive=1) {
nn.conv2d(%data, %weight, padding=[1, 1, 1, 1],  kernel_size=[7, 7], 
out_dtype="int32")
  };
  %conv_fn(%arg0, %nn.conv2d_arg)
}

and the coresponding PrimFunc for primitive call `%conv_fn` would be like
```python
@T.prim_func
def main(x: T.Buffer[...], weight: T.Buffer[(32, 3, 7, 7), "int8"], y: 
T.Buffer[...]) -> None:
 # body
```
Assume to utilize the specific hardware, we want to arrange I/O channels into 
4*4 tiles. There are extra two notes:
- We get to know the "best" weight layout until a TIR schedule/tuning is done.
- The required layout is out of scope of common representations like "OIHW", 
"OHWI", etc.

The TIR schedule part would do following transformation on `weight`:

```python
o, i, h, w = s.get_read_buffer_axes(conv_block)
o_outer, o_inner = s.buffer_split(o, factor=4)  # [32, 3, 7, 7] -> [8, 4, 3, 7, 
7]
i_outer, i_inner = s.buffer_split(i, factor=4)  # [8, 4, 3, 7, 7] -> [8, 4, 1, 
4, 7, 7]
s.buffer_reorder(o_outer, o_inner, i_outer, i_inner, h, w)  #  [8, 4, 1, 4, 7, 
7] -> [8, 1, 4, 4, 7, 7]
```

Above we use a set of extended TensorIR primitives, but they can just be seen 
as sugars of ongoing schedule primitive `transform_layout`:
https://github.com/apache/tvm-rfcs/pull/39

The point is that they are not arbitary index remappings (compare to a general 
`transform_layout`). We ensure every such schedule step takes exact equivalent 
relay transformations.

In TIR schedule phase, we trace every buffer layout change on function param 
buffer (we can do that since they are what we implement), generate the 
transform (&& reverse transform) in relay on each step,  and finally compose 
them into single layout transform (&& reverse transform) functions in relay.

For the used example, it would be:

- `s.buffer_split(o, factor=4)` 
  - x ->  relay.reshape(x, [-1, 4, 3, 7, 7])
  - (reverse) x -> relay.reshape(x, [32, 3, 7, 7])

- `s.buffer_split(i, factor=4)`
  - x -> relay.reshape(relay.nn.pad(x, [..., (0, 1), ...]), [8, 4, -1, 4, 7, 7])
  - (reverse) x -> relay.strided_slice(relay.reshape(x, [8, 4, 4, 7, 7]), 
begin=..., end=...)

- `s.buffer_reorder(...)`
  - x -> relay.transpose(x, [...])
  - (reverse) x -> relay.transpose(x, [...])

Finally all transforms (&& reverse transforms) are composed into two 
`relay.Function` objects to rewrite relay-level layouts, which accepts original 
relay params, returns updated params tuple:

fn (%p0: Tensor[..., int8], %p1: Tensor[(32, 3, 7, 7), int8]) {
  %0 = reshape(%p1, newshape=[...]);
  %1 = nn.pad(%0, pad_width=[...]);
  %2 = reshape(%1, newshape=[...]);
  %3 = transpose(%2, axes=[...]);
  (%p0, %3)
}

and the reverse direction is:

fn (%p0: Tensor[..., int8], %p1: Tensor[(8, 4, 1, 4, 7, 7), int8]) {
  %0 = transpose(%p1, axes=[...]);
  %1 = reshape(%0, newshape=[...]);
  %2 = strided_slice(%1, begin=[...], end=[...], strides=[...]);
  %3 = reshape(%2, newshape=[32, 3, 7, 7]);
  (%p0, %3)
}
   

A relay pass now can perform "pre"-schedule for each primitive function, fetch 
the layout transform functions from schedule result, and perform relay-level 
layout updation. Finally, an extra `FoldConstants` could eliminate all extra 
transformations out of primitive calls typically.

 fn (%arg0: Tensor[(1, 32, 224, 224), int8], %nn.conv2d_arg: Tensor[(32, 3, 
7, 7), int8]) {
  %0 = reshape(%nn.conv2d_arg, newshape=[...]);
  %1 = nn.pad(%0, pad_width=[...]);
  %2 = reshape(%1, newshape=[...]);
  %3 = transpose(%2, axes=[...]);
  %conv_fn = fn (%data: Tensor[(1, 3, 224, 224), int8], %weight: Tensor[(8, 
4, 1, 4, 7, 7), int8], Primitive=1, DevicePrimFuncKey=873487) {
   %4 = transpose(%weight, axes=[...]);
   %5 = reshape(%4, newshape=[...]);
   %6 = strided_slice(%5, begin=[...], end=[...], strides=[...]);
   %7 = reshape(%6, newshape=[32, 3, 7, 7]); 
   nn.conv2d(%data, %7, padding=[1, 1, 1, 1], kernel_size=[7, 7], 
out_dtype="int32");
  };
  %conv_fn(%arg0, %3)
}

The actual params are transformed before call into `%conv_fn` and the formal 
params are reversed within `%conv_fn`'s body. Why we need reverse transforms is 
that we currently can not represent a "lowered" function call in relay (correct 
me). It is a workaround for us to keep a valid primitive function body, that 
is, the relay module after pass can still be safely evaluated on a CPU.

All things d

[Apache TVM Discuss] [Development/pre-RFC] Introducing TY-NNP backend with end2end TensorIR integration

2022-01-04 Thread wrongtest via Apache TVM Discuss


[quote="areusch, post:3, topic:11807"]
it seems like this could either be integrated into `ci-cpu` or as a separate 
`ci-` image, so long as the binaries are publicly available. do you have an 
estimate of the size of the docker image? also, just for my curiosity, would 
you be able to share a rough timeline of when you’d like to land this?
[/quote]

If a separate image is possible (it can be based on `ci-cpu`), we may prefer it 
since future upgration will not bother the `ci-cpu`'s usages. The incremental 
file size would be as below:
- LLVM: x86+device target build is about 140M, some libLLVM* maybe unused
- Other toolchains: full toolchain binaries will occupy 500M, simulator only 
binaries can control down to <100M





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/introducing-ty-nnp-backend-with-end2end-tensorir-integration/11807/5)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/8b8acb181464bdb8219d22bd9f8657d5ce92fb87787c369135aa2b18199bddd8).


[Apache TVM Discuss] [Development/pre-RFC] Introducing TY-NNP backend with end2end TensorIR integration

2022-01-09 Thread wrongtest via Apache TVM Discuss


@mbs-octoml  Hi~ Many thanks for your reply! Here are several questions of me:
1. What does  `call_lowered` mean? Does it mean we can put PrimFuncs and relay 
functions into the same IRModule and make calls to each other now?

2. For the `VirtualDevice`, it would be the interface to keep all information 
we required across relay-tir boundary, is my understanding right? This would be 
a closed set (including device, mem scope, etc) or allow thirdparty extensions?

3. Just out of my curiosity,  what is the difference between ongoing `Relax` 
and current machanism described?





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/introducing-ty-nnp-backend-with-end2end-tensorir-integration/11807/7)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/a8a991f5166ce38518d461c1ad1880f851d28d4b180827f076034af6c4ab3763).


[Apache TVM Discuss] [Development] Can we lift tir.AttrStmt value type to ObjectRef?

2022-02-18 Thread wrongtest via Apache TVM Discuss


Schedule annotations of `For` and `Block` are all Map. But 
certain pragma annotations can not get lowerer to `T.attr`,only those of 
expression typed values are allowed.





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/can-we-lift-tir-attrstmt-value-type-to-objectref/12118/1)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/8ba395d87a58f7e1de0701b3c29cdeff40e371ba967d04b7606878cfd85656d9).


[Apache TVM Discuss] [Development] Can we lift tir.AttrStmt value type to ObjectRef?

2022-02-18 Thread wrongtest via Apache TVM Discuss


Hi~ I think this is not the issue of tvmscript. For example, though 
`List[Integer]` is supported by script, it would fail in lowering with `Illegal 
attribute of key pragma_key, value type Array not supported`, since the 
annotation can not convert to an attr stmt.   
```python
import tvm
from tvm.script import tir as T

@T.prim_func
def fun(x: T.Buffer((16,), "int32")) -> None:
for k in T.serial(0, 16, annotations={"pragma_key": [1, 2, 3]}):
x[k] = 1

print(tvm.lower(fun).script())
```





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/can-we-lift-tir-attrstmt-value-type-to-objectref/12118/3)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/7d3cf1a0358a40badecc66e9eb85d7fe7c916f2186f2077304c4c8f8da00f826).