[Apache TVM Discuss] [Development/RFC] [RFC] TensorIR: A schedulable IR for TVM

Siyuan Feng via Apache TVM Discuss Thu, 10 Sep 2020 09:53:19 -0700


## Background and Motivation
TVM is an end-to-end deep learning compiler with two levels of IR and 
optimization. TVM translates popular DL frameworks into Relay and optimizes the 
computation graph, after which it lowers each graph node into Tensor 
Expression(TE) and does another function-level optimization before finally 
lowering it into TIR and generating backend code. In brief, the current 
workflow is
```
TF/PyTorch/ONNX -> Relay -> TE (based on TE schedule) -> TIR -> C++/CUDA
```
Currently, low-level optimization is done through TE scheduling, which has 
several limitations:
- Based on an accessory data structure: schedule tree. Schedule primitives 
operate on schedule tree rather than TIR itself directly, which makes the 
scheduling result less intuitive.
- All primitives are coupled on the schedule tree representation, it is not 
easy to add new primitives and ensure the correctness of transformation at the 
same time.
- Limited high-dimension instruction and tensorization support. TE is not 
schedulable after tensorization. The description of tensor intrinsics is not 
friendly to users.


## Introduction
TensorIR is a brand new low-level IR with full scheduling support. Here are 
some key features and novel techniques.

### Core data structure: Block and Block Realize
To generalize the high-dimension tensor expression, we introduce a new 
statement structure *Block*. A block can wrap any part of the IR to provide 
isolation. A block is *the minimal unit* for scheduling and tensorization, 
which stores all the fundamental information of computation including block 
iteration vars, the region that the block reads and writes, the buffer which 
allocated inside the block, and the critical body statement. Block declare the 
expression computation with its iteration vars and types.

### Workload Example
1. First of all, we will define the workload(gemm in the example) with hybrid. 
Note that the hybrid scipt direct generate a **TIR** program rather than TE 
stages. (We can auto-complete the loop nesting with block definition by default)
```python
@tvm.hybrid.script
def matmul(a: ty.handle, b: ty.handle, c: ty.handle) -> None:
    C = tir.match_buffer(c, (1024, 1024), "float32")
    A = tir.match_buffer(a, (1024, 1024), "float32")
    B = tir.match_buffer(b, (1024, 1024), "float32")
    reducer = tir.comm_reducer(lambda x, y: x + y, tir.float32(0))

    with tir.block([1024, 1024, tir.reduce_axis(0, 1024)], "C") as [vi, vj, vk]:
        reducer.step(C[vi, vj], A[vi, vk] * B[vk, vj])
```

2. Then create the schedule from **TIR** and tile the loops using primitives.
```python
s = tir.create_schedule(matmul)
update = s.get_block("C")
i, j, k = s.get_axes(update)
i_o, i_i = s.split(i, bn)
j_o, j_i = s.split(j, bn)
k_o, k_i = s.split(k, 4)
s.reorder(i_o, j_o, k_o, k_i, i_i, j_i)
```
Result
```python
def func(a: ty.handle, b: ty.handle, c: ty.handle) -> None:
    # function attr dict
    B = tir.match_buffer(b, [1024, 1024])
    A = tir.match_buffer(a, [1024, 1024])
    C = tir.match_buffer(c, [1024, 1024])
    reducer = tir.comm_reducer(lambda x, y: x + y, tir.float32(0))
    # body
    for i0_outer, i1_outer, i2_outer, i2_inner, i0_inner, i1_inner in 
tir.grid(32, 32, 256, 4, 32, 32):
         with tir.block([1024, 1024, tir.reduce_axis(0, 1024)], "C") as [vi, 
vj, vk]:
             tir.bind(vi, ((i0_outer*32) + i0_inner))
             tir.bind(vj, ((i1_outer*32) + i1_inner))
             tir.bind(vk, ((i2_outer*4) + i2_inner))
             reducer.step(C[vi, vj], (A[vi, vk]*B[vk, vj]))
```

3. Vectorize and decompose reduction
```python
s.vectorize(j_i)
s.decompose_reduction(update, j_o)
```
Result
```python
@tvm.hybrid.script
def func(a: ty.handle, b: ty.handle, c: ty.handle) -> None:
    # function attr dict
    B = tir.match_buffer(b, [1024, 1024])
    A = tir.match_buffer(a, [1024, 1024])
    C = tir.match_buffer(c, [1024, 1024])
    # body
    for i0_outer in, i1_outer_init, i0_inner_init in tir.grid(32, 32, 32):
        for i1_inner_init in range(0, 32, annotation = 
{"loop_type":"vectorize"}):
            with tir.block([1024, 1024], "C_init") as [vi_init, vj_init]:
                tir.bind(vi_init, ((i0_outer*32) + i0_inner_init))
                tir.bind(vj_init, ((i1_outer_init*32) + i1_inner_init))
                C[vi_init, vj_init] = tir.float32(0)
        for i1_outer, i2_outer, i2_inner, i0_inner in tir.grid(32, 256, 4, 32):
            for i1_inner in range(0, 32, annotation = 
{"loop_type":"vectorize"}):
                with tir.block([1024, 1024, tir.reduce_axis(0, 1024)], 
"C_update") as [vi, vj, vk]:
                    tir.bind(vi, ((i0_outer*32) + i0_inner))
                    tir.bind(vj, ((i1_outer*32) + i1_inner))
                    tir.bind(vk, ((i2_outer*4) + i2_inner))
                    C[vi, vj] = (C[vi, vj] + (A[vi, vk]*B[vk, vj]))
```

4. Print it and run
```python
build_func = tvm.build(s.func, target=target)
build_func(a, b, c)
tvm.testing.assert_allclose(c.asnumpy(), np.matmul(a.asnumpy(), b.asnumpy()), 
rtol=1e-5)
evaluator = build_func.time_evaluator(build_func.entry_name, ctx, number=1)
```

In this example, we are imperatively changing the IR rather than waiting until 
the end to change(TE). It is very important for users and developers to 
directly see what happens during the scheduling. Also, at every stage during 
the schedule, we can get a *verifiable* IR, that's the major improvement for 
both user experience and correctness proof.

## Key Features
### Independent scheduling based on IR itself
Different from the TE schedule, TensorIR has a complete set of schedule 
algorithms, which does not need a schedule tree or any extra data structure. We 
will introduce a brand new set of schedule primitives and it has full backward 
compatibility for the TE schedule.
We simplify the compiling workload and conception. 
```
TF/PyTorch/ONNX -> Relay -> TIR -> schedule -> TIR -> scheudle -> TIR -> 
C++/CUDA
```
Now, there is no stage during the schedule. Rather than lowering the schedule 
into TIR, we directly mutate the TIR itself. Also, it enables the sequential 
schedule (schedule several times for a single workload). 


### Stronger Expressiveness and Optimization Ability
TE has limited expressiveness since each stage is defined by `Stage = 
te.compute(lambda expr)`, while TensorIR is a full c++-like IR. We can write 
any program with TensorIR as you want. Although not all programs can be 
scheduled, there are still more workloads that can be optimized by TensorIR.

One of the improved tasks is concatenating: 
#### TE
```python
B = te.compute(i, tvm.tir.if_then_else(i < 10, A0[i], tvm.tir.if_then_else(i < 
20, A1[i - 10], A2[i - 20])
```
#### TensorIR:
```python
with tir.block([10]) as vi:
    B[vi] = A0[vi]
with tir.block([10]) as vi:
    B[vi + 10] = A1[vi]
with tir.block([10]) as vi:
    B[vi + 20] = A2[vi]
```
The critical improvement is performance. In TIR we optimize the program by 
deprecating the `if` branch, which is impossible in the TE schedule.

### Memory and Execution Scope
Hardware accelerators led by GPUs are increasingly using hierarchical 
architecture, including memory hierarchy(global, shared, local/wmma in NV-GPU) 
and execution hierarchy(SM, warp, thread in NV-GPU). TVM defines the memory 
hierarchy and TensorIR provides corresponding native hierarchical block and 
execution scope. Different execution scope can access different memory scope. 

TensorIR natively supports hierarchy checks. We will check the memory access 
and thread binding, including warp level instruction(wmma) validation during 
the schedule. Following is an example of the GPU hierarchy.

```python
for bx in range(0, 32, annotation = {"loop_type":"blockIdx.x"}):
    with block(exec_scope="gpu_block"):
        for ty in range(0, 32, annotation = {"loop_type":"threadIdx.y"}):
            with block(exec_scope="gpu_warp"):
                for ty in range(0, 32, annotation = 
{"loop_type":"threadIdx.x"}):
                    with block(exec_scope="gpu_thread"):
                        A[i] = B[i] + 1
```

### High-dimension Scheduling and Tensorization.
With more and more backend provides tensor operator and instruction, the 
limitation of the TE schedule shows off. It is hard to tensorize a complex 
stage. TensorIR chooses a block (a wrap of high-dimension computation) as the 
minimum schedule unit. We can natively tensorize a sub-program and even do 
schedule after tensorization.

### Decouple Primitives and Correctness Proof
Since every primitive directly rewrite the AST (it works like a `StmtMutator`), 
we easily decouple primitives. Developers can add a schedule primitive as easy 
as add a pass. Also, it is easy to ensure the program has the same behavior 
between origin one and scheduled one by proving the correctness of each 
primitive and checking intermedia IR during the schedule.

### Round-trip Python Hybrid Syntax (Already upstream)
Hybrid enables developers to write TensorIR through python syntax, one of the 
most popular languages. Also, it provides a way to store the IR after or during 
the schedule. Please see the detail by 
https://discuss.tvm.apache.org/t/rfc-hybrid-script-support-for-tir/7516


## Migration Plan
1. Upstream TensorIR data structure.
1. Upstream TensorIR schedule primitive
1. Support AutoTVM/Ansor on TensorIR

Co-author @spectrometerHBH @tqchen @junrushao1994 

I appreciate discuss and advise from @merrymercy





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/rfc-tensorir-a-schedulable-ir-for-tvm/7872/1)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/2fcebc91faa49032949e50d0909b466757f17f8503246f805d57d4a6782822d6).

[Apache TVM Discuss] [Development/RFC] [RFC] TensorIR: A schedulable IR for TVM

Reply via email to