To reiterate---my original concern was that the first RFC was proposing changes
to target-independent part of TVM to add support for a very target-specific
feature. However, I do think that we can move this forward in way that would
be overall useful.
Here is the outline of my thoughts on this. Let me know what you think.
First, a couple of observations:
1. Architectures that support vectors can be assumed to also support vector
predication. I'm talking specifically about masked operations, and in
particular about predicated loads and stores.
2. For ARM/AArch64, it may be beneficial to distinguish vectorization via
fixed-length vectors from one via scalable vectors. If this choice is to be
made by auto-scheduling, it should be expressible in TIR.
What this RFC proposes is very close to allowing vectorization of countable
loops with variable iteration count, and I insist that we keep this in mind as
a goal.
The way that vectorization works right now is that a loop like
```
for (i : [0, 130)) {
C[i] = A[i] + B[i]
D[i] = A[i] * B[i]
}
```
will be replaced with statements
```
C[Ramp(0, 1, 130)] = A[Ramp(0, 1, 130)] + B[Ramp(0, 1, 130)]
D[Ramp(0, 1, 130)] = A[Ramp(0, 1, 130)] * B[Ramp(0, 1, 130)]
```
The expressions within these statement are all `PrimExpr`, whose type must be
expressible by `DataType`. All parameters in `DataType` are compile-time
integers, which means that a single statement can only represent vectors with a
known number of lanes. In other words, neither VIC nor VLA can be implemented
without some changes. These changes may be in how types are represented in
`DataType`, or in how vectorization is done (or a combination of these two).
We are already considering a special value for `DataType::lanes` that would
represent the yet-unknown vector length (VL). Following Halide's approach to
vectorization, I propose that we change vectorization to take an explicit
vector length as a parameter. As a special case for SVE, the scalable VL could
be represented by the same constant we chose for `DataType::lanes`. For
compatibility with existing code, `stage.vectorize()` would be equivalent to
`stage.vectorize(vector_length=iter_count)`, since currently only loops with
known iteration count can be vectorized. The argument value `vector_length=VL`
would indicate using SVE. With `vectorize(vector_length=32)`, the loop above
would be turned into
```
for (i = [0, (130+31)/32) {
// i-th vector is [32*i..32*(i+1))
C[Ramp(32*i, 1, 32), pred=(Ramp(32*i, 1, 32) < Broadcast(130, 32))] =
A[Ramp..., pred=...] + ...
...
}
```
If the loop iteration count changed from a known integer `130` to some
expression `N`, the generated code would remain mostly the same: the structure
does not depend on the fact that `130` is a compile-time constant. Similarly
the `32` indicating vector length could be replaced with the predefined value
for "scalable vector length", with the only issue potentially with calculating
the iteration count of the `for` loop above. If we were to allow an explicit
"stride" to `For`, the issue would go away (the RFC proposes something like
that).
To summarize:
1. Introduce `kScalableVectorLaneMark` (as suggested by @tqchen).
2. Make vector length a parameter to `stage.vectorize`.
3. Introduce "predicate" to `BufferLoad` and `BufferStore`.
4. Allow non-unit strides in `For` loops (as per the RFC).
--
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/18#issuecomment-1172632753
You are receiving this because you are subscribed to this thread.
Message ID: <apache/tvm-rfcs/pull/18/[email protected]>