Re: [apache/tvm-rfcs] [RFC][TIR] Layout transformations on buffer access (#39)
@Lunderberg Hi, I am much interested in `transform_layout` but my team depends totally on TensorIR schedule instead of TE. Could you kindly provide more design points on TensorIR side? It would be great if we can enjoy this preview feature in TensorIR. It is really useful for us. We have implemented some TensorIR primitives to serve similar purposes in form below to mutate the `Buffer` object's layout, strides and dtypes. ```python s.replace_buffer_obj(block, write_buffer_idx, *a set of rewrite callbacks) ``` Since generally all buffer accesses are multi-dimensional in TensorIR schedule phase, the implementation is a bit easier (just something like a pass to substitute the buffer object) than in TE, if no extra representative variables are introduced. Is the `transform_layout` would also be like above? ```python s.transform_layout(block, buffer_idx, remap_func) ``` Another form we use is just a duality of loop transformation primitives, where representative variables for buffer or axis of buffer are tracked into TensorIR schedule states. ```python n, c, h, w = s.get_buffer_axes(block, buffer_idx) c_outer, c_inner = s.split_for_buffer(c) s.reorder_for_buffer(n, c_outer, h, w, c_inner) # for `transform_layout` look like this? buffer_rv = s.get_buffer(some identifier) new_buffer_rv = s.transform_layout(buffer_rv, remap_func) ``` Is it possible to provide both integrated `transform_layout` primitive and step by step primitives for user's convenience? Very glad to know your opinions! :) -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/39#issuecomment-993336900
Re: [apache/tvm-rfcs] [RFC] Introducing DeclBuffer (PR #70)
Thanks a lot! I think then we can handle buffer related issues in customized passes with more explicit and robust way. I have one question on tir script, for certain algorithms in DL workloads, users may want to write non-stir formed script like ```python x = T.allocate((), "int32", "") x[()] = 0 while x[()] < 128: x[()] = x[()] + 1 # ... ``` Could the parser support still write things like that (though underlying IR structure changed) instead of ```python x_data = T.allocate((), "int32", "") x = T.decl_buffer(data=x_data,) x[()] = 0 # ... ``` -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/70#issuecomment-1123341991 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm-rfcs] [RFC] Introducing DeclBuffer (PR #70)
reuse T.alloc_buffer seems good,as long as there is no ambiguity for parser impl :) -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/70#issuecomment-1149935097 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)
Thanks for the all great discussions! It is so excited that we will have a more powerful ability to handle all things like paddings and imperfect tiles. Since our team rely on the code path of s-tir, we are extremely interested in the story on s-tir. I would be very appreciated if we have some details on s-tir padding. I would like to use a [127, 127, 127] matmul to depict my questions :) ```python @T.prim_func def matmul(A: T.Buffer[(127, 127), "float32"], B: T.Buffer[(127, 127), "float32"], C: T.Buffer[(127, 127), "float32"]): for i, j, k in T.grid(127, 127, 127): with T.block("compute"): vi, vj, vk = T.axis.remap("SSR", [i, j, k]) with T.init(): C[vi, vj] = 0.0 C[vi, vj] += A[vi, vk] * B[vk, vj] ``` In current s-tir state, we can construct padded loop and buffer using existing primitives by "split and then fuse" trick: ```python s = tvm.tir.Schedule(matmul) blk = s.get_block("compute") i, j, k = s.get_loops(blk) s.fuse(*s.split(i, factors=[4, 32])) s.fuse(*s.split(j, factors=[4, 32])) s.fuse(*s.split(k, factors=[4, 32])) s.transform_layout(blk, "A", lambda i,k: ((i // 32) * 32 + i % 32, (k // 32) * 32 + k % 32)) s.transform_layout(blk, "B", lambda k,j: ((k // 32) * 32 + k % 32, (j // 32) * 32 + j % 32)) s.transform_layout(blk, "C", lambda i,j: ((i // 32) * 32 + i % 32, (j // 32) * 32 + j % 32)) ``` We will get (if simplified) ```python @T.prim_func def func(A: T.Buffer[(128, 128), "float32"], B: T.Buffer[(128, 128), "float32"], C: T.Buffer[(128, 128), "float32"]): for i_0_i_1_fused, j_0_j_1_fused, k_0_k_1_fused in T.grid(128, 128, 128): with T.block("compute"): vi = T.axis.spatial(127, i_0_i_1_fused) vj = T.axis.spatial(127, j_0_j_1_fused) vk = T.axis.reduce(127, k_0_k_1_fused) T.where(i_0_i_1_fused < 127 and j_0_j_1_fused < 127 and k_0_k_1_fused < 127) T.reads(A[vi, vk], B[vk, vj]) T.writes(C[vi, vj]) with T.init(): C[vi, vj] = T.float32(0) C[vi, vj] = C[vi, vj] + A[vi, vk] * B[vk, vj] ``` Then the only thing left is the condition for padding: `T.where(i_0_i_1_fused < 127 and j_0_j_1_fused < 127 and k_0_k_1_fused < 127)`. I believe we now get to the point on current RFC about over-computation and branch tradeoff. And below are some my questions ~ 1. What happened when change to `s.transform_layout(..., pad_value=0)`? (if we want over-computations) - (possible behavior 1) Insert padding filling code as a producer block of `compute`. - since the effect is immediate, maybe we do not need `BufferConstraint` annotations afterwards? - (possible behavior 2) Annotate buffers and let lowering passes to handle. - we may require `BufferConstraint` to direct lowering passes, - (possible behavior 3) Pass `BufferConstraint` upwards into graph level - thus assume the param buffer match the constraint, do not write edge values. 2. For (1.2)(1.3), it seems encode the `BufferConstraint` into the buffer object is not the only choice. - For s-tir, fix me, at least for common cases the constraint could be treat to be local wrt the transformed block. What if we encode the constraint just into the block, as its memory access properties. We found previously, block memory annotations `T.reads`, `T.writes` (`BufferRegion`) have some limitations that they loss conditional access informations. Maybe we can also combine `BufferConstraint` with `BufferRegion`? - For graph level annotations, IIUC, it uses "Tensor" typed value instead of "Buffer" conceptually. Maybe we still need another construction instead of `Buffer` with `BufferConstraint` field? We could also consider instantiate graph level transformation explicitly. This is our solution currently: https://discuss.tvm.apache.org/t/introducing-ty-nnp-backend-with-end2end-tensorir-integration/11807/4. - Nevertheless, if finally we decide extent the buffer node structure, hope we can have an explicit lifetime for the `BufferConstraint` in the TIR lowering. Thus storage related passes afterwards do not bother, especially for customized passes developed by vendors. 3. For the reduce axis padding, mentioned in https://github.com/apache/tvm-rfcs/pull/77#discussion_r894899301 - In TIR level, since the schedule primitive should preserve the semantic correctness, how we prove the `k` dimension padding should only be zero? Especially when we do not know it is a "matmul" op generally. I think it is important if we want to use padded `transform_layout` in auto-schedule fashion applications. cc @Lunderberg @tqchen @vinx13 @Hzfengsy -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1152928725 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm-rfcs] [RFC] Adding initial SVE implementation (#18)
Hi~ here are my two questions :) cc @kparzysz-quic - > 2\. Make vector length a parameter to `stage.vectorize`. What is the different between - `sch[C].vectorize(v, vector_length=32)` and - `vo, vi = sch[C].split(v, 32)` then `sch[C].vectorize(vi)` It seems that we could also choose to properly lower the split's predicate to reach the same goal as proposed below. For example, weapons introduced in RFC https://github.com/apache/tvm-rfcs/pull/77 may help? - > 3\. Introduce "predicate" to `BufferLoad` and `BufferStore`. Our team also get confused on how to represent predicated ld/st, when several months ago the upstream upgrade `T.load`/`T.store` (who have 1D predicate field) to `BufferLoad`/`BufferStore`. Now since `BufferLoad`/`BufferStore` are multi-dimensional, the predicate seems to also be multi-dimensional predicate? Another concern is whether embedding predicate into `BufferLoad`/`BufferStore` increase the complexity (or break) buffer region related analysis in existing implementations. Could we leverage `T.select(pred, A[...], undef)` to represent `A[..., pred]`, or just match the predicated memory access pattern like `if (pred) C[...] = ...`? Thanks! -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/18#issuecomment-1173003679 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm-rfcs] [RFC] Relax Upstreaming (PR #89)
In Intellif, people build, maintain and extend the DL compilation stack with Relay in past years. However, we never think the upstreaming of a new module would break existing functionalities or cause confusions, but huge opportunities to solve many technical issues which prove to be not so easy to handle in Relay, which are already emphasized in the discussion thread. >From my perspective the TVM community is a very inclusive community. We do >have modules of certain overlapped functionality co-exist without so much >debates. As examples we could see different runtime implementation for Relay, >TE-schedule and TensorIR-schedule, Ansor and meta-schedule, etc. Wish it is >also not a problem on graph ast infra. -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1268210986 You are receiving this because you are subscribed to this thread. Message ID:
[TVM Discuss] [Development/RFC] Yet another dense op combine strategy
Hello there. The idea is just same with existing IR pass described in https://discuss.tvm.ai/t/discussion-new-ir-pass-proposal-combineparalleldense/3813 by @jonso . Many sequential network structures conduct group of matmul operations on same input tensor such as - gate projections on state within GRU/LSTM - Q/K/V projections on input within transformer layer Thanks to `CombineParallelDense` pass such operations can be combined to fully utilize performance of matmul kernels. The current implemented strategy is transform multiple matmul into batched matmul op: - before: Y_1: [M, N] = matmul(X: [M, K], W_1: [K, N]), ..., Y_B = matmul(X, W_B: [K, N]) - after: Y: [B, M, N] = batch_matmul(stack(X...), stack(W_1, ... ,W_B)) However, there seems to be another simpler choice to just combine them into one matmul instead of batched matmul, and it also works with even different output channel sizes: - before: Y_1 = matmul(X: [M, K], W_1: [K, N_1]), ..., Y_B = matmul(X, W_B: [K, N_B]) - after: Y: [M, N_1 + N_2 + ... + N_B] = matmul(X, stack(W_1, ..., W_B)) Since matmul and batch_matmul are different op implementations, the performance of combined op may differ. The output layout are also different which may affect downstream ops performance. We can conduct some comparison between matmul and equivalent batch_matmul with fixed LHS matrix. Use cublas as a reference, I find that use single cublasSgemm is significantly faster than cublasSgemmStridedBatched in certain circumstances with small B (typically 3) The proposed strategy can be an option to current CombineParallelDense pass. And I think the basic implementation logic will highly resemble `CombineParallelConv2d`. CombineParallelDense pass can now select better strategy between them to get more performance benefits. --- [Visit Topic](https://discuss.tvm.ai/t/yet-another-dense-op-combine-strategy/7126/1) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/da7fcd65104119955ca46f12df640ea3704b2a9ebceb5e4d4a5518323dd2db78).
[Apache TVM Discuss] [Development/RFC] [RFC] Differentiable tensor expression (Create and verify backward op automatically)
As there are more and more demands on TVM's training support, one of the most tedious but important work is to write backward implementation for operators. It may take great benefit if we can provide automation tools to help this process. Such tool can serve in two functionalities: - Automatically create backward definition from forward definition - Check gradient given forward and backward definition Traditional deep learning framework (perhaps Theano except :wink: ) conduct auto back-propagation on op graph level, that is, they have to implement one backward op given one forward op. Theoretically there should be 1 backward op definitions if they have 1 forward ops. For TVM however, there is an opportunity that we may conduct back-propagation on tensor expression level. Tensor expression operations are much less than whole neural network operators set, thus it will greatly reduce human work on higher level (relay op). ### Backward tensor expression generator Interface Since tensor expression defines how to compute output from input symbolically, we can just try apply back-propagation rule to it. eg, we can provide utility interface like ```python def auto_backprop(inputs: List[Tensor], output: Tensor) -> (List[Tensor], Tensor): """ Given input tensor list and output tensor, generate backward computation. - The inputs are the placeholder representing the gradient respect to original output and some other necessary original tensors. - The outputs are gradients respect to each of the original inputs. """ pass ``` Now if we have already defined some forward computation, then we can extract a "default" backward computation definition: ```python x = te.placeholder((n, k)) y = te.placeholder((m, k)) z = te.compute((n, m), ...) ((grad_x, grad_y), grad_z_placeholder) = te.auto_backprop((x, y), z) sched = te.create_schedule(grad_x.op) # Do schedule and tune backward ops... ``` The transformation should happens before create_schedule(), since generally forward & backward definitions are different and may not share same optimization strategies. We can wrap this sort of utility in topi and relay, where we can try best to provide default backward op definitions automatically without hand-written definition. Some pros and cons are listed below: - Pros - Avoid hand-written work for at least some portion of operations. - Auto generated definition maybe more robust on boundary behaviors and corner cases. - Cons - It is not all-powerfull. Not all operators can be automatically backward. - Some optimization hint may lose (backward of matmul is also matmul, backward of conv2d is also conv2d) Transformation logic At the beginning we may just focus on `te.compute()`, and do not support for tensor intrinsic / hybrid / extern. - ```te.compute()``` - Use simple matmul as an example ```python te.compute((m, n), lambda i, j: tvm.sum(data[i, k] * weight[j, k], axis=k) ``` If we want to compute gradient respect to `weight[w1][w2]`, we have to know how output is related to this weight position. Thus we "remap" the iter vars related to weight: ```python j = w1, k = w2 ``` Then all iter vars in compute expression can be represented with [w1, w2] with affine transformations. ```python tvm.sum(data[i, w2] * weight[w1, w2], axis=..) (for i, j=w1) ``` `i` is free variable inner, it can be seen that each `weight[w1, w2]` contribute to all `output[i, w1]` for each feasible `i`. For each `i`, the gradient of `tvm.sum(...)` respect to `weight[w1, w2]` is `data[i, w2]`. According to chain rule, the gradient of loss respect to `weight[w1, w2]` can be computed as ```python tvm.sum(data[i, w2] * grad_output[i, w1], axis=i) ``` - Actual back-propagation logic should carefully handle iter var relationships. For each occurance of target tensor to compute gradient in the expression, the feasible integer sets of each free iter var will get inferred based on iter var remapping. Given free vars fixed, compute gradient expression of output expression respect to target tensor position. Finally chain rule is applied to sum gradient expression among free var's feasible set. Unsupported case should be detected explicitly. - ```te.scan()``` is also an interesting operation valuable to support back-propagation, with which we can get backward implementations of RNN/LSTM/GRU directly. ### Gradient checking between forward && backward ops Given forward and backward implementation pair, we can verify the correctness with approximate gradients. This help developer to detect implementation error on general and corner cases. One of the methods is well described in https://datascience-enthusiast.com/DL/Improving_DeepNeural_Networks_Gradient_Checking.html --- [Visit Topic](https://discuss.tvm.apache.org/t/rfc-differentiable-tensor-expression-create-and
[Apache TVM Discuss] [Development/RFC] [RFC] Differentiable tensor expression (Create and verify backward op automatically)
Glad to see autodiff is already in progress! I think this rfc can be withdrew since this is exactly what autodiff is doing. Now I am very curious about current progress of autodiff with some questions. - If I have some common neural network structure such as resnet50 at hand, can I just use autodiff to get backward computation graph? - Is there some description about common ops which can be coveraged by autodiff? - Can te.scan() be supported? --- [Visit Topic](https://discuss.tvm.apache.org/t/rfc-differentiable-tensor-expression-create-and-verify-backward-op-automatically/7960/3) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/8349395ea57da88fe33bdb6e99388b410e6120246422cf1deb9af09122aeac4c).
[Apache TVM Discuss] [Development/pre-RFC] Introducing TY-NNP backend with end2end TensorIR integration
Hi, all~ This RFC is to upstream the support for our TY-NNP accelerator backend. We are from the AI accelerator toolchain team of [Intellifusion](https://www.intellif.com/), who has been focusing on developing vision processor that accelerates deep neural networks in visual recognition and searching in endpoints, such as IP cameras and robots, as well as in cloud. Nowadays, TVM has become the most important component in our AI software stack and we would like to upstream our work back. We believe participating in the open-source ecosystem will benefit both the internal software infrastructures and our customers! # Overall architecture The TY-NNP refers to the neural network accelerator architecture serving a wide range of our edge AI scenarios. TY-NNP takes a typical NPU design to offload neural network computation workloads to various kinds of domain-specified designed computing units. Generally, there are three kinds of computing units: * NU (neural units) NU is designed for high-throughput computation of typical neural-network workloads such as Conv/Matmul. Comparing to TensorCores in NVGPU, NU works in a coarse-grained fashion from a software perspective. Instead of software-programming of fine-grained M * N * K mma intrinsics, NU provides CISC-style instructions and a bundle of hardware configurations to developers. The NU components automatically load input/weight data from input buffers, execute fine-grained mma operations with hardware tiling control, and store result to output buffers. In TVM, we program NU with customized TIR intrinsics. Developers should use schedules to lower the specified computation patterns to NU intrinsics, arrange the on-chip input/output buffers, and perform tuning to determine the best hardware configurations. * VU (vector units) VU accelerates general computation workloads which can not fit NU. TY-NNP provides a set of on-chip VU cores, each taking its own on-chip buffer (called VM), and a set of vectorized/scalar function units and physical registers. VU programming is just like general vectorized programming on CPUs. In TVM, to offload the computation to VU, developers should schedule the computations into vectorizable form, arrange the on-chip input/output buffers, and mark the proper computation axis with `vectorize` or replace it with VU intrinsics. * CU (control units) CU can be seen as a small on-chip core and does not provide high computation abilities. It aims to control the on-chip execution flow and the whole on-chip kernel execution wiil starts from CU. TY-NNP takes an explicitly managed memory hierarchy, each computing unit has its own buffer and there is a global on-chip buffer (called DM) to transfer data between each unit. Data transfer is explicitly done by asynchronous DMA operations and explicit/implicit synchronizations are used to avoid hazards. In TVM, DMA and synchronization are also represented by TIR intrinsics. An off-chip storage (called DDR) is managed to transfer data between host and device, which takes much larger space than on-chip buffers and supports dynamic memory allocations. In TVM the DDR storage just corresponds to the storage scope `kGlobal` and is managed by runtime. # Implementation design The current TVM compilation stack for TY-NNP is as follows: ### Relay level * We use a fusion pass based on a dedicated hardware cost model. Beyond traditional heuristic-based fusion for `conv-bn-relu` like patterns, it performs a much more aggressive strategy to merge multiple anchor ops like conv into a single device kernel. This brings opportunities to schedule multiple anchor ops simultaneously, which we think is essential to saturate our NPU hardware. * A schedule-aware layout rewrite mechanism is added. Our tir schedule phase would rewrite tensor layouts to fit hardware features, so we modify the compile engine to give a chance of compatible updates at relay level. ### TIR level A key difference from the current cpu/gpu design is that we try to schedule&tune blocks for multiple ops. It is ok to compute a single heavy op for a single kernel on a gpu device. But we think NPU may prefer to launch a block of consecutive ops to avoid frequent kernel launches. Thus, the proposed fusion pass described above is a way to achieve this. Also, since the main efforts of tvm community are on cpu/gpu backend, there do exist pain points when developing tir supports for NPU fashion backend. We take some struggling to make it work through the standard schedule -> lower flow. * We use TensorIR schedule (https://discuss.tvm.apache.org/t/rfc-tensorir-a-schedulable-ir-for-tvm/7872) to schedule the computations. **This is the first trial of TensorIR schedule on NPU infrastructures as far as we know.** * A set of new schedule primitives are added to utilize hardware features. * A set of new tir passes are added to utilize hardware features. * We use `device_scope` att
[Apache TVM Discuss] [Development/pre-RFC] Introducing TY-NNP backend with end2end TensorIR integration
Thanks for your comments:) [quote="areusch, post:3, topic:11807"] could you say more here? is this a Relay-level thing or a TIR thing? presuming you’ve implemented this as a pass, how do you plan to ensure that the Relay-level pass makes the same scheduling decision as the TIR pass? [/quote] Perhaps I could take a fake example on Conv2d to describe it: fn (%arg0: Tensor[(1, 32, 224, 224), int8], %nn.conv2d_arg: Tensor[(32, 3, 7, 7), int8]) { %conv_fn = fn (%data: Tensor[(1, 3, 224, 224), int8], %weight: Tensor[(32, 3, 7, 7), int8], Primitive=1) { nn.conv2d(%data, %weight, padding=[1, 1, 1, 1], kernel_size=[7, 7], out_dtype="int32") }; %conv_fn(%arg0, %nn.conv2d_arg) } and the coresponding PrimFunc for primitive call `%conv_fn` would be like ```python @T.prim_func def main(x: T.Buffer[...], weight: T.Buffer[(32, 3, 7, 7), "int8"], y: T.Buffer[...]) -> None: # body ``` Assume to utilize the specific hardware, we want to arrange I/O channels into 4*4 tiles. There are extra two notes: - We get to know the "best" weight layout until a TIR schedule/tuning is done. - The required layout is out of scope of common representations like "OIHW", "OHWI", etc. The TIR schedule part would do following transformation on `weight`: ```python o, i, h, w = s.get_read_buffer_axes(conv_block) o_outer, o_inner = s.buffer_split(o, factor=4) # [32, 3, 7, 7] -> [8, 4, 3, 7, 7] i_outer, i_inner = s.buffer_split(i, factor=4) # [8, 4, 3, 7, 7] -> [8, 4, 1, 4, 7, 7] s.buffer_reorder(o_outer, o_inner, i_outer, i_inner, h, w) # [8, 4, 1, 4, 7, 7] -> [8, 1, 4, 4, 7, 7] ``` Above we use a set of extended TensorIR primitives, but they can just be seen as sugars of ongoing schedule primitive `transform_layout`: https://github.com/apache/tvm-rfcs/pull/39 The point is that they are not arbitary index remappings (compare to a general `transform_layout`). We ensure every such schedule step takes exact equivalent relay transformations. In TIR schedule phase, we trace every buffer layout change on function param buffer (we can do that since they are what we implement), generate the transform (&& reverse transform) in relay on each step, and finally compose them into single layout transform (&& reverse transform) functions in relay. For the used example, it would be: - `s.buffer_split(o, factor=4)` - x -> relay.reshape(x, [-1, 4, 3, 7, 7]) - (reverse) x -> relay.reshape(x, [32, 3, 7, 7]) - `s.buffer_split(i, factor=4)` - x -> relay.reshape(relay.nn.pad(x, [..., (0, 1), ...]), [8, 4, -1, 4, 7, 7]) - (reverse) x -> relay.strided_slice(relay.reshape(x, [8, 4, 4, 7, 7]), begin=..., end=...) - `s.buffer_reorder(...)` - x -> relay.transpose(x, [...]) - (reverse) x -> relay.transpose(x, [...]) Finally all transforms (&& reverse transforms) are composed into two `relay.Function` objects to rewrite relay-level layouts, which accepts original relay params, returns updated params tuple: fn (%p0: Tensor[..., int8], %p1: Tensor[(32, 3, 7, 7), int8]) { %0 = reshape(%p1, newshape=[...]); %1 = nn.pad(%0, pad_width=[...]); %2 = reshape(%1, newshape=[...]); %3 = transpose(%2, axes=[...]); (%p0, %3) } and the reverse direction is: fn (%p0: Tensor[..., int8], %p1: Tensor[(8, 4, 1, 4, 7, 7), int8]) { %0 = transpose(%p1, axes=[...]); %1 = reshape(%0, newshape=[...]); %2 = strided_slice(%1, begin=[...], end=[...], strides=[...]); %3 = reshape(%2, newshape=[32, 3, 7, 7]); (%p0, %3) } A relay pass now can perform "pre"-schedule for each primitive function, fetch the layout transform functions from schedule result, and perform relay-level layout updation. Finally, an extra `FoldConstants` could eliminate all extra transformations out of primitive calls typically. fn (%arg0: Tensor[(1, 32, 224, 224), int8], %nn.conv2d_arg: Tensor[(32, 3, 7, 7), int8]) { %0 = reshape(%nn.conv2d_arg, newshape=[...]); %1 = nn.pad(%0, pad_width=[...]); %2 = reshape(%1, newshape=[...]); %3 = transpose(%2, axes=[...]); %conv_fn = fn (%data: Tensor[(1, 3, 224, 224), int8], %weight: Tensor[(8, 4, 1, 4, 7, 7), int8], Primitive=1, DevicePrimFuncKey=873487) { %4 = transpose(%weight, axes=[...]); %5 = reshape(%4, newshape=[...]); %6 = strided_slice(%5, begin=[...], end=[...], strides=[...]); %7 = reshape(%6, newshape=[32, 3, 7, 7]); nn.conv2d(%data, %7, padding=[1, 1, 1, 1], kernel_size=[7, 7], out_dtype="int32"); }; %conv_fn(%arg0, %3) } The actual params are transformed before call into `%conv_fn` and the formal params are reversed within `%conv_fn`'s body. Why we need reverse transforms is that we currently can not represent a "lowered" function call in relay (correct me). It is a workaround for us to keep a valid primitive function body, that is, the relay module after pass can still be safely evaluated on a CPU. All things d
[Apache TVM Discuss] [Development/pre-RFC] Introducing TY-NNP backend with end2end TensorIR integration
[quote="areusch, post:3, topic:11807"] it seems like this could either be integrated into `ci-cpu` or as a separate `ci-` image, so long as the binaries are publicly available. do you have an estimate of the size of the docker image? also, just for my curiosity, would you be able to share a rough timeline of when you’d like to land this? [/quote] If a separate image is possible (it can be based on `ci-cpu`), we may prefer it since future upgration will not bother the `ci-cpu`'s usages. The incremental file size would be as below: - LLVM: x86+device target build is about 140M, some libLLVM* maybe unused - Other toolchains: full toolchain binaries will occupy 500M, simulator only binaries can control down to <100M --- [Visit Topic](https://discuss.tvm.apache.org/t/introducing-ty-nnp-backend-with-end2end-tensorir-integration/11807/5) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/8b8acb181464bdb8219d22bd9f8657d5ce92fb87787c369135aa2b18199bddd8).
[Apache TVM Discuss] [Development/pre-RFC] Introducing TY-NNP backend with end2end TensorIR integration
@mbs-octoml Hi~ Many thanks for your reply! Here are several questions of me: 1. What does `call_lowered` mean? Does it mean we can put PrimFuncs and relay functions into the same IRModule and make calls to each other now? 2. For the `VirtualDevice`, it would be the interface to keep all information we required across relay-tir boundary, is my understanding right? This would be a closed set (including device, mem scope, etc) or allow thirdparty extensions? 3. Just out of my curiosity, what is the difference between ongoing `Relax` and current machanism described? --- [Visit Topic](https://discuss.tvm.apache.org/t/introducing-ty-nnp-backend-with-end2end-tensorir-integration/11807/7) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/a8a991f5166ce38518d461c1ad1880f851d28d4b180827f076034af6c4ab3763).
[Apache TVM Discuss] [Development] Can we lift tir.AttrStmt value type to ObjectRef?
Schedule annotations of `For` and `Block` are all Map. But certain pragma annotations can not get lowerer to `T.attr`,only those of expression typed values are allowed. --- [Visit Topic](https://discuss.tvm.apache.org/t/can-we-lift-tir-attrstmt-value-type-to-objectref/12118/1) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/8ba395d87a58f7e1de0701b3c29cdeff40e371ba967d04b7606878cfd85656d9).
[Apache TVM Discuss] [Development] Can we lift tir.AttrStmt value type to ObjectRef?
Hi~ I think this is not the issue of tvmscript. For example, though `List[Integer]` is supported by script, it would fail in lowering with `Illegal attribute of key pragma_key, value type Array not supported`, since the annotation can not convert to an attr stmt. ```python import tvm from tvm.script import tir as T @T.prim_func def fun(x: T.Buffer((16,), "int32")) -> None: for k in T.serial(0, 16, annotations={"pragma_key": [1, 2, 3]}): x[k] = 1 print(tvm.lower(fun).script()) ``` --- [Visit Topic](https://discuss.tvm.apache.org/t/can-we-lift-tir-attrstmt-value-type-to-objectref/12118/3) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/7d3cf1a0358a40badecc66e9eb85d7fe7c916f2186f2077304c4c8f8da00f826).