Re: [apache/tvm] [VOTE] Adopt the New RFC Process (#7991)
+1 One small question: Under "Tracking Issue", the tracking issue is closed "when the RFC is either completed or abandoned." Is "abandoned" the same as "postponed" in this context, or is there a distinction between them? -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/tvm/issues/7991#issuecomment-833718493
Re: [apache/tvm] [VOTE] Adopt the New RFC Process (#7991)
> @Lunderberg @junrushao1994 good catch, would changing "abandoned" to > "postponed" resolve the ambiguity sufficiently? Thank you! That resolves my question. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/tvm/issues/7991#issuecomment-833786122
[apache/tvm-rfcs] [RFC][TIR] Separate physical and logical layout of buffers (#39)
This RFC introduces a hard boundary between the “logical layout” of a mathematical tensor and the “physical layout” of a buffer in memory, along with a specification for defining the conversion between the two. You can view, comment on, or merge this pull request online at: https://github.com/apache/tvm-rfcs/pull/39 -- Commit Summary -- * https://github.com/apache/tvm-rfcs/pull/39/commits/5e3773ec22dbf44f6a3d771afba7e46d360b9c47";>[RFC][TIR] Separate physical and logical layout of buffers -- File Changes -- A rfcs/-buffer-physical-layout.md (399) -- Patch Links -- https://github.com/apache/tvm-rfcs/pull/39.patch https://github.com/apache/tvm-rfcs/pull/39.diff -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/39
[apache/tvm-rfcs] [RFC][TIR] 1-d AllocateNode::extent (#40)
This RFC changes `ArrayAllocateNode::extents` to `PrimExpr AllocateNode::extent`, giving the 1-d size of the buffer to be allocated. This is part of the separation between logical layout and physical layout, proposed in [RFC#0039](https://github.com/apache/tvm-rfcs/pull/0039). TODO: Rendered markdown link You can view, comment on, or merge this pull request online at: https://github.com/apache/tvm-rfcs/pull/40 -- Commit Summary -- * https://github.com/apache/tvm-rfcs/pull/40/commits/06c8dea05cc1eb154e6799656650b333b2b97771";>[RFC][TIR] 1-d AllocateNode::extent -- File Changes -- A rfcs/-allocate-node-1d-extent.md (111) -- Patch Links -- https://github.com/apache/tvm-rfcs/pull/40.patch https://github.com/apache/tvm-rfcs/pull/40.diff -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/40
Re: [apache/tvm-rfcs] [RFC][TIR] Separate physical and logical layout of buffers (#39)
Thank you for the comments, @vinx13. > For example, another way is to use a mapping function: (n, c, h, w) -> (n, > tir.floordiv(c, 4), h, w, tir.floormod(c, 4)). This would allow arbitrary > mapping (we can add more restrictions like requiring affine mapping though, > to make analysis easier). ~~Having an arbitrary mapping function was something that I considered as an alternative, and would like to use as part of the internal representation. However, for the user-facing API, I prefer having a representation that cannot express an illegal internal state, rather than relying on validation. The arbitrary mapping function could have more than one logical index map to the same physical index (e.g. `(i,j) -> (i+j,)`), which would need to be caught after the fact.~~ Never mind, I'm convinced after going through your point below about CUDA's permuted layout. > A possible use cases of more complex mapping is [permuted > layout](https://github.com/NVIDIA/cutlass/blob/master/media/docs/implicit_gemm_convolution.md#shared-memory-layouts) > for shared memory on CUDA. Thank you for the example, and this sort of example was what I was hoping to get in order to determine if the representation of reorder/splits is sufficiently flexible. If I'm reading the link correctly, I agree that the physical layout in shared memory cannot be expressed in the proposed notation. If the global memory is 2-d memory `(i,j)`, the closest to representing physical layout in shared memory would be either `((i,2), (j,4), (j,8), SEP, (i,4))`. This would maintain the 128-bit vectors `(j,8)`, transpose because `(j,8)` appears in the first physical coordinate, but wouldn't handle the XOR condition needed to have the conflict-free access. It also couldn't specify that the `(i,2)` in the first physical coordinate would need to take precedence over the `(i,4)` in the second physical coordinate. In the functional notation, these would be `(i,j) -> (floormod(i,2), (floormod(row,8)) ^ (floordiv(j,8)), floormod(j,8), SEP, row)`, where `row = floordiv(i, 2)`. I am convinced, while the earlier notation may be useful as a more readable input in some cases, it doesn't cover all use cases. At most, it could be an alternative form of input, but the function definition mapping from logical indices to physical indices should be the primary representation. In terms of how to represent this in the TIR graph, I'll have to think a bit on it. I'm picturing either `BufferNode` holding `Array index_vars` and `Array physical_layout`, or making a TIR variant of `tvm::relay::FunctionNode`. > Also, there are related affine analysis infrastructure available, it would be > great if we can reuse it for loop analysis and rewriting. Thank you for mentioning these. I intend to use these utilities for the implementation, and will add it to the detail section. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/39#issuecomment-936560130
Re: [apache/tvm-rfcs] [RFC][TIR] Separate physical and logical layout of buffers (#39)
True, and that would allow both for user-defined mappings, and for specifying standard layouts. I have a bit of concern with using the `scope` parameter to also describe layouts, since in my mind "scope" refers to where the entire buffer is located, while "layout" refers to how individual elements are arranged within the buffer. Two buffers that have the same layout and different scopes could be `memcpy`-ed between the two scopes without additional rearrangement, and that would be obscured if the `scope` parameter is used to define both the scope and the layout. I really like that idea of building the TIR representation based on the provided function. Like with `te.compute`, that gives a convenient way to allow the library to handle the var/primexpr building, so long as the user provides the function to compute the physical indices. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/39#issuecomment-937008706
Re: [apache/tvm-rfcs] [RFC][TIR] Separate physical and logical layout of buffers (#39)
> How will vectorization work? If there is a `vectorize` directive spanning a > logical extent, will the vectorization pass create multidimensional `ramp`s? > How will vector loads and stores be represented? While in principle a vectorized load/store could fallback to a non-vectorized load/store, this would result in ignoring the `vectorize` directive entirely, which would be rather unexpected from the user's perspective. Therefore, my plan for the initial implementation is to require the elements to be contiguous in both the logical and physical layouts if they are vectorized. In both the physical and logical layouts, the vector load/store would be represented using a `RampNode` with `stride==1`. This restriction could then be loosened by allowing schedules to be expressed in terms of the physical layout, rather than the logical layout. I haven't thought through the full implications of this, which would likely involve either introducing an inlined Stage whose axes are the physical axes, or by exposing the physical axes as a separate parameter within the same stage. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/39#issuecomment-937824488
Re: [apache/tvm-rfcs] [RFC][TIR] Separate physical and logical layout of buffers (#39)
Following a video chat with @csullivan, documenting some of the key points of the conversation. * Setting the physical layout in a TE-based schedule has two roles. One is the rewrite the buffer itself, and the other is to define the order of iteration when writing to the buffer. In the latter use case, the schedule should have access to the physical axes for use in the schedule. * Setting the physical layout is roughly equivalent to using an inlined `te.compute` to define a re-shaped tensor. However, using `te.compute` requires the compute definition to change, whereas `set_physical_layout` is entirely contained in the schedule. * For a given tensor, either the physical axes or the logical axes may be used in scheduling, not both. We were unable to think of use-cases where you would want to change the physical layout, but still maintain the loop-order of the logical layout. One idea that came of this conversation was to have `set_physical_layout` act similar to `cache_read` and `cache_write`, which introduce a new stage. * If a physical layout is defined for a cache stage, then the generated code should perform the layout transformation as part of generating that cache. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/39#issuecomment-938057599
Re: [apache/tvm-rfcs] [RFC][TIR] Change AllocateNode::extent to 1-D (#40)
This is the first in a series of proposed changes, and this one on its own won't be able to support the `PHYSICAL_AXIS_SEPARATOR`. In the [Impacted TIR Nodes](https://github.com/Lunderberg/tvm-rfcs/blob/data_layout/rfcs/0039-buffer-physical-layout.md#impacted-tir-nodes) section of RFC #39, under `AllocateNode`, it gives a description of the changes being made. This first change removes all information of the logical shape from the `AllocateNode`, and the follow-up change will then extend to allow for N-d physical shapes. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/40#issuecomment-939073862
Re: [apache/tvm-rfcs] [RFC][TIR] Separate physical and logical layout of buffers (#39)
Following discussion with @tqchen , this RFC has had significant updates made. The major change is that instead of extending the capabilities of `Store` and `Load` nodes to support N-d indices, they would instead be removed in favor of keeping `BufferStore` and `BufferLoad` nodes throughout the lowering process. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/39#issuecomment-954215388
Re: [apache/tvm-rfcs] [RFC][TIR] Change AllocateNode::extent to 1-D (#40)
Closing this RFC, as it is no longer applicable after significant changes made to #39. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/40#issuecomment-954215697
Re: [apache/tvm-rfcs] [RFC][TIR] Change AllocateNode::extent to 1-D (#40)
Closed #40. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/40#event-5536746113
Re: [apache/tvm-rfcs] [RFC][TIR] Layout transformations on buffer access (#39)
> I'd suggest adding the BufferTransform data structure here which will be very > helpful to other audience. Sounds good, and I've added a description of it, and a possible data structure for it. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/39#issuecomment-956454255
Re: [apache/tvm-rfcs] [RFC][TIR] Layout transformations on buffer access (#39)
> Usage of te.AXIS_SEPARATOR: It seems this is only used in the API side but > not in BufferTransform, would be good to get some clarification. That's correct, the `te.AXIS_SEPARATOR` only appears in the API for the TE schedules, and not in the TIR graph generated from the TE schedule. I've updated the RFC with a paragraph on this distinction. For the axes returned, I'd want to add a third option. * T2: Using `te.AXIS_SEPARATOR` to separate groups of axes, e.g. `AA = s[A].transform_layout(lambda i, j: [j//4, te.AXIS_SEPARATOR, i, j%4])`, directly implies fusion of the `i` and `j%4` axes, but exposes each transformed axis for use in the TE schedule. That is, `AA.op.axis` has three elements, whose iteration bounds are `[A.shape[1]//4, A.shape[0], 4]`. Then my thoughts on each option are below, with T2 as my preferred option. * T0: This is easy to write in te, but any additional manipulation of the axis (e.g. changing the loop ordering independent of the data layout) would require splitting the returned axis to match the layout transformation. It would also mean that `lambda *indices: indices` isn't the identity function, because it would flatten the buffer as part of that transformation. * T1: As you mentioned, this would require explicitly writing down the transformation to fuse the axes, which would be error-prone. It would also make the default behavior, where all buffer axes are fused into a single index, be more of a special case. * T2: Doesn't require the user to specify the transformation to fuse the axes, and allows the corresponding loops to be easily reorganized. The fusion in the default case, where no `transform_layout` is applied, then follows the same pattern, where a `BufferTransform` node is inserted to fuse all remaining axes into `num_axis_separators+1 == 1` axes. > From TIR's perspective, the same primitive can be introduced to TIR as well. > It will be an eager transformation (without needing to create BufferTransform > node) that rewrite the both producer/consumer immediately upon calling it. > This is required if we want to further scheduling the transformed block. I > can follow up implementing it in TIR later When the transformation is performed, would that include fusing axes together? If it does, then the `FlattenBuffer` pass would need distinguish between a 2-d buffer that should be flattened to a flat 1-d buffer, and a 2-d buffer that has already been flattened from some larger dimension. If it doesn't, then the `FlattenBuffer` pass would need to distinguish between a N-d buffer that should be flattened to a 1-d buffer, and an N-d buffer that should be flattened to a 2-d buffer. In either case, `FlattenBuffer` would need some additional metadata to know how the buffer should be handled. I like the idea of having the transformation be done eagerly, to avoid needing to pass through additional information, but I'm not quite seeing how it could be done. (I had been trying an initial implementation that way on the TE side.) -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/39#issuecomment-959696021
Re: [apache/tvm-rfcs] [RFC][TIR] Layout transformations on buffer access (#39)
> Since Option2 suggests the transform is global, shall we consider > BufferTransform being part of function attribute? I had initially placed `BufferTransform` as a statement so that it could be possible to extended it to have a transformation defined by references to variables within the function body. This would allow other transforms that require rewriting memory layouts to add a `BufferTransform` defining the layout, relying on a later pass to implement the transformation itself. For example, `tir.transform.InjectDoubleBuffer` could be implemented by adding a buffer transform `lambda *indices: [loop_iter%2, *indices]`, and the load/store rewrites of `tir.transform.VectorizeLoop` could be implemented by adding a buffer transform `lambda *indices: [*indices, loop_iter]`. On further thought, this potential future extension would still be possible with the definition in the `PrimFunc`, so long as visitors that modify the `loop_iter` also modify the transformation, so I support having the transformations as a global definition, and will update the RFC. Having a single global definition of the buffer transformations would also make the order clearer in cases of multiple transformations. A transformation may be more easily expressed multiple sequential transformations, such as an initial transformation of a matrix into tiles for transfer to a GPU, then another transformation for ordering the elements into groups of size `warp_size`. If these are in different portions of the TIR call graph, the order of these transformations would depend on the traversal order. > This implies separation between buffer transform and the underlying physical > layout requirement (e.g. 1-D by default in most cases). That would minimize the number of changes to the TIR, though it would increase the number of distinct changes that are made while lowering. I had been thinking of the flattening to the underlying physical layout requirement as a special case of a transformation, which happens to define a row-major traversal. Good point, though, on how the current flattening depends on the parameters outside of the buffer itself. As such, the transformation representing the flattening couldn't be generated initially. The concepts could be combined at some point in the future, if the flattening is made to depend only on the buffer, but that should be a separate change. Between those two points, I think that would be that `PrimFuncAttrs::attrs` should include a map from buffers to a list of layout transformations, and a map from buffers to a list of axis separator locations. The layout transformations would be consumed by the pass that rearranges the layout, and the list of axis separator locations would be consumed by either `StorageFlatten` or `FlattenBuffer`. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/39#issuecomment-960254384
Re: [apache/tvm-rfcs] [RFC][TIR] Layout transformations on buffer access (#39)
Following a video chat discussion with @vinx13 , we touched on a number of points, summarized below. Also, we are adding @vinx13 as a co-author on this RFC. - Are there cases where the flattening in `StorageFlatten`/`FlattenBuffer` should be inferred from buffer properties, rather than explicitly specified by the user? For example, if a buffer has `"texture"` scope, then we know it must be flattened to a 2-d buffer. We concluded that this wouldn't be possible, because the number of resulting dimensions isn't sufficient to define the flattening being applied. For example, if a 4-d buffer is being flattened to 2-d for use in texture memory, the four initial axes `[A, B, C, D]` could be flattened to `[A, fuse(B,C,D)]`, `[fuse(A,B), fuse(C,D)]`, or `[fuse(A,B,C), D]`, without any clear method that is better or worse. - How will buffer layout transformations be represented in TensorIR schedules? `buffer_transform` will be a primitive transformation in TensorIR, which is eagerly applied on the TensorIR computation. - In all cases, this would rewrite the buffer shape, and would rewrite loads/stores of that buffer. - If these loads/stores occur within a series of nested loops that cover all values of the buffer, and have no additional computation (e.g. cache read/write) in the body of these loops, then the loops will be rewritten to be along the transformed axes. can write remainder of schedule in terms of the transformed axes. Otherwise, rewriting the loops would not be well-defined, and will not be done. - The recommendation for use will be to apply the layout transformations prior to any other scheduling passes that could impact the loop structure, so that rewriting of the loops is possible. - Should buffer flattening be implemented as a special case of layout transformation? Buffer flattening should remain a separate concept from the layout transforms. Where all other layout transformations can be performed eagerly, and should be before other scheduling passes, buffer flattening must be performed after other scheduling passes. If it were done eagerly, other passes wouldn't have sufficient information about the structure of the buffer. - Is deprecating Store/Load acceptable, instead using BufferStore/BufferLoad throughout all lowering steps? Yes, as this gives a single uniform way to access buffers, regardless of the lowering step. The one concern is that we should port all existing functionality. For example, the vload/vstore methods in Buffer, which currently return Load/Store respectively, should not be removed, and instead should be updated to return flattened BufferLoad/BufferStore. - RampNode should be treated as a compiler internal, and shouldn't be easily constructible by users as indices into buffers. The preferred method to represent vectorized access is to have a buffer access within a vectorized loop, then allow `tir.transform.VectorizeLoop` to insert the RampNode. This matches previous behavior, where RampNode could occur in flattened Store/Load, while BufferLoad/BufferStore avoided RampNodes to maintain easy analysis of accessed locations. - Passes that change buffer dimensionality (e.g. InjectDoubleBuffer) should either be moved before the StorageFlatten/FlattenBuffer, or should be rewritten to instead resize the buffer, rather than changing the dimensionaltiy. The former would require the pass to also update the axis separators to be used when flattening. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/39#issuecomment-961358086
Re: [apache/tvm-rfcs] [RFC][TIR] Layout transformations on buffer access (#39)
Sounds good! I've updated with examples of scheduling with the returned new axes, which work in the implementation posted in [PR#9727](https://github.com/apache/tvm/pull/9727). -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/39#issuecomment-992608499
Re: [apache/tvm] [VOTE] Replace codeowners with more relevant automation (Issue #10471)
+1 -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm/issues/10471#issuecomment-1060835861 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm-rfcs] Add Target Pre-processing RFC (PR #71)
> Based on my previous re-review of LLVM, thanks to @tqchen, it might help to > use my_target.features.dsp rather than my_target.arch.has_dsp and clarifying > these are features available to the Target? What do you think? I like that, and the renaming makes it clear which are boolean parameters and which are variable attributes. That would also be useful for cleaning up the [Vulkan target](https://github.com/apache/tvm/blob/main/src/target/target_kind.cc#L341), which right now has a large number of boolean attributes that would be better expressed as feature support. The renaming would also help with avoiding architecture-specific checks at the code review stage. Since the information defined by the architecture would be pulled over into `my_target.features`, checks that directly access `my_target.arch` should only occur for case (1). -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/71#issuecomment-1130250850 You are receiving this because you are subscribed to this thread. Message ID:
[apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)
This RFC introduces a method to specify padding to be applied as part of a buffer layout transformation, to be used when the desired layout does not evenly tile the buffer being transformed, and simplifications that can be performed based on these padded buffers. The motivating examples are primarily in the "Implementation options" section, which goes through several desired usages of the buffer padding, and how they can be automatically derived using the TIR primitives/transformations described in earlier sections. TODO: Rendered Markdown link You can view, comment on, or merge this pull request online at: https://github.com/apache/tvm-rfcs/pull/77 -- Commit Summary -- * [RFC] Buffer Layout Padding -- File Changes -- A rfcs/-layout-transform-padding.md (2521) -- Patch Links -- https://github.com/apache/tvm-rfcs/pull/77.patch https://github.com/apache/tvm-rfcs/pull/77.diff -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/77 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)
> Introducing changes to TIR would needs some additional thoughts that deserves > some extra consideration. Due to the N*M complexity (where N is the TIR > possibilities and M is the number of primitives to be supported) that needs > to be handled in implementation (by backend implementers and primitive > implementers) This was part of the design consideration, to minimize the impact of the proposed changes to primitives, lowering transformations, and backends. * The `BufferConstraint` annotations do not need specific handling at the codegen level, as it is only present to enable compile-time optimizations. * Use of the `BufferConstraint` hints would occur within existing utilities, primarily as additional information available in `arith::Analyzer` utilities. This minimizes the need for other primitives/transforms to be aware of the buffer constraints, while still benefiting from them. * The `T.undef()` built-in does not need specific handling at the codegen level, as it is removed during lowering. * The `T.undef()` built-in does not require specific handling from other primitives, as stores of `T.undef()` can be treated the same as stores of any other value. > Right now it is possible to do non-local constraint rewriting flowings as > part of the graph pass. Note that while E1 is indeed less "compact" on one > hand, we can use it to reconstruct the desirable compact data > structure(something like BufferConstraint that represents the layout mapping) > that we can use to flow the decisions across the graph node during the pass. I definitely agree that graph-level transforms are where the layouts and constraints should be decided. The `BufferConstraint` annotations are not intended as a way to override in TIR what was already decided at the graph level, but rather a way to communicate to TIR transformations what has been decided at the graph level. > E1: Composing a stage that transforms the layout(a loop that represents the > mapping) I'm still a bit confused with this approach, specifically how one would avoid having a separate compute definition for each workload on a new target (Initially brought up by @csullivan [here](https://github.com/apache/tvm-rfcs/pull/77#discussion_r893701372).) In my mind, if I'm going to compose a layout transformation stage, it would need to be followed by a compute stage that takes a transformed layout as input. So rather than having a single conv2d that can be generalized over layouts, each transformed layout would still need to have a compute stage for it. > Note that intiially such data structure do not need to live beyond the life > of a pass, because they can be reconstructed at anytime from the other > representation. How would this be represented while optimizing the performance of a subgraph? My concern would be how to express the non-local constraints while keeping a small search space for optimization. * Ensure that the producer and consumer stages are within the same subgraph. Since the constraints provided to a consumer depend not only on the producer, but also on the constraints provided to the producer, so this might require fusing the entire end-to-end model into a single monolithic kernel. My understanding is that this would result in a search space that is too large to effectively optimize, though I haven't explicitly tested it. * Insert a transformation stage into the subgraph, in which the constraint is written. Later portions of the subgraph could then rely on the constraint without examining other subgraphs. Would need to have some way to indicate that the transformation stage shouldn't be altered during optimization, nor should it be part of the performance timing. * Express the graph-level constraints to a subgraph, so that it can optimize using those constraints. This was my intent with the `BufferConstraint` annotations, since then the subgraphs could take advantage of > E1 also enables some additional capabilities (e.g.) expressing future memory > remappings that do not necessarily fit into padding/packing. Is there an existing annotation to indicate that a stage should be removed entirely during lowering? That might be an effective way to allow more general usage by annotating a stage that can be assumed to have been performed prior to the subgraph. This would be a way to express the second option of an extra transformation stage, while still providing enough information to remove the transformation stage during lowering. -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1162392893 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)
> Indeed it is important to avoid having a separate compute definition for each > workload on a new target. In this particular case, all computation definition > would start with the original layout. Then there is a "schedule > transformation" like transform layout which will generate the new stage as > part of the scheduling process. Thank you, and that is roughly how I'm seeing it as well. That everything starts with the base compute definition and is modified from there. If I understand correctly, the main differences are below. * Option A: Layout transformations of inputs are allowed, but only during initial graph-level optimization. When optimizing an individual PrimFunc, layout transformations of inputs and outputs are not allowed. * Option B: Layout transformations of inputs and outputs are not allowed. If this is desired, it should be done by first introducing a cache stage in TIR, then transforming the layout of the cache, and finally by a graph-level transformation that inspects each PrimFunc and hoists the cache stage out. > The particular stage can be marked, which contains effectively the same > information as BufferConstraint, except that it does not introduce new data > structures. During global layout reflowing, such information can be used to > guide the reflowing to reconstruct a data structure like BufferConstraint or > other Layout mappings and use that to serve the same purpose. So long as the constraints can be statically searched for, this approach makes sense to me. I would be more concerned about adding additional semantics to existing nodes, such as a AttrStmt node, since it then requires passes to be aware not only of the existence of the constraint, but also that it must be reconstructed from the existing data structure. This approach would make it much more difficult for a static analysis tool to identify locations where the constraints must be updated. As a way to potentially find a way forward, what if we start by implementing pad values only for buffers that are allocated internally to a function? This would be allowed behavior under both Option A and Option B, and would help determine how difficult reconstruction of the constraints would be from the transformation block without any additional annotation. This could help motivate whether additional annotations are necessary, regardless of whether they are stored alongside the Buffer itself or in a separate attribute/annotation. -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1163436177 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)
> It doesn't add additional semantic, the computation semantic stays the same, > it is a hint to the graph compiler. My apologies, I had meant the semantics of a node from the perspective of a TIR transformation, not the semantics from the perspective of the computation being described. For a TIR transformation, if an object is replaced, whatever attributes describe that object must be updated to refer to the new object. So if constraints are added to the block annotation, I had been thinking of that as a change to the semantics of the `BlockRealizeNode::annotations` from "does not need to be updated when a buffer is replaced" to "must be updated when a buffer is replaced". -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1163510231 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)
Writing out some of my thoughts, to see if there's a way to express the constraints while only using existing TIR features. The main goals would be as follows. 1. Allow simplification of expressions based on the values present in the padding. 2. Allow local simplifications to take advantage of non-local constraints, without requiring a full end-to-end analysis. 3. Specify the non-local constraints in some deducible manner that doesn't impose a runtime performance penalty. Next, working through various options for how the constraints could be stored. In the examples below, sketching out how these would apply to the element-wise operation which starts as below. ```python @T.prim_func def func(A: T.Buffer[(14), "int32"], B: T.Buffer[14, "int32"]): for i in T.serial(14): B[i] = 2 * A[i] ``` 1. Apply layout transforms on local caches. Here, the full lifetime of a buffer is known. All TIR optimization are done prior to hoisting the cache and layout transformation into the graph level. - For read caches, pad value is whatever gets conditionally written to the padding while generating it. In example below, `AC` could be recognized as being padded. ```python @T.prim_func def func(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]): AC = T.alloc_buffer([4, 4], "int32") for io, ii in T.grid(4, 4): if 4 * io + ii < 14: AC[io, ii] = A[4 * io + ii] else: AC[io, ii] = 0 for i in T.serial(14): B[i] = 2 * AC[i // 4, i % 4] ``` - For write caches, pad value is whatever is in the padding after the last write to the cache. In example below, `BC` could be recognized as being padded. ```python @T.prim_func def func(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]): BC = T.alloc_buffer([4, 4], "int32") for io, ii in T.grid(4, 4): if 4 * io + ii < 14: BC[io, ii] = 2 * A[4*io + ii] else: BC[io, ii] = 0 for io, ii in T.grid(4, 4): if 4 * io + ii < 14: B[i] = BC[io, ii] ``` - Downside, either of the `else` statements could be eliminated as a no-op, since they don't contribute to the output `B` value. After that elimination, there wouldn't be any way to reconstruct the pad value. 2. When hoisting an allocation+transformation, write the pad value to the buffer at the start of function from which it was hoisted. This way, the pad value can still be used in local reasoning. - No change needed in producers, since they would already write the pad value to the buffer. - For consumers, would be represented as writing `pad_value` into the padding at the start of the function. ```python @T.prim_func def func(AC: T.Buffer[(4, 4), "int32"], B: T.Buffer[14, "int32"]): for io, ii in T.grid(4, 4): if 4 * io + ii >= 14: AC[io, ii] = 0 for io, ii in T.grid(4, 4): if 4 * io + ii < 14: B[4 * io + ii] = 2 * AC[io, ii] ``` - Downside, repeated unnecessary effort at the beginning of each consumer. Avoiding it with this representation would require knowing that the producer had written `pad_value` already, which is exactly the information we're trying to avoid. 3. When hoisting an allocation+transformation, write the pad value to the buffer at the start of function from which it was hoisted, and write `T.undef()` at the end. This way, the pad value can still be used in local reasoning, and no-op removal can remove the repeated writing when lowering. - No change needed in producers, since they would already write the pad value to the buffer. - For consumers, would be like option 2, but with an additional write of `T.undef()` at the end of the function. When lowering, the write of `T.undef()` would allow the first write to be removed as a no-op because it is overwritten. The `T.undef()` can then be removed as described in the RFC. ```python @T.prim_func def func(AC: T.Buffer[(4, 4), "int32"], B: T.Buffer[14, "int32"]): for io, ii in T.grid(4, 4): if 4 * io + ii >= 14: AC[io, ii] = 0 for io, ii in T.grid(4, 4): if 4 * io + ii < 14: B[4 * io + ii] = 2 * AC[io, ii] for io, ii in T.grid(4, 4): if 4 * io + ii >= 14: AC[io, ii] = T.undef() ``` - Downside, no way to distinguish between "can assume the pad value is zero" and "can overwrite the pad value at will". The writing of `T.undef()` would allow any writes to the padding to be inserted as a no-op. - Downside, wouldn't actually simplify out in cases where the pad value is used. The first in a pair of repeated writes to the same locati
Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)
> For example, we may introduce explicit cache stage to add the padding, and > mark this block for later processing. Wouldn't that require a "remove entirely" annotation that was suggested against [here](https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1163019805)? I could see how we could mark a transformation to be hoisted out later, but when some simplifications require the constraint to be expressed in the producer, and others in the consumer, exposing it to both `PrimFuncs` for local simplifications would require either duplication of the block, or maintaining non-local information only for a single pass. If the stage is duplicated, all but one of the duplicates would need to be marked as temporary. If the information is only retained for a single pass, then any scheduling/optimization of a single subgraph would require walking through the entire end-to-end model. -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1163616169 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)
> Talking about “constraints”, it is also useful to talk about categories of > them, roughly we can divide them into three categories. I like this breakdown, and agree. In this categorization, what I've been calling "constraints" would be "assumptions". Double-checking in `builtin.h`, it looks like we don't currently have a TIR equivalent of `__builtin_assume`. For usage of assumptions, I think the key would be to insert an assumption whenever the information that could otherwise prove it is hoisted out of the PrimFunc. That would provide non-local information that could be used by the PrimFunc to allow local simplifications. > transformation of PrimFunc do not change the PrimFunc interface: this is > really important so we can transform a PrimFunc without worrying about how > the graph interacts with it(as the interface remains the same, we can lift > out the blocks earlier) I don't think we can make this strong of a statement, as it would also forbid fusing operators together or hoisting a stage out of a PrimFunc. In both cases, the signature of the resulting PrimFunc may be different than it was before. This shows up in the example, as the interface of `grow` is different from the transformed `grow_packed`. As a slightly less general statement, I would say that transformations of a PrimFunc *in isolation* may not change the PrimFunc's interface. So an optimization search to improve the performance of a single subgraph may not change the layout of its own arguments, nor may it change assumptions of what is present in the padding, as those would change its interface. However, a graph-level transform would be allowed to fuse subgraphs, to hoist stages out of a PrimFunc, to alter the layout of a PrimFunc's input, or to alter the assumptions provided about the inputs. In general, a PrimFunc's interface could only be changed when calls into the PrimFunc are also modified to remain compatible. Is there a better term than "scheduling primitive" to describe layout transformations that impact input/output buffers? I think the difference is between context-independent transformations that may be performed on a PrimFunc without changing, as opposed to context-dependent transformations that may only be performed as part of a graph-level transformation. > Each function needs to have its own TIR analysis of how it flows things back, > for example, in the case of `addone`, we can safely flow PadMapping back, > changing `addone` to `addone_packed` by analyzing the TIR. If the addone is > elemwise exp however, we need to insert a select operator(because `exp(0)=1` > ) the message to input becomes `PadMapping(constraint, pad_value=undef)`. Would this handle cases where there are multiple different options for how an operator could be implemented? Otherwise, I'm not sure how this would handle cases where multiple different sets of layouts/constraints could be inferred from different TIR-level schedules of the same operator. As examples, the drop-down has 6 different implementations of `addone`, each of which would allow different hoistable pad/crop operations. Click to expand ```python # Implementation 1, no preproc/postproc are present. # # No hoistable layout transformations. Could be fused with a layout # transformation, but doesn't otherwise provide any constraints. @T.prim_func def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]): for i in T.serial(14): with T.block("compute"): B[i] = A[i] + 1 # Implementation 2, pad input/output, but never access the padding of # either input or output. # # In back-propagation of constraints, the T.undef() that is cropped # from BC could be narrowed to a known value provided from the # successor. However, AC's padding is never written to, so could # propagate T.undef() back to preceding function. @T.prim_func def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]): for io, ii in T.grid(4, 4): with T.block(): T.block_attr("preproc", "pad") if 4 * io + ii < 14: AC[io, ii] = A[4 * io + ii] for i in T.serial(14): with T.block("compute"): BC[i // 4, i % 4] = AC[i // 4, i % 4] + 1 for i in T.serial(14): with T.block(): T.block_attr("postproc", ["crop", T.undef()]) B[i] = BC[i // 4, i % 4] # Implementation 3, pad input with known value, but never access # padding of output. # # In back-propagation of constraints, the T.undef() that is cropped # from BC could be narrowed to a known value provided from the # successor. AC's padding is written to, so this would propagate # `PadMapping(predicate, pad_value=0)` to the previous operator. @T.prim_func def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]): for io, ii in T.grid(4, 4): with T.block(): T.block_attr("preproc", "pad") AC[io, ii] = T.if_then_else(4 * io + ii < 14, A[4 * io + i
Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)
> Our design principle at TIR level ideally we start with one instance of > possibility, then use probabilistic space of meta-schedule to represent > multiple choices. For this, would the layout re-flowing occur periodically during optimization? Otherwise, including transformations in the performance benchmarking of candidates would unfairly penalize candidates that add a transformation step, while excluding transformations would unfairly bias toward transformations, even when sequential operators require separate layout transformations. Representing different options as different allowed steps in a search space makes sense to me, so long as the candidates are reasonably exposed to optimizer. > In our particular example, however, the idea is that the schedule primitive > do not modify the input/output buffer, but introduce preproc and postproc > stages with clear hint that they should be lifted out (aka we are doing the > same thing in two steps) I think I understand. That would effectively be treating the preproc/postproc stages as separate function bodies, but ones which happen to exist within the same TIR PrimFunc for ease of use. With this representation, I think the biggest part would be determining when to fix a previously free parameter, in order to expose it as an assumption to another TIR PrimFunc. Maybe in the "Step 2: Reflowing of Layouts", this isn't used to cancel any statements out, but instead to create a dynamic performance penalty if an assertion is no longer held, with the performance penalty equal to the time required to do the transformation. > As a quick intermediate middle ground. For most intermediate stage(really > like add or exp), we would ideally not insert any layout decisions and allow > decisions from other key ops(conv and matmul) to backprop their decision to > them. I'd agree, though would phrase it somewhat differently. The element-wise operations impose a constraint such that input and output layouts, that the input and output have identical layouts. -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1165889831 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)
These make sense, and agreed that the TIR->global feedback is important for enabling the layout reflow. Going back through the discussion, I think we're converging on agreement on what features are required, and the main question remaining are how to best provide annotation for non-local information, and how best to express layout transformations while scheduling. I've made some updates to the text of the RFC, based on the discussions here, primarily to remove the proposed changes to TIR data structures. This follows your comment from a few days ago, which brought up `__builtin_assume` as a comparison. * Adding an intrinsic `tir::builtin::assume`, which corresponds to the `__builtin_assume` LLVM intrinsic. The emphasis is that these assumptions are primarily to expose non-local information for use in local simplifications. * Removing `BufferConstraint` entirely. The RFC no longer proposes any changes to TIR data structures, only the `assume` and `undef` intrinsics. * Describing what assumptions can/should be placed into a PrimFunc when hoisting stages out into independent PrimFuncs, and what transformations are legal based on the choice of exposed assumptions. * Captured some of the discussion here about the dangers of altering a PrimFunc's interface, and the limited cases where it may be altered. -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1169188372 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)
> In general it is helpful to first keep schedule decision local, e.g. > introducing a caching stage (AC, BC in the example), the compose with another > reflowing pass to bring the decision to consumer/producers. My goal with the latest update wasn't to require global decisions, but to make local changes only, which could be used in different contexts. For the auto-scheduler, since the context requires maintaining the same PrimFunc interface, local optimization would be restricted to transformations of the caching stage. For stand-alone usage, such as preparing a single PrimFunc for a unit test, the context allows the interface to change. That way, the restrictions to the transformations are imposed by the level of abstraction that requires them. > While it is possible to express padding with a loop and another loop that > writes the padded value, it is harder to schedule the resulting blocks as > there are more than one producers. Having a single loop and use > `T.if_then_else` will express such pattern in a single shot and makes future > rewriting easier. I definitely agree that this makes the later analysis/rewrites easier. I had maintained them as two separate loops both to minimize the extent of changes being made in any one scheduling change, and to maintain the current behavior of `Schedule.transform_layout` which does not alter the surrounding loop iterators ([previous conversation](https://github.com/apache/tvm/pull/10538#discussion_r826209815) with @vinx13). I see four main options on how the loopnests could be handled: 1. When a buffer is transformed, all loops in the producer stage over the buffers's pre-transformation axes are replaced with loops over the buffer's post-transformation spatial dimensions. It is an error if this replacement cannot be done (e.g. the pre-transformation loops have been fused/split/reordered). - Pro: Allows the `T.if_then_else` to be inserted at the time of the transformations. - Pro: Removes the need for the . - Con: May restrict search space, since earlier manipulation of the loop iterators would prevent later buffer transformations. - Con: Doesn't help consumers of a transformed buffer. In a reduction, may be desirable to iterate over the input buffer, but this couldn't be expressed in terms of an output. - Con: For buffers whose padding is not written to, must either insert a conditional statement or maintain the pre-transformation loop structure. 2. When a buffer is transformed, an attempt is made to replace all loops in the producer stage over the buffers's pre-transformation axes with loops over the buffer's post-transformation spatial dimensions. If this replacement cannot be done (e.g. the pre-transformation loops have been fused/split/reordered), and if `pad_value` is not `None`, then an error should be raised. - Pro: Always valid to apply a transform - Pro: Avoids undoing scheduling benefits from previous changes to iterators. - Pro: Later calls to `reduce_branching_through_overcompute` could still introduce a value for the padding, if the full life cycle of the buffer is known. - Con: Allowing the follow-up stage at all requires just as much analysis to identify as if it were always present. 3. When a buffer is transformed, all loops in the producer stage over the buffers's pre-transformation axes are replaced with loops over the buffer's post-transformation spatial dimensions. If this replacement cannot be done (e.g. the pre-transformation loops have been fused/split/reordered), then the follow-up stage is inserted. - Pro: Always valid to apply a transform - Pro: Avoids undoing scheduling benefits from previous changes to iterators. - Con: Allowing the follow-up stage at all requires just as much analysis to identify as if it were always present. 4. When a buffer is transformed, all loops over spatial dimensions in the producer are replaced with loops over the post-tranformation buffer axes. - Pro: Always valid to apply a transform. - Con: May undo scheduling that has previously provided useful performance improvements. - Con: Loop iterators over pre-transformation indices may have been fused with reduction axes. Would need to undo the fusion to apply. The current proposed version would be option 4, but I think I'd prefer option 2 in order to reduce the number of follow-up simplifications required. > Some of the complications of duplicated condition(and their simplification) > roots from the fact that we do layout transform of output and input > separately(each introducing their own conditions which then needs to be > simplified). It might be helpful to do a global transformation, usually > driven from the output, then "backprop" the implication of that decisions to > the input. Doing such transformation at a single shot will likely alleviate > the need of generating extra conditions then simplifying them. At the TIR level, I suppose
Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)
Thank you very much on the comments, suggestions, and discussion, and I'm quite happy with how the design evolved over the course of the discussions! -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1182157349 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm] [VOTE] Commit Messages RFC (Issue #12583)
+1 -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm/issues/12583#issuecomment-1232123344 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm-rfcs] [RFC] Add Commit Message Guideline (PR #88)
[Rendered link](https://github.com/gromero/tvm-rfcs/blob/cmg/rfcs/0088-commit-message-guideline.md) -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/88#issuecomment-1233170752 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm-rfcs] [RFC] Introduce PresburgerSet (PR #99)
I very much like the proposed improvements, especially the use cases for inner-block and inter-block analysis. While I have made some development [for similar applications](https://github.com/apache/tvm/blob/main/src/tir/analysis/control_flow_graph.h), the additional formalism and reliability proposed here would be very useful. -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/99#issuecomment-1438568053 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm] [VOTE] Clarify Community Strategy Decision Process (Issue #15521)
+1 -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm/issues/15521#issuecomment-1675087792 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm-rfcs] [RFC] Scalable vectors in TIR (PR #104)
> What I'm aiming at is to be able to lower the TIR to a generic CPU, that is > to an architecture that does not support SVE. The TIR will need to have some > default lowering in CodeGenLLVM/CodeGenCPU, so being able to do that is > important. Could it instead be in a target-dependent lowering pass? That is, since a lowering pass after `BindTarget` ([here](https://github.com/apache/tvm/blob/main/src/driver/driver_api.cc#L573) in `driver_api.cc`) would know whether the target CPU supports SVE or not, we could make a pass that either returns the IRModule unmodified for CPUs that support SVE, or converts it to non-SVE instructions otherwise. I'd like to avoid adding more complexity to the `CodeGenLLVM` and `CodeGenCPU` steps, as it is more difficult to test than IRModule to IRModule transformations. -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/104#issuecomment-1751100613 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm-rfcs] [RFC] Scalable vectors in TIR (PR #104)
Agreeing with @kparzysz-quic, changes that update the `DLDataType` would need to be approached very cautiously. I usually lean toward allowing short-term breakages if they lead to better long-term code health, but updating the `DLDataType` would be very wide reaching even more my tastes. One way to limit the scope of the change might be to introduce a distinction between the runtime `DLDataType`, and some new compile-time data-type. This would be analogous to the distinction between the runtime `DLTensor` and the compile-time `tir::Buffer`. It would still involve massive changes inside of TVM, but would preserve the runtime types, avoiding the ABI breakage of compiled functions. -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/104#issuecomment-1755890786 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm] [VOTE] Transition Main to Unity (Issue #16368)
+1 -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm/issues/16368#issuecomment-1887308146 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm] [VOTE] Release Apache TVM v0.16.0.rc0 (Issue #16912)
+1 -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm/issues/16912#issuecomment-2080793568 You are receiving this because you are subscribed to this thread. Message ID:
[apache/tvm] [Bugfix][NCCL] Release NCCL thread_local resources in destructor (PR #17078)
Prior to this commit, allocations performed by `ncclCommInitRank` had no corresponding call to `ncclCommDestroy`. While `ncclCommDestroy` does occur in the `CCLThreadLocalContext::Clear` method, there are no calls into this method. On worker processes, the failure to call `ncclCommDestroy` typically had little effect. Any destruction would occur shortly before the process closes, and so resources would be reclaimed by the OS when the process terminates. However, worker0 of a Disco session is a separate thread, rather than a separate process. While this allows it to easily receive data from the controller thread, resources allocated by worker0 are not reclaimed by the OS until the entire process terminates. As a result, the `CCLThreadLocalContext` leaked GPU memory, as the `ncclCommInitRank` call at the start of each `tvm.runtime.disco.ProcessSession` was never de-allocated. The increase in GPU memory usage was about 1 gigabyte for each `ProcessSession`. This commit updates `CCLThreadLocalContext` to have a destructor that calls the `Clear` method. For worker0, this is called when the thread is joined to the main thread. You can view, comment on, or merge this pull request online at: https://github.com/apache/tvm/pull/17078 -- Commit Summary -- * [Bugfix][NCCL] Release NCCL thread_local resources in destructor -- File Changes -- M src/runtime/disco/nccl/nccl.cc (12) M src/runtime/disco/nccl/nccl_context.h (15) -- Patch Links -- https://github.com/apache/tvm/pull/17078.patch https://github.com/apache/tvm/pull/17078.diff -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm/pull/17078 You are receiving this because you are subscribed to this thread. Message ID:
[Apache TVM Discuss] [Development/RFC] [RFC] Parametrized Unit Tests
Adding notes from a few video chats, so that there is a record of the discussion >From @tkonolige , confirmed that the current implementation of >`@tvm.testing.parametrize_targets` shows skipped targets if they are >explicitly listed in the decorator, but not if they come from >`TVM_TEST_TARGETS` environment variable. >From @tkonolige , some of the unit tests have a significant amount of setup >required before the loop over `enabled_targets()`, and repeating the setup for >many targets would increase the runtime of the tests. I agree, this >definitely would be an issue and should have a recommended style to avoid >duplicating work. I think the best style would be to use xunit-style >`setup_class` to perform the setup, then have the test methods within the >class implemented using parametrized fixtures. I'll test out a few options >and get back on it. >From @areusch , recommended using xfail instead of skip for tests marked as >known failing. I agree, and will make that change to the PR. >From @areusch , recommended removing the global variables of >`tvm_excluded_targets` and `tvm_known_failing_targets`, having the decorator >as the only method by which to apply these traits. The global variables can >silently have a typo, whereas a typo in a decorator throws an error. I agree, >and will make that change to the PR. >From @areusch , though perhaps out of scope of this particular RFC, it would >be good to have some discussion as to which unit tests should be required to >have parametrized targets, and which are allowed to be target-specific (e.g. >where should changes in PRs be requested for having non-parametrized targets). > My first thoughts would be to rearrange the current `tvm/tests/python` folder >to mimic the current organization of the top-level `src` and `python` >directories, rather than the current division into unit/integration tests. >Then, the tests that are specific to the `target` subdirectory and the >target-specific subdirectories in `runtime` would be allowed to have >non-parametrized tests, while all others would be parametrized. Someone mentioned that @masahi was looking into ways to compare cuda/vulkan correctness across all topi models, and that this might be of interest to him. The topi tests were exactly the ones I had in mind as a starting point for converting over to the parametrized targets, for exactly the same reason. --- [Visit Topic](https://discuss.tvm.apache.org/t/rfc-parametrized-unit-tests/9946/4) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/561dfa5e8ed159e2f3325fad182bda38726ae6c015f9f9dc62147d66e5ebfe8c).
[Apache TVM Discuss] [Development/RFC] [RFC] Parametrized Unit Tests
Correct, the different cases are intended to show the entire contents of a test file. The names in this example are chosen so that it can be run with minimal interaction between cases. For the `fixture(scope="module")`, this indicates when pytest should clean up a fixture, but it is only available to be accessed by tests that explicitly accept the fixture as a function argument. With the default scope of `"function"`, the cleanup is done between each test. Absent any caching that we add, this scope is what prevents the fixture from being regenerated, repeating the expensive setup, with each target/size parameter. For the caching, using the cached values should be no different than passing an argument into a `verify_*` function, or a `verify_*` function accessing a nonlocal variable. In either case, we're relying on the test not to make modifications to the setup that is shared between multiple tests. That said, if we decide to use fixtures as the method to handle expensive setup, I think that we should add a helper function `tvm.testing.fixture` to reduce the amount of boilerplate, avoid potential mistakes (e.g. having `@lru_cache` before `@pytest.fixture` instead of after), and as a place to disable the cache entirely based on an environment variable. I haven't implemented it yet, but the behavior would be something like the following. ``` # Using pytest directly, defining parametrized fixture @pytest.fixture(scope="module", params=[1, 10, 100]) def size(request): return request.param # With helper function, same behavior as above tvm.testing.fixture(name="size", params=[1,10,100]) # Using pytest directly, defining a cached parameter derived from `size` @pytest.fixture(scope="module") @functools.lru_cache def expensive_size_dependent(size): ... # With helper function, same behavior as above, # except that caching can be overridden with environment variable. @tvm.testing.fixture(cache=True) def expensive_size_dependent(size): ... ``` --- [Visit Topic](https://discuss.tvm.apache.org/t/rfc-parametrized-unit-tests/9946/9) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/b5a3a077ef1325d8f5ba6f44ab839f7ca1110666c71d1c01fc1a7a4055bbf550).
[Apache TVM Discuss] [Development/pre-RFC] [docs][discuss] Convert ReStructuredText docs to Markdown?
I think I'd lean towards markdown for consistency with other services, but that's only if all other features were equal. Markdown would be nicer for reviewing, since it can be viewed from github in the browser, but the I think cross-references are the more important feature. Would either md or rST allow for cross-references between the python API documentation and the C++ documentation? I think that would be the biggest feature for me, since it would give a clear way to indicate which functions are intended to only be wrappers. --- [Visit Topic](https://discuss.tvm.apache.org/t/docs-discuss-convert-restructuredtext-docs-to-markdown/10264/3) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/cb0d61960a4d8efed6c9dc969d5a5c7c595df05ccf969468135fb695d3ae41a9).
[Apache TVM Discuss] [Development/pre-RFC] [pre-RFC] Vectorized TIR Buffers
@masahi Tagging following comments on https://github.com/apache/tvm/pull/8528#pullrequestreview-718506978 --- [Visit Topic](https://discuss.tvm.apache.org/t/pre-rfc-vectorized-tir-buffers/10615/5) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/2b20409dd28ff339a6a9b23c17ad88ce918e6a63561c0b46db2304e8981c9bfc).