Re: [apache/tvm] [VOTE] Adopt the New RFC Process (#7991)

2021-05-06 Thread Lunderberg
+1

One small question: Under "Tracking Issue", the tracking issue is closed "when 
the RFC is either completed or abandoned."  Is "abandoned" the same as 
"postponed" in this context, or is there a distinction between them?

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm/issues/7991#issuecomment-833718493

Re: [apache/tvm] [VOTE] Adopt the New RFC Process (#7991)

2021-05-06 Thread Lunderberg
> @Lunderberg @junrushao1994 good catch, would changing "abandoned" to 
> "postponed" resolve the ambiguity sufficiently?

Thank you!  That resolves my question.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm/issues/7991#issuecomment-833786122

[apache/tvm-rfcs] [RFC][TIR] Separate physical and logical layout of buffers (#39)

2021-10-05 Thread Lunderberg
This RFC introduces a hard boundary between the “logical layout” of a 
mathematical tensor and the “physical layout” of a buffer in memory, along with 
a specification for defining the conversion between the two.
You can view, comment on, or merge this pull request online at:

  https://github.com/apache/tvm-rfcs/pull/39

-- Commit Summary --

  * https://github.com/apache/tvm-rfcs/pull/39/commits/5e3773ec22dbf44f6a3d771afba7e46d360b9c47";>[RFC][TIR]
 Separate physical and logical layout of buffers

-- File Changes --

A rfcs/-buffer-physical-layout.md (399)

-- Patch Links --

https://github.com/apache/tvm-rfcs/pull/39.patch
https://github.com/apache/tvm-rfcs/pull/39.diff

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/39


[apache/tvm-rfcs] [RFC][TIR] 1-d AllocateNode::extent (#40)

2021-10-06 Thread Lunderberg
This RFC changes `Array AllocateNode::extents` to `PrimExpr 
AllocateNode::extent`, giving the 1-d size of the buffer to be allocated.  This 
is part of the separation between logical layout and physical layout, proposed 
in [RFC#0039](https://github.com/apache/tvm-rfcs/pull/0039).

TODO: Rendered markdown link
You can view, comment on, or merge this pull request online at:

  https://github.com/apache/tvm-rfcs/pull/40

-- Commit Summary --

  * https://github.com/apache/tvm-rfcs/pull/40/commits/06c8dea05cc1eb154e6799656650b333b2b97771";>[RFC][TIR]
 1-d AllocateNode::extent

-- File Changes --

A rfcs/-allocate-node-1d-extent.md (111)

-- Patch Links --

https://github.com/apache/tvm-rfcs/pull/40.patch
https://github.com/apache/tvm-rfcs/pull/40.diff

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/40


Re: [apache/tvm-rfcs] [RFC][TIR] Separate physical and logical layout of buffers (#39)

2021-10-06 Thread Lunderberg
Thank you for the comments, @vinx13.

> For example, another way is to use a mapping function: (n, c, h, w) -> (n, 
> tir.floordiv(c, 4), h, w, tir.floormod(c, 4)). This would allow arbitrary 
> mapping (we can add more restrictions like requiring affine mapping though, 
> to make analysis easier).

~~Having an arbitrary mapping function was something that I considered as an 
alternative, and would like to use as part of the internal representation.  
However, for the user-facing API, I prefer having a representation that cannot 
express an illegal internal state, rather than relying on validation.  The 
arbitrary mapping function could have more than one logical index map to the 
same physical index (e.g. `(i,j) -> (i+j,)`), which would need to be caught 
after the fact.~~

Never mind, I'm convinced after going through your point below about CUDA's 
permuted layout.

> A possible use cases of more complex mapping is [permuted 
> layout](https://github.com/NVIDIA/cutlass/blob/master/media/docs/implicit_gemm_convolution.md#shared-memory-layouts)
>  for shared memory on CUDA.

Thank you for the example, and this sort of example was what I was hoping to 
get in order to determine if the representation of reorder/splits is 
sufficiently flexible.  If I'm reading the link correctly, I agree that the 
physical layout in shared memory cannot be expressed in the proposed notation.  
If the global memory is 2-d memory `(i,j)`, the closest to representing 
physical layout in shared memory would be either `((i,2), (j,4), (j,8), SEP, 
(i,4))`.  This would maintain the 128-bit vectors `(j,8)`, transpose because 
`(j,8)` appears in the first physical coordinate, but wouldn't handle the XOR 
condition needed to have the conflict-free access.  It also couldn't specify 
that the `(i,2)` in the first physical coordinate would need to take precedence 
over the `(i,4)` in the second physical coordinate.  In the functional 
notation, these would be `(i,j) -> (floormod(i,2), (floormod(row,8)) ^ 
(floordiv(j,8)), floormod(j,8), SEP, row)`, where `row = floordiv(i, 2)`.

I am convinced, while the earlier notation may be useful as a more readable 
input in some cases, it doesn't cover all use cases.  At most, it could be an 
alternative form of input, but the function definition mapping from logical 
indices to physical indices should be the primary representation.

In terms of how to represent this in the TIR graph, I'll have to think a bit on 
it.  I'm picturing either `BufferNode` holding `Array index_vars` and 
`Array physical_layout`, or making a TIR variant of 
`tvm::relay::FunctionNode`.

> Also, there are related affine analysis infrastructure available, it would be 
> great if we can reuse it for loop analysis and rewriting.

Thank you for mentioning these.  I intend to use these utilities for the 
implementation, and will add it to the detail section.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/39#issuecomment-936560130

Re: [apache/tvm-rfcs] [RFC][TIR] Separate physical and logical layout of buffers (#39)

2021-10-06 Thread Lunderberg
True, and that would allow both for user-defined mappings, and for specifying 
standard layouts.  I have a bit of concern with using the `scope` parameter to 
also describe layouts, since in my mind "scope" refers to where the entire 
buffer is located, while "layout" refers to how individual elements are 
arranged within the buffer.  Two buffers that have the same layout and 
different scopes could be `memcpy`-ed between the two scopes without additional 
rearrangement, and that would be obscured if the `scope` parameter is used to 
define both the scope and the layout.

I really like that idea of building the TIR representation based on the 
provided function.  Like with `te.compute`, that gives a convenient way to 
allow the library to handle the var/primexpr building, so long as the user 
provides the function to compute the physical indices.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/39#issuecomment-937008706

Re: [apache/tvm-rfcs] [RFC][TIR] Separate physical and logical layout of buffers (#39)

2021-10-07 Thread Lunderberg
> How will vectorization work? If there is a `vectorize` directive spanning a 
> logical extent, will the vectorization pass create multidimensional `ramp`s? 
> How will vector loads and stores be represented?

While in principle a vectorized load/store could fallback to a non-vectorized 
load/store, this would result in ignoring the `vectorize` directive entirely, 
which would be rather unexpected from the user's perspective.  Therefore, my 
plan for the initial implementation is to require the elements to be contiguous 
in both the logical and physical layouts if they are vectorized.  In both the 
physical and logical layouts, the vector load/store would be represented using 
a `RampNode` with `stride==1`.

This restriction could then be loosened by allowing schedules to be expressed 
in terms of the physical layout, rather than the logical layout.  I haven't 
thought through the full implications of this, which would likely involve 
either introducing an inlined Stage whose axes are the physical axes, or by 
exposing the physical axes as a separate parameter within the same stage.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/39#issuecomment-937824488

Re: [apache/tvm-rfcs] [RFC][TIR] Separate physical and logical layout of buffers (#39)

2021-10-07 Thread Lunderberg
Following a video chat with @csullivan, documenting some of the key points of 
the conversation.

* Setting the physical layout in a TE-based schedule has two roles.  One is the 
rewrite the buffer itself, and the other is to define the order of iteration 
when writing to the buffer.  In the latter use case, the schedule should have 
access to the physical axes for use in the schedule.
* Setting the physical layout is roughly equivalent to using an inlined 
`te.compute` to define a re-shaped tensor.  However, using `te.compute` 
requires the compute definition to change, whereas `set_physical_layout` is 
entirely contained in the schedule.
* For a given tensor, either the physical axes or the logical axes may be used 
in scheduling, not both.  We were unable to think of use-cases where you would 
want to change the physical layout, but still maintain the loop-order of the 
logical layout.  One idea that came of this conversation was to have 
`set_physical_layout` act similar to `cache_read` and `cache_write`, which 
introduce a new stage.
* If a physical layout is defined for a cache stage, then the generated code 
should perform the layout transformation as part of generating that cache.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/39#issuecomment-938057599

Re: [apache/tvm-rfcs] [RFC][TIR] Change AllocateNode::extent to 1-D (#40)

2021-10-08 Thread Lunderberg
This is the first in a series of proposed changes, and this one on its own 
won't be able to support the `PHYSICAL_AXIS_SEPARATOR`.  In the [Impacted TIR 
Nodes](https://github.com/Lunderberg/tvm-rfcs/blob/data_layout/rfcs/0039-buffer-physical-layout.md#impacted-tir-nodes)
 section of RFC #39, under `AllocateNode`, it gives a description of the 
changes being made.  This first change removes all information of the logical 
shape from the `AllocateNode`, and the follow-up change will then extend to 
allow for N-d physical shapes.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/40#issuecomment-939073862

Re: [apache/tvm-rfcs] [RFC][TIR] Separate physical and logical layout of buffers (#39)

2021-10-28 Thread Lunderberg
Following discussion with @tqchen , this RFC has had significant updates made.  
The major change is that instead of extending the capabilities of `Store` and 
`Load` nodes to support N-d indices, they would instead be removed in favor of 
keeping `BufferStore` and `BufferLoad` nodes throughout the lowering process.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/39#issuecomment-954215388

Re: [apache/tvm-rfcs] [RFC][TIR] Change AllocateNode::extent to 1-D (#40)

2021-10-28 Thread Lunderberg
Closing this RFC, as it is no longer applicable after significant changes made 
to #39.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/40#issuecomment-954215697

Re: [apache/tvm-rfcs] [RFC][TIR] Change AllocateNode::extent to 1-D (#40)

2021-10-28 Thread Lunderberg
Closed #40.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/40#event-5536746113

Re: [apache/tvm-rfcs] [RFC][TIR] Layout transformations on buffer access (#39)

2021-11-01 Thread Lunderberg
> I'd suggest adding the BufferTransform data structure here which will be very 
> helpful to other audience.

Sounds good, and I've added a description of it, and a possible data structure 
for it.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/39#issuecomment-956454255

Re: [apache/tvm-rfcs] [RFC][TIR] Layout transformations on buffer access (#39)

2021-11-03 Thread Lunderberg
> Usage of te.AXIS_SEPARATOR: It seems this is only used in the API side but 
> not in BufferTransform, would be good to get some clarification.

That's correct, the `te.AXIS_SEPARATOR` only appears in the API for the TE 
schedules, and not in the TIR graph generated from the TE schedule.  I've 
updated the RFC with a paragraph on this distinction.

For the axes returned, I'd want to add a third option.

* T2: Using `te.AXIS_SEPARATOR` to separate groups of axes, e.g. `AA = 
s[A].transform_layout(lambda i, j: [j//4, te.AXIS_SEPARATOR, i, j%4])`, 
directly implies fusion of the `i` and `j%4` axes, but exposes each transformed 
axis for use in the TE schedule.  That is, `AA.op.axis` has three elements, 
whose iteration bounds are `[A.shape[1]//4, A.shape[0], 4]`.

Then my thoughts on each option are below, with T2 as my preferred option.

* T0: This is easy to write in te, but any additional manipulation of the axis 
(e.g. changing the loop ordering independent of the data layout) would require 
splitting the returned axis to match the layout transformation.  It would also 
mean that `lambda *indices: indices` isn't the identity function, because it 
would flatten the buffer as part of that transformation.
* T1: As you mentioned, this would require explicitly writing down the 
transformation to fuse the axes, which would be error-prone.  It would also 
make the default behavior, where all buffer axes are fused into a single index, 
be more of a special case.
* T2: Doesn't require the user to specify the transformation to fuse the axes, 
and allows the corresponding loops to be easily reorganized.  The fusion in the 
default case, where no `transform_layout` is applied, then follows the same 
pattern, where a `BufferTransform` node is inserted to fuse all remaining axes 
into `num_axis_separators+1 == 1` axes.


> From TIR's perspective, the same primitive can be introduced to TIR as well. 
> It will be an eager transformation (without needing to create BufferTransform 
> node) that rewrite the both producer/consumer immediately upon calling it. 
> This is required if we want to further scheduling the transformed block. I 
> can follow up implementing it in TIR later

When the transformation is performed, would that include fusing axes together?  
If it does, then the `FlattenBuffer` pass would need distinguish between a 2-d 
buffer that should be flattened to a flat 1-d buffer, and a 2-d buffer that has 
already been flattened from some larger dimension.  If it doesn't, then the 
`FlattenBuffer` pass would need to distinguish between a N-d buffer that should 
be flattened to a 1-d buffer, and an N-d buffer that should be flattened to a 
2-d buffer.  In either case, `FlattenBuffer` would need some additional 
metadata to know how the buffer should be handled.

I like the idea of having the transformation be done eagerly, to avoid needing 
to pass through additional information, but I'm not quite seeing how it could 
be done.  (I had been trying an initial implementation that way on the TE side.)

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/39#issuecomment-959696021

Re: [apache/tvm-rfcs] [RFC][TIR] Layout transformations on buffer access (#39)

2021-11-03 Thread Lunderberg
> Since Option2 suggests the transform is global, shall we consider 
> BufferTransform being part of function attribute?

I had initially placed `BufferTransform` as a statement so that it could be 
possible to extended it to have a transformation defined by references to 
variables within the function body.  This would allow other transforms that 
require rewriting memory layouts to add a `BufferTransform` defining the 
layout, relying on a later pass to implement the transformation itself.  For 
example, `tir.transform.InjectDoubleBuffer` could be implemented by adding a 
buffer transform `lambda *indices: [loop_iter%2, *indices]`, and the load/store 
rewrites of `tir.transform.VectorizeLoop` could be implemented by adding a 
buffer transform `lambda *indices: [*indices, loop_iter]`.

On further thought, this potential future extension would still be possible 
with the definition in the `PrimFunc`, so long as visitors that modify the 
`loop_iter` also modify the transformation, so I support having the 
transformations as a global definition, and will update the RFC.

Having a single global definition of the buffer transformations would also make 
the order clearer in cases of multiple transformations.  A transformation may 
be more easily expressed multiple sequential transformations, such as an 
initial transformation of a matrix into tiles for transfer to a GPU, then 
another transformation for ordering the elements into groups of size 
`warp_size`.  If these are in different portions of the TIR call graph, the 
order of these transformations would depend on the traversal order.

> This implies separation between buffer transform and the underlying physical 
> layout requirement (e.g. 1-D by default in most cases).

That would minimize the number of changes to the TIR, though it would increase 
the number of distinct changes that are made while lowering.  I had been 
thinking of the flattening to the underlying physical layout requirement as a 
special case of a transformation, which happens to define a row-major traversal.

Good point, though, on how the current flattening depends on the parameters 
outside of the buffer itself.  As such, the transformation representing the 
flattening couldn't be generated initially.  The concepts could be combined at 
some point in the future, if the flattening is made to depend only on the 
buffer, but that should be a separate change.

Between those two points, I think that would be that `PrimFuncAttrs::attrs` 
should include a map from buffers to a list of layout transformations, and a 
map from buffers to a list of axis separator locations.  The layout 
transformations would be consumed by the pass that rearranges the layout, and 
the list of axis separator locations would be consumed by either 
`StorageFlatten` or `FlattenBuffer`.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/39#issuecomment-960254384

Re: [apache/tvm-rfcs] [RFC][TIR] Layout transformations on buffer access (#39)

2021-11-04 Thread Lunderberg
Following a video chat discussion with @vinx13 , we touched on a number of 
points, summarized below.  Also, we are adding @vinx13 as a co-author on this 
RFC.

- Are there cases where the flattening in `StorageFlatten`/`FlattenBuffer` 
should be inferred from buffer properties, rather than explicitly specified by 
the user?  For example, if a buffer has `"texture"` scope, then we know it must 
be flattened to a 2-d buffer.  We concluded that this wouldn't be possible, 
because the number of resulting dimensions isn't sufficient to define the 
flattening being applied.  For example, if a 4-d buffer is being flattened to 
2-d for use in texture memory, the four initial axes `[A, B, C, D]` could be 
flattened to `[A, fuse(B,C,D)]`, `[fuse(A,B), fuse(C,D)]`, or `[fuse(A,B,C), 
D]`, without any clear method that is better or worse.

- How will buffer layout transformations be represented in TensorIR schedules?  
 `buffer_transform` will be a primitive transformation in TensorIR, which is 
eagerly applied on the TensorIR computation.
  - In all cases, this would rewrite the buffer shape, and would rewrite 
loads/stores of that buffer.
  - If these loads/stores occur within a series of nested loops that cover all 
values of the buffer, and have no additional computation (e.g. cache 
read/write) in the body of these loops, then the loops will be rewritten to be 
along the transformed axes. can write remainder of schedule in terms of the 
transformed axes.  Otherwise, rewriting the loops would not be well-defined, 
and will not be done.
  -  The recommendation for use will be to apply the layout transformations 
prior to any other scheduling passes that could impact the loop structure, so 
that rewriting of the loops is possible.

- Should buffer flattening be implemented as a special case of layout 
transformation?  Buffer flattening should remain a separate concept from the 
layout transforms.  Where all other layout transformations can be performed 
eagerly, and should be before other scheduling passes, buffer flattening must 
be performed after other scheduling passes.  If it were done eagerly, other 
passes wouldn't have sufficient information about the structure of the buffer.

- Is deprecating Store/Load acceptable, instead using BufferStore/BufferLoad 
throughout all lowering steps?  Yes, as this gives a single uniform way to 
access buffers, regardless of the lowering step.  The one concern is that we 
should port all existing functionality.  For example, the vload/vstore methods 
in Buffer, which currently return Load/Store respectively, should not be 
removed, and instead should be updated to return flattened 
BufferLoad/BufferStore.

- RampNode should be treated as a compiler internal, and shouldn't be easily 
constructible by users as indices into buffers.  The preferred method to 
represent vectorized access is to have a buffer access within a vectorized 
loop, then allow `tir.transform.VectorizeLoop` to insert the RampNode.  This 
matches previous behavior, where RampNode could occur in flattened Store/Load, 
while BufferLoad/BufferStore avoided RampNodes to maintain easy analysis of 
accessed locations.

- Passes that change buffer dimensionality (e.g. InjectDoubleBuffer) should 
either be moved before the StorageFlatten/FlattenBuffer, or should be rewritten 
to instead resize the buffer, rather than changing the dimensionaltiy.  The 
former would require the pass to also update the axis separators to be used 
when flattening.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/39#issuecomment-961358086

Re: [apache/tvm-rfcs] [RFC][TIR] Layout transformations on buffer access (#39)

2021-12-13 Thread Lunderberg
Sounds good!  I've updated with examples of scheduling with the returned new 
axes, which work in the implementation posted in 
[PR#9727](https://github.com/apache/tvm/pull/9727).

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/39#issuecomment-992608499

Re: [apache/tvm] [VOTE] Replace codeowners with more relevant automation (Issue #10471)

2022-03-07 Thread Eric Lunderberg
+1

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm/issues/10471#issuecomment-1060835861
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm-rfcs] Add Target Pre-processing RFC (PR #71)

2022-05-18 Thread Eric Lunderberg
> Based on my previous re-review of LLVM, thanks to @tqchen, it might help to 
> use my_target.features.dsp rather than my_target.arch.has_dsp and clarifying 
> these are features available to the Target? What do you think?

I like that, and the renaming makes it clear which are boolean parameters and 
which are variable attributes.  That would also be useful for cleaning up the 
[Vulkan 
target](https://github.com/apache/tvm/blob/main/src/target/target_kind.cc#L341),
 which right now has a large number of boolean attributes that would be better 
expressed as feature support.

The renaming would also help with avoiding architecture-specific checks at the 
code review stage.  Since the information defined by the architecture would be 
pulled over into `my_target.features`, checks that directly access 
`my_target.arch` should only occur for case (1).

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/71#issuecomment-1130250850
You are receiving this because you are subscribed to this thread.

Message ID: 

[apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

2022-06-06 Thread Eric Lunderberg
This RFC introduces a method to specify padding to be applied as part of a 
buffer layout transformation, to be used when the desired layout does not 
evenly tile the buffer being transformed, and simplifications that can be 
performed based on these padded buffers.

The motivating examples are primarily in the "Implementation options" 
section, which goes through several desired usages of the buffer padding, and 
how they can be automatically derived using the TIR primitives/transformations 
described in earlier sections.

TODO: Rendered Markdown link
You can view, comment on, or merge this pull request online at:

  https://github.com/apache/tvm-rfcs/pull/77

-- Commit Summary --

  * [RFC] Buffer Layout Padding

-- File Changes --

A rfcs/-layout-transform-padding.md (2521)

-- Patch Links --

https://github.com/apache/tvm-rfcs/pull/77.patch
https://github.com/apache/tvm-rfcs/pull/77.diff

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77
You are receiving this because you are subscribed to this thread.

Message ID: 


Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

2022-06-21 Thread Eric Lunderberg
> Introducing changes to TIR would needs some additional thoughts that deserves 
> some extra consideration. Due to the N*M complexity (where N is the TIR 
> possibilities and M is the number of primitives to be supported) that needs 
> to be handled in implementation (by backend implementers and primitive 
> implementers)

This was part of the design consideration, to minimize the impact of the 
proposed changes to primitives, lowering transformations, and backends.

* The `BufferConstraint` annotations do not need specific handling at the 
codegen level, as it is only present to enable compile-time optimizations.
  
* Use of the `BufferConstraint` hints would occur within existing utilities, 
primarily as additional information available in `arith::Analyzer` utilities.  
This minimizes the need for other primitives/transforms to be aware of the 
buffer constraints, while still benefiting from them.
  
* The `T.undef()` built-in does not need specific handling at the codegen 
level, as it is removed during lowering.
  
* The `T.undef()` built-in does not require specific handling from other 
primitives, as stores of `T.undef()` can be treated the same as stores of any 
other value.
  
> Right now it is possible to do non-local constraint rewriting flowings as 
> part of the graph pass. Note that while E1 is indeed less "compact" on one 
> hand, we can use it to reconstruct the desirable compact data 
> structure(something like BufferConstraint that represents the layout mapping) 
> that we can use to flow the decisions across the graph node during the pass.
  
I definitely agree that graph-level transforms are where the layouts and 
constraints should be decided.  The `BufferConstraint` annotations are not 
intended as a way to override in TIR what was already decided at the graph 
level, but rather a way to communicate to TIR transformations what has been 
decided at the graph level.

> E1: Composing a stage that transforms the layout(a loop that represents the 
> mapping)

I'm still a bit confused with this approach, specifically how one would avoid 
having a separate compute definition for each workload on a new target 
(Initially brought up by @csullivan 
[here](https://github.com/apache/tvm-rfcs/pull/77#discussion_r893701372).) In 
my mind, if I'm going to compose a layout transformation stage, it would need 
to be followed by a compute stage that takes a transformed layout as input.  So 
rather than having a single conv2d that can be generalized over layouts, each 
transformed layout would still need to have a compute stage for it.

> Note that intiially such data structure do not need to live beyond the life 
> of a pass, because they can be reconstructed at anytime from the other 
> representation.

How would this be represented while optimizing the performance of a subgraph?  
My concern would be how to express the non-local constraints while keeping a 
small search space for optimization.

* Ensure that the producer and consumer stages are within the same subgraph.  
Since the constraints provided to a consumer depend not only on the producer, 
but also on the constraints provided to the producer, so this might require 
fusing the entire end-to-end model into a single monolithic kernel.
  
  My understanding is that this would result in a search space that is too 
large to effectively optimize, though I haven't explicitly tested it.
  
* Insert a transformation stage into the subgraph, in which the constraint is 
written.  Later portions of the subgraph could then rely on the constraint 
without examining other subgraphs.
  
  Would need to have some way to indicate that the transformation stage 
shouldn't be altered during optimization, nor should it be part of the 
performance timing.
  
* Express the graph-level constraints to a subgraph, so that it can optimize 
using those constraints.
  
  This was my intent with the `BufferConstraint` annotations, since then the 
subgraphs could take advantage of
  
> E1 also enables some additional capabilities (e.g.) expressing future memory 
> remappings that do not necessarily fit into padding/packing.

Is there an existing annotation to indicate that a stage should be removed 
entirely during lowering?  That might be an effective way to allow more general 
usage by annotating a stage that can be assumed to have been performed prior to 
the subgraph.  This would be a way to express the second option of an extra 
transformation stage, while still providing enough information to remove the 
transformation stage during lowering.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1162392893
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

2022-06-22 Thread Eric Lunderberg
> Indeed it is important to avoid having a separate compute definition for each 
> workload on a new target. In this particular case, all computation definition 
> would start with the original layout. Then there is a "schedule 
> transformation" like transform layout which will generate the new stage as 
> part of the scheduling process.

Thank you, and that is roughly how I'm seeing it as well.  That everything 
starts with the base compute definition and is modified from there.  If I 
understand correctly, the main differences are below.

* Option A: Layout transformations of inputs are allowed, but only during 
initial graph-level optimization.  When optimizing an individual PrimFunc, 
layout transformations of inputs and outputs are not allowed.

* Option B: Layout transformations of inputs and outputs are not allowed.  If 
this is desired, it should be done by first introducing a cache stage in TIR, 
then transforming the layout of the cache, and finally by a graph-level 
transformation that inspects each PrimFunc and hoists the cache stage out.

> The particular stage can be marked, which contains effectively the same 
> information as BufferConstraint, except that it does not introduce new data 
> structures. During global layout reflowing, such information can be used to 
> guide the reflowing to reconstruct a data structure like BufferConstraint or 
> other Layout mappings and use that to serve the same purpose.

So long as the constraints can be statically searched for, this approach makes 
sense to me.  I would be more concerned about adding additional semantics to 
existing nodes, such as a AttrStmt node, since it then requires passes to be 
aware not only of the existence of the constraint, but also that it must be 
reconstructed from the existing data structure.  This approach would make it 
much more difficult for a static analysis tool to identify locations where the 
constraints must be updated.

As a way to potentially find a way forward, what if we start by implementing 
pad values only for buffers that are allocated internally to a function?  This 
would be allowed behavior under both Option A and Option B, and would help 
determine how difficult reconstruction of the constraints would be from the 
transformation block without any additional annotation.  This could help 
motivate whether additional annotations are necessary, regardless of whether 
they are stored alongside the Buffer itself or in a separate 
attribute/annotation.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1163436177
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

2022-06-22 Thread Eric Lunderberg
> It doesn't add additional semantic, the computation semantic stays the same, 
> it is a hint to the graph compiler.

My apologies, I had meant the semantics of a node from the perspective of a TIR 
transformation, not the semantics from the perspective of the computation being 
described.  For a TIR transformation, if an object is replaced, whatever 
attributes describe that object must be updated to refer to the new object.  So 
if constraints are added to the block annotation, I had been thinking of that 
as a change to the semantics of the `BlockRealizeNode::annotations` from "does 
not need to be updated when a buffer is replaced" to "must be updated when a 
buffer is replaced".

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1163510231
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

2022-06-22 Thread Eric Lunderberg
Writing out some of my thoughts, to see if there's a way to express the 
constraints while only using existing TIR features.  The main goals would be as 
follows.

1. Allow simplification of expressions based on the values present in the 
padding.
2. Allow local simplifications to take advantage of non-local constraints, 
without requiring a full end-to-end analysis.
3. Specify the non-local constraints in some deducible manner that doesn't 
impose a runtime performance penalty.
   
Next, working through various options for how the constraints could be stored. 
In the examples below, sketching out how these would apply to the element-wise 
operation which starts as below.

```python
@T.prim_func
def func(A: T.Buffer[(14), "int32"], B: T.Buffer[14, "int32"]):
for i in T.serial(14):
B[i] = 2 * A[i]
```

1. Apply layout transforms on local caches.  Here, the full lifetime of a 
buffer is known.  All TIR optimization are done prior to hoisting the cache and 
layout transformation into the graph level.
   
   - For read caches, pad value is whatever gets conditionally written to the 
padding while generating it.  In example below, `AC` could be recognized as 
being padded.
 
 ```python
 @T.prim_func
 def func(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]):
 AC = T.alloc_buffer([4, 4], "int32")
 for io, ii in T.grid(4, 4):
 if 4 * io + ii < 14:
 AC[io, ii] = A[4 * io + ii]
 else:
 AC[io, ii] = 0
 
 for i in T.serial(14):
 B[i] = 2 * AC[i // 4, i % 4]
 ```
 
   - For write caches, pad value is whatever is in the padding after the last 
write to the cache.  In example below, `BC` could be recognized as being padded.

 ```python
 @T.prim_func
 def func(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]):
 BC = T.alloc_buffer([4, 4], "int32")
 for io, ii in T.grid(4, 4):
 if 4 * io + ii < 14:
 BC[io, ii] = 2 * A[4*io + ii]
 else:
 BC[io, ii] = 0
 
 for io, ii in T.grid(4, 4):
 if 4 * io + ii < 14:
 B[i] = BC[io, ii]
 ```

   - Downside, either of the `else` statements could be eliminated as a no-op, 
since they don't contribute to the output `B` value. After that elimination, 
there wouldn't be any way to reconstruct the pad value.
 
2. When hoisting an allocation+transformation, write the pad value to the 
buffer at the start of function from which it was hoisted. This way, the pad 
value can still be used in local reasoning.
   
   - No change needed in producers, since they would already write the pad 
value to the buffer.
   
   - For consumers, would be represented as writing `pad_value` into the 
padding at the start of the function.
   
 ```python
 @T.prim_func
 def func(AC: T.Buffer[(4, 4), "int32"], B: T.Buffer[14, "int32"]):
 for io, ii in T.grid(4, 4):
 if 4 * io + ii >= 14:
 AC[io, ii] = 0
 
 for io, ii in T.grid(4, 4):
 if 4 * io + ii < 14:
 B[4 * io + ii] = 2 * AC[io, ii]
 ```
 
   - Downside, repeated unnecessary effort at the beginning of each consumer.  
Avoiding it with this representation would require knowing that the producer 
had written `pad_value` already, which is exactly the information we're trying 
to avoid.
 
3. When hoisting an allocation+transformation, write the pad value to the 
buffer at the start of function from which it was hoisted, and write 
`T.undef()` at the end.  This way, the pad value can still be used in local 
reasoning, and no-op removal can remove the repeated writing when lowering.
   
   - No change needed in producers, since they would already write the pad 
value to the buffer.
 
   - For consumers, would be like option 2, but with an additional write of 
`T.undef()` at the end of the function.  When lowering, the write of 
`T.undef()` would allow the first write to be removed as a no-op because it is 
overwritten.  The `T.undef()` can then be removed as described in the RFC.
   
 ```python
 @T.prim_func
 def func(AC: T.Buffer[(4, 4), "int32"], B: T.Buffer[14, "int32"]):
 for io, ii in T.grid(4, 4):
 if 4 * io + ii >= 14:
 AC[io, ii] = 0
 
 for io, ii in T.grid(4, 4):
 if 4 * io + ii < 14:
 B[4 * io + ii] = 2 * AC[io, ii]
 
 for io, ii in T.grid(4, 4):
 if 4 * io + ii >= 14:
 AC[io, ii] = T.undef()
 ```
 
   - Downside, no way to distinguish between "can assume the pad value is zero" 
and "can overwrite the pad value at will".  The writing of `T.undef()` would 
allow any writes to the padding to be inserted as a no-op.
 
   - Downside, wouldn't actually simplify out in cases where the pad value is 
used.  The first in a pair of repeated writes to the same locati

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

2022-06-22 Thread Eric Lunderberg
> For example, we may introduce explicit cache stage to add the padding, and 
> mark this block for later processing.

Wouldn't that require a "remove entirely" annotation that was suggested against 
[here](https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1163019805)?  I 
could see how we could mark a transformation to be hoisted out later, but when 
some simplifications require the constraint to be expressed in the producer, 
and others in the consumer, exposing it to both `PrimFuncs` for local 
simplifications would require either duplication of the block, or maintaining 
non-local information only for a single pass.  If the stage is duplicated, all 
but one of the duplicates would need to be marked as temporary.  If the 
information is only retained for a single pass, then any 
scheduling/optimization of a single subgraph would require walking through the 
entire end-to-end model.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1163616169
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

2022-06-24 Thread Eric Lunderberg
> Talking about “constraints”, it is also useful to talk about categories of 
> them, roughly we can divide them into three categories.

I like this breakdown, and agree.  In this categorization, what I've been 
calling "constraints" would be "assumptions".  Double-checking in `builtin.h`, 
it looks like we don't currently have a TIR equivalent of `__builtin_assume`.

For usage of assumptions, I think the key would be to insert an assumption 
whenever the information that could otherwise prove it is hoisted out of the 
PrimFunc.  That would provide non-local information that could be used by the 
PrimFunc to allow local simplifications.

> transformation of PrimFunc do not change the PrimFunc interface: this is 
> really important so we can transform a PrimFunc without worrying about how 
> the graph interacts with it(as the interface remains the same, we can lift 
> out the blocks earlier)

I don't think we can make this strong of a statement, as it would also forbid 
fusing operators together or hoisting a stage out of a PrimFunc.  In both 
cases, the signature of the resulting PrimFunc may be different than it was 
before.  This shows up in the example, as the interface of `grow` is different 
from the transformed `grow_packed`.

As a slightly less general statement, I would say that transformations of a 
PrimFunc *in isolation* may not change the PrimFunc's interface. So an 
optimization search to improve the performance of a single subgraph may not 
change the layout of its own arguments, nor may it change assumptions of what 
is present in the padding, as those would change its interface.  However, a 
graph-level transform would be allowed to fuse subgraphs, to hoist stages out 
of a PrimFunc, to alter the layout of a PrimFunc's input, or to alter the 
assumptions provided about the inputs.  In general, a PrimFunc's interface 
could only be changed when calls into the PrimFunc are also modified to remain 
compatible.

Is there a better term than "scheduling primitive" to describe layout 
transformations that impact input/output buffers?  I think the difference is 
between context-independent transformations that may be performed on a PrimFunc 
without changing, as opposed to context-dependent transformations that may only 
be performed as part of a graph-level transformation.



> Each function needs to have its own TIR analysis of how it flows things back, 
> for example, in the case of `addone`, we can safely flow PadMapping back, 
> changing `addone` to `addone_packed` by analyzing the TIR. If the addone is 
> elemwise exp however, we need to insert a select operator(because `exp(0)=1` 
> ) the message to input becomes `PadMapping(constraint, pad_value=undef)`.

Would this handle cases where there are multiple different options for how an 
operator could be implemented?  Otherwise, I'm not sure how this would handle 
cases where multiple different sets of layouts/constraints could be inferred 
from different TIR-level schedules of the same operator.  As examples, the 
drop-down has 6 different implementations of `addone`, each of which would 
allow different hoistable pad/crop operations.


Click to expand


```python
# Implementation 1, no preproc/postproc are present.
#
# No hoistable layout transformations.  Could be fused with a layout
# transformation, but doesn't otherwise provide any constraints.
@T.prim_func
def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]):
for i in T.serial(14):
with T.block("compute"):
B[i] = A[i] + 1


# Implementation 2, pad input/output, but never access the padding of
# either input or output.
#
# In back-propagation of constraints, the T.undef() that is cropped
# from BC could be narrowed to a known value provided from the
# successor.  However, AC's padding is never written to, so could
# propagate T.undef() back to preceding function.
@T.prim_func
def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]):
for io, ii in T.grid(4, 4):
with T.block():
T.block_attr("preproc", "pad")
if 4 * io + ii < 14:
AC[io, ii] = A[4 * io + ii]

for i in T.serial(14):
with T.block("compute"):
BC[i // 4, i % 4] = AC[i // 4, i % 4] + 1

for i in T.serial(14):
with T.block():
T.block_attr("postproc", ["crop", T.undef()])
B[i] = BC[i // 4, i % 4]


# Implementation 3, pad input with known value, but never access
# padding of output.
#
# In back-propagation of constraints, the T.undef() that is cropped
# from BC could be narrowed to a known value provided from the
# successor.  AC's padding is written to, so this would propagate
# `PadMapping(predicate, pad_value=0)` to the previous operator.
@T.prim_func
def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]):
for io, ii in T.grid(4, 4):
with T.block():
T.block_attr("preproc", "pad")
AC[io, ii] = T.if_then_else(4 * io + ii < 14, A[4 * io + i

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

2022-06-24 Thread Eric Lunderberg
> Our design principle at TIR level ideally we start with one instance of 
> possibility, then use probabilistic space of meta-schedule to represent 
> multiple choices.

For this, would the layout re-flowing occur periodically during optimization?  
Otherwise, including transformations in the performance benchmarking of 
candidates would unfairly penalize candidates that add a transformation step, 
while excluding transformations would unfairly bias toward transformations, 
even when sequential operators require separate layout transformations.

Representing different options as different allowed steps in a search space 
makes sense to me, so long as the candidates are reasonably exposed to 
optimizer.

> In our particular example, however, the idea is that the schedule primitive 
> do not modify the input/output buffer, but introduce preproc and postproc 
> stages with clear hint that they should be lifted out (aka we are doing the 
> same thing in two steps)

I think I understand.  That would effectively be treating the preproc/postproc 
stages as separate function bodies, but ones which happen to exist within the 
same TIR PrimFunc for ease of use.

With this representation, I think the biggest part would be determining when to 
fix a previously free parameter, in order to expose it as an assumption to 
another TIR PrimFunc.  Maybe in the "Step 2: Reflowing of Layouts", this isn't 
used to cancel any statements out, but instead to create a dynamic performance 
penalty if an assertion is no longer held, with the performance penalty equal 
to the time required to do the transformation.

> As a quick intermediate middle ground. For most intermediate stage(really 
> like add or exp), we would ideally not insert any layout decisions and allow 
> decisions from other key ops(conv and matmul) to backprop their decision to 
> them.

I'd agree, though would phrase it somewhat differently.  The element-wise 
operations impose a constraint such that input and output layouts, that the 
input and output have identical layouts.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1165889831
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

2022-06-28 Thread Eric Lunderberg
These make sense, and agreed that the TIR->global feedback is important for 
enabling the layout reflow.  Going back through the discussion, I think we're 
converging on agreement on what features are required, and the main question 
remaining are how to best provide annotation for non-local information, and how 
best to express layout transformations while scheduling.

I've made some updates to the text of the RFC, based on the discussions here, 
primarily to remove the proposed changes to TIR data structures.  This follows 
your comment from a few days ago, which brought up `__builtin_assume` as a 
comparison.

* Adding an intrinsic `tir::builtin::assume`, which corresponds to the 
`__builtin_assume` LLVM intrinsic.  The emphasis is that these assumptions are 
primarily to expose non-local information for use in local simplifications.
* Removing `BufferConstraint` entirely.  The RFC no longer proposes any changes 
to TIR data structures, only the `assume` and `undef` intrinsics.
* Describing what assumptions can/should be placed into a PrimFunc when 
hoisting stages out into independent PrimFuncs, and what transformations are 
legal based on the choice of exposed assumptions.
* Captured some of the discussion here about the dangers of altering a 
PrimFunc's interface, and the limited cases where it may be altered.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1169188372
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

2022-06-30 Thread Eric Lunderberg
> In general it is helpful to first keep schedule decision local, e.g. 
> introducing a caching stage (AC, BC in the example), the compose with another 
> reflowing pass to bring the decision to consumer/producers.

My goal with the latest update wasn't to require global decisions, but to make 
local changes only, which could be used in different contexts.  For the 
auto-scheduler, since the context requires maintaining the same PrimFunc 
interface, local optimization would be restricted to transformations of the 
caching stage.  For stand-alone usage, such as preparing a single PrimFunc for 
a unit test, the context allows the interface to change.  That way, the 
restrictions to the transformations are imposed by the level of abstraction 
that requires them.

> While it is possible to express padding with a loop and another loop that 
> writes the padded value, it is harder to schedule the resulting blocks as 
> there are more than one producers. Having a single loop and use 
> `T.if_then_else` will express such pattern in a single shot and makes future 
> rewriting easier.

I definitely agree that this makes the later analysis/rewrites easier. I had 
maintained them as two separate loops both to minimize the extent of changes 
being made in any one scheduling change, and to maintain the current behavior 
of `Schedule.transform_layout` which does not alter the surrounding loop 
iterators ([previous 
conversation](https://github.com/apache/tvm/pull/10538#discussion_r826209815) 
with @vinx13).

I see four main options on how the loopnests could be handled:

1. When a buffer is transformed, all loops in the producer stage over the 
buffers's pre-transformation axes are replaced with loops over the buffer's 
post-transformation spatial dimensions.  It is an error if this replacement 
cannot be done (e.g. the pre-transformation loops have been 
fused/split/reordered).

   - Pro: Allows the `T.if_then_else` to be inserted at the time of the 
transformations.
   - Pro: Removes the need for the .
   - Con: May restrict search space, since earlier manipulation of the loop 
iterators would prevent later buffer transformations.
   - Con: Doesn't help consumers of a transformed buffer.  In a reduction, may 
be desirable to iterate over the input buffer, but this couldn't be expressed 
in terms of an output.
   - Con: For buffers whose padding is not written to, must either insert a 
conditional statement or maintain the pre-transformation loop structure.

2. When a buffer is transformed, an attempt is made to replace all loops in the 
producer stage over the buffers's pre-transformation axes with loops over the 
buffer's post-transformation spatial dimensions.  If this replacement cannot be 
done (e.g. the pre-transformation loops have been fused/split/reordered), and 
if `pad_value` is not `None`, then an error should be raised.

   - Pro: Always valid to apply a transform
   - Pro: Avoids undoing scheduling benefits from previous changes to iterators.
   - Pro: Later calls to `reduce_branching_through_overcompute` could still 
introduce a value for the padding, if the full life cycle of the buffer is 
known.
   - Con: Allowing the follow-up stage at all requires just as much analysis to 
identify as if it were always present.

3. When a buffer is transformed, all loops in the producer stage over the 
buffers's pre-transformation axes are replaced with loops over the buffer's 
post-transformation spatial dimensions.  If this replacement cannot be done 
(e.g. the pre-transformation loops have been fused/split/reordered), then the 
follow-up stage is inserted.

   - Pro: Always valid to apply a transform
   - Pro: Avoids undoing scheduling benefits from previous changes to iterators.
   - Con: Allowing the follow-up stage at all requires just as much analysis to 
identify as if it were always present.

4. When a buffer is transformed, all loops over spatial dimensions in the 
producer are replaced with loops over the post-tranformation buffer axes.

   - Pro: Always valid to apply a transform.
   - Con: May undo scheduling that has previously provided useful performance 
improvements.
   - Con: Loop iterators over pre-transformation indices may have been fused 
with reduction axes.  Would need to undo the fusion to apply.
 
The current proposed version would be option 4, but I think I'd prefer option 2 
in order to reduce the number of follow-up simplifications required.

> Some of the complications of duplicated condition(and their simplification) 
> roots from the fact that we do layout transform of output and input 
> separately(each introducing their own conditions which then needs to be 
> simplified). It might be helpful to do a global transformation, usually 
> driven from the output, then "backprop" the implication of that decisions to 
> the input. Doing such transformation at a single shot will likely alleviate 
> the need of generating extra conditions then simplifying them.

At the TIR level, I suppose

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

2022-07-12 Thread Eric Lunderberg
Thank you very much on the comments, suggestions, and discussion, and I'm quite 
happy with how the design evolved over the course of the discussions!

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1182157349
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm] [VOTE] Commit Messages RFC (Issue #12583)

2022-08-30 Thread Eric Lunderberg
+1

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm/issues/12583#issuecomment-1232123344
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm-rfcs] [RFC] Add Commit Message Guideline (PR #88)

2022-08-31 Thread Eric Lunderberg
[Rendered 
link](https://github.com/gromero/tvm-rfcs/blob/cmg/rfcs/0088-commit-message-guideline.md)

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/88#issuecomment-1233170752
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm-rfcs] [RFC] Introduce PresburgerSet (PR #99)

2023-02-21 Thread Eric Lunderberg
I very much like the proposed improvements, especially the use cases for 
inner-block and inter-block analysis.  While I have made some development [for 
similar 
applications](https://github.com/apache/tvm/blob/main/src/tir/analysis/control_flow_graph.h),
 the additional formalism and reliability proposed here would be very useful.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/99#issuecomment-1438568053
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm] [VOTE] Clarify Community Strategy Decision Process (Issue #15521)

2023-08-11 Thread Eric Lunderberg
+1

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm/issues/15521#issuecomment-1675087792
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm-rfcs] [RFC] Scalable vectors in TIR (PR #104)

2023-10-06 Thread Eric Lunderberg
> What I'm aiming at is to be able to lower the TIR to a generic CPU, that is 
> to an architecture that does not support SVE. The TIR will need to have some 
> default lowering in CodeGenLLVM/CodeGenCPU, so being able to do that is 
> important.

Could it instead be in a target-dependent lowering pass?  That is, since a 
lowering pass after `BindTarget` 
([here](https://github.com/apache/tvm/blob/main/src/driver/driver_api.cc#L573) 
in `driver_api.cc`) would know whether the target CPU supports SVE or not, we 
could make a pass that either returns the IRModule unmodified for CPUs that 
support SVE, or converts it to non-SVE instructions otherwise.

I'd like to avoid adding more complexity to the `CodeGenLLVM` and `CodeGenCPU` 
steps, as it is more difficult to test than IRModule to IRModule 
transformations.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/104#issuecomment-1751100613
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm-rfcs] [RFC] Scalable vectors in TIR (PR #104)

2023-10-10 Thread Eric Lunderberg
Agreeing with @kparzysz-quic, changes that update the `DLDataType` would need 
to be approached very cautiously.  I usually lean toward allowing short-term 
breakages if they lead to better long-term code health, but updating the 
`DLDataType` would be  very wide reaching even more my tastes.

One way to limit the scope of the change might be to introduce a distinction 
between the runtime `DLDataType`, and some new compile-time data-type.  This 
would be analogous to the distinction between the runtime `DLTensor` and the 
compile-time `tir::Buffer`.  It would still involve massive changes inside of 
TVM, but would preserve the runtime types, avoiding the ABI breakage of 
compiled functions.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/104#issuecomment-1755890786
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm] [VOTE] Transition Main to Unity (Issue #16368)

2024-01-11 Thread Eric Lunderberg
+1

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm/issues/16368#issuecomment-1887308146
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm] [VOTE] Release Apache TVM v0.16.0.rc0 (Issue #16912)

2024-04-27 Thread Eric Lunderberg
+1

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm/issues/16912#issuecomment-2080793568
You are receiving this because you are subscribed to this thread.

Message ID: 

[apache/tvm] [Bugfix][NCCL] Release NCCL thread_local resources in destructor (PR #17078)

2024-06-10 Thread Eric Lunderberg
Prior to this commit, allocations performed by `ncclCommInitRank` had no 
corresponding call to `ncclCommDestroy`.  While `ncclCommDestroy` does occur in 
the `CCLThreadLocalContext::Clear` method, there are no calls into this method. 
 On worker processes, the failure to call `ncclCommDestroy` typically had 
little effect.  Any destruction would occur shortly before the process closes, 
and so resources would be reclaimed by the OS when the process terminates.

However, worker0 of a Disco session is a separate thread, rather than a 
separate process.  While this allows it to easily receive data from the 
controller thread, resources allocated by worker0 are not reclaimed by the OS 
until the entire process terminates.  As a result, the `CCLThreadLocalContext` 
leaked GPU memory, as the `ncclCommInitRank` call at the start of each
`tvm.runtime.disco.ProcessSession` was never de-allocated.  The increase in GPU 
memory usage was about 1 gigabyte for each `ProcessSession`.

This commit updates `CCLThreadLocalContext` to have a destructor that calls the 
`Clear` method.  For worker0, this is called when the thread is joined to the 
main thread.
You can view, comment on, or merge this pull request online at:

  https://github.com/apache/tvm/pull/17078

-- Commit Summary --

  * [Bugfix][NCCL] Release NCCL thread_local resources in destructor

-- File Changes --

M src/runtime/disco/nccl/nccl.cc (12)
M src/runtime/disco/nccl/nccl_context.h (15)

-- Patch Links --

https://github.com/apache/tvm/pull/17078.patch
https://github.com/apache/tvm/pull/17078.diff

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm/pull/17078
You are receiving this because you are subscribed to this thread.

Message ID: 


[Apache TVM Discuss] [Development/RFC] [RFC] Parametrized Unit Tests

2021-05-11 Thread Eric Lunderberg via Apache TVM Discuss


Adding notes from a few video chats, so that there is a record of the discussion

>From @tkonolige , confirmed that the current implementation of 
>`@tvm.testing.parametrize_targets` shows skipped targets if they are 
>explicitly listed in the decorator, but not if they come from 
>`TVM_TEST_TARGETS` environment variable.

>From @tkonolige , some of the unit tests have a significant amount of setup 
>required before the loop over `enabled_targets()`, and repeating the setup for 
>many targets would increase the runtime of the tests.  I agree, this 
>definitely would be an issue and should have a recommended style to avoid 
>duplicating work.  I think the best style would be to use xunit-style 
>`setup_class` to perform the setup, then have the test methods within the 
>class implemented using parametrized fixtures.  I'll test out a few options 
>and get back on it.

>From @areusch , recommended using xfail instead of skip for tests marked as 
>known failing.  I agree, and will make that change to the PR.

>From @areusch , recommended removing the global variables of 
>`tvm_excluded_targets` and `tvm_known_failing_targets`, having the decorator 
>as the only method by which to apply these traits.  The global variables can 
>silently have a typo, whereas a typo in a decorator throws an error.  I agree, 
>and will make that change to the PR.

>From @areusch , though perhaps out of scope of this particular RFC, it would 
>be good to have some discussion as to which unit tests should be required to 
>have parametrized targets, and which are allowed to be target-specific (e.g. 
>where should changes in PRs be requested for having non-parametrized targets). 
> My first thoughts would be to rearrange the current `tvm/tests/python` folder 
>to mimic the current organization of the top-level `src` and `python` 
>directories, rather than the current division into unit/integration tests.  
>Then, the tests that are specific to the `target` subdirectory and the 
>target-specific subdirectories in `runtime` would be allowed to have 
>non-parametrized tests, while all others would be parametrized.

Someone mentioned that @masahi was looking into ways to compare cuda/vulkan 
correctness across all topi models, and that this might be of interest to him.  
The topi tests were exactly the ones I had in mind as a starting point for 
converting over to the parametrized targets, for exactly the same reason.





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/rfc-parametrized-unit-tests/9946/4) to 
respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/561dfa5e8ed159e2f3325fad182bda38726ae6c015f9f9dc62147d66e5ebfe8c).


[Apache TVM Discuss] [Development/RFC] [RFC] Parametrized Unit Tests

2021-05-17 Thread Eric Lunderberg via Apache TVM Discuss


Correct, the different cases are intended to show the entire contents of a test 
file.  The names in this example are chosen so that it can be run with minimal 
interaction between cases.

For the `fixture(scope="module")`, this indicates when pytest should clean up a 
fixture, but it is only available to be accessed by tests that explicitly 
accept the fixture as a function argument.  With the default scope of 
`"function"`, the cleanup is done between each test.  Absent any caching that 
we add, this scope is what prevents the fixture from being regenerated, 
repeating the expensive setup, with each target/size parameter.

For the caching, using the cached values should be no different than passing an 
argument into a `verify_*` function, or a `verify_*` function accessing a 
nonlocal variable.  In either case, we're relying on the test not to make 
modifications to the setup that is shared between multiple tests.

That said, if we decide to use fixtures as the method to handle expensive 
setup, I think that we should add a helper function `tvm.testing.fixture` to 
reduce the amount of boilerplate, avoid potential mistakes (e.g. having 
`@lru_cache` before `@pytest.fixture` instead of after), and as a place to 
disable the cache entirely based on an environment variable.  I haven't 
implemented it yet, but the behavior would be something like the following.


```
# Using pytest directly, defining parametrized fixture
@pytest.fixture(scope="module", params=[1, 10, 100])
def size(request):
return request.param

# With helper function, same behavior as above
tvm.testing.fixture(name="size", params=[1,10,100])

# Using pytest directly, defining a cached parameter derived from `size`
@pytest.fixture(scope="module")
@functools.lru_cache
def expensive_size_dependent(size):
...

# With helper function, same behavior as above,
# except that caching can be overridden with environment variable.
@tvm.testing.fixture(cache=True)
def expensive_size_dependent(size):
...
```





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/rfc-parametrized-unit-tests/9946/9) to 
respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/b5a3a077ef1325d8f5ba6f44ab839f7ca1110666c71d1c01fc1a7a4055bbf550).


[Apache TVM Discuss] [Development/pre-RFC] [docs][discuss] Convert ReStructuredText docs to Markdown?

2021-06-17 Thread Eric Lunderberg via Apache TVM Discuss


I think I'd lean towards markdown for consistency with other services, but 
that's only if all other features were equal.  Markdown would be nicer for 
reviewing, since it can be viewed from github in the browser, but the I think 
cross-references are the more important feature.

Would either md or rST allow for cross-references between the python API 
documentation and the C++ documentation?  I think that would be the biggest 
feature for me, since it would give a clear way to indicate which functions are 
intended to only be wrappers.





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/docs-discuss-convert-restructuredtext-docs-to-markdown/10264/3)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/cb0d61960a4d8efed6c9dc969d5a5c7c595df05ccf969468135fb695d3ae41a9).


[Apache TVM Discuss] [Development/pre-RFC] [pre-RFC] Vectorized TIR Buffers

2021-07-29 Thread Eric Lunderberg via Apache TVM Discuss


@masahi Tagging following comments on 
https://github.com/apache/tvm/pull/8528#pullrequestreview-718506978





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/pre-rfc-vectorized-tir-buffers/10615/5) 
to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/2b20409dd28ff339a6a9b23c17ad88ce918e6a63561c0b46db2304e8981c9bfc).