Re: [apache/tvm-rfcs] Add Target Pre-processing RFC (PR #71)

2022-05-20 Thread Junru Shao
Thanks @Mousius for drafing this RFC!

First of all, I completely agree on the importance to handle `arch`-specific 
checks. Use our experience as an example, on CUDA we might want to check if the 
PTX intrinsic `cp.async.commit_group` is available on certain architecture 
before tensorizing using that PTX intrinsic, and the existing approach is to 
add extra flags in `TargetKind`.

To motivate discussion, I would love to provide more context on the system 
design, and then talk about some specific points I noticed in the current RFC, 
and finally propose some ideas inspired by this RFC.

## Background

**Design principles.** Just wanted to share some of my design principles when 
developing TVM, the compiler I believe that aims to work across hardware models:
- A1. Generic. By design, TVM serves the common interest of all vendors. 
Therefore, it would be great if we could discuss broadly how a design choice 
would benefit all hardware platforms, including ARM, NV, Intel, Qual, etc.
- A2. Minimal. Ideally, practitioners are supposed to learn only bare minimum 
to participate in TVM development.
  - Take TIR as an example, for a developer who knows the IR and passes to use 
and develop TIR, the only extra concept is `Target` .
  - Indeed Relay is in a less ideal state, but Relax is designed to address 
this issue so I wouldn't worry too much
- A3. Customizable. We want to develop the infrastructure to provide 
customization opportunities for all vendors to independently work on their 
extension in a collaborative way, making TVM a platform for everyone. A few 
examples:
  - TIR scheduling is heading towards this direction that all the schedule 
primitives are decoupled with each other, so that vendors could develop their 
own primitives without affecting each other.
  - TVMScript is going to be re-designed this way treating TIR and Relax as 
independent dialects while the core infra itself supports any 3rdparty IR.
  - Admittedly Relay is much limited with the assumption of Relay => TE => TIR 
lowering path, with some hooks to hack in other compilation paths, but Relax is 
going to change this so again I wouldn't worry...

**Current `arch`-specifc checks.** Most of the 6 groups of `arch`-specific 
helper functions, mentioned in the "Motivation" section, are developed by 
concurrent efforts from multiple parties, and therefore I would say 
fragmentation is almost a certain thing to happen. On the other hand, 
fragmentation of those helper functions, which are indeed ad-hoc, is currently 
confined locally as independent checks without yet polluting the design of 
global infrastructure.

**Special target attributes.** Below are a few existing target attributes that 
serve special semantics, the design of which I don't fully agree with, but now 
are preserved as legacy:
- **keys.** This is used to guide TOPI dispatch. For example, "llvm 
--keys=arm_cpu,cpu" first finds if there is any `GenericFunc` registered with 
`arm_cpu`, and if not, it falls back to `cpu`.
- **libs.** This is used in TOPI to dispatch to vendor library. For example, 
"cuda -libs=cudnn" prefers dispatching to cuDNN.
- **device.** and **model.** These two sometimes control the behavior of 
auto-tuning.

**Existing `arch`-like attributes.** Note that the design of `Target` 
attributes in `TargetKind` is intended to describe the capability of hardware, 
according to the [Target 
RFC](https://discuss.tvm.apache.org/t/rfc-tvm-target-specification/6844):
> Pass in additional target-specific attributes like ISA/library extension to a 
> target.
Therefore, currently all the `arch`-like flags are centered around `Target`. As 
an example:
- The [Vulkan 
target](https://github.com/apache/tvm/blob/a6a34046c432b3766e7c32bbd85c098812a12a68/src/target/target_kind.cc#L345-L370)
 comprehensively includes hardware feature support (e.g. whether or not fp16 is 
supported), physical limits of the device (e.g. max number of threads allowed).
- The [CUDA 
target](https://github.com/apache/tvm/blob/a6a34046c432b3766e7c32bbd85c098812a12a68/src/target/target_kind.cc#L292-L296)
 is defined in similar approach but less comprehensive yet. It doesn't require 
architectural change to grow its attributes.

Note that the number of attributes may grow if there is new hardware feature, 
but to the best of my knowledge, it could be less common that those hardware 
features may bloat, mainly because there is concrete cost to grow features on 
hardwares.

**Target tag system.** Given the fact that existing hardware models are 
enumerable, the [Target 
RFC](https://discuss.tvm.apache.org/t/rfc-tvm-target-specification/6844) 
proposes to use "tags" to allow easy creation of targets. For example, 
`Target("nvidia/jetson-agx-xavier")` gives full specification of this device, 
including the cuda target and the ARM host. At the time of writing, it only 
takes 200 tags to describe [all the CUDA 
hardware](https://github.com/apache/tvm/blob/a6a34046c432b3766e7c32bbd85c098812a12a68/src/t

Re: [apache/tvm-rfcs] Add Target Pre-processing RFC (PR #71)

2022-05-20 Thread Christopher Sidebottom
Hi @junrushao1994, thanks for the elaborate reply 😸  I don't want to debate our 
personal principles but I appreciate you sharing them and will reference them 
where I can.

> **Current `arch`-specifc checks.** Most of the 6 groups of `arch`-specific 
> helper functions, mentioned in the "Motivation" section, are developed by 
> concurrent efforts from multiple parties, and therefore I would say 
> fragmentation is almost a certain thing to happen. On the other hand, 
> fragmentation of those helper functions, which are indeed ad-hoc, is 
> currently confined locally as independent checks without yet polluting the 
> design of global infrastructure.

Yip, this fragmentation occurs due to a lack of a standardised mechanism for 
these groups to use, which the RFC aims to provide - I'm not sure why we 
consider that pollution given it should have a positive impact on all groups 
and aids in providing customisation through a well defined route established by 
all vendors, providing simplicity and customisation (A1, A2 and A3).

> Existing arch-like attributes. Note that the design of Target attributes in 
> TargetKind is intended to describe the capability of hardware, according to 
> the [Target 
> RFC](https://discuss.tvm.apache.org/t/rfc-tvm-target-specification/6844):

I believe this is still held given the crossover between hardware and code 
generators, for `c`/`llvm` you will still be able to pass `mattr`/`mcpu`/etc 
which are the standard way to specify CPU, ISA and extensions in existing 
compilers (A1). This is also following the TVM Guideline of "Be consistent with 
existing well-known package’s APIs if the features overlap.
  For example, tensor operation APIs should always be consistent with the numpy 
API.".

> Target tag system. Given the fact that existing hardware models are 
> enumerable, the [Target 
> RFC](https://discuss.tvm.apache.org/t/rfc-tvm-target-specification/6844) 
> proposes to use "tags" to allow easy creation of targets. For example, 
> Target("nvidia/jetson-agx-xavier") gives full specification of this device, 
> including the cuda target and the ARM host. At the time of writing, it only 
> takes 200 tags to describe [all the CUDA 
> hardware](https://github.com/apache/tvm/blob/a6a34046c432b3766e7c32bbd85c098812a12a68/src/target/tag.cc#L107-L348).

> **What on earth are `Target`s.** Actually, `target` in TVM not only refers to 
> the hardware, but also the codegen targets. For example, LLVM targets means 
> TVM codegens to LLVM, and let LLVM do the rest of compilation. CUDA targets 
> codegens to CUDA source code (i.e. `*.cu` files), and invokes NVCC for 
> compilation.

There's definitely an existing amount of confusion in the `Target` system, but 
I think it's even more confused than this by looking at the tagged entries, 
such as:

```
TVM_REGISTER_TARGET_TAG("raspberry-pi/4b-aarch64")
.set_config({{"kind", String("llvm")},
 {"mtriple", String("aarch64-linux-gnu")},
 {"mcpu", String("cortex-a72")},
 {"mattr", Array{"+neon"}},
 {"num-cores", Integer(4)},
 {"host", Map{{"kind", String("llvm")},
 {"mtriple", 
String("aarch64-linux-gnu")},
 {"mcpu", String("cortex-a72")},
 {"mattr", 
Array{"+neon"}},
 {"num-cores", Integer(4));
```

Which defines a system configuration that then uses a code generator `llvm` 
with a specific CPU profile, therefore the `Target` system can represent **at 
minimum** 3 distinct layers: systems/hardware/code generators. Given the 
principle of having to understand a minimal amount (A2), `Target` needs to be 
streamlined into understandable parts. This is a digression from the actual 
RFC, as the features are pre-processed from the tagged `Target`s information.

The problem that I've faced trying to figure out how to implement this is that 
`llvm` takes `mcpu`/`mattr`/etc which are used to infer features later where-as 
the `cuda` `Target` has a more complete reflection of the `attrs` available for 
CUDA. These inconsistencies make it difficult for a user to approach TVM and 
that isn't requiring a developers to learn the bare minimum (A2). It also 
diverges from other compilers where you'd expect `mcpu`/`mtriple`/etc to infer 
much of this information for you (A1).

> Do we need multiple target parsers? My answer is no. If the parsers are 
> maintained by different vendors separately on their own interest, then they 
> could decide how to implement parsers for "keys", "arch", together, without 
> conflicting with other contributors. Therefore, I would say it's already 
> consistent with our principle A3 without having to develop multiple parsers.

> Naming choice. When implementing the Target RFC, I came up with the 
> preprocessor to be backward compatible with existing funct

Re: [apache/tvm-rfcs] Add Target Pre-processing RFC (PR #71)

2022-05-20 Thread Tianqi Chen
Thanks folks for discussions. I think they summarizes to the following points

- Q0: Subfield grouping (e.g. features) or simply leave as top-level attrs
- Q1: Folder structure: `target/preprocessors/cuda.cc` vs 
`target/cuda/cuda_preprocessor.cc`
 - Note that code-reuse is less likely going to be an issue as there can be 
common includes among sub-folder structures. 
- Q2: How to reconcile attributes specified already for (part of) subfield 
group, `llvm` can use `mattr` to specify the feature necessary. What happens to 
CUDA(e.g. we proposing to group hw related limitations to features), and how 
can we specify some of the existing feature fields during tag registration.



-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/71#issuecomment-1133013784
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm-rfcs] Add Target Pre-processing RFC (PR #71)

2022-05-20 Thread Tianqi Chen
Thanks folks for discussions. I think they summarizes to the following points

- Q0: Subfield grouping (e.g. features) or simply leave as top-level attrs
- Q1: Folder structure: `target/preprocessors/cuda.cc` vs 
`target/cuda/cuda_preprocessor.cc`
 - Note that code-reuse is less likely going to be an issue as there can be 
common includes among sub-folder structures.
- Q2: How to reconcile attributes specified already for (part of) subfield 
group, `llvm` can use `mattr` to specify the feature necessary. What happens to 
CUDA(e.g. we proposing to group hw related limitations to features), and how 
can we specify some of the fields(under the proposal will go into a subfield 
`feature`) during tag registration.


-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/71#issuecomment-1133026002
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm-rfcs] [RFC] UMA Universal Modular Accelerator Interface (PR #60)

2022-05-20 Thread Mark Shields
One more collage/uma overlap aspect: Collage distinguishes 'registered' 
backends (ie just TargetKinds) from 'activated' backends (ie Target objects in 
the provided build targets). I think though the proposal here is the act of 
registration is also activation? I need help understanding how this will look 
from the user's pov in combination with targets.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/60#issuecomment-1133139602
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm-rfcs] [RFC] Introducing DeclBuffer (PR #70)

2022-05-20 Thread Wuwei Lin
@wrongtest I've thought about the option A3 vs A4. From the parsing / 
translation from TVM script to TIR, it is acceptable to have `T.allocate` 
translated to `Allocate + DeclBuffer` two nodes. But it will be tricky for 
`TVMScriptPrinter`. We will need to find both `Allocate` and `DeclBuffer` nodes 
and then print `T.allocate`, and these two nodes do not have to be parent/child 
of each other. I'm not sure if this behavior, which breaks 1-to-1 mapping 
between TVM script and TIR, is desirable. Alternatively, we can add an option 
to `T.allocate`, such as `def allocate(..., return_buffer: bool)`. What do you 
think?

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/70#issuecomment-1133153361
You are receiving this because you are subscribed to this thread.

Message ID: 

Re: [apache/tvm-rfcs] Add Target Pre-processing RFC (PR #71)

2022-05-20 Thread Junru Shao
@Mousius Thank you so much for your response! This makes lots of sense to me! 
Also, thanks for including my personal principles in the discussion! It's my 
personal principles which are completely okay to disagree with :-)

> I'm not sure why we consider that pollution given it should have a positive 
> impact on all groups and aids in providing customisation through a well 
> defined route established by all vendors, providing simplicity and 
> customisation (A1, A2 and A3).

Sorry my words may lead to some potential miscommunication. As I mentioned in 
the very beginning, I completely agree with the idea of handling `arch` more 
systematically and it's important to us, so in no means I consider it 
pollution. I mean the fragmentation itself is not good but hasn't become 
extremely bad right now, because they are not used globally at infrastructural 
level, so there is no namespace polluting.

> I believe this is still held given the crossover between hardware and code 
> generators, for c/llvm you will still be able to pass mattr/mcpu/etc which 
> are the standard way to specify CPU, ISA and extensions in existing compilers.

Agreed. It is designed to be consistent with existing wisdom, i.e. LLVM/GCC 
mattr/mcpu/etc., so I believe it's not controversial.

> There's definitely an existing amount of confusion in the Target system, but 
> I think it's even more confused than this by looking at the tagged entries, 
> such as:  Which defines a system configuration that then uses a code 
> generator llvm with a specific CPU profile, therefore the Target system can 
> represent at minimum 3 distinct layers: systems/hardware/code generators.

Yes, I think your understanding is correct. Per [Target 
RFC](https://discuss.tvm.apache.org/t/rfc-tvm-target-specification/6844), this 
object represents host compiler info (C0), heterogeneous compilation settings 
(C1), compiler behavior control (C2), additional ISA/library (C4). Therefore, 
the tag system (C4) is introduced to reduce the cognitive overhead, because the 
end users can use a simple plain string tag `Target("raspberry-pi/4b-aarch64")` 
to represent all C0/1/2/4, instead of understanding every detail of the 
hardware themselves (A2). In other words, to the end users, the vision of 
having tag system is that they only need a simple string to do configure TVM 
codegen on their their hardware.

> where-as the cuda Target has a more complete reflection of the attrs 
> available for CUDA. These inconsistencies make it difficult for a user to 
> approach TVM and that isn't requiring a developers to learn the bare minimum 
> (A2). It also diverges from other compilers where you'd expect 
> mcpu/mtriple/etc to infer much of this information for you (A1).

Right, in fact I share the same confusion with you with it comes to 
CUDA/Vulkan/etc., where (to best of my knowledge) there is no precedent of 
mcpu/mattr/mtriple/etc (except for LLVM NVPTX backend). Certainly, we should 
develop unified approach to address this confusion, which could possibly be the 
`arch` proposed in the RFC.

> I don't mind using one parser, I was just making it more granular and 
> avoiding stepping on peoples toes with the existing pre-processor. Either 
> design seems to fulfil your principle of customisability (A3), more 
> granularly means you only have to implement a small piece of customisation 
> where-as with the single parser it requires some thought as to how to cater 
> for all attributes within a given TargetKind parser.

Thanks for sharing your thoughts! Definitely. The issue of having multiple 
parsers is that the order of their application is less explicit, so I would be 
a little bit leaning towards a single one, while the implementation could share 
common helper functions.

> > Where to dispatch target parsers. Currently, the preprocessor is dispatched 
> > solely based on TargetKind, which is admittedly a bit limited and 
> > overlooked the possiblity that aarch64 and x86 may need completely 
> > different parsers. Therefore, here we would love to raise a question: based 
> > on which attribute the parser should be dispatched? Previous wisdom in 
> > clang seems to suggest dispatching according to mtriple (IIUC), so shall we 
> > introduce anything similar, for example, dispatching based on --device 
> > aarch64-foo-bar?
>
> This requires developers to learn more than the bare minimum (A2), as users 
> are required to learn the specifics of how to specify a Target in TVM. If an 
> llvm Target supports -mcpu/-mattr/-mtriple then all of these fields can be 
> reasonably expected to allow the Target to infer features given the relation 
> to GCC/LLVM (A1).

Right. I agree. `TargetKind`-based dispatch is certainly the clearest approach 
as you mentioned (A2). The reason I'm thinking about (not 100% sure though) 
other directions is: what if different vendors (say Qualcomm and ARM) want to 
customize their own target parsing logic separately, while their `TargetKind` 
is th