Re: [apache/tvm-rfcs] Add Target Pre-processing RFC (PR #71)
Thanks @Mousius for drafing this RFC! First of all, I completely agree on the importance to handle `arch`-specific checks. Use our experience as an example, on CUDA we might want to check if the PTX intrinsic `cp.async.commit_group` is available on certain architecture before tensorizing using that PTX intrinsic, and the existing approach is to add extra flags in `TargetKind`. To motivate discussion, I would love to provide more context on the system design, and then talk about some specific points I noticed in the current RFC, and finally propose some ideas inspired by this RFC. ## Background **Design principles.** Just wanted to share some of my design principles when developing TVM, the compiler I believe that aims to work across hardware models: - A1. Generic. By design, TVM serves the common interest of all vendors. Therefore, it would be great if we could discuss broadly how a design choice would benefit all hardware platforms, including ARM, NV, Intel, Qual, etc. - A2. Minimal. Ideally, practitioners are supposed to learn only bare minimum to participate in TVM development. - Take TIR as an example, for a developer who knows the IR and passes to use and develop TIR, the only extra concept is `Target` . - Indeed Relay is in a less ideal state, but Relax is designed to address this issue so I wouldn't worry too much - A3. Customizable. We want to develop the infrastructure to provide customization opportunities for all vendors to independently work on their extension in a collaborative way, making TVM a platform for everyone. A few examples: - TIR scheduling is heading towards this direction that all the schedule primitives are decoupled with each other, so that vendors could develop their own primitives without affecting each other. - TVMScript is going to be re-designed this way treating TIR and Relax as independent dialects while the core infra itself supports any 3rdparty IR. - Admittedly Relay is much limited with the assumption of Relay => TE => TIR lowering path, with some hooks to hack in other compilation paths, but Relax is going to change this so again I wouldn't worry... **Current `arch`-specifc checks.** Most of the 6 groups of `arch`-specific helper functions, mentioned in the "Motivation" section, are developed by concurrent efforts from multiple parties, and therefore I would say fragmentation is almost a certain thing to happen. On the other hand, fragmentation of those helper functions, which are indeed ad-hoc, is currently confined locally as independent checks without yet polluting the design of global infrastructure. **Special target attributes.** Below are a few existing target attributes that serve special semantics, the design of which I don't fully agree with, but now are preserved as legacy: - **keys.** This is used to guide TOPI dispatch. For example, "llvm --keys=arm_cpu,cpu" first finds if there is any `GenericFunc` registered with `arm_cpu`, and if not, it falls back to `cpu`. - **libs.** This is used in TOPI to dispatch to vendor library. For example, "cuda -libs=cudnn" prefers dispatching to cuDNN. - **device.** and **model.** These two sometimes control the behavior of auto-tuning. **Existing `arch`-like attributes.** Note that the design of `Target` attributes in `TargetKind` is intended to describe the capability of hardware, according to the [Target RFC](https://discuss.tvm.apache.org/t/rfc-tvm-target-specification/6844): > Pass in additional target-specific attributes like ISA/library extension to a > target. Therefore, currently all the `arch`-like flags are centered around `Target`. As an example: - The [Vulkan target](https://github.com/apache/tvm/blob/a6a34046c432b3766e7c32bbd85c098812a12a68/src/target/target_kind.cc#L345-L370) comprehensively includes hardware feature support (e.g. whether or not fp16 is supported), physical limits of the device (e.g. max number of threads allowed). - The [CUDA target](https://github.com/apache/tvm/blob/a6a34046c432b3766e7c32bbd85c098812a12a68/src/target/target_kind.cc#L292-L296) is defined in similar approach but less comprehensive yet. It doesn't require architectural change to grow its attributes. Note that the number of attributes may grow if there is new hardware feature, but to the best of my knowledge, it could be less common that those hardware features may bloat, mainly because there is concrete cost to grow features on hardwares. **Target tag system.** Given the fact that existing hardware models are enumerable, the [Target RFC](https://discuss.tvm.apache.org/t/rfc-tvm-target-specification/6844) proposes to use "tags" to allow easy creation of targets. For example, `Target("nvidia/jetson-agx-xavier")` gives full specification of this device, including the cuda target and the ARM host. At the time of writing, it only takes 200 tags to describe [all the CUDA hardware](https://github.com/apache/tvm/blob/a6a34046c432b3766e7c32bbd85c098812a12a68/src/t
Re: [apache/tvm-rfcs] Add Target Pre-processing RFC (PR #71)
Hi @junrushao1994, thanks for the elaborate reply 😸 I don't want to debate our personal principles but I appreciate you sharing them and will reference them where I can. > **Current `arch`-specifc checks.** Most of the 6 groups of `arch`-specific > helper functions, mentioned in the "Motivation" section, are developed by > concurrent efforts from multiple parties, and therefore I would say > fragmentation is almost a certain thing to happen. On the other hand, > fragmentation of those helper functions, which are indeed ad-hoc, is > currently confined locally as independent checks without yet polluting the > design of global infrastructure. Yip, this fragmentation occurs due to a lack of a standardised mechanism for these groups to use, which the RFC aims to provide - I'm not sure why we consider that pollution given it should have a positive impact on all groups and aids in providing customisation through a well defined route established by all vendors, providing simplicity and customisation (A1, A2 and A3). > Existing arch-like attributes. Note that the design of Target attributes in > TargetKind is intended to describe the capability of hardware, according to > the [Target > RFC](https://discuss.tvm.apache.org/t/rfc-tvm-target-specification/6844): I believe this is still held given the crossover between hardware and code generators, for `c`/`llvm` you will still be able to pass `mattr`/`mcpu`/etc which are the standard way to specify CPU, ISA and extensions in existing compilers (A1). This is also following the TVM Guideline of "Be consistent with existing well-known package’s APIs if the features overlap. For example, tensor operation APIs should always be consistent with the numpy API.". > Target tag system. Given the fact that existing hardware models are > enumerable, the [Target > RFC](https://discuss.tvm.apache.org/t/rfc-tvm-target-specification/6844) > proposes to use "tags" to allow easy creation of targets. For example, > Target("nvidia/jetson-agx-xavier") gives full specification of this device, > including the cuda target and the ARM host. At the time of writing, it only > takes 200 tags to describe [all the CUDA > hardware](https://github.com/apache/tvm/blob/a6a34046c432b3766e7c32bbd85c098812a12a68/src/target/tag.cc#L107-L348). > **What on earth are `Target`s.** Actually, `target` in TVM not only refers to > the hardware, but also the codegen targets. For example, LLVM targets means > TVM codegens to LLVM, and let LLVM do the rest of compilation. CUDA targets > codegens to CUDA source code (i.e. `*.cu` files), and invokes NVCC for > compilation. There's definitely an existing amount of confusion in the `Target` system, but I think it's even more confused than this by looking at the tagged entries, such as: ``` TVM_REGISTER_TARGET_TAG("raspberry-pi/4b-aarch64") .set_config({{"kind", String("llvm")}, {"mtriple", String("aarch64-linux-gnu")}, {"mcpu", String("cortex-a72")}, {"mattr", Array{"+neon"}}, {"num-cores", Integer(4)}, {"host", Map{{"kind", String("llvm")}, {"mtriple", String("aarch64-linux-gnu")}, {"mcpu", String("cortex-a72")}, {"mattr", Array{"+neon"}}, {"num-cores", Integer(4)); ``` Which defines a system configuration that then uses a code generator `llvm` with a specific CPU profile, therefore the `Target` system can represent **at minimum** 3 distinct layers: systems/hardware/code generators. Given the principle of having to understand a minimal amount (A2), `Target` needs to be streamlined into understandable parts. This is a digression from the actual RFC, as the features are pre-processed from the tagged `Target`s information. The problem that I've faced trying to figure out how to implement this is that `llvm` takes `mcpu`/`mattr`/etc which are used to infer features later where-as the `cuda` `Target` has a more complete reflection of the `attrs` available for CUDA. These inconsistencies make it difficult for a user to approach TVM and that isn't requiring a developers to learn the bare minimum (A2). It also diverges from other compilers where you'd expect `mcpu`/`mtriple`/etc to infer much of this information for you (A1). > Do we need multiple target parsers? My answer is no. If the parsers are > maintained by different vendors separately on their own interest, then they > could decide how to implement parsers for "keys", "arch", together, without > conflicting with other contributors. Therefore, I would say it's already > consistent with our principle A3 without having to develop multiple parsers. > Naming choice. When implementing the Target RFC, I came up with the > preprocessor to be backward compatible with existing funct
Re: [apache/tvm-rfcs] Add Target Pre-processing RFC (PR #71)
Thanks folks for discussions. I think they summarizes to the following points - Q0: Subfield grouping (e.g. features) or simply leave as top-level attrs - Q1: Folder structure: `target/preprocessors/cuda.cc` vs `target/cuda/cuda_preprocessor.cc` - Note that code-reuse is less likely going to be an issue as there can be common includes among sub-folder structures. - Q2: How to reconcile attributes specified already for (part of) subfield group, `llvm` can use `mattr` to specify the feature necessary. What happens to CUDA(e.g. we proposing to group hw related limitations to features), and how can we specify some of the existing feature fields during tag registration. -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/71#issuecomment-1133013784 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm-rfcs] Add Target Pre-processing RFC (PR #71)
Thanks folks for discussions. I think they summarizes to the following points - Q0: Subfield grouping (e.g. features) or simply leave as top-level attrs - Q1: Folder structure: `target/preprocessors/cuda.cc` vs `target/cuda/cuda_preprocessor.cc` - Note that code-reuse is less likely going to be an issue as there can be common includes among sub-folder structures. - Q2: How to reconcile attributes specified already for (part of) subfield group, `llvm` can use `mattr` to specify the feature necessary. What happens to CUDA(e.g. we proposing to group hw related limitations to features), and how can we specify some of the fields(under the proposal will go into a subfield `feature`) during tag registration. -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/71#issuecomment-1133026002 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm-rfcs] [RFC] UMA Universal Modular Accelerator Interface (PR #60)
One more collage/uma overlap aspect: Collage distinguishes 'registered' backends (ie just TargetKinds) from 'activated' backends (ie Target objects in the provided build targets). I think though the proposal here is the act of registration is also activation? I need help understanding how this will look from the user's pov in combination with targets. -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/60#issuecomment-1133139602 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm-rfcs] [RFC] Introducing DeclBuffer (PR #70)
@wrongtest I've thought about the option A3 vs A4. From the parsing / translation from TVM script to TIR, it is acceptable to have `T.allocate` translated to `Allocate + DeclBuffer` two nodes. But it will be tricky for `TVMScriptPrinter`. We will need to find both `Allocate` and `DeclBuffer` nodes and then print `T.allocate`, and these two nodes do not have to be parent/child of each other. I'm not sure if this behavior, which breaks 1-to-1 mapping between TVM script and TIR, is desirable. Alternatively, we can add an option to `T.allocate`, such as `def allocate(..., return_buffer: bool)`. What do you think? -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/70#issuecomment-1133153361 You are receiving this because you are subscribed to this thread. Message ID:
Re: [apache/tvm-rfcs] Add Target Pre-processing RFC (PR #71)
@Mousius Thank you so much for your response! This makes lots of sense to me! Also, thanks for including my personal principles in the discussion! It's my personal principles which are completely okay to disagree with :-) > I'm not sure why we consider that pollution given it should have a positive > impact on all groups and aids in providing customisation through a well > defined route established by all vendors, providing simplicity and > customisation (A1, A2 and A3). Sorry my words may lead to some potential miscommunication. As I mentioned in the very beginning, I completely agree with the idea of handling `arch` more systematically and it's important to us, so in no means I consider it pollution. I mean the fragmentation itself is not good but hasn't become extremely bad right now, because they are not used globally at infrastructural level, so there is no namespace polluting. > I believe this is still held given the crossover between hardware and code > generators, for c/llvm you will still be able to pass mattr/mcpu/etc which > are the standard way to specify CPU, ISA and extensions in existing compilers. Agreed. It is designed to be consistent with existing wisdom, i.e. LLVM/GCC mattr/mcpu/etc., so I believe it's not controversial. > There's definitely an existing amount of confusion in the Target system, but > I think it's even more confused than this by looking at the tagged entries, > such as: Which defines a system configuration that then uses a code > generator llvm with a specific CPU profile, therefore the Target system can > represent at minimum 3 distinct layers: systems/hardware/code generators. Yes, I think your understanding is correct. Per [Target RFC](https://discuss.tvm.apache.org/t/rfc-tvm-target-specification/6844), this object represents host compiler info (C0), heterogeneous compilation settings (C1), compiler behavior control (C2), additional ISA/library (C4). Therefore, the tag system (C4) is introduced to reduce the cognitive overhead, because the end users can use a simple plain string tag `Target("raspberry-pi/4b-aarch64")` to represent all C0/1/2/4, instead of understanding every detail of the hardware themselves (A2). In other words, to the end users, the vision of having tag system is that they only need a simple string to do configure TVM codegen on their their hardware. > where-as the cuda Target has a more complete reflection of the attrs > available for CUDA. These inconsistencies make it difficult for a user to > approach TVM and that isn't requiring a developers to learn the bare minimum > (A2). It also diverges from other compilers where you'd expect > mcpu/mtriple/etc to infer much of this information for you (A1). Right, in fact I share the same confusion with you with it comes to CUDA/Vulkan/etc., where (to best of my knowledge) there is no precedent of mcpu/mattr/mtriple/etc (except for LLVM NVPTX backend). Certainly, we should develop unified approach to address this confusion, which could possibly be the `arch` proposed in the RFC. > I don't mind using one parser, I was just making it more granular and > avoiding stepping on peoples toes with the existing pre-processor. Either > design seems to fulfil your principle of customisability (A3), more > granularly means you only have to implement a small piece of customisation > where-as with the single parser it requires some thought as to how to cater > for all attributes within a given TargetKind parser. Thanks for sharing your thoughts! Definitely. The issue of having multiple parsers is that the order of their application is less explicit, so I would be a little bit leaning towards a single one, while the implementation could share common helper functions. > > Where to dispatch target parsers. Currently, the preprocessor is dispatched > > solely based on TargetKind, which is admittedly a bit limited and > > overlooked the possiblity that aarch64 and x86 may need completely > > different parsers. Therefore, here we would love to raise a question: based > > on which attribute the parser should be dispatched? Previous wisdom in > > clang seems to suggest dispatching according to mtriple (IIUC), so shall we > > introduce anything similar, for example, dispatching based on --device > > aarch64-foo-bar? > > This requires developers to learn more than the bare minimum (A2), as users > are required to learn the specifics of how to specify a Target in TVM. If an > llvm Target supports -mcpu/-mattr/-mtriple then all of these fields can be > reasonably expected to allow the Target to infer features given the relation > to GCC/LLVM (A1). Right. I agree. `TargetKind`-based dispatch is certainly the clearest approach as you mentioned (A2). The reason I'm thinking about (not 100% sure though) other directions is: what if different vendors (say Qualcomm and ARM) want to customize their own target parsing logic separately, while their `TargetKind` is th