[quote="kparzysz, post:11, topic:6844"]
a composite target looks like a better solution. As the next step I suggest 
that we **drop the target host** completely. A function can be split into 
multiple parts meant for different targets. Instead of explicitly designating a 
certain target as a *target host* , we can use the same mechanism that assigns 
the individual parts to their targets to assign the “host” part to its target. 
This would remove the distinction between a host and a device code for the 
purposes of code generation.
[/quote]

I agree that it is attempting to remove target host completely. However that 
does brings some trouble to our analysis in the early stage. 

Specifically, when we talk about say a "GPU program", there are two kinds of 
mindset here. 

In the high level(relay, topi), we would make a composite GPU kernel(e.g. 
softmax) as a "gpu program". The softmax could actually contain multiple kernel 
launches and needs host code for dimension calculations, but because the code 
itself only reads/writes GPU mem, we view such kernel as a GPU program. It is 
also useful to view it in that way, because the ML Kernel writer and scheduler 
view them as GPU program, rather than heterogeneous program in high level 
scheduling.


At the lowest level, the "GPU program" only refers to the device code, but not 
the host code that drives the program.

So we can find the design choice really boils down to how do we view the device 
program:
- V0: a gpu(device) program is a program that involves a single device target 
and related host code to drive that target.
- V1: a gpu(device) program is a program that only involves device code but not 
the host driving part.

While from the low-level driver's PoV it is certainly easier to take the V1 
view. The V0 view can be more useful in the following regard:
- Provide a useful device key for dispatching high level schedules.
- It is the natural way high level developers use to think about program with a 
single device target.
- It offers simplicity for users who want to specify the target(e.g. they don't 
have to specify cuda as a composite target).

It also acknowledges the fact that there is a difference between a single 
target(host/device mix) program and multiple device targets program. We can 
still use the composite target for the later ones. That does mean though 
usually such per target split could happen earlier in the graph stage instead 
of the later stage.

As some additional fruits for thoughts, V0 and V1 also corresponds to two 
different kinds of mindsets that the CUDA programming model and OpenCL 
programming model advocated for. As we know nvcc allows GPU kernels to directly 
blend into "cu files" and to programmers the cu files becomes what we know GPU 
program. The OpenCL model is closer to the V1. As we know that CUDA model "won" 
the GPGPU programming over the other one, in my opinion, due to the mindset 
offered in V0





---
[Visit Topic](https://discuss.tvm.ai/t/rfc-tvm-target-specification/6844/12) to 
respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/6d569a59ddce5eaff7bc7b7acc9e6f6dec3f27b9be0e8d099073267eee19f454).

Reply via email to