[quote="kparzysz, post:11, topic:6844"] a composite target looks like a better solution. As the next step I suggest that we **drop the target host** completely. A function can be split into multiple parts meant for different targets. Instead of explicitly designating a certain target as a *target host* , we can use the same mechanism that assigns the individual parts to their targets to assign the “host” part to its target. This would remove the distinction between a host and a device code for the purposes of code generation. [/quote]
I agree that it is attempting to remove target host completely. However that does brings some trouble to our analysis in the early stage. Specifically, when we talk about say a "GPU program", there are two kinds of mindset here. In the high level(relay, topi), we would make a composite GPU kernel(e.g. softmax) as a "gpu program". The softmax could actually contain multiple kernel launches and needs host code for dimension calculations, but because the code itself only reads/writes GPU mem, we view such kernel as a GPU program. It is also useful to view it in that way, because the ML Kernel writer and scheduler view them as GPU program, rather than heterogeneous program in high level scheduling. At the lowest level, the "GPU program" only refers to the device code, but not the host code that drives the program. So we can find the design choice really boils down to how do we view the device program: - V0: a gpu(device) program is a program that involves a single device target and related host code to drive that target. - V1: a gpu(device) program is a program that only involves device code but not the host driving part. While from the low-level driver's PoV it is certainly easier to take the V1 view. The V0 view can be more useful in the following regard: - Provide a useful device key for dispatching high level schedules. - It is the natural way high level developers use to think about program with a single device target. - It offers simplicity for users who want to specify the target(e.g. they don't have to specify cuda as a composite target). It also acknowledges the fact that there is a difference between a single target(host/device mix) program and multiple device targets program. We can still use the composite target for the later ones. That does mean though usually such per target split could happen earlier in the graph stage instead of the later stage. As some additional fruits for thoughts, V0 and V1 also corresponds to two different kinds of mindsets that the CUDA programming model and OpenCL programming model advocated for. As we know nvcc allows GPU kernels to directly blend into "cu files" and to programmers the cu files becomes what we know GPU program. The OpenCL model is closer to the V1. As we know that CUDA model "won" the GPGPU programming over the other one, in my opinion, due to the mindset offered in V0 --- [Visit Topic](https://discuss.tvm.ai/t/rfc-tvm-target-specification/6844/12) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/6d569a59ddce5eaff7bc7b7acc9e6f6dec3f27b9be0e8d099073267eee19f454).