tra added a comment.
In http://reviews.llvm.org/D9888#257904, @sfantao wrote:
> This diff refactors the original patch and is rebased on top of the latests
> offloading changes inserted for CUDA.
>
> Here I don't touch the CUDA support. I tried, however, to have the
> implementation modular enough so that it could eventually be combined with
> the CUDA implementation. In my view OpenMP offloading is more general in the
> sense that it does not refer to a given tool chain, instead it uses existing
> toolchains to generate code for offloading devices. So, I believe that a tool
> chain (which I did not include in this patch) targeting NVPTX will be able to
> handle both CUDA and OpenMP offloading models.
What do you mean by "does not refer to a given toolchain"? Do you have the
toolchain patch available?
Creating a separate toolchain for CUDA was a crutch that was available to craft
appropriate cc1 command line for device-side compilation using existing
toolchain. It works, but it's rather rigid arrangement. Creating a NVPTX
toolchain which can be parameterized to produce CUDA or OpenMP would be an
improvement.
Ideally toolchain tweaking should probably be done outside of the toolchain
itself so that it can be used with any combination of {CUDA or OpenMP target
tweaks}x{toolchains capable of generating target code}.
> b ) The building of the driver actions is unchanged.
>
> I don't create device specific actions. Instead only the bundling/unbundling
> are inserted as first or last action if the file type requires that.
Could you elaborate on that? The way I read it, the driver sees linear chain of
compilation steps plus bundling/unbundling at the beginning/end and that each
action would result in multiple compiler invocations, presumably per target.
If that's the case, then it may present a bit of a challenge in case one part
of compilation depends on results of another. That's the case for CUDA where
results of device-side compilation must be present for host-side compilation so
we can generate additional code to initialize it at runtime.
> c) Add offloading kind to `ToolChain`
>
> Offloading does not require a new toolchain to be created. Existent
> toolchains are used and the offloading kind is used to drive specific
> behavior in each toolchain so that valid device code is generated.
>
> This is a major difference from what is currently done for CUDA. But I guess
> the CUDA implementation easily fits this design and the Nvidia GPU toolchain
> could be reused for both CUDA and OpenMP offloading.
Sounds good. I'd be happy to make necessary make CUDA support use it.
> d) Use Job results cache to easily use host results in device actions and
> vice-versa.
>
> An array of the results for each job is kept so that the device job can use
> the result previously generated for the host and used it as input or
> vice-versa.
Nice. That's something that will be handy for CUDA and may help to avoid
passing bits of info about other jobs explicitly throughout the driver.
> The result cache can also be updated to keep the required information for the
> CUDA implementation to decide host/device binaries combining (injection is
> the term used in the code). I don't have a concrete proposal for that
> however, given that is not clear to me what are the plans for CUDA to support
> separate compilation, I understand that the CUDA binary is inserted directly
> in host IR (Art, can you shed some light on this?).
Currently CUDA depends on libcudart which assumes that GPU code and its
initialization is done the way nvcc does it. Currently we do include PTX
assembly (as in readable text) generated on device side into host-side IR
*and* generate some host data structures and init code to register GPU binaries
with libcudart. I haven't figured out a way to compile host/device sides of
CUDA without a host-side compilation depending on device results.
Long-term we're considering implementing CUDA runtime support based on plain
driver interface which would give us more control over where we keep GPU code
and how we initialize it. Then we could simplify things and, for example,
incorporate GPU code via linker script. Alas for the time being we're stuck
with libcudart and sequential device and host compilation phases.
As for separate compilation -- compilation part is doable. It's using the
results of such compilation that becomes tricky. CUDA's triple-bracket kernel
launch syntax depends on libcudart and will not work, because we would not
generate init code. You can still launch kernels manually using raw driver API,
but it's quite a bit more convoluted.
--Artem
================
Comment at: include/clang/Driver/Driver.h:208
@@ +207,3 @@
+ /// CreateUnbundledOffloadingResult - Create a command to unbundle the input
+ /// and use the resulting input info. If there re inputs already cached in
+ /// OffloadingHostResults for that action use them instead. If no offloading
----------------
re -> are
================
Comment at: include/clang/Driver/Driver.h:210
@@ +209,3 @@
+ /// OffloadingHostResults for that action use them instead. If no offloading
+ /// is being support just return the provided input info.
+ InputInfo CreateUnbundledOffloadingResult(
----------------
"If offloading is not supported" perhaps?
================
Comment at: lib/Driver/Driver.cpp:2090
@@ +2089,3 @@
+ dyn_cast<OffloadUnbundlingJobAction>(A)) {
+ // The input of the unbundling job has to a single input non-source file,
+ // so we do not consider it having multiple architectures. We just use the
----------------
"has to be"
http://reviews.llvm.org/D9888
_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits