[PATCH] D44435: Add the module name to __cuda_module_ctor and __cuda_module_dtor for unique function names

Simeon Ehrig via Phabricator via cfe-commits Thu, 15 Mar 2018 03:23:35 -0700

SimeonEhrig added inline comments.

================
Comment at: unittests/CodeGen/IncrementalProcessingTest.cpp:176-178
+
+// In CUDA incremental processing, a CUDA ctor or dtor will be generated for 
+// every statement if a fatbinary file exists.
----------------
tra wrote:
> SimeonEhrig wrote:
> > tra wrote:
> > > I don't understand the comment. What is 'CUDA incremental processing' and 
> > > what exactly is meant by 'statement' here? I'd appreciate if you could 
> > > give me more details. My understanding is that ctor/dtor are generated 
> > > once per TU. I suspect "incremental processing" may change that, but I 
> > > have no idea what exactly does it do.
> > A CUDA ctor/dtor will generates for every llvm::module. The TU can also 
> > composed of many modules. In our interpreter, we add new code to our AST 
> > with new modules at runtime. 
> > The ctor/dtor generation is depend on the fatbinary code. The CodeGen 
> > checks, if a path to a fatbinary file is set. If it is, it generates an 
> > ctor with at least a __cudaRegisterFatBinary() function call. So, the 
> > generation is independent of the source code in the module and we can use 
> > every statement. A statement can be an expression, a declaration, a 
> > definition and so one.   
> I still don't understand how it's going to work. Do you have some sort of 
> design document outlining how the interpreter is going to work with CUDA?
> 
> The purpose of the ctor/dtor is to stitch together host-side kernel launch 
> with the GPU-side kernel binary which resides in the GPU binary created by 
> device-side compilation. 
> 
> So, the question #1 -- if you pass GPU-side binary to the compiler, where did 
> you get it? Normally it's the result of device-side compilation of the same 
> TU. In your case it's not quite clear what exactly would that be, if you feed 
> the source to the compiler incrementally. I.e. do you somehow recompile 
> everything we've seen on device side so far for each new chunk of host-side 
> source you feed to the compiler? 
> 
> Next question is -- assuming that device side does have correct GPU-side 
> binary, when do you call those ctors/dtors? JIT model does not quite fit the 
> assumptions that drive regular CUDA compilation.
> 
> Let's consider this:
> ```
> __global__ void foo();
> __global__ void bar();
> 
> // If that's all we've  fed to compiler so far, we have no GPU code yet, so 
> there 
> // should be no fatbin file. If we do have it, what's in it?
> 
> void launch() {
>   foo<<<1,1>>>();
>   bar<<<1,1>>>();
> }
> // If you've generated ctors/dtors at this point they would be 
> // useless as no GPU code exists in the preceding code.
> 
> __global__ void foo() {}
> // Now we'd have some GPU code, but how can we need to retrofit it into 
> // all the ctors/dtors we've generated before. 
> __global__ void bar() {}
> // Does bar end up in its own fatbinary? Or is it combined into a new 
> // fatbin which contains both boo and bar?
> // If it's a new fatbin, you somehow need to update existing ctors/dtors, 
> // unless you want to leak CUDA resources fast.
> // If it's a separate fatbin, then you will need to at the very least change 
> the way 
> // ctors/dtors are generated by the 'launch' function, because now they need 
> to 
> // tie each kernel launch to a different fatbin.
> 
> ```
> 
> It looks to me that if you want to JIT CUDA code you will need to take over 
> GPU-side kernel management.
> ctors/dtors do that for full-TU compilation, but they rely on device-side 
> code being compiled and available during host-side compilation. For JIT, the 
> interpreter should be in charge of registering new kernels with the CUDA 
> runtime and unregistering/unloading them when a kernel goes away. This makes 
> ctors/dtors completely irrelevant.
At the moment, there is no documentation, because we still develop the feature. 
I try to describe how it works.

The device side compilation works with a second compiler (a normal clang), 
which we start via syscall. In the interpreter, we check if the input line is a 
kernel definition or a kernel launch. Then we write the source code to a file 
and compile it with the clang to a PCH-file.  Then the PCH-file will be 
compiled to PTX and then to a fatbin. If we add a new kernel, we will send the 
source code with the existing PCH-file to clang compiler. So we easy extend the 
AST and generate a PTX-file with all defined kernels. 

An implementation of this feature can you see at my prototype: 
<https://github.com/SimeonEhrig/CUDA-Runtime-Interpreter>

Running the ctor/dtor isn't hard. I search after the JITSymbol and generate an 
function pointer. Than I can simply run it. This feature can you also see in my 
prototype. So, we can run the ctor, if new fatbin code is generated and the 
dtor before, if code was already registered. The CUDA runtime also provide the 
possibility to run the (un)register functions many times.

  __global__ void foo();
  __global__ void bar();

  //At this point, there is no fatbin file and it will no generated. 

  void launch() {
    foo<<<1,1>>>();
    bar<<<1,1>>>();
  }

  // The definition of launch() is not possible at the direct input mode (type 
in line by line) in cling. 
  // At this point, we need a definition of foo() and bar(). But there is a 
exception. 
  // We have a function to read in a piece of code from file. This piece of 
code will translate in a single module. 

  __global__ void foo() {}
  __global__ void bar() {}

  // In our case, we will compile this 8 lines of code in a single module in 
cling and send it  to the CUDA device JIT, too. 

  // We have on file fatbinary file, which will extend with new kernels. The 
file have to unregistered and registered every time, if it will changed.
  // When and which ctor/dtor have to run is managed by the interpreter. 

I don't know, if I understand it right. Do you mean, we should implement the 
content of the ctor/dtor direct in our cling source code? For example, we call 
direct the `__cudaRegisterFatBinary()` function in the source code of cling 
after the generating of a new fatbin-file as opposed of calling 
`__cuda_module_ctor`, which we generated with JIT-backend of our interpreter.

Repository:
  rC Clang

https://reviews.llvm.org/D44435

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D44435: Add the module name to __cuda_module_ctor and __cuda_module_dtor for unique function names

Reply via email to