[PATCH] D24715: [OpenCL] Block captured variables in dynamic parallelism - OpenCL 2.0

2016-10-03 Thread bekket mcclane via cfe-commits
mshockwave removed a reviewer: bader.
mshockwave updated this revision to Diff 73254.
mshockwave added a comment.

@Anastasia
Sorry for late responding, I'd just attach a new version of patch that fixes 
block function use cases as normal lambda functions. But testing code is not 
included in this revision, we're still working on it.

About the questions you asked in the previous comment, I'm going to explain 
them from another aspect: How will one implement __enqueue_kernel_XXX? It might 
be classified into two categories:

1. Library implementation like libclc/libclcxx written in OpenCL-C
2. Implement builtins directly in compiler.

If we choose the first one, which most of people would do regarding its 
simplicity and flexibility, and we want to fetch captured variables inside the 
implementation of __enqueue_kernel_XXX, the possible approach would be:

  void* block_as_voidptr = (void*)arg_child_kernel;
  block_literal_ty *block = (block_literal_ty*)block_as_voidptr;
  block->capA;
  block->capB;

This seems promise, but what exactly `block_literal_ty` looks like? We all know 
that `block_literal_ty` would look similar to:

  typedef struct {
/*
* Fields of block header. 
* e.g. isa, block_descriptor...
*/
  
int capA;
int capB;
...
  } block_literal_ty;

But since we're discussing a static type language, the definition of this 
struct must be known. However, the EXACT appearence of `block_literal_ty` would 
vary among programs, or even functions. That's the thing cap_copy_helper want 
to aid.

Of course there is another library approach: Keep the child kernel's 
invoke_function prototype untouched, pass block_literal variable(in void 
pointer type) as its first function argument. Since instructions for extracting 
captured variables had been generated during the codegen of invoke_function 
body. Also, we don't need to tackle any captured variables inside 
__enqueue_kernel_XXX. 
However, the OpenCL spec says that global address space is the only address 
space shared between parent and child kernel; and the block_literal variable 
itself, is allocated as private(stack) variable in parent kernel. So we need to 
copy the block_literal variable(not its pointer) into some global space. 
Nevertheless, OpenCL doesn't allow dynamic-sized memory in global space, so we 
need to define a block of static size memory, perhaps array, in our library 
implementation. Here is the place might require global memory management since 
static size implies potential risk of running out pre-allocated space.

Regarding the improvement proposed by us which "flatten" captured variables 
into invoke_function argument list and block_literal pointer wouldn't be passed 
as first argument(to invoke_function) anymore. The reason why it doesn't 
require global memory management is that we can retrieve captured variables 
with cap_num field and cap_copy_helper routine INSIDE __enqueue_kernel_XXX and 
passed those captures as arguments to child kernel, rather than saving 
block_literal variable globally and postpone the retrieving actions until 
invoke_function, the child kernel body.


https://reviews.llvm.org/D24715

Files:
  lib/CodeGen/CGBlocks.cpp
  lib/CodeGen/CGBlocks.h
  lib/CodeGen/CGExpr.cpp
  lib/CodeGen/CodeGenFunction.cpp
  lib/CodeGen/CodeGenFunction.h
  lib/CodeGen/CodeGenModule.h

Index: lib/CodeGen/CodeGenModule.h
===
--- lib/CodeGen/CodeGenModule.h
+++ lib/CodeGen/CodeGenModule.h
@@ -464,6 +464,7 @@
 
   llvm::Type *BlockDescriptorType = nullptr;
   llvm::Type *GenericBlockLiteralType = nullptr;
+  llvm::Type *GenericOCLBlockLiteralType = nullptr;
 
   struct {
 int GlobalUniqueCount;
@@ -768,6 +769,8 @@
   /// The type of a generic block literal.
   llvm::Type *getGenericBlockLiteralType();
 
+  llvm::Type *getGenericOCLBlockLiteralType();
+
   /// Gets the address of a block which requires no captures.
   llvm::Constant *GetAddrOfGlobalBlock(const BlockExpr *BE, const char *);
   
Index: lib/CodeGen/CodeGenFunction.h
===
--- lib/CodeGen/CodeGenFunction.h
+++ lib/CodeGen/CodeGenFunction.h
@@ -288,6 +288,7 @@
 
   const CodeGen::CGBlockInfo *BlockInfo;
   llvm::Value *BlockPointer;
+  bool IsOCLChildKernelInvoke;
 
   llvm::DenseMap LambdaCaptureFields;
   FieldDecl *LambdaThisCaptureField;
@@ -1341,8 +1342,12 @@
 const DeclMapTy &ldm,
 bool IsLambdaConversionToBlock);
 
+  llvm::Constant* GenerateBlockFunctionWrapper(const CGBlockInfo &blockInfo,
+   const llvm::Function *invokeFunc);
+
   llvm::Constant *GenerateCopyHelperFunction(const CGBlockInfo &blockInfo);
   llvm::Constant *GenerateDestroyHelperFunction(const CGBlockInfo &blockInfo);
+  llvm::Constant *GenerateOCLCapturesCopyFunction(const CGBlockInfo &blockInfo);
   llvm::Constant 

[PATCH] D24715: [OpenCL] Block captured variables in dynamic parallelism - OpenCL 2.0

2016-10-05 Thread bekket mcclane via cfe-commits
mshockwave added a comment.

> From this picture I don't see how the flattening itself can help us to avoid 
> using global memory. Surely in both cases the captures content will have to 
> be copied into the memory accessible for the enqueued kernel (which is a 
> global memory in a general case, but doesn't have to be in some cases I am 
> guessing). Perhaps I am missing some extra step in the approach you are 
> proposing. If you rely on the parameter passing using normal function call 
> into the block_invoke then in both cases we can skip the memcpy of captures 
> at all. Otherwise both appoaches will need to make a copy of the captures.

Now I agree with the necessity of global `accessable_memory`. After all, kernel 
enqueuing functions in the host side(clEnqueueNDRangeXXX) also require 
pre-allocated `__global` memory, we should follow the same fashion.

> What we can improve though is avoiding extra data copy using the copy helpers 
> you are proposing (this can also be achieved by calling mempy passing the 
> capture offset pointer into block_literal and captures size instead of the 
> the whole block_literal as highlighted above). We can also potentially avoid 
> reloading of the captures in the enqueued kernel though the capture 
> flattening, but this depends on the calling convention (or rather enqueueing 
> convension I would say).

It seems that the flattening approach leave little space for the 
implementation, which violate the generic property of llvm. Also, although 
cap_copy_helper looks helpful, I think there is little chance for one to copy 
individual captured variables - copy the entire block_literal is sufficient. Of 
course, we can reduce block_literal size by removing redundant fields for the 
sake of optimization but that's another discussion topic I think.

I would like to remove this code review. 
@Anastasia thanks for your patient discussing with newbie llvm/clang developer 
like me :)


https://reviews.llvm.org/D24715



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits