lildmh marked 2 inline comments as done.
lildmh added inline comments.

================
Comment at: lib/CodeGen/CGOpenMPRuntime.cpp:8739
+///     // For each component specified by this mapper:
+///     if (currentComponent.hasMapper())
+///       (*currentComponent.Mapper())(rt_mapper_handle, arg_base, arg_begin,
----------------
ABataev wrote:
> lildmh wrote:
> > ABataev wrote:
> > > lildmh wrote:
> > > > ABataev wrote:
> > > > > lildmh wrote:
> > > > > > ABataev wrote:
> > > > > > > lildmh wrote:
> > > > > > > > ABataev wrote:
> > > > > > > > > lildmh wrote:
> > > > > > > > > > ABataev wrote:
> > > > > > > > > > > Currently `currentComponent` is generated by the 
> > > > > > > > > > > compiler. But can we instead pass this data as an extra 
> > > > > > > > > > > parameter to this `omp_mapper` function.
> > > > > > > > > > Emm, I think this scheme will be very difficult and 
> > > > > > > > > > inefficient. If we pass components as an argument of 
> > > > > > > > > > `omp_mapper` function, it means that the runtime needs to 
> > > > > > > > > > generate all components related to a map clause. I don't 
> > > > > > > > > > think the runtime is able to do that efficiently. On the 
> > > > > > > > > > other hand, in the current scheme, these components are 
> > > > > > > > > > naturally generated by the compiler, and the runtime only 
> > > > > > > > > > needs to know the base pointer, pointer, type, size. etc.
> > > > > > > > > With the current scheme, we may end with the code blowout. We 
> > > > > > > > > need to generate very similar code for different types and 
> > > > > > > > > variables. The worst thing here is that we will be unable to 
> > > > > > > > > optimize this huge amount of code because the codegen relies 
> > > > > > > > > on the runtime functions and the code cannot be inlined. 
> > > > > > > > > That's why I would like to move as much as possible code to 
> > > > > > > > > the runtime rather than to emit it in the compiler. 
> > > > > > > > I understand your concerns. I think this is the best we can do 
> > > > > > > > right now.
> > > > > > > > 
> > > > > > > > The most worrisome case will be when we have nested mappers 
> > > > > > > > within each other. In this case, a mapper function will call 
> > > > > > > > another mapper function. We can inline the inner mapper 
> > > > > > > > functions in this scenario, so that these mapper function can 
> > > > > > > > be properly optimized. As a result, I think the performance 
> > > > > > > > should be fine.
> > > > > > > Instead, we can use indirect function calls passed in the array 
> > > > > > > to the runtime. Do you think it is going to be slower? In your 
> > > > > > > current scheme, we generate many runtime calls instead. Could you 
> > > > > > > try to estimate the number of calls in cases if we'll call the 
> > > > > > > mappers through the indirect function calls and in your cuurent 
> > > > > > > scheme, where we need to call the runtime functions many times in 
> > > > > > > each particular mapper?
> > > > > > Hi Alexey,
> > > > > > 
> > > > > > Sorry I don't understand your idea. What indirect function calls do 
> > > > > > you propose to be passed to the runtime? What are these functions 
> > > > > > supposed to do?
> > > > > > 
> > > > > > The number of function calls will be exactly equal to the number of 
> > > > > > components mapped, no matter whether there are nested mappers or 
> > > > > > not. The number of components depend on the program. E.g., if we 
> > > > > > map a large array section, then there will be many more function 
> > > > > > calls.
> > > > > I mean the pointers to the mapper function, generated by the 
> > > > > compiler. In your comment, it is `c.Mapper()`
> > > > If we pass nested mapper functions to the runtime, I think it will slow 
> > > > down execution because of the extra level of indirect function calls. 
> > > > E.g., the runtime will call `omp_mapper1`, which calls the runtime 
> > > > back, which calls `omp_mapper2`, .... This can result in a deep call 
> > > > stack.
> > > > 
> > > > I think the current implementation will be more efficient, which 
> > > > doesn't pass nested mappers to the runtime. One call to the outer most 
> > > > mapper function will have all data mapping done. The call stack will be 
> > > > 2 level deep (the first level is the mapper function, and the second 
> > > > level is `__tgt_push_mapper_component`) in this case from the runtime. 
> > > > There are also more compiler optimization space when we inline all 
> > > > nested mapper functions.
> > > Yes, if we leave it as is. But if instead of the bunch unique functions 
> > > we'll have the common one, that accept list if indirect pointers to 
> > > functions additionally, and move it to the runtime library, we won't need 
> > > those 2 functions we have currently. We'll have full access to the 
> > > mapping data vector in the runtime library and won't need to use those 2 
> > > accessors we have currently. Instead, we'll need just one runtime 
> > > functions, which implements the whole mapping logic. We still need to 
> > > call it recursively, but I assume the number of calls will remain the 
> > > same as in the current scheme. Did you understand the idea? If yes, it 
> > > would good if you coild try to estimate the number of function calls in 
> > > current scheme and in this new scheme to estimate possible pros and cons.
> > Hi Alexey,
> > 
> > Could you give an example for this scheme? 1) I don't understand how the 
> > mapper function can have full access to the mapping data vector without 
> > providing these 2 accessors. 2) I don't think it is possible to have a 
> > common function instead of bunch of unique functions for each mapper 
> > declared.
> Hi Lingda, something like this.
> ```
> void __tgt_mapper(void *base, void *begin, size_t size, int64_t type, auto 
> components[]) {
>   // Allocate space for an array section first.
>   if (size > 1 && !maptype.IsDelete)
>      <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
>    
>   // Map members.
>   for (unsigned i = 0; i < size; i++) {
>      // For each component specified by this mapper:
>      for (auto c : components) {
>        if (c.hasMapper())
>          (*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
>        else
>          <push>(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
>      }
>   }
>   // Delete the array section.
>   if (size > 1 && maptype.IsDelete)
>     <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
> }
> 
> void <type>.mapper(void *base, void *begin, size_t size, int64_t type) {
>  auto sub_components[] = {...};
>  __tgt_mapper(base, begin, size, type, sub_components);
> }
> ```
> 
Hi Alexey,

I don't think this scheme is more efficient than the current scheme. My reasons 
are:

1) Most code here is essentially to generate `components`, i.e., we need to 
generate `c.arg_base, c.arg_begin, c.arg_size, c.arg_type` for each `c` in 
`components`, so there will still be a lot of code in `<type>.mapper`. It will 
not reduce the mapper function code, i.e., we will still have a bunch of unique 
mapper functions.

2) This scheme will prevent a lot of compiler optimization from happening. In 
reality, a lot of computation should be redundant. E.g., for two components 
`c1` and `c2`, `c1`'s base may be the same as `c2`'s begin, so the compiler 
will be able to eliminate these reduction computation, especially when we 
inline all nested mapper functions together. If we move these computation into 
the runtime, the compiler will not be able to do such optimization.

3) In terms of the number of `push` function calls, this scheme has the exact 
same number of calls as the current scheme, so I don't think this scheme can 
bring performance benefits. The scheme should perform worse than the current 
scheme, because it reduces the opportunities of compiler optimization as 
mentioned above.


================
Comment at: lib/CodeGen/CGOpenMPRuntime.cpp:747
+  // Call to void __tgt_push_mapper_component(void *rt_mapper_handle, void
+  // *base, void *begin, size_t size, int64_t type);
+  OMPRTL__tgt_push_mapper_component,
----------------
ABataev wrote:
> Check the declaration of the runtime function in the runtime patch and here. 
> `size` parameter here has type `size_t`, while in runtime part it is `int64_t`
The runtime part uses `int64_t` because I see every other runtime function use 
`int64_t` instead of `size_t` for `size`, e.g., `__tgt_target, 
__tgt_target_teams`, etc., despite that they are declared to be `size_t` in 
Clang codegen. So I guess it is done on purpose? Otherwise we need to modify 
all these runtime function interface in the future.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59474/new/

https://reviews.llvm.org/D59474



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to