lildmh marked 2 inline comments as done.
lildmh added inline comments.

================
Comment at: lib/CodeGen/CGOpenMPRuntime.cpp:8739
+///     // For each component specified by this mapper:
+///     if (currentComponent.hasMapper())
+///       (*currentComponent.Mapper())(rt_mapper_handle, arg_base, arg_begin,
----------------
ABataev wrote:
> lildmh wrote:
> > ABataev wrote:
> > > lildmh wrote:
> > > > ABataev wrote:
> > > > > lildmh wrote:
> > > > > > ABataev wrote:
> > > > > > > lildmh wrote:
> > > > > > > > ABataev wrote:
> > > > > > > > > lildmh wrote:
> > > > > > > > > > ABataev wrote:
> > > > > > > > > > > lildmh wrote:
> > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > Currently `currentComponent` is generated by the 
> > > > > > > > > > > > > compiler. But can we instead pass this data as an 
> > > > > > > > > > > > > extra parameter to this `omp_mapper` function.
> > > > > > > > > > > > Emm, I think this scheme will be very difficult and 
> > > > > > > > > > > > inefficient. If we pass components as an argument of 
> > > > > > > > > > > > `omp_mapper` function, it means that the runtime needs 
> > > > > > > > > > > > to generate all components related to a map clause. I 
> > > > > > > > > > > > don't think the runtime is able to do that efficiently. 
> > > > > > > > > > > > On the other hand, in the current scheme, these 
> > > > > > > > > > > > components are naturally generated by the compiler, and 
> > > > > > > > > > > > the runtime only needs to know the base pointer, 
> > > > > > > > > > > > pointer, type, size. etc.
> > > > > > > > > > > With the current scheme, we may end with the code 
> > > > > > > > > > > blowout. We need to generate very similar code for 
> > > > > > > > > > > different types and variables. The worst thing here is 
> > > > > > > > > > > that we will be unable to optimize this huge amount of 
> > > > > > > > > > > code because the codegen relies on the runtime functions 
> > > > > > > > > > > and the code cannot be inlined. That's why I would like 
> > > > > > > > > > > to move as much as possible code to the runtime rather 
> > > > > > > > > > > than to emit it in the compiler. 
> > > > > > > > > > I understand your concerns. I think this is the best we can 
> > > > > > > > > > do right now.
> > > > > > > > > > 
> > > > > > > > > > The most worrisome case will be when we have nested mappers 
> > > > > > > > > > within each other. In this case, a mapper function will 
> > > > > > > > > > call another mapper function. We can inline the inner 
> > > > > > > > > > mapper functions in this scenario, so that these mapper 
> > > > > > > > > > function can be properly optimized. As a result, I think 
> > > > > > > > > > the performance should be fine.
> > > > > > > > > Instead, we can use indirect function calls passed in the 
> > > > > > > > > array to the runtime. Do you think it is going to be slower? 
> > > > > > > > > In your current scheme, we generate many runtime calls 
> > > > > > > > > instead. Could you try to estimate the number of calls in 
> > > > > > > > > cases if we'll call the mappers through the indirect function 
> > > > > > > > > calls and in your cuurent scheme, where we need to call the 
> > > > > > > > > runtime functions many times in each particular mapper?
> > > > > > > > Hi Alexey,
> > > > > > > > 
> > > > > > > > Sorry I don't understand your idea. What indirect function 
> > > > > > > > calls do you propose to be passed to the runtime? What are 
> > > > > > > > these functions supposed to do?
> > > > > > > > 
> > > > > > > > The number of function calls will be exactly equal to the 
> > > > > > > > number of components mapped, no matter whether there are nested 
> > > > > > > > mappers or not. The number of components depend on the program. 
> > > > > > > > E.g., if we map a large array section, then there will be many 
> > > > > > > > more function calls.
> > > > > > > I mean the pointers to the mapper function, generated by the 
> > > > > > > compiler. In your comment, it is `c.Mapper()`
> > > > > > If we pass nested mapper functions to the runtime, I think it will 
> > > > > > slow down execution because of the extra level of indirect function 
> > > > > > calls. E.g., the runtime will call `omp_mapper1`, which calls the 
> > > > > > runtime back, which calls `omp_mapper2`, .... This can result in a 
> > > > > > deep call stack.
> > > > > > 
> > > > > > I think the current implementation will be more efficient, which 
> > > > > > doesn't pass nested mappers to the runtime. One call to the outer 
> > > > > > most mapper function will have all data mapping done. The call 
> > > > > > stack will be 2 level deep (the first level is the mapper function, 
> > > > > > and the second level is `__tgt_push_mapper_component`) in this case 
> > > > > > from the runtime. There are also more compiler optimization space 
> > > > > > when we inline all nested mapper functions.
> > > > > Yes, if we leave it as is. But if instead of the bunch unique 
> > > > > functions we'll have the common one, that accept list if indirect 
> > > > > pointers to functions additionally, and move it to the runtime 
> > > > > library, we won't need those 2 functions we have currently. We'll 
> > > > > have full access to the mapping data vector in the runtime library 
> > > > > and won't need to use those 2 accessors we have currently. Instead, 
> > > > > we'll need just one runtime functions, which implements the whole 
> > > > > mapping logic. We still need to call it recursively, but I assume the 
> > > > > number of calls will remain the same as in the current scheme. Did 
> > > > > you understand the idea? If yes, it would good if you coild try to 
> > > > > estimate the number of function calls in current scheme and in this 
> > > > > new scheme to estimate possible pros and cons.
> > > > Hi Alexey,
> > > > 
> > > > Could you give an example for this scheme? 1) I don't understand how 
> > > > the mapper function can have full access to the mapping data vector 
> > > > without providing these 2 accessors. 2) I don't think it is possible to 
> > > > have a common function instead of bunch of unique functions for each 
> > > > mapper declared.
> > > Hi Lingda, something like this.
> > > ```
> > > void __tgt_mapper(void *base, void *begin, size_t size, int64_t type, 
> > > auto components[]) {
> > >   // Allocate space for an array section first.
> > >   if (size > 1 && !maptype.IsDelete)
> > >      <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
> > >    
> > >   // Map members.
> > >   for (unsigned i = 0; i < size; i++) {
> > >      // For each component specified by this mapper:
> > >      for (auto c : components) {
> > >        if (c.hasMapper())
> > >          (*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
> > >        else
> > >          <push>(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
> > >      }
> > >   }
> > >   // Delete the array section.
> > >   if (size > 1 && maptype.IsDelete)
> > >     <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
> > > }
> > > 
> > > void <type>.mapper(void *base, void *begin, size_t size, int64_t type) {
> > >  auto sub_components[] = {...};
> > >  __tgt_mapper(base, begin, size, type, sub_components);
> > > }
> > > ```
> > > 
> > Hi Alexey,
> > 
> > I don't think this scheme is more efficient than the current scheme. My 
> > reasons are:
> > 
> > 1) Most code here is essentially to generate `components`, i.e., we need to 
> > generate `c.arg_base, c.arg_begin, c.arg_size, c.arg_type` for each `c` in 
> > `components`, so there will still be a lot of code in `<type>.mapper`. It 
> > will not reduce the mapper function code, i.e., we will still have a bunch 
> > of unique mapper functions.
> > 
> > 2) This scheme will prevent a lot of compiler optimization from happening. 
> > In reality, a lot of computation should be redundant. E.g., for two 
> > components `c1` and `c2`, `c1`'s base may be the same as `c2`'s begin, so 
> > the compiler will be able to eliminate these reduction computation, 
> > especially when we inline all nested mapper functions together. If we move 
> > these computation into the runtime, the compiler will not be able to do 
> > such optimization.
> > 
> > 3) In terms of the number of `push` function calls, this scheme has the 
> > exact same number of calls as the current scheme, so I don't think this 
> > scheme can bring performance benefits. The scheme should perform worse than 
> > the current scheme, because it reduces the opportunities of compiler 
> > optimization as mentioned above.
> Hi Lingda, I'm trying to simplify the code generated by clang and avoid some 
> unnecessary code duplications. If the complexity of this scheme is the same 
> as proposed by you, I would prefer to use this scheme unless there are some 
> other opinions.
> 1. It is not a problem. This code is unique and is not duplicated in the 
> different mappers.
> 2. Inlining is no solution here. We still generate to much code, which is 
> almost the same in many cases and it will lead to very ineffective codegen 
> because we still end up with a lot of almost the same code. This also might 
> lead to poor performance.
> 3. Yes, the number of pushes is always the same, in all possible schemes. It 
> would be good to compare somehow the performance of both schemes, at least 
> preliminary.
> 
> Also, this solution reduces the number of required runtime functions, instead 
> of 2 we need just 1 and, thus, we need to make fewer runtime functions calls.
> 
> I think it would better to propose this scheme as an alternate design and 
> discuss it in the OpenMP telecon. What do you think? Or we can try to discuss 
> it in the offline mode via the e-mail with other members.
> I'm not trying to convince you to implement this scheme right now, but it 
> would be good to discuss it. Maybe it will lead to some better ideas from 
> others?
Hi Alexey,

I still prefer the current scheme, because:
1) I don't like recursive mapper calls, which goes back to my original scheme a 
little bit. I really think inlining can make a big difference when we have 
nested mappers. These compiler optimizations are the keys to have better 
performance for mappers.
2) I don't think the codegen here is inefficient. Yes there is duplicated code 
across different mapper functions, but why that will lead to poor performance?
3) Although we have 2 runtime functions now, the `__tgt_mapper_num_components` 
is called only once per mapper. It should have very negligible performance 
impact.

But if you have a different option, we can discuss it next time in the meeting. 
I do have a time constraint to work on the mapper implementation. I'll no 
longer work in this project starting this September, and I have about 30% of my 
time working on it until then.


================
Comment at: lib/CodeGen/CGOpenMPRuntime.cpp:747
+  // Call to void __tgt_push_mapper_component(void *rt_mapper_handle, void
+  // *base, void *begin, size_t size, int64_t type);
+  OMPRTL__tgt_push_mapper_component,
----------------
ABataev wrote:
> lildmh wrote:
> > ABataev wrote:
> > > Check the declaration of the runtime function in the runtime patch and 
> > > here. `size` parameter here has type `size_t`, while in runtime part it 
> > > is `int64_t`
> > The runtime part uses `int64_t` because I see every other runtime function 
> > use `int64_t` instead of `size_t` for `size`, e.g., `__tgt_target, 
> > __tgt_target_teams`, etc., despite that they are declared to be `size_t` in 
> > Clang codegen. So I guess it is done on purpose? Otherwise we need to 
> > modify all these runtime function interface in the future.
> I recently committed the patch that fixes this problem in clang. If you're 
> using `int64_t` in the runtime, the same type must be used in clang codegen.
That's nice to know. Then I will change the type of `size` to `int64_t` as well.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59474/new/

https://reviews.llvm.org/D59474



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to