gtbercea added a comment.

In https://reviews.llvm.org/D52434#1248844, @Hahnfeld wrote:

> Just tested this and got very weird results for register usage:
>
>   void func(double *a) {
>     #pragma omp target teams distribute parallel for map(a[0:100]) // 
> dist_schedule(static)
>     for (int i = 0; i < 100; i++) {
>       a[i]++;
>     }
>   }
>
>
> Compiling with current trunk for `sm_60` (Pascal): 29 registers
>  Adding `dist_schedule(static)` (the previous default): 19 registers
>  For reference: `dist_schedule(static, 128)` also uses 29 registers
>
> Any ideas? This significantly slows down STREAM...


Jonas, without an explicit dist_schedule clause the program will run with 
schedule(static, <number of threads in block>). It looks like that happens fine 
since you get the same register count in the explicit static chunk variant as 
in the default case.

The difference you see in register count is (I suspect) driven by the runtime 
code (less registers for non-chunked than for chunked). I am currently 
investigating this and trying to find ways to reduce this number.

One big problem your code has is that the trip count is incredibly small, 
especially for STREAM and especially on GPUs. You need a much larger loop size 
otherwise the timings will be dominated by OpenMP setups costs.


Repository:
  rC Clang

https://reviews.llvm.org/D52434



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to