gtbercea added a comment.
In https://reviews.llvm.org/D52434#1248844, @Hahnfeld wrote:
> Just tested this and got very weird results for register usage:
>
> void func(double *a) {
> #pragma omp target teams distribute parallel for map(a[0:100]) //
> dist_schedule(static)
> for (int i = 0; i < 100; i++) {
> a[i]++;
> }
> }
>
>
> Compiling with current trunk for `sm_60` (Pascal): 29 registers
> Adding `dist_schedule(static)` (the previous default): 19 registers
> For reference: `dist_schedule(static, 128)` also uses 29 registers
>
> Any ideas? This significantly slows down STREAM...
Jonas, without an explicit dist_schedule clause the program will run with
schedule(static, <number of threads in block>). It looks like that happens fine
since you get the same register count in the explicit static chunk variant as
in the default case.
The difference you see in register count is (I suspect) driven by the runtime
code (less registers for non-chunked than for chunked). I am currently
investigating this and trying to find ways to reduce this number.
One big problem your code has is that the trip count is incredibly small,
especially for STREAM and especially on GPUs. You need a much larger loop size
otherwise the timings will be dominated by OpenMP setups costs.
Repository:
rC Clang
https://reviews.llvm.org/D52434
_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits