Hahnfeld added a comment. In https://reviews.llvm.org/D52434#1249399, @gtbercea wrote:
> In https://reviews.llvm.org/D52434#1249186, @Hahnfeld wrote: > > > In https://reviews.llvm.org/D52434#1249102, @gtbercea wrote: > > > > > You report a slow down which I am not able to reproduce actually. Do you > > > use any additional clauses not present in your previous post? > > > > > > No, only `dist_schedule(static)` which is faster. Tested on a `Tesla P100` > > with today's trunk version: > > > > | `#pragma omp target teams distribute parallel for` (new defaults) > > | 190 - 250 GB/s | > > | adding clauses for old defaults: `schedule(static) dist_schedule(static)` > > | 30 - 50 GB/s | > > | same directive with only `dist_schedule(static)` added (fewer registers) > > | 320 - 400 GB/s | > > | > > > Which loop size you're using ? What runtime does nvprof report for these > kernels? Sorry, forgot to mention: I'm using the original STREAM code with 80,000,000 `double` elements in each vector. Output from `nvprof`: Type Time(%) Time Calls Avg Min Max Name GPU activities: 70.05% 676.71ms 9 75.191ms 1.3760us 248.09ms [CUDA memcpy DtoH] 7.67% 74.102ms 10 7.4102ms 7.3948ms 7.4220ms __omp_offloading_34_b871a7d5_main_l307 7.63% 73.679ms 10 7.3679ms 7.3457ms 7.3811ms __omp_offloading_34_b871a7d5_main_l301 6.78% 65.516ms 10 6.5516ms 6.5382ms 6.5763ms __omp_offloading_34_b871a7d5_main_l295 6.77% 65.399ms 10 6.5399ms 6.5319ms 6.5495ms __omp_offloading_34_b871a7d5_main_l289 0.68% 6.6106ms 1 6.6106ms 6.6106ms 6.6106ms __omp_offloading_34_b871a7d5_main_l264 0.41% 3.9659ms 1 3.9659ms 3.9659ms 3.9659ms __omp_offloading_34_b871a7d5_main_l245 0.00% 1.1200us 1 1.1200us 1.1200us 1.1200us [CUDA memcpy HtoD] API calls: 51.12% 678.90ms 9 75.434ms 24.859us 248.70ms cuMemcpyDtoH 22.40% 297.51ms 42 7.0835ms 4.0042ms 7.6802ms cuCtxSynchronize 20.31% 269.72ms 1 269.72ms 269.72ms 269.72ms cuCtxCreate 5.32% 70.631ms 1 70.631ms 70.631ms 70.631ms cuCtxDestroy 0.46% 6.1607ms 1 6.1607ms 6.1607ms 6.1607ms cuModuleLoadDataEx 0.28% 3.7628ms 1 3.7628ms 3.7628ms 3.7628ms cuModuleUnload 0.10% 1.2977ms 42 30.898us 13.930us 60.092us cuLaunchKernel 0.00% 56.142us 42 1.3360us 677ns 2.0930us cuFuncGetAttribute 0.00% 43.957us 46 955ns 454ns 1.7670us cuCtxSetCurrent 0.00% 15.179us 1 15.179us 15.179us 15.179us cuMemcpyHtoD 0.00% 7.2780us 10 727ns 358ns 1.4760us cuModuleGetGlobal 0.00% 6.9910us 2 3.4950us 2.2660us 4.7250us cuDeviceGetPCIBusId 0.00% 5.7500us 6 958ns 333ns 3.5270us cuModuleGetFunction 0.00% 3.7530us 9 417ns 184ns 1.0850us cuDeviceGetAttribute 0.00% 2.6790us 3 893ns 370ns 1.9300us cuDeviceGetCount 0.00% 2.0090us 3 669ns 484ns 767ns cuDeviceGet The memcpy comes from a `target update` to verify the results on the host. It's not included in the measurement itself, so STREAM only evaluates the kernel execution time: Function Best Rate MB/s Avg time Min time Max time Copy: 190819.6 0.006781 0.006708 0.006841 Scale: 189065.7 0.006800 0.006770 0.006831 Add: 253831.7 0.007616 0.007564 0.007646 Triad: 253432.3 0.007668 0.007576 0.007737 Repository: rC Clang https://reviews.llvm.org/D52434 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits