ABataev added a comment. In D99432#2738859 <https://reviews.llvm.org/D99432#2738859>, @estewart08 wrote:
> In D99432#2736981 <https://reviews.llvm.org/D99432#2736981>, @ABataev wrote: > >> In D99432#2736970 <https://reviews.llvm.org/D99432#2736970>, @estewart08 >> wrote: >> >>> In D99432#2728788 <https://reviews.llvm.org/D99432#2728788>, @ABataev wrote: >>> >>>> In D99432#2726997 <https://reviews.llvm.org/D99432#2726997>, @estewart08 >>>> wrote: >>>> >>>>> In D99432#2726845 <https://reviews.llvm.org/D99432#2726845>, @ABataev >>>>> wrote: >>>>> >>>>>> In D99432#2726588 <https://reviews.llvm.org/D99432#2726588>, @estewart08 >>>>>> wrote: >>>>>> >>>>>>> In D99432#2726391 <https://reviews.llvm.org/D99432#2726391>, @ABataev >>>>>>> wrote: >>>>>>> >>>>>>>> In D99432#2726337 <https://reviews.llvm.org/D99432#2726337>, >>>>>>>> @estewart08 wrote: >>>>>>>> >>>>>>>>> In D99432#2726060 <https://reviews.llvm.org/D99432#2726060>, @ABataev >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> In D99432#2726050 <https://reviews.llvm.org/D99432#2726050>, >>>>>>>>>> @estewart08 wrote: >>>>>>>>>> >>>>>>>>>>> In D99432#2726025 <https://reviews.llvm.org/D99432#2726025>, >>>>>>>>>>> @ABataev wrote: >>>>>>>>>>> >>>>>>>>>>>> In D99432#2726019 <https://reviews.llvm.org/D99432#2726019>, >>>>>>>>>>>> @estewart08 wrote: >>>>>>>>>>>> >>>>>>>>>>>>> In reference to https://bugs.llvm.org/show_bug.cgi?id=48851, I do >>>>>>>>>>>>> not see how this helps SPMD mode with team privatization of >>>>>>>>>>>>> declarations in-between target teams and parallel regions. >>>>>>>>>>>> >>>>>>>>>>>> Diв you try the reproducer with the applied patch? >>>>>>>>>>> >>>>>>>>>>> Yes, I still saw the test fail, although it was not with latest >>>>>>>>>>> llvm-project. Are you saying the reproducer passes for you? >>>>>>>>>> >>>>>>>>>> I don't have CUDA installed but from what I see in the LLVM IR it >>>>>>>>>> shall pass. Do you have a debug log, does it crashes or produces >>>>>>>>>> incorrect results? >>>>>>>>> >>>>>>>>> This is on an AMDGPU but I assume the behavior would be similar for >>>>>>>>> NVPTX. >>>>>>>>> >>>>>>>>> It produces incorrect/incomplete results in the dist[0] index after a >>>>>>>>> manual reduction and in turn the final global gpu_results array is >>>>>>>>> incorrect. >>>>>>>>> When thread 0 does a reduction into dist[0] it has no knowledge of >>>>>>>>> dist[1] having been updated by thread 1. Which tells me the array is >>>>>>>>> still thread private. >>>>>>>>> Adding some printfs, looking at one teams' output: >>>>>>>>> >>>>>>>>> SPMD >>>>>>>>> >>>>>>>>> Thread 0: dist[0]: 1 >>>>>>>>> Thread 0: dist[1]: 0 // This should be 1 >>>>>>>>> After reduction into dist[0]: 1 // This should be 2 >>>>>>>>> gpu_results = [1,1] // [2,2] expected >>>>>>>>> >>>>>>>>> Generic Mode: >>>>>>>>> >>>>>>>>> Thread 0: dist[0]: 1 >>>>>>>>> Thread 0: dist[1]: 1 >>>>>>>>> After reduction into dist[0]: 2 >>>>>>>>> gpu_results = [2,2] >>>>>>>> >>>>>>>> Hmm, I would expect a crash if the array was allocated in the local >>>>>>>> memory. Could you try to add some more printfs (with data and >>>>>>>> addresses of the array) to check the results? Maybe there is a data >>>>>>>> race somewhere in the code? >>>>>>> >>>>>>> As a reminder, each thread updates a unique index in the dist array and >>>>>>> each team updates a unique index in gpu_results. >>>>>>> >>>>>>> SPMD - shows each thread has a unique address for dist array >>>>>>> >>>>>>> Team 0 Thread 1: dist[0]: 0, 0x7f92e24a8bf8 >>>>>>> Team 0 Thread 1: dist[1]: 1, 0x7f92e24a8bfc >>>>>>> >>>>>>> Team 0 Thread 0: dist[0]: 1, 0x7f92e24a8bf0 >>>>>>> Team 0 Thread 0: dist[1]: 0, 0x7f92e24a8bf4 >>>>>>> >>>>>>> Team 0 Thread 0: After reduction into dist[0]: 1 >>>>>>> Team 0 Thread 0: gpu_results address: 0x7f92a5000000 >>>>>>> -------------------------------------------------- >>>>>>> Team 1 Thread 1: dist[0]: 0, 0x7f92f9ec5188 >>>>>>> Team 1 Thread 1: dist[1]: 1, 0x7f92f9ec518c >>>>>>> >>>>>>> Team 1 Thread 0: dist[0]: 1, 0x7f92f9ec5180 >>>>>>> Team 1 Thread 0: dist[1]: 0, 0x7f92f9ec5184 >>>>>>> >>>>>>> Team 1 Thread 0: After reduction into dist[0]: 1 >>>>>>> Team 1 Thread 0: gpu_results address: 0x7f92a5000000 >>>>>>> >>>>>>> gpu_results[0]: 1 >>>>>>> gpu_results[1]: 1 >>>>>>> >>>>>>> Generic - shows each team shares dist array address amongst threads >>>>>>> >>>>>>> Team 0 Thread 1: dist[0]: 1, 0x7fac01938880 >>>>>>> Team 0 Thread 1: dist[1]: 1, 0x7fac01938884 >>>>>>> >>>>>>> Team 0 Thread 0: dist[0]: 1, 0x7fac01938880 >>>>>>> Team 0 Thread 0: dist[1]: 1, 0x7fac01938884 >>>>>>> >>>>>>> Team 0 Thread 0: After reduction into dist[0]: 2 >>>>>>> Team 0 Thread 0: gpu_results address: 0x7fabc5000000 >>>>>>> -------------------------------------------------- >>>>>>> Team 1 Thread 1: dist[0]: 1, 0x7fac19354e10 >>>>>>> Team 1 Thread 1: dist[1]: 1, 0x7fac19354e14 >>>>>>> >>>>>>> Team 1 Thread 0: dist[0]: 1, 0x7fac19354e10 >>>>>>> Team 1 Thread 0: dist[1]: 1, 0x7fac19354e14 >>>>>>> >>>>>>> Team 1 Thread 0: After reduction into dist[0]: 2 >>>>>>> Team 1 Thread 0: gpu_results address: 0x7fabc5000000 >>>>>> >>>>>> Could you check if it works with >>>>>> `-fno-openmp-cuda-parallel-target-regions` option? >>>>> >>>>> Unfortunately that crashes: >>>>> llvm-project/llvm/lib/IR/Instructions.cpp:495: void >>>>> llvm::CallInst::init(llvm::FunctionType*, llvm::Value*, >>>>> llvm::ArrayRef<llvm::Value*>, >>>>> llvm::ArrayRef<llvm::OperandBundleDefT<llvm::Value*> >, const >>>>> llvm::Twine&): Assertion `(i >= FTy->getNumParams() || >>>>> FTy->getParamType(i) == Args[i]->getType()) && "Calling a function with a >>>>> bad signature!"' failed. >>>> >>>> Hmm, could you provide a full stack trace? >>> >>> At this point I am not sure I want to dig into that crash as our >>> llvm-branch is not caught up to trunk. >>> >>> I did build trunk and ran some tests on a sm_70: >>> -Without this patch: code fails with incomplete results >>> -Without this patch and with -fno-openmp-cuda-parallel-target-regions: code >>> fails with incomplete results >>> >>> -With this patch: code fails with incomplete results (thread private array) >>> Team 0 Thread 1: dist[0]: 0, 0x7c1e800000a8 >>> Team 0 Thread 1: dist[1]: 1, 0x7c1e800000ac >>> >>> Team 0 Thread 0: dist[0]: 1, 0x7c1e800000a0 >>> Team 0 Thread 0: dist[1]: 0, 0x7c1e800000a4 >>> >>> Team 0 Thread 0: After reduction into dist[0]: 1 >>> Team 0 Thread 0: gpu_results address: 0x7c1ebc800000 >>> >>> Team 1 Thread 1: dist[0]: 0, 0x7c1e816f27c8 >>> Team 1 Thread 1: dist[1]: 1, 0x7c1e816f27cc >>> >>> Team 1 Thread 0: dist[0]: 1, 0x7c1e816f27c0 >>> Team 1 Thread 0: dist[1]: 0, 0x7c1e816f27c4 >>> >>> Team 1 Thread 0: After reduction into dist[0]: 1 >>> Team 1 Thread 0: gpu_results address: 0x7c1ebc800000 >>> >>> gpu_results[0]: 1 >>> gpu_results[1]: 1 >>> FAIL >>> >>> -With this patch and with -fno-openmp-cuda-parallel-target-regions: Pass >>> Team 0 Thread 1: dist[0]: 1, 0x7a5b56000018 >>> Team 0 Thread 1: dist[1]: 1, 0x7a5b5600001c >>> >>> Team 0 Thread 0: dist[0]: 1, 0x7a5b56000018 >>> Team 0 Thread 0: dist[1]: 1, 0x7a5b5600001c >>> >>> Team 0 Thread 0: After reduction into dist[0]: 2 >>> Team 0 Thread 0: gpu_results address: 0x7a5afc800000 >>> >>> Team 1 Thread 1: dist[0]: 1, 0x7a5b56000018 >>> Team 1 Thread 1: dist[1]: 1, 0x7a5b5600001c >>> >>> Team 1 Thread 0: dist[0]: 1, 0x7a5b56000018 >>> Team 1 Thread 0: dist[1]: 1, 0x7a5b5600001c >>> >>> Team 1 Thread 0: After reduction into dist[0]: 2 >>> Team 1 Thread 0: gpu_results address: 0x7a5afc800000 >>> >>> gpu_results[0]: 2 >>> gpu_results[1]: 2 >>> PASS >>> >>> I am concerned about team 0 and team 1 having the same address for the dist >>> array here. >> >> It is caused by the problem with the runtime. It should work with >> `-fno-openmp-cuda-parallel-target-regions` (I think) option (it uses a >> different runtime function for this case) and I just want to check that it >> really works. Looks like currently, runtime allocates a unique array for >> each thread. > > Unfortunately, this patch + the flag does not work for the larger reproducer, > the CPU check passes but GPU check fails with incorrect results. > https://github.com/zjin-lcf/oneAPI-DirectProgramming/blob/master/all-pairs-distance-omp/main.cpp Ok, I'll drop this patch and prepare the more conservative one, where the kernel will be executed in non-SPMD mode instead. Later it will be improved by the LLVM optimization. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D99432/new/ https://reviews.llvm.org/D99432 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits