https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88940
Bug ID: 88940 Summary: [openacc, libgomp] cuModuleLoadData error for asynchronous parallel with abort Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: vries at gcc dot gnu.org Target Milestone: --- Consider abort-6.c (the async version of abort-1.c): ... /* { dg-do run } */ #include <stdio.h> #include <stdlib.h> int main (void) { fprintf (stderr, "CheCKpOInT\n"); #pragma acc parallel async { abort (); } #pragma acc wait return 0; } /* { dg-output "CheCKpOInT" } */ /* { dg-shouldfail "" } */ ... It passes at O2 like this: ... CheCKpOInT libgomp: cuModuleLoadData error: an illegal instruction was encountered libgomp: Cannot map target functions or variables (expected 1, have 4294967295) PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/abort-6.c \ -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -O2 execution test ... Normally, the compiler can optimize away the code following an abort, because the abort is a noreturn call. In the case of an async parallel however, the parallel region is executed asynchronously, that is, the expected behaviour is: - the kernel representing the parallel region is launched - the code launching the kernel does not wait for the launch to start or finish, - the code following the parallel region is executed So, in the case of the test-case above, we can eliminate all code after the abort inside the parallel async region, but we can't eliminated any code after the parallel async region. The compiler presently does precisely that: it removes the code after the parallel async region. At eh, we have: ... main () { int D.3104; stderr.0_1 = stderr; __builtin_fwrite ("CheCKpOInT\n", 1, 11, stderr.0_1); #pragma omp target oacc_parallel async(-1) [child fn: main._omp_fn.0 (???)] abort (); #pragma omp return __builtin_GOACC_wait (-2, 0); D.3104 = 0; goto <D.3109>; D.3104 = 0; goto <D.3109>; <D.3109>: return D.3104; } ... and at cfg, we have: ... main () { int D.3104; <bb 2> : stderr.0_1 = stderr; __builtin_fwrite ("CheCKpOInT\n", 1, 11, stderr.0_1); #pragma omp target oacc_parallel async(-1) [child fn: main._omp_fn.0 (???)] <bb 3> : abort (); } ... And at optimized: ... main () { struct _IO_FILE * stderr.0_1; <bb 2> [local count: 1073741824]: stderr.0_1 = stderr; fwrite ("CheCKpOInT\n", 1, 11, stderr.0_1); GOACC_parallel_keyed (-1, main._omp_fn.0, 0, 0B, 0B, 0B, 536936447, -1, 0); __builtin_unreachable (); } ... The __builtin_unreachable seems to expand to nothing, so at runtime, when executing main, we return from GOACC_parallel_keyed, step out of main, and start executing the code after main, which happens to be _start, which eventually leads to cuModuleLoadData picking up the result of the abort, which leads to the "Cannot map" error message. This problem is encountered for the nvptx target, but I think this is a generic openacc problem.