https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98321
--- Comment #1 from Tom de Vries <vries at gcc dot gnu.org> --- Ok, let's first make a runnable test-case: ... $ cat src/libgomp/testsuite/libgomp.oacc-c/test.c #include <stdio.h> #define TYPE float TYPE a = 1; TYPE b = 2; int main (void) { printf ("A: %f\n", a); #pragma acc parallel num_gangs (1) num_workers (1) copy (a, b) #pragma acc atomic update a += b; printf ("A: %f\n", a); return !(a == 3); } ... Indeed we see the cas, but that has nothing to do with support in the nvptx port: ... atom.cas.b32 %r29, [%r25], %r22, %r28; ... This appears already at ompexp on the host, where we expand: ... #pragma omp atomic_load relaxed D.2555 = *D.2568 <bb 4> : D.2557 = D.2555 + b.1; #pragma omp atomic_store relaxed (D.2557) ... into: ... D.2583 = __atomic_load_4 (D.2582, 0); D.2584 = D.2583; <bb 4> : D.2585 = VIEW_CONVERT_EXPR<float>(D.2584); D.2586 = D.2585 + b.1; D.2587 = VIEW_CONVERT_EXPR<unsigned int>(D.2586); D.2588 = __sync_val_compare_and_swap_4 (D.2582, D.2584, D.2587); ... This is part of a generic problem with offloading, where choices are made in the host compiler which are suboptimal or even unsupported in the offload compiler. Ideally this should be addressed in the host compiler. It may be possible to address this in the nvptx port by trying to detect the unoptimal pattern and converting it to the optimal atom.add.f32. But ultimately that's a workaround, and it's better to fix this at the source.