https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066
--- Comment #6 from Jakub Jelinek <jakub at gcc dot gnu.org> --- E.g. the builtin is often used in a loop where the user does his own atomic load first and decides what to do based on that. Say for float f; void foo () { #pragma omp atomic f += 3.0f; } with -O2 -fopenmp we emit: D.2113 = &f; D.2115 = __atomic_load_4 (D.2113, 0); D.2114 = D.2115; <bb 3> : D.2112 = VIEW_CONVERT_EXPR<float>(D.2114); _1 = D.2112 + 3.0e+0; D.2116 = VIEW_CONVERT_EXPR<unsigned int>(_1); D.2117 = .ATOMIC_COMPARE_EXCHANGE (D.2113, D.2114, D.2116, 4, 0, 0); D.2118 = REALPART_EXPR <D.2117>; D.2119 = D.2114; D.2114 = D.2118; if (D.2118 != D.2119) goto <bb 3>; [0.00%] else goto <bb 4>; [100.00%] <bb 4> : return; which is essentially void foo () { int x = __atomic_load_4 ((int *) &f, __ATOMIC_RELAXED), y; float g; do { __builtin_memcpy (&g, &x, 4); g += 3.0f; __builtin_memcpy (&y, &g, 4); } while (!__atomic_compare_exchange_n ((int *) &f, &x, y, false, __ATOMIC_RELAXED, __ATOMIC_RELAXED)); } Can you explain how your proposed change would improve this? It would just slow it down and make it larger.