https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912
--- Comment #4 from Erik Schnetter <schnetter at gmail dot com> --- I build with the compiler options /Users/eschnett/src/CarpetX/Cactus/view-compilers/bin/g++ -fopenmp -Wall -pipe -g -march=skylake -std=gnu++17 -O3 -fcx-limited-range -fexcess-precision=fast -fno-math-errno -fno-rounding-math -fno-signaling-nans -funsafe-math-optimizations -c -o configs/sim/build/Z4c/rhs.cxx.o configs/sim/build/Z4c/rhs.cxx.ii One of the kernels in question (the one I describe above) is the C++ lambda in lines 281013 to 281119. The call to the "noinline" function ensures that the kernel (and surrounding for loops) is compiled as a separate function, which produces more efficient code. The function "grid.loop_int_device" contains essentially three nested for loops, and the actual kernel is the C++ lambda in lines 281015 to 281118. I'll have a look at -fdump-tree-optimized.