https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rguenth at gcc dot gnu.org Keywords| |missed-optimization Target| |x86_64-*-* --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- Which function does the loop kernel reside in? I see you have some lambdas in Z4c_RHS, done fancy as out-of-line functions, that do look like they could comprise the actual kernels. In apply_upwind_diss I see cases without stack usage. I'm looking at -O2 -march=skylake compiles Note that with C++ it's easy to retain some abstraction and thus misinterpret stack accesses as spilling where they are aggregates not eliminated. For example in one of the lambdas I see _61489 = __builtin_ia32_maskloadpd256 (_104487, _61513); D.545024[1].elts.car = _61489; ... MEM[(struct vect *)&D.544982].elts._M_elems[1] = MEM[(const struct simd &)&D.545024 + 32]; ... MEM[(struct mat3 *)&vars + 992B] = MEM[(const struct mat3 &)&D.544982]; and D.544982 is later variable indexed in some MIN/MAX, FMA using code (instead of using 'vars' there). Looking at what -fdump-tree-optimized produces is sometimes pointing at problems. That said, the code is large so please point at some source lines within the important kernel(s) (of the preprocessed source, that is) and the compile options used.