https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99785
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |hubicka at gcc dot gnu.org Version|unknown |11.0 --- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> --- Did anybody check the actual output from clang as to whether it performs the desired optimizations? I only have clang 9 around and that rejects the TU (maybe there's clang specific code paths and the preprocessed source is not representative here) Inlining blend_pixels without first constant propagating 'blend_key' (I suppose at all call paths that's eventually supposed to be constant propagated somehow?) looks quite stupid given the large switch. Sure, saving %xmm around calls can have a cost but trashing icache should be worse. If all of this is auto-generated the auto-generation might also be able to improve the blend_key dispatch. Another strathegy might be to not put always_inline on everything (because that in turn will cause exponential growth) but instead inline everything into the finally important function(s) via 'flatten'. That is, you do sth like static __attribute__((always_inline)) inline void large_leaf () { /* large */ } static __attribute__((always_inline)) inline void inter1 () { large_leaf (); } static __attribute__((always_inline)) inline void inter2 () { inter1 (); inter1 (); } static __attribute__((always_inline)) inline void inter3 () { inter2 (); inter2 (); } and what you get is (intermediate) 8 copies of the large_leaf body. Which is because we inline expand from leafs rather than first inlining the small always-inline wrappers (and throwing them away before inlining into them). I suppose we could try to not inline into always-inline functions at the expense of needing to iterate on inlined always-inline bodies. Or somehow at least delay inlining large bodies into always-inline bodies. Anyway, marking such large functions as always-inline is asking for trouble.