https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108086
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> --- Samples: 877K of event 'cycles:u', Event count (approx.): 779618998855 Overhead Samples Command Shared Object Symbol 52.64% 461458 cc1plus cc1plus [.] ggc_internal_alloc 3.18% 27836 cc1plus cc1plus [.] bitmap_set_bit 2.66% 23374 cc1plus cc1plus [.] hash_table<hash_map<tree_ 2.37% 20813 cc1plus cc1plus [.] insert_decl_map 1.57% 13954 cc1plus cc1plus [.] hash_table<hash_map<tree_ 1.55% 13852 cc1plus [unknown] [k] 0xffffffffad200b47 1.30% 11443 cc1plus cc1plus [.] copy_bb callgraph ipa passes : 238.97 ( 79%) 31.98 ( 96%) 270.98 ( 81%) 12908M ( 90%) integration : 80.40 ( 27%) 19.17 ( 57%) 99.91 ( 30%) 11659M ( 81%) tree eh : 23.57 ( 8%) 0.05 ( 0%) 23.65 ( 7%) 153M ( 1%) tree operand scan : 134.50 ( 45%) 12.74 ( 38%) 146.64 ( 44%) 892M ( 6%) I think this is the "known" issue of always-inline functions calling always-inline functions, eventually leading to some exponential growth in size of the callgraph, size estimation and code. We fail to elide the intermediate functions early (and re-use the body for the last inline instance). It's often better to use the flatten attribute on the outermost function implementing a computation kernel. A smaller "main" program is void bar(__m512i *); void foo(__m512i *input) { __m512i transVecs[64]; Transpose<0>::_transpose(input, transVecs); bar (transVecs); } Reducing 64 to 32 (also in the templates) makes it compile almost instantly but still callgraph ipa passes : 11.74 ( 84%) 3.00 ( 91%) 14.74 ( 85%) 1617M ( 92%) integration : 6.88 ( 49%) 1.75 ( 53%) 8.63 ( 50%) 1460M ( 83%) tree operand scan : 3.04 ( 22%) 1.17 ( 35%) 4.31 ( 25%) 113M ( 7%) I can halve the operand scan time.