https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108086
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
Samples: 877K of event 'cycles:u', Event count (approx.): 779618998855
Overhead Samples Command Shared Object Symbol
52.64% 461458 cc1plus cc1plus [.] ggc_internal_alloc
3.18% 27836 cc1plus cc1plus [.] bitmap_set_bit
2.66% 23374 cc1plus cc1plus [.]
hash_table<hash_map<tree_
2.37% 20813 cc1plus cc1plus [.] insert_decl_map
1.57% 13954 cc1plus cc1plus [.]
hash_table<hash_map<tree_
1.55% 13852 cc1plus [unknown] [k] 0xffffffffad200b47
1.30% 11443 cc1plus cc1plus [.] copy_bb
callgraph ipa passes : 238.97 ( 79%) 31.98 ( 96%) 270.98 ( 81%)
12908M ( 90%)
integration : 80.40 ( 27%) 19.17 ( 57%) 99.91 ( 30%)
11659M ( 81%)
tree eh : 23.57 ( 8%) 0.05 ( 0%) 23.65 ( 7%)
153M ( 1%)
tree operand scan : 134.50 ( 45%) 12.74 ( 38%) 146.64 ( 44%)
892M ( 6%)
I think this is the "known" issue of always-inline functions calling
always-inline functions, eventually leading to some exponential growth in size
of the callgraph, size estimation and code. We fail to elide the intermediate
functions early (and re-use the body for the last inline instance).
It's often better to use the flatten attribute on the outermost function
implementing a computation kernel.
A smaller "main" program is
void bar(__m512i *);
void foo(__m512i *input)
{
__m512i transVecs[64];
Transpose<0>::_transpose(input, transVecs);
bar (transVecs);
}
Reducing 64 to 32 (also in the templates) makes it compile almost instantly
but still
callgraph ipa passes : 11.74 ( 84%) 3.00 ( 91%) 14.74 ( 85%)
1617M ( 92%)
integration : 6.88 ( 49%) 1.75 ( 53%) 8.63 ( 50%)
1460M ( 83%)
tree operand scan : 3.04 ( 22%) 1.17 ( 35%) 4.31 ( 25%)
113M ( 7%)
I can halve the operand scan time.