https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108086

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
Samples: 877K of event 'cycles:u', Event count (approx.): 779618998855          
Overhead       Samples  Command  Shared Object     Symbol                       
  52.64%        461458  cc1plus  cc1plus           [.] ggc_internal_alloc
   3.18%         27836  cc1plus  cc1plus           [.] bitmap_set_bit
   2.66%         23374  cc1plus  cc1plus           [.]
hash_table<hash_map<tree_
   2.37%         20813  cc1plus  cc1plus           [.] insert_decl_map
   1.57%         13954  cc1plus  cc1plus           [.]
hash_table<hash_map<tree_
   1.55%         13852  cc1plus  [unknown]         [k] 0xffffffffad200b47
   1.30%         11443  cc1plus  cc1plus           [.] copy_bb

 callgraph ipa passes               : 238.97 ( 79%)  31.98 ( 96%) 270.98 ( 81%)
12908M ( 90%)
 integration                        :  80.40 ( 27%)  19.17 ( 57%)  99.91 ( 30%)
11659M ( 81%)
 tree eh                            :  23.57 (  8%)   0.05 (  0%)  23.65 (  7%)
  153M (  1%)
 tree operand scan                  : 134.50 ( 45%)  12.74 ( 38%) 146.64 ( 44%)
  892M (  6%)

I think this is the "known" issue of always-inline functions calling
always-inline functions, eventually leading to some exponential growth in size
of the callgraph, size estimation and code.  We fail to elide the intermediate
functions early (and re-use the body for the last inline instance).

It's often better to use the flatten attribute on the outermost function
implementing a computation kernel.

A smaller "main" program is

void bar(__m512i *);
void foo(__m512i *input)
{
  __m512i transVecs[64];
  Transpose<0>::_transpose(input, transVecs);
  bar (transVecs);
}

Reducing 64 to 32 (also in the templates) makes it compile almost instantly
but still

 callgraph ipa passes               :  11.74 ( 84%)   3.00 ( 91%)  14.74 ( 85%)
 1617M ( 92%)
 integration                        :   6.88 ( 49%)   1.75 ( 53%)   8.63 ( 50%)
 1460M ( 83%)
 tree operand scan                  :   3.04 ( 22%)   1.17 ( 35%)   4.31 ( 25%)
  113M (  7%)

I can halve the operand scan time.

Reply via email to