https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99386
--- Comment #4 from Milian Wolff <mail at milianw dot de> ---
Ah, but LTO only helps with the variant that contains a single type. The
variant with two types remains very slow:
variant with single type:
```
Performance counter stats for './variant 1' (5 runs):
264.14 msec task-clock # 0.999 CPUs utilized
( +- 0.13% )
0 context-switches # 0.001 K/sec
( +-100.00% )
0 cpu-migrations # 0.000 K/sec
380 page-faults # 0.001 M/sec
( +- 0.13% )
1,182,582,454 cycles # 4.477 GHz
( +- 0.06% ) (62.52%)
634,015 stalled-cycles-frontend # 0.05% frontend cycles
idle ( +- 3.72% ) (62.52%)
1,044,218,220 stalled-cycles-backend # 88.30% backend cycles
idle ( +- 0.16% ) (62.52%)
1,187,317,899 instructions # 1.00 insn per cycle
# 0.88 stalled cycles per
insn ( +- 0.11% ) (62.52%)
132,470,519 branches # 501.512 M/sec
( +- 0.09% ) (62.53%)
2,967 branch-misses # 0.00% of all branches
( +- 7.80% ) (62.47%)
788,740,131 L1-dcache-loads # 2986.044 M/sec
( +- 0.16% ) (62.47%)
16,466,669 L1-dcache-load-misses # 2.09% of all L1-dcache
accesses ( +- 0.16% ) (62.46%)
<not supported> LLC-loads
<not supported> LLC-load-misses
0.264412 +- 0.000379 seconds time elapsed ( +- 0.14% )
```
The above measurements is in the same ballpark as the no-variant baseline
without LTO. But check out the following for using a variant with two types:
```
Performance counter stats for './variant 2' (5 runs):
1,807.01 msec task-clock # 1.000 CPUs utilized
( +- 0.04% )
4 context-switches # 0.002 K/sec
( +- 11.59% )
0 cpu-migrations # 0.000 K/sec
( +- 61.24% )
383 page-faults # 0.212 K/sec
( +- 0.27% )
8,093,139,812 cycles # 4.479 GHz
( +- 0.01% ) (62.35%)
1,393,308 stalled-cycles-frontend # 0.02% frontend cycles
idle ( +- 5.84% ) (62.52%)
7,257,955,665 stalled-cycles-backend # 89.68% backend cycles
idle ( +- 0.08% ) (62.62%)
4,728,542,717 instructions # 0.58 insn per cycle
# 1.53 stalled cycles per
insn ( +- 0.02% ) (62.65%)
395,189,246 branches # 218.698 M/sec
( +- 0.02% ) (62.65%)
17,570 branch-misses # 0.00% of all branches
( +- 12.38% ) (62.55%)
3,806,321,294 L1-dcache-loads # 2106.424 M/sec
( +- 0.02% ) (62.39%)
16,753,910 L1-dcache-load-misses # 0.44% of all L1-dcache
accesses ( +- 0.11% ) (62.28%)
<not supported> LLC-loads
<not supported> LLC-load-misses
1.807335 +- 0.000776 seconds time elapsed ( +- 0.04% )
```
Again, performance suffers dramatically