https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803
--- Comment #12 from Yann Droneaud <yann at droneaud dot fr> --- (In reply to Yann Droneaud from comment #8) > Created attachment 46903 [details] > An artificial test case for gcc to emit 17 calls to __tls_get_addr() > It's possible to "workaround" this issue by using some __asm__ statement such as OPTIMIZER_HIDE_VAR() from Linux kernel, see https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/compiler.h?h=v5.3#n169 __asm__ ("" : "=r" (var) : "0" (var)) As base line, an artificial benchmark using attachment #46903 in an ELF shared library, have following results when run through perf: Performance counter stats for './case' (100 runs): 932,83 msec task-clock:u # 0,999 CPUs utilized ( +- 0,38% ) 0 context-switches:u # 0,000 K/sec 0 cpu-migrations:u # 0,000 K/sec 51 page-faults:u # 0,054 K/sec ( +- 0,19% ) 2 528 214 520 cycles:u # 2,710 GHz ( +- 0,30% ) 4 150 148 436 instructions:u # 1,64 insn per cycle ( +- 0,00% ) 1 530 031 707 branches:u # 1640,205 M/sec ( +- 0,00% ) 51 236 branch-misses:u # 0,00% of all branches ( +- 7,24% ) 0,93364 +- 0,00359 seconds time elapsed ( +- 0,38% ) Applying the following change to attachment #46903 --- case.c~ 2019-09-20 15:39:41.852356614 +0200 +++ case.c 2019-09-25 21:38:47.620696710 +0200 @@ -1,8 +1,15 @@ int process(int *); +#define HIDE_VAR(v) __asm__ ("" : "=r" (v) : "0" (v)) + static int *state(void) { - static __thread int s[16]; + static __thread int __s[16]; + int *s = __s; + + HIDE_VAR(s); + return s; } And running the benchmark again through perf, the results are: Performance counter stats for './case' (100 runs): 540,54 msec task-clock:u # 0,999 CPUs utilized ( +- 0,22% ) 0 context-switches:u # 0,000 K/sec 0 cpu-migrations:u # 0,000 K/sec 50 page-faults:u # 0,093 K/sec ( +- 0,17% ) 1 469 494 111 cycles:u # 2,719 GHz ( +- 0,12% ) 1 530 148 010 instructions:u # 1,04 insn per cycle ( +- 0,00% ) 730 031 281 branches:u # 1350,554 M/sec ( +- 0,00% ) 74 613 branch-misses:u # 0,01% of all branches ( +- 18,61% ) 0,54102 +- 0,00119 seconds time elapsed ( +- 0,22% ) With the workaround, the benchmark took 42% less time. It's 1.72x faster.