https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803

--- Comment #12 from Yann Droneaud <yann at droneaud dot fr> ---
(In reply to Yann Droneaud from comment #8)
> Created attachment 46903 [details]
> An artificial test case for gcc to emit 17 calls to __tls_get_addr()
>

It's possible to "workaround" this issue by using some __asm__ statement such
as OPTIMIZER_HIDE_VAR() from Linux kernel, see
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/compiler.h?h=v5.3#n169

   __asm__ ("" : "=r" (var) : "0" (var))

As base line, an artificial benchmark using attachment #46903 in an ELF shared
library, have following results when run through perf:

 Performance counter stats for './case' (100 runs):

        932,83 msec task-clock:u       #    0,999 CPUs utilized   ( +-  0,38% )
             0      context-switches:u #    0,000 K/sec                  
             0      cpu-migrations:u   #    0,000 K/sec                  
            51      page-faults:u      #    0,054 K/sec           ( +-  0,19% )
 2 528 214 520      cycles:u           #    2,710 GHz             ( +-  0,30% )
 4 150 148 436      instructions:u     #    1,64  insn per cycle  ( +-  0,00% )
 1 530 031 707      branches:u         # 1640,205 M/sec           ( +-  0,00% )
        51 236      branch-misses:u    #    0,00% of all branches ( +-  7,24% )

       0,93364 +- 0,00359 seconds time elapsed  ( +-  0,38% )

Applying the following change to attachment #46903

  --- case.c~   2019-09-20 15:39:41.852356614 +0200
  +++ case.c    2019-09-25 21:38:47.620696710 +0200
  @@ -1,8 +1,15 @@
   int process(int *);

  +#define HIDE_VAR(v) __asm__ ("" : "=r" (v) : "0" (v))
  +
   static int *state(void)
   {
  -     static __thread int s[16];
  +     static __thread int __s[16];
  +     int *s = __s;
  +
  +     HIDE_VAR(s);
  +
        return s;
   }

And running the benchmark again through perf, the results are:

 Performance counter stats for './case' (100 runs):

        540,54 msec task-clock:u       #    0,999 CPUs utilized   ( +-  0,22% )
             0      context-switches:u #    0,000 K/sec                  
             0      cpu-migrations:u   #    0,000 K/sec                  
            50      page-faults:u      #    0,093 K/sec           ( +-  0,17% )
 1 469 494 111      cycles:u           #    2,719 GHz             ( +-  0,12% )
 1 530 148 010      instructions:u     #    1,04  insn per cycle  ( +-  0,00% )
   730 031 281      branches:u         # 1350,554 M/sec           ( +-  0,00% )
        74 613      branch-misses:u    #    0,01% of all branches ( +- 18,61% )

       0,54102 +- 0,00119 seconds time elapsed  ( +-  0,22% )

With the workaround, the benchmark took 42% less time. It's 1.72x faster.

Reply via email to