[Bug target/86952] Avoid jump table for switch statement with -mindirect-branch=thunk

2019-03-01 Thread daniel at iogearbox dot net
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952

Daniel Borkmann  changed:

   What|Removed |Added

 CC||daniel at iogearbox dot net

--- Comment #12 from Daniel Borkmann  ---
I've been looking into this issue quite recently and improved the benchmark
tool a bit along the way. There need to be multiple considerations wrt to
traversing the switch cases, the case is here is doing round robin, but
additional distributions / tests could be added. Pushed here just in case:
https://github.com/borkmann/microbenchmark

Numbers I'm getting are stable:

* Xeon E3-1240, packet.net c1.small.x86 instance:

 # make prep
 [...]
 # make
 gcc -g -I. -O2   -c -o test.o test.c
 gcc -g -I. -O2 -mindirect-branch=thunk --param=case-values-threshold=20   -c
-o switch-no-table.o switch-no-table.c
 gcc -g -I. -O2 -mindirect-branch=thunk   -c -o switch.o switch.c
 gcc -g -I. -O2   -c -o switch-no-retpol.o switch-no-retpol.c
 gcc -o test test.o switch-no-table.o switch.o switch-no-retpol.o
 taskset 1 ./test
 no retpoline :  6098325270
 no jump table:  6298192058 (no retpoline: 103.28%)
 jump table   : 22081802856 (no retpoline: 362.10%, no jump table: 350.61%)
 # make
 taskset 1 ./test
 no retpoline :  6098439816
 no jump table:  6298242270 (no retpoline: 103.28%)
 jump table   : 22107872854 (no retpoline: 362.52%, no jump table: 351.02%)
 # make
 taskset 1 ./test
 no retpoline :  6098187038
 no jump table:  6298308128 (no retpoline: 103.28%)
 jump table   : 22071053524 (no retpoline: 361.93%, no jump table: 350.43%)

* Xeon Gold 5120, packet.net m2.xlarge.x86 instance:

 # make prep
 [...]
 # make
 gcc -g -I. -O2   -c -o test.o test.c
 gcc -g -I. -O2 -mindirect-branch=thunk --param=case-values-threshold=20   -c
-o switch-no-table.o switch-no-table.c
 gcc -g -I. -O2 -mindirect-branch=thunk   -c -o switch.o switch.c
 gcc -g -I. -O2   -c -o switch-no-retpol.o switch-no-retpol.c
 gcc -o test test.o switch-no-table.o switch.o switch-no-retpol.o
 taskset 1 ./test
 no retpoline :  5450356814
 no jump table:  5620673036 (no retpoline: 103.12%)
 jump table   : 21448285314 (no retpoline: 393.52%, no jump table: 381.60%)
 # make
 taskset 1 ./test
 no retpoline :  5450356100
 no jump table:  5620678302 (no retpoline: 103.12%)
 jump table   : 21448119720 (no retpoline: 393.52%, no jump table: 381.59%)
 # make
 taskset 1 ./test
 no retpoline :  5450331258
 no jump table:  5620839740 (no retpoline: 103.13%)
 jump table   : 21446922902 (no retpoline: 393.50%, no jump table: 381.56%)

I've also looked into clang for their -mretpoline flag, and they generally turn
off jump table generation in this case. For gcc, the s390 folks implemented a
target override for the default case-values-threshold to raise it to 20. For
x86 something similar could be done. Anyway, H.J. Lu asked me to reopen this
issue (but seems like I cannot make this change from my account).

[Bug target/86952] Avoid jump table for switch statement with -mindirect-branch=thunk

2019-03-01 Thread daniel at iogearbox dot net
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952

--- Comment #16 from Daniel Borkmann  ---
(In reply to Martin Liška from comment #15)
> (In reply to Daniel Borkmann from comment #12)
> > I've been looking into this issue quite recently and improved the benchmark
> > tool a bit along the way. There need to be multiple considerations wrt to
> > traversing the switch cases, the case is here is doing round robin, but
> > additional distributions / tests could be added. Pushed here just in case:
> > https://github.com/borkmann/microbenchmark
> 
> Thanks a lot for the benchmark.
> 
> > Numbers I'm getting are stable:
> > 
> > * Xeon E3-1240, packet.net c1.small.x86 instance:
> > 
> >  # make prep
> >  [...]
> >  # make
> >  gcc -g -I. -O2   -c -o test.o test.c
> >  gcc -g -I. -O2 -mindirect-branch=thunk --param=case-values-threshold=20  
> > -c -o switch-no-table.o switch-no-table.c
> >  gcc -g -I. -O2 -mindirect-branch=thunk   -c -o switch.o switch.c
> >  gcc -g -I. -O2   -c -o switch-no-retpol.o switch-no-retpol.c
> >  gcc -o test test.o switch-no-table.o switch.o switch-no-retpol.o
> >  taskset 1 ./test
> >  no retpoline :  6098325270
> >  no jump table:  6298192058 (no retpoline: 103.28%)
> >  jump table   : 22081802856 (no retpoline: 362.10%, no jump table:
> > 350.61%)
> >  # make
> >  taskset 1 ./test
> >  no retpoline :  6098439816
> >  no jump table:  6298242270 (no retpoline: 103.28%)
> >  jump table   : 22107872854 (no retpoline: 362.52%, no jump table:
> > 351.02%)
> >  # make
> >  taskset 1 ./test
> >  no retpoline :  6098187038
> >  no jump table:  6298308128 (no retpoline: 103.28%)
> >  jump table   : 22071053524 (no retpoline: 361.93%, no jump table:
> > 350.43%)
> > 
> > * Xeon Gold 5120, packet.net m2.xlarge.x86 instance:
> > 
> >  # make prep
> >  [...]
> >  # make
> >  gcc -g -I. -O2   -c -o test.o test.c
> >  gcc -g -I. -O2 -mindirect-branch=thunk --param=case-values-threshold=20  
> > -c -o switch-no-table.o switch-no-table.c
> >  gcc -g -I. -O2 -mindirect-branch=thunk   -c -o switch.o switch.c
> >  gcc -g -I. -O2   -c -o switch-no-retpol.o switch-no-retpol.c
> >  gcc -o test test.o switch-no-table.o switch.o switch-no-retpol.o
> >  taskset 1 ./test
> >  no retpoline :  5450356814
> >  no jump table:  5620673036 (no retpoline: 103.12%)
> >  jump table   : 21448285314 (no retpoline: 393.52%, no jump table:
> > 381.60%)
> >  # make
> >  taskset 1 ./test
> >  no retpoline :  5450356100
> >  no jump table:  5620678302 (no retpoline: 103.12%)
> >  jump table   : 21448119720 (no retpoline: 393.52%, no jump table:
> > 381.59%)
> >  # make
> >  taskset 1 ./test
> >  no retpoline :  5450331258
> >  no jump table:  5620839740 (no retpoline: 103.13%)
> >  jump table   : 21446922902 (no retpoline: 393.50%, no jump table:
> > 381.56%)
> 
> I can confirm the numbers. I've got:
> model name: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
> 
> taskset 1 ./test
> no retpoline :  4311969467
> no jump table:  5146081372 (no retpoline: 119.34%)
> jump table   : 18845846887 (no retpoline: 437.06%, no jump table:
> 366.22%)

Ok, great, thanks for testing on your side as well!

> > I've also looked into clang for their -mretpoline flag, and they generally
> > turn off jump table generation in this case. For gcc, the s390 folks
> > implemented a target override for the default case-values-threshold to raise
> > it to 20. 
> 
> Note that GCC has similar parameter:
> 
> --param case-values-threshold
>The smallest number of different values for which it is best
> to use a jump-table instead of a tree of conditional branches.  If the value
> is 0, use the default for the machine.  The default is 0.

Yeah, I know, I've used it above for the test case (see the gcc cmdline parts).

> For 20 branches, I've got even worse numbers:
> https://github.com/marxin/microbenchmark-1/tree/retpoline-table
> 
> taskset 1 ./test
> no retpoline :  5096377521
> no jump table:  5169400990 (no retpoline: 101.43%)
> jump table   : 28830137876 (no retpoline: 565.70%, no jump table:
> 557.71%)
> 
> So are you suggesting to disable jump tables with retpolines at all?

I leave that up to you guys, but I would at min probably implement something
like s390 folks did for gcc, commit db7a90aa0de5 ("S/390: Disable prediction of
indirect branches"), see s390_case_values_threshold() which does:

+unsigned int
+s390_case_values_threshold (void)
+{
+  /* Disabling branch prediction for indirect jumps makes jump tables
+ much more expensive.  */
+  if (TARGET_INDIRECT_BRANCH_NOBP_JUMP)
+return 20;
+
+  return default_case_values_threshold ();
+}

> For x86 something similar could be done. Anyway, H.J. Lu asked me
> > to reopen this issue (but seems like I cannot make this change from my
> > account).
> 
> Yep, I would need an account ending with @gcc.org to change a bug.

[Bug target/86952] Avoid jump table for switch statement with -mindirect-branch=thunk

2019-03-06 Thread daniel at iogearbox dot net
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952

--- Comment #20 from Daniel Borkmann  ---
(In reply to Martin Liška from comment #19)
> Ok, I updated the benchmark and push it here:
> https://github.com/marxin/microbenchmark-1
> 
> And I see following on my Haswell machine:

Thanks for working on it! Bit strange why some of your numbers are quite
fluctuating e.g. in your 'normal' column. What do you use to tune your setup
for testing? I've been running the `make prep` part which I added back then,
and the numbers I see are quite stable. I ran a quick test this morning with
your repo, and here's what I got for the round-robin walk:

* Xeon E3-1240 (3.7GHz):

# ./test.py 
 normal   retpolineretpo+no-JT  retpo+JT=20  retpo+JT=40
cases:8: 0.49 (100%)  2.09 (426%)  0.53 (108%)  0.53 (108%)  0.53 (108%) 
cases:   16: 0.49 (100%)  2.09 (426%)  0.58 (119%)  0.58 (119%)  0.58 (119%) 
cases:   32: 0.49 (100%)  2.09 (426%)  0.61 (125%)  2.09 (426%)  0.61 (125%) 
cases:   64: 0.49 (100%)  2.26 (458%)  0.69 (140%)  2.27 (459%)  2.27 (459%) 
cases:  128: 0.50 (100%)  2.37 (476%)  0.76 (153%)  2.32 (466%)  2.41 (483%) 
cases:  256: 0.52 (100%)  2.33 (451%)  0.91 (175%)  2.33 (450%)  2.36 (456%) 
cases: 1024: 1.05 (100%)  2.54 (242%)  1.08 (103%)  2.59 (246%)  2.54 (242%) 
cases: 2048: 1.63 (100%)  2.56 (157%)  1.94 (119%)  2.61 (160%)  2.59 (159%) 
cases: 4096: 2.19 (100%)  3.12 (143%)  3.22 (147%)  3.09 (142%)  3.13 (143%) 

* Xeon Gold 5120 (2.6GHz):

# ./test.py 
 normal   retpolineretpo+no-JT  retpo+JT=20  retpo+JT=40
cases:8: 0.70 (100%)  2.98 (425%)  0.75 (107%)  0.75 (107%)  0.75 (107%) 
cases:   16: 0.70 (100%)  2.98 (425%)  0.82 (117%)  0.82 (117%)  0.82 (117%) 
cases:   32: 0.70 (100%)  3.01 (430%)  0.87 (124%)  2.98 (426%)  0.87 (124%) 
cases:   64: 0.70 (100%)  3.52 (501%)  0.94 (134%)  3.52 (501%)  3.52 (501%) 
cases:  128: 0.71 (100%)  3.51 (495%)  1.07 (151%)  3.50 (495%)  3.50 (494%) 
cases:  256: 0.76 (100%)  3.14 (414%)  1.27 (167%)  3.14 (414%)  3.14 (414%) 
cases: 1024: 1.46 (100%)  3.36 (230%)  1.49 (102%)  3.36 (230%)  3.36 (230%) 
cases: 2048: 2.25 (100%)  3.19 (142%)  2.70 (120%)  3.19 (142%)  3.19 (142%) 
cases: 4096: 2.90 (100%)  3.74 (129%)  4.48 (155%)  3.73 (129%)  3.72 (129%) 

Probably makes sense to also add other walk tests aka input distributions for
foo{,_no_table,_no_retpol}() for further comparison if plan would be to
disable jump tables entirely.