https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107827

            Bug ID: 107827
           Summary: switch table compression could shrink large tables
                    (but missing), testcase cerf-lib
           Product: gcc
           Version: 12.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: aleks at physik dot tu-berlin.de
  Target Milestone: ---

The cerf-lib (self-contained numeric library that provides an efficient and
accurate implementation of complex error functions) contains 2 switch-blocks
having about 100 entries. Each entry represents a piece-wise
taylor-polynomial-approximation (aiming nearly machine precision, type double).
https://jugit.fz-juelich.de/mlz/libcerf/-/tree/main/lib

Compiler Explorer (selected -O3 for x86-64-gcc-12) https://godbolt.org/
shows ASM output having huge jump-tables. Even code compiles fine on my PC with
gcc11, you have to shrink to 20 cases in Compiler-Explorer to see result (some
webservice limit, but this is not the issue here). Lets focus on 2 static
functions.

1. erfcx.c / static double erfcx_y100() / case (int) 0..99 (100 cases, no gap,
each case is a taylor-polynomial of grade 6). Gives Jump-Table of 100 x 8
bytes, plus code. The table could be easy compressed or replaced by linear
formula.
a) assumning no function ever exceeded 4 GB of code, it would be possible to
use only 4 byte offset (jumping-distance), and for 99% of funtions 65 KB range
would be enough (at least in O2 is makes sence, but also in O3 it could be
faster to decrease memory/cache usage - this could become a new option to
control it).
b) for this erfcx_y100-sample it would be even possible to calc the offset by
formula.
c) of course the parameters could be a separate lookup-table, but no speedup
for other reasons (I tried these days). gcc could detect the similarity
(idential cases beside the 7 hard-coded coefficients) which would be antother
change request..

sample (I cutted the coefficients numbers for readability)
switch ((int) y100) {
  case 0: {
    double t = 2*y100 - 1;
    return 0.7 + (0.7 + (0.3 + (0.1 + (0.8 + (0.3 + 0.1 * t) *t) *t) *t) *t)
*t;
  }
  case 1: {
    double t = 2*y100 - 3;
    return 0.2 + (0.7 + (0.3 + (0.1 + (0.8 + (0.3 + 0.1 *t) *t) *t) *t) *t) *t;
  }
 .. until case 99:
  // instead 8 x 100 byte table it could be 2 x 100 byte, or 10 byte linear
equation.

2. im_w_of_x.c / static double w_im_y100() / case (int) 0..96 (97 cases, no
gap, similar polynomials between 6th to 8th grade, plus another unique final
formula). Here the code is (even neglecting the coefficients) not identical,
but still similar. Address-Offset-Compression could still be used (but not
linear all the way).
d) same as a) + b)
e) side note: " 2*y100 " seems no to be detected and calculated once at begin
of switch.
f) detection of similar code (same equation, different parameters) could be
used to decrease object-size (text) as only 4 kinds of equations are here, only
offset in some created "const double parameters[]"-array differs.

PS: no code is wrong or shows different results, this is a
ro-memory-consumption and timing issue only.

Thanks for reading.

Reply via email to