[Bug c/107827] New: switch table compression could shrink large tables (but missing), testcase cerf-lib

2022-11-22 Thread aleks at physik dot tu-berlin.de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107827

Bug ID: 107827
   Summary: switch table compression could shrink large tables
(but missing), testcase cerf-lib
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: aleks at physik dot tu-berlin.de
  Target Milestone: ---

The cerf-lib (self-contained numeric library that provides an efficient and
accurate implementation of complex error functions) contains 2 switch-blocks
having about 100 entries. Each entry represents a piece-wise
taylor-polynomial-approximation (aiming nearly machine precision, type double).
https://jugit.fz-juelich.de/mlz/libcerf/-/tree/main/lib

Compiler Explorer (selected -O3 for x86-64-gcc-12) https://godbolt.org/
shows ASM output having huge jump-tables. Even code compiles fine on my PC with
gcc11, you have to shrink to 20 cases in Compiler-Explorer to see result (some
webservice limit, but this is not the issue here). Lets focus on 2 static
functions.

1. erfcx.c / static double erfcx_y100() / case (int) 0..99 (100 cases, no gap,
each case is a taylor-polynomial of grade 6). Gives Jump-Table of 100 x 8
bytes, plus code. The table could be easy compressed or replaced by linear
formula.
a) assumning no function ever exceeded 4 GB of code, it would be possible to
use only 4 byte offset (jumping-distance), and for 99% of funtions 65 KB range
would be enough (at least in O2 is makes sence, but also in O3 it could be
faster to decrease memory/cache usage - this could become a new option to
control it).
b) for this erfcx_y100-sample it would be even possible to calc the offset by
formula.
c) of course the parameters could be a separate lookup-table, but no speedup
for other reasons (I tried these days). gcc could detect the similarity
(idential cases beside the 7 hard-coded coefficients) which would be antother
change request..

sample (I cutted the coefficients numbers for readability)
switch ((int) y100) {
  case 0: {
double t = 2*y100 - 1;
return 0.7 + (0.7 + (0.3 + (0.1 + (0.8 + (0.3 + 0.1 * t) *t) *t) *t) *t)
*t;
  }
  case 1: {
double t = 2*y100 - 3;
return 0.2 + (0.7 + (0.3 + (0.1 + (0.8 + (0.3 + 0.1 *t) *t) *t) *t) *t) *t;
  }
 .. until case 99:
  // instead 8 x 100 byte table it could be 2 x 100 byte, or 10 byte linear
equation.

2. im_w_of_x.c / static double w_im_y100() / case (int) 0..96 (97 cases, no
gap, similar polynomials between 6th to 8th grade, plus another unique final
formula). Here the code is (even neglecting the coefficients) not identical,
but still similar. Address-Offset-Compression could still be used (but not
linear all the way).
d) same as a) + b)
e) side note: " 2*y100 " seems no to be detected and calculated once at begin
of switch.
f) detection of similar code (same equation, different parameters) could be
used to decrease object-size (text) as only 4 kinds of equations are here, only
offset in some created "const double parameters[]"-array differs.

PS: no code is wrong or shows different results, this is a
ro-memory-consumption and timing issue only.

Thanks for reading.

[Bug target/107827] switch table compression could shrink large tables (but missing), testcase cerf-lib

2022-11-22 Thread aleks at physik dot tu-berlin.de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107827

--- Comment #4 from Alexander Kleinsorge  ---
(In reply to Andrew Pinski from comment #2)
> x86_64 target does not dojump table compression at all.
> But aarch64 and arm targets do:
> .L4:
> .2byte  (.L103 - .Lrtx4) / 4
> .2byte  (.L102 - .Lrtx4) / 4
> .2byte  (.L101 - .Lrtx4) / 4
> .2byte  (.L100 - .Lrtx4) / 4
> .2byte  (.L99 - .Lrtx4) / 4
> .2byte  (.L98 - .Lrtx4) / 4
> .2byte  (.L97 - .Lrtx4) / 4
> .2byte  (.L96 - .Lrtx4) / 4
> .2byte  (.L95 - .Lrtx4) / 4
> 
> 
> Confirmed. Note the /4 can't be done for x86 but the rest I think can be
> done. The /4 can't be done because some instructions one byte in length on
> x86_64.

I think by inserting 1 byte NOPs should be possible to get jump-point-alignment
https://www.felixcloutier.com/x86/nop
So the other 1 byte ops could be filled up for alignment.

[Bug target/107827] switch table compression could shrink large tables (but missing), testcase cerf-lib

2022-11-22 Thread aleks at physik dot tu-berlin.de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107827

--- Comment #5 from Alexander Kleinsorge  ---
And what about linear equation for same-sized cases (each case has same
code-size). Which should be possible in Case 1 (I think).

[Bug target/108281] New: float value range estimation missing (vs. integer)

2023-01-03 Thread aleks at physik dot tu-berlin.de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108281

Bug ID: 108281
   Summary: float value range estimation missing (vs. integer)
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: aleks at physik dot tu-berlin.de
  Target Milestone: ---

"gcc -O3" and optional: "-funsafe-math-optimizations" (isnan)
GCC ignores ranges of float numbers for optimization, tested via
https://godbolt.org/
For many frequent used functions (or even all in math.h) gcc could know a range
limit and concidering it for comparisons.

#include 
int ranges(float x) {
if (x!=x) return -1; // optional NAN check
if (cos(x) < -1.0f) return -2;
if (sin(x) >  1.0f) return -3;
if (fabs(x) < 0.0f) return -4;
if (atan(x) < -2.0f) return -5; // +-PI/2
if (exp(x) < 0.0f) return -6;
if (sqrt(x) < 0.0f) return -7;
if (log(x)  < 90.0f) return -8; // ln(FLT_MAX)=88.8
// ln(DBL_MAX) = 709.8
return 0; // the only valid return (beside -1)
}
int sqr2(float x) { // squares give non-negative results
return x*x < 0.0f; // == false
}
int ax2(float x) {
return fabs(x) > -1.0f; // == true
}
int cmp_sqrt(float x, float y) { // similar (!sadly very often seen!)
//x = fabs(x); y = fabs(y); // optional sign removal line
return sqrtf(x) < sqrtf(y); // == (x < y), hint: sqrt=slow
}

[Bug target/108281] float value range estimation missing (vs. integer)

2023-01-03 Thread aleks at physik dot tu-berlin.de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108281

--- Comment #1 from Alexander Kleinsorge  ---
(same for types double and long-double)

[Bug middle-end/119943] -O3 forgets trivial code shift. causing significant slowdown

2025-04-25 Thread aleks at physik dot tu-berlin.de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119943

--- Comment #2 from Alexander Kleinsorge  ---
Created attachment 61197
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=61197&action=edit
full sample code (1 file)

version above has the issue active.
gcc -o radix.exe -march=native -O3 radix.c
./radix 60 50

[Bug c/119943] New: -O3 forgets trivial code shift. causing significant slowdown

2025-04-25 Thread aleks at physik dot tu-berlin.de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119943

Bug ID: 119943
   Summary: -O3 forgets trivial code shift. causing significant
slowdown
   Product: gcc
   Version: 12.4.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: aleks at physik dot tu-berlin.de
  Target Milestone: ---

the test below caused significant slowdown (3 vs 11 us) using "-O3"
optimization on x64 arch (Intel 12 gen).
the critical line of code is a struct initialization of 4x uint8.

this indicates 2 issues:

1. why moving the intialization around changes runtime in O3 ?
gcc should detect this easy trick (low hanging fruit).

2. moving up another line  "uint8 aa,bb=0,cc=3,dd=4;" is not harmful.
struct filling is more critical, even if only {0,0,0,0} = 0u (32bit)


// the code reduced to the idea
typedef unsigned int uint32;
typedef unsigned char uint8;

// time measure via (average over 10 runs)
struct timespec ts0 = {0, 0};
clock_gettime(CLOCK_REALTIME, &ts0);

typedef struct {
uint8 a, b, c, d;
} cfg_t;

// test (2 vs 10 us)
uint32 test1(elem_t array[], const uint32 n) {  // n=60

// if critical line is here, then slowdown !!
if (n <= 60u) {
if (n > 1u) quickSort(array, 0, n - 1);
return 0u;
}
// we never go here, as (n<=60)
uint8 aa, bb = 0, cc = 3, dd = 4; // the other not harmful line

cfg_t cfg = {0, 0, 0, 0}; // critical line !!!

// something ABC, same in test2()
// we use cfg struct here many times

return n + cfg.a;
}

[Bug middle-end/119943] -O3 forgets trivial code shift. causing significant slowdown

2025-04-25 Thread aleks at physik dot tu-berlin.de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119943

--- Comment #4 from Alexander Kleinsorge  ---
I dont think this is the a vectorization issue here!
Moving a struct initialization around, seems not related to the duplicate you
mentioned, for me. Could you please check again?

[Bug middle-end/50481] builtin to reverse the bit order

2025-05-19 Thread aleks at physik dot tu-berlin.de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50481

Alexander Kleinsorge  changed:

   What|Removed |Added

 CC||aleks at physik dot 
tu-berlin.de

--- Comment #13 from Alexander Kleinsorge  ---
for single bytes (uint8), there could be a faster way (x86 + x64).
there are only logical ops and shifts, nothing else.

static inline uint8 byte_rev(uint8 v) {
const uint64 BREV64 = ~0x084c2a6e195d3b7fLLu; // verify this number (LUT
like)
uint8 a = (BREV64) >> ((v % 16u) * 4u); // from low
uint8 b = (BREV64) >> ((v / 16u) * 4u); // from high
return (a * 16u) | (b % 16u);
}