[Bug c/107827] New: switch table compression could shrink large tables (but missing), testcase cerf-lib
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107827 Bug ID: 107827 Summary: switch table compression could shrink large tables (but missing), testcase cerf-lib Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: aleks at physik dot tu-berlin.de Target Milestone: --- The cerf-lib (self-contained numeric library that provides an efficient and accurate implementation of complex error functions) contains 2 switch-blocks having about 100 entries. Each entry represents a piece-wise taylor-polynomial-approximation (aiming nearly machine precision, type double). https://jugit.fz-juelich.de/mlz/libcerf/-/tree/main/lib Compiler Explorer (selected -O3 for x86-64-gcc-12) https://godbolt.org/ shows ASM output having huge jump-tables. Even code compiles fine on my PC with gcc11, you have to shrink to 20 cases in Compiler-Explorer to see result (some webservice limit, but this is not the issue here). Lets focus on 2 static functions. 1. erfcx.c / static double erfcx_y100() / case (int) 0..99 (100 cases, no gap, each case is a taylor-polynomial of grade 6). Gives Jump-Table of 100 x 8 bytes, plus code. The table could be easy compressed or replaced by linear formula. a) assumning no function ever exceeded 4 GB of code, it would be possible to use only 4 byte offset (jumping-distance), and for 99% of funtions 65 KB range would be enough (at least in O2 is makes sence, but also in O3 it could be faster to decrease memory/cache usage - this could become a new option to control it). b) for this erfcx_y100-sample it would be even possible to calc the offset by formula. c) of course the parameters could be a separate lookup-table, but no speedup for other reasons (I tried these days). gcc could detect the similarity (idential cases beside the 7 hard-coded coefficients) which would be antother change request.. sample (I cutted the coefficients numbers for readability) switch ((int) y100) { case 0: { double t = 2*y100 - 1; return 0.7 + (0.7 + (0.3 + (0.1 + (0.8 + (0.3 + 0.1 * t) *t) *t) *t) *t) *t; } case 1: { double t = 2*y100 - 3; return 0.2 + (0.7 + (0.3 + (0.1 + (0.8 + (0.3 + 0.1 *t) *t) *t) *t) *t) *t; } .. until case 99: // instead 8 x 100 byte table it could be 2 x 100 byte, or 10 byte linear equation. 2. im_w_of_x.c / static double w_im_y100() / case (int) 0..96 (97 cases, no gap, similar polynomials between 6th to 8th grade, plus another unique final formula). Here the code is (even neglecting the coefficients) not identical, but still similar. Address-Offset-Compression could still be used (but not linear all the way). d) same as a) + b) e) side note: " 2*y100 " seems no to be detected and calculated once at begin of switch. f) detection of similar code (same equation, different parameters) could be used to decrease object-size (text) as only 4 kinds of equations are here, only offset in some created "const double parameters[]"-array differs. PS: no code is wrong or shows different results, this is a ro-memory-consumption and timing issue only. Thanks for reading.
[Bug target/107827] switch table compression could shrink large tables (but missing), testcase cerf-lib
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107827 --- Comment #4 from Alexander Kleinsorge --- (In reply to Andrew Pinski from comment #2) > x86_64 target does not dojump table compression at all. > But aarch64 and arm targets do: > .L4: > .2byte (.L103 - .Lrtx4) / 4 > .2byte (.L102 - .Lrtx4) / 4 > .2byte (.L101 - .Lrtx4) / 4 > .2byte (.L100 - .Lrtx4) / 4 > .2byte (.L99 - .Lrtx4) / 4 > .2byte (.L98 - .Lrtx4) / 4 > .2byte (.L97 - .Lrtx4) / 4 > .2byte (.L96 - .Lrtx4) / 4 > .2byte (.L95 - .Lrtx4) / 4 > > > Confirmed. Note the /4 can't be done for x86 but the rest I think can be > done. The /4 can't be done because some instructions one byte in length on > x86_64. I think by inserting 1 byte NOPs should be possible to get jump-point-alignment https://www.felixcloutier.com/x86/nop So the other 1 byte ops could be filled up for alignment.
[Bug target/107827] switch table compression could shrink large tables (but missing), testcase cerf-lib
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107827 --- Comment #5 from Alexander Kleinsorge --- And what about linear equation for same-sized cases (each case has same code-size). Which should be possible in Case 1 (I think).
[Bug target/108281] New: float value range estimation missing (vs. integer)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108281 Bug ID: 108281 Summary: float value range estimation missing (vs. integer) Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: aleks at physik dot tu-berlin.de Target Milestone: --- "gcc -O3" and optional: "-funsafe-math-optimizations" (isnan) GCC ignores ranges of float numbers for optimization, tested via https://godbolt.org/ For many frequent used functions (or even all in math.h) gcc could know a range limit and concidering it for comparisons. #include int ranges(float x) { if (x!=x) return -1; // optional NAN check if (cos(x) < -1.0f) return -2; if (sin(x) > 1.0f) return -3; if (fabs(x) < 0.0f) return -4; if (atan(x) < -2.0f) return -5; // +-PI/2 if (exp(x) < 0.0f) return -6; if (sqrt(x) < 0.0f) return -7; if (log(x) < 90.0f) return -8; // ln(FLT_MAX)=88.8 // ln(DBL_MAX) = 709.8 return 0; // the only valid return (beside -1) } int sqr2(float x) { // squares give non-negative results return x*x < 0.0f; // == false } int ax2(float x) { return fabs(x) > -1.0f; // == true } int cmp_sqrt(float x, float y) { // similar (!sadly very often seen!) //x = fabs(x); y = fabs(y); // optional sign removal line return sqrtf(x) < sqrtf(y); // == (x < y), hint: sqrt=slow }
[Bug target/108281] float value range estimation missing (vs. integer)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108281 --- Comment #1 from Alexander Kleinsorge --- (same for types double and long-double)
[Bug middle-end/119943] -O3 forgets trivial code shift. causing significant slowdown
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119943 --- Comment #2 from Alexander Kleinsorge --- Created attachment 61197 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=61197&action=edit full sample code (1 file) version above has the issue active. gcc -o radix.exe -march=native -O3 radix.c ./radix 60 50
[Bug c/119943] New: -O3 forgets trivial code shift. causing significant slowdown
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119943 Bug ID: 119943 Summary: -O3 forgets trivial code shift. causing significant slowdown Product: gcc Version: 12.4.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: aleks at physik dot tu-berlin.de Target Milestone: --- the test below caused significant slowdown (3 vs 11 us) using "-O3" optimization on x64 arch (Intel 12 gen). the critical line of code is a struct initialization of 4x uint8. this indicates 2 issues: 1. why moving the intialization around changes runtime in O3 ? gcc should detect this easy trick (low hanging fruit). 2. moving up another line "uint8 aa,bb=0,cc=3,dd=4;" is not harmful. struct filling is more critical, even if only {0,0,0,0} = 0u (32bit) // the code reduced to the idea typedef unsigned int uint32; typedef unsigned char uint8; // time measure via (average over 10 runs) struct timespec ts0 = {0, 0}; clock_gettime(CLOCK_REALTIME, &ts0); typedef struct { uint8 a, b, c, d; } cfg_t; // test (2 vs 10 us) uint32 test1(elem_t array[], const uint32 n) { // n=60 // if critical line is here, then slowdown !! if (n <= 60u) { if (n > 1u) quickSort(array, 0, n - 1); return 0u; } // we never go here, as (n<=60) uint8 aa, bb = 0, cc = 3, dd = 4; // the other not harmful line cfg_t cfg = {0, 0, 0, 0}; // critical line !!! // something ABC, same in test2() // we use cfg struct here many times return n + cfg.a; }
[Bug middle-end/119943] -O3 forgets trivial code shift. causing significant slowdown
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119943 --- Comment #4 from Alexander Kleinsorge --- I dont think this is the a vectorization issue here! Moving a struct initialization around, seems not related to the duplicate you mentioned, for me. Could you please check again?
[Bug middle-end/50481] builtin to reverse the bit order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50481 Alexander Kleinsorge changed: What|Removed |Added CC||aleks at physik dot tu-berlin.de --- Comment #13 from Alexander Kleinsorge --- for single bytes (uint8), there could be a faster way (x86 + x64). there are only logical ops and shifts, nothing else. static inline uint8 byte_rev(uint8 v) { const uint64 BREV64 = ~0x084c2a6e195d3b7fLLu; // verify this number (LUT like) uint8 a = (BREV64) >> ((v % 16u) * 4u); // from low uint8 b = (BREV64) >> ((v / 16u) * 4u); // from high return (a * 16u) | (b % 16u); }