Spurious register spill with volatile function argument
Seems I had misused volatile. I removed ‘volatile’ from the function argument on test_0 and it prevented the spill through the stack. I added volatile because I was trying to avoid the compiler optimising away the call to test_0 (as it has no side effects) but it appeared that volatile was unnecessary and was a misuse of volatile (intended to indicate storage may change outside of the control of the compiler). However it is an interesting case… as a register arguments don’t have storage. GCC, Clang folk, any ideas on why there is a stack spill for a volatile register argument passed in esi? Does volatile force the argument to have storage allocated on the stack? Is this a corner case in the C standard? This argument in the x86_64 calling convention only has a register, so technically it can’t change outside the control of the C "virtual machine” so volatile has a vague meaning here. This seems to be a case of interpreting the C standard in such a was as to make sure that a volatile argument “can be changed” outside the control of the C "virtual machine” by explicitly giving it a storage location on the stack. I think volatile scalar arguments are a special case and that the volatile type label shouldn’t widen the scope beyond the register unless it actually *needs* storage to spill. This is not a volatile stack scoped variable unless the C standard interprets ABI register parameters as actually having ‘storage’ so this is questionable… Maybe I should have gotten a warning… or the volatile type qualifier on a scalar register argument should have been ignored… volatile for scalar function arguments seems to mean: “make this volatile and subject to change outside of the compiler” rather than being a qualifier for its storage (which is a register). # gcc test_0: mov DWORD PTR [rsp-4], esi mov ecx, DWORD PTR [rsp-4] mov eax, edi cdq idivecx mov eax, edx ret # clang test_0: mov dword ptr [rsp - 4], esi xor edx, edx mov eax, edi div dword ptr [rsp - 4] mov eax, edx ret /* Test program compiled on x86_64 with: cc -O3 -fomit-frame-pointer -masm=intel -S test.c -o test.S */ #include #include static const int p = 8191; static const int s = 13; int __attribute__ ((noinline)) test_0(unsigned int k, volatile int p) { return k % p; } int __attribute__ ((noinline)) test_1(unsigned int k) { return k % p; } int __attribute__ ((noinline)) test_2(unsigned int k) { int i = (k&p) + (k>>s); i = (i&p) + (i>>s); if (i>=p) i -= p; return i; } int main() { test_0(1, 8191); /* control */ for (int i = INT_MIN; i < INT_MAX; i++) { int r1 = test_1(i), r2 = test_2(i); if (r1 != r2) printf("%d %d %d\n", i, r1, r2); } } > On 27 Mar 2016, at 2:32 PM, Andrew Waterman wrote: > > It would be good to figure out how to get rid of the spurious register spills. > > The strength reduction optimization isn't always profitable on Rocket, > as it increases instruction count and code size. The divider has an > early out and for small numbers is quite fast. > > On Fri, Mar 25, 2016 at 5:43 PM, Michael Clark wrote: >> Now considering I have no idea how many cycles it takes for an integer >> divide on the Rocket so the optimisation may not be a win. >> >> Trying to read MuDiv in multiplier.scala, and will at some point run some >> timings in the cycle-accurate simulator. >> >> In either case, the spurious stack moves emitted by GCC are curious... >> >>> On 26 Mar 2016, at 9:42 AM, Michael Clark wrote: >>> >>> Hi All, >>> >>> I have found an interesting case where an optimisation is not being applied >>> by GCC on RISC-V. And also some strange assembly output from GCC on RISC-V. >>> >>> Both GCC and Clang appear to optimise division by a constant Mersenne prime >>> on x86_64 however GCC on RISC-V is not applying this optimisation. >>> >>> See test program and assembly output for these platforms: >>> >>> * GCC -O3 on RISC-V >>> * GCC -O3 on x86_64 >>> * LLVM/Clang -O3 on x86_64 >>> >>> Another strange observation is GCC on RISC-V is moving a1 to a5 via a stack >>> store followed by a stack load. Odd? GCC 5 also seems to be doing odd stuff >>> with stack ‘moves' on x86_64, moving esi to ecx via the stack (I think >>> recent x86 micro-architecture treats tip of the stack like an extended >>> register file so this may only have a small penalty on x86). >>> >>> See GCC on RISC-V is emitting this: >>> >>> test_0: >>> add sp,sp,-16 >>
Re: coding question
Hi Cesar, > On 12/12/2018, at 10:31 AM, Moreno, Cesar A wrote: > > > > HOW do Imalloc or newan array of complex numbers in GNU C++ code ? > MALLOC does not work, how do I use MALLOC correctly or NEW ? > > I made a struct called complex (for a complex number with realp and imagp) > And I want to make an array of 128 locations (where each location includes > arealp and imagp) use vector and the “indexed for comprehension”. and don’t hard code a constant for the cardinality of a set. the compiler knows how to do this safely e.g.: struct ℂ { float r; float i; }; vector<ℂ> v = { { 1, 0 }, { 0, 1 } }; for (/* auto */ [ i, c ] : v) { printf(“%zu: (%f, %f)\n” i, c.r, c.i); }; I haven’t compiled this so it might have some coding bugs. I might have missed auto or it doesn’t work or something. uncomment auto if it doesn’t compile. also your local type aliases may differ from mine. Michael > Struct complex > { > Float realp; > Float imagp; > } > > // tdin is a pointer to an array of complex numbers > complex* tdin; > > > // form the array with MALLOC or whatever works in GNU C++ > tdin = (complex*) malloc ( 128 * sizeof (complex) ); > > > then I can print the contents of the array with > > for (int k; k<128; k++) > { cout << tdin[k].realp << " and " << tdin[k].imagp << "endl"; >
x86 branches vs conditional moves
Hi, Curious about this codegen: - https://godbolt.org/g/5XxP5S Why does gcc branch on _Bool, but emits a conditional move for an integer? can it emit cmovne instead of branching? also curious where one would change this to learn about GCC internals. It’s not a bug, but it is a performance issue (*1). I was just curious under which conditions the ternary operator is lowered to cmov on x86 and found this difference in lowering. Michael [1] https://github.com/xiadz/cmov #include extern int a; extern int b; extern int c; extern _Bool C; int select_int() { return c ? a : b; } _Bool select_bool() { return C ? a : b; } _Bool a_bool() { return 2; } select_int(): mov eax, DWORD PTR c[rip] testeax, eax mov eax, DWORD PTR a[rip] cmove eax, DWORD PTR b[rip] ret select_bool(): cmp BYTE PTR C[rip], 0 jne .L8 mov eax, DWORD PTR b[rip] testeax, eax setne al ret .L8: mov edx, DWORD PTR a[rip] testedx, edx setne al ret a_bool(): mov eax, 1 ret
Redundant loads for bitfield accesses
Hi, Is there any reason for 3 loads being issued for these bitfield accesses, given two of the loads are bytes, and one is a half; the compiler appears to know the structure is aligned at a half word boundary. Secondly, the riscv code is using a mixture of 32-bit and 64-bit adds and shifts. Thirdly, with -Os the riscv code size is the same, but the schedule is less than optimal. i.e. the 3rd load is issued much later. - https://cx.rv8.io/g/2YDLTA code: struct foo { unsigned int a : 5; unsigned int b : 5; unsigned int c : 5; }; unsigned int proc_foo(struct foo *p) { return p->a + p->b + p->c; } riscv asm: proc_foo(foo*): lhu a3,0(a0) lbu a4,0(a0) lbu a5,1(a0) srliw a3,a3,5 andi a0,a4,31 srli a5,a5,2 andi a4,a3,31 addw a0,a0,a4 andi a5,a5,31 add a0,a0,a5 ret x86_64 asm: proc_foo(foo*): movzx edx, BYTE PTR [rdi] movzx eax, WORD PTR [rdi] mov ecx, edx shr ax, 5 and eax, 31 and ecx, 31 lea edx, [rcx+rax] movzx eax, BYTE PTR [rdi+1] shr al, 2 and eax, 31 add eax, edx ret hand coded riscv asm: proc_foo(foo*): lhu a1,0(a0) srli a2,a1,5 srli a3,a1,10 andi a0,a1,31 andi a2,a2,31 andi a3,a3,31 add a0,a0,a2 add a0,a0,a3 ret Michael
Re: Redundant loads for bitfield accesses
Here’s a more extreme example: - https://cx.rv8.io/g/2HWQje The bitfield type is unsigned int, so one or two 32-bit loads should suffice (depending on register pressure). GCC is issuing a lw at some point in the asm. struct foo { unsigned int a : 3; unsigned int b : 3; unsigned int c : 3; unsigned int d : 3; unsigned int e : 3; unsigned int f : 3; unsigned int g : 3; unsigned int h : 3; unsigned int i : 3; unsigned int j : 3; }; unsigned int proc_foo(struct foo *p) { return p->a + p->b + p->c + p->d + p->d + p->e + p->f + p->g + p->h + p->i + p->j; } > On 17 Aug 2017, at 10:29 AM, Michael Clark wrote: > > Hi, > > Is there any reason for 3 loads being issued for these bitfield accesses, > given two of the loads are bytes, and one is a half; the compiler appears to > know the structure is aligned at a half word boundary. Secondly, the riscv > code is using a mixture of 32-bit and 64-bit adds and shifts. Thirdly, with > -Os the riscv code size is the same, but the schedule is less than optimal. > i.e. the 3rd load is issued much later. > > - https://cx.rv8.io/g/2YDLTA > > code: > > struct foo { > unsigned int a : 5; > unsigned int b : 5; > unsigned int c : 5; > }; > > unsigned int proc_foo(struct foo *p) > { > return p->a + p->b + p->c; > } > > riscv asm: > > proc_foo(foo*): > lhu a3,0(a0) > lbu a4,0(a0) > lbu a5,1(a0) > srliw a3,a3,5 > andi a0,a4,31 > srli a5,a5,2 > andi a4,a3,31 > addw a0,a0,a4 > andi a5,a5,31 > add a0,a0,a5 > ret > > x86_64 asm: > > proc_foo(foo*): > movzx edx, BYTE PTR [rdi] > movzx eax, WORD PTR [rdi] > mov ecx, edx > shr ax, 5 > and eax, 31 > and ecx, 31 > lea edx, [rcx+rax] > movzx eax, BYTE PTR [rdi+1] > shr al, 2 > and eax, 31 > add eax, edx > ret > > hand coded riscv asm: > > proc_foo(foo*): > lhu a1,0(a0) > srli a2,a1,5 > srli a3,a1,10 > andi a0,a1,31 > andi a2,a2,31 > andi a3,a3,31 > add a0,a0,a2 > add a0,a0,a3 > ret > > Michael
Re: Redundant loads for bitfield accesses
> On 17 Aug 2017, at 10:41 AM, Andrew Pinski wrote: > > On Wed, Aug 16, 2017 at 3:29 PM, Michael Clark wrote: >> Hi, >> >> Is there any reason for 3 loads being issued for these bitfield accesses, >> given two of the loads are bytes, and one is a half; the compiler appears to >> know the structure is aligned at a half word boundary. Secondly, the riscv >> code is using a mixture of 32-bit and 64-bit adds and shifts. Thirdly, with >> -Os the riscv code size is the same, but the schedule is less than optimal. >> i.e. the 3rd load is issued much later. > > > Well one thing is most likely SLOW_BYTE_ACCESS is set to 0. This > forces byte access for bit-field accesses. The macro is misnamed now > as it only controls bit-field accesses right now (and one thing in > dojump dealing with comparisons with and and a constant but that might > be dead code). This should allow for you to get the code in hand > written form. > I suspect SLOW_BYTE_ACCESS support should be removed and be assumed to > be 1 but I have not time to look into each backend to see if it is > correct to do or not. Maybe it is wrong for AVR. Thanks, that’s interesting. So I should try compiling the riscv backend with SLOW_BYTE_ACCESS = 1? Less risk than making a change to x86. This is clearly distinct from slow unaligned access. It seems odd that O3 doesn’t coalesce loads even if byte access is slow as one would expect the additional cost of the additional loads would outweigh the fact that byte accesses are not slow unless something weird is happening with the costs of loads of different widths. x86 could also be helped here too. I guess subsequent loads will be served from L1, but that’s not really an excuse for this codegen when the element is 32-bits aligned (unsigned int).
Bit-field struct member sign extension pattern results in redundant
Sorry I had to send again as my Apple mailer is munging emails. I’ve disabled RTF. This one is quite interesting: - https://cx.rv8.io/g/WXWMTG It’s another target independent bug. x86 is using some LEA followed by SAR trick with a 3 bit shift. Surely SHL 27, SAR 27 would suffice. In any case RISC-V seems like a nice target to try to fix this codegen for, as its less risk than attempting a fix in x86 ;-) - https://github.com/riscv/riscv-gcc/issues/89 code: template inline T signextend(const T x) { struct {T x:B;} s; return s.x = x; } int sx5(int x) { return signextend(x); } riscv asm: sx5(int): slliw a0,a0,3 slliw a0,a0,24 sraiw a0,a0,24 sraiw a0,a0,3 ret hand coded riscv asm sx5(int): slliw a0,a0,27 sraiw a0,a0,27 ret x86 asm: sx5(int): lea eax, [0+rdi*8] sar al, 3 movsx eax, al ret hand coded x86 asm (no worse because the sar depends on the lea) sx5(int): shl edi, 27 sar edi, 27 movsx eax, dl ret
Re: Bit-field struct member sign extension pattern results in redundant
> On 18 Aug 2017, at 10:41 AM, Andrew Pinski wrote: > > On Thu, Aug 17, 2017 at 3:29 PM, Michael Clark wrote: >> Sorry I had to send again as my Apple mailer is munging emails. I’ve >> disabled RTF. >> >> >> This one is quite interesting: >> >> - https://cx.rv8.io/g/WXWMTG >> >> It’s another target independent bug. x86 is using some LEA followed by SAR >> trick with a 3 bit shift. Surely SHL 27, SAR 27 would suffice. In any case >> RISC-V seems like a nice target to try to fix this codegen for, as its less >> risk than attempting a fix in x86 ;-) >> >> - https://github.com/riscv/riscv-gcc/issues/89 >> >> code: >> >>template >>inline T signextend(const T x) >>{ >>struct {T x:B;} s; >>return s.x = x; >>} >> >>int sx5(int x) { >>return signextend(x); >>} > > on AARCH64 I get: >sbfiz w0, w0, 3, 5 >asr w0, w0, 3 >ret So it is a bug on arm too? and can be done with one sbfiz instruction? (assuming I’m understand sbfiz from my first reading) e.g. sbfiz w0, w0, 0, 2 ret > Which is: > (insn 7 6 8 2 (set (reg:SI 80) >(sign_extend:SI (ashift:QI (reg:QI 0 x0 [ x+3 ]) >(const_int 3 [0x3] t.cc:5 557 {*extendsi_ashlqi} > (expr_list:REG_DEAD (reg:SI 0 x0 [ x ]) >(nil))) > > (insn 14 9 17 2 (set (reg/i:SI 0 x0) >(ashiftrt:SI (reg:SI 80) >(const_int 3 [0x3]))) t.cc:10 530 {*aarch64_ashr_sisd_or_int_si3} > (expr_list:REG_DEAD (reg:SI 80) >(nil))) > > What I suspect is we get a QImode register in there and that is what > is happening to both x86 and riscv. Curious. Thanks. I still need to learn more about gcc pattern matching and lowering. I actually have this exact code pattern using a struct member to do arbitrary width sign extend as it’s the first method for arbitrary bit width extension, in Google and on the Stanford Bit Twiddling Hacks page. I understand i’m better off changing my pattern to an explicit shift left and shift right, given I’m now aware this is unambiguously the optimal lowering for ISAs with 1 cycle shifters and no field extract instructions, however given others are likely to stumble on the Stanford page, I wouldn’t be surprised if i’m not the only one using this pattern: - https://graphics.stanford.edu/~seander/bithacks.html#FixedSignExtend The x86 sequence is kind of trick and it’s no better off in instruction count by using two shifts. I’m also annoyed that BZHI in BMI2 is an arbitrary width zero extend versus a bit extend. I’d like a BEXT-like instruction that copies a bit at offset n, but given it is only two instruction saving it should also do a right shift so it can be used for signed and unsigned field extract (with a bit copy or bit zero operand bit). Thanks, Michael.
Re: Bit-field struct member sign extension pattern results in redundant
> On 18 Aug 2017, at 11:13 AM, Michael Clark wrote: > > So it is a bug on arm too? and can be done with one sbfiz instruction? > (assuming I’m understand sbfiz from my first reading) e.g. > > sbfiz w0, w0, 0, 2 >ret Getting my 3’s and 5’s swapped. Confused by gcc. sbfiz w0, w0, 0, 4 >> Which is: >> (insn 7 6 8 2 (set (reg:SI 80) >> (sign_extend:SI (ashift:QI (reg:QI 0 x0 [ x+3 ]) >> (const_int 3 [0x3] t.cc:5 557 {*extendsi_ashlqi} >>(expr_list:REG_DEAD (reg:SI 0 x0 [ x ]) >> (nil))) >> >> (insn 14 9 17 2 (set (reg/i:SI 0 x0) >> (ashiftrt:SI (reg:SI 80) >> (const_int 3 [0x3]))) t.cc:10 530 {*aarch64_ashr_sisd_or_int_si3} >>(expr_list:REG_DEAD (reg:SI 80) >> (nil))) >> >> What I suspect is we get a QImode register in there and that is what >> is happening to both x86 and riscv. >
Re: Bit-field struct member sign extension pattern results in redundant
> On 18 Aug 2017, at 10:41 PM, Gabriel Paubert wrote: > > On Fri, Aug 18, 2017 at 10:29:04AM +1200, Michael Clark wrote: >> Sorry I had to send again as my Apple mailer is munging emails. I’ve >> disabled RTF. >> >> >> This one is quite interesting: >> >> - https://cx.rv8.io/g/WXWMTG >> >> It’s another target independent bug. x86 is using some LEA followed by SAR >> trick with a 3 bit shift. Surely SHL 27, SAR 27 would suffice. In any case >> RISC-V seems like a nice target to try to fix this codegen for, as its less >> risk than attempting a fix in x86 ;-) >> >> - https://github.com/riscv/riscv-gcc/issues/89 >> >> code: >> >> template >> inline T signextend(const T x) >> { >> struct {T x:B;} s; >> return s.x = x; >> } >> >> int sx5(int x) { >> return signextend(x); >> } >> >> riscv asm: >> >> sx5(int): >>slliw a0,a0,3 >>slliw a0,a0,24 >>sraiw a0,a0,24 >>sraiw a0,a0,3 >>ret >> >> hand coded riscv asm >> >> sx5(int): >>slliw a0,a0,27 >>sraiw a0,a0,27 >>ret >> >> x86 asm: >> >> sx5(int): >>lea eax, [0+rdi*8] >>sar al, 3 >>movsx eax, al >>ret >> >> hand coded x86 asm (no worse because the sar depends on the lea) >> >> sx5(int): >>shl edi, 27 >>sar edi, 27 >>movsx eax, dl > > Huh? dl is not a subreg of edi! > > s/edi/edx/ and it may work. > > dil can also be used, but only on 64 bit. Sorry I meant dil on x86-64. I was sure that it was possible to extend into another register. I have not done much i386 asm so I am unaware of the constraints. Can the source and dest registers for movsx not differ on i386? I thought they could. In any case, the plot thickens… I believe we have bugs on both RISC-V and Aarch64. I found that it at least appears like it is transitioning to a char or short as the break is at 24 and 16 depending on the width, and values over 16 work as one would expect. Here is an updated test program: https://cx.rv8.io/g/M9ewNf template inline T signextend(const T x) { struct {T x:B;} s; return s.x = x; } int sx3(int x) { return signextend(x); } int sx5(int x) { return signextend(x); } int sx11(int x) { return signextend(x); } int sx14(int x) { return signextend(x); } int sx19(int x) { return signextend(x); } I filed a bug on riscv-gcc but I think it is target independent code given there appears to be an issue on Aarch64. AFAICT, Aarch64 should generate a single sbfx for all of the test functions. - https://github.com/riscv/riscv-gcc/issues/89 Should I file a bug on GCC bugzilla given it looks to be target independent? On RISC-V, the codegen is much more obviously wrong, but essentially the same thing is happening on Aarch64 but there is only one additional instruction instead of two. sx3(int): slliw a0,a0,5 slliw a0,a0,24 sraiw a0,a0,24 sraiw a0,a0,5 ret sx5(int): slliw a0,a0,3 slliw a0,a0,24 sraiw a0,a0,24 sraiw a0,a0,3 ret sx11(int): slliw a0,a0,5 slliw a0,a0,16 sraiw a0,a0,16 sraiw a0,a0,5 ret sx14(int): slliw a0,a0,2 slliw a0,a0,16 sraiw a0,a0,16 sraiw a0,a0,2 ret sx19(int): slliw a0,a0,13 sraiw a0,a0,13 ret
Re: Bit-field struct member sign extension pattern results in redundant
> On 18 Aug 2017, at 10:56 PM, Michael Clark wrote: > >> >> On 18 Aug 2017, at 10:41 PM, Gabriel Paubert wrote: >> >> On Fri, Aug 18, 2017 at 10:29:04AM +1200, Michael Clark wrote: >>> Sorry I had to send again as my Apple mailer is munging emails. I’ve >>> disabled RTF. >>> >>> >>> This one is quite interesting: >>> >>> - https://cx.rv8.io/g/WXWMTG >>> >>> It’s another target independent bug. x86 is using some LEA followed by SAR >>> trick with a 3 bit shift. Surely SHL 27, SAR 27 would suffice. In any case >>> RISC-V seems like a nice target to try to fix this codegen for, as its less >>> risk than attempting a fix in x86 ;-) >>> >>> - https://github.com/riscv/riscv-gcc/issues/89 >>> >>> code: >>> >>> template >>> inline T signextend(const T x) >>> { >>> struct {T x:B;} s; >>> return s.x = x; >>> } >>> >>> int sx5(int x) { >>> return signextend(x); >>> } >>> >>> riscv asm: >>> >>> sx5(int): >>> slliw a0,a0,3 >>> slliw a0,a0,24 >>> sraiw a0,a0,24 >>> sraiw a0,a0,3 >>> ret >>> >>> hand coded riscv asm >>> >>> sx5(int): >>> slliw a0,a0,27 >>> sraiw a0,a0,27 >>> ret >>> >>> x86 asm: >>> >>> sx5(int): >>> lea eax, [0+rdi*8] >>> sar al, 3 >>> movsx eax, al >>> ret >>> >>> hand coded x86 asm (no worse because the sar depends on the lea) >>> >>> sx5(int): >>> shl edi, 27 >>> sar edi, 27 >>> movsx eax, dl >> >> Huh? dl is not a subreg of edi! >> >> s/edi/edx/ and it may work. >> >> dil can also be used, but only on 64 bit. In any case it is more straightforward given we are just dealing with an int bitfield member in this case so a regular move would suffice. The mov gets retired in rename (movsx and movzx can both be retired in rename I think). The same pattern could be used for all 5 function examples, with predictable performance, versus 5 different solutions for the 5 different bitfield positions (that’s actually very amazing that the pattern matching on x86 finds 5 unique ways to lower this code pattern). Padon my Intel syntax. e.g. sx5(int): shl edi, 27 sar edi, 27 mov eax, edi However I’m more interested in what the fix would be for RISC-V and Aarch64. It’s less obvious how to fix it on x86-64 because there are too many different ways. On RISC-V there is only one obvious way and that is to not artificially narrow the type when we are extracting from an int aligned struct bit field member to an int. Likewise on Aarch64, one sbfiz instruction. > Sorry I meant dil on x86-64. I was sure that it was possible to extend into > another register. I have not done much i386 asm so I am unaware of the > constraints. Can the source and dest registers for movsx not differ on i386? > I thought they could. > > In any case, the plot thickens… > > I believe we have bugs on both RISC-V and Aarch64. > > I found that it at least appears like it is transitioning to a char or short > as the break is at 24 and 16 depending on the width, and values over 16 work > as one would expect. > > Here is an updated test program: https://cx.rv8.io/g/M9ewNf > > template > inline T signextend(const T x) > { > struct {T x:B;} s; > return s.x = x; > } > > int sx3(int x) { return signextend(x); } > int sx5(int x) { return signextend(x); } > int sx11(int x) { return signextend(x); } > int sx14(int x) { return signextend(x); } > int sx19(int x) { return signextend(x); } > > I filed a bug on riscv-gcc but I think it is target independent code given > there appears to be an issue on Aarch64. AFAICT, Aarch64 should generate a > single sbfx for all of the test functions. > > - https://github.com/riscv/riscv-gcc/issues/89 > > Should I file a bug on GCC bugzilla given it looks to be target independent? > > On RISC-V, the codegen is much more obviously wrong, but essentially the same > thing is happening on Aarch64 but there is only one additional instruction > instead of two. > > sx3(int): > slliw a0,a0,5 > slliw a0,a0,24 > sraiw a0,a0,24 > sraiw a0,a0,5 > ret > sx5(int): > slliw a0,a0,3 > slliw a0,a0,24 > sraiw a0,a0,24 > sraiw a0,a0,3 > ret > sx11(int): > slliw a0,a0,5 > slliw a0,a0,16 > sraiw a0,a0,16 > sraiw a0,a0,5 > ret > sx14(int): > slliw a0,a0,2 > slliw a0,a0,16 > sraiw a0,a0,16 > sraiw a0,a0,2 > ret > sx19(int): > slliw a0,a0,13 > sraiw a0,a0,13 > ret
Quantitative analysis of -Os vs -O3
Dear GCC folk, I have to say that’s GCC’s -Os caught me by surprise after several years using Apple GCC and more recently LLVM/Clang in Xcode. Over the last year and a half I have been working on RISC-V development and have been exclusively using GCC for RISC-V builds, and initially I was using -Os. After performing a qualitative/quantitative assessment I don’t believe GCC’s current -Os is particularly useful, at least for my needs as it doesn’t provide a commensurate saving in size given the sometimes quite huge drop in performance. I’m quoting an extract from Eric’s earlier email on the Overwhelmed by GCC frustration thread, as I think Apple’s documentation which presumably documents Clang/LLVM -Os policy is what I would call an ideal -Os (perhaps using -O2 as a starting point) with the idea that the current -Os is renamed to -Oz. -Oz (APPLE ONLY) Optimize for size, regardless of performance. -Oz enables the same optimization flags that -Os uses, but -Oz also enables other optimizations intended solely to reduce code size. In particular, instructions that encode into fewer bytes are preferred over longer instructions that execute in fewer cycles. -Oz on Darwin is very similar to -Os in FSF distributions of GCC. -Oz employs the same inlining limits and avoids string instructions just like -Os. -Os Optimize for size, but not at the expense of speed. -Os enables all -O2 optimizations that do not typically increase code size. However, instructions are chosen for best performance, regardless of size. To optimize solely for size on Darwin, use -Oz (APPLE ONLY). I have recently been working on a benchmark suite to test a RISC-V JIT engine. I have performed all testing using GCC 7.1 as the baseline compiler, and during the process I have collected several performance metrics, some that are neutral to the JIT runtime environment. In particular I have made performance comparisons between -Os and -O3 on x86, along with capturing executable file sizes, dynamic retired instruction and micro-op counts for x86, dynamic retired instruction counts for RISC-V as well as dynamic register and instruction usage histograms for RISC-V, for both -Os and -O3. See the Optimisation section for a charted performance comparison between -O3 and -Os. There are dozens of other plots that show the differences between -Os and -O3. - https://rv8.io/bench The Geomean on x86 shows a 19% performance hit for -Os vs -O3 on x86. The Geomean of course smooths over some pathological cases where -Os performance is severely degraded versus -O3 but not with significant, or commensurate savings in size. I don’t currently have -O2 in my results however it seems like I should add -O2 to the benchmark suite. If you take a look at the web page you’ll see that there is already a huge amount of data given we have captured dynamic register frequencies and dynamic instruction frequencies for -Os and -O3. The tables and charts are all generated by scripts so if there is interest I could add -O2. I can also pretty easily perform runs with new compiler versions as everything is completely automated. The biggest factor is that it currently takes 4 hours for a full run as we run all of the benchmarks in a simulator to capture dynamic register usage and dynamic instruction usage. After looking at the results, one has to question the utility of -Os in its present form, and indeed question how it is actually used in practice, given the proportion of savings in executable size. After my assessment I would not recommend anyone to use -Os because its savings in size are not proportionate to the loss in performance. I feel discouraged from using it after looking at the results. I really don’t believe -Os makes the right trades e.g. reducing icache pressure can indeed lead to better performance due to reduced code size. I also wonder whether -O2 level optimisations may be a good starting point for a more useful -Os and how one would proceed towards selecting optimisations to add back to -Os to increase its usability, or rename the current -Os to -Oz and make -Os an alias for -O2. A similar profile to -O2 would probably produce less shock for anyone who does quantitative performance analysis of -Os. In fact there are some interesting issues for the RISC-V backend given the assembler performs RVC compression and GCC doesn’t really see the size of emitted instructions. It would be an interesting backend to investigate improving -Os presuming that a backend can opt in to various optimisations for a given optimisation level. RISC-V would gain most of its size and runtime icache pressure reduction improvements by getting the highest frequency registers allocated within the 8 register set that is accessi
Re: Quantitative analysis of -Os vs -O3
> On 26 Aug 2017, at 8:39 PM, Andrew Pinski wrote: > > On Sat, Aug 26, 2017 at 1:23 AM, Michael Clark wrote: >> Dear GCC folk, >> I have to say that’s GCC’s -Os caught me by surprise after several years >> using Apple GCC and more recently LLVM/Clang in Xcode. Over the last year >> and a half I have been working on RISC-V development and have been >> exclusively using GCC for RISC-V builds, and initially I was using -Os. >> After performing a qualitative/quantitative assessment I don’t believe GCC’s >> current -Os is particularly useful, at least for my needs as it doesn’t >> provide a commensurate saving in size given the sometimes quite huge drop in >> performance. >> >> I’m quoting an extract from Eric’s earlier email on the Overwhelmed by GCC >> frustration thread, as I think Apple’s documentation which presumably >> documents Clang/LLVM -Os policy is what I would call an ideal -Os (perhaps >> using -O2 as a starting point) with the idea that the current -Os is renamed >> to -Oz. >> >>-Oz >> (APPLE ONLY) Optimize for size, regardless of performance. -Oz >> enables the same optimization flags that -Os uses, but -Oz also >> enables other optimizations intended solely to reduce code >> size. >> In particular, instructions that encode into fewer bytes are >> preferred over longer instructions that execute in fewer >> cycles. >> -Oz on Darwin is very similar to -Os in FSF distributions of >> GCC. >> -Oz employs the same inlining limits and avoids string >> instructions >> just like -Os. >> >>-Os >> Optimize for size, but not at the expense of speed. -Os >> enables all >> -O2 optimizations that do not typically increase code size. >> However, instructions are chosen for best performance, >> regardless >> of size. To optimize solely for size on Darwin, use -Oz (APPLE >> ONLY). >> >> I have recently been working on a benchmark suite to test a RISC-V JIT >> engine. I have performed all testing using GCC 7.1 as the baseline compiler, >> and during the process I have collected several performance metrics, some >> that are neutral to the JIT runtime environment. In particular I have made >> performance comparisons between -Os and -O3 on x86, along with capturing >> executable file sizes, dynamic retired instruction and micro-op counts for >> x86, dynamic retired instruction counts for RISC-V as well as dynamic >> register and instruction usage histograms for RISC-V, for both -Os and -O3. >> >> See the Optimisation section for a charted performance comparison between >> -O3 and -Os. There are dozens of other plots that show the differences >> between -Os and -O3. >> >>- https://rv8.io/bench >> >> The Geomean on x86 shows a 19% performance hit for -Os vs -O3 on x86. The >> Geomean of course smooths over some pathological cases where -Os performance >> is severely degraded versus -O3 but not with significant, or commensurate >> savings in size. > > > First let me put into some perspective on -Os usage and some history: > 1) -Os is not useful for non-embedded users > 2) the embedded folks really need the smallest code possible and > usually will be willing to afford the performance hit > 3) -Os was a mistake for Apple to use in the first place; they used it > and then GCC got better for PowerPC to use the string instructions > which is why -Oz was added :) > 4) -Os is used heavily by the arm/thumb2 folks in bare metal applications. > > Comparing -O3 to -Os is not totally fair on x86 due to the many > different instructions and encodings. > Compare it on ARM/Thumb2 or MIPS/MIPS16 (or micromips) where size is a > big issue. > I soon have a need to keep overall (bare-metal) application size down > to just 256k. > Micro-controllers are places where -Os matters the most. Fair points. - Size at all cost is useful for the embedded case where there is a restricted footprint. - It’s fair to compare on RISC-V which has the RVC compressed ISA extension, which is conceptually similar to Thumb-2 - Understand renaming -Os to -Oz would cause a few downstream issues for those who expect size at all costs. - There is an achievable use-case for good RVC compression and good performance on RISC-V However the question remains, what options does one choose for size, but not size at the expense of speed. -O2 and an -mtune? I’m probably interested in an -O2 with an -mtune that
Re: Quantitative analysis of -Os vs -O3
FYI - I’ve updated the stats to include -O2 in addition to -O3 and -Os: - https://rv8.io/bench#optimisation There are 57 plots and 31 tables. It’s quite a bit of data. It will be quite interesting to run these on new gcc releases to monitor changes. The Geomean for -O2 is 0.98 of -O3 on x86-64. I probably need to add some tables that show file sizes per architecture side by side, versus the current grouping which is by optimisation level to allow comparisons between architectures. If I pivot the data, we can add file size ratios by optimisation level per architecture. Note: these are relatively small benchmark programs, however the stats are still interesting. I’m most interested in RISC-V register allocation at present. -O2 does pretty well on file size compared to -O3, on all architectures. At a glance, the -O2 file sizes are slightly larger than the -Os file sizes but the performance increase is considerably more. I could perhaps show ratios of performance vs size between -O2 and -Os. > On 26 Aug 2017, at 10:05 PM, Michael Clark wrote: > >> >> On 26 Aug 2017, at 8:39 PM, Andrew Pinski wrote: >> >> On Sat, Aug 26, 2017 at 1:23 AM, Michael Clark wrote: >>> Dear GCC folk, >>> I have to say that’s GCC’s -Os caught me by surprise after several years >>> using Apple GCC and more recently LLVM/Clang in Xcode. Over the last year >>> and a half I have been working on RISC-V development and have been >>> exclusively using GCC for RISC-V builds, and initially I was using -Os. >>> After performing a qualitative/quantitative assessment I don’t believe >>> GCC’s current -Os is particularly useful, at least for my needs as it >>> doesn’t provide a commensurate saving in size given the sometimes quite >>> huge drop in performance. >>> >>> I’m quoting an extract from Eric’s earlier email on the Overwhelmed by GCC >>> frustration thread, as I think Apple’s documentation which presumably >>> documents Clang/LLVM -Os policy is what I would call an ideal -Os (perhaps >>> using -O2 as a starting point) with the idea that the current -Os is >>> renamed to -Oz. >>> >>> -Oz >>> (APPLE ONLY) Optimize for size, regardless of performance. -Oz >>> enables the same optimization flags that -Os uses, but -Oz also >>> enables other optimizations intended solely to reduce code >>> size. >>> In particular, instructions that encode into fewer bytes are >>> preferred over longer instructions that execute in fewer >>> cycles. >>> -Oz on Darwin is very similar to -Os in FSF distributions of >>> GCC. >>> -Oz employs the same inlining limits and avoids string >>> instructions >>> just like -Os. >>> >>> -Os >>> Optimize for size, but not at the expense of speed. -Os >>> enables all >>> -O2 optimizations that do not typically increase code size. >>> However, instructions are chosen for best performance, >>> regardless >>> of size. To optimize solely for size on Darwin, use -Oz (APPLE >>> ONLY). >>> >>> I have recently been working on a benchmark suite to test a RISC-V JIT >>> engine. I have performed all testing using GCC 7.1 as the baseline >>> compiler, and during the process I have collected several performance >>> metrics, some that are neutral to the JIT runtime environment. In >>> particular I have made performance comparisons between -Os and -O3 on x86, >>> along with capturing executable file sizes, dynamic retired instruction and >>> micro-op counts for x86, dynamic retired instruction counts for RISC-V as >>> well as dynamic register and instruction usage histograms for RISC-V, for >>> both -Os and -O3. >>> >>> See the Optimisation section for a charted performance comparison between >>> -O3 and -Os. There are dozens of other plots that show the differences >>> between -Os and -O3. >>> >>> - https://rv8.io/bench >>> >>> The Geomean on x86 shows a 19% performance hit for -Os vs -O3 on x86. The >>> Geomean of course smooths over some pathological cases where -Os >>> performance is severely degraded versus -O3 but not with significant, or >>> commensurate savings in size. >> >> >> First let me put into some perspective on -Os usage and some history: >> 1) -Os is not useful for non-embedded users >> 2) the embedded
Redundant sign-extension instructions on RISC-V
Dear GCC folk, # Issue Background We’re investigating an issue with redundant sign-extension instructions being emitted with the riscv backend. Firstly I would like to state that riscv is possibly a unique backend with respect to its canonical sign-extended register form due to the following: - POINTERS_EXTEND_UNSIGNED is false, and the canonical form after 32-bit operations on RV64 is sign-extended not zero-extended. - PROMOTE_MODE is defined on RV64 such that SI mode values are promoted to signed DI mode values holding SI mode subregs - RISC-V does not have register aliases for these different modes, rather different instructions are selected to operate on different register widths. - The 32-bit instructions sign-extend results. e.g. all shifts, add, sub, etc. - Some instructions such as logical ops only have full word width variants (AND, OR, XOR) but these instructions maintain canonical form as there is no horizontal movement of bits. # Issue Details During porting from GCC 6.1 to GCC 7.1, the RISC-V port was changed to define TRULY_NOOP_TRUNCATION to 1 and PROMOTE_MODE was set up to promote values to wider modes. This was done to ensure ABI correctness so that registers holding narrower modes always contain canonical sign extended values. The result of this is that sign_extend operations are inserted but later eliminated in ree/simplyfy_rtx. In testcase 3 the sign_extend is correctly eliminated, and indeed most 32-bit instructions are handled correctly. This is what the pattern looks like for a typical 32-bit instruction after expand: ;; ;; Full RTL generated for this function: ;; (note 1 0 4 NOTE_INSN_DELETED) (note 4 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK) (insn 2 4 3 2 (set (reg/v:DI 73 [ rs1 ]) (reg:DI 10 a0 [ rs1 ])) "lshift.c":1 -1 (nil)) (note 3 2 6 2 NOTE_INSN_FUNCTION_BEG) (insn 6 3 7 2 (set (reg:SI 74) (ashift:SI (subreg/s/u:SI (reg/v:DI 73 [ rs1 ]) 0) (const_int 24 [0x18]))) "lshift.c":1 -1 (nil)) (insn 7 6 8 2 (set (reg:DI 75) (sign_extend:DI (reg:SI 74))) "lshift.c":1 -1 (nil)) (insn 8 7 12 2 (set (reg:DI 72 [ ]) (reg:DI 75)) "lshift.c":1 -1 (nil)) (insn 12 8 13 2 (set (reg/i:DI 10 a0) (reg:DI 72 [ ])) "lshift.c":1 -1 (nil)) (insn 13 12 0 2 (use (reg/i:DI 10 a0)) "lshift.c":1 -1 (nil)) # Problem test cases - testcase 1 - open coded bswap - redundant sign extension instructions - testcase 2 - open coded right shift - redundant sign extension instructions - testcase 3 - open coded left shift - control / correct behaviour It appears that the downside to the PROMOTE_MODE transition is that several redundant sign-extension operations are cropping up under specific circumstances. One of the first small isolators is testcase 1 (below), which is an open-coded bswap. You’ll notice the SEXT.W emitted before the return. This was then further isolated to an even simpler testcase 2 which is a single right shift which also emits SEXT.W before the return. A 32-bit left shift or other 32-bit operations correctly have the redundant sign_extend eliminated. We have isolated the code that prevented the right shift from having its redundant sign extension eliminated and it is target independent code in simplify-rtx. Code in simplify_unary_operation_1 assumes zero extension is more efficient, likely an assumption based on the fact that most platforms define POINTERS_EXTEND_UNSIGNED to true and naturally promote words to zero extended form vs sign extended form. The redundant sign extension appears to be due to an optimisation in simply_rtx that generates zero extend for right shifts where the value is non zero, based on the assumption that zero extend is” better”. This is not true on platforms that define POINTERS_EXTEND_UNSIGNED to false i.e. naturally promote subregister width operations to canonical sign-extended form internally. # Partial resolution of test case 2 We have prepared a patch that disables the zero extend optimisation on platforms that define POINTERS_EXTEND_UNSIGNED. It fixes the issue in testcase 2. We believe this fix is very low risk because right shift for values above zero will always have a zero sign bit, thus either sign or zero extension can be used (this is in the comment), however sign-extension is “better” on RISC-V and likely all other platforms with POINTERS_EXTEND_UNSIGNED false. Platforms like x86, Aarch64 and most others define POINTERS_EXTEND_UNSIGNED to 1 so would not be affected by this change. Please review this candidate patch to fix the codegen issue for testcase 2. diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c index ce632ae..25dd70f 100644 --- a/gcc/simplify-rtx.c +++ b/gcc/simplify-rtx.c @@ -1503,6 +1503,10 @@ simplify_unary_operation_1 (enum rtx_code code, machine_mode mode, rtx op) /* (sign_extend:M (lshiftrt:N (const_int I))) is better as (zero_extend:M (lshiftrt:N (const_int I))) if I is not 0. */ if (GET_
Re: Redundant sign-extension instructions on RISC-V
> On 30 Aug 2017, at 12:36 PM, Michael Clark wrote: > > Dear GCC folk, > > > # Issue Background > > We’re investigating an issue with redundant sign-extension instructions being > emitted with the riscv backend. Firstly I would like to state that riscv is > possibly a unique backend with respect to its canonical sign-extended > register form due to the following: > > - POINTERS_EXTEND_UNSIGNED is false, and the canonical form after 32-bit > operations on RV64 is sign-extended not zero-extended. > > - PROMOTE_MODE is defined on RV64 such that SI mode values are promoted to > signed DI mode values holding SI mode subregs > > - RISC-V does not have register aliases for these different modes, rather > different instructions are selected to operate on different register widths. > > - The 32-bit instructions sign-extend results. e.g. all shifts, add, sub, etc. > > - Some instructions such as logical ops only have full word width variants > (AND, OR, XOR) but these instructions maintain canonical form as there is no > horizontal movement of bits. > > > # Issue Details > > During porting from GCC 6.1 to GCC 7.1, the RISC-V port was changed to define > TRULY_NOOP_TRUNCATION to 1 and PROMOTE_MODE was set up to promote values to > wider modes. This was done to ensure ABI correctness so that registers > holding narrower modes always contain canonical sign extended values. > > The result of this is that sign_extend operations are inserted but later > eliminated in ree/simplyfy_rtx. In testcase 3 the sign_extend is correctly > eliminated, and indeed most 32-bit instructions are handled correctly. This > is what the pattern looks like for a typical 32-bit instruction after expand: > > ;; > ;; Full RTL generated for this function: > ;; > (note 1 0 4 NOTE_INSN_DELETED) > (note 4 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK) > (insn 2 4 3 2 (set (reg/v:DI 73 [ rs1 ]) > (reg:DI 10 a0 [ rs1 ])) "lshift.c":1 -1 >(nil)) > (note 3 2 6 2 NOTE_INSN_FUNCTION_BEG) > (insn 6 3 7 2 (set (reg:SI 74) > (ashift:SI (subreg/s/u:SI (reg/v:DI 73 [ rs1 ]) 0) > (const_int 24 [0x18]))) "lshift.c":1 -1 >(nil)) > (insn 7 6 8 2 (set (reg:DI 75) > (sign_extend:DI (reg:SI 74))) "lshift.c":1 -1 >(nil)) > (insn 8 7 12 2 (set (reg:DI 72 [ ]) > (reg:DI 75)) "lshift.c":1 -1 >(nil)) > (insn 12 8 13 2 (set (reg/i:DI 10 a0) > (reg:DI 72 [ ])) "lshift.c":1 -1 >(nil)) > (insn 13 12 0 2 (use (reg/i:DI 10 a0)) "lshift.c":1 -1 >(nil)) > > > # Problem test cases > > - testcase 1 - open coded bswap - redundant sign extension instructions > - testcase 2 - open coded right shift - redundant sign extension instructions > - testcase 3 - open coded left shift - control / correct behaviour > > It appears that the downside to the PROMOTE_MODE transition is that several > redundant sign-extension operations are cropping up under specific > circumstances. One of the first small isolators is testcase 1 (below), which > is an open-coded bswap. You’ll notice the SEXT.W emitted before the return. > This was then further isolated to an even simpler testcase 2 which is a > single right shift which also emits SEXT.W before the return. A 32-bit left > shift or other 32-bit operations correctly have the redundant sign_extend > eliminated. > > We have isolated the code that prevented the right shift from having its > redundant sign extension eliminated and it is target independent code in > simplify-rtx. Code in simplify_unary_operation_1 assumes zero extension is > more efficient, likely an assumption based on the fact that most platforms > define POINTERS_EXTEND_UNSIGNED to true and naturally promote words to zero > extended form vs sign extended form. The redundant sign extension appears to > be due to an optimisation in simply_rtx that generates zero extend for right > shifts where the value is non zero, based on the assumption that zero extend > is” better”. This is not true on platforms that define > POINTERS_EXTEND_UNSIGNED to false i.e. naturally promote subregister width > operations to canonical sign-extended form internally. > > > # Partial resolution of test case 2 > > We have prepared a patch that disables the zero extend optimisation on > platforms that define POINTERS_EXTEND_UNSIGNED. It fixes the issue in > testcase 2. We believe this fix is very low risk because right shift for > values above zero will always have a zero sign bit, thus either sign or zero > extension can be used (this is in the comment), however sign-extension is > “better” on RISC-V and likely all other platforms with >
Re: Redundant sign-extension instructions on RISC-V
> On 30 Aug 2017, at 9:14 PM, Richard Biener wrote: > > On Wed, Aug 30, 2017 at 2:36 AM, Michael Clark wrote: >> Dear GCC folk, >> >> >> # Issue Background >> >> We’re investigating an issue with redundant sign-extension instructions >> being emitted with the riscv backend. Firstly I would like to state that >> riscv is possibly a unique backend with respect to its canonical >> sign-extended register form due to the following: >> >> - POINTERS_EXTEND_UNSIGNED is false, and the canonical form after 32-bit >> operations on RV64 is sign-extended not zero-extended. >> >> - PROMOTE_MODE is defined on RV64 such that SI mode values are promoted to >> signed DI mode values holding SI mode subregs >> >> - RISC-V does not have register aliases for these different modes, rather >> different instructions are selected to operate on different register widths. >> >> - The 32-bit instructions sign-extend results. e.g. all shifts, add, sub, >> etc. >> >> - Some instructions such as logical ops only have full word width variants >> (AND, OR, XOR) but these instructions maintain canonical form as there is no >> horizontal movement of bits. >> >> >> # Issue Details >> >> During porting from GCC 6.1 to GCC 7.1, the RISC-V port was changed to >> define TRULY_NOOP_TRUNCATION to 1 and PROMOTE_MODE was set up to promote >> values to wider modes. This was done to ensure ABI correctness so that >> registers holding narrower modes always contain canonical sign extended >> values. >> >> The result of this is that sign_extend operations are inserted but later >> eliminated in ree/simplyfy_rtx. In testcase 3 the sign_extend is correctly >> eliminated, and indeed most 32-bit instructions are handled correctly. This >> is what the pattern looks like for a typical 32-bit instruction after expand: >> >> ;; >> ;; Full RTL generated for this function: >> ;; >> (note 1 0 4 NOTE_INSN_DELETED) >> (note 4 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK) >> (insn 2 4 3 2 (set (reg/v:DI 73 [ rs1 ]) >> (reg:DI 10 a0 [ rs1 ])) "lshift.c":1 -1 >>(nil)) >> (note 3 2 6 2 NOTE_INSN_FUNCTION_BEG) >> (insn 6 3 7 2 (set (reg:SI 74) >> (ashift:SI (subreg/s/u:SI (reg/v:DI 73 [ rs1 ]) 0) >> (const_int 24 [0x18]))) "lshift.c":1 -1 >>(nil)) >> (insn 7 6 8 2 (set (reg:DI 75) >> (sign_extend:DI (reg:SI 74))) "lshift.c":1 -1 >>(nil)) >> (insn 8 7 12 2 (set (reg:DI 72 [ ]) >> (reg:DI 75)) "lshift.c":1 -1 >>(nil)) >> (insn 12 8 13 2 (set (reg/i:DI 10 a0) >> (reg:DI 72 [ ])) "lshift.c":1 -1 >>(nil)) >> (insn 13 12 0 2 (use (reg/i:DI 10 a0)) "lshift.c":1 -1 >>(nil)) >> >> >> # Problem test cases >> >> - testcase 1 - open coded bswap - redundant sign extension instructions >> - testcase 2 - open coded right shift - redundant sign extension instructions >> - testcase 3 - open coded left shift - control / correct behaviour >> >> It appears that the downside to the PROMOTE_MODE transition is that several >> redundant sign-extension operations are cropping up under specific >> circumstances. One of the first small isolators is testcase 1 (below), which >> is an open-coded bswap. You’ll notice the SEXT.W emitted before the return. >> This was then further isolated to an even simpler testcase 2 which is a >> single right shift which also emits SEXT.W before the return. A 32-bit left >> shift or other 32-bit operations correctly have the redundant sign_extend >> eliminated. >> >> We have isolated the code that prevented the right shift from having its >> redundant sign extension eliminated and it is target independent code in >> simplify-rtx. Code in simplify_unary_operation_1 assumes zero extension is >> more efficient, likely an assumption based on the fact that most platforms >> define POINTERS_EXTEND_UNSIGNED to true and naturally promote words to zero >> extended form vs sign extended form. The redundant sign extension appears to >> be due to an optimisation in simply_rtx that generates zero extend for right >> shifts where the value is non zero, based on the assumption that zero extend >> is” better”. This is not true on platforms that define >> POINTERS_EXTEND_UNSIGNED to false i.e. naturally promote subregister width >> operations to canonical sign-extended form internally. >> >> >> # Partial resolution of test case 2 >> >
Re: Redundant sign-extension instructions on RISC-V
> On 30 Aug 2017, at 9:43 PM, Michael Clark wrote: > >>> diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c >>> index ce632ae..25dd70f 100644 >>> --- a/gcc/simplify-rtx.c >>> +++ b/gcc/simplify-rtx.c >>> @@ -1503,6 +1503,10 @@ simplify_unary_operation_1 (enum rtx_code code, >>> machine_mode mode, rtx op) >>> /* (sign_extend:M (lshiftrt:N (const_int I))) is better as >>>(zero_extend:M (lshiftrt:N (const_int I))) if I is not 0. */ >>> if (GET_CODE (op) == LSHIFTRT >>> +#if defined(POINTERS_EXTEND_UNSIGNED) >>> + /* we skip this optimisation if pointers naturally extend signed */ >>> + && POINTERS_EXTEND_UNSIGNED >>> +#endif >>>&& CONST_INT_P (XEXP (op, 1)) >>>&& XEXP (op, 1) != const0_rtx) >>> return simplify_gen_unary (ZERO_EXTEND, mode, op, GET_MODE (op)); >> >> Is it just me or does this miss a || mode != Pmode || GET_MODE (op) != >> ptr_mode >> check? Note the comment says exactly the opposite as the transform... >> >> I’m not even sure why this simplification is correct in the first place?! > > I hope you are not confusing my use of POINTERS_EXTEND_UNSIGNED as a proxy > for the property that defines whether sub width operations sign-extend to the > full width of the register vs zero extend. Are you taking about our added > comment? Read it as: riscv.h #define REGISTERS_EXTEND_UNSIGNED false +#if defined(REGISTERS_EXTEND_UNSIGNED) + /* we skip this optimisation if registers naturally extend signed */ + && REGISTERS_EXTEND_UNSIGNED +#endif The comment is the inverse because RISC-V registers naturally extend signed and REGISTERS_EXTEND_UNSIGNED is false. GCC defines unsigned false for RISC-V promotions.
Re: Redundant sign-extension instructions on RISC-V
> On 31 Aug 2017, at 7:20 AM, Matthew Fortune > wrote: > > Jeff Law writes: >> On 08/30/2017 06:52 AM, Richard Biener wrote: >>> On Wed, Aug 30, 2017 at 11:53 AM, Michael Clark >>> wrote: >>>> >>>>> On 30 Aug 2017, at 9:43 PM, Michael Clark wrote: >>>>> >>>>>>> diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c >>>>>>> index ce632ae..25dd70f 100644 >>>>>>> --- a/gcc/simplify-rtx.c >>>>>>> +++ b/gcc/simplify-rtx.c >>>>>>> @@ -1503,6 +1503,10 @@ simplify_unary_operation_1 (enum rtx_code code, >>>>>>> machine_mode >> mode, rtx op) >>>>>>>/* (sign_extend:M (lshiftrt:N (const_int I))) is better as >>>>>>> (zero_extend:M (lshiftrt:N (const_int I))) if I is not 0. */ >>>>>>>if (GET_CODE (op) == LSHIFTRT >>>>>>> +#if defined(POINTERS_EXTEND_UNSIGNED) >>>>>>> + /* we skip this optimisation if pointers naturally extend signed >>>>>>> */ >>>>>>> + && POINTERS_EXTEND_UNSIGNED >>>>>>> +#endif >>>>>>> && CONST_INT_P (XEXP (op, 1)) >>>>>>> && XEXP (op, 1) != const0_rtx) >>>>>>> return simplify_gen_unary (ZERO_EXTEND, mode, op, GET_MODE (op)); >>>>>> >>>>>> Is it just me or does this miss a || mode != Pmode || GET_MODE (op) != >>>>>> ptr_mode >>>>>> check? Note the comment says exactly the opposite as the transform... >>>>>> >>>>>> I’m not even sure why this simplification is correct in the first place?! >>>>> >>>>> I hope you are not confusing my use of POINTERS_EXTEND_UNSIGNED as a >>>>> proxy for the >> property that defines whether sub width operations sign-extend to the full >> width of the >> register vs zero extend. Are you taking about our added comment? >>> >>> I'm talking about using POINTERS_EXTEND_UNSIGNED for sth that looks >>> unrelated (and that has no easy way to be queried as you noted). >> Right. I was going to make the same observation. I can't see how >> POINTER_EXTEND_UNSIGNED plays a significant role here. >> >> MIPS has similar properties and my recollection is they did some >> interesting tricks in the target files to fold the extension back into >> the arithmetic insns (beyond the usual LOAD_EXTEND_OP, >> WORD_REGISTER_OPERATIONS, TRULY_NOOP_TRUNCATION, and PROMOTE_MODE stuff). > > If there is a condition to add I would have expected it to be based around > WORD_REGISTER_OPERATIONS etc too. Okay. Thanks. To disable the right shift zero extend optimisation we can perhaps then look into an expression that uses WORD_REGISTER_OPERATIONS && LOAD_EXTEND_OP(mode) == SIGN_EXTEND. What ever semantically means, sub word operations on SIMode are sign-extended vs zero-extended. >> My recollection was they defined their key insns with match_operators >> that allowed the sign extension to occur in the arithmetic insns. BUt I >> don't see any evidence of that anymore. But I can distinctly remember >> discussing it with Ian and Meissner eons ago and its impact on reload in >> particular. > > I see that riscv has chosen to not allow ior/and/xor with SImode as named > patterns but instead just for combine to pick up. Given that the > architecture has almost all the same properties as MIPS I don't follow why > the SImode version is not allowed at expand time. MIPS relies on all SImode > values being in a canonical sign extended form at all points and can > therefore freely represent the dual (or rather no) mode operations, such as > comparisons and logical operations, as both SI and DI mode. This pretty much > solves the redundant sign extension issues. Just because the logical > operations only exist as '64-bit' operations in the 64-bit architecture does > not mean you can't tell GCC there are 32-bit versions as well; you just have > to present a logical view of the architecture rather than being overly > strict. LLVM for MIPS went through similar issues and I suspect RISCV will > hit the same kind of issues but the same solution was used and both 32bit > and 64-bit operations described with the same underlying instruction. > > Is there an architectural difference that means riscv can't do the same > thing? I don’t believe so. We were seeking guidance and in the process of discovering and figuring out how to best solve these issues as they cropped up. Changing the logical op patterns to match 32-bit operations on RV64 seems like a sensible approach. We’ll also have to look at the SLT/SLTU (Set Less Than/Set Less Than Unsigned) patterns as they don’t have 32-bit versions; and verify they are doing the right thing with respect to modes and sign extension… We’ll work on a revised patch and circle back… Thanks, Michael
Re: Redundant sign-extension instructions on RISC-V
> On 31 Aug 2017, at 2:12 PM, Michael Clark wrote: > >> >> On 31 Aug 2017, at 7:20 AM, Matthew Fortune >> wrote: >> >> Jeff Law writes: >>> On 08/30/2017 06:52 AM, Richard Biener wrote: >>>> On Wed, Aug 30, 2017 at 11:53 AM, Michael Clark >>>> wrote: >>>>> >>>>>> On 30 Aug 2017, at 9:43 PM, Michael Clark wrote: >>>>>> >>>>>>>> diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c >>>>>>>> index ce632ae..25dd70f 100644 >>>>>>>> --- a/gcc/simplify-rtx.c >>>>>>>> +++ b/gcc/simplify-rtx.c >>>>>>>> @@ -1503,6 +1503,10 @@ simplify_unary_operation_1 (enum rtx_code code, >>>>>>>> machine_mode >>> mode, rtx op) >>>>>>>> /* (sign_extend:M (lshiftrt:N (const_int I))) is better as >>>>>>>> (zero_extend:M (lshiftrt:N (const_int I))) if I is not 0. */ >>>>>>>> if (GET_CODE (op) == LSHIFTRT >>>>>>>> +#if defined(POINTERS_EXTEND_UNSIGNED) >>>>>>>> + /* we skip this optimisation if pointers naturally extend >>>>>>>> signed */ >>>>>>>> + && POINTERS_EXTEND_UNSIGNED >>>>>>>> +#endif >>>>>>>> && CONST_INT_P (XEXP (op, 1)) >>>>>>>> && XEXP (op, 1) != const0_rtx) >>>>>>>>return simplify_gen_unary (ZERO_EXTEND, mode, op, GET_MODE (op)); >>>>>>> >>>>>>> Is it just me or does this miss a || mode != Pmode || GET_MODE (op) != >>>>>>> ptr_mode >>>>>>> check? Note the comment says exactly the opposite as the transform... >>>>>>> >>>>>>> I’m not even sure why this simplification is correct in the first >>>>>>> place?! >>>>>> >>>>>> I hope you are not confusing my use of POINTERS_EXTEND_UNSIGNED as a >>>>>> proxy for the >>> property that defines whether sub width operations sign-extend to the full >>> width of the >>> register vs zero extend. Are you taking about our added comment? >>>> >>>> I'm talking about using POINTERS_EXTEND_UNSIGNED for sth that looks >>>> unrelated (and that has no easy way to be queried as you noted). >>> Right. I was going to make the same observation. I can't see how >>> POINTER_EXTEND_UNSIGNED plays a significant role here. >>> >>> MIPS has similar properties and my recollection is they did some >>> interesting tricks in the target files to fold the extension back into >>> the arithmetic insns (beyond the usual LOAD_EXTEND_OP, >>> WORD_REGISTER_OPERATIONS, TRULY_NOOP_TRUNCATION, and PROMOTE_MODE stuff). >> >> If there is a condition to add I would have expected it to be based around >> WORD_REGISTER_OPERATIONS etc too. > > Okay. Thanks. > > To disable the right shift zero extend optimisation we can perhaps then look > into an expression that uses WORD_REGISTER_OPERATIONS && LOAD_EXTEND_OP(mode) > == SIGN_EXTEND. What ever semantically means, sub word operations on SIMode > are sign-extended vs zero-extended. I have a new patch to fix the right shift optimisation for platforms that zero-extend. The sign extend op is DImode so we check the mode of the enclosed shift expression hence XEXP (op, 0). Hopefully we have used the correct macros this time. The comments read a little more clearly in this version. I’ve amended the prelude comment to mention the optimisation works better for platforms that zero extend, and state that the optimisation is skipped for platforms that sign extend. diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c index ce632ae..3b6e5eb 100644 --- a/gcc/simplify-rtx.c +++ b/gcc/simplify-rtx.c @@ -1501,8 +1501,14 @@ simplify_unary_operation_1 (enum rtx_code code, machine_mode mode, rtx op) } /* (sign_extend:M (lshiftrt:N (const_int I))) is better as - (zero_extend:M (lshiftrt:N (const_int I))) if I is not 0. */ + (zero_extend:M (lshiftrt:N (const_int I))) if I is not 0 + and the platform zero extends narrower operations */ if (GET_CODE (op) == LSHIFTRT +#if defined(WORD_REGISTER_OPERATIONS) && defined(LOAD_EXTEND_OP) + /* we skip on platform that sign extend narrower operations */ + && !(WORD_REGISTER_OPERATIONS && + LOAD_EXTEND_OP(GET_MODE (XEXP (op, 0))) == SIGN_EXTEND) +#endif
Re: Bit-field struct member sign extension pattern results in redundant
> On 19 Aug 2017, at 4:10 AM, Richard Henderson wrote: > > On 08/17/2017 03:29 PM, Michael Clark wrote: >> hand coded x86 asm (no worse because the sar depends on the lea) >> >> sx5(int): >>shl edi, 27 >>sar edi, 27 >>movsx eax, dl > > Typo in the register, but I know what you mean. More interestingly, edi > already has the sign-extended value, so "mov eax, edi" sufficies (saving one > byte or perhaps allowing better register allocation). > > That said, if anyone is tweaking x86-64 patterns on this, consider > > sx5(int): > rorxeax, edi, 5 > sar eax, 27 > > where the (newish) bmi2 rorx instruction allows the data to be moved into > place > while rotating, avoiding the extra move entirely. The patterns might be hard to match unless we can get the gimple expansions for bitfield access to use the mode of the enclosing type. For bitfield accesses to SI mode fields under 8 bits on RISC-V, gcc is generating two shifts on QI mode sub-registers, each with a sign-extend. For bitfield accesses to SI mode fields under 16 bits on RISC-V, gcc is generating two shifts on HI mode sub-registers, each with a sign-extend. The bitfield expansion logic is selecting the narrowest type that can hold the bitwidth of the field, versus the bitwidth of the enclosing type and this appears to be in the gimple to rtl transform, possibly in expr.c. Using the mode of the enclosing type for bitfield member access expansion would likely solve this problem. I suspect that with the use of caches and word sized register accesses, that this sort of change would be reasonable to make for 32-bit and 64-bit targets. I also noticed something I didn’t spot earlier. On RV32 the sign extend and shifts are correctly coalesced into 27 bit shifts in the combine stage. We are presently looking into another redundant sign extension issue on RV64 that could potentially be related. It could be that the shift coalescing optimisation doesn’t happen unless the redundant sign extensions are eliminated early in combine by simplify_rtx. i.e. the pattern is more complex due to sign_extend ops that are not eliminated. - https://cx.rv8.io/g/2FxpNw RV64 and Aarch64 both appear to have the issue but with different expansions for the shift and extend pattern due to the mode of access (QI or HI). Field accesses above 16 bits create SI mode accesses and generate the expected code. The RISC-V compiler has the #define SLOW_BYTE_ACCESS 1 patch however it appears to make no difference in this case. SLOW_BYTE_ACCESS suppresses QI mode and HI mode loads in some bitfield test cases when a struct is passed by pointer but has no effect on this particular issue. This shows the codegen for the fix to the SLOW_BYTE_ACCESS issue. i.e. proof that the compiler as SLOW_BYTE_ACCESS defined to 1. - https://cx.rv8.io/g/TyXnoG A target independent fix that would solve the issue on ARM and RISC-V would be to access bitfield members with the mode of the bitfield member's enclosing type instead of the smallest mode that can hold the bitwidth of the type. If we had a char or short member in the struct, I can understand the use of QI and HI mode, as we would need narrorwer accesses due to alignment issues, but in this case the member is in int and one would expect this to expand to SI mode accesses if the enclosing type is SI mode. riscv64: sx5(int): slliw a0,a0,3 slliw a0,a0,24 sraiw a0,a0,24 sraiw a0,a0,3 ret sx17(int): slliw a0,a0,15 sraiw a0,a0,15 ret riscv32: sx5(int): slli a0,a0,27 srai a0,a0,27 ret sx17(int): slli a0,a0,15 srai a0,a0,15 ret aarch64: sx5(int): sbfiz w0, w0, 3, 5 asr w0, w0, 3 ret sx17(int): sbfx w0, w0, 0, 17 ret
Re: RFC: Improving GCC8 default option settings
> On 13 Sep 2017, at 1:57 AM, Wilco Dijkstra wrote: > > Hi all, > > At the GNU Cauldron I was inspired by several interesting talks about > improving > GCC in various ways. While GCC has many great optimizations, a common theme is > that its default settings are rather conservative. As a result users are > required to enable several additional optimizations by hand to get good code. > Other compilers enable more optimizations at -O2 (loop unrolling in LLVM was > mentioned repeatedly) which GCC could/should do as well. There are some nuances to -O2. Please consider -O2 users who wish use it like Clang/LLVM’s -Os (-O2 without loop vectorisation IIRC). Clang/LLVM has an -Os that is like -O2 so adding optimisations that increase code size can be skipped from -Os without drastically effecting performance. This is not the case with GCC where -Os is a size at all costs optimisation mode. GCC users option for size not at the expense of speed is to use -O2. Clang GCC -Oz ~= -Os -Os ~= -O2 So if adding optimisations to -O2 that increase code size, please considering adding an -O2s that maintains the compact code size of -O2. -O2 generates pretty compact code as many performance optimisations tend to reduce code size, or otherwise add optimisations that increase code size to -O3. Adding loop unrolling on makes sense in the Clang/LLVM context where they have a compact code model with good performance i.e. -Os. In GCC this is -O2. So if you want to enable more optimisations at -O2, please copy -O2 optimisations to -O2s or rename -Os to -Oz and copy -O2 optimisation defaults to a new -Os. The present reality is that any project that wishes to optimize for size at all costs will need to run a configure test for -Oz, and then fall back to -Os, given the current disparity between Clang/LLVM and GCC flags here. > Here are a few concrete proposals to improve GCC's option settings which will > enable better code generation for most targets: > > * Make -fno-math-errno the default - this mostly affects the code generated > for > sqrt, which should be treated just like floating point division and not set > errno by default (unless you explicitly select C89 mode). > > * Make -fno-trapping-math the default - another obvious one. From the docs: > "Compile code assuming that floating-point operations cannot generate > user-visible traps." > There isn't a lot of code that actually uses user-visible traps (if any - > many CPUs don't even support user traps as it's an optional IEEE feature). > So assuming trapping math by default is way too conservative since there is > no obvious benefit to users. > > * Make -fno-common the default - this was originally needed for pre-ANSI C, > but > is optional in C (not sure whether it is still in C99/C11). This can > significantly improve code generation on targets that use anchors for globals > (note the linker could report a more helpful message when ancient code that > requires -fcommon fails to link). > > * Make -fomit-frame-pointer the default - various targets already do this at > higher optimization levels, but this could easily be done for all targets. > Frame pointers haven't been needed for debugging for decades, however if > there > are still good reasons to keep it enabled with -O0 or -O1 (I can't think of > any > unless it is for last-resort backtrace when there is no unwind info at a > crash), > we could just disable the frame pointer from -O2 onwards. > > These are just a few ideas to start. What do people think? I'd welcome > discussion > and other proposals for similar improvements. > > Wilco
Re: RFC: Improving GCC8 default option settings
> On 13 Sep 2017, at 12:47 PM, Segher Boessenkool > wrote: > > On Wed, Sep 13, 2017 at 09:27:22AM +1200, Michael Clark wrote: >>> Other compilers enable more optimizations at -O2 (loop unrolling in LLVM was >>> mentioned repeatedly) which GCC could/should do as well. >> >> There are some nuances to -O2. Please consider -O2 users who wish use it >> like Clang/LLVM’s -Os (-O2 without loop vectorisation IIRC). >> >> Clang/LLVM has an -Os that is like -O2 so adding optimisations that increase >> code size can be skipped from -Os without drastically effecting performance. >> >> This is not the case with GCC where -Os is a size at all costs optimisation >> mode. GCC users option for size not at the expense of speed is to use -O2. > > "Size not at the expense of speed" exists in neither compiler. Just the > tradeoffs are different between GCC and LLVM. It would be a silly > optimisation target -- it's exactly the same as just "speed"! Unless > “speed" means "let's make it faster, and bigger just because" ;-) I would like to be able to quantify stats on a well known benchmark suite, say SPECint 2006 or SPECint 2017 but in my own small benchmark suite I saw a disproportionate difference in size between -O2 and -Os, but a significant drop in performance with -O2 vs -Os. - https://rv8.io/bench#optimisation - https://rv8.io/bench#executable-file-sizes -O2 is 98% perf of -O3 on x86-64 -Os is 81% perf of -O3 on x86-64 -O2 saves 5% space on -O3 on x86-64 -Os saves 8% space on -Os on x86-64 17% drop in performance for 3% saving in space is not a good trade for a “general” size optimisation. It’s more like executable compression. -O2 seems to be a suite spot for size versus speed. I could only recommend GCC’s -Os if the user is trying to squeeze something down to fit the last few bytes of a ROM and -Oz seems like a more appropriate name. -O2 the current suite spot in GCC and is likely closest in semantics to LLVM/Clang -Os and I’d like -O2 binaries to stay lean. I don’t think O2 should slow down nor should the binariesget larger. Turning up knobs that effect code size should be reserved for -O3 until GCC makes a distinction between -O2/-O2s and -Os/-Oz. On RISC-V I believe we could shrink binaries at -O2 further with no sacrifice in performance, perhaps with a performance improvement by reducing icache bandwidth… BTW -O2 gets great compression and performance improvements compared to -O0 ;-D it’s the points after -O2 where the trade offs don’t correlate. I like -O2 My 2c. > GCC's -Os is not "size at all costs" either; there are many options (mostly > --params) that can decrease code size significantly. To tune code size > down for your particular program you have to play with options a bit. This > shouldn't be news to anyone. > > '-Os' > Optimize for size. '-Os' enables all '-O2' optimizations that do > not typically increase code size. It also performs further > optimizations designed to reduce code size. > >> So if adding optimisations to -O2 that increase code size, please >> considering adding an -O2s that maintains the compact code size of -O2. -O2 >> generates pretty compact code as many performance optimisations tend to >> reduce code size, or otherwise add optimisations that increase code size to >> -O3. Adding loop unrolling on makes sense in the Clang/LLVM context where >> they have a compact code model with good performance i.e. -Os. In GCC this >> is -O2. >> >> So if you want to enable more optimisations at -O2, please copy -O2 >> optimisations to -O2s or rename -Os to -Oz and copy -O2 optimisation >> defaults to a new -Os. > > '-O2' > Optimize even more. GCC performs nearly all supported > optimizations that do not involve a space-speed tradeoff. As > compared to '-O', this option increases both compilation time and > the performance of the generated code. > >> The present reality is that any project that wishes to optimize for size at >> all costs will need to run a configure test for -Oz, and then fall back to >> -Os, given the current disparity between Clang/LLVM and GCC flags here. > > The present reality is that any project that wishes to support both GCC and > LLVM needs to do configure tests, because LLVM chose to do many things > differently (sometimes unavoidably). If GCC would change some options > to be more like LLVM, all users only ever using GCC would be affected, > while all other incompatibilities would remain. Not a good tradeoff at > all. > > > Segher
Re: RFC: Improving GCC8 default option settings
> On 13 Sep 2017, at 1:15 PM, Michael Clark wrote: > > - https://rv8.io/bench#optimisation > - https://rv8.io/bench#executable-file-sizes > > -O2 is 98% perf of -O3 on x86-64 > -Os is 81% perf of -O3 on x86-64 > > -O2 saves 5% space on -O3 on x86-64 > -Os saves 8% space on -Os on x86-64 > > 17% drop in performance for 3% saving in space is not a good trade for a > “general” size optimisation. It’s more like executable compression. Sorry fixed typo: -O2 is 98% perf of -O3 on x86-64 -Os is 81% perf of -O3 on x86-64 -O2 saves 5% space on -O3 on x86-64 -Os saves 8% space on -O3 on x86-64 The extra ~3% space saving for ~17% drop in performance doesn’t seem like a good general option for size based on the cost in performance. Again. I really like GCC’s -O2 and hope that its binaries don’t grow in size nor slow down.
Re: Bit-field struct member sign extension pattern results in redundant
> On 5 Sep 2017, at 9:35 AM, Michael Clark wrote: > >> >> On 19 Aug 2017, at 4:10 AM, Richard Henderson wrote: >> >> On 08/17/2017 03:29 PM, Michael Clark wrote: >>> hand coded x86 asm (no worse because the sar depends on the lea) >>> >>> sx5(int): >>> shl edi, 27 >>> sar edi, 27 >>> movsx eax, dl >> >> Typo in the register, but I know what you mean. More interestingly, edi >> already has the sign-extended value, so "mov eax, edi" sufficies (saving one >> byte or perhaps allowing better register allocation). >> >> That said, if anyone is tweaking x86-64 patterns on this, consider >> >> sx5(int): >> rorxeax, edi, 5 >> sar eax, 27 >> >> where the (newish) bmi2 rorx instruction allows the data to be moved into >> place >> while rotating, avoiding the extra move entirely. > > The patterns might be hard to match unless we can get the gimple expansions > for bitfield access to use the mode of the enclosing type. > > For bitfield accesses to SI mode fields under 8 bits on RISC-V, gcc is > generating two shifts on QI mode sub-registers, each with a sign-extend. > > For bitfield accesses to SI mode fields under 16 bits on RISC-V, gcc is > generating two shifts on HI mode sub-registers, each with a sign-extend. > > The bitfield expansion logic is selecting the narrowest type that can hold > the bitwidth of the field, versus the bitwidth of the enclosing type and this > appears to be in the gimple to rtl transform, possibly in expr.c. > > Using the mode of the enclosing type for bitfield member access expansion > would likely solve this problem. I suspect that with the use of caches and > word sized register accesses, that this sort of change would be reasonable to > make for 32-bit and 64-bit targets. > > I also noticed something I didn’t spot earlier. On RV32 the sign extend and > shifts are correctly coalesced into 27 bit shifts in the combine stage. We > are presently looking into another redundant sign extension issue on RV64 > that could potentially be related. It could be that the shift coalescing > optimisation doesn’t happen unless the redundant sign extensions are > eliminated early in combine by simplify_rtx. i.e. the pattern is more complex > due to sign_extend ops that are not eliminated. > > - https://cx.rv8.io/g/2FxpNw > > RV64 and Aarch64 both appear to have the issue but with different expansions > for the shift and extend pattern due to the mode of access (QI or HI). Field > accesses above 16 bits create SI mode accesses and generate the expected > code. The RISC-V compiler has the #define SLOW_BYTE_ACCESS 1 patch however it > appears to make no difference in this case. SLOW_BYTE_ACCESS suppresses QI > mode and HI mode loads in some bitfield test cases when a struct is passed by > pointer but has no effect on this particular issue. This shows the codegen > for the fix to the SLOW_BYTE_ACCESS issue. i.e. proof that the compiler as > SLOW_BYTE_ACCESS defined to 1. > > - https://cx.rv8.io/g/TyXnoG > > A target independent fix that would solve the issue on ARM and RISC-V would > be to access bitfield members with the mode of the bitfield member's > enclosing type instead of the smallest mode that can hold the bitwidth of the > type. If we had a char or short member in the struct, I can understand the > use of QI and HI mode, as we would need narrorwer accesses due to alignment > issues, but in this case the member is in int and one would expect this to > expand to SI mode accesses if the enclosing type is SI mode. It appears that 64-bit targets which promote to an SI mode subregister of DI mode register prevents combine from coalescing the shift and sign_extend. The only difference I can spot between RV32 which coalesces the shift during combine, and RV64 and other 64-bit targets which do not coalesce the shifts during combine, is that the 64-bit targets have promoted the shift operand to a SI mode subregister of a DI mode register, whereas on RV 32 the shift operand is simply an SI mode register. I suspect there is some code in simplify-rtx.c that is missing the sub register case however I’m still trying to isolate the code that coalesces shift and sign_extend. I’ve found a similar, perhaps smaller, but related case where a shift is not coalesced however in this case it is also not coalesced on RV32 https://cx.rv8.io/g/fPdk2F > riscv64: > > sx5(int): > slliw a0,a0,3 > slliw a0,a0,24 > sraiw a0,a0,24 > sraiw a0,a0,3 > ret > > sx17(int): > slliw a0,a0,15 >
Re: RFC: Improving GCC8 default option settings
> On 14 Sep 2017, at 3:06 AM, Allan Sandfeld Jensen wrote: > > On Dienstag, 12. September 2017 23:27:22 CEST Michael Clark wrote: >>> On 13 Sep 2017, at 1:57 AM, Wilco Dijkstra wrote: >>> >>> Hi all, >>> >>> At the GNU Cauldron I was inspired by several interesting talks about >>> improving GCC in various ways. While GCC has many great optimizations, a >>> common theme is that its default settings are rather conservative. As a >>> result users are required to enable several additional optimizations by >>> hand to get good code. Other compilers enable more optimizations at -O2 >>> (loop unrolling in LLVM was mentioned repeatedly) which GCC could/should >>> do as well. >> >> There are some nuances to -O2. Please consider -O2 users who wish use it >> like Clang/LLVM’s -Os (-O2 without loop vectorisation IIRC). >> >> Clang/LLVM has an -Os that is like -O2 so adding optimisations that increase >> code size can be skipped from -Os without drastically effecting >> performance. >> >> This is not the case with GCC where -Os is a size at all costs optimisation >> mode. GCC users option for size not at the expense of speed is to use -O2. >> >> ClangGCC >> -Oz ~= -Os >> -Os ~= -O2 >> > No. Clang's -Os is somewhat limited compared to gcc's, just like the clang > -Og > is just -O1. AFAIK -Oz is a proprietary Apple clang parameter, and not in > clang proper. It appears to be in mainline clang. mclark@anarch128:~$ clang -Oz -c a.c -o a.o mclark@anarch128:~$ clang -Ox -c a.c -o a.o error: invalid integral value 'x' in '-Ox' error: invalid integral value 'x' in '-Ox' mclark@anarch128:~$ uname -a Linux anarch128.org 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u3 (2017-08-06) x86_64 GNU/Linux mclark@anarch128:~$ clang --version clang version 3.8.1-24 (tags/RELEASE_381/final) Target: x86_64-pc-linux-gnu Thread model: posix InstalledDir: /usr/bin I still think it would be unfortunate to loose the size/speed sweet spot of -O2 by adding optimisations that increase code size, unless there was a size optimisation option that was derived from -O2 at the point -O2 is souped up. i.e. create an -O2s (or renaming -Os to -Oz and deriving the new -Os from the current -O2). I’m going to start looking at this point to see whats involved in making a patch. Distros want a balance or size and speed might even pick it up, even if it is not accepted in mainline.
Re: [RFC] type promotion pass
> On 16 Sep 2017, at 1:04 AM, David Edelsohn wrote: > > On Tue, Sep 5, 2017 at 5:26 AM, Prathamesh Kulkarni > wrote: >> Hi, >> I have attached revamped version of Kugan's original patch for type promotion >> (https://gcc.gnu.org/ml/gcc-patches/2014-09/msg00472.html) >> rebased on r249469. The motivation of the pass is to minimize >> generation of subregs >> to avoid redundant zero/sign extensions by carrying out computations >> in PROMOTE_MODE >> as much as possible on tree-level. >> >> * Working of pass >> The pass is dominator-based, and tries to promote type of def to >> PROMOTE_MODE in each gimple stmt. Before beginning domwalk, all the >> default definitions are promoted to PROMOTE_MODE >> in promote_all_ssa_defined_with_nop (). The patch adds a new tree code >> SEXT_EXPR to represent sign-extension on tree level. CONVERT_EXPR is >> either replaced by an explicit sign or zero extension depending on the >> signedness of operands. >> >> The core of the pass is in following two routines: >> a) promote_ssa: This pass looks at the def of gimple_stmt, and >> promotes type of def to promoted_type in-place if possible. If it >> cannot promote def in-place, then >> it transforms: >> def = def_stmt >> to >> new_def = def_stmt >> def = convert_expr new_def >> where new_def is a clone of def, and type of def is set to promoted_type. >> >> b) fixup_use: The main intent is to "fix" uses of a promoted variable >> to preserve semantics >> of the code, for instance if the variable is used in stmt where it's >> original type is required. >> Another case is when def is not promoted by promote_ssa, but some uses >> could be promoted. >> >> promote_all_stmts () is the driver function that calls fixup_use and >> promote_ssa for each stmt >> within the basic block. The pass relies extensively on dom and vrp to >> remove redundancies generated by the pass and is thus enabled only if >> vrp is enabled. >> >> Issues: >> 1] Pass ordering: type-promote pass generates too many redundancies >> which can hamper other optimizations. One case I observed was on arm >> when it inhibited path splitting optimization because the number of >> stmts in basic block exceeded the value of >> param-max-jump-thread-duplication-stmts. So I placed the pass just >> before dom. I am not sure if this is a good approach. Maybe the pass >> itself >> should avoid generating redundant statements and not rely too much on >> dom and vrp ? >> >> 2] Redundant copies left after the pass: When it's safe, vrp optimzies >> _1 = op1 sext op2 >> into >> _1 = op1 >> >> which leaves redundant copies that are not optimized away at GIMPLE level. >> I placed pass_copy_prop after vrp to eliminate these copies but not sure if >> that >> is ideal. Maybe we should do this during vrp itself as this is the >> only case that >> generates redundant copies ? >> >> 3] Backward phi's: Since we traverse in dominated order, fixup_use() >> gets called first on a ssa_name that is an argument of a backward-phi. >> If it promotes the type and later if promote_ssa() decides that the >> ssa_name should not be promoted, then we have to again walk the >> backward phi's to undo the "incorrect" promotion, which is done by >> emitting a convert_expr back to the original type from promoted_type. >> While I suppose it shouldn't affect correctness, it generates >> redundant casts and was wondering if there's a better approach to >> handle this issue. >> >> * SPEC2k6 benchmarking: >> >> Results of benchmarking the patch for aarch64-linux-gnu cortex-a57: >> >> Improved: >> 401.bzip2 +2.11% >> 459.GemsFDTD +1.42% >> 464.h264ref +1.96% >> 471.omnetpp +1.05% >> 481.wrf+0.99% >> >> Regressed: >> 447.dealII-1.34% >> 450.soplex -1.54% >> 456.hmmer -3.79% >> 482.sphinx3 -2.95% >> >> The remaining benchmarks didn't have much difference. However there >> was some noise >> in the above run, and I suppose the numbers are not precise. >> I will give another run with benchmarking. There was no significant >> difference in code-size with and without patch for SPEC2k6. >> >> * Validation >> >> The patch passes bootstrap+test on x86_64-unknown-linux-gnu, >> arm-linux-gnueabihf, >> aarch64-linux-gnu and ppc64le-linux-gnu. On arm-linux-gnueabihf, and >> aarch64-linux-gnu, there is following fallout: >> >> 1] gcc.dg/attr-alloc_size-11.c missing range info for signed char >> (test for warnings, line 50) >> The issue seems to be that we call reset_flow_sensitive_info(def) if >> def is promoted and that invalidates value range which is probably >> causing the regression. >> >> 2] gcc.dg/fold-narrowbopcst-1.c scan-tree-dump optimized " = _.* + 156" >> This requires adjusting the scan-dump. >> >> On ppc64le-linux-gnu, I am observing several regressions in the >> testsuite. Most of these seem to be because type-promotion is >> interfering with other optimizations, especially widening_mul and >> bswap pass. Also observed
Re: [RFC] type promotion pass
> On 16 Sep 2017, at 8:59 AM, Segher Boessenkool > wrote: > > Hi! > > On Sat, Sep 16, 2017 at 08:47:03AM +1200, Michael Clark wrote: >> RISC-V defines promote_mode on RV64 to promote SImode to signed DImode >> subregisters. I did an experiment on RISC-V to not promote SImode to DImode >> and it improved codegen for many of my regression test cases, but >> unfortunately it breaks the RISC-V ABI. > > It sounds like you should implement TARGET_PROMOTE_FUNCTION_MODE as well? riscv currently has default_promote_function_mode_always_promote. gcc/config/riscv/riscv.c:#define TARGET_PROMOTE_FUNCTION_MODE default_promote_function_mode_always_promote I see that default_promote_function_mode_always_promote just calls promote_mode Is TARGET_PROMOTE_FUNCTION_MODE used to perform canonicalisation before calls and returns? i.e. would it be possible to have promote_mode as a no-op (leave SImode values as SImode in the RTL), but have TARGET_PROMOTE_FUNCTION_MODE perform promotions similar to our current PROMOTE_MODE definition i.e. is TARGET_PROMOTE_FUNCTION_MODE the hook that is used to canonicalise values *before* calls and *before* returns? I’ll do some reading… I was also curious about benchmarking the alternate ABI choice that leaves the upper bits undefined, and does narrowing in the caller. It would be an ABI break so is a no go, but I was curious about the performance of this option. Any 32-bit ops causes narrowing on demand. It’s only the logical ops that would need to be explicitly narrowed in this alternate ABI. In any case I don’t think the RISC-V maintainers would accept an ABI break however i’m curious as to the advantages and disadvantages of ABIs that do and don’t define the upper bits for narrower modes when passing values to and from functions. Thanks, Michael.
Re: Redundant sign-extension instructions on RISC-V
Hi, I’ve spun an alternative version of RISC-V GCC which preserves RV64 ABI sign extended canonicalisation for passed values but keeps values in SImode until return. https://github.com/michaeljclark/riscv-gcc/commits/riscv-gcc-7-mc These are the changes: - #undef PROMOTE_MODE - #define TARGET_PROMOTE_FUNCTION_MODE - Implement riscv_function_promote to promote arguments and return values - Add SImode logical op patterns to riscv.md The RTL for the sign extension regression test cases now contain all SImode ops, however REE is still not able to remove sign_extend after and/ior/xor/not with SImode inputs that are in canonical form. I’m still seeing shifts not being coalesced as this likely happens in combine, and the redundant sign extensions are not eliminated until much later in REE with this particular configuration. Observations. - Aarch64 leaves upper bits undefined instead of canonicalising them and PROMOTE_MODE promotes QImode and HImode to SImode. - RISC-V defines explicit sign_extend + op pattern matches in its metadata to collapse redundant sign extends - Explicit _extend metadata patterns appear to be required for REE to eliminate redundant sign_extend. - I removed these patterns and then all functions returning int had: sext.w a0, a0; ret - Explicit _extend metadata patterns span only two ops - e.g. (sign_extend:DI (plus:SI)), (sign_extend:DI (minus:SI)), (sign_extend:DI (any_shift:SI)), etc - REE may not be able to cross enough transitive dependencies to handle sign_extend and/ior/xor/not ops with canonical inputs - e.g. (sign_extend:DI (logical_op:SI (plus_minus_shift:SI) (plus_minus_shift:SI))) - e.g. (sign_extend:DI (logical_op:SI (logical_op:SI (logical_op:SI (plus_minus_shift:SI) (plus_minus_shift:SI) - metadata pattern matching approach is untenable where there is a series of logical ops on canonical inputs. e.g. the bswap test case - Shift coalescing is done before REE in combine so this may be why we are seeing missed optimisations. - there may be more optimisations that are missed due to the sign_extend promotions changing patterns - Keeping values in SImode until return doesn’t appear to help much however it may change how the problem is solved. Actions - Backport Prathamesh’s type promotion patch to riscv-gcc. It may suit the eager definition of PROMOTE_MODE - Benchmark Alternative ABI that leaves upper bits undefined, out of curiosities sake… I’ve been reasoning about the alternate ABI. I can see it has the cost of widening narrower return values when they are promoted to wider types, but the general case of adding integers to pointers would not be affected when it is inline. Deferring canonicalisation when there are several routines that return ints and work on ints as inputs would seem to indicate generation of less operations. I’d really like to see SPECint run on the alternate ABI. However the current ABI could be improved if the compiler is better able to reason about cases where the sign_extend operations don’t need to be emitted, so it would not be a fair test against the current version of the port. Michael. > On 30 Aug 2017, at 12:36 PM, Michael Clark wrote: > > Dear GCC folk, > > > # Issue Background > > We’re investigating an issue with redundant sign-extension instructions being > emitted with the riscv backend. Firstly I would like to state that riscv is > possibly a unique backend with respect to its canonical sign-extended > register form due to the following: > > - POINTERS_EXTEND_UNSIGNED is false, and the canonical form after 32-bit > operations on RV64 is sign-extended not zero-extended. > > - PROMOTE_MODE is defined on RV64 such that SI mode values are promoted to > signed DI mode values holding SI mode subregs > > - RISC-V does not have register aliases for these different modes, rather > different instructions are selected to operate on different register widths. > > - The 32-bit instructions sign-extend results. e.g. all shifts, add, sub, etc. > > - Some instructions such as logical ops only have full word width variants > (AND, OR, XOR) but these instructions maintain canonical form as there is no > horizontal movement of bits. > > > # Issue Details > > During porting from GCC 6.1 to GCC 7.1, the RISC-V port was changed to define > TRULY_NOOP_TRUNCATION to 1 and PROMOTE_MODE was set up to promote values to > wider modes. This was done to ensure ABI correctness so that registers > holding narrower modes always contain canonical sign extended values. > > The result of this is that sign_extend operations are inserted but later > eliminated in ree/simplyfy_rtx. In testcase 3 the sign_ex
Re: Register Allocation Graph Coloring algorithm and Others
Hi Leslie, I suggest adding these 3 papers to your reading list. Register allocation for programs in SSA-form Sebastian Hack, Daniel Grund, and Gerhard Goos http://www.rw.cdl.uni-saarland.de/~grund/papers/cc06-ra_ssa.pdf Simple and Efficient Construction of Static Single Assignment Form Matthias Braun , Sebastian Buchwald , Sebastian Hack , Roland Leißa , Christoph Mallon , and Andreas Zwinkau https://www.info.uni-karlsruhe.de/uploads/publikationen/braun13cc.pdf Optimal register allocation for SSA-form programs in polynomial time Sebastian Hack, Gerhard Goos http://web.cs.ucla.edu/~palsberg/course/cs232/papers/HackGoos-ipl06.pdf A lot of the earlier literature regarding the register allocation problem describes the general graph colouring problem as NP-complete, however previous research in Register Allocation has been in the context of programs that were not in SSA form. i.e. the Chaitin-Briggs paper states that register allocation is NP-complete and proposes an iterative algorithm. If one reads Hack Goos, there is the discovery that programs that are in SSA form have chordal graphs, and the colouring algorithm for chordal graphs can be completed in polynomial time. After the cost of building the interference graph, it is possible to perform register allocation in a single pass. The key is in not modifying the graph. If one has frequency for each basic block, then one can sort basic blocks by frequency, allocating the highest frequency blocks first, and where possible assigning the same physcial register to all virtual registers in each phi node (to avoid copies). At program points where the live set is larger than k, the set of physical registers, one spills the the register that has the largest distance between its next use. At least that is how I am thinking about this problem. I am also a novice with regards to register allocation. I intend to develop a register allocator for use in this JIT engine: rv8: a high performance RISC-V to x86 binary translator Michael Clark, Bruce Hoult https://carrv.github.io/papers/clark-rv8-carrv2017.pdf Our JIT already performs almost twice as fast a QEMU and we are using a static register allocation, and QEMU i believe has a register allocator. We are mapping a 31 register RISC architecture to a 16 register CISC architecture. I have statistics on the current slow-down, and it appears to mainly be L1 accesses to spilled registers. I believe after we have implemented a register allocator in our JIT we will be able to run RISC-V binaries at near native performance on x86-64. In fact given we are performing runtime profile guided optimisation, it may even be possible for some benchmarks to run faster than register allocations that are based on static estimates of loop frequencies. https://anarch128.org/~mclark/rv8-slides.pdf https://rv8.io/bench We’ve still got a long way to go to get to ~1:1 performance for RISC-V on x86-64, but I think it is possible, at least for some codes… Regards, Michael. > On 15/12/2017, at 4:18 PM, Leslie Zhai wrote: > > Hi GCC and LLVM developers, > > I am learning Register Allocation algorithms and I am clear that: > > * Unlimited VirtReg (pseudo) -> limited or fixed or alias[1] PhysReg (hard) > > * Memory (20 - 100 cycles) is expensive than Register (1 cycle), but it has > to spill code when PhysReg is unavailable > > * Folding spill code into instructions, handling register coallescing, > splitting live ranges, doing rematerialization, doing shrink wrapping are > harder than RegAlloc > > * LRA and IRA is default Passes in RA for GCC: > > $ /opt/gcc-git/bin/gcc hello.c > DEBUG: ../../gcc/lra.c, lra_init_once, line 2441 > DEBUG: ../../gcc/ira-build.c, ira_build, line 3409 > > * Greedy is default Pass for LLVM > > But I have some questions, please give me some hint, thanks a lot! > > * IRA is regional register allocator performing graph coloring on a top-down > traversal of nested regions, is it Global? compares with Local LRA > > * The papers by Briggs and Chaiten contradict[2] themselves when examine the > text of the paper vs. the pseudocode provided? > > * Why interference graph is expensive to build[3]? > > And I am practicing[4] to use HEA, developed by Dr. Rhydian Lewis, for LLVM > firstly. > > > [1] https://reviews.llvm.org/D39712 > > [2] http://lists.llvm.org/pipermail/llvm-dev/2008-March/012940.html > > [3] https://github.com/joaotavio/llvm-register-allocator > > [4] https://github.com/xiangzhai/llvm/tree/avr/include/llvm/CodeGen/GCol > > -- > Regards, > Leslie Zhai - https://reviews.llvm.org/p/xiangzhai/ > > >
Re: RISC-V ELF multilibs
> On 1/06/2018, at 6:16 PM, Sebastian Huber > wrote: > > On 29/05/18 20:02, Jim Wilson wrote: >>> Most variants include the C extension. Would it be possible to add >>> -march=rv32g and -march=rv64g variants? >> >> The expectation is that most implementations will include the C extension. >> It reduces code size, improves performance, and I think I read somewhere >> that it takes only 400 gates to implement. >> >> It isn't practical to try to support every possible combination of >> architecture and ABI here, as there are too many possible combinations. But >> if there is a major RISC-V target that is rv32g or rv64g then we should >> consider it. You can of course define your own set of multilibs. > > I am not a hardware developer, but I heard that the C extension is not easy > to implement in out of order machines. Easy is relative. The RISC-V ISA encoding has been designed so that wide fetch is relatively easy, possibly an order of magnitude easier than an ISA with a complex variable length encoding like x86. RV64GC instruction lengths can be derived from the 2 least significant bits in the first half-word packet of each instruction for mixed 16/32 bit instructions. It does not require an attempt to parse multiple prefixes with instructions ranging from 1 byte up to 15 bytes, before arriving at an instruction length (x86). From that perspective RV64GC is decidedly simple. That said, RV64GC is a little more complex than regular 32-bit encodings like RV64G, PowerPC, AArch64, or SPARC. I don’t think the paper you have linked to draws the conclusion you are arriving at. I spoke to Chris Celio about RVC and he said he just just didn’t get around to implementing RVC in BOOM, not necessarily that it’s absence is a statement of its difficulty rather the academic objectives of BOOM, one of a very small number of Open Source OoO processors. Sure, if it was “trivial” then BOOM would include it, so it’s certainly non trivial. For BOOM, RVC requires an instruction buffer that handles unaligned fetches and the way the PC is derived further down the pipeline needs to be modified as the same micro-op may have a different width depending on whether it was expanded from an RVC encoding. It may or may not an extra pre-decoder pipe stage. I have a fair understanding of what’s requested. The misaligned fetch is probably the most complex but BOOM already needs to have complex cases like the fetch buffer spanning cache lines and page boundaries. Unaligned instruction fetches add to this. > For example: > > https://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-157.html > > -- > Sebastian Huber, embedded brains GmbH > > Address : Dornierstr. 4, D-82178 Puchheim, Germany > Phone : +49 89 189 47 41-16 > Fax : +49 89 189 47 41-09 > E-Mail : sebastian.hu...@embedded-brains.de > PGP : Public key available on request. > > Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG. >
Re: [RFC] Adding Python as a possible language and it's usage
On 20/07/2018, at 4:12 AM, Richard Earnshaw (lists) mailto:richard.earns...@arm.com>> wrote: > On 19/07/18 12:30, Florian Weimer wrote: >> * Segher Boessenkool: >> >>> What would the advantage of using Python be? I haven't heard any yet. >>> Awk may be a bit clunky but at least it is easily readable for anyone. >> >> I'm not an experienced awk programmer, but I don't think plain awk >> supports arrays of arrays, so there's really no good way to emulate >> user-defined data types and structure the code. >> > > You can do multi-dimentional arrays in awk. They're flattened a bit > like tcl ones are, but there are ways of iterating over a dimention. > See, for example, config/arm/parsecpu.awk which gets up to tricks like that. Has it occurred to anyone to write a lean and fast tool in the host C++ subset that is allowable for use in the host tools. I guess this is currently C++98/GNU++98. This adds no additional dependencies. Sure it is a slight level of effort higher than writing an awk script, but with a modern C++ this is less of a case as it has ever been in the past. I personally use C++11/14 as a substitute for python type programs that would normally be considered script language material, mostly due to fluency and the fact that modern C++ has grown more tractable as a replacement for “fast to code in” languages given it is much faster to code in than C. LLVM discussed changing the host compiler language feature dependencies to C++11/C++14. There are obvious but not insurmountable bootstrap requirements. i.e. for very old systems it will require an intermediate C++11/C++14 compiler to bootstrap LLVM 6.0. Here is LLVM's new compiler baseline and it seems to require CentOS 7. - Clang 3.1 - GCC 4.8 - Visual Studio 2015 (Update 3) [1] https://llvm.org/docs/GettingStarted.html#getting-a-modern-host-c-toolchain I find I can be very productive and nearly as concise in C++11 as I can in many script languages due to modern versions of , , , , , , auto, lambdas, etc. It’s relatively easy to write memory clean code from the get go using std::unique_ptr<> and sparing amounts of std::shared_ptr<>) and the new C++11 for comprehensions, initializer lists and various other enhancements can make coding in “modern C++” relatively friendly and productive. In the words of Bjarne Stroustrup: It feels like a new language. I can point to examples of small text processing utilities that i’ve written that could be written in python with a relatively similar amount of effort. Fluency with STL and the use of lean idioms. STL and structs (data hiding is only a header tweak away from being broken in a language like C++, and the use of struct and is similar to language like python which resorts to using underscores or “idiomatic enforcement”). i.e. there are lightweight, fast and productive modern C++ idioms that work well with vectors, sets, maps and unique_ptr or shared_ptr automatic memory management. I find with modern idioms these programs valgrind clean almost always. Would modern-C++ (for generated files) be considered for GCC 9? The new idioms may make parts of the code considerable more concise and could allow use of some of the new automatic memory management features. The reason I’m suggesting this, is that for anything other than a trivial command line invocation of sed or awk, I would tend to write a modern C++ program to do text processing versus a script langauge like python. Firstly it is faster, Secondly I am proficient enough and the set and map functionality combined with the new automatic memory management is sufficient enough that complex nested data structures and text processing can handled with relative ease. Note: I do tend to avoid iostream and instead use stdc FILE * and fopen/fread/frwite/fclose or POSIX open/read/write/close if I want to avoid buffering. I find iostream performance is not that great. How conservative are we? Is C++11 going go be available for use in GCC before C++2x in 202x. Indeed would improve some of the Windows/UNIX interoperability. I’ve found that writing C++11/14 allows me to write in an idiomatic C/C++ subset that is quite stable across platforms. We now even have on Windows. There has been quite a bit of convergence. Having the constraint that modern C++11/14 can only be used for generated files lessens the burden as the distribution build can maintain the same base compiler dependencies. Michael.
Re: That light at the end of the tunnel?
> On 31/07/2018, at 12:47 AM, Eric S. Raymond wrote: > > Richard Biener : >> Ok, so let me ask whether you can currently convert trunk and >> gcc-{6,7,8}-branch >> successfully, ignoring "merges" into them (shouldn't have happened). All >> other >> branches can in theory be converted later if required, right? > > Trunk conversionm is currently broken. That's what makes this serious > problem; > it was working before. Didn’t you mention the trunk conversion issue was due to a bug introduced into reposurgeon at some date recently as a fix for some other conversion? Can you not roll back to a version of reposurgeon used for the last successful conversion? Assuming you know the date of the last successful conversion, then you can use the reposurgeon version from that date. What were the problems in the last “successful” conversion that led to the subsequent changes in reposurgeon? Were the conversion problems in the last successful conversion major enough that the resultant tree was not usable for trunk and the major development branches: gcc6, gcc7 and gcc8 branches? Or were the issues on older branches? If you are bisecting reposurgeon then would you not search for the date of the message when you had reported successful conversion here on this list (with whatever remaining issues that version had), and use the version of reposurgeon from that date for your next trial conversion? Last known good version of reposurgeon is the problem right? (with whatever outstanding issues that version had with the conversion).
[RFC] zip_vector: in-memory block compression of integer arrays
Hi Folks, This is an edited version of a message posted on the LLVM Discourse. I want to share what I have been working on as I feel it may be of interest to the GCC compiler developers, specifically concerning alias analysis and optimizations for iteration of sparse block-based multi-arrays. I also have questions about optimization related to this implementation, specifically the observability of alias analysis pessimization and memory to register optimizations. I have been working on _zip_vector_. _zip_vector_ is a compressed variable length array that uses vectorized block codecs to compress and decompress integers using dense variable bit width deltas as well as compressing constant values and sequences. _zip_vector_ employs integer block codecs optimized for vector instruction sets using the Google Highway C++ library for portable SIMD/vector intrinsics. The high-level class supports 32-bit and 64-bit compressed integer arrays: - `zip_vector` - { 8, 16, 24 } bit signed and unsigned fixed-width values. - { 8, 16, 24 } bit signed deltas and per block IV. - constants and sequences using per block IV and delta. - `zip_vector` - { 8, 16, 24, 32, 48 } bit signed and unsigned fixed-width values. - { 8, 16, 24, 32, 48 } bit signed deltas with per block IV. - constants and sequences using per block IV and delta. Here is a link to the implementation: - https://github.com/metaparadigm/zvec/ The README has a background on the delta encoding scheme. If you read the source, "zvec_codecs.h" contains the low-level vectorized block codecs while "zvec_block.h" contains a high-level interface to the block codecs using cpuid-based dynamic dispatch. The high-level sparse integer vector class leveraging the block codecs is in "zip_vector.h". It has been tested with GCC and LLVM on x86-64 using SSE3, AVX, and AVX-512. The principle of operation is to employ simplified block codecs dedicated to only compressing fixed-width integers and thus are extremely fast, unlike typical compression algorithms. They are _in the order of 30-150GiB/sec_ on a single core when running within the L1 cache on Skylake AVX-512. From this perspective, zip_vector achieves its performance by reducing global memory bandwidth because it fetches and stores compressed data to and from RAM and then uses extremely fast vector codecs to pack and unpack compressed blocks within the L1 cache. From this perspective, it is similar to texture compression codecs, but the specific use case is closer to storage for index arrays because the block codecs are lossless integer codecs. The performance is striking in that it can be faster for in-order read-only traversal than a regular array, while the primary goal is footprint reduction. The design use case is an offsets array that might contain 64-bit values but usually contains smaller values. From this perspective, we wanted the convenience of simply using `zip_vector` or `zip_vector` while benefiting from the space advantages of storing data using 8, 16, 24, 32, and 48-bit deltas. Q. Why is it specifically of interest to GCC developers? I think the best way to answer this is with questions. How can we model a block-based iterator for a sparse array that is amenable to vectorization? There are aspects to the zip_vector iterator design that are *not done yet* concerning its current implementation. The iteration has two phases. There is an inter-block phase at the boundary of each block (the logic inside of `switch_page`) that scans and compresses the previously active block, updates the page index, and decompresses the next block. Then there is a _broad-phase_ for intra-block accesses, which is amenable to vectorization due to the use of fixed-size blocks. *Making 1D iteration as fast as 2D iteration* Firstly I have to say that there is a lot of analysis for the optimization of the iterator that I would like to discuss. There is the issue of hoisting the inter-block boundary test from the fast path so that during block boundary traversal, subsequent block endings are calculated in advance so that the broad phase only requires a pointer increment and comparison with the addresses held in registers. The challenge is getting past compiler alias analysis. Alias analysis seems to prevent caching of the sum of the slab base address and active area offset in a register versus being demoted to memory accesses. These member variables hold the location of the slab and the offset to the uncompressed page which are both on the critical path. When these values are in memory, _it adds 4 or more cycles of latency for base address calculation on every access_. There is also the possibility to hoist and fold the active page check as we know we can make constructive proofs concerning changes to that value. Benchmarks compare the performance of 1D and 2D style iterators. At certain times the compiler would hoist the base and offset point
Re: [RFC] zip_vector: in-memory block compression of integer arrays
On 17/08/22 7:10 pm, Richard Biener wrote: Q. Why is it specifically of interest to GCC developers? I think the best way to answer this is with questions. How can we model a block-based iterator for a sparse array that is amenable to vectorization? There are aspects to the zip_vector iterator design that are *not done yet* concerning its current implementation. The iteration has two phases. There is an inter-block phase at the boundary of each block (the logic inside of `switch_page`) that scans and compresses the previously active block, updates the page index, and decompresses the next block. Then there is a _broad-phase_ for intra-block accesses, which is amenable to vectorization due to the use of fixed-size blocks. *Making 1D iteration as fast as 2D iteration* Firstly I have to say that there is a lot of analysis for the optimization of the iterator that I would like to discuss. There is the issue of hoisting the inter-block boundary test from the fast path so that during block boundary traversal, subsequent block endings are calculated in advance so that the broad phase only requires a pointer increment and comparison with the addresses held in registers. The challenge is getting past compiler alias analysis. Alias analysis seems to prevent caching of the sum of the slab base address and active area offset in a register versus being demoted to memory accesses. These member variables hold the location of the slab and the offset to the uncompressed page which are both on the critical path. When these values are in memory, _it adds 4 or more cycles of latency for base address calculation on every access_. There is also the possibility to hoist and fold the active page check as we know we can make constructive proofs concerning changes to that value. Benchmarks compare the performance of 1D and 2D style iterators. At certain times the compiler would hoist the base and offset pointers from member variable accesses into registers in the 1D version making a noticeable difference in performance. In some respects, from the perspective of single-threaded code, the only way the pointer to the active region can change is inside `switch_page(size_t y)`. The potential payoff is huge because one may be able to access data ~ 0.9X - 3.5X faster than simply accessing integers in RAM when combining the reduction in global memory bandwidth with auto-vectorization, but the challenge is generating safe code for the simpler 1D iteration case that is as efficient as explicit 2D iteration. 1D iteration: for (auto x : vec) x2 += x; 2D iteration: for (size_t i = 0; i < n; i += decltype(vec)::page_interval) { i64 *cur = &vec[i], *end = cur + decltype(vec)::page_interval; while(cur != end) x2 += *cur++; } Note: In this example, I avoid having a different size loop tail but that is also a consideration. I trialled several techniques using a simplified version of the `zip_vector` class where `switch_page` was substituted with simple logic so that it was possible to get the compiler to coalesce the slab base pointer and active area offset into a single calculation upon page crossings. There is also hoisting of the active_page check (_y-parameter_) to only occur on block crossings. I found that when the `switch_page` implementation became more complex, i.e. probably adding an extern call to `malloc`, the compiler would resort to more conservatively fetching through a pointer to a member variable for the base pointer and offset calculation. See here: https://github.com/metaparadigm/zvec/blob/756e583472028fcc36e94c0519926978094dbb00/src/zip_vector.h#L491-L496 So I got to the point where I thought it would help to get input from compiler developers to figure out how to observe which internal constraints are violated by "switch_page" preventing the base pointer and offset address calculation from being cached in registers. "slab_data" and "active_area" are neither volatile nor atomic, so threads should not expect their updates to be atomic or go through memory. I tried a large number of small experiments. e.g. so let's try and collapse "slab_data" and "active_area" into one pointer at the end of "switch_page" so that we only have one pointer to access. Also, the "active_page" test doesn't necessarily need to be in the broad phase. I had attempted to manually hoist these variables by modifying the iterators but found it was necessary to keep them where they were to avoid introducing stateful invariants to the iterators that could become invalidated due to read accesses. Stack/register-based coroutines could help due to the two distinct states in the iterator. I want to say that it is not as simple as one might think on the surface. I tried several approaches to coalesce address calculations and move them into the page switch logic, all leading to performance fall-off, almost like the compiler was carrying some pessimization that forced touched member variables to be accessed via
Re: stack arenas using alloca
Hi Folks, *sending again with Thunderbird because Apple Mail munged the message*. I wanted to share a seed of an idea I have been ruminating on for a while, and that is being able to return alloca memory from a function. I think it’s trivially possible by hacking the epilogue to unlink the frame pointer but not pop the stack. Stack traces should work fine and it just looks like the parent function called an alloca which covers the fixed child frame and the alloca that it returned. I tend to use alloca quite a lot in my programming style because I like to avoid the heap. I have a style where I run a loop with a size pass, then call alloca, then run the loop again without suppressing writes. I would like to be able to return, initially, one alloca to a parent function, but more than one should also be possible. I haven't thought a lot about the mechanics of how one might signal to the compiler to suppress popping the stack, but I imagined one could return a pointer or structure with _Stack qualifiers to signal that stack memory from the child frame is being returned. This is just a quick note as there is a deeper conversation about compacting stack arenas that would be possible if one had a runtime *and* compile time reflection API for frame metadata, where the frame is expressed as an anonymous C structure perhaps accessible via a function on a reference to a function. If you had frame metadata, you could emit one or many memmove calls. trivially one just needs the fixed frame length to recover the fixed frame stack memory and with full frame reflection, it would be possible to defer to a runtime routine in the case two or more stack pointers are returned, assuming the compiler tracks sizes of alloca calls. Ideally, the reflection API would also be constexpr-able meaning the runtime compaction could be inlined. This won't work for types that have pointers unless one has a full single-thread compacting GC that uses type metadata, but I tend to use indices. But unlike heap compaction, it doesn't have to worry about races. The beauty of stack arenas, and compacting stack arenas, as a subtype, are many. Firstly the memory has no thread aliasing issues unlike heap memory, meaning analysis is much simpler and there are no pause issues like there are for GC. Secondly, it would be quite a neat way to constexpr variable-length arrays or a _List type because, unlike malloc, it’s much easier to reason about. I have a relatively complete C reflection API that could be used as a starting point for C reflection. It currently works as an LLVM plugin. I am not sure how I would do this in GCC. A Python parser came to mind but it would be better if it somehow ran inside the compiler. It would also be great to have access to the frame of a function as an anonymous structure. It’s only runtime at present and it could benefit from a for comprehension because looping on variable length lists is a little clumsy. If I could return alloca memory in a structure that would largely solve this problem. - https://github.com/michaeljclark/crefl Regards, Michael
Re: stack arenas using alloca
On 8/15/24 06:24, Michael Clark wrote: Hi Folks, *sending again with Thunderbird because Apple Mail munged the message*. I wanted to share a seed of an idea I have been ruminating on for a while, and that is being able to return alloca memory from a function. I think it’s trivially possible by hacking the epilogue to unlink the frame pointer but not pop the stack. Stack traces should work fine and it just looks like the parent function called an alloca which covers the fixed child frame and the alloca that it returned. I tend to use alloca quite a lot in my programming style because I like to avoid the heap. I have a style where I run a loop with a size pass, then call alloca, then run the loop again without suppressing writes. I would like to be able to return, initially, one alloca to a parent function, but more than one should also be possible. I haven't thought a lot about the mechanics of how one might signal to the compiler to suppress popping the stack, but I imagined one could return a pointer or structure with _Stack qualifiers to signal that stack memory from the child frame is being returned. This is just a quick note as there is a deeper conversation about compacting stack arenas that would be possible if one had a runtime *and* compile time reflection API for frame metadata, where the frame is expressed as an anonymous C structure perhaps accessible via a function on a reference to a function. If you had frame metadata, you could emit one or many memmove calls. trivially one just needs the fixed frame length to recover the fixed frame stack memory and with full frame reflection, it would be possible to defer to a runtime routine in the case two or more stack pointers are returned, assuming the compiler tracks sizes of alloca calls. Ideally, the reflection API would also be constexpr-able meaning the runtime compaction could be inlined. This won't work for types that have pointers unless one has a full single- thread compacting GC that uses type metadata, but I tend to use indices. But unlike heap compaction, it doesn't have to worry about races. The beauty of stack arenas, and compacting stack arenas, as a subtype, are many. Firstly the memory has no thread aliasing issues unlike heap memory, meaning analysis is much simpler and there are no pause issues like there are for GC. Secondly, it would be quite a neat way to constexpr variable-length arrays or a _List type because, unlike malloc, it’s much easier to reason about. I have a relatively complete C reflection API that could be used as a starting point for C reflection. It currently works as an LLVM plugin. I am not sure how I would do this in GCC. A Python parser came to mind but it would be better if it somehow ran inside the compiler. It would also be great to have access to the frame of a function as an anonymous structure. It’s only runtime at present and it could benefit from a for comprehension because looping on variable length lists is a little clumsy. If I could return alloca memory in a structure that would largely solve this problem. - https://github.com/michaeljclark/crefl It seems to work. I am hacking on LLVM. here are some notes. - the calling convention could return with the new stack pointer covering the dynamic area or it could return with the dynamic delta in a spare register for the caller to issue an alloca. I am preferring that the stack pointer covers the dynamic region on return because it makes the boundary clear on functions returning alloca memory. the stack pointer always covers the extent of stack allocated memory. - the epilogue code that subtracts the dynamic stack offset instead needs to not do that, which slightly changes epilogue register restore code, if it uses pop instructions, or it needs the delta for frame pointer relative moves. in my experiment I stashed the stack pointer in %rdi, let the x86 epilogue pop callee save restores, then I fetch the return address, restore the stack pointer, and push the return address. - as mentioned, the epilogue needs to push the return address to the end of the stack so that the return instruction works, or it could move it into a register and make an indirect branch. similar. this is unless folks went with an alternative convention that returns the delta and issues an alloca in the parent frame after the call. I don't like that. - either way, metadata needs to make the call look pretty much like it is also a dynamic alloca instruction, so that there is a boundary and metadata like there is around dynamic alloca so that stack pointer calculations after the function are gonna work properly. - emitting a write with the stack pointer in alloca instructions is an interesting consideration. it wouldn't be depended on so it wouldn't add much latency, just a store slot and tiny bit of store bandwidth. it's already right there. this is for the debugger.
Re: stack arenas using alloca
On 8/23/24 15:24, Michael Clark wrote: On 8/15/24 06:24, Michael Clark wrote: Hi Folks, like I said this is crazy talk as alloca isn't even in the C standard. but VLAs are, and the current implementation of VLAs depends on alloca. one more thing. it doesn't require PT_GNU_STACK or writable stacks like GCC nested functions. :-) so I think it is safer but it does have safety issues, mostly related to stack overflows but its going to need some careful analysis with respect to ROP. returning deduced VLA types with bounds references in frame metadata and a chain of frame pointers for dynamic alloca might help. also my experiment in LLVM needs clobbers in the parent frame because while I was able to modify the epilogue based on an attribute, I so far haven't figured out how to make it look to the translator like an alloca in the parent frame so that subsequent references to earlier locals save their stack pointer in a register. this was my first test function: char* f(int n) __attribute__((allocareturn)) { static const char* fmt = "num-%d"; int len = snprintf(0, 0, fmt, n); char *str = __builtin_alloca(len+1); snprintf(str, len+1, fmt, n); return str; } but this also works: char* f(int n) __attribute__((allocareturn)) { static const char* fmt = "num-%d"; int len = snprintf(0, 0, fmt, n); char str[len+1]; snprintf(str, len+1, fmt, n); return str; } and this is even better: f(int n) -> auto { static const char* fmt = "num-%d"; int len = snprintf(0, 0, fmt, n); char str[len+1]; snprintf(str, len+1, fmt, n); return str; } like a reference in an implicit copy constructor for a structure. struct anon123456 { int &n; int arr[n]; }; that would be positioned at the end of the fixed frame so that the parent function knows where the bound is. like I said. crazy talk. Michael.
Re: stack arenas using alloca
On 8/23/24 15:46, Michael Clark wrote: one more thing. it doesn't require PT_GNU_STACK or writable stacks like GCC nested functions. 🙂 so I think it is safer but it does have safety issues, mostly related to stack overflows but its going to need some careful analysis with respect to ROP. brain isn't working. of course it needs writable stacks. it doesn't need executable stacks. starting to think about [[allocareturn]] ROP...
Re: stack arenas using alloca
On 8/23/24 15:57, Michael Clark wrote: On 8/23/24 15:46, Michael Clark wrote: one more thing. it doesn't require PT_GNU_STACK or writable stacks like GCC nested functions. 🙂 so I think it is safer but it does have safety issues, mostly related to stack overflows but its going to need some careful analysis with respect to ROP. brain isn't working. of course it needs writable stacks. it doesn't need executable stacks. starting to think about [[allocareturn]] ROP... if you loaded the return address from the fixed frame in the epilogue into a temporary during restore of callee save registers which usually unwinds the dynamic offset, you could translate the return instruction into an indirect branch. that gives similar ROP protection to functions without dynamic alloca returns. then the biggest problem is stack overflows. we can use alloca anyway just we have to call the function twice with some funky convention that suppresses array writes for null pointers because we have to call it in the parent frame. this just makes it somewhat more useful for functions returning lists where you don't want to allocate on the heap. if you don't have any side effects in loops besides array writes, you could have some funky feature where you run the loop once for a size. I tend to write a lot of code like that by hand and hate it. just look at code that uses the Vulkan API. it is full of client code that does that. that's sort of where this line of thinking emerged. it makes it harder if you want to write to either stack or heap/data though because you need a couple of different translations of the function. I would like implicitly deduced memory region qualifiers like _StackRegion, _DataRegion, _HeapRegion, and _ThreadRegion for TLS. and i'm making the assumption the stack is data race free which it is 99.9% of the time. then you can implicitly typedef _StackRegion MyType MyType in metadata for a translation that has an input type with _StackRegion. I'd like an OS where stack can only escape to fibres and fibres can only run on one OS thread. that way we could allocate page table roots or overlay page tables for threads to make stack fault when accessed by another thread. grsecurity already does this for Linux kernel threads. Michael.
Re: #pragma once behavior
I like the sound of resolved path identity from search, including the constructed full path and the index within include paths. if I was writing a compiler from scratch, i'd problem do something like this: SHA2(include-path-search-offset-integer-of-found-header-to-support-include-next) + SHA2(container-relative-constructed-full-path-to-selected-header-during-include-search) + SHA2(selected-header-content) and make it timeless. if you evaluate the search yourself, you have the constructed full path identity while you are searching and you don't need to do tricks to to get a full container root relative path because you already have it during the search. it's a forward search path constructed relative root. what are the problems? it can include the same file twice if it has two path identities (hardlink or symlink) yet that seems the safer option to me because constant search configuration relative identity sounds better. as they could be defined to be different identities wrt #pragma once. it's the semantically safer option. and the canonical example that fails (same content, different path, same mtime) would succeed with this choice. pick holes in this. I might be overlooking something.
Re: Passing a hidden argument in a dedicated register
> On 17 Dec 2024, at 5:44 AM, Alexander Monakov wrote: > > >> On Mon, 16 Dec 2024, Florian Weimer via Gcc wrote: >> >> I would like to provide a facility to create wrapper functions without >> lots of argument shuffling. To achieve that, the wrapping function and >> the wrapped function should have the same prototype. There will be a >> trampoline that puts additional data somewhere (possibly including the >> address of the wrapped function, but that interpretation is up to the >> wrapping function) and then transfers control to the wrapper function >> with an indirect jump (tail call). >> >> For signal safety, I think the hidden argument needs to be in a register >> (instead of, say, thread-local storage). Most System V ABI variants >> seem to reserve a register for use by the dynamic linker, or for the >> static chain pointer of nested functions. >> >> Is there a way to reuse either register for this purpose and assign it >> to a local variable reliably at the start of the wrapper function >> implementation? > > Not in a way that will work with LLVM, I'm afraid, and with GCC > you'll have to shield wrappers from LTO: > > register void *r10 asm("r10"); > void f(int, int); > void f_wrap(int a, int b) > { >r10 = f; >f(a, b); > } > > LLVM refuses to translate this. With GCC you must compile with -ffixed-r10, > otherwise r10 is not reserved, and GCC will warn: > > warning: call-clobbered register used for global register variable >1 | register void *r10 asm("r10"); > |^~~: > > > This is the only approach I'm aware of, apart of generating wrappers > in asm (speaking of, is there some reason that wouldn't work for you?). I have a similar use case for replacing a scalar version of a function with a vector version. the vector version has some constant reference data that I want to hoist into the caller because it is called in a loop. I don't know how to do that without breaking iinterface encapsulation and using ifdef to manually hoist the uniform reference data into the caller. also I want it to be controlled not at the mercy of random optimizations. in some trusted environment if I had a transient fiield in a context structure passed by pointer that the function loaded into a vector register, then if the caller checked an identity invariant (context pointer identity does not change in the loop) then it could hoist some vector loads into the caller’s loop header. but that wouldn’t work for extern. you would need some vector extern ABI for functions called in loops that have a setup call that needs to be called in a loop header. it’s due to loads of vector reference data with requirements for interface encapsulation for scalar code which tends to make you want to preserve a simple calling convention for the function user. in vulkan you would bind a uniform into the function’s environment. but we don't want the caller to need to know in advance what might be hoisted. so I want a function designed for loops to export an implicit loop header setup stub contingent on the identity of a context pointer. the calling convention says the caller needs to call the setup stub if the context identity changes. you could do that with attributes on the context identity and an outline function containing the hoist. I could then have a loop header setup call that is consistent with both scalar or vector code and the scalar version is probably empty. reference data loads for shuffles are huge.
a super regular RISC that encodes constants in immediate blocks.
Hi Folks, here is an idea expressed as a simple proof-of-concept simulator. - https://github.com/michaeljclark/glyph/ I have been working on a proof-of-concept simulator for a RISC architecture with an immediate base register next to the program counter to split the front-end stream into independent instruction and constant streams. I named it glyph. it features a super-regular encoding scheme designed for vectorized decoding, and it uses a _i32x2_ vector of relative displacements in the link register to branch both instructions and constants at the same time. this evolved from thinking about a "virt" machine that was good for software to decode but could potentially be hardware. I am still writing tests for it, but it is possible to test out the relative link register _(pc,ib)_ vector concept using subroutine branch displacements and constant block displacements in constant memory, which is read-only like instruction text. it has some interesting consequences. stack does not leak spilled absolute addresses so ASLR may be harder to bypass. it uses flags for compare. register 0 can currently contain data. near branches are +/-512 bytes _(9-bit is the largest embedded immediate in the 16-bit format)_. the proof-of-concept has 8 registers, and there is a Python script for generating vectorized switch decoders for the variable-length instruction scheme. however, the proof-of-concept simulator currently only supports the 16-bit compressed opcode space. it doesn't have a compiler or assembler yet. but considerable thought has gone into the design of the instruction format and the split instruction and constant streams. the 16-bit opcode space can access 64-bit constants, but the 32-bit opcodes will support all typical constant sizes, and up to 64 read-write registers and more constants. linker relocations are quite neat with this architecture. return from procedure needs a reloc due to the use of relative displacements in the link register. I am keen for feedback. there are a lot of details in the repo, including a Python script to explore combinatoric expansion for vectorized decoders. it has a simpler length coding scheme than RISC-V at the expense of one bit of coding space in the 16-bit packet. as a result, it uses less wires and logic for length decoding of 64-bit instructions. it borrows an idea from LEB128. i.e. it is _super regular_. but we won't know how it will do on density until we have the 32-bit opcodes. Michael
a super regular RISC that encodes constants in immediate blocks
Hi GCC folks, introducing a new RISC instruction set with a variable length instruction packet using a super regular extension scheme with compression techniques to separate the instruction stream into two streams, one for instructions and and one for constants: - latest: https://metaparadigm.com/~mclark/glyph.pdf - current: https://metaparadigm.com/~mclark/glyph-20250507.pdf I am currently working on an assembler for it. I intend to come up with something that lets one manually emit constants into a .const section plus directives to tie together instructions and constants for local and global symbols, but I may also come up with a meta assembler mode that lets one use inline constants whereby the assembler splits things into streams and allocates address space for constants. I am writing a prototype assembler in Python. it hopefully should be ready in a few months. I would like to start with a Python assembler instead of using binutils because it will make for faster compression tuning when I get up to layout of the 32-bit opcodes which I intend to have contain 128-bit packed SIMD, and the 64-bit opcodes I intend to map to the 5-operand EVEX opcodes in AMD and Intel's AVX-512. this is a preview release. note: the 16-bit packet is designed as a standalone ISA, but it needs read-modify-write for bytes. we have focused on utility for function prologue and epilogue. other types and atomics will come in the 32-bit packet. it is intended to be a virtual machine target and I wish to recode AVX-512 to it. AMD and Intel can bite me if they don't want me to do this. there are interpreters in C, Python, and Go. and I also have a related project that I am working on for translation to x86. - https://github.com/michaeljclark/glyph/ - https://github.com/michaeljclark/x86/ this work has been mentioned on the RISC-V mailing list but this is the first time here. I think it is ready for early feedback. Michael.