Suboptimal code generated for __buitlin_rint on AMD64 without SS4_4.1
Hi, targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the following code (12 instructions using 51 bytes, plus 4 quadwords using 32 bytes) for __builtin_rint() when -msse4.1 is NOT given: .text 0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2 4: R_X86_64_PC32.rdata 8: f2 0f 10 1d 00 00 00 00 movsd .LC0(%rip), %xmm3 c: R_X86_64_PC32.rdata 10: 66 0f 28 c8 movapd %xmm0, %xmm1 14: 66 0f 54 ca andpd %xmm2, %xmm1 18: 66 0f 2f d9 comisd %xmm1, %xmm3 1c: 76 14 jbe 32 1e: f2 0f 58 cb addsd %xmm3, %xmm1 22: 66 0f 55 d0 andnpd %xmm0, %xmm2 26: f2 0f 5c cb subsd %xmm3, %xmm1 2a: 66 0f 56 ca orpd%xmm2, %xmm1 2e: 66 0f 28 c1 movapd %xmm1, %xmm0 32: c3 retq .rdata .align 8 0: 00 00 00 00 .LC0: .quad 0x1.0p52 00 00 30 43 00 00 00 00 00 00 00 00 .align 16 10: ff ff ff ff .LC1: .quad ~(-0.0) ff ff ff 7f 18: 00 00 00 00 .quad 0.0 00 00 00 00 .end JFTR: in the best case, the memory accesses cost several cycles, while in the worst case they yield a page fault! Properly optimized, faster and shorter code, using just 9 instructions in only 33 bytes, WITHOUT superfluous constants, thus avoiding costly memory accesses and saving at least 16 + 32 bytes, follows: .intel_syntax .text 0: f2 48 0f 2c c0cvtsd2si rax, xmm0 # rax = llrint(argument) 5: 48 f7 d8 neg rax # jz .L0 # argument zero? 8: 70 16 jo .L0 # argument indefinite? # argument overflows 64-bit integer? a: 48 f7 d8 neg rax d: f2 48 0f 2a c8cvtsi2sd xmm1, rax # xmm1 = rint(argument) 12: 66 0f 73 d0 3fpsrlq xmm0, 63 17: 66 0f 73 f0 3fpsllq xmm0, 63# xmm0 = (argument & -0.0) ? -0.0 : 0.0 1c: 66 0f 56 c1 orpdxmm0, xmm1 # xmm0 = round(argument) 20: c3 .L0: ret .end regards Stefan
Suboptimal code generated for __buitlin_ceil on AMD64 without SS4_4.1
Hi, targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the following code (17 instructions using 78 bytes, plus 6 quadwords using 48 bytes) for __builtin_ceil() when -msse4.1 is NOT given: .text 0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2 4: R_X86_64_PC32.rdata 8: f2 0f 10 25 00 00 00 00 movsd .LC0(%rip), %xmm4 c: R_X86_64_PC32.rdata 10: 66 0f 28 d8 movapd %xmm0, %xmm3 14: 66 0f 28 c8 movapd %xmm0, %xmm1 18: 66 0f 54 da andpd %xmm2, %xmm3 1c: 66 0f 2e e3 ucomisd %xmm3, %xmm4 20: 76 2b jbe4d <_ceil+0x4d> 22: f2 48 0f 2c c0 cvttsd2si %xmm0, %rax 27: 66 0f ef db pxor %xmm3, %xmm3 2b: f2 0f 10 25 20 00 00 00 movsd 0x20(%rip), %xmm4 2f: R_X86_64_PC32 .rdata 33: 66 0f 55 d1 andnpd %xmm1, %xmm2 37: f2 48 0f 2a d8 cvtsi2sd %rax, %xmm3 3c: f2 0f c2 c3 06 cmpnlesd %xmm3, %xmm0 41: 66 0f 54 c4 andpd %xmm4, %xmm0 45: f2 0f 58 c3 addsd %xmm3, %xmm0 49: 66 0f 56 c2 orpd %xmm2, %xmm0 4d: c3 retq .rdata .align 8 0: 00 00 00 00 .LC0: .quad 0x1.0p52 00 00 30 43 00 00 00 00 00 00 00 00 .align 16 10: ff ff ff ff .LC1: .quad ~(-0.0) ff ff ff 7f 18: 00 00 00 00 .quad 0.0 00 00 00 00 .align 8 20: 00 00 00 00 .LC2: .quad 0x1.0p0 00 00 f0 3f 00 00 00 00 00 00 00 00 .end JFTR: in the best case, the memory accesses cost several cycles, while in the worst case they yield a page fault! Properly optimized, faster and shorter code, using just 15 instructions in 65 bytes, WITHOUT superfluous constants, thus avoiding costly memory accesses and saving at least 32 bytes, follows: .intel_syntax .equBIAS, 1023 .text 0: f2 48 0f 2c c0cvttsd2si rax, xmm0 # rax = trunc(argument) 5: 48 f7 d8 neg rax # jz .L0 # argument zero? 8: 70 36 jo .L0 # argument indefinite? # argument overflows 64-bit integer? a: 48 f7 d8 neg rax d: f2 48 0f 2a c8cvtsi2sd xmm1, rax # xmm1 = trunc(argument) 12: 48 a1 00 00 00mov rax, BIAS << 52 19: 00 00 00 f0 3f 1c: 66 48 0f 6e d0movqxmm2, rax# xmm2 = 0x1.0p0 21: f2 0f 10 d8 movsd xmm3, xmm0 # xmm3 = argument 25: f2 0f c2 d9 02cmplesd xmm3, xmm1 # xmm3 = (argument <= trunc(argument)) ? ~0L : 0L 2a: 66 0f 55 da andnpd xmm3, xmm2 # xmm3 = (argument <= trunc(argument)) ? 0.0 : 1.0 2e: f2 0f 58 d9 addsd xmm3, xmm1 # xmm3 = (argument > trunc(argument)) ? 1.0 : 0.0 # + trunc(argument) # = ceil(argument) 32: 66 0f 73 d0 3fpsrlq xmm0, 63 37: 66 0f 73 f0 3fpsllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0 3c: 66 0f 56 c3 orpdxmm0, xmm3 # xmm0 = ceil(argument) 40: c3 .L0: ret .end regards Stefan
Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
Hi, targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the following code (13 instructions using 57 bytes, plus 4 quadwords using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given: .text 0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2 4: R_X86_64_PC32.rdata 8: f2 0f 10 25 00 00 00 00 movsd .LC0(%rip), %xmm4 c: R_X86_64_PC32.rdata 10: 66 0f 28 d8 movapd %xmm0, %xmm3 14: 66 0f 28 c8 movapd %xmm0, %xmm1 18: 66 0f 54 da andpd %xmm2, %xmm3 1c: 66 0f 2e e3 ucomisd %xmm3, %xmm4 20: 76 16 jbe38 <_trunc+0x38> 22: f2 48 0f 2c c0 cvttsd2si %xmm0, %rax 27: 66 0f ef c0 pxor %xmm0, %xmm0 2b: 66 0f 55 d1 andnpd %xmm1, %xmm2 2f: f2 48 0f 2a c0 cvtsi2sd %rax, %xmm0 34: 66 0f 56 c2 orpd %xmm2, %xmm0 38: c3 retq .rdata .align 8 0: 00 00 00 00 .LC0: .quad 0x1.0p52 00 00 30 43 00 00 00 00 00 00 00 00 .align 16 10: ff ff ff ff .LC1: .quad ~(-0.0) ff ff ff 7f 18: 00 00 00 00 .quad 0.0 00 00 00 00 .end JFTR: in the best case, the memory accesses cost several cycles, while in the worst case they yield a page fault! Properly optimized, shorter and faster code, using but only 9 instructions in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses and saving at least 16 + 32 bytes, follows: .intel_syntax .text 0: f2 48 0f 2c c0cvttsd2si rax, xmm0 # rax = trunc(argument) 5: 48 f7 d8 neg rax # jz .L0 # argument zero? 8: 70 16 jo .L0 # argument indefinite? # argument overflows 64-bit integer? a: 48 f7 d8 neg rax d: f2 48 0f 2a c8cvtsi2sd xmm1, rax # xmm1 = trunc(argument) 12: 66 0f 73 d0 3fpsrlq xmm0, 63 17: 66 0f 73 f0 3fpsllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0 1c: 66 0f 56 c1 orpdxmm0, xmm1 # xmm0 = trunc(argument) 20: c3 .L0: ret .end regards Stefan
Re: Suboptimal code generated for __buitlin_ceil on AMD64 without SS4_4.1
Could you file a bugzilla for that? https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc On Thu, Aug 5, 2021 at 3:34 PM Stefan Kanthak wrote: > > Hi, > > targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the > following code (17 instructions using 78 bytes, plus 6 quadwords > using 48 bytes) for __builtin_ceil() when -msse4.1 is NOT given: > > .text >0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2 > 4: R_X86_64_PC32.rdata >8: f2 0f 10 25 00 00 00 00 movsd .LC0(%rip), %xmm4 > c: R_X86_64_PC32.rdata > 10: 66 0f 28 d8 movapd %xmm0, %xmm3 > 14: 66 0f 28 c8 movapd %xmm0, %xmm1 > 18: 66 0f 54 da andpd %xmm2, %xmm3 > 1c: 66 0f 2e e3 ucomisd %xmm3, %xmm4 > 20: 76 2b jbe4d <_ceil+0x4d> > 22: f2 48 0f 2c c0 cvttsd2si %xmm0, %rax > 27: 66 0f ef db pxor %xmm3, %xmm3 > 2b: f2 0f 10 25 20 00 00 00 movsd 0x20(%rip), %xmm4 > 2f: R_X86_64_PC32 .rdata > 33: 66 0f 55 d1 andnpd %xmm1, %xmm2 > 37: f2 48 0f 2a d8 cvtsi2sd %rax, %xmm3 > 3c: f2 0f c2 c3 06 cmpnlesd %xmm3, %xmm0 > 41: 66 0f 54 c4 andpd %xmm4, %xmm0 > 45: f2 0f 58 c3 addsd %xmm3, %xmm0 > 49: 66 0f 56 c2 orpd %xmm2, %xmm0 > 4d: c3 retq > > .rdata > .align 8 >0: 00 00 00 00 .LC0: .quad 0x1.0p52 > 00 00 30 43 > 00 00 00 00 > 00 00 00 00 > .align 16 > 10: ff ff ff ff .LC1: .quad ~(-0.0) > ff ff ff 7f > 18: 00 00 00 00 .quad 0.0 > 00 00 00 00 > .align 8 > 20: 00 00 00 00 .LC2: .quad 0x1.0p0 > 00 00 f0 3f > 00 00 00 00 > 00 00 00 00 > .end > > JFTR: in the best case, the memory accesses cost several cycles, > while in the worst case they yield a page fault! > > > Properly optimized, faster and shorter code, using just 15 instructions > in 65 bytes, WITHOUT superfluous constants, thus avoiding costly memory > accesses and saving at least 32 bytes, follows: > > .intel_syntax > .equBIAS, 1023 > .text >0: f2 48 0f 2c c0cvttsd2si rax, xmm0 # rax = trunc(argument) >5: 48 f7 d8 neg rax > # jz .L0 # argument zero? >8: 70 36 jo .L0 # argument indefinite? ># argument overflows > 64-bit integer? >a: 48 f7 d8 neg rax >d: f2 48 0f 2a c8cvtsi2sd xmm1, rax # xmm1 = trunc(argument) > 12: 48 a1 00 00 00mov rax, BIAS << 52 > 19: 00 00 00 f0 3f > 1c: 66 48 0f 6e d0movqxmm2, rax# xmm2 = 0x1.0p0 > 21: f2 0f 10 d8 movsd xmm3, xmm0 # xmm3 = argument > 25: f2 0f c2 d9 02cmplesd xmm3, xmm1 # xmm3 = (argument <= > trunc(argument)) ? ~0L : 0L > 2a: 66 0f 55 da andnpd xmm3, xmm2 # xmm3 = (argument <= > trunc(argument)) ? 0.0 : 1.0 > 2e: f2 0f 58 d9 addsd xmm3, xmm1 # xmm3 = (argument > > trunc(argument)) ? 1.0 : 0.0 ># + trunc(argument) ># = ceil(argument) > 32: 66 0f 73 d0 3fpsrlq xmm0, 63 > 37: 66 0f 73 f0 3fpsllq xmm0, 63 # xmm0 = (argument & -0.0) > ? -0.0 : 0.0 > 3c: 66 0f 56 c3 orpdxmm0, xmm3 # xmm0 = ceil(argument) > 40: c3 .L0: ret > .end > > regards > Stefan -- BR, Hongtao
Suboptimal code generated for __buitlin_floor on AMD64 without SS4_4.1
Hi, targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the following code (19 instructions using 86 bytes, plus 6 quadwords using 48 bytes) for __builtin_floor() when -msse4.1 is NOT given: .text 0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2 4: R_X86_64_PC32.rdata 8: f2 0f 10 25 00 00 00 00 movsd .LC0(%rip), %xmm4 c: R_X86_64_PC32.rdata 10: 66 0f 28 d8 movapd %xmm0, %xmm3 14: 66 0f 28 c8 movapd %xmm0, %xmm1 18: 66 0f 54 da andpd %xmm2, %xmm3 1c: 66 0f 2e e3 ucomisd %xmm3, %xmm4 20: 76 33 jbe55 <_floor+0x55> 22: f2 48 0f 2c c0 cvttsd2si %xmm0, %rax 27: 66 0f ef db pxor %xmm3, %xmm3 2b: 66 0f 55 d1 andnpd %xmm1, %xmm2 2f: f2 48 0f 2a d8 cvtsi2sd %rax, %xmm3 34: 66 0f 28 e3 movapd %xmm3, %xmm4 38: f2 0f c2 e0 06 cmpnlesd %xmm0, %xmm4 3d: f2 0f 10 05 20 00 00 00 movsd .LC2(%rip), %xmm0 91: R_X86_64_PC32 .rdata 45: 66 0f 54 e0 andpd %xmm0, %xmm4 49: f2 0f 5c dc subsd %xmm4, %xmm3 4d: 66 0f 28 c3 movapd %xmm3, %xmm0 51: 66 0f 56 c2 orpd %xmm2, %xmm0 55: c3 retq .rdata .align 8 0: 00 00 00 00 .LC0: .quad 0x1.0p52 00 00 30 43 00 00 00 00 00 00 00 00 .align 16 10: ff ff ff ff .LC1: .quad ~(-0.0) ff ff ff 7f 18: 00 00 00 00 .quad 0.0 00 00 00 00 .align 8 20: 00 00 00 00 .LC2: .quad 0x1.0p0 00 00 f0 3f 00 00 00 00 00 00 00 00 .end JFTR: in the best case, the memory accesses cost several cycles, while in the worst case they yield a page fault! Properly optimized, shorter and faster code, using only 15 instructions in just 65 bytes, WITHOUT superfluous constants, thus avoiding costly memory accesses and saving at least 16 + 48 bytes, follows: .intel_syntax .equBIAS, 1023 .text 0: f2 48 0f 2c c0cvttsd2si rax, xmm0 # rax = trunc(argument) 5: 48 f7 d8 neg rax # jz .L0 # argument zero? 8: 70 36 jo .L0 # argument indefinite? # argument overflows 64-bit integer? a: 48 f7 d8 neg rax d: f2 48 0f 2a c8cvtsi2sd xmm1, rax # xmm1 = trunc(argument) 12: 48 a1 00 00 00mov rax, (1 << 63) | (BIAS << 52) 19: 00 00 00 f0 bf 1c: 66 48 0f 6e d0movqxmm2, rax# xmm2 = -0x1.0p0 21: f2 0f 10 d8 movsd xmm3, xmm0 # xmm3 = argument 25: f2 0f c2 d9 01cmpltsd xmm3, xmm1 # xmm3 = (argument < trunc(argument)) ? ~0L : 0L 2a: 66 0f 54 da andpd xmm3, xmm2 # xmm3 = (argument < trunc(argument)) ? -1.0 : 0.0 2e: f2 0f 58 d9 addsd xmm3, xmm1 # xmm3 = (argument < trunc(argument)) ? -1.0 : 0.0 # + trunc(argument) # = floor(argument) 32: 66 0f 73 d0 3fpsrlq xmm0, 63 37: 66 0f 73 f0 3fpsllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0 3c: 66 0f 56 c3 orpdxmm0, xmm3 # xmm0 = floor(argument) 40: c3 .L0: ret .end regards Stefan
Re: [RFC] Adding a new attribute to function param to mark it as constant
On Wed, 4 Aug 2021 at 18:30, Richard Earnshaw wrote: > > On 04/08/2021 13:46, Segher Boessenkool wrote: > > On Wed, Aug 04, 2021 at 05:20:58PM +0530, Prathamesh Kulkarni wrote: > >> On Wed, 4 Aug 2021 at 15:49, Segher Boessenkool > >> wrote: > >>> Both __builtin_constant_p and __is_constexpr will not work in your use > >>> case (since a function argument is not a constant, let alone an ICE). > >>> It only becomes a constant value later on. The manual (for the former) > >>> says: > >>> You may use this built-in function in either a macro or an inline > >>> function. However, if you use it in an inlined function and pass an > >>> argument of the function as the argument to the built-in, GCC never > >>> returns 1 when you call the inline function with a string constant or > >>> compound literal (see Compound Literals) and does not return 1 when you > >>> pass a constant numeric value to the inline function unless you specify > >>> the -O option. > >> Indeed, that's why I was thinking if we should use an attribute to mark > >> param as > >> a constant, so during type-checking the function call, the compiler > >> can emit a diagnostic if the passed arg > >> is not a constant. > > > > That will depend on the vagaries of what optimisations the compiler > > managed to do :-( > > > >> Alternatively -- as you suggest, we could define a new builtin, say > >> __builtin_ice(x) that returns true if 'x' is an ICE. > > > > (That is a terrible name, it's not clear at all to the reader, just > > write it out? It is fun if you know what it means, but infuriating > > otherwise.) > > > >> And wrap the intrinsic inside a macro that would check if the arg is an > >> ICE ? > > > > That will work yeah. Maybe not as elegant as you'd like, but not all > > that bad, and it *works*. Well, hopefully it does :-) > > > >> For eg: > >> > >> __extension__ extern __inline int32x2_t > >> __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) > >> vshl_n_s32_1 (int32x2_t __a, const int __b) > >> { > >> return __builtin_neon_vshl_nv2si (__a, __b); > >> } > >> > >> #define vshl_n_s32(__a, __b) \ > >> ({ typeof (__a) a = (__a); \ > >>_Static_assert (__builtin_constant_p ((__b)), #__b " is not an > >> integer constant"); \ > >>vshl_n_s32_1 (a, (__b)); }) > >> > >> void f(int32x2_t x, const int y) > >> { > >> vshl_n_s32 (x, 2); > >> vshl_n_s32 (x, y); > >> > >> int z = 1; > >> vshl_n_s32 (x, z); > >> } > >> > >> With this, the compiler rejects vshl_n_s32 (x, y) and vshl_n_s32 (x, > >> z) at all optimization levels since neither 'y' nor 'z' is an ICE. > > > > You used __builtin_constant_p though, which works differently, so the > > test is not conclusive, might not show what you want to show. > > > >> Instead of __builtin_constant_p, we could use __builtin_ice. > >> Would that be a reasonable approach ? > > > > I think it will work, yes. > > > >> But this changes the semantics of intrinsic from being an inline > >> function to a macro, and I am not sure if that's a good idea. > > > > Well, what happens if you call the actual builtin directly, with some > > non-constant parameter? That just fails with a more cryptic error, > > right? So you can view this as some syntactic sugar to make these > > intrinsics easier to use. > > > > Hrm I now remember a place I could have used this: > > > > #define mtspr(n, x) do { asm("mtspr %1,%0" : : "r"(x), "n"(n)); } while (0) > > #define mfspr(n) ({ \ > > u32 x; asm volatile("mfspr %0,%1" : "=r"(x) : "n"(n)); x; \ > > }) > > > > It is quite similar to your builtin code really, and I did resort to > > macros there, for similar reasons :-) > > > > > > Segher > > > > We don't want to have to resort to macros. Not least because at some > point we want to replace the content of arm_neon.h with a single #pragma > directive to remove all the parsing of the header that's needed. What's > more, if we had a suitable pragma we'd stand a fighting chance of being > able to extend support to other languages as well that don't use the > pre-processor, such as Fortran or Ada (not that that is on the cards > right now). Hi, IIUC, a more general issue here, is that the intrinsics require special type-checking of arguments, beyond what is dictated by the Standard. An argument needing to be an ICE could be seen as one instance. So perhaps, should there be some mechanism to tell the FE to let the target do additional checking for a particular function call, say by explicitly marking it with "intrinsic" attribute ? So while type checking a call to a function marked with "intrinsic" attribute, FE can invoke target handler with name of function and corresponding arguments passed, and then leave it to the target for further checking ? For vshl_n case, the target hook would check that the 2nd arg is an integer constant within the permissible range. I propose to do this only for intrinsics that need special checking and can be entirely implemented with C extensions and won't
Re: [RFC] Adding a new attribute to function param to mark it as constant
On 04/08/2021 18:59, Segher Boessenkool wrote: > On Wed, Aug 04, 2021 at 07:08:08PM +0200, Florian Weimer wrote: >> * Segher Boessenkool: >> >>> On Wed, Aug 04, 2021 at 03:27:00PM +0100, Richard Earnshaw wrote: On 04/08/2021 14:40, Segher Boessenkool wrote: > On Wed, Aug 04, 2021 at 02:00:42PM +0100, Richard Earnshaw wrote: >> We don't want to have to resort to macros. Not least because at some >> point we want to replace the content of arm_neon.h with a single #pragma >> directive to remove all the parsing of the header that's needed. What's >> more, if we had a suitable pragma we'd stand a fighting chance of being >> able to extend support to other languages as well that don't use the >> pre-processor, such as Fortran or Ada (not that that is on the cards >> right now). > > So how do you want to handle constants-that-are-not-yet-constant, say > before inlining? And how do you want to deal with those possibly not > ever becoming constant, perhaps because you used a too low "n" in -On > (but there are very many random other causes)? And, what *is* a > constant, anyway? This is even more fuzzy if you consider those > other languages as well. > > (Does skipping parsing of some trivial header save so much time? Huh!) Trivial? arm_neon.h is currently 20k lines of source. What's more, it has to support inline functions that might not be available when the header is parsed, but might become available if the user subsequently compiles a function with different attributes enabled. It is very definitely *NOT* trivial. >>> >>> Ha yes :-) I just assumed without looking that it would be like other >>> architectures' intrinsics headers. Whoops. >> >> But isn't it? >> >> $ echo '#include ' | gcc -E - | wc -l >> 41045 > > $ echo '#include ' | gcc -E - -maltivec | wc -l > 9 > > Most of this file (774 lines) is #define's, which take essentially no > time at all. And none of the other archs I have looked at have big > headers either! > > > Segher > arm_sve.h isn't large either, but that's because all it contains (other than a couple of typedefs is #pragma GCC aarch64 "arm_sve.h" :) R.
Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote: > Hi, > > targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the > following code (13 instructions using 57 bytes, plus 4 quadwords > using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given: > > .text >0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2 > 4: R_X86_64_PC32.rdata >8: f2 0f 10 25 00 00 00 00 movsd .LC0(%rip), %xmm4 > c: R_X86_64_PC32.rdata > 10: 66 0f 28 d8 movapd %xmm0, %xmm3 > 14: 66 0f 28 c8 movapd %xmm0, %xmm1 > 18: 66 0f 54 da andpd %xmm2, %xmm3 > 1c: 66 0f 2e e3 ucomisd %xmm3, %xmm4 > 20: 76 16 jbe38 <_trunc+0x38> > 22: f2 48 0f 2c c0 cvttsd2si %xmm0, %rax > 27: 66 0f ef c0 pxor %xmm0, %xmm0 > 2b: 66 0f 55 d1 andnpd %xmm1, %xmm2 > 2f: f2 48 0f 2a c0 cvtsi2sd %rax, %xmm0 > 34: 66 0f 56 c2 orpd %xmm2, %xmm0 > 38: c3 retq > > .rdata > .align 8 >0: 00 00 00 00 .LC0: .quad 0x1.0p52 > 00 00 30 43 > 00 00 00 00 > 00 00 00 00 > .align 16 > 10: ff ff ff ff .LC1: .quad ~(-0.0) > ff ff ff 7f > 18: 00 00 00 00 .quad 0.0 > 00 00 00 00 > .end > > JFTR: in the best case, the memory accesses cost several cycles, > while in the worst case they yield a page fault! > > > Properly optimized, shorter and faster code, using but only 9 instructions > in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses > and saving at least 16 + 32 bytes, follows: > > .intel_syntax > .text >0: f2 48 0f 2c c0cvttsd2si rax, xmm0 # rax = trunc(argument) >5: 48 f7 d8 neg rax > # jz .L0 # argument zero? >8: 70 16 jo .L0 # argument indefinite? ># argument overflows > 64-bit integer? >a: 48 f7 d8 neg rax >d: f2 48 0f 2a c8cvtsi2sd xmm1, rax # xmm1 = trunc(argument) > 12: 66 0f 73 d0 3fpsrlq xmm0, 63 > 17: 66 0f 73 f0 3fpsllq xmm0, 63 # xmm0 = (argument & -0.0) > ? -0.0 : 0.0 > 1c: 66 0f 56 c1 orpdxmm0, xmm1 # xmm0 = trunc(argument) > 20: c3 .L0: ret > .end There is one important difference, namely setting the invalid exception flag when the parameter can't be represented in a signed integer. So using your code may require some option (-fast-math comes to mind), or you need at least a check on the exponent before cvttsd2si. The last part of your code then goes to take into account the special case of -0.0, which I most often don't care about (I'd like to have a -fdont-split-hairs-about-the-sign-of-zero option). Potentially generating spurious invalid operation and then carefully taking into account the sign of zero does not seem very consistent. Apart from this, in your code, after cvttsd2si I'd rather use: mov rcx,rax # make a second copy to a scratch register neg rcx jo .L0 cvtsi2sd xmm1,rax The reason is latency, in an OoO engine, splitting the two paths is almost always a win. With your patch: cvttsd2si-->neg-?->neg-->cvtsi2sd where the ? means that the following instructions are speculated. With an auxiliary register there are two dependency chains: cvttsd2si-?->cvtsi2sd |->mov->neg->jump Actually some OoO cores just eliminate register copies using register renaming mechanism. But even this is probably completely irrelevant in this case where the latency is dominated by the two conversion instructions. Regards, Gabriel > > regards > Stefan
Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
On Thu, Aug 5, 2021 at 11:44 AM Gabriel Paubert wrote: > > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote: > > Hi, > > > > targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the > > following code (13 instructions using 57 bytes, plus 4 quadwords > > using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given: > > > > .text > >0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2 > > 4: R_X86_64_PC32.rdata > >8: f2 0f 10 25 00 00 00 00 movsd .LC0(%rip), %xmm4 > > c: R_X86_64_PC32.rdata > > 10: 66 0f 28 d8 movapd %xmm0, %xmm3 > > 14: 66 0f 28 c8 movapd %xmm0, %xmm1 > > 18: 66 0f 54 da andpd %xmm2, %xmm3 > > 1c: 66 0f 2e e3 ucomisd %xmm3, %xmm4 > > 20: 76 16 jbe38 <_trunc+0x38> > > 22: f2 48 0f 2c c0 cvttsd2si %xmm0, %rax > > 27: 66 0f ef c0 pxor %xmm0, %xmm0 > > 2b: 66 0f 55 d1 andnpd %xmm1, %xmm2 > > 2f: f2 48 0f 2a c0 cvtsi2sd %rax, %xmm0 > > 34: 66 0f 56 c2 orpd %xmm2, %xmm0 > > 38: c3 retq > > > > .rdata > > .align 8 > >0: 00 00 00 00 .LC0: .quad 0x1.0p52 > > 00 00 30 43 > > 00 00 00 00 > > 00 00 00 00 > > .align 16 > > 10: ff ff ff ff .LC1: .quad ~(-0.0) > > ff ff ff 7f > > 18: 00 00 00 00 .quad 0.0 > > 00 00 00 00 > > .end > > > > JFTR: in the best case, the memory accesses cost several cycles, > > while in the worst case they yield a page fault! > > > > > > Properly optimized, shorter and faster code, using but only 9 instructions > > in just 33 bytes, WITHOUT any constants, thus avoiding costly memory > > accesses > > and saving at least 16 + 32 bytes, follows: > > > > .intel_syntax > > .text > >0: f2 48 0f 2c c0cvttsd2si rax, xmm0 # rax = trunc(argument) > >5: 48 f7 d8 neg rax > > # jz .L0 # argument zero? > >8: 70 16 jo .L0 # argument indefinite? > ># argument overflows > > 64-bit integer? > >a: 48 f7 d8 neg rax > >d: f2 48 0f 2a c8cvtsi2sd xmm1, rax # xmm1 = trunc(argument) > > 12: 66 0f 73 d0 3fpsrlq xmm0, 63 > > 17: 66 0f 73 f0 3fpsllq xmm0, 63 # xmm0 = (argument & > > -0.0) ? -0.0 : 0.0 > > 1c: 66 0f 56 c1 orpdxmm0, xmm1 # xmm0 = trunc(argument) > > 20: c3 .L0: ret > > .end > > There is one important difference, namely setting the invalid exception > flag when the parameter can't be represented in a signed integer. So > using your code may require some option (-fast-math comes to mind), or > you need at least a check on the exponent before cvttsd2si. > > The last part of your code then goes to take into account the special > case of -0.0, which I most often don't care about (I'd like to have a > -fdont-split-hairs-about-the-sign-of-zero option). > > Potentially generating spurious invalid operation and then carefully > taking into account the sign of zero does not seem very consistent. > > Apart from this, in your code, after cvttsd2si I'd rather use: > mov rcx,rax # make a second copy to a scratch register > neg rcx > jo .L0 > cvtsi2sd xmm1,rax > > The reason is latency, in an OoO engine, splitting the two paths is > almost always a win. > > With your patch: > > cvttsd2si-->neg-?->neg-->cvtsi2sd > > where the ? means that the following instructions are speculated. > > With an auxiliary register there are two dependency chains: > > cvttsd2si-?->cvtsi2sd > |->mov->neg->jump > > Actually some OoO cores just eliminate register copies using register > renaming mechanism. But even this is probably completely irrelevant in > this case where the latency is dominated by the two conversion > instructions. Btw, the code to emit these sequences is in gcc/config/i386/i386-expand.c:ix86_expand_trunc and friends. Richard. > Regards, > Gabriel > > > > > > > regards > > Stefan > >
Re: Function attribute to indicate a likely (or unlikely) return value
On 7/25/21 7:33 PM, Dominique Pellé via Gcc wrote: Hi Hello. I read https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html but was left wondering: is there a way to annotate a function to indicate that a return value is likely (or unlikely)? Interesting idea :) No, we don't support that right now. For example, let's say we have this function: // Return OK (=0) in case of success (frequent case) // or an error code != 0 in case of failure (rare case). int do_something(); If it's unlikely to fail, I wish I could declare the function like this (pseudo-code!): int do_something() __likely_return(OK); So wherever it's used, the optimizer can optimize branch prediction and the instruction cache. In other words, lines like this: if (do_something() == OK)> ... would implicitly be similar to: // LIKELY defined as __builtin_expect((x), 1). if (LIKELY(do_something() == OK)) The advantage of being able to annotate the declaration, is that we only need to annotate once in the header, and all uses of the function can benefit from the optimization without polluting/modifying all code where the function is called. I see your point, seems like a good idea. The question is, how much would it take to implement and what's benefit of the suggested hints. Another example: a function that would be unlikely to return NULL could be declared as: void *foo() __unlikely_returns(NULL); Note that modern CPUs have branch predictors and a condition of 'if (ptr == 0)' can be guessed quite easily. This last example would be a bit similar to the __attribute__((malloc)) since I read about it in the doc: In addition, the GCC predicts that a function with the attribute returns non-null in most cases. Of course __attribute__((malloc)) gives other guarantees (return value cannot alias any other pointer) so it's not equivalent. Note we have a special branch probability for malloc: gcc/predict.def:54 Would attribute __likely_return() and __unlikely_return() make sense? Similarly, we have now: /* Branch to basic block containing call marked by noreturn attribute. */ DEF_PREDICTOR (PRED_NORETURN, "noreturn call", PROB_VERY_LIKELY, PRED_FLAG_FIRST_MATCH) Thanks for the ideas, Martin Is there already a way to achieve this which I missed in the doc? Regards Dominique
Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
Gabriel Paubert wrote: > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote: >> Hi, >> >> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the >> following code (13 instructions using 57 bytes, plus 4 quadwords >> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given: >> >> .text >>0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2 >> 4: R_X86_64_PC32.rdata >>8: f2 0f 10 25 00 00 00 00 movsd .LC0(%rip), %xmm4 >> c: R_X86_64_PC32.rdata >> 10: 66 0f 28 d8 movapd %xmm0, %xmm3 >> 14: 66 0f 28 c8 movapd %xmm0, %xmm1 >> 18: 66 0f 54 da andpd %xmm2, %xmm3 >> 1c: 66 0f 2e e3 ucomisd %xmm3, %xmm4 >> 20: 76 16 jbe38 <_trunc+0x38> >> 22: f2 48 0f 2c c0 cvttsd2si %xmm0, %rax >> 27: 66 0f ef c0 pxor %xmm0, %xmm0 >> 2b: 66 0f 55 d1 andnpd %xmm1, %xmm2 >> 2f: f2 48 0f 2a c0 cvtsi2sd %rax, %xmm0 >> 34: 66 0f 56 c2 orpd %xmm2, %xmm0 >> 38: c3 retq >> >> .rdata >> .align 8 >>0: 00 00 00 00 .LC0: .quad 0x1.0p52 >> 00 00 30 43 >> 00 00 00 00 >> 00 00 00 00 >> .align 16 >> 10: ff ff ff ff .LC1: .quad ~(-0.0) >> ff ff ff 7f >> 18: 00 00 00 00 .quad 0.0 >> 00 00 00 00 >> .end >> >> JFTR: in the best case, the memory accesses cost several cycles, >> while in the worst case they yield a page fault! >> >> >> Properly optimized, shorter and faster code, using but only 9 instructions >> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses >> and saving at least 16 + 32 bytes, follows: >> >> .intel_syntax >> .text >>0: f2 48 0f 2c c0cvttsd2si rax, xmm0 # rax = trunc(argument) >>5: 48 f7 d8 neg rax >> # jz .L0 # argument zero? >>8: 70 16 jo .L0 # argument indefinite? >># argument overflows >> 64-bit integer? >>a: 48 f7 d8 neg rax >>d: f2 48 0f 2a c8cvtsi2sd xmm1, rax # xmm1 = trunc(argument) >> 12: 66 0f 73 d0 3fpsrlq xmm0, 63 >> 17: 66 0f 73 f0 3fpsllq xmm0, 63 # xmm0 = (argument & >> -0.0) ? -0.0 : 0.0 >> 1c: 66 0f 56 c1 orpdxmm0, xmm1 # xmm0 = trunc(argument) >> 20: c3 .L0: ret >> .end > > There is one important difference, namely setting the invalid exception > flag when the parameter can't be represented in a signed integer. Right, I overlooked this fault. Thanks for pointing out. > So using your code may require some option (-fast-math comes to mind), > or you need at least a check on the exponent before cvttsd2si. The whole idea behind these implementations is to get rid of loading floating-point constants to perform comparisions. > The last part of your code then goes to take into account the special > case of -0.0, which I most often don't care about (I'd like to have a > -fdont-split-hairs-about-the-sign-of-zero option). Preserving the sign of -0.0 is explicitly specified in the standard, and is cheap, as shown in my code. > Potentially generating spurious invalid operation and then carefully > taking into account the sign of zero does not seem very consistent. > > Apart from this, in your code, after cvttsd2si I'd rather use: > mov rcx,rax # make a second copy to a scratch register > neg rcx > jo .L0 > cvtsi2sd xmm1,rax I don't know how GCC generates the code for builtins, and what kind of templates it uses: the second goal was to minimize register usage. > The reason is latency, in an OoO engine, splitting the two paths is > almost always a win. > > With your patch: > > cvttsd2si-->neg-?->neg-->cvtsi2sd > > where the ? means that the following instructions are speculated. > > With an auxiliary register there are two dependency chains: > > cvttsd2si-?->cvtsi2sd > |->mov->neg->jump Correct; see above: I expect the template(s) for builtins to give the register allocator some freedom to split code paths and resolve dependency chains. > Actually some OoO cores just eliminate register copies using register > renaming mechanism. But even this is probably completely irrelevant in > this case where the latency is dominated by the two conversion > instructions. Right, the conversions dominate both the original and the code I posted. It's easy to get rid of them, with still slightly shorter and faster branchless code (17 instructions, 84 bytes, instead of
Re: Optional machine prefix for programs in for -B dirs, match ing Clang
Hello, On Wed, 4 Aug 2021, John Ericson wrote: > On Wed, Aug 4, 2021, at 10:48 AM, Michael Matz wrote: > > ... the 'as' and 'ld' executables should be simply found within the > > version and target specific GCC libexecsubdir, possibly by being symlinks > > to whatever you want. That's at least how my crosss are configured and > > installed, without any --with-{as,ld} options. > > Yes that does work, and that's probably the best option today. I'm just > a little wary of unprefixing things programmatically. The libexecsubdir _is_ the prefix in above case :) > For some context, this is NixOS where we assemble a ton of cross > compilers automatically and each package gets its own isolated many FHS. > For that reason I would like to eventually avoid the target-specific > subdirs entirely, as I have the separate package trees to disambiguate > things. Now, I know that exact same argument could also be used to say > target prefixing is also superfluous, but eventually things on the PATH > need to be disambiguated. Sure, which is why (e.g.) cross binutils do install with an arch prefix into ${bindir}. But as GCC has the capability to look into libexecsubdir for binaries as well (which quite surely should never be in $PATH on any system), I don't see the conflict. > There is no requirement that the libexec things be named like the bin > things, but I sort of feel it's one less thing to remember and makes > debugging easier. Well, the naming scheme of binaries in libexecsubdir reflects the scheme that the compilers are using: cc1, cc1plus etc. Not aarch64-unknown-linux-cc1. > I am sympathetic to the issue that if GCC accepts everything Clang does > and vice-versa, we'll Postel's-law ourselves ourselves over time into > madness as mistakes are accumulated rather than weeded out. Right. I supposed it wouldn't hurt to also look for "${targettriple}-as" in $PATH before looking for 'as' (in $PATH). But I don't think we can (or should) switch off looking for 'as' in libexecsubdir. I don't even see why that behaviour should depend on an option, it could just be added by default. > I now have some patches for this change I suppose I could also submit. Even better :) Ciao, Michael.
Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
On 8/5/21 11:42 AM, Gabriel Paubert wrote: On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote: Hi, targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the following code (13 instructions using 57 bytes, plus 4 quadwords using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given: .text 0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2 4: R_X86_64_PC32.rdata 8: f2 0f 10 25 00 00 00 00 movsd .LC0(%rip), %xmm4 c: R_X86_64_PC32.rdata 10: 66 0f 28 d8 movapd %xmm0, %xmm3 14: 66 0f 28 c8 movapd %xmm0, %xmm1 18: 66 0f 54 da andpd %xmm2, %xmm3 1c: 66 0f 2e e3 ucomisd %xmm3, %xmm4 20: 76 16 jbe38 <_trunc+0x38> 22: f2 48 0f 2c c0 cvttsd2si %xmm0, %rax 27: 66 0f ef c0 pxor %xmm0, %xmm0 2b: 66 0f 55 d1 andnpd %xmm1, %xmm2 2f: f2 48 0f 2a c0 cvtsi2sd %rax, %xmm0 34: 66 0f 56 c2 orpd %xmm2, %xmm0 38: c3 retq .rdata .align 8 0: 00 00 00 00 .LC0: .quad 0x1.0p52 00 00 30 43 00 00 00 00 00 00 00 00 .align 16 10: ff ff ff ff .LC1: .quad ~(-0.0) ff ff ff 7f 18: 00 00 00 00 .quad 0.0 00 00 00 00 .end JFTR: in the best case, the memory accesses cost several cycles, while in the worst case they yield a page fault! Properly optimized, shorter and faster code, using but only 9 instructions in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses and saving at least 16 + 32 bytes, follows: .intel_syntax .text 0: f2 48 0f 2c c0cvttsd2si rax, xmm0 # rax = trunc(argument) 5: 48 f7 d8 neg rax # jz .L0 # argument zero? 8: 70 16 jo .L0 # argument indefinite? # argument overflows 64-bit integer? a: 48 f7 d8 neg rax d: f2 48 0f 2a c8cvtsi2sd xmm1, rax # xmm1 = trunc(argument) 12: 66 0f 73 d0 3fpsrlq xmm0, 63 17: 66 0f 73 f0 3fpsllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0 1c: 66 0f 56 c1 orpdxmm0, xmm1 # xmm0 = trunc(argument) 20: c3 .L0: ret .end There is one important difference, namely setting the invalid exception flag when the parameter can't be represented in a signed integer. So using your code may require some option (-fast-math comes to mind), or you need at least a check on the exponent before cvttsd2si. The last part of your code then goes to take into account the special case of -0.0, which I most often don't care about (I'd like to have a -fdont-split-hairs-about-the-sign-of-zero option). `-fno-signed-zeros` does that, if you need it Potentially generating spurious invalid operation and then carefully taking into account the sign of zero does not seem very consistent. Apart from this, in your code, after cvttsd2si I'd rather use: mov rcx,rax # make a second copy to a scratch register neg rcx jo .L0 cvtsi2sd xmm1,rax The reason is latency, in an OoO engine, splitting the two paths is almost always a win. With your patch: cvttsd2si-->neg-?->neg-->cvtsi2sd where the ? means that the following instructions are speculated. With an auxiliary register there are two dependency chains: cvttsd2si-?->cvtsi2sd |->mov->neg->jump Actually some OoO cores just eliminate register copies using register renaming mechanism. But even this is probably completely irrelevant in this case where the latency is dominated by the two conversion instructions. Regards, Gabriel regards Stefan -- _ Gabriel RAVIER First year student at Epitech +33 6 36 46 16 43 gabriel.rav...@epitech.eu 11 Quai Finkwiller 67000 STRASBOURG
Question about finding parameters in function bodies from SSA variables
Hello Richard, I'm still working on the points-to analysis and I am happy to say that after reviewing the ipa-cp code I was able to generate summaries for local variables, ssa variables, heap variables, global variables and functions. I am also using the callback hooks to find out if cgraph_nodes and varpool_nodes are added or deleted between read_summaries and execute. Even though I don't update the solutions between execute and function_transform yet, I am reading the points-to pairs and remapping the constraint variables back to trees during function_transform and printing the name of pointer-pointee pairs. This is still very much a work in progress and a very weak points-to analysis. I have almost finished my Andersen's / field insensitive / context insensitive / flow-insensitive / intraprocedural analysis with the LTO framework (without interacting with other transformations yet). The only thing that I am missing is assigning parameters to be pointing to NONLOCAL memory upon entry to the function and perhaps some corner cases where gimple is not exactly how I expect it to be. I am wondering, none of the variables in function->gimple_df->ssa_names and function->local_decls are PARM_DECL. I'm also not entirely sure if I should be looking for PARM_DECLs since looking at function bodies' gimple representation I don't see the formal parameters being used inside the function. Instead, it appears that some SSA variables are automatically initialized with the parameter value. Is this the case? For example, for a function: foo (struct a* $NAME) The variable $NAME is nowhere used inside the function. I also found that there is an ssa variable in location X ( in function->gimple_df->ssa_names[X]) named with a variation like $NAME_$X(D) and this seems to correspond to the parameter $NAME. How can one (preferably looking only at function->gimple_df->ssa_names[$X]) find out that this tree corresponds to a parameter? Many thanks! -Erick
Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1
Hi, On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote: > Gabriel Paubert wrote: > > > > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote: > >> Hi, > >> > >> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the > >> following code (13 instructions using 57 bytes, plus 4 quadwords > >> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given: > >> > >> .text > >>0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2 > >> 4: R_X86_64_PC32.rdata > >>8: f2 0f 10 25 00 00 00 00 movsd .LC0(%rip), %xmm4 > >> c: R_X86_64_PC32.rdata > >> 10: 66 0f 28 d8 movapd %xmm0, %xmm3 > >> 14: 66 0f 28 c8 movapd %xmm0, %xmm1 > >> 18: 66 0f 54 da andpd %xmm2, %xmm3 > >> 1c: 66 0f 2e e3 ucomisd %xmm3, %xmm4 > >> 20: 76 16 jbe38 <_trunc+0x38> > >> 22: f2 48 0f 2c c0 cvttsd2si %xmm0, %rax > >> 27: 66 0f ef c0 pxor %xmm0, %xmm0 > >> 2b: 66 0f 55 d1 andnpd %xmm1, %xmm2 > >> 2f: f2 48 0f 2a c0 cvtsi2sd %rax, %xmm0 > >> 34: 66 0f 56 c2 orpd %xmm2, %xmm0 > >> 38: c3 retq > >> > >> .rdata > >> .align 8 > >>0: 00 00 00 00 .LC0: .quad 0x1.0p52 > >> 00 00 30 43 > >> 00 00 00 00 > >> 00 00 00 00 > >> .align 16 > >> 10: ff ff ff ff .LC1: .quad ~(-0.0) > >> ff ff ff 7f > >> 18: 00 00 00 00 .quad 0.0 > >> 00 00 00 00 > >> .end > >> > >> JFTR: in the best case, the memory accesses cost several cycles, > >> while in the worst case they yield a page fault! > >> > >> > >> Properly optimized, shorter and faster code, using but only 9 instructions > >> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory > >> accesses > >> and saving at least 16 + 32 bytes, follows: > >> > >> .intel_syntax > >> .text > >>0: f2 48 0f 2c c0cvttsd2si rax, xmm0 # rax = trunc(argument) > >>5: 48 f7 d8 neg rax > >> # jz .L0 # argument zero? > >>8: 70 16 jo .L0 # argument indefinite? > >># argument overflows > >> 64-bit integer? > >>a: 48 f7 d8 neg rax > >>d: f2 48 0f 2a c8cvtsi2sd xmm1, rax # xmm1 = trunc(argument) > >> 12: 66 0f 73 d0 3fpsrlq xmm0, 63 > >> 17: 66 0f 73 f0 3fpsllq xmm0, 63 # xmm0 = (argument & > >> -0.0) ? -0.0 : 0.0 > >> 1c: 66 0f 56 c1 orpdxmm0, xmm1 # xmm0 = trunc(argument) > >> 20: c3 .L0: ret > >> .end > > > > There is one important difference, namely setting the invalid exception > > flag when the parameter can't be represented in a signed integer. > > Right, I overlooked this fault. Thanks for pointing out. > > > So using your code may require some option (-fast-math comes to mind), > > or you need at least a check on the exponent before cvttsd2si. > > The whole idea behind these implementations is to get rid of loading > floating-point constants to perform comparisions. Indeed, but what I had in mind was something along the following lines: movq rax,xmm0 # and copy rax to say rcx, if needed later shrq rax,52 # move sign and exponent to 12 LSBs andl eax,0x7ff # mask the sign cmpl eax,0x434 # value to be checked ja return # exponent too large, we're done (what about NaNs?) cvttsd2si rax,xmm0 # safe after exponent check cvtsi2sd xmm0,rax # conversion done and a bit more to handle the corner cases (essentially preserve the sign to be correct between -1 and -0.0). But the CPU can (speculatively) start the conversions early, so the dependency chain is rather short. I don't know if it's faster than your new code, I'm almost sure that it's shorter. Your new code also has a fairly long dependency chain. > > > The last part of your code then goes to take into account the special > > case of -0.0, which I most often don't care about (I'd like to have a > > -fdont-split-hairs-about-the-sign-of-zero option). > > Preserving the sign of -0.0 is explicitly specified in the standard, > and is cheap, as shown in my code. > > > Potentially generating spurious invalid operation and then carefully > > taking into account the sign of zero does not seem very consistent. > > > > Apart from this, in your code, after cvttsd2si I'd rather use: > > mov rcx,rax # make a second copy to a scratch register > > neg rcx > > jo .L0 > > cvtsi2sd xmm1,r
Re: Noob question about simple customization of GCC.
On Wed, 2021-08-04 at 00:17 -0700, Alacaster Soi via Gcc wrote: > How hard would it be to add a tree-like structure and > headers/sections to > the -v gcc option so you can see the call structure. Would this be a > reasonable first contribution/customization for a noob? It'll be a > while > before I can reasonably work on this. > GCC > version > config > > cc1 main.c > | cc1 config and > | output > -> tempfile.s > '*extra space' *between each > lowest > level command > > as -v > | output > -> tempfile.o > > > collect2.exe > | output > |- ld.exe > | output > -> tempfile.exe > I really like this UI idea, but I don't know how easy/hard it would be to implement. The code that implements figuring out what to invoke (the "driver") is in gcc/gcc.c, which is a big source file. FWIW there's also code in gcc/tree-diagnostic-path.cc to emit ASCII art that does something a bit similar to your idea, which might be worth looking at (in this case, to visualize function calls and returns along a code path). Hope this is helpful Dave
Re: daily report on extending static analyzer project [GSoC]
> On 05-Aug-2021, at 4:56 AM, David Malcolm wrote: > > On Wed, 2021-08-04 at 21:32 +0530, Ankur Saini wrote: > > [...snip...] >> >> - From observation, a typical vfunc call that isn't devirtualised by >> the compiler's front end looks something like this >> "OBJ_TYPE_REF(_2;(struct A)a_ptr_5(D)->0) (a_ptr_5(D))" >> where "a_ptr_5(D)" is pointer that is being used to call the virtual >> function. >> >> - We can access it's region to see what is the type of the object the >> pointer is actually pointing to. >> >> - This is then used to find a call with DECL_CONTEXT of the object >> from the all the possible targets of that polymorphic call. > > [...] > >> >> Patch file ( prototype ) : >> > >> + /* Call is possibly a polymorphic call. >> + >> + In such case, use devirtisation tools to find >> + possible callees of this function call. */ >> + >> + function *fun = get_current_function (); >> + gcall *stmt = const_cast (call); >> + cgraph_edge *e = cgraph_node::get (fun->decl)->get_edge (stmt); >> + if (e->indirect_info->polymorphic) >> + { >> +void *cache_token; >> +bool final; >> +vec targets >> + = possible_polymorphic_call_targets (e, &final, &cache_token, true); >> +if (!targets.is_empty ()) >> + { >> +tree most_propbable_taget = NULL_TREE; >> +if(targets.length () == 1) >> +return targets[0]->decl; >> + >> +/* From the current state, check which subclass the pointer that >> + is being used to this polymorphic call points to, and use to >> + filter out correct function call. */ >> +tree t_val = gimple_call_arg (call, 0); > > Maybe rename to "this_expr"? > > >> +const svalue *sval = get_rvalue (t_val, ctxt); > > and "this_sval"? ok > > ...assuming that that's what the value is. > > Probably should reject the case where there are zero arguments. Ideally it should always have one argument representing the pointer used to call the function. for example, if the function is called like this : - a_ptr->foo(arg); // where foo() is a virtual function and a_ptr is a pointer to an object of a subclass. I saw that it’s GIMPLE representation is as follows : - OBJ_TYPE_REF(_2;(struct A)a_ptr_5(D)->0) (a_ptr_5, arg); > > >> + >> +const region *reg >> + = [&]()->const region * >> + { >> +switch (sval->get_kind ()) >> + { >> +case SK_INITIAL: >> + { >> +const initial_svalue *initial_sval >> + = sval->dyn_cast_initial_svalue (); >> +return initial_sval->get_region (); >> + } >> + break; >> +case SK_REGION: >> + { >> +const region_svalue *region_sval >> + = sval->dyn_cast_region_svalue (); >> +return region_sval->get_pointee (); >> + } >> + break; >> + >> +default: >> + return NULL; >> + } >> + } (); > > I think the above should probably be a subroutine. > > That said, it's not clear to me what it's doing, or that this is correct. Sorry, I think I should have explained it earlier. Let's take an example code snippet :- Derived d; Base *base_ptr; base_ptr = &d; base_ptr->foo();// where foo() is a virtual function This genertes the following GIMPLE dump :- Derived::Derived (&d); base_ptr_6 = &d.D.3779; _1 = base_ptr_6->_vptr.Base; _2 = _1 + 8; _3 = *_2; OBJ_TYPE_REF(_3;(struct Base)base_ptr_6->1) (base_ptr_6); Here instead of trying to extract virtual pointer from the call and see which subclass it belongs, I found it simpler to extract the actual pointer which is used to call the function itself (which from observation, is always the first parameter of the call) and used the region model at that point to figure out what is the type of the object it actually points to ultimately get the actual subclass who's function is being called here. :) Now let me try to explain how I actually executed it ( A lot of assumptions here are based on observation, so please correct me wherever you think I made a false interpretation or forgot about a certain special case ) : - once it is confirmed that the call that we are dealing with is a polymorphic call ( via the cgraph edge representing the call ), I used the "possible_polymorphic_call_targets ()" from ipa-utils.h ( defined in ipa-devirt.c ), to get the possible callee of that call. function *fun = get_current_function (); gcall *stmt = const_cast (call); cgraph_edge *e = cgraph_node::get (fun->decl)->get_edge (stmt); if (e->indirect_info->polymorphic) { void *cache_token; bool final; vec targets = possible_polymorphic_call_targ
Re: [RFC] Adding a new attribute to function param to mark it as constant
On Thu, Aug 05, 2021 at 02:31:02PM +0530, Prathamesh Kulkarni wrote: > On Wed, 4 Aug 2021 at 18:30, Richard Earnshaw > wrote: > > We don't want to have to resort to macros. Not least because at some > > point we want to replace the content of arm_neon.h with a single #pragma > > directive to remove all the parsing of the header that's needed. What's > > more, if we had a suitable pragma we'd stand a fighting chance of being > > able to extend support to other languages as well that don't use the > > pre-processor, such as Fortran or Ada (not that that is on the cards > > right now). > Hi, > IIUC, a more general issue here, is that the intrinsics require > special type-checking of arguments, beyond what is dictated by the > Standard. > An argument needing to be an ICE could be seen as one instance. > > So perhaps, should there be some mechanism to tell the FE to let the > target do additional checking for a particular function call, say by An integer constant expression can be checked by the frontend itself, it does not depend on optimisation etc. That is the beauty of it: it is a) more local, and b) a more reliable / less surprising thing to use. But it *is* less powerful than "it is a constant integer after a travel through the bowels of the compiler". Which of course is less reliable and more surprising (think what happens if you use -O0 or -O1 or -Og or -Os or any -fno- etc.) So it will be a lot more maintenance work (answering PRs about it is only the start). Segher
gcc-9-20210805 is now available
Snapshot gcc-9-20210805 is now available on https://gcc.gnu.org/pub/gcc/snapshots/9-20210805/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 9 git branch with the following options: git://gcc.gnu.org/git/gcc.git branch releases/gcc-9 revision 11e2ac8f75060d9be432e8db1f358298a75c98d4 You'll find: gcc-9-20210805.tar.xzComplete GCC SHA256=4ee185d8c6144cebf81cd01ab68c8d64f8b097765f2278ec00882368e9dcfbcc SHA1=39fe1b99542d66d02d17131a7f297958439bc2ed Diffs from 9-20210729 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-9 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
Re: daily report on extending static analyzer project [GSoC]
On Thu, 2021-08-05 at 20:27 +0530, Ankur Saini wrote: > > > > On 05-Aug-2021, at 4:56 AM, David Malcolm > > wrote: > > > > On Wed, 2021-08-04 at 21:32 +0530, Ankur Saini wrote: > > > > [...snip...] > > > > > > - From observation, a typical vfunc call that isn't devirtualised > > > by > > > the compiler's front end looks something like this > > > "OBJ_TYPE_REF(_2;(struct A)a_ptr_5(D)->0) (a_ptr_5(D))" > > > where "a_ptr_5(D)" is pointer that is being used to call the > > > virtual > > > function. > > > > > > - We can access it's region to see what is the type of the object > > > the > > > pointer is actually pointing to. > > > > > > - This is then used to find a call with DECL_CONTEXT of the object > > > from the all the possible targets of that polymorphic call. > > > > [...] > > > > > > > > Patch file ( prototype ) : > > > > > > > > + /* Call is possibly a polymorphic call. > > > + > > > + In such case, use devirtisation tools to find > > > + possible callees of this function call. */ > > > + > > > + function *fun = get_current_function (); > > > + gcall *stmt = const_cast (call); > > > + cgraph_edge *e = cgraph_node::get (fun->decl)->get_edge (stmt); > > > + if (e->indirect_info->polymorphic) > > > + { > > > + void *cache_token; > > > + bool final; > > > + vec targets > > > + = possible_polymorphic_call_targets (e, &final, > > > &cache_token, true); > > > + if (!targets.is_empty ()) > > > + { > > > + tree most_propbable_taget = NULL_TREE; > > > + if(targets.length () == 1) > > > + return targets[0]->decl; > > > + > > > + /* From the current state, check which subclass the > > > pointer that > > > + is being used to this polymorphic call points to, and > > > use to > > > + filter out correct function call. */ > > > + tree t_val = gimple_call_arg (call, 0); > > > > Maybe rename to "this_expr"? > > > > > > > + const svalue *sval = get_rvalue (t_val, ctxt); > > > > and "this_sval"? > > ok > > > > > ...assuming that that's what the value is. > > > > Probably should reject the case where there are zero arguments. > > Ideally it should always have one argument representing the pointer > used to call the function. > > for example, if the function is called like this : - > > a_ptr->foo(arg); // where foo() is a virtual function and a_ptr is a > pointer to an object of a subclass. > > I saw that it’s GIMPLE representation is as follows : - > > OBJ_TYPE_REF(_2;(struct A)a_ptr_5(D)->0) (a_ptr_5, arg); > > > > > > > > + > > > + const region *reg > > > + = [&]()->const region * > > > + { > > > + switch (sval->get_kind ()) > > > + { > > > + case SK_INITIAL: > > > + { > > > + const initial_svalue *initial_sval > > > + = sval->dyn_cast_initial_svalue (); > > > + return initial_sval->get_region (); > > > + } > > > + break; > > > + case SK_REGION: > > > + { > > > + const region_svalue *region_sval > > > + = sval->dyn_cast_region_svalue (); > > > + return region_sval->get_pointee (); > > > + } > > > + break; > > > + > > > + default: > > > + return NULL; > > > + } > > > + } (); > > > > I think the above should probably be a subroutine. > > > > That said, it's not clear to me what it's doing, or that this is > > correct. > > > Sorry, I think I should have explained it earlier. > > Let's take an example code snippet :- > > Derived d; > Base *base_ptr; > base_ptr = &d; > base_ptr->foo();// where foo() is a virtual function > > This genertes the following GIMPLE dump :- > > Derived::Derived (&d); > base_ptr_6 = &d.D.3779; > _1 = base_ptr_6->_vptr.Base; > _2 = _1 + 8; > _3 = *_2; > OBJ_TYPE_REF(_3;(struct Base)base_ptr_6->1) (base_ptr_6); I did a bit of playing with this example, and tried adding: 1876case OBJ_TYPE_REF: 1877 gcc_unreachable (); 1878 break; to region_model::get_rvalue_1, and running cc1plus under the debugger. The debugger hits the "gcc_unreachable ();", at this stmt: OBJ_TYPE_REF(_2;(struct Base)base_ptr_5->0) (base_ptr_5); Looking at the region_model with region_model::debug() shows: (gdb) call debug() stack depth: 1 frame (index 0): frame: ‘test’@1 clusters within frame: ‘test’@1 cluster for: Derived d key: {bytes 0-7} value: ‘int (*) () *’ {(&constexpr int (* Derived::_ZTV7Derived [3])(...)+(sizetype)16)} cluster for: base_ptr_5: &Derived d. cluster for: _2: &‘foo’ m_called_unknown_fn: FALSE constraint_manager: equiv classes: ec0: {&Derived d.} ec1: {&
Re: [RFC] Adding a new attribute to function param to mark it as constant
On 8/4/21 3:46 AM, Richard Earnshaw wrote: On 03/08/2021 18:44, Martin Sebor wrote: On 8/3/21 4:11 AM, Prathamesh Kulkarni via Gcc wrote: On Tue, 27 Jul 2021 at 13:49, Richard Biener wrote: On Mon, Jul 26, 2021 at 11:06 AM Prathamesh Kulkarni via Gcc wrote: On Fri, 23 Jul 2021 at 23:29, Andrew Pinski wrote: On Fri, Jul 23, 2021 at 3:55 AM Prathamesh Kulkarni via Gcc wrote: Hi, Continuing from this thread, https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575920.html The proposal is to provide a mechanism to mark a parameter in a function as a literal constant. Motivation: Consider the following intrinsic vshl_n_s32 from arrm/arm_neon.h: __extension__ extern __inline int32x2_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) vshl_n_s32 (int32x2_t __a, const int __b) { return (int32x2_t)__builtin_neon_vshl_nv2si (__a, __b); } and it's caller: int32x2_t f (int32x2_t x) { return vshl_n_s32 (x, 1); } Can't you do similar to what is done already in the aarch64 back-end: #define __AARCH64_NUM_LANES(__v) (sizeof (__v) / sizeof (__v[0])) #define __AARCH64_LANE_CHECK(__vec, __idx) \ __builtin_aarch64_im_lane_boundsi (sizeof(__vec), sizeof(__vec[0]), __idx) ? Yes this is about lanes but you could even add one for min/max which is generic and such; add an argument to say the intrinsics name even. You could do this as a non-target builtin if you want and reuse it also for the aarch64 backend. Hi Andrew, Thanks for the suggestions. IIUC, we could use this approach to check if the argument falls within a certain range (min / max), but I am not sure how it will help to determine if the arg is a constant immediate ? AFAIK, vshl_n intrinsics require that the 2nd arg is immediate ? Even the current RTL builtin checking is not consistent across optimization levels: For eg: int32x2_t f(int32_t *restrict a) { int32x2_t v = vld1_s32 (a); int b = 2; return vshl_n_s32 (v, b); } With pristine trunk, compiling with -O2 results in no errors because constant propagation replaces 'b' with 2, and during expansion, expand_builtin_args is happy. But at -O0, it results in the error - "argument 2 must be a constant immediate". So I guess we need some mechanism to mark a parameter as a constant ? I guess you want to mark it in a way that the frontend should force constant evaluation and error if that's not possible? C++ doesn't allow to declare a parameter as 'constexpr' but something like void foo (consteval int i); since I guess you do want to allow passing constexpr arguments in C++ or in C extended forms of constants like static const int a[4]; foo (a[1]); ? But yes, this looks useful to me. Hi Richard, Thanks for the suggestions and sorry for late response. I have attached a prototype patch that implements consteval attribute. As implemented, the attribute takes at least one argument(s), which refer to parameter position, and the corresponding parameter must be const qualified, failing which, the attribute is ignored. I'm curious why the argument must be const-qualified. If it's to keep it from being changed in ways that would prevent it from being evaluated at compile-time in the body of the function then to be effective, the enforcement of the constraint should be on the definition of the function. Otherwise, the const qualifier could be used in a declaration of a function but left out from a subsequent definition of it, letting it modify it, like so: __attribute__ ((consteval (1))) void f (const int); inline __attribute__ ((always_inline)) void f (int i) { ++i; } In this particular case it's because the inline function is implementing an intrinsic operation in the architecture and the instruction only supports a literal constant value. At present we catch this while trying to expand the intrinsic, but that can lead to poor diagnostics because we really want to report against the line of code calling the intrinsic. Presumably the intrinsics can accept (or can be made to accept) any constant integer expressions, not just literals. E.g., the aarch64 builtin below accepts them. For example, this is accepted in C++: __Int64x2_t void f (__Int32x2_t a) { constexpr int n = 2; return __builtin_aarch64_vshll_nv2si (a, n + 1); } Making the intrinscis accept constant arguments in constexpr-like functions and introducing a constexpr-lite attribute (for C code) was what I was suggesting bythe constexpr comment below. I'd find that a much more general and more powerful design. But my comment above was to highlight that if requiring the function argument referenced by the proposed consteval attribute to be const is necessary to prevent it from being modified then the requirement needs to be enforced not on the declaration but on the definition. You may rightly say: "but we get to define the inline arm function wrappers so we'll make sure to never declare them that way." I don't have a problem with that. What I am s
Re: Optional machine prefix for programs in for -B dirs, match ing Clang
On Thu, Aug 5, 2021, at 8:30 AM, Michael Matz wrote: > Hello, > > On Wed, 4 Aug 2021, John Ericson wrote: > > > On Wed, Aug 4, 2021, at 10:48 AM, Michael Matz wrote: > > > ... the 'as' and 'ld' executables should be simply found within the > > > version and target specific GCC libexecsubdir, possibly by being symlinks > > > to whatever you want. That's at least how my crosss are configured and > > > installed, without any --with-{as,ld} options. > > > > Yes that does work, and that's probably the best option today. I'm just > > a little wary of unprefixing things programmatically. > > The libexecsubdir _is_ the prefix in above case :) Right. I meant stripping off the `cpu-vendor-os-` (conventionally) that ld and as are prefixed with. stripping off leading directories is easier. > > For some context, this is NixOS where we assemble a ton of cross > > compilers automatically and each package gets its own isolated many FHS. > > For that reason I would like to eventually avoid the target-specific > > subdirs entirely, as I have the separate package trees to disambiguate > > things. Now, I know that exact same argument could also be used to say > > target prefixing is also superfluous, but eventually things on the PATH > > need to be disambiguated. > > Sure, which is why (e.g.) cross binutils do install with an arch prefix > into ${bindir}. But as GCC has the capability to look into libexecsubdir > for binaries as well (which quite surely should never be in $PATH on any > system), I don't see the conflict. Yes there is no actual conflict. Our originally wrapper scripts may have been confused about this at some point but that's on us. > > > There is no requirement that the libexec things be named like the bin > > things, but I sort of feel it's one less thing to remember and makes > > debugging easier. > > Well, the naming scheme of binaries in libexecsubdir reflects the scheme > that the compilers are using: cc1, cc1plus etc. Not > aarch64-unknown-linux-cc1. Right. > > > I am sympathetic to the issue that if GCC accepts everything Clang does > > and vice-versa, we'll Postel's-law ourselves ourselves over time into > > madness as mistakes are accumulated rather than weeded out. > > Right. I supposed it wouldn't hurt to also look for "${targettriple}-as" > in $PATH before looking for 'as' (in $PATH). But I don't think we can (or > should) switch off looking for 'as' in libexecsubdir. I don't even see > why that behaviour should depend on an option, it could just be added by > default. OK I agree with that. so if someone passes -B$x, how about looking for - $x/$machine/$version/$prog - $x/$machine/$prog - $x/$machine-prog - $x/prog so no prefixing in the subdir, only in the main dir? ($libexecsubdir is morally $libexec being a search dir + subdir IIRC) > > I now have some patches for this change I suppose I could also submit. > > Even better :) Great! I will continue improving my patch based on the above. In the meantime, I posted https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576725.html which is a small cleanup that, while helping with my changes, doesn't change the behavior and I hope is good in any event.
Re: Why vectorization didn't turn on by -O2
On Thu, Aug 5, 2021 at 5:20 AM Segher Boessenkool wrote: > > On Wed, Aug 04, 2021 at 11:22:53AM +0100, Richard Sandiford wrote: > > Segher Boessenkool writes: > > > On Wed, Aug 04, 2021 at 10:10:36AM +0100, Richard Sandiford wrote: > > >> Richard Biener writes: > > >> > Alternatively only enable loop vectorization at -O2 (the above checks > > >> > flag_tree_slp_vectorize as well). At least the cost model kind > > >> > does not have any influence on BB vectorization, that is, we get the > > >> > same pros and cons as we do for -O3. > > >> > > >> Yeah, but a lot of the loop vector cost model choice is about controlling > > >> code size growth and avoiding excessive runtime versioning tests. > > > > > > Both of those depend a lot on the target, and target-specific conditions > > > as well (which CPU model is selected for example). Can we factor that > > > in somehow? Maybe we need some target hook that returns the expected > > > percentage code growth for vectorising a given loop, for example, and > > > -O2 vs. -O3 then selects what percentage is acceptable. > > > > > >> BB SLP > > >> should be a win on both code size and performance (barring significant > > >> target costing issues). > > > > > > Yeah -- but this could use a similar hook as well (just a straightline > > > piece of code instead of a loop). > > > > I think anything like that should be driven by motivating use cases. > > It's not something that we can easily decide in the abstract. > > > > The results so far with using very-cheap at -O2 have been promising, > > so I don't think new hooks should block that becoming the default. > > Right, but it wouldn't hurt to think a sec if we are on the right path > forward. It's is crystal clear that to make good decisions about what > and how to vectorise you need to take *some* target characteristics into > account, and that will have to happen sooner rather than later. > > This was all in reply to > > > >> Yeah, but a lot of the loop vector cost model choice is about controlling > > >> code size growth and avoiding excessive runtime versioning tests. > > It was not meant to hold up these patches :-) > > > >> PR100089 was an exception because we ended up keeping unvectorised > > >> scalar code that would never have existed otherwise. BB SLP proper > > >> shouldn't have that problem. > > > > > > It also is a tiny piece of code. There will always be tiny examples > > > that are much worse (or much better) than average. > > > > Yeah, what makes PR100089 important isn't IMO the test itself, but the > > underlying problem that the PR exposed. Enabling this “BB SLP in loop > > vectorisation” code can lead to the generation of scalar COND_EXPRs even > > though we know that ifcvt doesn't have a proper cost model for deciding > > whether scalar COND_EXPRs are a win. > > > > Introducing scalar COND_EXPRs at -O3 is arguably an acceptable risk > > (although still dubious), but I think it's something we need to avoid > > for -O2, even if that means losing the optimisation. > > Yeah -- -O2 should almost always do the right thing, while -O3 can do > bad things more often, it just has to be better "on average". > > > Segher Move thread to gcc-patches and gcc -- BR, Hongtao