Re: x86: making better use of vpternlog{d,q}
On 24.05.2023 11:01, Hongtao Liu wrote: > On Wed, May 24, 2023 at 3:58 PM Jan Beulich via Gcc wrote: >> >> Hello, >> >> for a couple of years I was meaning to extend the use of these AVX512F >> insns beyond the pretty minimalistic ones there are so far. Now that I've >> got around to at least draft something, I ran into a couple of issues I >> cannot explain. I'd like to start with understanding the unexpected >> effects of a change to an existing insn I have made (reproduced at the >> bottom). I certainly was prepared to observe testsuite failures, but it >> ends up failing tests I didn't expect it would fail, and - upon looking >> at sibling ones - also ends up leaving intact tests which I would expect >> would then need adjustment (because of using the new alternative). >> >> In particular (all mentioned tests are in gcc.target/i386/) >> - avx512f-andn-si-zmm-1.c (and its AVX512VL counterparts) fails because >> for whatever reason generated code reverts back to using vpbroadcastd, >> - avx512f-andn-di-zmm-1.c, otoh, is unaffected (i.e. continues to use >> vpandnq with embedded broadcast), >> - avx512f-andn-si-zmm-2.c doesn't use the new 4th insn alternative when >> at the same time a made-up DI variant of the test (akin to what might >> be an avx512f-andn-di-zmm-2.c testcase) does. >> IOW: How is SI mode element size different here from DI mode one? Is >> there anything wrong with the 4th alternative I'm adding, or is this >> hinting at some anomaly elsewhere? > __m512i is defined as __v8di, when it's used for _mm512_andnot_epi32, > it's explicitlt converted to (__v16si) and creates an extra subreg > which is not needed for DImode cases. > And pass_combine try to match the below pattern but failed due to the > condition REG_P (operands[1]) || REG_P (operands[2]). Here I think you > want register_operand instead of REG_P. Thanks, this has indeed made things match my expectations wrt testsuite results. Sadly similar adjustments for other (new) insns didn't make any difference with the further issues I'm facing. I may therefore need to ask more questions; I hope they're not going to be too dumb. Jan
Re: [patch]: Implement PR104327 for avr
Am 25.05.23 um 08:35 schrieb Richard Biener: On Wed, May 24, 2023 at 5:44 PM Georg-Johann Lay wrote: Am 24.05.23 um 11:38 schrieb Richard Biener: On Tue, May 23, 2023 at 2:56 PM Georg-Johann Lay wrote: PR target/104327 not only affects s390 but also avr: The avr backend pre-sets some options depending on optimization level. The inliner then thinks that always_inline functions are not eligible for inlining and terminates with an error. Proposing the following patch that implements TARGET_CAN_INLINE_P. Ok to apply? Johann target/104327: Allow more inlining between different optimization levels. avr-common.cc introduces the following options that are set depending on optimization level: -mgas-isr-prologues, -mmain-is-OS-task and -fsplit-wide-types-early. The inliner thinks that different options disallow cross-optimization inlining, so provide can_inline_p. gcc/ PR target/104327 * config/avr/avr.cc (avr_can_inline_p): New static function. (TARGET_CAN_INLINE_P): Define to that function. diff --git a/gcc/config/avr/avr.cc b/gcc/config/avr/avr.cc index 9fa50ca230d..55b48f63865 100644 --- a/gcc/config/avr/avr.cc +++ b/gcc/config/avr/avr.cc @@ -1018,6 +1018,22 @@ avr_no_gccisr_function_p (tree func) return avr_lookup_function_attribute1 (func, "no_gccisr"); } + +/* Implement `TARGET_CAN_INLINE_P'. */ +/* Some options like -mgas_isr_prologues depend on optimization level, + and the inliner might think that due to different options, inlining + is not permitted; see PR104327. */ + +static bool +avr_can_inline_p (tree /* caller */, tree callee) +{ + // For now, dont't allow to inline ISRs. If the user actually wants + // to inline ISR code, they have to turn the body of the ISR into an + // ordinary function. + + return ! avr_interrupt_function_p (callee); I'm not sure if AVR has ISA extensions but the above will likely break things like void __attribute__((target("-mX"))) foo () { asm ("isa X opcode"); stmt-that-generates-X-ISA; } This yields warning: target attribute is not supported on this machine [-Wattributes] Ah, that's an interesting fact. So that indeed leaves __attribute__((optimize(...))) influencing the set of active target attributes via the generic option target hooks like in your case the different defaults. avr has -mmcu= target options, but switching them in mid-air won't work because the file prologue might already be different and incompatible across different architectures. And I never saw any user requesting such a thing, and I can't imagine any reasonable use case... If the warning is not strong enough, may be it can be turned into an error, but -Wattributes is not specific enough for that. Note the target attribute is then simply ignored. void bar () { if (cpu-has-X) foo (); } if always-inlines are the concern you can use bool always_inline = (DECL_DISREGARD_INLINE_LIMITS (callee) && lookup_attribute ("always_inline", DECL_ATTRIBUTES (callee))); /* Do what the user says. */ if (always_inline) return true; return default_target_can_inline_p (caller, callee); The default implementation of can_inline_p worked fine for avr. As far as I understand, the new behavior is due to clean-up of global states for options? I think the last change was r8-2658-g9b25e12d2d940a which for targets without target attribute support made it more likely to run into the default hook actually comparing the options. Previously the "default" was oddly special-cased but you could have still run into compares with two different set of defaults when there's another "default" default. Say, compile with -O2 and have one optimize(0) and one optimize(Os) function it would compare the optimize(0) and optimize(Os) set if they were distinct from the -O2 set. That probably never happened for AVR. So I need to take into account inlining costs and decide on that whether it's preferred to inline a function or not? No, the hook isn't about cost, it's about full incompatibility. So if the different -m options that could be in effect for AVR in a single TU for different functions never should prevent inlining then simply make the hook return true. If there's a specific option (that can differ from what specified on the compiler command line!) that should, then you should compare the setting of that option from the DECL_FUNCTION_SPECIFIC_TARGET of the caller and the callee. But as far as I can see simply returning true should be correct for AVR, or like your patch handle interrupts differently (though the -Winline diagnostic will tell the user there's a mismatch in target options which might be confusing). Ok, simply "true" sounds reasonable. Is that change ok then? Johann Richard. Johann +} + /* Implement `TARGET_SET_CURRENT_FUNCTION'. */ /* Sanity cheching for above function attributes. */ @@ -14713,6 +14729,9 @@ avr_float_lib_compare_retur
Re: Wrong cost computation / conclusion ins insn combine?
Am 24.05.23 um 14:31 schrieb Richard Earnshaw (lists): On 23/05/2023 19:41, Georg-Johann Lay wrote: For some time now I am staring at the following test case and what combine does with it: typedef struct { unsigned b0 : 1; unsigned b1 : 1; unsigned b2 : 1; unsigned b3 : 1; unsigned b4 : 1; unsigned b5 : 1; unsigned b6 : 1; unsigned b7 : 1; } b_t; Prior to combine, there is: insn_cost 4 for 18: r52:QI = r24:QI insn_cost 4 for 2: r47:QI = r52:QI insn_cost 4 for 6: r48:QI = zero_extract(r47:QI,0x1,0) insn_cost 4 for 7: r50:QI = 0x1 insn_cost 4 for 8: r49:QI = r48:QI ^ r50:QI insn_cost 8 for 9: zero_extract(r47:QI,0x1,0) = r49:QI insn_cost 4 for 15: r24:QI = r47:QI So insn 6 extracts bit 0, insn 8 flips it, and insn 9 inserts it as bit 0 again. Combine then starts looking for combinations, and at some point comes up with: Trying 7 -> 9: 7: r50:QI = ~r47:QI 9: zero_extract(r47:QI,0x1,0) = r50:QI Successfully matched this instruction: (set (zero_extract:QI (reg/v:QI 47 [ a ]) (const_int 1 [0x1]) (const_int 0 [0])) (not:QI (reg/v:QI 47 [ a ]))) allowing combination of insns 7 and 9 original costs 4 + 8 = 12 replacement cost 12 deferring deletion of insn with uid = 7. modifying insn i3 9: zero_extract(r47:QI,0x1,0)=~r47:QI deferring rescan insn with uid = 9. So the cost is 12 and this insn is accepted. But down the line, combine cooks up this: Trying 2, 9 -> 15: 2: r47:QI = r52:QI 9: zero_extract(r47:QI,0x1,0) = ~r47:QI 15: r24:QI = r47:QI ... Successfully matched this instruction: (set (reg/i:QI 24 r24) (ior:QI (and:QI (reg:QI 52) (const_int -2 [0xfffe])) (and:QI (reg/v:QI 47 [ a ]) (const_int 1 [0x1] allowing combination of insns 2, 9 and 15 original costs 4 + 12 + 4 = 20 replacement costs 4 + 12 = 16 deferring deletion of insn with uid = 2. modifying insn i2 9: r47:QI=~r52:QI deferring rescan insn with uid = 9. modifying insn i3 15: r24:QI=r52:QI&0xfffe|r47:QI&0x1 deferring rescan insn with uid = 15. So this one has a cost of 16 which is more expensive than the first combination. For example it still needs to flip the bit. So why is combine choosing the expensive replacement over the cheap one? Because it thinks it is cheaper. As your log shows, the calculation is: original costs 4 + 12 + 4 = 20 replacement costs 4 + 12 = 16 But the real problem is that two of the instructions in this example are simple register-register move operations which will likely be eliminated during register allocation anyway (due to coalescing) and this throws off the cost calculations. I've seen this sort of thing before; perhaps the best solution would be to override the cost of a simple (register to register) set and give it a cost of zero. Then we'd see that this new sequence is worse than the original. Are you proposing to temporarily patch some cost hook during combine? Combine can't work out allocation costs because it won't look at constraints. And when function need special classes, register alloc has to kick out the hard regs again and re-alloc them... And when constraints like "+r" or "0" are present, combine won't do a great job in computing (spared) register allocation costs. For some insns there is even predicates that disallow combine to drop in hard-regs like combine_pseudo_register_operand, but that' a bit too much for this case. Johann Also it combines hard-registers in the 2nd case, but not in the 1st one. So the costs get biased towards 2nd. Can someone explain why combine takes the more expensive solution? Target is avr, compiled with $ avr-gcc-14 bits.c -dumpbase "" -S -Os -fdump-rtl-combine-details -dp Johann R.
Re: [patch]: Implement PR104327 for avr
> Am 25.05.2023 um 16:22 schrieb Georg-Johann Lay : > > > >> Am 25.05.23 um 08:35 schrieb Richard Biener: >>> On Wed, May 24, 2023 at 5:44 PM Georg-Johann Lay wrote: >>> Am 24.05.23 um 11:38 schrieb Richard Biener: On Tue, May 23, 2023 at 2:56 PM Georg-Johann Lay wrote: > > PR target/104327 not only affects s390 but also avr: > The avr backend pre-sets some options depending on optimization level. > The inliner then thinks that always_inline functions are not eligible > for inlining and terminates with an error. > > Proposing the following patch that implements TARGET_CAN_INLINE_P. > > Ok to apply? > > Johann > > target/104327: Allow more inlining between different optimization levels. > > avr-common.cc introduces the following options that are set depending > on optimization level: -mgas-isr-prologues, -mmain-is-OS-task and > -fsplit-wide-types-early. The inliner thinks that different options > disallow cross-optimization inlining, so provide can_inline_p. > > gcc/ > PR target/104327 > * config/avr/avr.cc (avr_can_inline_p): New static function. > (TARGET_CAN_INLINE_P): Define to that function. > diff --git a/gcc/config/avr/avr.cc b/gcc/config/avr/avr.cc > index 9fa50ca230d..55b48f63865 100644 > --- a/gcc/config/avr/avr.cc > +++ b/gcc/config/avr/avr.cc > @@ -1018,6 +1018,22 @@ avr_no_gccisr_function_p (tree func) > return avr_lookup_function_attribute1 (func, "no_gccisr"); >} > > + > +/* Implement `TARGET_CAN_INLINE_P'. */ > +/* Some options like -mgas_isr_prologues depend on optimization level, > + and the inliner might think that due to different options, inlining > + is not permitted; see PR104327. */ > + > +static bool > +avr_can_inline_p (tree /* caller */, tree callee) > +{ > + // For now, dont't allow to inline ISRs. If the user actually wants > + // to inline ISR code, they have to turn the body of the ISR into an > + // ordinary function. > + > + return ! avr_interrupt_function_p (callee); I'm not sure if AVR has ISA extensions but the above will likely break things like void __attribute__((target("-mX"))) foo () { asm ("isa X opcode"); stmt-that-generates-X-ISA; } >>> >>> This yields >>> >>> warning: target attribute is not supported on this machine [-Wattributes] >> Ah, that's an interesting fact. So that indeed leaves >> __attribute__((optimize(...))) >> influencing the set of active target attributes via the generic option target >> hooks like in your case the different defaults. >>> avr has -mmcu= target options, but switching them in mid-air >>> won't work because the file prologue might already be different >>> and incompatible across different architectures. And I never >>> saw any user requesting such a thing, and I can't imagine >>> any reasonable use case... If the warning is not strong enough, >>> may be it can be turned into an error, but -Wattributes is not >>> specific enough for that. >> Note the target attribute is then simply ignored. void bar () { if (cpu-has-X) foo (); } if always-inlines are the concern you can use bool always_inline = (DECL_DISREGARD_INLINE_LIMITS (callee) && lookup_attribute ("always_inline", DECL_ATTRIBUTES (callee))); /* Do what the user says. */ if (always_inline) return true; return default_target_can_inline_p (caller, callee); >>> >>> The default implementation of can_inline_p worked fine for avr. >>> As far as I understand, the new behavior is due to clean-up >>> of global states for options? >> I think the last change was r8-2658-g9b25e12d2d940a which >> for targets without target attribute support made it more likely >> to run into the default hook actually comparing the options. >> Previously the "default" was oddly special-cased but you >> could have still run into compares with two different set of >> defaults when there's another "default" default. Say, compile >> with -O2 and have one optimize(0) and one optimize(Os) >> function it would compare the optimize(0) and optimize(Os) >> set if they were distinct from the -O2 set. That probably never >> happened for AVR. >>> So I need to take into account inlining costs and decide on that >>> whether it's preferred to inline a function or not? >> No, the hook isn't about cost, it's about full incompatibility. So >> if the different -m options that could be in effect for AVR in >> a single TU for different functions never should prevent inlining >> then simply make the hook return true. If there's a specific >> option (that can differ from what specified on the compiler >> command line!) that should, then you should compare the >> setting of that option from the
gcc-11-20230525 is now available
Snapshot gcc-11-20230525 is now available on https://gcc.gnu.org/pub/gcc/snapshots/11-20230525/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 11 git branch with the following options: git://gcc.gnu.org/git/gcc.git branch releases/gcc-11 revision 7e289c12ce0da962ef258c992f7c52710198e3a6 You'll find: gcc-11-20230525.tar.xz Complete GCC SHA256=d1f035bb013ab720161bd2ac5586e57b84929b16d88e39db55382128bccb94dd SHA1=40f31830cd8afdc5c918c39b6aa292a477b879e0 Diffs from 11-20230518 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-11 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
Will GCC eventually support SSE2 or SSE4.1?
Hi, compile the following function on a system with Core2 processor (released January 2008) for the 32-bit execution environment: --- demo.c --- int ispowerof2(unsigned long long argument) { return (argument & argument - 1) == 0; } --- EOF --- GCC 13.3: gcc -m32 -O3 demo.c NOTE: -mtune=native is the default! # https://godbolt.org/z/b43cjGdY9 ispowerof2(unsigned long long): movqxmm1, [esp+4] pcmpeqd xmm0, xmm0 paddq xmm0, xmm1 pandxmm0, xmm1 movdedx, xmm0 #pxorxmm1, xmm1 psrlq xmm0, 32 #pcmpeqb xmm0, xmm1 movdeax, xmm0 #pmovmskb eax, xmm0 or edx, eax #cmp al, 255 seteal #seteal movzx eax, al# ret 11 instructions in 40 bytes# 10 instructions in 36 bytes OOPS: why does GCC (ab)use the SSE2 alias "Willamette New Instruction Set" here instead of the native SSE4.1 alias "Penryn New Instruction Set" of the Core2 (and all later processors)? OUCH: why does it FAIL to REALLY use SSE2, as shown in the comments on the right side? Now add the -mtune=core2 option to EXPLICITLY enable the NATIVE SSE4.1 alias "Penryn New Instruction Set" of the Core2 processor: GCC 13.3: gcc -m32 -mtune=core2 -O3 demo.c # https://godbolt.org/z/svhEoYT11 ispowerof2(unsigned long long): #xor eax, eax movqxmm1, [esp+4] #movq xmm1, [esp+4] pcmpeqd xmm0, xmm0 #pcmpeqq xmm0, xmm0 paddq xmm0, xmm1 #paddqxmm0, xmm1 pandxmm0, xmm1 #ptestxmm0, xmm1 movdedx, xmm0 # psrlq xmm0, 32 # movdeax, xmm0 # or edx, eax # seteal #sete al movzx eax, al# ret#ret 11 instructions in 40 bytes# 7 instructions in 26 bytes OUCH: GCC FAILS to use SSE4.1 as shown in the comments on the right side. ~~~ Last compile with -mtune=i386 for the i386 processor: GCC 13.3: gcc -m32 -mtune=i386 -O3 demo.c # https://godbolt.org/z/e76W6dsMj ispowerof2(unsigned long long): pushebx# mov ecx, [esp+8] #moveax, [esp+4] mov ebx, [esp+12] #movedx, [esp+8] mov eax, ecx # mov edx, ebx # add eax, -1#addeax, -1 adc edx, -1#adcedx, -1 and eax, ecx #andeax, [esp+4] and edx, ebx #andedx, [esp+8] or eax, edx #or eax, edx seteal #negeax movzx eax, al#sbbeax, eax pop ebx#inceax ret#ret 14 instructions in 33 bytes# 11 instructions in 32 bytes OUCH: why does GCC abuse EBX (and ECX too) and performs a superfluous memory write? Stefan Kanthak