Re: x86: making better use of vpternlog{d,q}

2023-05-25 Thread Jan Beulich via Gcc
On 24.05.2023 11:01, Hongtao Liu wrote:
> On Wed, May 24, 2023 at 3:58 PM Jan Beulich via Gcc  wrote:
>>
>> Hello,
>>
>> for a couple of years I was meaning to extend the use of these AVX512F
>> insns beyond the pretty minimalistic ones there are so far. Now that I've
>> got around to at least draft something, I ran into a couple of issues I
>> cannot explain. I'd like to start with understanding the unexpected
>> effects of a change to an existing insn I have made (reproduced at the
>> bottom). I certainly was prepared to observe testsuite failures, but it
>> ends up failing tests I didn't expect it would fail, and - upon looking
>> at sibling ones - also ends up leaving intact tests which I would expect
>> would then need adjustment (because of using the new alternative).
>>
>> In particular (all mentioned tests are in gcc.target/i386/)
>> - avx512f-andn-si-zmm-1.c (and its AVX512VL counterparts) fails because
>>   for whatever reason generated code reverts back to using vpbroadcastd,
>> - avx512f-andn-di-zmm-1.c, otoh, is unaffected (i.e. continues to use
>>   vpandnq with embedded broadcast),
>> - avx512f-andn-si-zmm-2.c doesn't use the new 4th insn alternative when
>>   at the same time a made-up DI variant of the test (akin to what might
>>   be an avx512f-andn-di-zmm-2.c testcase) does.
>> IOW: How is SI mode element size different here from DI mode one? Is
>> there anything wrong with the 4th alternative I'm adding, or is this
>> hinting at some anomaly elsewhere?
> __m512i is defined as __v8di, when it's used for _mm512_andnot_epi32,
> it's explicitlt converted to (__v16si) and creates an extra subreg
> which is not needed for DImode cases.
> And pass_combine try to match the below pattern but failed due to the
> condition REG_P (operands[1]) || REG_P (operands[2]). Here I think you
> want register_operand instead of REG_P.

Thanks, this has indeed made things match my expectations wrt testsuite
results. Sadly similar adjustments for other (new) insns didn't make
any difference with the further issues I'm facing. I may therefore need
to ask more questions; I hope they're not going to be too dumb.

Jan


Re: [patch]: Implement PR104327 for avr

2023-05-25 Thread Georg-Johann Lay




Am 25.05.23 um 08:35 schrieb Richard Biener:

On Wed, May 24, 2023 at 5:44 PM Georg-Johann Lay  wrote:

Am 24.05.23 um 11:38 schrieb Richard Biener:

On Tue, May 23, 2023 at 2:56 PM Georg-Johann Lay  wrote:


PR target/104327 not only affects s390 but also avr:
The avr backend pre-sets some options depending on optimization level.
The inliner then thinks that always_inline functions are not eligible
for inlining and terminates with an error.

Proposing the following patch that implements TARGET_CAN_INLINE_P.

Ok to apply?

Johann

target/104327: Allow more inlining between different optimization levels.

avr-common.cc introduces the following options that are set depending
on optimization level: -mgas-isr-prologues, -mmain-is-OS-task and
-fsplit-wide-types-early.  The inliner thinks that different options
disallow cross-optimization inlining, so provide can_inline_p.

gcc/
  PR target/104327
  * config/avr/avr.cc (avr_can_inline_p): New static function.
  (TARGET_CAN_INLINE_P): Define to that function.
diff --git a/gcc/config/avr/avr.cc b/gcc/config/avr/avr.cc
index 9fa50ca230d..55b48f63865 100644
--- a/gcc/config/avr/avr.cc
+++ b/gcc/config/avr/avr.cc
@@ -1018,6 +1018,22 @@ avr_no_gccisr_function_p (tree func)
  return avr_lookup_function_attribute1 (func, "no_gccisr");
}

+
+/* Implement `TARGET_CAN_INLINE_P'.  */
+/* Some options like -mgas_isr_prologues depend on optimization level,
+   and the inliner might think that due to different options, inlining
+   is not permitted; see PR104327.  */
+
+static bool
+avr_can_inline_p (tree /* caller */, tree callee)
+{
+  // For now, dont't allow to inline ISRs.  If the user actually wants
+  // to inline ISR code, they have to turn the body of the ISR into an
+  // ordinary function.
+
+  return ! avr_interrupt_function_p (callee);


I'm not sure if AVR has ISA extensions but the above will likely break
things like

void __attribute__((target("-mX"))) foo () { asm ("isa X opcode");
stmt-that-generates-X-ISA; }


This yields

warning: target attribute is not supported on this machine [-Wattributes]


Ah, that's an interesting fact.  So that indeed leaves
__attribute__((optimize(...)))
influencing the set of active target attributes via the generic option target
hooks like in your case the different defaults.


avr has -mmcu= target options, but switching them in mid-air
won't work because the file prologue might already be different
and incompatible across different architectures.  And I never
saw any user requesting such a thing, and I can't imagine
any reasonable use case...  If the warning is not strong enough,
may be it can be turned into an error, but -Wattributes is not
specific enough for that.


Note the target attribute is then simply ignored.


void bar ()
{
if (cpu-has-X)
  foo ();
}

if always-inlines are the concern you can use

bool always_inline
  = (DECL_DISREGARD_INLINE_LIMITS (callee)
 && lookup_attribute ("always_inline",
  DECL_ATTRIBUTES (callee)));
/* Do what the user says.  */
if (always_inline)
  return true;

return default_target_can_inline_p (caller, callee);


The default implementation of can_inline_p worked fine for avr.
As far as I understand, the new behavior is due to clean-up
of global states for options?


I think the last change was r8-2658-g9b25e12d2d940a which
for targets without target attribute support made it more likely
to run into the default hook actually comparing the options.
Previously the "default" was oddly special-cased but you
could have still run into compares with two different set of
defaults when there's another "default" default.  Say, compile
with -O2 and have one optimize(0) and one optimize(Os)
function it would compare the optimize(0) and optimize(Os)
set if they were distinct from the -O2 set.  That probably never
happened for AVR.


So I need to take into account inlining costs and decide on that
whether it's preferred to inline a function or not?


No, the hook isn't about cost, it's about full incompatibility.  So
if the different -m options that could be in effect for AVR in
a single TU for different functions never should prevent inlining
then simply make the hook return true.  If there's a specific
option (that can differ from what specified on the compiler
command line!) that should, then you should compare the
setting of that option from the DECL_FUNCTION_SPECIFIC_TARGET
of the caller and the callee.

But as far as I can see simply returning true should be correct
for AVR, or like your patch handle interrupts differently (though
the -Winline diagnostic will tell the user there's a mismatch in
target options which might be confusing).


Ok, simply "true" sounds reasonable.  Is that change ok then?

Johann



Richard.


Johann


+}
+
/* Implement `TARGET_SET_CURRENT_FUNCTION'.  */
/* Sanity cheching for above function attributes.  */

@@ -14713,6 +14729,9 @@ avr_float_lib_compare_retur

Re: Wrong cost computation / conclusion ins insn combine?

2023-05-25 Thread Georg-Johann Lay




Am 24.05.23 um 14:31 schrieb Richard Earnshaw (lists):

On 23/05/2023 19:41, Georg-Johann Lay wrote:

For some time now I am staring at the following test case and what
combine does with it:

typedef struct
{
 unsigned b0 : 1;
 unsigned b1 : 1;
 unsigned b2 : 1;
 unsigned b3 : 1;
 unsigned b4 : 1;
 unsigned b5 : 1;
 unsigned b6 : 1;
 unsigned b7 : 1;
} b_t;

Prior to combine, there is:

insn_cost 4 for    18: r52:QI = r24:QI
insn_cost 4 for 2: r47:QI = r52:QI
insn_cost 4 for 6: r48:QI = zero_extract(r47:QI,0x1,0)
insn_cost 4 for 7: r50:QI = 0x1
insn_cost 4 for 8: r49:QI = r48:QI ^ r50:QI
insn_cost 8 for 9: zero_extract(r47:QI,0x1,0) = r49:QI
insn_cost 4 for    15: r24:QI = r47:QI

So insn 6 extracts bit 0, insn 8 flips it, and insn 9 inserts
it as bit 0 again.

Combine then starts looking for combinations, and at some point
comes up with:


Trying 7 -> 9:
 7: r50:QI = ~r47:QI
 9: zero_extract(r47:QI,0x1,0) = r50:QI
Successfully matched this instruction:
(set (zero_extract:QI (reg/v:QI 47 [ a ])
   (const_int 1 [0x1])
   (const_int 0 [0]))
  (not:QI (reg/v:QI 47 [ a ])))
allowing combination of insns 7 and 9
original costs 4 + 8 = 12
replacement cost 12
deferring deletion of insn with uid = 7.
modifying insn i3 9: zero_extract(r47:QI,0x1,0)=~r47:QI
deferring rescan insn with uid = 9.

So the cost is 12 and this insn is accepted. But down the line,
combine cooks up this:

Trying 2, 9 -> 15:
 2: r47:QI = r52:QI
 9: zero_extract(r47:QI,0x1,0) = ~r47:QI
    15: r24:QI = r47:QI
...
Successfully matched this instruction:
(set (reg/i:QI 24 r24)
  (ior:QI (and:QI (reg:QI 52)
  (const_int -2 [0xfffe]))
  (and:QI (reg/v:QI 47 [ a ])
  (const_int 1 [0x1]
allowing combination of insns 2, 9 and 15
original costs 4 + 12 + 4 = 20
replacement costs 4 + 12 = 16
deferring deletion of insn with uid = 2.
modifying insn i2 9: r47:QI=~r52:QI
deferring rescan insn with uid = 9.
modifying insn i3    15: r24:QI=r52:QI&0xfffe|r47:QI&0x1
deferring rescan insn with uid = 15.

So this one has a cost of 16 which is more expensive than the first
combination. For example it still needs to flip the bit.

So why is combine choosing the expensive replacement over the cheap one?


Because it thinks it is cheaper.  As your log shows, the calculation is:

original costs 4 + 12 + 4 = 20
replacement costs 4 + 12 = 16

But the real problem is that two of the instructions in this example are 
simple register-register move operations which will likely be eliminated 
during register allocation anyway (due to coalescing) and this throws 
off the cost calculations.


I've seen this sort of thing before; perhaps the best solution would be 
to override the cost of a simple (register to register) set and give it 
a cost of zero.  Then we'd see that this new sequence is worse than the 
original.


Are you proposing to temporarily patch some cost hook during combine?

Combine can't work out allocation costs because it won't look at
constraints.  And when function need special classes, register alloc
has to kick out the hard regs again and re-alloc them...

And when constraints like "+r" or "0" are present, combine won't do
a great job in computing (spared) register allocation costs.

For some insns there is even predicates that disallow combine to
drop in hard-regs like combine_pseudo_register_operand, but that'
a bit too much for this case.

Johann



Also it combines hard-registers in the 2nd case, but not in the
1st one.  So the costs get biased towards 2nd.

Can someone explain why combine takes the more expensive solution?

Target is avr, compiled with

$ avr-gcc-14 bits.c -dumpbase "" -S -Os -fdump-rtl-combine-details -dp

Johann


R.




Re: [patch]: Implement PR104327 for avr

2023-05-25 Thread Richard Biener via Gcc



> Am 25.05.2023 um 16:22 schrieb Georg-Johann Lay :
> 
> 
> 
>> Am 25.05.23 um 08:35 schrieb Richard Biener:
>>> On Wed, May 24, 2023 at 5:44 PM Georg-Johann Lay  wrote:
>>> Am 24.05.23 um 11:38 schrieb Richard Biener:
 On Tue, May 23, 2023 at 2:56 PM Georg-Johann Lay  wrote:
> 
> PR target/104327 not only affects s390 but also avr:
> The avr backend pre-sets some options depending on optimization level.
> The inliner then thinks that always_inline functions are not eligible
> for inlining and terminates with an error.
> 
> Proposing the following patch that implements TARGET_CAN_INLINE_P.
> 
> Ok to apply?
> 
> Johann
> 
> target/104327: Allow more inlining between different optimization levels.
> 
> avr-common.cc introduces the following options that are set depending
> on optimization level: -mgas-isr-prologues, -mmain-is-OS-task and
> -fsplit-wide-types-early.  The inliner thinks that different options
> disallow cross-optimization inlining, so provide can_inline_p.
> 
> gcc/
>  PR target/104327
>  * config/avr/avr.cc (avr_can_inline_p): New static function.
>  (TARGET_CAN_INLINE_P): Define to that function.
> diff --git a/gcc/config/avr/avr.cc b/gcc/config/avr/avr.cc
> index 9fa50ca230d..55b48f63865 100644
> --- a/gcc/config/avr/avr.cc
> +++ b/gcc/config/avr/avr.cc
> @@ -1018,6 +1018,22 @@ avr_no_gccisr_function_p (tree func)
>  return avr_lookup_function_attribute1 (func, "no_gccisr");
>}
> 
> +
> +/* Implement `TARGET_CAN_INLINE_P'.  */
> +/* Some options like -mgas_isr_prologues depend on optimization level,
> +   and the inliner might think that due to different options, inlining
> +   is not permitted; see PR104327.  */
> +
> +static bool
> +avr_can_inline_p (tree /* caller */, tree callee)
> +{
> +  // For now, dont't allow to inline ISRs.  If the user actually wants
> +  // to inline ISR code, they have to turn the body of the ISR into an
> +  // ordinary function.
> +
> +  return ! avr_interrupt_function_p (callee);
 
 I'm not sure if AVR has ISA extensions but the above will likely break
 things like
 
 void __attribute__((target("-mX"))) foo () { asm ("isa X opcode");
 stmt-that-generates-X-ISA; }
>>> 
>>> This yields
>>> 
>>> warning: target attribute is not supported on this machine [-Wattributes]
>> Ah, that's an interesting fact.  So that indeed leaves
>> __attribute__((optimize(...)))
>> influencing the set of active target attributes via the generic option target
>> hooks like in your case the different defaults.
>>> avr has -mmcu= target options, but switching them in mid-air
>>> won't work because the file prologue might already be different
>>> and incompatible across different architectures.  And I never
>>> saw any user requesting such a thing, and I can't imagine
>>> any reasonable use case...  If the warning is not strong enough,
>>> may be it can be turned into an error, but -Wattributes is not
>>> specific enough for that.
>> Note the target attribute is then simply ignored.
 void bar ()
 {
if (cpu-has-X)
  foo ();
 }
 
 if always-inlines are the concern you can use
 
bool always_inline
  = (DECL_DISREGARD_INLINE_LIMITS (callee)
 && lookup_attribute ("always_inline",
  DECL_ATTRIBUTES (callee)));
/* Do what the user says.  */
if (always_inline)
  return true;
 
return default_target_can_inline_p (caller, callee);
>>> 
>>> The default implementation of can_inline_p worked fine for avr.
>>> As far as I understand, the new behavior is due to clean-up
>>> of global states for options?
>> I think the last change was r8-2658-g9b25e12d2d940a which
>> for targets without target attribute support made it more likely
>> to run into the default hook actually comparing the options.
>> Previously the "default" was oddly special-cased but you
>> could have still run into compares with two different set of
>> defaults when there's another "default" default.  Say, compile
>> with -O2 and have one optimize(0) and one optimize(Os)
>> function it would compare the optimize(0) and optimize(Os)
>> set if they were distinct from the -O2 set.  That probably never
>> happened for AVR.
>>> So I need to take into account inlining costs and decide on that
>>> whether it's preferred to inline a function or not?
>> No, the hook isn't about cost, it's about full incompatibility.  So
>> if the different -m options that could be in effect for AVR in
>> a single TU for different functions never should prevent inlining
>> then simply make the hook return true.  If there's a specific
>> option (that can differ from what specified on the compiler
>> command line!) that should, then you should compare the
>> setting of that option from the 

gcc-11-20230525 is now available

2023-05-25 Thread GCC Administrator via Gcc
Snapshot gcc-11-20230525 is now available on
  https://gcc.gnu.org/pub/gcc/snapshots/11-20230525/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 11 git branch
with the following options: git://gcc.gnu.org/git/gcc.git branch 
releases/gcc-11 revision 7e289c12ce0da962ef258c992f7c52710198e3a6

You'll find:

 gcc-11-20230525.tar.xz   Complete GCC

  SHA256=d1f035bb013ab720161bd2ac5586e57b84929b16d88e39db55382128bccb94dd
  SHA1=40f31830cd8afdc5c918c39b6aa292a477b879e0

Diffs from 11-20230518 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-11
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


Will GCC eventually support SSE2 or SSE4.1?

2023-05-25 Thread Stefan Kanthak
Hi,

compile the following function on a system with Core2 processor
(released January 2008) for the 32-bit execution environment:

--- demo.c ---
int ispowerof2(unsigned long long argument)
{
return (argument & argument - 1) == 0;
}
--- EOF ---

GCC 13.3: gcc -m32 -O3 demo.c

NOTE: -mtune=native is the default!

# https://godbolt.org/z/b43cjGdY9
ispowerof2(unsigned long long):
movqxmm1, [esp+4]
pcmpeqd xmm0, xmm0
paddq   xmm0, xmm1
pandxmm0, xmm1
movdedx, xmm0  #pxorxmm1, xmm1
psrlq   xmm0, 32   #pcmpeqb xmm0, xmm1
movdeax, xmm0  #pmovmskb eax, xmm0
or  edx, eax   #cmp al, 255
seteal #seteal
movzx   eax, al#
ret

11 instructions in 40 bytes# 10 instructions in 36 bytes

OOPS: why does GCC (ab)use the SSE2 alias "Willamette New Instruction Set"
  here instead of the native SSE4.1 alias "Penryn New Instruction Set"
  of the Core2 (and all later processors)?

OUCH: why does it FAIL to REALLY use SSE2, as shown in the comments on the
  right side?


Now add the -mtune=core2 option to EXPLICITLY enable the NATIVE SSE4.1
alias "Penryn New Instruction Set" of the Core2 processor:

GCC 13.3: gcc -m32 -mtune=core2 -O3 demo.c

# https://godbolt.org/z/svhEoYT11
ispowerof2(unsigned long long):
   #xor  eax, eax
movqxmm1, [esp+4]  #movq xmm1, [esp+4]
pcmpeqd xmm0, xmm0 #pcmpeqq  xmm0, xmm0
paddq   xmm0, xmm1 #paddqxmm0, xmm1
pandxmm0, xmm1 #ptestxmm0, xmm1
movdedx, xmm0  #
psrlq   xmm0, 32   #
movdeax, xmm0  #
or  edx, eax   #
seteal #sete al
movzx   eax, al#
ret#ret

11 instructions in 40 bytes# 7 instructions in 26 bytes

OUCH: GCC FAILS to use SSE4.1 as shown in the comments on the right side.
  ~~~

Last compile with -mtune=i386 for the i386 processor:

GCC 13.3: gcc -m32 -mtune=i386 -O3 demo.c

# https://godbolt.org/z/e76W6dsMj
ispowerof2(unsigned long long):
pushebx#
mov ecx, [esp+8]   #moveax, [esp+4]
mov ebx, [esp+12]  #movedx, [esp+8]
mov eax, ecx   #
mov edx, ebx   #
add eax, -1#addeax, -1
adc edx, -1#adcedx, -1
and eax, ecx   #andeax, [esp+4]
and edx, ebx   #andedx, [esp+8]
or  eax, edx   #or eax, edx
seteal #negeax
movzx   eax, al#sbbeax, eax
pop ebx#inceax
ret#ret

14 instructions in 33 bytes# 11 instructions in 32 bytes

OUCH: why does GCC abuse EBX (and ECX too) and performs a superfluous
  memory write?


Stefan Kanthak