Re: [PATCH] rs6000: ROP - Emit hashst and hashchk insns on Power8 and later [PR114759]

2024-07-04 Thread Kewen.Lin
on 2024/7/3 23:05, Peter Bergner wrote:
> On 7/3/24 4:01 AM, Kewen.Lin wrote:
>>> -  if (TARGET_POWER10
>>> +  if (TARGET_POWER8
>>>&& info->calls_p
>>>&& DEFAULT_ABI == ABI_ELFv2
>>>&& rs6000_rop_protect)
>>
>> Nit: I noticed that this is the only place to change
>> info->rop_hash_size to non-zero, and ...
>>
>>> @@ -3277,7 +3277,7 @@ rs6000_emit_prologue (void)
>>>/* NOTE: The hashst isn't needed if we're going to do a sibcall,
>>>   but there's no way to know that here.  Harmless except for
>>>   performance, of course.  */
>>> -  if (TARGET_POWER10 && rs6000_rop_protect && info->rop_hash_size != 0)
>>> +  if (TARGET_POWER8 && rs6000_rop_protect && info->rop_hash_size != 0)
>>
>> ... this condition and ...
>>
>>>  {
>>>gcc_assert (DEFAULT_ABI == ABI_ELFv2);
>>>rtx stack_ptr = gen_rtx_REG (Pmode, STACK_POINTER_REGNUM);
>>> @@ -5056,7 +5056,7 @@ rs6000_emit_epilogue (enum epilogue_type 
>>> epilogue_type)
>>>  
>>>/* The ROP hash check must occur after the stack pointer is restored
>>>   (since the hash involves r1), and is not performed for a sibcall.  */
>>> -  if (TARGET_POWER10
>>> +  if (TARGET_POWER8>&& rs6000_rop_protect
>>>&& info->rop_hash_size != 0
>>
>> ... here, both check info->rop_hash_size isn't zero, I think we can drop 
>> these
>> two TARGET_POWER10 (TARGET_POWER8) and rs6000_rop_protect checks?  Instead 
>> just
>> update the inner gcc_assert (now checking DEFAULT_ABI == ABI_ELFv2) by extra
>> checkings on TARGET_POWER8 && rs6000_rop_protect?
>>
>> The other looks good to me, ok for trunk with this nit tweaked (if you agree
>> with it and re-tested well), thanks!
> 
> I agree with you, because the next patch I haven't submitted yet (waiting
> on this to get in), makes that simplification as part of the adding earlier
> checking of invalid options. :-)  The follow-on patch will not only remove
> the TARGET_* and the 2nd/3rd rs6000_rop_protect usage, but will also remove
> the test and asserts of ELFv2...because we've already verified valid option
> usage earlier in the normal options handling code.
> 
> Therefore, I'd like to keep this patch as simple as possible and limited to
> the TARGET_POWER10 -> TARGET_POWER8 change and the cleanup of those tests is
> coming in the next patch...which has already been tested.

Looking forward to the upcoming patch, then this patch is ok for trunk, thanks!

BR,
Kewen



Re: [PATCH] rs6000, update vec_ld, vec_lde, vec_st and vec_ste, documentation

2024-07-04 Thread Kewen.Lin
Hi Carl,

on 2024/7/4 01:23, Carl Love wrote:
> 
> On 7/3/24 2:36 AM, Kewen.Lin wrote:
>> Hi Carl,
>>
>> on 2024/6/27 01:05, Carl Love wrote:
>>> GCC maintainers:
>>>
>>> The following patch updates the user documentation for the vec_ld, vec_lde, 
>>> vec_st and vec_ste built-ins to make it clearer that there are data 
>>> alignment requirements for these built-ins.  If the data alignment 
>>> requirements are not followed, the data loaded or stored by these built-ins 
>>> will be wrong.
>>>
>>> Please let me know if this patch is acceptable for mainline.  Thanks.
>>>
>>>    Carl
>>>
>>> 
>>> rs6000, update vec_ld, vec_lde, vec_st and vec_ste documentation
>>>
>>> Use of the vec_ld and vec_st built-ins require that the data be 16-byte
>>> aligned to work properly.  Add some additional text to the existing
>>> documentation to make this clearer to the user.
>>>
>>> Similarly, the vec_lde and vec_ste built-ins also have data alignment
>>> requirements based on the size of the vector element.  Update the
>>> documentation to make this clear to the user.
>>>
>>> gcc/ChangeLog:
>>> * doc/extend.texi: Add clarification for the use of the vec_ld
>>> vec_st, vec_lde and vec_ste built-ins.
>>> ---
>>>   gcc/doc/extend.texi | 15 +++
>>>   1 file changed, 11 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
>>> index ee3644a5264..55faded17b9 100644
>>> --- a/gcc/doc/extend.texi
>>> +++ b/gcc/doc/extend.texi
>>> @@ -22644,10 +22644,17 @@ vector unsigned char vec_xxsldi (vector unsigned 
>>> char,
>>>   @end smallexample
>>>     Note that the @samp{vec_ld} and @samp{vec_st} built-in functions always
>>> -generate the AltiVec @samp{LVX} and @samp{STVX} instructions even
>>> -if the VSX instruction set is available.  The @samp{vec_vsx_ld} and
>>> -@samp{vec_vsx_st} built-in functions always generate the VSX @samp{LXVD2X},
>>> -@samp{LXVW4X}, @samp{STXVD2X}, and @samp{STXVW4X} instructions.
>>> +generate the AltiVec @samp{LVX}, and @samp{STVX} instructions.  The
>> This change removed "even if the VSX instruction set is available.", I think 
>> it's
>> not intentional?  vec_ld and vec_st are well defined in PVIPR, this 
>> paragraph is
>> not to document them IMHO.  Since we document vec_vsx_ld and vec_vsx_st 
>> here, it
>> aims to note the difference between these two pairs.  But I'm not opposed to 
>> add
>> more words to emphasis the special masking off, I prefer to use the same 
>> words to
>> PVIPR "ignoring the four low-order bits of the calculated address".  And 
>> IMHO we
>> should not say "it requires the data to be 16-byte aligned to work properly" 
>> in
>> case the users are aware of this behavior well and have some no 16-byte 
>> aligned
>> data and expect it to behave like that, it's arguable to define "it" as not 
>> work
>> properly.
> 
> Yea, probably should have left "even if the VSX instruction set is available."
> 
> I was looking to make it clear that if the data is not 16-bye aligned you may 
> not get the expected data loaded/stored.
> 
> So how about the following instead:
> 
>    Note that the @samp{vec_ld} and @samp{vec_st} built-in functions always
>    generate the AltiVec @samp{LVX}, and @samp{STVX} instructions even
>    if the VSX
>    instruction set is available. The instructions mask off the lower
>    4-bits of
>    the calculated address. The use of these instructions on data that
>    is not
>    16-byte aligned may result in unexpected bytes being loaded or stored.

Sorry for nitpicking, to avoid the implicit conclusion between "not 16-byte
aligned" and "unexpected bytes" (considering even if the given address isn't
16-byte aligned, if users would like to leverage this masking off trick, they
can do that as the behavior is definite, the results are still expected for
them, and PVIPR doesn't object this), so maybe "... of the calculated address,
so be careful of the alignment of the calculated address when meeting unexpected
load or store data."?

> 
>>> +instructions mask off the lower 4 bits of the effective address thus 
>>> requiring
>>> +the data to be 16-byte aligned to work properly.  The @samp{vec_lde} and
>>> +@samp{vec_ste} built-in functions operate on vectors of bytes, short 
>>> integer,
>>> +integer, and float.  The corresponding AltiVec instructions @samp{LVEBX},
>>> +@samp{LVEHX}, @samp{LVEWX}, @samp{STVEBX}, @samp{STVEHX}, @samp{STVEWX} 
>>> mask
>>> +off the lower bits of the effective address based on the size of the data.
>>> +Thus the data must be aligned to the size of the vector element to work
>>> +properly.  The @samp{vec_vsx_ld} and @samp{vec_vsx_st} built-in functions
>>> +always generate the VSX @samp{LXVD2X}, @samp{LXVW4X}, @samp{STXVD2X}, and
>>> +@samp{STXVW4X} instructions.
>> As above, there was a reason to mention vec_ld and vec_st here, but not one 
>> for
>> vec_lde and vec_ste IMHO, so let's not mention vec_lde

Re: [PATCH 2/13 ver5] rs6000, __builtin_vsx_xvcv{sp{sx,u}ws,dpuxds_uns}

2024-07-04 Thread Kewen.Lin
Hi,

on 2024/7/4 07:33, Carl Love wrote:
> GCC maintainers:
> 
> Per the comments on patch 2 from version 4, I have moved the removal of 
> built-ins __builtin_vsx_xvcvdpsxws and __builtin_vsx_xvcvdpuxws from patch 4 
> to this patch.
> 
> Please let me know if this patch is acceptable.  Thanks.
> 
>     Carl
> 
> 
> 
> rs6000, __builtin_vsx_xvcv{sp{sx,u}ws,dpuxds_uns}

Nit: uncomplete subject
rs6000: Remove built-ins __builtin_vsx_xvcv{sp{sx,u}ws,dpuxds_uns}

OK for trunk with this nit fixed, thanks!

BR,
Kewen

> 
> The built-in __builtin_vsx_xvcvspsxws is covered by built-in vec_signed
> built-in that is documented in the PVIPR.  The __builtin_vsx_xvcvspsxws
> built-in is not documented and there are no test cases for it.
> 
> The built-in __builtin_vsx_xvcvdpuxds_uns is redundant as it is covered by
> vec_unsigned, remove.
> 
> The __builtin_vsx_xvcvspuxws is redundant as it is covered by
> vec_unsigned, remove.
> 
> The built-in __builtin_vsx_xvcvdpsxws is redundant as it is covered by
> vec_signed{e,o}, remove.
> 
> The built-in __builtin_vsx_xvcvdpuxws is redundant as it is covered by
> vec_unsigned{e,o}, remove.
> 
> This patch removes the redundant built-ins.
> 
> gcc/ChangeLog:
>     * config/rs6000/rs6000-builtins.def (__builtin_vsx_xvcvspsxws,
>     __builtin_vsx_xvcvdpuxds_uns, __builtin_vsx_xvcvspuxws,
>     __builtin_vsx_xvcvdpsxws, __builtin_vsx_xvcvdpuxws): Remove
>     built-in definitions.
> ---
>  gcc/config/rs6000/rs6000-builtins.def | 15 ---
>  1 file changed, 15 deletions(-)
> 
> diff --git a/gcc/config/rs6000/rs6000-builtins.def 
> b/gcc/config/rs6000/rs6000-builtins.def
> index 7c36976a089..60ccc5542be 100644
> --- a/gcc/config/rs6000/rs6000-builtins.def
> +++ b/gcc/config/rs6000/rs6000-builtins.def
> @@ -1688,36 +1688,21 @@
>    const vsll __builtin_vsx_xvcvdpsxds_scale (vd, const int);
>  XVCVDPSXDS_SCALE vsx_xvcvdpsxds_scale {}
> 
> -  const vsi __builtin_vsx_xvcvdpsxws (vd);
> -    XVCVDPSXWS vsx_xvcvdpsxws {}
> -
>    const vsll __builtin_vsx_xvcvdpuxds (vd);
>  XVCVDPUXDS vsx_fixuns_truncv2dfv2di2 {}
> 
>    const vsll __builtin_vsx_xvcvdpuxds_scale (vd, const int);
>  XVCVDPUXDS_SCALE vsx_xvcvdpuxds_scale {}
> 
> -  const vull __builtin_vsx_xvcvdpuxds_uns (vd);
> -    XVCVDPUXDS_UNS vsx_fixuns_truncv2dfv2di2 {}
> -
> -  const vsi __builtin_vsx_xvcvdpuxws (vd);
> -    XVCVDPUXWS vsx_xvcvdpuxws {}
> -
>    const vd __builtin_vsx_xvcvspdp (vf);
>  XVCVSPDP vsx_xvcvspdp {}
> 
>    const vsll __builtin_vsx_xvcvspsxds (vf);
>  XVCVSPSXDS vsx_xvcvspsxds {}
> 
> -  const vsi __builtin_vsx_xvcvspsxws (vf);
> -    XVCVSPSXWS vsx_fix_truncv4sfv4si2 {}
> -
>    const vsll __builtin_vsx_xvcvspuxds (vf);
>  XVCVSPUXDS vsx_xvcvspuxds {}
> 
> -  const vsi __builtin_vsx_xvcvspuxws (vf);
> -    XVCVSPUXWS vsx_fixuns_truncv4sfv4si2 {}
> -
>    const vd __builtin_vsx_xvcvsxddp (vsll);
>  XVCVSXDDP vsx_floatv2div2df2 {}
> 


Re: [PATCH 4/13 ver5] rs6000, extend the current vec_{un, }signed{e, o} built-ins

2024-07-04 Thread Kewen.Lin
Hi,

on 2024/7/4 07:40, Carl Love wrote:
> 
> GCC maintainers:
> 
> I moved the removal of built-ins __builtin_vsx_xvcvdpsxws and 
> __builtin_vsx_xvcvdpuxws from patch 4 to  patch patch 2.
> 
> I fixed various issues with the ChangeLog wording, spaces and descriptions.
> 
> Fixed the comments in file gcc/config/rs6000/vsx.md.
> 
> Updated the built-in description in gcc/doc/extend.texi.
> 
> Please let me know if the patch is acceptable for mainline. Thanks.
> 
> Carl
> 
> 
> 
>  rs6000, extend the current vec_{un,}signed{e,o}  built-ins

Nit: s/  / /

> 
> The built-ins __builtin_vsx_xvcvspsxds and __builtin_vsx_xvcvspuxds
> convert a vector of floats to a vector of signed/unsigned long long ints.
> Extend the existing vec_{un,}signed{e,o} built-ins to handle the argument
> vector of floats to return a vector of even/odd signed/unsigned integers.
> 
> The define expands vsignede_v4sf, vsignedo_v4sf, vunsignede_v4sf,
> vunsignedo_v4sf are added to support the new vec_{un,}signed{e,o}
> built-ins.
> 
> The built-ins __builtin_vsx_xvcvspsxds and __builtin_vsx_xvcvspuxds are
> now for internal use only. They are not documented and they do not
> have test cases.
> 
> Add testcases and update documentation.

OK for trunk, thanks!

BR,
Kewen

> 
> gcc/ChangeLog:
>     (__builtin_vsx_xvcvspsxds, __builtin_vsx_xvcvspuxds): Rename to
>     __builtin_vsignede_v4sf, __builtin_vunsignede_v4sf respectively.
>     (XVCVSPSXDS, XVCVSPUXDS): Rename to VEC_VSIGNEDE_V4SF,
>     VEC_VUNSIGNEDE_V4SF respectively.
>     (__builtin_vsignedo_v4sf, __builtin_vunsignedo_v4sf): New
>     built-in definitions.
>     * config/rs6000/rs6000-overload.def (vec_signede, vec_signedo,
>     vec_unsignede, vec_unsignedo): Add new overloaded specifications.
>     * config/rs6000/vsx.md (vsignede_v4sf, vsignedo_v4sf,
>     vunsignede_v4sf, vunsignedo_v4sf): New define_expands.
>     * doc/extend.texi (vec_signedo, vec_signede, vec_unsignedo,
>     vec_unsignede): Add documentation for new overloaded built-ins to
>     convert vector float to vector {un,}signed long long.
> 
> gcc/testsuite/ChangeLog:
>     * gcc.target/powerpc/builtins-3-runnable.c
>     (test_unsigned_int_result, test_ll_unsigned_int_result): Add
>     new argument.
>     (vec_signede, vec_signedo, vec_unsignede, vec_unsignedo): New
>     tests for the overloaded built-ins.
> ---
>  gcc/config/rs6000/rs6000-builtins.def | 14 +++-
>  gcc/config/rs6000/rs6000-overload.def |  8 ++
>  gcc/config/rs6000/vsx.md  | 84 +++
>  gcc/doc/extend.texi   | 10 +++
>  .../gcc.target/powerpc/builtins-3-runnable.c  | 49 +--
>  5 files changed, 154 insertions(+), 11 deletions(-)
> 
> diff --git a/gcc/config/rs6000/rs6000-builtins.def 
> b/gcc/config/rs6000/rs6000-builtins.def
> index 43d5c229dc3..29a9deb3410 100644
> --- a/gcc/config/rs6000/rs6000-builtins.def
> +++ b/gcc/config/rs6000/rs6000-builtins.def
> @@ -1697,11 +1697,17 @@
>    const vd __builtin_vsx_xvcvspdp (vf);
>  XVCVSPDP vsx_xvcvspdp {}
> 
> -  const vsll __builtin_vsx_xvcvspsxds (vf);
> -    XVCVSPSXDS vsx_xvcvspsxds {}
> +  const vsll __builtin_vsignede_v4sf (vf);
> +    VEC_VSIGNEDE_V4SF vsignede_v4sf {}
> 
> -  const vsll __builtin_vsx_xvcvspuxds (vf);
> -    XVCVSPUXDS vsx_xvcvspuxds {}
> +  const vsll __builtin_vsignedo_v4sf (vf);
> +    VEC_VSIGNEDO_V4SF vsignedo_v4sf {}
> +
> +  const vull __builtin_vunsignede_v4sf (vf);
> +    VEC_VUNSIGNEDE_V4SF vunsignede_v4sf {}
> +
> +  const vull __builtin_vunsignedo_v4sf (vf);
> +    VEC_VUNSIGNEDO_V4SF vunsignedo_v4sf {}
> 
>    const vd __builtin_vsx_xvcvsxddp (vsll);
>  XVCVSXDDP vsx_floatv2div2df2 {}
> diff --git a/gcc/config/rs6000/rs6000-overload.def 
> b/gcc/config/rs6000/rs6000-overload.def
> index 84bd9ae6554..4d857bb1af3 100644
> --- a/gcc/config/rs6000/rs6000-overload.def
> +++ b/gcc/config/rs6000/rs6000-overload.def
> @@ -3307,10 +3307,14 @@
>  [VEC_SIGNEDE, vec_signede, __builtin_vec_vsignede]
>    vsi __builtin_vec_vsignede (vd);
>  VEC_VSIGNEDE_V2DF
> +  vsll __builtin_vec_vsignede (vf);
> +    VEC_VSIGNEDE_V4SF
> 
>  [VEC_SIGNEDO, vec_signedo, __builtin_vec_vsignedo]
>    vsi __builtin_vec_vsignedo (vd);
>  VEC_VSIGNEDO_V2DF
> +  vsll __builtin_vec_vsignedo (vf);
> +    VEC_VSIGNEDO_V4SF
> 
>  [VEC_SIGNEXTI, vec_signexti, __builtin_vec_signexti]
>    vsi __builtin_vec_signexti (vsc);
> @@ -4433,10 +4437,14 @@
>  [VEC_UNSIGNEDE, vec_unsignede, __builtin_vec_vunsignede]
>    vui __builtin_vec_vunsignede (vd);
>  VEC_VUNSIGNEDE_V2DF
> +  vull __builtin_vec_vunsignede (vf);
> +    VEC_VUNSIGNEDE_V4SF
> 
>  [VEC_UNSIGNEDO, vec_unsignedo, __builtin_vec_vunsignedo]
>    vui __builtin_vec_vunsignedo (vd);
>  VEC_VUNSIGNEDO_V2DF
> +  vull __builtin_vec_vunsignedo (vf);
> +    VEC_VUNSIGNEDO_V4SF
> 
>  [VEC_VEE, vec_extract_exp, __builtin_vec_extract_exp]
>    vui __builtin_vec_extract_exp (vf);
> diff --git a/gcc/conf

Re: [PATCH 13/13 ver5] rs6000, remove vector set and vector init built-ins.

2024-07-04 Thread Kewen.Lin
Hi Carl,

on 2024/7/4 07:51, Carl Love wrote:
>  GCC maintainers:
> 
> The patch has been updated to remove the customized vec_init built-in code.  
> Specfivically the init identifier, the related generated code for the init 
> built-in attribute bit, function altivec_expand_vec_init_builtin and calls to 
> the function.
> 
> Please let me know if the patch is acceptable for mainline. Thanks.
> 
>   Carl
> 
> ---
> 
> rs6000, remove vector set and vector init built-ins.
> 
> The vector init built-ins:
> 
>   __builtin_vec_init_v16qi, __builtin_vec_init_v8hi,
>   __builtin_vec_init_v4si, __builtin_vec_init_v4sf,
>   __builtin_vec_init_v2di, __builtin_vec_init_v2df,
>   __builtin_vec_init_v1ti
> 
> perform the same operation as initializing the vector in C code. For
> example:
> 
>   result_v4si = __builtin_vec_init_v4si (1, 2, 3, 4);
>   result_v4si = {1, 2, 3, 4};
> 
> These two constructs were tested and verified they generate identical
> assembly instructions with no optimization and -O3 optimization.
> 
> The vector set built-ins:
> 
>   __builtin_vec_set_v16qi, __builtin_vec_set_v8hi.
>   __builtin_vec_set_v4si, __builtin_vec_set_v4sf,
>   __builtin_vec_set_v1ti, __builtin_vec_set_v2di,
>   __builtin_vec_set_v2df
> 
> perform the same operation as setting a specific element in the vector in
> C code.  For example:
> 
>   src_v4si = __builtin_vec_set_v4si (src_v4si, int_val, index);
>   src_v4si[index] = int_val;
> 
> The built-in actually generates more instructions than the inline C code
> with no optimization but is identical with -O3 optimizations.
> 
> All of the above built-ins that are removed do not have test cases and
> are not documented.
> 
> Built-ins   __builtin_vec_set_v1ti __builtin_vec_set_v2di,
> __builtin_vec_set_v2df are not removed as they are used in function
> resolve_vec_insert() in file rs6000-c.cc.
> 
> The built-ins are removed as they don't provide any benefit over just
> using C code.
> 
> The code to define the bif_init_bit, bif_is_init, as well as their uses
> is removed.  The function altivec_expand_vec_init_builtin is also removed.

Nit: s/is removed/are removed/ ?

> 
> gcc/ChangeLog:
>     * config/rs6000/rs6000-builtin.cc (altivec_expand_vec_init_builtin):
>     Removed the function.

Nit: s/Removed/Remove/, applied for the other changelog entries.

>     (rs6000_expand_builtin): Removed the if bif_is_int check to call
>     the altivec_expand_vec_init_builtin function.


>     * config/rs6000/rs6000-builtins.def: Removed the attribute string
>     comment for init.
>     (__builtin_vec_init_v16qi,
>     __builtin_vec_init_v4sf, __builtin_vec_init_v4si,
>     __builtin_vec_init_v8hi, __builtin_vec_init_v1ti,
>     __builtin_vec_init_v2df, __builtin_vec_init_v2di,
>     __builtin_vec_set_v16qi, __builtin_vec_set_v4sf,
>     __builtin_vec_set_v4si, __builtin_vec_set_v8hi): Remove
>     built-in definitions.
>     * config/rs6000-gen-builtins.cc: Removed comment for init attribute
>     string.
>     (struct attrinfo): Removed isint entry.

Typo: s/isint/isinit/

>     (parse_bif_attrs): Removed the if statement to check for attribute
>     init.
>     (ifdef DEBUG): Removed print for init attribute string.
>     (write_decls): Removed print for define bif_init_bit and
>     define for bif_is_init.
>     (write_bif_static_init): Removed if bifp->attrs.isinit statement.
> ---
>  gcc/config/rs6000/rs6000-builtin.cc  | 40 -
>  gcc/config/rs6000/rs6000-builtins.def    | 45 +++-
>  gcc/config/rs6000/rs6000-gen-builtins.cc | 16 +++--
>  3 files changed, 8 insertions(+), 93 deletions(-)
> 
> diff --git a/gcc/config/rs6000/rs6000-builtin.cc 
> b/gcc/config/rs6000/rs6000-builtin.cc
> index 646e740774e..0a24d20a58c 100644
> --- a/gcc/config/rs6000/rs6000-builtin.cc
> +++ b/gcc/config/rs6000/rs6000-builtin.cc
> @@ -2313,43 +2313,6 @@ altivec_expand_predicate_builtin (enum insn_code 
> icode, tree exp, rtx target)
>    return target;
>  }
> 
> -/* Expand vec_init builtin.  */
> -static rtx
> -altivec_expand_vec_init_builtin (tree type, tree exp, rtx target)
> -{
> -  machine_mode tmode = TYPE_MODE (type);
> -  machine_mode inner_mode = GET_MODE_INNER (tmode);
> -  int i, n_elt = GET_MODE_NUNITS (tmode);
> -
> -  gcc_assert (VECTOR_MODE_P (tmode));
> -  gcc_assert (n_elt == call_expr_nargs (exp));
> -
> -  if (!target || !register_operand (target, tmode))
> -    target = gen_reg_rtx (tmode);
> -
> -  /* If we have a vector compromised of a single element, such as V1TImode, 
> do
> - the initialization directly.  */
> -  if (n_elt == 1 && GET_MODE_SIZE (tmode) == GET_MODE_SIZE (inner_mode))
> -    {
> -  rtx x = expand_normal (CALL_EXPR_ARG (exp, 0));
> -  emit_move_insn (target, gen_lowpart (tmode, x));
> -    }
> -  else
> -    {
> -  rtvec v = rtvec_alloc (n_elt);
> -
> -  for (i = 0; i < n_elt; ++i)
> -    {
> -    

Re: [PATCH v1 0/2] Aarch64: addp NEON big-endian fix [PR114890]

2024-07-04 Thread Kyrylo Tkachov


> On 3 Jul 2024, at 12:50, Alfie Richards  wrote:
> 
> External email: Use caution opening links or attachments 
> Hi Kyrill,
> 
> Okay noted for future!
> Yes happy someone to commit this.
> 

Ok, I’ve pushed the two patches to mainline for you.
Thanks!
Kyrill

> Kind regards,
> Alfie
> 
> Sent from Outlook for iOS
> From: Kyrylo Tkachov 
> Sent: Wednesday, July 3, 2024 11:23:37 AM
> To: Alfie Richards 
> Cc: gcc-patches@gcc.gnu.org 
> Subject: Re: [PATCH v1 0/2] Aarch64: addp NEON big-endian fix [PR114890]   Hi 
> Alfie,
> 
> > On 3 Jul 2024, at 12:10, alfie.richa...@arm.com wrote:
> > 
> > External email: Use caution opening links or attachments
> > 
> > 
> > From: Alfie Richards 
> > 
> > Hi All,
> > 
> > This fixes a case where the operands for the addp NEON intrinsic were 
> > erroneously swapped.
> > 
> > Regtested on aarch64-unknown-linux-gnu
> > 
> > Ok for master and GCC14.2?
> 
> This is okay, though I would have done it as a single commit rather than 
> splitting the test case separately and then updating it to remove the xfail.
> But it’s okay as is anyway.
> Do you need someone to commit it for you?
> Thanks,
> Kyrill
> 
> > 
> > Kind regards,
> > Alfie Richards
> > 
> > Alfie Richards (2):
> >  Aarch64: Add test for non-commutative SIMD intrinsic
> >  Aarch64, bugfix: Fix NEON bigendian addp intrinsic [PR114890]
> > 
> > gcc/config/aarch64/aarch64-simd.md|   2 -
> > .../aarch64/vector_intrinsics_asm.c   | 371 ++
> > 2 files changed, 371 insertions(+), 2 deletions(-)
> > create mode 100644 gcc/testsuite/gcc.target/aarch64/vector_intrinsics_asm.c
> > 
> > --
> > 2.34.1
> > 
> 



[PATCH 1/3] expr: Allow same precision modes conversion between {ibm_extended, ieee_quad}_format [PR112993]

2024-07-04 Thread Kewen.Lin
Hi,

With some historical reasons, rs6000 defines KFmode, TFmode
and IFmode to have different mode precision, but it causes
some issues and needs some workarounds such as r14-6478 for
PR112788.  So we are going to make all rs6000 128 bit scalar
FP modes have 128 bit precision.  Be prepared for that, this
patch is to make function convert_mode_scalar allow same
precision FP modes conversion if their underlying formats are
ibm_extended_format and ieee_quad_format respectively, just
like the existing special treatment on arm_bfloat_half_format
<-> ieee_half_format.  It also factors out all the relevant
checks into a lambda function.

Bootstrapped and regtested on x86_64-redhat-linux and
powerpc64{,le}-linux-gnu.

Is it ok for trunk?

BR,
Kewen
-
PR target/112993

gcc/ChangeLog:

* expr.cc (convert_mode_scalar): Allow same precision conversion
between scalar floating point modes if whose underlying format is
ibm_extended_format or ieee_quad_format, and refactor assertion
with new lambda function acceptable_same_precision_modes.
---
 gcc/expr.cc | 30 --
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/gcc/expr.cc b/gcc/expr.cc
index ffbac513692..eac4dcc982e 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -338,6 +338,29 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp)
   enum rtx_code equiv_code = (unsignedp < 0 ? UNKNOWN
  : (unsignedp ? ZERO_EXTEND : SIGN_EXTEND));

+  auto acceptable_same_precision_modes
+= [] (scalar_mode from_mode, scalar_mode to_mode) -> bool
+{
+  if (DECIMAL_FLOAT_MODE_P (from_mode) != DECIMAL_FLOAT_MODE_P (to_mode))
+   return true;
+
+  /* arm_bfloat_half_format <-> ieee_half_format */
+  if ((REAL_MODE_FORMAT (from_mode) == &arm_bfloat_half_format
+  && REAL_MODE_FORMAT (to_mode) == &ieee_half_format)
+ || (REAL_MODE_FORMAT (to_mode) == &arm_bfloat_half_format
+ && REAL_MODE_FORMAT (from_mode) == &ieee_half_format))
+   return true;
+
+  /* ibm_extended_format <-> ieee_quad_format */
+  if ((REAL_MODE_FORMAT (from_mode) == &ibm_extended_format
+  && REAL_MODE_FORMAT (to_mode) == &ieee_quad_format)
+ || (REAL_MODE_FORMAT (from_mode) == &ieee_quad_format
+ && REAL_MODE_FORMAT (to_mode) == &ibm_extended_format))
+   return true;
+
+  return false;
+};
+
   if (to_real)
 {
   rtx value;
@@ -346,12 +369,7 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp)

   gcc_assert ((GET_MODE_PRECISION (from_mode)
   != GET_MODE_PRECISION (to_mode))
- || (DECIMAL_FLOAT_MODE_P (from_mode)
- != DECIMAL_FLOAT_MODE_P (to_mode))
- || (REAL_MODE_FORMAT (from_mode) == &arm_bfloat_half_format
- && REAL_MODE_FORMAT (to_mode) == &ieee_half_format)
- || (REAL_MODE_FORMAT (to_mode) == &arm_bfloat_half_format
- && REAL_MODE_FORMAT (from_mode) == &ieee_half_format));
+ || acceptable_same_precision_modes (from_mode, to_mode));

   if (GET_MODE_PRECISION (from_mode) == GET_MODE_PRECISION (to_mode))
{
--
2.39.1


[PATCH 2/3 v2] rs6000: Make all 128 bit scalar FP modes have 128 bit precision [PR112993]

2024-07-04 Thread Kewen.Lin
Hi,

On rs6000, there are three 128 bit scalar floating point
modes TFmode, IFmode and KFmode.  With some historical
reasons, we defines them with different mode precisions,
that is KFmode 126, TFmode 127 and IFmode 128.  But in
fact all of them should have the same mode precision 128,
this special setting has caused some issues like some
unexpected failures mentioned in [1] and also made us have
to introduce some workarounds, such as: the workaround in
build_common_tree_nodes for KFmode 126, the workaround in
range_compatible_p for same mode but different precision
issue.

This patch is to make these three 128 bit scalar floating
point modes TFmode, IFmode and KFmode have 128 bit mode
precision, and keep the order same as previous in order
to make machine independent parts of the compiler not try
to widen IFmode to TFmode.  Besides, build_common_tree_nodes
adopts the newly added hook mode_for_floating_type so we
don't need to worry about unexpected mode for long double
type node.

In function convert_mode_scalar, it adopts sext_optab for
same precision modes conversion if !DECIMAL_FLOAT_MODE_P,
so we only need to support sext_optab for any possible
conversion.  Thus this patch removes some useless trunc
optab supports, supplements one new sext_optab which calls
the common handler rs6000_expand_float128_convert, unnames
two define_insn_and_split to avoid conflicts and make them
more clear.  Considering the current implementation that
there is no chance to have KF <-> IF conversion (since
either of them would be TF already), it adds two dummy
define_expands to assert this.

Bootstrapped and regtested on x86_64-redhat-linux,
powerpc64{,le}-linux-gnu (ibm128 long double default)
and powerpc64le-linux-gnu (ieee128 long double default).

Comparing to v1 [2], it makes use of new hook
mode_for_floating_type and factors out the change on generic
part of code to 1/3.

I'm going to push this once the generic part 1/3 gets
approved and no objection on this one.

btw, two related patches on fortran[3] and ranger[4] have been
approved.

[1] https://inbox.sourceware.org/gcc-patches/
718677e7-614d-7977-312d-05a75e1fd...@linux.ibm.com/
[2] https://gcc.gnu.org/pipermail/gcc-patches/2024-May/651017.html
[3] https://gcc.gnu.org/pipermail/gcc-patches/2024-May/651018.html
[4] https://gcc.gnu.org/pipermail/gcc-patches/2024-May/651019.html

BR,
Kewen
-
PR target/112993

gcc/ChangeLog:

* config/rs6000/rs6000-modes.def (IFmode, KFmode, TFmode): Define
with FLOAT_MODE instead of FRACTIONAL_FLOAT_MODE, don't use special
precisions any more.
(rs6000-modes.h): Remove include.
* config/rs6000/rs6000-modes.h: Remove.
* config/rs6000/rs6000.h (rs6000-modes.h): Remove include.
* config/rs6000/t-rs6000: Remove rs6000-modes.h include.
* config/rs6000/rs6000.cc (rs6000_option_override_internal): Replace
all uses of FLOAT_PRECISION_TFmode with 128.
(rs6000_c_mode_for_floating_type): Likewise.
* config/rs6000/rs6000.md (define_expand trunciftf2): Remove.
(define_expand truncifkf2): Remove.
(define_expand trunckftf2): Remove.
(define_expand trunctfif2): Remove.
(define_expand expandtfkf2, expandtfif2): Merge to ...
(define_expand expandtf2): ... this, new.
(define_expand expandiftf2): Merge to ...
(define_expand expandtf2): ... this, new.
(define_expand expandiftf2): Update with assert.
(define_expand expandkfif2): New.
(define_insn_and_split extendkftf2): Rename to  ...
(define_insn_and_split *extendkftf2): ... this.
(define_insn_and_split trunctfkf2): Rename to ...
(define_insn_and_split *extendtfkf2): ... this.
---
 gcc/config/rs6000/rs6000-modes.def | 31 +
 gcc/config/rs6000/rs6000-modes.h   | 36 ---
 gcc/config/rs6000/rs6000.cc|  9 +---
 gcc/config/rs6000/rs6000.h |  5 ---
 gcc/config/rs6000/rs6000.md| 72 --
 gcc/config/rs6000/t-rs6000 |  1 -
 6 files changed, 32 insertions(+), 122 deletions(-)
 delete mode 100644 gcc/config/rs6000/rs6000-modes.h

diff --git a/gcc/config/rs6000/rs6000-modes.def 
b/gcc/config/rs6000/rs6000-modes.def
index 094b246c834..b69593c40a6 100644
--- a/gcc/config/rs6000/rs6000-modes.def
+++ b/gcc/config/rs6000/rs6000-modes.def
@@ -18,12 +18,11 @@
along with GCC; see the file COPYING3.  If not see
.  */

-/* We order the 3 128-bit floating point types so that IFmode (IBM 128-bit
-   floating point) is the 128-bit floating point type with the highest
-   precision (128 bits).  This so that machine independent parts of the
-   compiler do not try to widen IFmode to TFmode on ISA 3.0 (power9) that has
-   hardware support for IEEE 128-bit.  We set TFmode (long double mode) in
-   between, and KFmode (explicit __float128) below it.
+/* We order the 3 128-bit floating point type modes here as KFmod

[PATCH] gcov: Cache source files

2024-07-04 Thread Jørgen Kvalsvik
Cache the source files as they are read, rather than discarding them at
the end of output_lines (), and move the reading of the source file to
the new function slurp.

This patch does not really change anything other than moving the file
reading out of output_file, but set gcov up for more interaction with
the source file. The motvating example is reporting coverage on
functions from different source files, notably C++ headers and
((always_inline)).

Here is an example of what gcov does today:

hello.h:
inline __attribute__((always_inline))
int hello (const char *s)
{
  if (s)
printf ("hello, %s!\n", s);
  else
printf ("hello, world!\n");
  return 0;
}

hello.c:
int notmain(const char *entity)
{
  return hello (entity);
}

int main()
{
  const char *empty = 0;
  if (!empty)
hello (empty);
  else
puts ("Goodbye!");
}

$ gcov -abc hello
function notmain called 0 returned 0% blocks executed 0%
#:4:int notmain(const char *entity)
%:4-block 2
branch  0 never executed (fallthrough)
branch  1 never executed
-:5:{
#:6:  return hello (entity);
%:6-block 7
-:7:}

Clearly there is a branch in notmain, but the branch comes from the
inlining of hello. This is not very obvious from looking at the output.
Here is hello.h.gcov:

-:3:inline __attribute__((always_inline))
-:4:int hello (const char *s)
-:5:{
#:6:  if (s)
%:6-block 3
branch  0 never executed (fallthrough)
branch  1 never executed
%:6-block 2
branch  2 never executed (fallthrough)
branch  3 never executed
#:7:printf ("hello, %s!\n", s);
%:7-block 4
call0 never executed
%:7-block 3
call1 never executed
-:8:  else
#:9:printf ("hello, world!\n");
%:9-block 5
call0 never executed
%:9-block 4
call1 never executed
#:   10:  return 0;
%:   10-block 6
%:   10-block 5
-:   11:}

The blocks from the different call sites have all been interleaved.

The reporting could tuned be to list the inlined function, too, like
this:

1:4:int notmain(const char *entity)
-: == inlined from hello.h ==
1:6:  if (s)
branch  0 taken 0 (fallthrough)
branch  1 taken 1
#:7:printf ("hello, %s!\n", s);
%:7-block 3
call0 never executed
-:8:  else
1:9:printf ("hello, world!\n");
1:9-block 4
call0 returned 1
1:   10:  return 0;
1:   10-block 5
-: == inlined from hello.h (end) ==
-:5:{
1:6:  return hello (entity);
1:6-block 7
-:7:}

Implementing something to this effect relies on having the sources for
both files (hello.c, hello.h) available, which is what this patch sets
up.

Note that the previous reading code would leak the source file content,
and explicitly storing them is not a huge departure nor performance
implication. I verified this with valgrind:

With slurp:

$ valgrind gcov ./hello
== == Memcheck, a memory error detector
== == Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
== == Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
== == Command: ./gcc/gcov demo
== ==
File 'hello.c'
Lines executed:100.00% of 4
Creating 'hello.c.gcov'

File 'hello.h'
Lines executed:75.00% of 4
Creating 'hello.h.gcov'
== ==
== == HEAP SUMMARY:
== == in use at exit: 84,907 bytes in 54 blocks
== ==   total heap usage: 254 allocs, 200 frees, 137,156 bytes allocated
== ==
== == LEAK SUMMARY:
== ==definitely lost: 1,237 bytes in 22 blocks
== ==indirectly lost: 562 bytes in 18 blocks
== ==  possibly lost: 0 bytes in 0 blocks
== ==still reachable: 83,108 bytes in 14 blocks
== ==   of which reachable via heuristic:
== == newarray   : 1,544 bytes in 1 blocks
== == suppressed: 0 bytes in 0 blocks
== == Rerun with --leak-check=full to see details of leaked memory
== ==
== == For lists of detected and suppressed errors, rerun with: -s
== == ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Without slurp:

$ valgrind gcov ./demo
== == Memcheck, a memory error detector
== == Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
== == Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
== == Command: ./gcc/gcov demo
== ==
File 'hello.c'
Lines executed:100.00% of 4
Creating 'hello.c.gcov'

File 'hello.h'
Lines executed:75.00% of 4
Creating 'hello.h.gcov'

Lines executed:87.50% of 8
== ==
== == HEAP SUMMARY:
== == in use at exit: 85,316 bytes in 82 blocks
== ==   total heap usage: 250 allocs, 168 frees, 137,084 bytes allocated
== ==
== == LEAK SUMMARY:
== ==definitely lost: 1,646 bytes in 50 blocks
== ==indirectly lost: 562 bytes in 18 blocks
== ==  possibly lost: 0 bytes in 0 blocks
==

[PATCH 3/3] tree: Remove KFmode workaround [PR112993]

2024-07-04 Thread Kewen.Lin
Hi,

The fix for PR112993 will make KFmode have 128 bit mode precision,
we don't need this workaround to fix up the type precision any
more, and just go with the mode precision.  So this patch is to
remove KFmode workaround.

Bootstrapped and regtested on x86_64-redhat-linux,
powerpc64{,le}-linux-gnu (ibm128 long double default)
and powerpc64le-linux-gnu (ieee128 long double default).

Is it OK for trunk if {1,2}/3 in this series get landed?

BR,
Kewen
-

PR target/112993

gcc/ChangeLog:

* tree.cc (build_common_tree_nodes): Drop the workaround for rs6000
KFmode precision adjustment.
---
 gcc/tree.cc | 9 -
 1 file changed, 9 deletions(-)

diff --git a/gcc/tree.cc b/gcc/tree.cc
index f801712c9dd..f730981ec8b 100644
--- a/gcc/tree.cc
+++ b/gcc/tree.cc
@@ -9575,15 +9575,6 @@ build_common_tree_nodes (bool signed_char)
   if (!targetm.floatn_mode (n, extended).exists (&mode))
continue;
   int precision = GET_MODE_PRECISION (mode);
-  /* Work around the rs6000 KFmode having precision 113 not
-128.  */
-  const struct real_format *fmt = REAL_MODE_FORMAT (mode);
-  gcc_assert (fmt->b == 2 && fmt->emin + fmt->emax == 3);
-  int min_precision = fmt->p + ceil_log2 (fmt->emax - fmt->emin);
-  if (!extended)
-   gcc_assert (min_precision == n);
-  if (precision < min_precision)
-   precision = min_precision;
   FLOATN_NX_TYPE_NODE (i) = make_node (REAL_TYPE);
   TYPE_PRECISION (FLOATN_NX_TYPE_NODE (i)) = precision;
   layout_type (FLOATN_NX_TYPE_NODE (i));
--
2.39.1


Re: [PATCH 1/1] ada: Make the names of uninstalled cross-gnattools consistent across builds

2024-07-04 Thread Arnaud Charlet
The change is OK, thanks.

> We suffer from an inconsistency in the names of uninstalled gnattools 
> executables in cross-compiler configurations.  The cause is a recipe we 
> have:
> 
> ada.all.cross:
>   for tool in $(ADA_TOOLS) ; do \
> if [ -f $$tool$(exeext) ] ; \
> then \
>   $(MV) $$tool$(exeext) $$tool-cross$(exeext); \
> fi; \
>   done
> 
> the intent of which is to give the names of gnattools executables the 
> '-cross' suffix, consistently with the compiler drivers: 'gcc-cross', 
> 'g++-cross', etc.
> 
> A problem with the recipe is that this 'make' target is called too early 
> in the build process, before gnattools have been made.  Consequently no 
> renames happen and owing to that they are conditional on the presence of 
> the individual executables the recipe succeeds doing nothing.
> 
> However if a target is requested later on such as 'make pdf' that does 
> not cause gnattools executables to be rebuilt, then 'ada.all.cross' does 
> succeed in renaming the executables already present in the build tree.  
> Then if the 'gnat' testsuite is run later on which expects non-suffixed 
> 'gnatmake' executable, it does not find the 'gnatmake-cross' executable 
> in the build tree and may either catastrophically fail or incorrectly 
> use a system-installed copy of 'gnatmake'.
> 
> Of course if a target is requested such as `make all' that does cause 
> gnattools executables to be rebuilt, then both suffixed and non-suffixed 
> uninstalled executables result.
> 
> Fix the problem by moving the renaming of gnattools to a separate 'make' 
> recipe, pasted into a new 'gnattools-cross-mv' target and the existing 
> legacy 'cross-gnattools' target.  Then invoke the new target explicitly 
> from the 'gnattools-cross' recipe in gnattools/.
> 
> Update the test harness accordingly, so that suffixed gnattools are used 
> in cross-compilation testsuite runs.
> 
>   gcc/
>   * ada/gcc-interface/Make-lang.in (ada.all.cross): Move recipe 
>   to...
>   (GNATTOOLS_CROSS_MV): ... this new variable.
>   (cross-gnattools): Paste it here.
>   (gnattools-cross-mv): New target.
> 
>   gnattools/
>   * Makefile.in (gnattools-cross): Also build 'gnattools-cross-mv' 
>   in GCC_DIR.
> 
>   gcc/testsuite/
>   * lib/gnat.exp (local_find_gnatmake, find_gnatclean): Use 
>   '-cross' suffix where testing a cross-compiler.
> ---
>  gcc/ada/gcc-interface/Make-lang.in |   19 ---
>  gcc/testsuite/lib/gnat.exp |   22 ++
>  gnattools/Makefile.in  |1 +
>  3 files changed, 31 insertions(+), 11 deletions(-)
> 
> gcc-ada-all-cross-gnattools.diff
> Index: gcc/gcc/ada/gcc-interface/Make-lang.in
> ===
> --- gcc.orig/gcc/ada/gcc-interface/Make-lang.in
> +++ gcc/gcc/ada/gcc-interface/Make-lang.in
> @@ -780,6 +780,7 @@ gnattools: $(GCC_PARTS) $(CONFIG_H) pref
>  cross-gnattools: force
>   $(MAKE) -C ada $(ADA_TOOLS_FLAGS_TO_PASS) gnattools1-re
>   $(MAKE) -C ada $(ADA_TOOLS_FLAGS_TO_PASS) gnattools2
> + $(GNATTOOLS_CROSS_MV)
>  
>  canadian-gnattools: force
>   $(MAKE) -C ada $(ADA_TOOLS_FLAGS_TO_PASS) gnattools1-re
> @@ -795,19 +796,23 @@ gnatlib gnatlib-sjlj gnatlib-zcx gnatlib
>  FORCE_DEBUG_ADAFLAGS="$(FORCE_DEBUG_ADAFLAGS)" \
>  $@
>  
> +gnattools-cross-mv:
> + $(GNATTOOLS_CROSS_MV)
> +
> +GNATTOOLS_CROSS_MV=\
> +  for tool in $(ADA_TOOLS) ; do \
> +if [ -f $$tool$(exeext) ] ; \
> +then \
> +  $(MV) $$tool$(exeext) $$tool-cross$(exeext); \
> +fi; \
> +  done
> +
>  # use only for native compiler
>  gnatlib_and_tools: gnatlib gnattools
>  
>  # Build hooks:
>  
>  ada.all.cross:
> - for tool in $(ADA_TOOLS) ; do \
> -   if [ -f $$tool$(exeext) ] ; \
> -   then \
> - $(MV) $$tool$(exeext) $$tool-cross$(exeext); \
> -   fi; \
> - done
> -
>  ada.start.encap:
>  ada.rest.encap:
>  ada.man:
> Index: gcc/gcc/testsuite/lib/gnat.exp
> ===
> --- gcc.orig/gcc/testsuite/lib/gnat.exp
> +++ gcc/gcc/testsuite/lib/gnat.exp
> @@ -199,12 +199,19 @@ proc prune_gnat_output { text } {
>  # which prevent multilib from working, so define a new one.
>  
>  proc local_find_gnatmake {} {
> +global target_triplet
>  global tool_root_dir
> +global host_triplet
>  
>  if ![is_remote host] {
> -set file [lookfor_file $tool_root_dir gnatmake]
> + if { "$host_triplet" == "$target_triplet" } {
> + set gnatmake gnatmake
> + } else {
> + set gnatmake gnatmake-cross
> + }
> + set file [lookfor_file $tool_root_dir $gnatmake]
>  if { $file == "" } {
> - set file [lookfor_file $tool_root_dir gcc/gnatmake]
> + set file [lookfor_file $tool_root_dir gcc/$gnatmake]
>  }
>  if { $file != "" } {
>   set root [file dirname $fil

Re: [patch,avr] PR87376: Disable -ftree-ter

2024-07-04 Thread Richard Biener
On Wed, Jul 3, 2024 at 9:26 PM Georg-Johann Lay  wrote:
>
>
>
> Am 02.07.24 um 15:48 schrieb Richard Biener:
> > On Tue, Jul 2, 2024 at 3:43 PM Georg-Johann Lay  wrote:
> >>
> >> Hi Jeff,
> >>
> >> This is a patch to get correct code out of 64-bit
> >> loads from address-space __memx.
> >>
> >> The AVR address-spaces may require that move insns issue
> >> calls to library support functions, a fact that -ftree-ter
> >> doesn't account for.  tree-ssa-ter.cc then replaces an
> >> expression across such a library call, resulting in wrong code.
> >>
> >> This patch disables that pass per default on avr, as there is no
> >> more fine grained way to avoid malicious optimizations.
> >> The pass can still be re-enabled by means of explicit -ftree-ter.
> >>
> >> Ok to apply?
> >
> > I think this requires more details on what goes wrong - I assume
> > it's not stmt reordering that effectively happens but recursive
> > expand_expr on SSA defs when those invoke libcalls?  In that
> > case this would point to a deeper issue.
>
> The difference is that with TER, we get a hard reg in .expand
> for a movdi from 24-bit address-space __memx.
>
> Such moves require library calls, which in turn require
> specific hard registers.  As avr backend has no movdi, the
> moddi gets expanded as 8 * movqi, and that does not work
> when the target registers are hard regs, as some of them
> are clobbered by the libcalls.

So I see

(insn 18 17 19 2 (parallel [
(set (reg:QI 22 r22 [+4 ])
(mem/u/c:QI (reg/f:PSI 55) [1 aa+4 S1 A8 AS7]))
(clobber (reg:QI 22 r22))
(clobber (reg:QI 21 r21))
(clobber (reg:HI 30 r30))
]) "t.c":12:13 38 {xloadqi_A}
 (nil))
(insn 19 18 20 2 (set (reg:PSI 56)
(reg/f:PSI 47)) "t.c":12:13 112 {*movpsi_split}
 (nil))
(insn 20 19 21 2 (parallel [
(set (reg/f:PSI 57)
(plus:PSI (reg/f:PSI 47)
(const_int 5 [0x5])))
(clobber (scratch:QI))
]) "t.c":12:13 205 {addpsi3}
 (nil))
(insn 21 20 22 2 (parallel [
(set (reg:QI 23 r23 [+5 ])
(mem/u/c:QI (reg/f:PSI 57) [1 aa+5 S1 A8 AS7]))
(clobber (reg:QI 22 r22))
(clobber (reg:QI 21 r21))
(clobber (reg:HI 30 r30))
]) "t.c":12:13 38 {xloadqi_A}
 (nil))

for example - insn 21 clobbers r22 which is also the destination of insn 18.

With -fno-tree-ter those oddly get _no_ intermediate reg but we have

(insn 9 8 10 2 (parallel [
(set (subreg:QI (reg:DI 43 [ aa.0_1 ]) 1)
(mem/u/c:QI (reg/f:PSI 48) [1 aa+1 S1 A8 AS7]))
(clobber (reg:QI 22 r22))
(clobber (reg:QI 21 r21))
(clobber (reg:HI 30 r30))
]) "t.c":12:13 38 {xloadqi_A}
 (nil))

but since on GIMPLE we have DImode loads I don't see how TER comes into
play here - TER should favor the second code generation, not the first ...
(or TER shouldn't play into here at all).

with -fno-tree-ter we come via

#0  expand_expr_real (exp=, target=0x7716c9a8,
tmode=E_DImode, modifier=EXPAND_NORMAL, alt_rtl=0x7fffcff8,
inner_reference_p=false) at /space/rguenther/src/gcc/gcc/expr.cc:9433
#1  0x0109fe63 in store_expr (exp=,
target=0x7716c9a8, call_param_p=0, nontemporal=false, reverse=false)
at /space/rguenther/src/gcc/gcc/expr.cc:6740
#2  0x0109e626 in expand_assignment (to=,
from=, nontemporal=false)
at /space/rguenther/src/gcc/gcc/expr.cc:6461

while with TER we instead have

#0  expand_expr_real (exp=, target=0x0,
tmode=E_VOIDmode, modifier=EXPAND_NORMAL, alt_rtl=0x0,
inner_reference_p=false) at /space/rguenther/src/gcc/gcc/expr.cc:9433
#1  0x010b279f in expand_expr_real_gassign (g=0x771613c0,
target=0x0, tmode=E_VOIDmode, modifier=EXPAND_NORMAL, alt_rtl=0x0,
inner_reference_p=false) at /space/rguenther/src/gcc/gcc/expr.cc:11100
#2  0x010b3294 in expand_expr_real_1 (exp=,
target=0x0, tmode=E_VOIDmode, modifier=EXPAND_NORMAL, alt_rtl=0x0,
inner_reference_p=false) at /space/rguenther/src/gcc/gcc/expr.cc:11278

the difference is -fno-tree-ter has a target (reg:DI 43 [...]) but with TER we
are not passing a target or a mode.

I think the issue is that the expansion at some point doesn't expect
the result to end up in
a hard register.  Maybe define_expand are not supposed to do that but maybe
expansion needs to fix up.

A first thought was

diff --git a/gcc/expr.cc b/gcc/expr.cc
index ffbac513692..1509acad02e 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -1,6 +1,12 @@ expand_expr_real_gassign (gassign *g, rtx
target, machine_mode tmode,
   gcc_unreachable ();
 }
   set_curr_insn_location (saved_loc);
+  if (!target && REG_P (r) && REGNO (r) < FIRST_PSEUDO_REGISTER)
+{
+  rtx tem = gen_reg_rtx (GET_MODE (r));
+  emit_move_insn (tem, r);
+  r = tem;
+}
   if (REG_P (r) && !REG_EXPR (r))
 set_reg_attrs_for_decl_rtl (lh

Re: [PATCH v3] ARM: thumb1: Use LDMIA/STMIA for DI/DF loads/stores

2024-07-04 Thread Richard Earnshaw (lists)
On 20/06/2024 08:24, Siarhei Volkau wrote:
> If the address register is dead after load/store operation it looks
> beneficial to use LDMIA/STMIA instead of pair of LDR/STR instructions,
> at least if optimizing for size.
> 
> Changes v2 -> v3:
>  - switching to mixed approach (insn+peep2)
>  - keep memory attributes in peepholes
>  - handle stmia corner cases
> 
> Changes v1 -> v2:
>  - switching to peephole2 approach
>  - added test case
> 
> gcc/ChangeLog:
> 
> * config/arm/arm.cc (thumb_load_double_from_address): Emit ldmia
> when address reg rewritten by load.
> 
> * config/arm/thumb1.md (peephole2 to rewrite DI/DF load): New.
> (peephole2 to rewrite DI/DF store): New.
> 
>   * config/arm/iterators.md (DIDF): New.
> 
> gcc/testsuite:
> 
> * gcc.target/arm/thumb1-load-store-64bit.c: Add new test.

I made a couple of cleanups and pushed this.  My testing of the cleanup also 
identified another corner case for the ldm instruciton: if the result of the 
load is not used (but it can't be eliminated because the address is marked 
volatile), then we could end up with
ldm r0!, {r0, r1}
Which of course is unpredictable.  So we need to test not only that r0 is dead 
but that it isn't written by the load either.

Anyway, thanks for the patch.

R.

> 
> Signed-off-by: Siarhei Volkau 
> ---
>  gcc/config/arm/arm.cc |  8 ++--
>  gcc/config/arm/iterators.md   |  3 ++
>  gcc/config/arm/thumb1.md  | 37 ++-
>  .../gcc.target/arm/thumb1-load-store-64bit.c  | 16 
>  4 files changed, 58 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/arm/thumb1-load-store-64bit.c
> 
> diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
> index b8c32db0a1d..2734ab3bce1 100644
> --- a/gcc/config/arm/arm.cc
> +++ b/gcc/config/arm/arm.cc
> @@ -28368,15 +28368,13 @@ thumb_load_double_from_address (rtx *operands)
>switch (GET_CODE (addr))
>  {
>  case REG:
> -  operands[2] = adjust_address (operands[1], SImode, 4);
> -
> -  if (REGNO (operands[0]) == REGNO (addr))
> +  if (reg_overlap_mentioned_p (addr, operands[0]))
>   {
> -   output_asm_insn ("ldr\t%H0, %2", operands);
> -   output_asm_insn ("ldr\t%0, %1", operands);
> +   output_asm_insn ("ldmia\t%m1, {%0, %H0}", operands);
>   }
>else
>   {
> +   operands[2] = adjust_address (operands[1], SImode, 4);
> output_asm_insn ("ldr\t%0, %1", operands);
> output_asm_insn ("ldr\t%H0, %2", operands);
>   }
> diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
> index 8d066fcf05d..09046bff83b 100644
> --- a/gcc/config/arm/iterators.md
> +++ b/gcc/config/arm/iterators.md
> @@ -50,6 +50,9 @@ (define_mode_iterator QHSD [QI HI SI DI])
>  ;; A list of the 32bit and 64bit integer modes
>  (define_mode_iterator SIDI [SI DI])
>  
> +;; A list of the 64bit modes for thumb1.
> +(define_mode_iterator DIDF [DI DF])
> +
>  ;; A list of atomic compare and swap success return modes
>  (define_mode_iterator CCSI [(CC_Z "TARGET_32BIT") (SI "TARGET_THUMB1")])
>  
> diff --git a/gcc/config/arm/thumb1.md b/gcc/config/arm/thumb1.md
> index d7074b43f60..c9a4201fcc9 100644
> --- a/gcc/config/arm/thumb1.md
> +++ b/gcc/config/arm/thumb1.md
> @@ -2055,4 +2055,39 @@ (define_insn "thumb1_stack_protect_test_insn"
> (set_attr "conds" "clob")
> (set_attr "type" "multiple")]
>  )
> -
> +
> +;; match patterns usable by ldmia/stmia
> +(define_peephole2
> +  [(set (match_operand:DIDF 0 "low_register_operand" "")
> + (match_operand:DIDF 1 "memory_operand" ""))]
> +  "TARGET_THUMB1
> +   && REG_P (XEXP (operands[1], 0))
> +   && REGNO_REG_CLASS (REGNO (XEXP (operands[1], 0))) == LO_REGS
> +   && peep2_reg_dead_p (1, XEXP (operands[1], 0))"
> +  [(set (match_dup 0)
> + (match_dup 1))]
> +  "
> +  {
> +operands[1] = change_address (operands[1], VOIDmode,
> +   gen_rtx_POST_INC (SImode,
> + XEXP (operands[1], 0)));
> +  }"
> +)
> +
> +(define_peephole2
> +  [(set (match_operand:DIDF 1 "memory_operand" "")
> + (match_operand:DIDF 0 "low_register_operand" ""))]
> +  "TARGET_THUMB1
> +   && REG_P (XEXP (operands[1], 0))
> +   && REGNO_REG_CLASS (REGNO (XEXP (operands[1], 0))) == LO_REGS
> +   && peep2_reg_dead_p (1, XEXP (operands[1], 0))
> +   && REGNO (XEXP (operands[1], 0)) != (REGNO (operands[0]) + 1)"
> +  [(set (match_dup 1)
> + (match_dup 0))]
> +  "
> +  {
> +operands[1] = change_address (operands[1], VOIDmode,
> +   gen_rtx_POST_INC (SImode,
> + XEXP (operands[1], 0)));
> +  }"
> +)
> diff --git a/gcc/testsuite/gcc.target/arm/thumb1-load-store-64bit.c 
> b/gcc/testsuite/gcc.target/arm/thumb1-load-store-64bit.c
> new file mode 100644
> index 000..167fa9ec876
> --- /dev/null
> +++ b

Re: [patch,avr] PR87376: Disable -ftree-ter

2024-07-04 Thread Richard Biener
On Thu, Jul 4, 2024 at 11:24 AM Richard Biener
 wrote:
>
> On Wed, Jul 3, 2024 at 9:26 PM Georg-Johann Lay  wrote:
> >
> >
> >
> > Am 02.07.24 um 15:48 schrieb Richard Biener:
> > > On Tue, Jul 2, 2024 at 3:43 PM Georg-Johann Lay  wrote:
> > >>
> > >> Hi Jeff,
> > >>
> > >> This is a patch to get correct code out of 64-bit
> > >> loads from address-space __memx.
> > >>
> > >> The AVR address-spaces may require that move insns issue
> > >> calls to library support functions, a fact that -ftree-ter
> > >> doesn't account for.  tree-ssa-ter.cc then replaces an
> > >> expression across such a library call, resulting in wrong code.
> > >>
> > >> This patch disables that pass per default on avr, as there is no
> > >> more fine grained way to avoid malicious optimizations.
> > >> The pass can still be re-enabled by means of explicit -ftree-ter.
> > >>
> > >> Ok to apply?
> > >
> > > I think this requires more details on what goes wrong - I assume
> > > it's not stmt reordering that effectively happens but recursive
> > > expand_expr on SSA defs when those invoke libcalls?  In that
> > > case this would point to a deeper issue.
> >
> > The difference is that with TER, we get a hard reg in .expand
> > for a movdi from 24-bit address-space __memx.
> >
> > Such moves require library calls, which in turn require
> > specific hard registers.  As avr backend has no movdi, the
> > moddi gets expanded as 8 * movqi, and that does not work
> > when the target registers are hard regs, as some of them
> > are clobbered by the libcalls.
>
> So I see
>
> (insn 18 17 19 2 (parallel [
> (set (reg:QI 22 r22 [+4 ])
> (mem/u/c:QI (reg/f:PSI 55) [1 aa+4 S1 A8 AS7]))
> (clobber (reg:QI 22 r22))
> (clobber (reg:QI 21 r21))
> (clobber (reg:HI 30 r30))
> ]) "t.c":12:13 38 {xloadqi_A}
>  (nil))
> (insn 19 18 20 2 (set (reg:PSI 56)
> (reg/f:PSI 47)) "t.c":12:13 112 {*movpsi_split}
>  (nil))
> (insn 20 19 21 2 (parallel [
> (set (reg/f:PSI 57)
> (plus:PSI (reg/f:PSI 47)
> (const_int 5 [0x5])))
> (clobber (scratch:QI))
> ]) "t.c":12:13 205 {addpsi3}
>  (nil))
> (insn 21 20 22 2 (parallel [
> (set (reg:QI 23 r23 [+5 ])
> (mem/u/c:QI (reg/f:PSI 57) [1 aa+5 S1 A8 AS7]))
> (clobber (reg:QI 22 r22))
> (clobber (reg:QI 21 r21))
> (clobber (reg:HI 30 r30))
> ]) "t.c":12:13 38 {xloadqi_A}
>  (nil))
>
> for example - insn 21 clobbers r22 which is also the destination of insn 18.
>
> With -fno-tree-ter those oddly get _no_ intermediate reg but we have
>
> (insn 9 8 10 2 (parallel [
> (set (subreg:QI (reg:DI 43 [ aa.0_1 ]) 1)
> (mem/u/c:QI (reg/f:PSI 48) [1 aa+1 S1 A8 AS7]))
> (clobber (reg:QI 22 r22))
> (clobber (reg:QI 21 r21))
> (clobber (reg:HI 30 r30))
> ]) "t.c":12:13 38 {xloadqi_A}
>  (nil))
>
> but since on GIMPLE we have DImode loads I don't see how TER comes into
> play here - TER should favor the second code generation, not the first ...
> (or TER shouldn't play into here at all).
>
> with -fno-tree-ter we come via
>
> #0  expand_expr_real (exp=, target=0x7716c9a8,
> tmode=E_DImode, modifier=EXPAND_NORMAL, alt_rtl=0x7fffcff8,
> inner_reference_p=false) at /space/rguenther/src/gcc/gcc/expr.cc:9433
> #1  0x0109fe63 in store_expr (exp=,
> target=0x7716c9a8, call_param_p=0, nontemporal=false, reverse=false)
> at /space/rguenther/src/gcc/gcc/expr.cc:6740
> #2  0x0109e626 in expand_assignment (to=,
> from=, nontemporal=false)
> at /space/rguenther/src/gcc/gcc/expr.cc:6461
>
> while with TER we instead have
>
> #0  expand_expr_real (exp=, target=0x0,
> tmode=E_VOIDmode, modifier=EXPAND_NORMAL, alt_rtl=0x0,
> inner_reference_p=false) at /space/rguenther/src/gcc/gcc/expr.cc:9433
> #1  0x010b279f in expand_expr_real_gassign (g=0x771613c0,
> target=0x0, tmode=E_VOIDmode, modifier=EXPAND_NORMAL, alt_rtl=0x0,
> inner_reference_p=false) at /space/rguenther/src/gcc/gcc/expr.cc:11100
> #2  0x010b3294 in expand_expr_real_1 (exp=,
> target=0x0, tmode=E_VOIDmode, modifier=EXPAND_NORMAL, alt_rtl=0x0,
> inner_reference_p=false) at /space/rguenther/src/gcc/gcc/expr.cc:11278
>
> the difference is -fno-tree-ter has a target (reg:DI 43 [...]) but with TER we
> are not passing a target or a mode.
>
> I think the issue is that the expansion at some point doesn't expect
> the result to end up in
> a hard register.  Maybe define_expand are not supposed to do that but maybe
> expansion needs to fix up.
>
> A first thought was
>
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index ffbac513692..1509acad02e 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -1,6 +1,12 @@ expand_expr_real_gassign (gassign *g, rtx
> target, machine_mode tmode,
>gcc_unreachable ()

[PATCH 1/2] LoongArch: TFmode is not allowed to be stored in the float register.

2024-07-04 Thread Lulu Cheng
PR target/115752

gcc/ChangeLog:

* config/loongarch/loongarch.cc
(loongarch_hard_regno_mode_ok_uncached): Replace
UNITS_PER_FPVALUE with UNITS_PER_HWFPVALUE.
* config/loongarch/loongarch.h (UNITS_PER_FPVALUE): Delete.

gcc/testsuite/ChangeLog:

* gcc.target/loongarch/pr115752.c: New test.
---
 gcc/config/loongarch/loongarch.cc | 2 +-
 gcc/config/loongarch/loongarch.h  | 7 ---
 gcc/testsuite/gcc.target/loongarch/pr115752.c | 8 
 3 files changed, 9 insertions(+), 8 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/loongarch/pr115752.c

diff --git a/gcc/config/loongarch/loongarch.cc 
b/gcc/config/loongarch/loongarch.cc
index e2ff2af89e2..803ed0575bd 100644
--- a/gcc/config/loongarch/loongarch.cc
+++ b/gcc/config/loongarch/loongarch.cc
@@ -6705,7 +6705,7 @@ loongarch_hard_regno_mode_ok_uncached (unsigned int 
regno, machine_mode mode)
   if (mclass == MODE_FLOAT
  || mclass == MODE_COMPLEX_FLOAT
  || mclass == MODE_VECTOR_FLOAT)
-   return size <= UNITS_PER_FPVALUE;
+   return size <= UNITS_PER_HWFPVALUE;
 
   /* Allow integer modes that fit into a single register.  We need
 to put integers into FPRs when using instructions like CVT
diff --git a/gcc/config/loongarch/loongarch.h b/gcc/config/loongarch/loongarch.h
index b9323aba394..5efeae53be6 100644
--- a/gcc/config/loongarch/loongarch.h
+++ b/gcc/config/loongarch/loongarch.h
@@ -146,13 +146,6 @@ along with GCC; see the file COPYING3.  If not see
 #define UNITS_PER_HWFPVALUE \
   (TARGET_SOFT_FLOAT ? 0 : UNITS_PER_FP_REG)
 
-/* The largest size of value that can be held in floating-point
-   registers.  */
-#define UNITS_PER_FPVALUE \
-  (TARGET_SOFT_FLOAT ? 0 \
-   : TARGET_SINGLE_FLOAT ? UNITS_PER_FP_REG \
-: LA_LONG_DOUBLE_TYPE_SIZE / BITS_PER_UNIT)
-
 /* The number of bytes in a double.  */
 #define UNITS_PER_DOUBLE (TYPE_PRECISION (double_type_node) / BITS_PER_UNIT)
 
diff --git a/gcc/testsuite/gcc.target/loongarch/pr115752.c 
b/gcc/testsuite/gcc.target/loongarch/pr115752.c
new file mode 100644
index 000..df4bae524f7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/loongarch/pr115752.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+
+long double
+test (long double xx)
+{
+   __asm ("" :: "f"(xx)); /* { dg-error "inconsistent operand constraints in 
an 'asm'" } */
+   return xx + 1;
+}
-- 
2.39.3



[PATCH 2/2] LoongArch: Remove unreachable codes.

2024-07-04 Thread Lulu Cheng
gcc/ChangeLog:

* config/loongarch/loongarch.cc
(loongarch_split_move): Delete.
(loongarch_hard_regno_mode_ok_uncached): Likewise.
* config/loongarch/loongarch.md
(move_doubleword_fpr): Likewise.
(load_low): Likewise.
(load_high): Likewise.
(store_word): Likewise.
(movgr2frh): Likewise.
(movfrh2gr): Likewise.
---
 gcc/config/loongarch/loongarch.cc |  47 +++--
 gcc/config/loongarch/loongarch.md | 109 --
 2 files changed, 8 insertions(+), 148 deletions(-)

diff --git a/gcc/config/loongarch/loongarch.cc 
b/gcc/config/loongarch/loongarch.cc
index 803ed0575bd..ebd418ab115 100644
--- a/gcc/config/loongarch/loongarch.cc
+++ b/gcc/config/loongarch/loongarch.cc
@@ -4382,42 +4382,13 @@ loongarch_split_move_p (rtx dest, rtx src)
 void
 loongarch_split_move (rtx dest, rtx src)
 {
-  rtx low_dest;
-
   gcc_checking_assert (loongarch_split_move_p (dest, src));
   if (LSX_SUPPORTED_MODE_P (GET_MODE (dest)))
 loongarch_split_128bit_move (dest, src);
   else if (LASX_SUPPORTED_MODE_P (GET_MODE (dest)))
 loongarch_split_256bit_move (dest, src);
-  else if (FP_REG_RTX_P (dest) || FP_REG_RTX_P (src))
-{
-  if (!TARGET_64BIT && GET_MODE (dest) == DImode)
-   emit_insn (gen_move_doubleword_fprdi (dest, src));
-  else if (!TARGET_64BIT && GET_MODE (dest) == DFmode)
-   emit_insn (gen_move_doubleword_fprdf (dest, src));
-  else if (TARGET_64BIT && GET_MODE (dest) == TFmode)
-   emit_insn (gen_move_doubleword_fprtf (dest, src));
-  else
-   gcc_unreachable ();
-}
   else
-{
-  /* The operation can be split into two normal moves.  Decide in
-which order to do them.  */
-  low_dest = loongarch_subword (dest, false);
-  if (REG_P (low_dest) && reg_overlap_mentioned_p (low_dest, src))
-   {
- loongarch_emit_move (loongarch_subword (dest, true),
-  loongarch_subword (src, true));
- loongarch_emit_move (low_dest, loongarch_subword (src, false));
-   }
-  else
-   {
- loongarch_emit_move (low_dest, loongarch_subword (src, false));
- loongarch_emit_move (loongarch_subword (dest, true),
-  loongarch_subword (src, true));
-   }
-}
+gcc_unreachable ();
 }
 
 /* Check if adding an integer constant value for a specific mode can be
@@ -6688,20 +6659,18 @@ loongarch_hard_regno_mode_ok_uncached (unsigned int 
regno, machine_mode mode)
   size = GET_MODE_SIZE (mode);
   mclass = GET_MODE_CLASS (mode);
 
-  if (GP_REG_P (regno) && !LSX_SUPPORTED_MODE_P (mode)
+  if (GP_REG_P (regno)
+  && !LSX_SUPPORTED_MODE_P (mode)
   && !LASX_SUPPORTED_MODE_P (mode))
 return ((regno - GP_REG_FIRST) & 1) == 0 || size <= UNITS_PER_WORD;
 
-  /* For LSX, allow TImode and 128-bit vector modes in all FPR.  */
-  if (FP_REG_P (regno) && LSX_SUPPORTED_MODE_P (mode))
-return true;
-
-  /* FIXED ME: For LASX, allow TImode and 256-bit vector modes in all FPR.  */
-  if (FP_REG_P (regno) && LASX_SUPPORTED_MODE_P (mode))
-return true;
-
   if (FP_REG_P (regno))
 {
+  /* Allow 128-bit or 256-bit vector modes in all FPR.  */
+  if (LSX_SUPPORTED_MODE_P (mode)
+ || LASX_SUPPORTED_MODE_P (mode))
+   return true;
+
   if (mclass == MODE_FLOAT
  || mclass == MODE_COMPLEX_FLOAT
  || mclass == MODE_VECTOR_FLOAT)
diff --git a/gcc/config/loongarch/loongarch.md 
b/gcc/config/loongarch/loongarch.md
index 25c1d323ba0..21890a2d94b 100644
--- a/gcc/config/loongarch/loongarch.md
+++ b/gcc/config/loongarch/loongarch.md
@@ -400,9 +400,6 @@ (define_mode_iterator X [(SI "!TARGET_64BIT") (DI 
"TARGET_64BIT")])
 ;; 64-bit modes for which we provide move patterns.
 (define_mode_iterator MOVE64 [DI DF])
 
-;; 128-bit modes for which we provide move patterns on 64-bit targets.
-(define_mode_iterator MOVE128 [TI TF])
-
 ;; Iterator for sub-32-bit integer modes.
 (define_mode_iterator SHORT [QI HI])
 
@@ -421,12 +418,6 @@ (define_mode_iterator ANYF [(SF "TARGET_HARD_FLOAT")
 (define_mode_iterator ANYFI [(SI "TARGET_HARD_FLOAT")
 (DI "TARGET_DOUBLE_FLOAT")])
 
-;; A mode for which moves involving FPRs may need to be split.
-(define_mode_iterator SPLITF
-  [(DF "!TARGET_64BIT && TARGET_DOUBLE_FLOAT")
-   (DI "!TARGET_64BIT && TARGET_DOUBLE_FLOAT")
-   (TF "TARGET_64BIT && TARGET_DOUBLE_FLOAT")])
-
 ;; A mode for anything with 32 bits or more, and able to be loaded with
 ;; the same addressing mode as ld.w.
 (define_mode_iterator LD_AT_LEAST_32_BIT [GPR ANYF])
@@ -2421,41 +2412,6 @@ (define_insn "*movdf_softfloat"
   [(set_attr "move_type" "move,load,store")
(set_attr "mode" "DF")])
 
-;; Emit a doubleword move in which exactly one of the operands is
-;; a floating-point register.  We can't just emit two normal moves
-;; because of the constraints imposed by the FPU register model;
-;; see loongarch_can_change_

[PATCH][committed][testsuite]: Update test for PR115537 to use SVE .

2024-07-04 Thread Tamar Christina
Hi All,

The PR was about SVE codegen, the testcase accidentally used neoverse-n1
instead of neoverse-v1 as was the original report.

This updates the tool options.

Regtested on aarch64-none-linux-gnu and no issues.

committed under the obvious rule.

Thanks,
Tamar

gcc/testsuite/ChangeLog:

PR tree-optimization/115537
* gcc.dg/vect/pr115537.c: Update flag from neoverse-n1 to neoverse-v1.

---
diff --git a/gcc/testsuite/gcc.dg/vect/pr115537.c 
b/gcc/testsuite/gcc.dg/vect/pr115537.c
index 
99ed467feb884b745e1d499339fd6ef58aa0e6d2..9f7347a5f2adf03dadf428cc1cfab46c3f930466
 100644
--- a/gcc/testsuite/gcc.dg/vect/pr115537.c
+++ b/gcc/testsuite/gcc.dg/vect/pr115537.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-additional-options "-mcpu=neoverse-n1" { target aarch64*-*-* } } */
+/* { dg-additional-options "-mcpu=neoverse-v1" { target aarch64*-*-* } } */
 
 char *a;
 int b;




-- 
diff --git a/gcc/testsuite/gcc.dg/vect/pr115537.c b/gcc/testsuite/gcc.dg/vect/pr115537.c
index 99ed467feb884b745e1d499339fd6ef58aa0e6d2..9f7347a5f2adf03dadf428cc1cfab46c3f930466 100644
--- a/gcc/testsuite/gcc.dg/vect/pr115537.c
+++ b/gcc/testsuite/gcc.dg/vect/pr115537.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-additional-options "-mcpu=neoverse-n1" { target aarch64*-*-* } } */
+/* { dg-additional-options "-mcpu=neoverse-v1" { target aarch64*-*-* } } */
 
 char *a;
 int b;





[pushed] wwwdocs: gcc-12: Tweak RISC-V default ISA announcement

2024-07-04 Thread Gerald Pfeifer
"bump" instead of "bumped" triggered by attention, and while I was there 
already I tweaked the whole entry.

Pushed.

Gerald
---
 htdocs/gcc-12/changes.html | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/htdocs/gcc-12/changes.html b/htdocs/gcc-12/changes.html
index bf332c86..b97dc376 100644
--- a/htdocs/gcc-12/changes.html
+++ b/htdocs/gcc-12/changes.html
@@ -899,9 +899,9 @@ function Multiply (S1, S2 : Sign) return Sign is
 
 RISC-V
 
-Default ISA spec version was bump to 20191213, more detail see this The default ISA spec version was bumped to 20191213; see https://groups.google.com/a/groups.riscv.org/g/sw-dev/c/aE1ZeHHCYf4";>
-announcement
+this announcement for details.
 New ISA extension support for zba, zbb, zbc, zbs was added.
 New ISA extension support for vector and scalar crypto was added, only
support architecture testing macro and -march= 
parsing.
-- 
2.45.2


[PATCH] RISC-V: Support group size of three in SLP store permute lowering

2024-07-04 Thread Richard Biener
The following implements the group-size three scheme from
vect_permute_store_chain in SLP grouped store permute lowering
and extends it to power-of-two multiples of group size three.

The scheme goes from vectors A, B and C to
{ A[0], B[0], C[0], A[1], B[1], C[1], ... } by first producing
{ A[0], B[0], X, A[1], B[1], X, ... } (with X random but chosen
to A[n]) and then permuting in C[n] in the appropriate places.

The extension goes as to replace vector elements with a
power-of-two number of lanes and you'd get pairwise interleaving
until the final three input permutes happen.

The last permute step could be seen as extending C to { C[0], C[0],
C[0], ... } and then performing a blend.

VLA archs will want to use store-lanes here I guess, I'm not sure
if the three vector interleave operation is also available with
a register source and destination and thus available for a shuffle.

* tree-vect-slp.cc (vect_build_slp_instance): Special case
three input permute with the same number of lanes in store
permute lowering.

* gcc.dg/vect/slp-53.c: New testcase.
* gcc.dg/vect/slp-54.c: New testcase.
---
 gcc/testsuite/gcc.dg/vect/slp-53.c | 15 +++
 gcc/testsuite/gcc.dg/vect/slp-54.c | 18 +
 gcc/tree-vect-slp.cc   | 65 +-
 3 files changed, 97 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/slp-53.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/slp-54.c

diff --git a/gcc/testsuite/gcc.dg/vect/slp-53.c 
b/gcc/testsuite/gcc.dg/vect/slp-53.c
new file mode 100644
index 000..d8cd5f85b3c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/slp-53.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+
+void foo (int * __restrict x, int *y)
+{
+  x = __builtin_assume_aligned (x, __BIGGEST_ALIGNMENT__);
+  y = __builtin_assume_aligned (y, __BIGGEST_ALIGNMENT__);
+  for (int i = 0; i < 1024; ++i)
+{
+  x[3*i+0] = y[2*i+0] * 7 + 5;
+  x[3*i+1] = y[2*i+1] * 2;
+  x[3*i+2] = y[2*i+0] + 3;
+}
+}
+
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { 
vect_int && vect_int_mult } xfail vect_load_lanes } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-54.c 
b/gcc/testsuite/gcc.dg/vect/slp-54.c
new file mode 100644
index 000..ab66b349d1f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/slp-54.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+
+void foo (int * __restrict x, int *y)
+{
+  x = __builtin_assume_aligned (x, __BIGGEST_ALIGNMENT__);
+  y = __builtin_assume_aligned (y, __BIGGEST_ALIGNMENT__);
+  for (int i = 0; i < 1024; ++i)
+{
+  x[6*i+0] = y[4*i+0] * 7 + 5;
+  x[6*i+1] = y[4*i+1] * 2;
+  x[6*i+2] = y[4*i+2] + 3;
+  x[6*i+3] = y[4*i+3] * 7 + 5;
+  x[6*i+4] = y[4*i+0] * 2;
+  x[6*i+5] = y[4*i+3] + 3;
+}
+}
+
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { 
vect_int && vect_int_mult } xfail riscv*-*-* } } } */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 170d0cf7fa1..2dc6d365303 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -3714,6 +3714,69 @@ vect_build_slp_instance (vec_info *vinfo,
 when the number of lanes is even.  */
  while (SLP_TREE_CHILDREN (perm).length () > 2)
{
+ /* When we have three equal sized groups left the pairwise
+reduction does not result in a scheme that avoids using
+three vectors.  Instead merge the first two groups
+to the final size with do-not-care elements (chosen
+from the first group) and then merge with the third.
+  { A0, B0,  x, A1, B1,  x, ... }
+   -> { A0, B0, C0, A1, B1, C1, ... }
+This handles group size of three (and at least
+power-of-two multiples of that).  */
+ if (SLP_TREE_CHILDREN (perm).length () == 3
+ && (SLP_TREE_LANES (SLP_TREE_CHILDREN (perm)[0])
+ == SLP_TREE_LANES (SLP_TREE_CHILDREN (perm)[1]))
+ && (SLP_TREE_LANES (SLP_TREE_CHILDREN (perm)[0])
+ == SLP_TREE_LANES (SLP_TREE_CHILDREN (perm)[2])))
+   {
+ int ai = 0;
+ int bi = 1;
+ slp_tree a = SLP_TREE_CHILDREN (perm)[ai];
+ slp_tree b = SLP_TREE_CHILDREN (perm)[bi];
+ unsigned n = SLP_TREE_LANES (perm);
+
+ slp_tree permab
+   = vect_create_new_slp_node (2, VEC_PERM_EXPR);
+ SLP_TREE_LANES (permab) = n;
+ SLP_TREE_LANE_PERMUTATION (permab).create (n);
+ SLP_TREE_VECTYPE (permab) = SLP_TREE_VECTYPE (perm);
+ /* ???  Should be NULL but that's not expected.  */
+ SLP_TREE_REPRESENTATIVE (permab)
+  

[PATCH] RISC-V: Support group size of three in SLP store permute lowering

2024-07-04 Thread Richard Biener
The following implements the group-size three scheme from
vect_permute_store_chain in SLP grouped store permute lowering
and extends it to power-of-two multiples of group size three.

The scheme goes from vectors A, B and C to
{ A[0], B[0], C[0], A[1], B[1], C[1], ... } by first producing
{ A[0], B[0], X, A[1], B[1], X, ... } (with X random but chosen
to A[n]) and then permuting in C[n] in the appropriate places.

The extension goes as to replace vector elements with a
power-of-two number of lanes and you'd get pairwise interleaving
until the final three input permutes happen.

The last permute step could be seen as extending C to { C[0], C[0],
C[0], ... } and then performing a blend.

VLA archs will want to use store-lanes here I guess, I'm not sure
if the three vector interleave operation is also available with
a register source and destination and thus available for a shuffle.

* tree-vect-slp.cc (vect_build_slp_instance): Special case
three input permute with the same number of lanes in store
permute lowering.

* gcc.dg/vect/slp-53.c: New testcase.
* gcc.dg/vect/slp-54.c: New testcase.
---
 gcc/testsuite/gcc.dg/vect/slp-53.c | 15 +++
 gcc/testsuite/gcc.dg/vect/slp-54.c | 18 +
 gcc/tree-vect-slp.cc   | 65 +-
 3 files changed, 97 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/slp-53.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/slp-54.c

diff --git a/gcc/testsuite/gcc.dg/vect/slp-53.c 
b/gcc/testsuite/gcc.dg/vect/slp-53.c
new file mode 100644
index 000..d8cd5f85b3c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/slp-53.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+
+void foo (int * __restrict x, int *y)
+{
+  x = __builtin_assume_aligned (x, __BIGGEST_ALIGNMENT__);
+  y = __builtin_assume_aligned (y, __BIGGEST_ALIGNMENT__);
+  for (int i = 0; i < 1024; ++i)
+{
+  x[3*i+0] = y[2*i+0] * 7 + 5;
+  x[3*i+1] = y[2*i+1] * 2;
+  x[3*i+2] = y[2*i+0] + 3;
+}
+}
+
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { 
vect_int && vect_int_mult } xfail vect_load_lanes } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-54.c 
b/gcc/testsuite/gcc.dg/vect/slp-54.c
new file mode 100644
index 000..ab66b349d1f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/slp-54.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+
+void foo (int * __restrict x, int *y)
+{
+  x = __builtin_assume_aligned (x, __BIGGEST_ALIGNMENT__);
+  y = __builtin_assume_aligned (y, __BIGGEST_ALIGNMENT__);
+  for (int i = 0; i < 1024; ++i)
+{
+  x[6*i+0] = y[4*i+0] * 7 + 5;
+  x[6*i+1] = y[4*i+1] * 2;
+  x[6*i+2] = y[4*i+2] + 3;
+  x[6*i+3] = y[4*i+3] * 7 + 5;
+  x[6*i+4] = y[4*i+0] * 2;
+  x[6*i+5] = y[4*i+3] + 3;
+}
+}
+
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { 
vect_int && vect_int_mult } xfail riscv*-*-* } } } */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 6e77ec2c193..0f830c1ad9c 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -3713,6 +3713,69 @@ vect_build_slp_instance (vec_info *vinfo,
 most two vector inputs to produce a single vector output.  */
  while (SLP_TREE_CHILDREN (perm).length () > 2)
{
+ /* When we have three equal sized groups left the pairwise
+reduction does not result in a scheme that avoids using
+three vectors.  Instead merge the first two groups
+to the final size with do-not-care elements (chosen
+from the first group) and then merge with the third.
+  { A0, B0,  x, A1, B1,  x, ... }
+   -> { A0, B0, C0, A1, B1, C1, ... }
+This handles group size of three (and at least
+power-of-two multiples of that).  */
+ if (SLP_TREE_CHILDREN (perm).length () == 3
+ && (SLP_TREE_LANES (SLP_TREE_CHILDREN (perm)[0])
+ == SLP_TREE_LANES (SLP_TREE_CHILDREN (perm)[1]))
+ && (SLP_TREE_LANES (SLP_TREE_CHILDREN (perm)[0])
+ == SLP_TREE_LANES (SLP_TREE_CHILDREN (perm)[2])))
+   {
+ int ai = 0;
+ int bi = 1;
+ slp_tree a = SLP_TREE_CHILDREN (perm)[ai];
+ slp_tree b = SLP_TREE_CHILDREN (perm)[bi];
+ unsigned n = SLP_TREE_LANES (perm);
+
+ slp_tree permab
+   = vect_create_new_slp_node (2, VEC_PERM_EXPR);
+ SLP_TREE_LANES (permab) = n;
+ SLP_TREE_LANE_PERMUTATION (permab).create (n);
+ SLP_TREE_VECTYPE (permab) = SLP_TREE_VECTYPE (perm);
+ /* ???  Should be NULL but that's not expected.  */
+ SLP_TREE_REP

Re: [patch,avr] PR87376: Disable -ftree-ter

2024-07-04 Thread Georg-Johann Lay




Am 04.07.24 um 11:49 schrieb Richard Biener:

On Thu, Jul 4, 2024 at 11:24 AM Richard Biener
 wrote:


On Wed, Jul 3, 2024 at 9:26 PM Georg-Johann Lay  wrote:




Am 02.07.24 um 15:48 schrieb Richard Biener:

On Tue, Jul 2, 2024 at 3:43 PM Georg-Johann Lay  wrote:


Hi Jeff,

This is a patch to get correct code out of 64-bit
loads from address-space __memx.

The AVR address-spaces may require that move insns issue
calls to library support functions, a fact that -ftree-ter
doesn't account for.  tree-ssa-ter.cc then replaces an
expression across such a library call, resulting in wrong code.

This patch disables that pass per default on avr, as there is no
more fine grained way to avoid malicious optimizations.
The pass can still be re-enabled by means of explicit -ftree-ter.

Ok to apply?


I think this requires more details on what goes wrong - I assume
it's not stmt reordering that effectively happens but recursive
expand_expr on SSA defs when those invoke libcalls?  In that
case this would point to a deeper issue.


The difference is that with TER, we get a hard reg in .expand
for a movdi from 24-bit address-space __memx.

Such moves require library calls, which in turn require
specific hard registers.  As avr backend has no movdi, the
moddi gets expanded as 8 * movqi, and that does not work
when the target registers are hard regs, as some of them
are clobbered by the libcalls.


So I see

(insn 18 17 19 2 (parallel [
 (set (reg:QI 22 r22 [+4 ])
 (mem/u/c:QI (reg/f:PSI 55) [1 aa+4 S1 A8 AS7]))
 (clobber (reg:QI 22 r22))
 (clobber (reg:QI 21 r21))
 (clobber (reg:HI 30 r30))
 ]) "t.c":12:13 38 {xloadqi_A}
  (nil))
(insn 19 18 20 2 (set (reg:PSI 56)
 (reg/f:PSI 47)) "t.c":12:13 112 {*movpsi_split}
  (nil))
(insn 20 19 21 2 (parallel [
 (set (reg/f:PSI 57)
 (plus:PSI (reg/f:PSI 47)
 (const_int 5 [0x5])))
 (clobber (scratch:QI))
 ]) "t.c":12:13 205 {addpsi3}
  (nil))
(insn 21 20 22 2 (parallel [
 (set (reg:QI 23 r23 [+5 ])
 (mem/u/c:QI (reg/f:PSI 57) [1 aa+5 S1 A8 AS7]))
 (clobber (reg:QI 22 r22))
 (clobber (reg:QI 21 r21))
 (clobber (reg:HI 30 r30))
 ]) "t.c":12:13 38 {xloadqi_A}
  (nil))

for example - insn 21 clobbers r22 which is also the destination of insn 18.

With -fno-tree-ter those oddly get _no_ intermediate reg but we have

(insn 9 8 10 2 (parallel [
 (set (subreg:QI (reg:DI 43 [ aa.0_1 ]) 1)
 (mem/u/c:QI (reg/f:PSI 48) [1 aa+1 S1 A8 AS7]))
 (clobber (reg:QI 22 r22))
 (clobber (reg:QI 21 r21))
 (clobber (reg:HI 30 r30))
 ]) "t.c":12:13 38 {xloadqi_A}
  (nil))

but since on GIMPLE we have DImode loads I don't see how TER comes into
play here - TER should favor the second code generation, not the first ...
(or TER shouldn't play into here at all).

with -fno-tree-ter we come via

#0  expand_expr_real (exp=, target=0x7716c9a8,
 tmode=E_DImode, modifier=EXPAND_NORMAL, alt_rtl=0x7fffcff8,
 inner_reference_p=false) at /space/rguenther/src/gcc/gcc/expr.cc:9433
#1  0x0109fe63 in store_expr (exp=,
 target=0x7716c9a8, call_param_p=0, nontemporal=false, reverse=false)
 at /space/rguenther/src/gcc/gcc/expr.cc:6740
#2  0x0109e626 in expand_assignment (to=,
 from=, nontemporal=false)
 at /space/rguenther/src/gcc/gcc/expr.cc:6461

while with TER we instead have

#0  expand_expr_real (exp=, target=0x0,
 tmode=E_VOIDmode, modifier=EXPAND_NORMAL, alt_rtl=0x0,
 inner_reference_p=false) at /space/rguenther/src/gcc/gcc/expr.cc:9433
#1  0x010b279f in expand_expr_real_gassign (g=0x771613c0,
 target=0x0, tmode=E_VOIDmode, modifier=EXPAND_NORMAL, alt_rtl=0x0,
 inner_reference_p=false) at /space/rguenther/src/gcc/gcc/expr.cc:11100
#2  0x010b3294 in expand_expr_real_1 (exp=,
 target=0x0, tmode=E_VOIDmode, modifier=EXPAND_NORMAL, alt_rtl=0x0,
 inner_reference_p=false) at /space/rguenther/src/gcc/gcc/expr.cc:11278

the difference is -fno-tree-ter has a target (reg:DI 43 [...]) but with TER we
are not passing a target or a mode.

I think the issue is that the expansion at some point doesn't expect
the result to end up in
a hard register.  Maybe define_expand are not supposed to do that but maybe
expansion needs to fix up.

A first thought was

diff --git a/gcc/expr.cc b/gcc/expr.cc
index ffbac513692..1509acad02e 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -1,6 +1,12 @@ expand_expr_real_gassign (gassign *g, rtx
target, machine_mode tmode,
gcc_unreachable ();
  }
set_curr_insn_location (saved_loc);
+  if (!target && REG_P (r) && REGNO (r) < FIRST_PSEUDO_REGISTER)
+{
+  rtx tem = gen_reg_rtx (GET_MODE (r));
+  emit_move_insn (tem, r);
+  r = tem;
+}
if (REG_P (r) && !REG_EXPR (

Re: [PATCH 1/2] aarch64: PR target/115457 Implement missing __ARM_FEATURE_BF16 macro

2024-07-04 Thread Kyrylo Tkachov


> On 3 Jul 2024, at 11:59, Kyrylo Tkachov  wrote:
> 
> Hi all,
> 
> The ACLE asks the user to test for __ARM_FEATURE_BF16 before using the
>  header but GCC doesn't set this up.
> LLVM does, so this is an inconsistency between the compilers.
> 
> This patch enables that macro for TARGET_BF16_FP.
> Bootstrapped and tested on aarch64-none-linux-gnu.
> Pushing to trunk.
> I think it makes sense to back port this to branches that support the 
> arm_bf16.h header.
> I’ll prepare patches for those separately.
> 

The patches apply cleanly and pass bootstrap and testing on the 12, 13, 14 
branches so I’ll back port them there.
They should also go to GCC 11 as that branch supports the intrinsics but the 
patches need a bit of adjusting for the patch context. I’ll prepare those 
separately.
Thanks,
Kyrill

> Thanks,
> Kyrill
> 
> gcc/
> 
> PR target/115457
> * config/aarch64/aarch64-c.c (aarch64_update_cpp_builtins):
> Define __ARM_FEATURE_BF16 for TARGET_BF16_FP.
> 
> gcc/testsuite/
> 
> PR target/115457
> * gcc.target/aarch64/acle/bf16_feature.c: New test.
> 
> Signed-off-by: Kyrylo Tkachov 
> <0001-aarch64-PR-target-115457-Implement-missing-__ARM_FEA.patch>



[PATCH 1/2]AArch64: make aarch64_simd_vec_unpack_lo_/_hi_ consistent.

2024-07-04 Thread Tamar Christina
Hi All,

The fix for PR18127 reworked the uxtl to zip optimization.
In doing so it undid the changes in aarch64_simd_vec_unpack_lo_ and this now
no longer matches aarch64_simd_vec_unpack_hi_.  It still works because the
RTL generated by aarch64_simd_vec_unpack_lo_ overlaps with the general zero
extend RTL and so because that one is listed before the lo pattern recog picks
it instead.

This just makes aarch64_simd_vec_unpack_lo_ mirror
aarch64_simd_vec_unpack_hi_ for consistency and so we're not relying on the
order of patterns.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md
(aarch64_simd_vec_unpack_lo_): Add same split as
aarch64_simd_vec_unpack_hi_.
* config/aarch64/aarch64.cc (aarch64_gen_shareable_zero): Update
comment.

---
diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
01b084d8ccb53fc0f1c7b0dd8f23546c331f020a..d4026cbf0b66995104e8e40ca294faaf8d5ba847
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1904,7 +1904,7 @@ (define_insn "*aarch64_topbits_shuffle_be"
 
 ;; Widening operations.
 
-(define_insn "aarch64_simd_vec_unpack_lo_"
+(define_insn_and_split "aarch64_simd_vec_unpack_lo_"
   [(set (match_operand: 0 "register_operand" "=w")
 (ANY_EXTEND: (vec_select:
   (match_operand:VQW 1 "register_operand" "w")
@@ -1912,6 +1912,19 @@ (define_insn "aarch64_simd_vec_unpack_lo_"
)))]
   "TARGET_SIMD"
   "xtl\t%0., %1."
+  "&&  == ZERO_EXTEND
+   && aarch64_split_simd_shift_p (insn)"
+  [(const_int 0)]
+  {
+/* On many cores, it is cheaper to implement UXTL using a ZIP1 with zero,
+   provided that the cost of the zero can be amortized over several
+   operations.  We'll later recombine the zero and zip if there are
+   not sufficient uses of the zero to make the split worthwhile.  */
+rtx res = simplify_gen_subreg (mode, operands[0], mode, 0);
+rtx zero = aarch64_gen_shareable_zero (mode);
+emit_insn (gen_aarch64_zip1 (res, operands[1], zero));
+DONE;
+  }
   [(set_attr "type" "neon_shift_imm_long")]
 )
 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
ae7e21d90b2aeec51b7626471ccf7f036fa9b3db..6f49d1482042efabedbe723aa59ecf129b84f4ad
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -23159,7 +23159,8 @@ aarch64_gen_shareable_zero (machine_mode mode)
to split without that restriction and instead recombine shared zeros
if they turn out not to be worthwhile.  This would allow splits in
single-block functions and would also cope more naturally with
-   rematerialization.  */
+   rematerialization.  The downside of not doing this is that we lose the
+   optimizations for vector epilogues as well.  */
 
 bool
 aarch64_split_simd_shift_p (rtx_insn *insn)




-- 
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 01b084d8ccb53fc0f1c7b0dd8f23546c331f020a..d4026cbf0b66995104e8e40ca294faaf8d5ba847 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1904,7 +1904,7 @@ (define_insn "*aarch64_topbits_shuffle_be"
 
 ;; Widening operations.
 
-(define_insn "aarch64_simd_vec_unpack_lo_"
+(define_insn_and_split "aarch64_simd_vec_unpack_lo_"
   [(set (match_operand: 0 "register_operand" "=w")
 (ANY_EXTEND: (vec_select:
 			   (match_operand:VQW 1 "register_operand" "w")
@@ -1912,6 +1912,19 @@ (define_insn "aarch64_simd_vec_unpack_lo_"
 			)))]
   "TARGET_SIMD"
   "xtl\t%0., %1."
+  "&&  == ZERO_EXTEND
+   && aarch64_split_simd_shift_p (insn)"
+  [(const_int 0)]
+  {
+/* On many cores, it is cheaper to implement UXTL using a ZIP1 with zero,
+   provided that the cost of the zero can be amortized over several
+   operations.  We'll later recombine the zero and zip if there are
+   not sufficient uses of the zero to make the split worthwhile.  */
+rtx res = simplify_gen_subreg (mode, operands[0], mode, 0);
+rtx zero = aarch64_gen_shareable_zero (mode);
+emit_insn (gen_aarch64_zip1 (res, operands[1], zero));
+DONE;
+  }
   [(set_attr "type" "neon_shift_imm_long")]
 )
 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index ae7e21d90b2aeec51b7626471ccf7f036fa9b3db..6f49d1482042efabedbe723aa59ecf129b84f4ad 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -23159,7 +23159,8 @@ aarch64_gen_shareable_zero (machine_mode mode)
to split without that restriction and instead recombine shared zeros
if they turn out not to be worthwhile.  This would allow splits in
single-block functions and would also cope more naturally with
-   rematerialization.  */
+   rematerialization.  The downside of not doing this is that we lose the
+   optimizations for ve

[PATCH 2/2]AArch64: lower 2 reg TBL permutes with one zero register to 1 reg TBL.

2024-07-04 Thread Tamar Christina
Hi All,

When a two reg TBL is performed with one operand being a zero vector we can
instead use a single reg TBL and map the indices for accessing the zero vector
to an out of range constant.

On AArch64 out of range indices into a TBL have a defined semantics of setting
the element to zero.  Many uArches have a slower 2-reg TBL than 1-reg TBL.

Before this change we had:

typedef unsigned int v4si __attribute__ ((vector_size (16)));

v4si f1 (v4si a)
{
  v4si zeros = {0,0,0,0};
  return __builtin_shufflevector (a, zeros, 0, 5, 1, 6);
}

which generates:

f1:
mov v30.16b, v0.16b
moviv31.4s, 0
adrpx0, .LC0
ldr q0, [x0, #:lo12:.LC0]
tbl v0.16b, {v30.16b - v31.16b}, v0.16b
ret

.LC0:
.byte   0
.byte   1
.byte   2
.byte   3
.byte   20
.byte   21
.byte   22
.byte   23
.byte   4
.byte   5
.byte   6
.byte   7
.byte   24
.byte   25
.byte   26
.byte   27

and with the patch:

f1:
adrpx0, .LC0
ldr q31, [x0, #:lo12:.LC0]
tbl v0.16b, {v0.16b}, v31.16b
ret

.LC0:
.byte   0
.byte   1
.byte   2
.byte   3
.byte   -1
.byte   -1
.byte   -1
.byte   -1
.byte   4
.byte   5
.byte   6
.byte   7
.byte   -1
.byte   -1
.byte   -1
.byte   -1

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

This sequence is generated often by openmp and aside from the
strict performance impact of this change, it also gives better
register allocation as we no longer have the consecutive
register limitation.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* config/aarch64/aarch64.cc (struct expand_vec_perm_d): Add zero_op0 and
zero_op1.
(aarch64_evpc_tbl): Implement register value remapping.
(aarch64_vectorize_vec_perm_const): Detect if operand is a zero dup
before it's forced to a reg.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/tbl_with_zero.c: New test.

---
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
6f49d1482042efabedbe723aa59ecf129b84f4ad..655b93e65a7b686a2b82e8ada64cac084ca397d4
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25384,6 +25384,7 @@ struct expand_vec_perm_d
   unsigned int vec_flags;
   unsigned int op_vec_flags;
   bool one_vector_p;
+  bool zero_op0, zero_op1;
   bool testing_p;
 };
 
@@ -25880,13 +25881,38 @@ aarch64_evpc_tbl (struct expand_vec_perm_d *d)
   /* to_constant is safe since this routine is specific to Advanced SIMD
  vectors.  */
   unsigned int nelt = d->perm.length ().to_constant ();
+
+  /* If one register is the constant vector of 0 then we only need
+ a one reg TBL and we map any accesses to the vector of 0 to -1.  We can't
+ do this earlier since vec_perm_indices clamps elements to within range so
+ we can only do it during codegen.  */
+  if (d->zero_op0)
+d->op0 = d->op1;
+  else if (d->zero_op1)
+d->op1 = d->op0;
+
   for (unsigned int i = 0; i < nelt; ++i)
-/* If big-endian and two vectors we end up with a weird mixed-endian
-   mode on NEON.  Reverse the index within each word but not the word
-   itself.  to_constant is safe because we checked is_constant above.  */
-rperm[i] = GEN_INT (BYTES_BIG_ENDIAN
-   ? d->perm[i].to_constant () ^ (nelt - 1)
-   : d->perm[i].to_constant ());
+{
+  auto val = d->perm[i].to_constant ();
+
+  /* If we're selecting from a 0 vector, we can just use an out of range
+index instead.  */
+  if ((d->zero_op0 && val < nelt) || (d->zero_op1 && val >= nelt))
+   rperm[i] = constm1_rtx;
+  else
+   {
+ /* If we are remapping a zero register as the first parameter we need
+to adjust the indices of the non-zero register.  */
+ if (d->zero_op0)
+   val = val % nelt;
+
+ /* If big-endian and two vectors we end up with a weird mixed-endian
+mode on NEON.  Reverse the index within each word but not the word
+itself.  to_constant is safe because we checked is_constant
+above.  */
+ rperm[i] = GEN_INT (BYTES_BIG_ENDIAN ? val ^ (nelt - 1) : val);
+   }
+}
 
   sel = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (nelt, rperm));
   sel = force_reg (vmode, sel);
@@ -26132,6 +26158,8 @@ aarch64_vectorize_vec_perm_const (machine_mode vmode, 
machine_mode op_mode,
  const vec_perm_indices &sel)
 {
   struct expand_vec_perm_d d;
+  d.zero_op0 = d.zero_op1 = false;
+  rtx e;
 
   /* Check whether the mask can be applied to a single vector.  */
   if (sel.ninputs () == 1
@@ -26147,6 +26175,12 @@ aarch64_vectorize_vec_perm_const (machine_mode vmode, 
machine_mod

Re: [patch,avr] PR87376: Disable -ftree-ter

2024-07-04 Thread Richard Biener
On Thu, Jul 4, 2024 at 1:08 PM Georg-Johann Lay  wrote:
>
>
>
> Am 04.07.24 um 11:49 schrieb Richard Biener:
> > On Thu, Jul 4, 2024 at 11:24 AM Richard Biener
> >  wrote:
> >>
> >> On Wed, Jul 3, 2024 at 9:26 PM Georg-Johann Lay  wrote:
> >>>
> >>>
> >>>
> >>> Am 02.07.24 um 15:48 schrieb Richard Biener:
>  On Tue, Jul 2, 2024 at 3:43 PM Georg-Johann Lay  wrote:
> >
> > Hi Jeff,
> >
> > This is a patch to get correct code out of 64-bit
> > loads from address-space __memx.
> >
> > The AVR address-spaces may require that move insns issue
> > calls to library support functions, a fact that -ftree-ter
> > doesn't account for.  tree-ssa-ter.cc then replaces an
> > expression across such a library call, resulting in wrong code.
> >
> > This patch disables that pass per default on avr, as there is no
> > more fine grained way to avoid malicious optimizations.
> > The pass can still be re-enabled by means of explicit -ftree-ter.
> >
> > Ok to apply?
> 
>  I think this requires more details on what goes wrong - I assume
>  it's not stmt reordering that effectively happens but recursive
>  expand_expr on SSA defs when those invoke libcalls?  In that
>  case this would point to a deeper issue.
> >>>
> >>> The difference is that with TER, we get a hard reg in .expand
> >>> for a movdi from 24-bit address-space __memx.
> >>>
> >>> Such moves require library calls, which in turn require
> >>> specific hard registers.  As avr backend has no movdi, the
> >>> moddi gets expanded as 8 * movqi, and that does not work
> >>> when the target registers are hard regs, as some of them
> >>> are clobbered by the libcalls.
> >>
> >> So I see
> >>
> >> (insn 18 17 19 2 (parallel [
> >>  (set (reg:QI 22 r22 [+4 ])
> >>  (mem/u/c:QI (reg/f:PSI 55) [1 aa+4 S1 A8 AS7]))
> >>  (clobber (reg:QI 22 r22))
> >>  (clobber (reg:QI 21 r21))
> >>  (clobber (reg:HI 30 r30))
> >>  ]) "t.c":12:13 38 {xloadqi_A}
> >>   (nil))
> >> (insn 19 18 20 2 (set (reg:PSI 56)
> >>  (reg/f:PSI 47)) "t.c":12:13 112 {*movpsi_split}
> >>   (nil))
> >> (insn 20 19 21 2 (parallel [
> >>  (set (reg/f:PSI 57)
> >>  (plus:PSI (reg/f:PSI 47)
> >>  (const_int 5 [0x5])))
> >>  (clobber (scratch:QI))
> >>  ]) "t.c":12:13 205 {addpsi3}
> >>   (nil))
> >> (insn 21 20 22 2 (parallel [
> >>  (set (reg:QI 23 r23 [+5 ])
> >>  (mem/u/c:QI (reg/f:PSI 57) [1 aa+5 S1 A8 AS7]))
> >>  (clobber (reg:QI 22 r22))
> >>  (clobber (reg:QI 21 r21))
> >>  (clobber (reg:HI 30 r30))
> >>  ]) "t.c":12:13 38 {xloadqi_A}
> >>   (nil))
> >>
> >> for example - insn 21 clobbers r22 which is also the destination of insn 
> >> 18.
> >>
> >> With -fno-tree-ter those oddly get _no_ intermediate reg but we have
> >>
> >> (insn 9 8 10 2 (parallel [
> >>  (set (subreg:QI (reg:DI 43 [ aa.0_1 ]) 1)
> >>  (mem/u/c:QI (reg/f:PSI 48) [1 aa+1 S1 A8 AS7]))
> >>  (clobber (reg:QI 22 r22))
> >>  (clobber (reg:QI 21 r21))
> >>  (clobber (reg:HI 30 r30))
> >>  ]) "t.c":12:13 38 {xloadqi_A}
> >>   (nil))
> >>
> >> but since on GIMPLE we have DImode loads I don't see how TER comes into
> >> play here - TER should favor the second code generation, not the first ...
> >> (or TER shouldn't play into here at all).
> >>
> >> with -fno-tree-ter we come via
> >>
> >> #0  expand_expr_real (exp=, 
> >> target=0x7716c9a8,
> >>  tmode=E_DImode, modifier=EXPAND_NORMAL, alt_rtl=0x7fffcff8,
> >>  inner_reference_p=false) at /space/rguenther/src/gcc/gcc/expr.cc:9433
> >> #1  0x0109fe63 in store_expr (exp=,
> >>  target=0x7716c9a8, call_param_p=0, nontemporal=false, 
> >> reverse=false)
> >>  at /space/rguenther/src/gcc/gcc/expr.cc:6740
> >> #2  0x0109e626 in expand_assignment (to= >> 1>,
> >>  from=, nontemporal=false)
> >>  at /space/rguenther/src/gcc/gcc/expr.cc:6461
> >>
> >> while with TER we instead have
> >>
> >> #0  expand_expr_real (exp=, target=0x0,
> >>  tmode=E_VOIDmode, modifier=EXPAND_NORMAL, alt_rtl=0x0,
> >>  inner_reference_p=false) at /space/rguenther/src/gcc/gcc/expr.cc:9433
> >> #1  0x010b279f in expand_expr_real_gassign (g=0x771613c0,
> >>  target=0x0, tmode=E_VOIDmode, modifier=EXPAND_NORMAL, alt_rtl=0x0,
> >>  inner_reference_p=false) at /space/rguenther/src/gcc/gcc/expr.cc:11100
> >> #2  0x010b3294 in expand_expr_real_1 (exp= >> 1>,
> >>  target=0x0, tmode=E_VOIDmode, modifier=EXPAND_NORMAL, alt_rtl=0x0,
> >>  inner_reference_p=false) at /space/rguenther/src/gcc/gcc/expr.cc:11278
> >>
> >> the difference is -fno-tree-ter has a target (reg:DI 43 [...]) but with 
> >> TER we
> >> are not passing a target or a mode.
> >>
> >> I 

Re: [PATCH 1/2]AArch64: make aarch64_simd_vec_unpack_lo_/_hi_ consistent.

2024-07-04 Thread Richard Sandiford
Tamar Christina  writes:
> Hi All,
>
> The fix for PR18127 reworked the uxtl to zip optimization.
> In doing so it undid the changes in aarch64_simd_vec_unpack_lo_ and this 
> now
> no longer matches aarch64_simd_vec_unpack_hi_.  It still works because the
> RTL generated by aarch64_simd_vec_unpack_lo_ overlaps with the general 
> zero
> extend RTL and so because that one is listed before the lo pattern recog picks
> it instead.
>
> This just makes aarch64_simd_vec_unpack_lo_ mirror
> aarch64_simd_vec_unpack_hi_ for consistency and so we're not relying on 
> the
> order of patterns.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-simd.md
>   (aarch64_simd_vec_unpack_lo_): Add same split as
>   aarch64_simd_vec_unpack_hi_.
>   * config/aarch64/aarch64.cc (aarch64_gen_shareable_zero): Update
>   comment.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> 01b084d8ccb53fc0f1c7b0dd8f23546c331f020a..d4026cbf0b66995104e8e40ca294faaf8d5ba847
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1904,7 +1904,7 @@ (define_insn "*aarch64_topbits_shuffle_be"
>  
>  ;; Widening operations.
>  
> -(define_insn "aarch64_simd_vec_unpack_lo_"
> +(define_insn_and_split "aarch64_simd_vec_unpack_lo_"
>[(set (match_operand: 0 "register_operand" "=w")
>  (ANY_EXTEND: (vec_select:
>  (match_operand:VQW 1 "register_operand" "w")
> @@ -1912,6 +1912,19 @@ (define_insn "aarch64_simd_vec_unpack_lo_"
>   )))]
>"TARGET_SIMD"
>"xtl\t%0., %1."
> +  "&&  == ZERO_EXTEND
> +   && aarch64_split_simd_shift_p (insn)"
> +  [(const_int 0)]
> +  {
> +/* On many cores, it is cheaper to implement UXTL using a ZIP1 with zero,
> +   provided that the cost of the zero can be amortized over several
> +   operations.  We'll later recombine the zero and zip if there are
> +   not sufficient uses of the zero to make the split worthwhile.  */
> +rtx res = simplify_gen_subreg (mode, operands[0], mode, 0);
> +rtx zero = aarch64_gen_shareable_zero (mode);
> +emit_insn (gen_aarch64_zip1 (res, operands[1], zero));
> +DONE;
> +  }
>[(set_attr "type" "neon_shift_imm_long")]
>  )

I think we should just remove the pattern instead, and update users
to generate an extension of the 64-bit lowpart.  That form is now the
canonical one, so the current aarch64_simd_vec_unpack_lo_
pattern hard-codes unsimplified rtl.

> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> ae7e21d90b2aeec51b7626471ccf7f036fa9b3db..6f49d1482042efabedbe723aa59ecf129b84f4ad
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -23159,7 +23159,8 @@ aarch64_gen_shareable_zero (machine_mode mode)
> to split without that restriction and instead recombine shared zeros
> if they turn out not to be worthwhile.  This would allow splits in
> single-block functions and would also cope more naturally with
> -   rematerialization.  */
> +   rematerialization.  The downside of not doing this is that we lose the
> +   optimizations for vector epilogues as well.  */
>  
>  bool
>  aarch64_split_simd_shift_p (rtx_insn *insn)

This part is ok.  late-combine was supposed to help with this,
but we might need to update the propagation code so that it handle
changes in mode.

Thanks,
Richard


RE: [PATCH 1/2]AArch64: make aarch64_simd_vec_unpack_lo_/_hi_ consistent.

2024-07-04 Thread Tamar Christina
> -Original Message-
> From: Richard Sandiford 
> Sent: Thursday, July 4, 2024 12:46 PM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> ; Marcus Shawcroft
> ; ktkac...@gcc.gnu.org
> Subject: Re: [PATCH 1/2]AArch64: make aarch64_simd_vec_unpack_lo_/_hi_
> consistent.
> 
> Tamar Christina  writes:
> > Hi All,
> >
> > The fix for PR18127 reworked the uxtl to zip optimization.
> > In doing so it undid the changes in aarch64_simd_vec_unpack_lo_ and this
> now
> > no longer matches aarch64_simd_vec_unpack_hi_.  It still works because
> the
> > RTL generated by aarch64_simd_vec_unpack_lo_ overlaps with the general
> zero
> > extend RTL and so because that one is listed before the lo pattern recog 
> > picks
> > it instead.
> >
> > This just makes aarch64_simd_vec_unpack_lo_ mirror
> > aarch64_simd_vec_unpack_hi_ for consistency and so we're not relying on
> the
> > order of patterns.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > * config/aarch64/aarch64-simd.md
> > (aarch64_simd_vec_unpack_lo_): Add same split as
> > aarch64_simd_vec_unpack_hi_.
> > * config/aarch64/aarch64.cc (aarch64_gen_shareable_zero): Update
> > comment.
> >
> > ---
> >
> > diff --git a/gcc/config/aarch64/aarch64-simd.md
> b/gcc/config/aarch64/aarch64-simd.md
> > index
> 01b084d8ccb53fc0f1c7b0dd8f23546c331f020a..d4026cbf0b66995104e8e40ca
> 294faaf8d5ba847 100644
> > --- a/gcc/config/aarch64/aarch64-simd.md
> > +++ b/gcc/config/aarch64/aarch64-simd.md
> > @@ -1904,7 +1904,7 @@ (define_insn
> "*aarch64_topbits_shuffle_be"
> >
> >  ;; Widening operations.
> >
> > -(define_insn "aarch64_simd_vec_unpack_lo_"
> > +(define_insn_and_split "aarch64_simd_vec_unpack_lo_"
> >[(set (match_operand: 0 "register_operand" "=w")
> >  (ANY_EXTEND: (vec_select:
> >(match_operand:VQW 1 "register_operand" "w")
> > @@ -1912,6 +1912,19 @@ (define_insn
> "aarch64_simd_vec_unpack_lo_"
> > )))]
> >"TARGET_SIMD"
> >"xtl\t%0., %1."
> > +  "&&  == ZERO_EXTEND
> > +   && aarch64_split_simd_shift_p (insn)"
> > +  [(const_int 0)]
> > +  {
> > +/* On many cores, it is cheaper to implement UXTL using a ZIP1 with 
> > zero,
> > +   provided that the cost of the zero can be amortized over several
> > +   operations.  We'll later recombine the zero and zip if there are
> > +   not sufficient uses of the zero to make the split worthwhile.  */
> > +rtx res = simplify_gen_subreg (mode, operands[0], mode,
> 0);
> > +rtx zero = aarch64_gen_shareable_zero (mode);
> > +emit_insn (gen_aarch64_zip1 (res, operands[1], zero));
> > +DONE;
> > +  }
> >[(set_attr "type" "neon_shift_imm_long")]
> >  )
> 
> I think we should just remove the pattern instead, and update users
> to generate an extension of the 64-bit lowpart.  That form is now the
> canonical one, so the current aarch64_simd_vec_unpack_lo_
> pattern hard-codes unsimplified rtl.
> 

Ok, but then it's very weird to have _hi still.  The reason for these 
indirections
seems to be to model the hi/lo split for RTL simplifications. 

The aarch64_simd_vec_unpack_lo models a vec_select as well.  This RTL is
different from the one it gets rewritten to (dropping the vec_select).

So dropping it means we lose this, and I guess maybe fold-rtx does it now?
It's not immediately clear to me that dropping the additional RTL for recog
is beneficially.

Thanks,
Tamar

> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index
> ae7e21d90b2aeec51b7626471ccf7f036fa9b3db..6f49d1482042efabedbe723aa
> 59ecf129b84f4ad 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -23159,7 +23159,8 @@ aarch64_gen_shareable_zero (machine_mode
> mode)
> > to split without that restriction and instead recombine shared zeros
> > if they turn out not to be worthwhile.  This would allow splits in
> > single-block functions and would also cope more naturally with
> > -   rematerialization.  */
> > +   rematerialization.  The downside of not doing this is that we lose the
> > +   optimizations for vector epilogues as well.  */
> >
> >  bool
> >  aarch64_split_simd_shift_p (rtx_insn *insn)
> 
> This part is ok.  late-combine was supposed to help with this,
> but we might need to update the propagation code so that it handle
> changes in mode.
> 
> Thanks,
> Richard


Re: [PATCH 2/2]AArch64: lower 2 reg TBL permutes with one zero register to 1 reg TBL.

2024-07-04 Thread Richard Sandiford
Tamar Christina  writes:
> Hi All,
>
> When a two reg TBL is performed with one operand being a zero vector we can
> instead use a single reg TBL and map the indices for accessing the zero vector
> to an out of range constant.
>
> On AArch64 out of range indices into a TBL have a defined semantics of setting
> the element to zero.  Many uArches have a slower 2-reg TBL than 1-reg TBL.
>
> Before this change we had:
>
> typedef unsigned int v4si __attribute__ ((vector_size (16)));
>
> v4si f1 (v4si a)
> {
>   v4si zeros = {0,0,0,0};
>   return __builtin_shufflevector (a, zeros, 0, 5, 1, 6);
> }
>
> which generates:
>
> f1:
> mov v30.16b, v0.16b
> moviv31.4s, 0
> adrpx0, .LC0
> ldr q0, [x0, #:lo12:.LC0]
> tbl v0.16b, {v30.16b - v31.16b}, v0.16b
> ret
>
> .LC0:
> .byte   0
> .byte   1
> .byte   2
> .byte   3
> .byte   20
> .byte   21
> .byte   22
> .byte   23
> .byte   4
> .byte   5
> .byte   6
> .byte   7
> .byte   24
> .byte   25
> .byte   26
> .byte   27
>
> and with the patch:
>
> f1:
> adrpx0, .LC0
> ldr q31, [x0, #:lo12:.LC0]
> tbl v0.16b, {v0.16b}, v31.16b
> ret

Nice!

>
> .LC0:
> .byte   0
> .byte   1
> .byte   2
> .byte   3
> .byte   -1
> .byte   -1
> .byte   -1
> .byte   -1
> .byte   4
> .byte   5
> .byte   6
> .byte   7
> .byte   -1
> .byte   -1
> .byte   -1
> .byte   -1
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> This sequence is generated often by openmp and aside from the
> strict performance impact of this change, it also gives better
> register allocation as we no longer have the consecutive
> register limitation.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (struct expand_vec_perm_d): Add zero_op0 and
>   zero_op1.
>   (aarch64_evpc_tbl): Implement register value remapping.
>   (aarch64_vectorize_vec_perm_const): Detect if operand is a zero dup
>   before it's forced to a reg.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/tbl_with_zero.c: New test.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> 6f49d1482042efabedbe723aa59ecf129b84f4ad..655b93e65a7b686a2b82e8ada64cac084ca397d4
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -25384,6 +25384,7 @@ struct expand_vec_perm_d
>unsigned int vec_flags;
>unsigned int op_vec_flags;
>bool one_vector_p;
> +  bool zero_op0, zero_op1;

Nit: might as well add _p suffixes for consistency with the existing fields.

>bool testing_p;
>  };
>  
> @@ -25880,13 +25881,38 @@ aarch64_evpc_tbl (struct expand_vec_perm_d *d)
>/* to_constant is safe since this routine is specific to Advanced SIMD
>   vectors.  */
>unsigned int nelt = d->perm.length ().to_constant ();
> +
> +  /* If one register is the constant vector of 0 then we only need
> + a one reg TBL and we map any accesses to the vector of 0 to -1.  We 
> can't
> + do this earlier since vec_perm_indices clamps elements to within range 
> so
> + we can only do it during codegen.  */
> +  if (d->zero_op0)
> +d->op0 = d->op1;
> +  else if (d->zero_op1)
> +d->op1 = d->op0;
> +
>for (unsigned int i = 0; i < nelt; ++i)
> -/* If big-endian and two vectors we end up with a weird mixed-endian
> -   mode on NEON.  Reverse the index within each word but not the word
> -   itself.  to_constant is safe because we checked is_constant above.  */
> -rperm[i] = GEN_INT (BYTES_BIG_ENDIAN
> - ? d->perm[i].to_constant () ^ (nelt - 1)
> - : d->perm[i].to_constant ());
> +{
> +  auto val = d->perm[i].to_constant ();
> +
> +  /* If we're selecting from a 0 vector, we can just use an out of range
> +  index instead.  */
> +  if ((d->zero_op0 && val < nelt) || (d->zero_op1 && val >= nelt))
> + rperm[i] = constm1_rtx;
> +  else
> + {
> +   /* If we are remapping a zero register as the first parameter we need
> +  to adjust the indices of the non-zero register.  */
> +   if (d->zero_op0)
> + val = val % nelt;
> +
> +   /* If big-endian and two vectors we end up with a weird mixed-endian
> +  mode on NEON.  Reverse the index within each word but not the word
> +  itself.  to_constant is safe because we checked is_constant
> +  above.  */
> +   rperm[i] = GEN_INT (BYTES_BIG_ENDIAN ? val ^ (nelt - 1) : val);
> + }
> +}
>  
>sel = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (nelt, rperm));
>sel = force_reg (vmode, sel);
> @@ -26132,6 +26158,8 @@ aarch64_vectorize_vec_perm_const (

Re: [PATCH 1/2]AArch64: make aarch64_simd_vec_unpack_lo_/_hi_ consistent.

2024-07-04 Thread Richard Sandiford
Tamar Christina  writes:
>> -Original Message-
>> From: Richard Sandiford 
>> Sent: Thursday, July 4, 2024 12:46 PM
>> To: Tamar Christina 
>> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> ; Marcus Shawcroft
>> ; ktkac...@gcc.gnu.org
>> Subject: Re: [PATCH 1/2]AArch64: make aarch64_simd_vec_unpack_lo_/_hi_
>> consistent.
>> 
>> Tamar Christina  writes:
>> > Hi All,
>> >
>> > The fix for PR18127 reworked the uxtl to zip optimization.
>> > In doing so it undid the changes in aarch64_simd_vec_unpack_lo_ and 
>> > this
>> now
>> > no longer matches aarch64_simd_vec_unpack_hi_.  It still works because
>> the
>> > RTL generated by aarch64_simd_vec_unpack_lo_ overlaps with the general
>> zero
>> > extend RTL and so because that one is listed before the lo pattern recog 
>> > picks
>> > it instead.
>> >
>> > This just makes aarch64_simd_vec_unpack_lo_ mirror
>> > aarch64_simd_vec_unpack_hi_ for consistency and so we're not relying on
>> the
>> > order of patterns.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>> >
>> > Ok for master?
>> >
>> > Thanks,
>> > Tamar
>> >
>> > gcc/ChangeLog:
>> >
>> >* config/aarch64/aarch64-simd.md
>> >(aarch64_simd_vec_unpack_lo_): Add same split as
>> >aarch64_simd_vec_unpack_hi_.
>> >* config/aarch64/aarch64.cc (aarch64_gen_shareable_zero): Update
>> >comment.
>> >
>> > ---
>> >
>> > diff --git a/gcc/config/aarch64/aarch64-simd.md
>> b/gcc/config/aarch64/aarch64-simd.md
>> > index
>> 01b084d8ccb53fc0f1c7b0dd8f23546c331f020a..d4026cbf0b66995104e8e40ca
>> 294faaf8d5ba847 100644
>> > --- a/gcc/config/aarch64/aarch64-simd.md
>> > +++ b/gcc/config/aarch64/aarch64-simd.md
>> > @@ -1904,7 +1904,7 @@ (define_insn
>> "*aarch64_topbits_shuffle_be"
>> >
>> >  ;; Widening operations.
>> >
>> > -(define_insn "aarch64_simd_vec_unpack_lo_"
>> > +(define_insn_and_split "aarch64_simd_vec_unpack_lo_"
>> >[(set (match_operand: 0 "register_operand" "=w")
>> >  (ANY_EXTEND: (vec_select:
>> >   (match_operand:VQW 1 "register_operand" "w")
>> > @@ -1912,6 +1912,19 @@ (define_insn
>> "aarch64_simd_vec_unpack_lo_"
>> >)))]
>> >"TARGET_SIMD"
>> >"xtl\t%0., %1."
>> > +  "&&  == ZERO_EXTEND
>> > +   && aarch64_split_simd_shift_p (insn)"
>> > +  [(const_int 0)]
>> > +  {
>> > +/* On many cores, it is cheaper to implement UXTL using a ZIP1 with 
>> > zero,
>> > +   provided that the cost of the zero can be amortized over several
>> > +   operations.  We'll later recombine the zero and zip if there are
>> > +   not sufficient uses of the zero to make the split worthwhile.  */
>> > +rtx res = simplify_gen_subreg (mode, operands[0], mode,
>> 0);
>> > +rtx zero = aarch64_gen_shareable_zero (mode);
>> > +emit_insn (gen_aarch64_zip1 (res, operands[1], zero));
>> > +DONE;
>> > +  }
>> >[(set_attr "type" "neon_shift_imm_long")]
>> >  )
>> 
>> I think we should just remove the pattern instead, and update users
>> to generate an extension of the 64-bit lowpart.  That form is now the
>> canonical one, so the current aarch64_simd_vec_unpack_lo_
>> pattern hard-codes unsimplified rtl.
>> 
>
> Ok, but then it's very weird to have _hi still.  The reason for these 
> indirections
> seems to be to model the hi/lo split for RTL simplifications. 
>
> The aarch64_simd_vec_unpack_lo models a vec_select as well.  This RTL is
> different from the one it gets rewritten to (dropping the vec_select).
>
> So dropping it means we lose this, and I guess maybe fold-rtx does it now?
> It's not immediately clear to me that dropping the additional RTL for recog
> is beneficially.

The principle is that, say:

  (vec_select:V2SI (reg:V2DI R) (parallel [(const_int 0) (const_int 1)]))

is (for little-endian) equivalent to:

  (subreg:V2SI (reg:V2DI R) 0)

and similarly for the equivalent big-endian pair.  Simplification rules
are now supposed to ensure that only the second (simpler) form is generated
by target-independent code.  We should fix any cases where that doesn't
happen, since it would be a missed optimisation for any instructions
that take (in this case) V2SI inputs.

There's no equivalent simplification for _hi because it isn't possible
to refer directly to the upper 64 bits of a 128-bit register using subregs.

Thanks,
Richard


Re: [PATCH 1/2]AArch64: make aarch64_simd_vec_unpack_lo_/_hi_ consistent.

2024-07-04 Thread Richard Sandiford
Richard Sandiford  writes:
> Tamar Christina  writes:
>>> -Original Message-
>>> From: Richard Sandiford 
>>> Sent: Thursday, July 4, 2024 12:46 PM
>>> To: Tamar Christina 
>>> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>>> ; Marcus Shawcroft
>>> ; ktkac...@gcc.gnu.org
>>> Subject: Re: [PATCH 1/2]AArch64: make aarch64_simd_vec_unpack_lo_/_hi_
>>> consistent.
>>> 
>>> Tamar Christina  writes:
>>> > Hi All,
>>> >
>>> > The fix for PR18127 reworked the uxtl to zip optimization.
>>> > In doing so it undid the changes in aarch64_simd_vec_unpack_lo_ and 
>>> > this
>>> now
>>> > no longer matches aarch64_simd_vec_unpack_hi_.  It still works because
>>> the
>>> > RTL generated by aarch64_simd_vec_unpack_lo_ overlaps with the general
>>> zero
>>> > extend RTL and so because that one is listed before the lo pattern recog 
>>> > picks
>>> > it instead.
>>> >
>>> > This just makes aarch64_simd_vec_unpack_lo_ mirror
>>> > aarch64_simd_vec_unpack_hi_ for consistency and so we're not relying 
>>> > on
>>> the
>>> > order of patterns.
>>> >
>>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>>> >
>>> > Ok for master?
>>> >
>>> > Thanks,
>>> > Tamar
>>> >
>>> > gcc/ChangeLog:
>>> >
>>> >   * config/aarch64/aarch64-simd.md
>>> >   (aarch64_simd_vec_unpack_lo_): Add same split as
>>> >   aarch64_simd_vec_unpack_hi_.
>>> >   * config/aarch64/aarch64.cc (aarch64_gen_shareable_zero): Update
>>> >   comment.
>>> >
>>> > ---
>>> >
>>> > diff --git a/gcc/config/aarch64/aarch64-simd.md
>>> b/gcc/config/aarch64/aarch64-simd.md
>>> > index
>>> 01b084d8ccb53fc0f1c7b0dd8f23546c331f020a..d4026cbf0b66995104e8e40ca
>>> 294faaf8d5ba847 100644
>>> > --- a/gcc/config/aarch64/aarch64-simd.md
>>> > +++ b/gcc/config/aarch64/aarch64-simd.md
>>> > @@ -1904,7 +1904,7 @@ (define_insn
>>> "*aarch64_topbits_shuffle_be"
>>> >
>>> >  ;; Widening operations.
>>> >
>>> > -(define_insn "aarch64_simd_vec_unpack_lo_"
>>> > +(define_insn_and_split "aarch64_simd_vec_unpack_lo_"
>>> >[(set (match_operand: 0 "register_operand" "=w")
>>> >  (ANY_EXTEND: (vec_select:
>>> >  (match_operand:VQW 1 "register_operand" "w")
>>> > @@ -1912,6 +1912,19 @@ (define_insn
>>> "aarch64_simd_vec_unpack_lo_"
>>> >   )))]
>>> >"TARGET_SIMD"
>>> >"xtl\t%0., %1."
>>> > +  "&&  == ZERO_EXTEND
>>> > +   && aarch64_split_simd_shift_p (insn)"
>>> > +  [(const_int 0)]
>>> > +  {
>>> > +/* On many cores, it is cheaper to implement UXTL using a ZIP1 with 
>>> > zero,
>>> > +   provided that the cost of the zero can be amortized over several
>>> > +   operations.  We'll later recombine the zero and zip if there are
>>> > +   not sufficient uses of the zero to make the split worthwhile.  */
>>> > +rtx res = simplify_gen_subreg (mode, operands[0], mode,
>>> 0);
>>> > +rtx zero = aarch64_gen_shareable_zero (mode);
>>> > +emit_insn (gen_aarch64_zip1 (res, operands[1], zero));
>>> > +DONE;
>>> > +  }
>>> >[(set_attr "type" "neon_shift_imm_long")]
>>> >  )
>>> 
>>> I think we should just remove the pattern instead, and update users
>>> to generate an extension of the 64-bit lowpart.  That form is now the
>>> canonical one, so the current aarch64_simd_vec_unpack_lo_
>>> pattern hard-codes unsimplified rtl.
>>> 
>>
>> Ok, but then it's very weird to have _hi still.  The reason for these 
>> indirections
>> seems to be to model the hi/lo split for RTL simplifications. 
>>
>> The aarch64_simd_vec_unpack_lo models a vec_select as well.  This RTL is
>> different from the one it gets rewritten to (dropping the vec_select).
>>
>> So dropping it means we lose this, and I guess maybe fold-rtx does it now?
>> It's not immediately clear to me that dropping the additional RTL for recog
>> is beneficially.
>
> The principle is that, say:
>
>   (vec_select:V2SI (reg:V2DI R) (parallel [(const_int 0) (const_int 1)]))
>
> is (for little-endian) equivalent to:
>
>   (subreg:V2SI (reg:V2DI R) 0)

Sigh, of course I meant V4SI rather than V2DI in the above :)

> and similarly for the equivalent big-endian pair.  Simplification rules
> are now supposed to ensure that only the second (simpler) form is generated
> by target-independent code.  We should fix any cases where that doesn't
> happen, since it would be a missed optimisation for any instructions
> that take (in this case) V2SI inputs.
>
> There's no equivalent simplification for _hi because it isn't possible
> to refer directly to the upper 64 bits of a 128-bit register using subregs.
>
> Thanks,
> Richard


Re: [PATCH v3] ARM: thumb1: Use LDMIA/STMIA for DI/DF loads/stores

2024-07-04 Thread Siarhei Volkau
чт, 4 июл. 2024 г. в 12:45, Richard Earnshaw (lists) :
>
> On 20/06/2024 08:24, Siarhei Volkau wrote:
> > If the address register is dead after load/store operation it looks
> > beneficial to use LDMIA/STMIA instead of pair of LDR/STR instructions,
> > at least if optimizing for size.
> >
> > Changes v2 -> v3:
> >  - switching to mixed approach (insn+peep2)
> >  - keep memory attributes in peepholes
> >  - handle stmia corner cases
> >
> > Changes v1 -> v2:
> >  - switching to peephole2 approach
> >  - added test case
> >
> > gcc/ChangeLog:
> >
> > * config/arm/arm.cc (thumb_load_double_from_address): Emit ldmia
> > when address reg rewritten by load.
> >
> > * config/arm/thumb1.md (peephole2 to rewrite DI/DF load): New.
> > (peephole2 to rewrite DI/DF store): New.
> >
> >   * config/arm/iterators.md (DIDF): New.
> >
> > gcc/testsuite:
> >
> > * gcc.target/arm/thumb1-load-store-64bit.c: Add new test.
>
> I made a couple of cleanups and pushed this.  My testing of the cleanup also 
> identified another corner case for the ldm instruciton: if the result of the 
> load is not used (but it can't be eliminated because the address is marked 
> volatile), then we could end up with
> ldm r0!, {r0, r1}
> Which of course is unpredictable.  So we need to test not only that r0 is 
> dead but that it isn't written by the load either.
>

Good catch.

Regarding thumb2, I investigated it a bit and found that it has little effort:
1. DI mode splitted into two insns during subreg1 pass, so it won't be
easy to catch as it was for t1.
2. DF on soft-float t2 targets should have its own rules to transform
into LDM/STM.
3. LDM/STM are slower than LDRD/STRD on dual-issue cores. so,
profitable only for -Os.
I think my experience does not allow me to propose such a patch, though.

But I have some ideas to improve thumb1, specifically armv6-m.
Will propose RFCs one by one.

> Anyway, thanks for the patch.
>
> R.
>

Thank you for your hard work.

Siarhei

> >
> > Signed-off-by: Siarhei Volkau 
> > ---
> >  gcc/config/arm/arm.cc |  8 ++--
> >  gcc/config/arm/iterators.md   |  3 ++
> >  gcc/config/arm/thumb1.md  | 37 ++-
> >  .../gcc.target/arm/thumb1-load-store-64bit.c  | 16 
> >  4 files changed, 58 insertions(+), 6 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/arm/thumb1-load-store-64bit.c
> >
> > diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
> > index b8c32db0a1d..2734ab3bce1 100644
> > --- a/gcc/config/arm/arm.cc
> > +++ b/gcc/config/arm/arm.cc
> > @@ -28368,15 +28368,13 @@ thumb_load_double_from_address (rtx *operands)
> >switch (GET_CODE (addr))
> >  {
> >  case REG:
> > -  operands[2] = adjust_address (operands[1], SImode, 4);
> > -
> > -  if (REGNO (operands[0]) == REGNO (addr))
> > +  if (reg_overlap_mentioned_p (addr, operands[0]))
> >   {
> > -   output_asm_insn ("ldr\t%H0, %2", operands);
> > -   output_asm_insn ("ldr\t%0, %1", operands);
> > +   output_asm_insn ("ldmia\t%m1, {%0, %H0}", operands);
> >   }
> >else
> >   {
> > +   operands[2] = adjust_address (operands[1], SImode, 4);
> > output_asm_insn ("ldr\t%0, %1", operands);
> > output_asm_insn ("ldr\t%H0, %2", operands);
> >   }
> > diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
> > index 8d066fcf05d..09046bff83b 100644
> > --- a/gcc/config/arm/iterators.md
> > +++ b/gcc/config/arm/iterators.md
> > @@ -50,6 +50,9 @@ (define_mode_iterator QHSD [QI HI SI DI])
> >  ;; A list of the 32bit and 64bit integer modes
> >  (define_mode_iterator SIDI [SI DI])
> >
> > +;; A list of the 64bit modes for thumb1.
> > +(define_mode_iterator DIDF [DI DF])
> > +
> >  ;; A list of atomic compare and swap success return modes
> >  (define_mode_iterator CCSI [(CC_Z "TARGET_32BIT") (SI "TARGET_THUMB1")])
> >
> > diff --git a/gcc/config/arm/thumb1.md b/gcc/config/arm/thumb1.md
> > index d7074b43f60..c9a4201fcc9 100644
> > --- a/gcc/config/arm/thumb1.md
> > +++ b/gcc/config/arm/thumb1.md
> > @@ -2055,4 +2055,39 @@ (define_insn "thumb1_stack_protect_test_insn"
> > (set_attr "conds" "clob")
> > (set_attr "type" "multiple")]
> >  )
> > -
> > +
> > +;; match patterns usable by ldmia/stmia
> > +(define_peephole2
> > +  [(set (match_operand:DIDF 0 "low_register_operand" "")
> > + (match_operand:DIDF 1 "memory_operand" ""))]
> > +  "TARGET_THUMB1
> > +   && REG_P (XEXP (operands[1], 0))
> > +   && REGNO_REG_CLASS (REGNO (XEXP (operands[1], 0))) == LO_REGS
> > +   && peep2_reg_dead_p (1, XEXP (operands[1], 0))"
> > +  [(set (match_dup 0)
> > + (match_dup 1))]
> > +  "
> > +  {
> > +operands[1] = change_address (operands[1], VOIDmode,
> > +   gen_rtx_POST_INC (SImode,
> > + XEXP (operands[1], 0)));
> > +  }"
> > +)
> > +
> > +(define_peephole

[committed] libstdc++-v3: Skip 30_threads/future/members/poll.cc on hppa*-*-linux*

2024-07-04 Thread John David Anglin
Fixes test fail on hppa*-*-linux*.

Committed to trunk.

Dave
---

Skip 30_threads/future/members/poll.cc on hppa*-*-linux*

hppa*-*-linux* lacks high resolution timer support. Timer resolution
ranges from 1 to 10ms. As a result, a large number of iterations are
needed for the wait_for_0 and ready loops. This causes the
wait_until_sys_epoch and wait_until_steady_epoch loops to timeout.
There the loop wait time is determined by the timer resolution.

2024-07-04  John David Anglin  

libstdc++-v3/ChangeLog:
PR libstdc++/98678
* testsuite/30_threads/future/members/poll.cc: Skip on hppa*-*-linux*.

diff --git a/libstdc++-v3/testsuite/30_threads/future/members/poll.cc 
b/libstdc++-v3/testsuite/30_threads/future/members/poll.cc
index 4fa282bd87f..2bdbe7a48ce 100644
--- a/libstdc++-v3/testsuite/30_threads/future/members/poll.cc
+++ b/libstdc++-v3/testsuite/30_threads/future/members/poll.cc
@@ -19,6 +19,7 @@
 // { dg-do run { target c++11 } }
 // { dg-additional-options "-pthread" { target pthread } }
 // { dg-require-gthreads "" }
+// { dg-skip-if "no high resolution timer support" { hppa*-*-linux* } }
 
 #include 
 #include 


signature.asc
Description: PGP signature


Re: [PATCH v3] ARM: thumb1: Use LDMIA/STMIA for DI/DF loads/stores

2024-07-04 Thread Richard Earnshaw (lists)
On 04/07/2024 13:50, Siarhei Volkau wrote:
> чт, 4 июл. 2024 г. в 12:45, Richard Earnshaw (lists) 
> :
>>
>> On 20/06/2024 08:24, Siarhei Volkau wrote:
>>> If the address register is dead after load/store operation it looks
>>> beneficial to use LDMIA/STMIA instead of pair of LDR/STR instructions,
>>> at least if optimizing for size.
>>>
>>> Changes v2 -> v3:
>>>  - switching to mixed approach (insn+peep2)
>>>  - keep memory attributes in peepholes
>>>  - handle stmia corner cases
>>>
>>> Changes v1 -> v2:
>>>  - switching to peephole2 approach
>>>  - added test case
>>>
>>> gcc/ChangeLog:
>>>
>>> * config/arm/arm.cc (thumb_load_double_from_address): Emit ldmia
>>> when address reg rewritten by load.
>>>
>>> * config/arm/thumb1.md (peephole2 to rewrite DI/DF load): New.
>>> (peephole2 to rewrite DI/DF store): New.
>>>
>>>   * config/arm/iterators.md (DIDF): New.
>>>
>>> gcc/testsuite:
>>>
>>> * gcc.target/arm/thumb1-load-store-64bit.c: Add new test.
>>
>> I made a couple of cleanups and pushed this.  My testing of the cleanup also 
>> identified another corner case for the ldm instruciton: if the result of the 
>> load is not used (but it can't be eliminated because the address is marked 
>> volatile), then we could end up with
>> ldm r0!, {r0, r1}
>> Which of course is unpredictable.  So we need to test not only that r0 is 
>> dead but that it isn't written by the load either.
>>
> 
> Good catch.
> 
> Regarding thumb2, I investigated it a bit and found that it has little effort:
> 1. DI mode splitted into two insns during subreg1 pass, so it won't be
> easy to catch as it was for t1.

We need to enable the new pair fusion pass for thumb2 to address this.  But it 
will mostly only result in new ldrd/strd instructions I suspect.

> 2. DF on soft-float t2 targets should have its own rules to transform
> into LDM/STM.
> 3. LDM/STM are slower than LDRD/STRD on dual-issue cores. so,
> profitable only for -Os.

Speed tends to be micro-architecture (implementation) specific, but yes, it may 
well be the case that this will only be a win at -Os.

R.



Re: [patch,avr] PR87376: Disable -ftree-ter

2024-07-04 Thread Georg-Johann Lay

Am 04.07.24 um 13:25 schrieb Richard Biener:

On Thu, Jul 4, 2024 at 1:08 PM Georg-Johann Lay  wrote:

Am 04.07.24 um 11:49 schrieb Richard Biener:

On Thu, Jul 4, 2024 at 11:24 AM Richard Biener
 wrote:

On Wed, Jul 3, 2024 at 9:26 PM Georg-Johann Lay  wrote:

Am 02.07.24 um 15:48 schrieb Richard Biener:

On Tue, Jul 2, 2024 at 3:43 PM Georg-Johann Lay  wrote:


This is a patch to get correct code out of 64-bit
loads from address-space __memx.

The AVR address-spaces may require that move insns issue
calls to library support functions, a fact that -ftree-ter
doesn't account for.  tree-ssa-ter.cc then replaces an
expression across such a library call, resulting in wrong code.

This patch disables that pass per default on avr, as there is no
more fine grained way to avoid malicious optimizations.
The pass can still be re-enabled by means of explicit -ftree-ter.

Ok to apply?


I think this requires more details on what goes wrong - I assume
it's not stmt reordering that effectively happens but recursive
expand_expr on SSA defs when those invoke libcalls?  In that
case this would point to a deeper issue.


The difference is that with TER, we get a hard reg in .expand
for a movdi from 24-bit address-space __memx.

Such moves require library calls, which in turn require
specific hard registers.  As avr backend has no movdi, the
moddi gets expanded as 8 * movqi, and that does not work
when the target registers are hard regs, as some of them
are clobbered by the libcalls.


So I see

(insn 18 17 19 2 (parallel [
  (set (reg:QI 22 r22 [+4 ])
  (mem/u/c:QI (reg/f:PSI 55) [1 aa+4 S1 A8 AS7]))
  (clobber (reg:QI 22 r22))
  (clobber (reg:QI 21 r21))
  (clobber (reg:HI 30 r30))
  ]) "t.c":12:13 38 {xloadqi_A}
   (nil))
(insn 19 18 20 2 (set (reg:PSI 56)
  (reg/f:PSI 47)) "t.c":12:13 112 {*movpsi_split}
   (nil))
(insn 20 19 21 2 (parallel [
  (set (reg/f:PSI 57)
  (plus:PSI (reg/f:PSI 47)
  (const_int 5 [0x5])))
  (clobber (scratch:QI))
  ]) "t.c":12:13 205 {addpsi3}
   (nil))
(insn 21 20 22 2 (parallel [
  (set (reg:QI 23 r23 [+5 ])
  (mem/u/c:QI (reg/f:PSI 57) [1 aa+5 S1 A8 AS7]))
  (clobber (reg:QI 22 r22))
  (clobber (reg:QI 21 r21))
  (clobber (reg:HI 30 r30))
  ]) "t.c":12:13 38 {xloadqi_A}
   (nil))

for example - insn 21 clobbers r22 which is also the destination of insn 18.

With -fno-tree-ter those oddly get _no_ intermediate reg but we have

(insn 9 8 10 2 (parallel [
  (set (subreg:QI (reg:DI 43 [ aa.0_1 ]) 1)
  (mem/u/c:QI (reg/f:PSI 48) [1 aa+1 S1 A8 AS7]))
  (clobber (reg:QI 22 r22))
  (clobber (reg:QI 21 r21))
  (clobber (reg:HI 30 r30))
  ]) "t.c":12:13 38 {xloadqi_A}
   (nil))

but since on GIMPLE we have DImode loads I don't see how TER comes into
play here - TER should favor the second code generation, not the first ...
(or TER shouldn't play into here at all).

with -fno-tree-ter we come via

#0  expand_expr_real (exp=, target=0x7716c9a8,
  tmode=E_DImode, modifier=EXPAND_NORMAL, alt_rtl=0x7fffcff8,
  inner_reference_p=false) at /space/rguenther/src/gcc/gcc/expr.cc:9433
#1  0x0109fe63 in store_expr (exp=,
  target=0x7716c9a8, call_param_p=0, nontemporal=false, reverse=false)
  at /space/rguenther/src/gcc/gcc/expr.cc:6740
#2  0x0109e626 in expand_assignment (to=,
  from=, nontemporal=false)
  at /space/rguenther/src/gcc/gcc/expr.cc:6461

while with TER we instead have

#0  expand_expr_real (exp=, target=0x0,
  tmode=E_VOIDmode, modifier=EXPAND_NORMAL, alt_rtl=0x0,
  inner_reference_p=false) at /space/rguenther/src/gcc/gcc/expr.cc:9433
#1  0x010b279f in expand_expr_real_gassign (g=0x771613c0,
  target=0x0, tmode=E_VOIDmode, modifier=EXPAND_NORMAL, alt_rtl=0x0,
  inner_reference_p=false) at /space/rguenther/src/gcc/gcc/expr.cc:11100
#2  0x010b3294 in expand_expr_real_1 (exp=,
  target=0x0, tmode=E_VOIDmode, modifier=EXPAND_NORMAL, alt_rtl=0x0,
  inner_reference_p=false) at /space/rguenther/src/gcc/gcc/expr.cc:11278

the difference is -fno-tree-ter has a target (reg:DI 43 [...]) but with TER we
are not passing a target or a mode.

I think the issue is that the expansion at some point doesn't expect
the result to end up in
a hard register.  Maybe define_expand are not supposed to do that but maybe
expansion needs to fix up.

A first thought was

diff --git a/gcc/expr.cc b/gcc/expr.cc
index ffbac513692..1509acad02e 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -1,6 +1,12 @@ expand_expr_real_gassign (gassign *g, rtx
target, machine_mode tmode,
 gcc_unreachable ();
   }
 set_curr_insn_location (saved_loc);
+  if (!target && REG_P (r) && REGNO (r) < FIRST_PSEUDO_REGISTER)
+{
+ 

[patch, rtl-optimization, loop-unroll] v1: Loop unroll factor based on,available registers over reg needed inside loops

2024-07-04 Thread Ajit Agarwal
Hello Richard:

Based on your feedback I have changed the logic of determining
unroll factor for loops.

Unroll factor is calculated based on available registers and regs
needed inside the loops.

Unroll factor is quotient of max of available registers in loop
over regs needed inside the loops.

Considered -fsplit-ivs-in-unroller or -fvariable-expansion-in-unroller
for controlling unroll factor for loops.

Unroll factor is directly proportional to available register.
Wherein unroll factor is inversely proportional to register
needed inside loops.

Registers needed are the Loop variables/Loop induction
variables inside the loops.

Available registers are determined by the number of hard registers
available for each register class minus register needed inside the
loops for given register class.


Bootstrapped and regtested on powerpc64-linux-gnu.

Thanks & Regards
Ajit


rtl-optimization, loop-unroll: Loop unroll factor based on
available registers over reg needed inside loops

Unroll factor is calculated based on available register and regs
needed inside the loops.

Unroll factor is quotient of max of available registers in loop
over regs needed inside the loops.

Considered -fsplit-ivs-in-unroller or -fvariable-expansion-in-unroller
for controlling unroll factor for loops.

Unroll factor is directly proportional to available register.
Wherein unroll factor is inversely proportional to register
needed inside loops.

Registers needed are the Loop variables/Loop induction
variables inside the loops.

Available registers are determined by the number of hard registers
available for each register class minus register needed inside the
loops for given register class.

2024-07-04  Ajit Kumar Agarwal  

gcc/ChangeLog:

* loop-unroll.cc: Add calculation of unroll factor and use
of unroll factor on loop unrolling.
---
 gcc/loop-unroll.cc | 427 -
 1 file changed, 424 insertions(+), 3 deletions(-)

diff --git a/gcc/loop-unroll.cc b/gcc/loop-unroll.cc
index bfdfe6c2bb7..86e0cc84eec 100644
--- a/gcc/loop-unroll.cc
+++ b/gcc/loop-unroll.cc
@@ -35,6 +35,11 @@ along with GCC; see the file COPYING3.  If not see
 #include "dojump.h"
 #include "expr.h"
 #include "dumpfile.h"
+#include "regs.h"
+#include "ira.h"
+#include "rtl-iter.h"
+#include "regset.h"
+#include "df.h"
 
 /* This pass performs loop unrolling.  We only perform this
optimization on innermost loops (with single exception) because
@@ -65,6 +70,39 @@ along with GCC; see the file COPYING3.  If not see
showed that this choice may affect performance in order of several %.
*/
 
+class loop_data
+{
+public:
+  class loop *outermost_exit;  /* The outermost exit of the loop.  */
+  bool has_call;   /* True if the loop contains a call.  */
+  /* Maximal register pressure inside loop for given register class
+ (defined only for the pressure classes).  */
+  int max_reg_pressure[N_REG_CLASSES];
+  int regs_needed[N_REG_CLASSES];
+  /* Loop regs referenced and live pseudo-registers.  */
+  bitmap_head regs_ref;
+  bitmap_head regs_live;
+};
+
+#define LOOP_DATA(LOOP) ((class loop_data *) (LOOP)->aux)
+
+/* Record all regs that are set in any one insn.  Communication from
+   mark_reg_{store,clobber} and global_conflicts.  Asm can refer to
+   all hard-registers.  */
+static rtx regs_set[(FIRST_PSEUDO_REGISTER > MAX_RECOG_OPERANDS
+? FIRST_PSEUDO_REGISTER : MAX_RECOG_OPERANDS) * 2];
+/* Number of regs stored in the previous array.  */
+static int n_regs_set;
+
+/* Currently processed loop.  */
+static class loop *curr_loop;
+
+/* Registers currently living.  */
+static bitmap_head curr_regs_live;
+
+/* Current reg pressure for each pressure class.  */
+static int curr_reg_pressure[N_REG_CLASSES];
+
 /* Information about induction variables to split.  */
 
 struct iv_to_split
@@ -102,6 +140,7 @@ struct iv_split_hasher : free_ptr_hash 
   static inline bool equal (const iv_to_split *, const iv_to_split *);
 };
 
+void mark_insn_with_unroller (class loop *loop, basic_block bb);
 
 /* A hash function for information about insns to split.  */
 
@@ -272,11 +311,263 @@ decide_unrolling (int flags)
 }
 }
 
+/* Return pressure class and number of needed hard registers (through
+   *NREGS) of register REGNO.  */
+static enum reg_class
+get_regno_pressure_class (int regno, int *nregs)
+{
+  if (regno >= FIRST_PSEUDO_REGISTER)
+{
+  enum reg_class pressure_class;
+  pressure_class = reg_allocno_class (regno);
+  pressure_class = ira_pressure_class_translate[pressure_class];
+  *nregs
+   = ira_reg_class_max_nregs[pressure_class][PSEUDO_REGNO_MODE (regno)];
+  return pressure_class;
+}
+  else if (! TEST_HARD_REG_BIT (ira_no_alloc_regs, regno)
+  && ! TEST_HARD_REG_BIT (eliminable_regset, regno))
+{
+  *nregs = 1;
+  return ira_pressure_class_translate[REGNO_REG_CLASS (regno)];
+}
+  else
+{
+  *nregs = 0;
+  r

Re: [Patch, rtl-optimization]: Loop unroll factor based on register pressure

2024-07-04 Thread Ajit Agarwal
Hello Richard:

On 03/07/24 2:18 pm, Richard Biener wrote:
> On Sun, Jun 30, 2024 at 4:15 AM Ajit Agarwal  wrote:
>>
>> Hello All:
>>
>> This patch determines Unroll factor based on loop register pressure.
>>
>> Unroll factor is quotient of max of available registers in loop
>> by number of liveness.
>>
>> If available registers increases unroll factor increases.
>> Wherein unroll factor decreases if number of liveness increases.
> 
> Unrolling as implemented does not increase register lifetime unless
> -fsplit-ivs-in-unroller or -fvariable-expansion-in-unroller.  But I do not
> see you looking at those transforms at all.
> 
Based on your feedback I have changed the logic of determining
unroll factor for loops in version-1 of the patch.

Unroll factor is calculated based on available registers and regs
needed inside the loops.

Unroll factor is quotient of max of available registers in loop
over regs needed inside the loops.

Considered -fsplit-ivs-in-unroller or -fvariable-expansion-in-unroller
for controlling unroll factor for loops.

Unroll factor is directly proportional to available register.
Wherein unroll factor is inversely proportional to register
needed inside loops.

Registers needed are the Loop variables/Loop induction
variables inside the loops.

Available registers are determined by the number of hard registers
available for each register class minus register needed inside the
loops for given register class.

Thanks & Regards
Ajit

> Richard.
> 
>> Loop unrolling is based on loop variables that determines unroll
>> factor. Loop variables of the loop are the variables that increases
>> register pressure and take advantage of existing register pressure
>> calculation.
>>
>> Available registers are determined by the number of hard registers
>> available for each register class minus max reg pressure of loop
>> for given register class.
>>
>> Bootstrapped and regtested on powerpc64-linux-gnu.
>>
>> Thanks & Regards
>> Ajit
>>
>>
>> rtl-optimization: Loop unroll factor based on register pressure
>>
>> Unroll factor is calculated based on loop register pressure.
>>
>> Unroll factor is quotient of max of available registers in loop
>> by number of liveness.
>>
>> If available registers increases unroll factor increases.
>> Wherein unroll factor decreases if number of liveness increases.
>>
>> Loop unrolling is based on loop variables that determines unroll
>> factor. Loop variables of the loop are the variables that increases
>> register pressure and take advantage of existing register pressure
>> calculation.
>>
>> Available registers are determined by the number of hard registers
>> available for each register class minus max reg pressure of loop
>> for given register class.
>>
>> 2024-06-29  Ajit Kumar Agarwal  
>>
>> gcc/ChangeLog:
>>
>> * loop-unroll.cc: Add calculation of register pressure of
>> the loop and use of that to calculate unroll factor.
>> ---
>>  gcc/loop-unroll.cc | 331 -
>>  1 file changed, 328 insertions(+), 3 deletions(-)
>>
>> diff --git a/gcc/loop-unroll.cc b/gcc/loop-unroll.cc
>> index bfdfe6c2bb7..6936ba7afb9 100644
>> --- a/gcc/loop-unroll.cc
>> +++ b/gcc/loop-unroll.cc
>> @@ -35,6 +35,11 @@ along with GCC; see the file COPYING3.  If not see
>>  #include "dojump.h"
>>  #include "expr.h"
>>  #include "dumpfile.h"
>> +#include "regs.h"
>> +#include "ira.h"
>> +#include "rtl-iter.h"
>> +#include "regset.h"
>> +#include "df.h"
>>
>>  /* This pass performs loop unrolling.  We only perform this
>> optimization on innermost loops (with single exception) because
>> @@ -65,6 +70,38 @@ along with GCC; see the file COPYING3.  If not see
>> showed that this choice may affect performance in order of several %.
>> */
>>
>> +class loop_data
>> +{
>> +public:
>> +  class loop *outermost_exit;  /* The outermost exit of the loop.  */
>> +  bool has_call;   /* True if the loop contains a call.  */
>> +  /* Maximal register pressure inside loop for given register class
>> + (defined only for the pressure classes).  */
>> +  int max_reg_pressure[N_REG_CLASSES];
>> +  /* Loop regs referenced and live pseudo-registers.  */
>> +  bitmap_head regs_ref;
>> +  bitmap_head regs_live;
>> +};
>> +
>> +#define LOOP_DATA(LOOP) ((class loop_data *) (LOOP)->aux)
>> +
>> +/* Record all regs that are set in any one insn.  Communication from
>> +   mark_reg_{store,clobber} and global_conflicts.  Asm can refer to
>> +   all hard-registers.  */
>> +static rtx regs_set[(FIRST_PSEUDO_REGISTER > MAX_RECOG_OPERANDS
>> +? FIRST_PSEUDO_REGISTER : MAX_RECOG_OPERANDS) * 2];
>> +/* Number of regs stored in the previous array.  */
>> +static int n_regs_set;
>> +
>> +/* Currently processed loop.  */
>> +static class loop *curr_loop;
>> +
>> +/* Registers currently living.  */
>> +static bitmap_head curr_regs_live;
>> +
>> +/* Current reg pressure for each pressure class.  */
>> +static int curr_reg_pressure[N_REG_C

[committed][RISC-V] Fix test expectations after recent late-combine changes

2024-07-04 Thread Jeff Law


With the recent DCE related adjustment to late-combine the 
rvv/base/vcreate.c test no longer has those undesirable vmvNr statements.


It's a bit unclear why this wasn't written as a scan-assembler-not and 
xfailed given the comment says we don't want to see vmvNr insructions. 
I must have missed that during review.


This patch adjusts the test to expect no vmvNr statements and if they're 
ever re-introduced, we'll get a nice unexpected failure.


Pushing to the trunk.

Jeff

gcc/testsuite
* gcc.target/riscv/rvv/base/vcreate.c: Update expected output.

diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/vcreate.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/vcreate.c
index 01006de7c81..1c7c154637e 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/base/vcreate.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/vcreate.c
@@ -256,6 +256,6 @@ test_vcreate_v_i64m2x4 (vint64m2_t v0, vint64m2_t v1, 
vint64m2_t v2,
 }
 
 // Ideally with O3, should find 0 instances of any vmvnr.v PR113913
-/* { dg-final { scan-assembler-times {vmv1r.v\s+v[0-9]+,\s*v[0-9]+} 72 } } */
-/* { dg-final { scan-assembler-times {vmv2r.v\s+v[0-9]+,\s*v[0-9]+} 36 } } */
-/* { dg-final { scan-assembler-times {vmv4r.v\s+v[0-9]+,\s*v[0-9]+} 16 } } */
+/* { dg-final { scan-assembler-not {vmv1r.v\s+v[0-9]+,\s*v[0-9]+} } } */
+/* { dg-final { scan-assembler-not {vmv2r.v\s+v[0-9]+,\s*v[0-9]+} } } */
+/* { dg-final { scan-assembler-not {vmv4r.v\s+v[0-9]+,\s*v[0-9]+} } } */


Re: [PATCH 1/3] expr: Allow same precision modes conversion between {ibm_extended, ieee_quad}_format [PR112993]

2024-07-04 Thread Richard Sandiford
"Kewen.Lin"  writes:
> Hi,
>
> With some historical reasons, rs6000 defines KFmode, TFmode
> and IFmode to have different mode precision, but it causes
> some issues and needs some workarounds such as r14-6478 for
> PR112788.  So we are going to make all rs6000 128 bit scalar
> FP modes have 128 bit precision.  Be prepared for that, this
> patch is to make function convert_mode_scalar allow same
> precision FP modes conversion if their underlying formats are
> ibm_extended_format and ieee_quad_format respectively, just
> like the existing special treatment on arm_bfloat_half_format
> <-> ieee_half_format.  It also factors out all the relevant
> checks into a lambda function.
>
> Bootstrapped and regtested on x86_64-redhat-linux and
> powerpc64{,le}-linux-gnu.
>
> Is it ok for trunk?
>
> BR,
> Kewen
> -
>   PR target/112993
>
> gcc/ChangeLog:
>
>   * expr.cc (convert_mode_scalar): Allow same precision conversion
>   between scalar floating point modes if whose underlying format is
>   ibm_extended_format or ieee_quad_format, and refactor assertion
>   with new lambda function acceptable_same_precision_modes.
> ---
>  gcc/expr.cc | 30 --
>  1 file changed, 24 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index ffbac513692..eac4dcc982e 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -338,6 +338,29 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp)
>enum rtx_code equiv_code = (unsignedp < 0 ? UNKNOWN
> : (unsignedp ? ZERO_EXTEND : SIGN_EXTEND));
>
> +  auto acceptable_same_precision_modes
> += [] (scalar_mode from_mode, scalar_mode to_mode) -> bool
> +{
> +  if (DECIMAL_FLOAT_MODE_P (from_mode) != DECIMAL_FLOAT_MODE_P (to_mode))
> + return true;
> +
> +  /* arm_bfloat_half_format <-> ieee_half_format */
> +  if ((REAL_MODE_FORMAT (from_mode) == &arm_bfloat_half_format
> +&& REAL_MODE_FORMAT (to_mode) == &ieee_half_format)
> +   || (REAL_MODE_FORMAT (to_mode) == &arm_bfloat_half_format
> +   && REAL_MODE_FORMAT (from_mode) == &ieee_half_format))
> + return true;
> +
> +  /* ibm_extended_format <-> ieee_quad_format */
> +  if ((REAL_MODE_FORMAT (from_mode) == &ibm_extended_format
> +&& REAL_MODE_FORMAT (to_mode) == &ieee_quad_format)
> +   || (REAL_MODE_FORMAT (from_mode) == &ieee_quad_format
> +   && REAL_MODE_FORMAT (to_mode) == &ibm_extended_format))
> + return true;
> +
> +  return false;
> +};
> +
>if (to_real)
>  {
>rtx value;
> @@ -346,12 +369,7 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp)
>
>gcc_assert ((GET_MODE_PRECISION (from_mode)
>  != GET_MODE_PRECISION (to_mode))
> -   || (DECIMAL_FLOAT_MODE_P (from_mode)
> -   != DECIMAL_FLOAT_MODE_P (to_mode))
> -   || (REAL_MODE_FORMAT (from_mode) == &arm_bfloat_half_format
> -   && REAL_MODE_FORMAT (to_mode) == &ieee_half_format)
> -   || (REAL_MODE_FORMAT (to_mode) == &arm_bfloat_half_format
> -   && REAL_MODE_FORMAT (from_mode) == &ieee_half_format));
> +   || acceptable_same_precision_modes (from_mode, to_mode));
>
>if (GET_MODE_PRECISION (from_mode) == GET_MODE_PRECISION (to_mode))
>   {
> --
> 2.39.1

This part looks good to me FWIW, but what's the correct behaviour of:

  if (GET_MODE_PRECISION (from_mode) == GET_MODE_PRECISION (to_mode))
{
  if (REAL_MODE_FORMAT (to_mode) == &arm_bfloat_half_format
  && REAL_MODE_FORMAT (from_mode) == &ieee_half_format)
/* libgcc implements just __trunchfbf2, not __extendhfbf2.  */
tab = trunc_optab;
  else
/* Conversion between decimal float and binary float, same
   size.  */
tab = DECIMAL_FLOAT_MODE_P (from_mode) ? trunc_optab : sext_optab;

for the new pairing?  The intent for bfloat/half seems to be that bfloat
is treated as arbitrarily “lesser than” half, so half->bfloat is a
truncation and bfloat->half is an extension.  It seems like it would be
good to do something similar for the new pair, so that the float modes
still form a total order in terms of extension & truncation.

Thanks,
Richard


Re: [PATCH 3/3] tree: Remove KFmode workaround [PR112993]

2024-07-04 Thread Richard Sandiford
"Kewen.Lin"  writes:
> Hi,
>
> The fix for PR112993 will make KFmode have 128 bit mode precision,
> we don't need this workaround to fix up the type precision any
> more, and just go with the mode precision.  So this patch is to
> remove KFmode workaround.
>
> Bootstrapped and regtested on x86_64-redhat-linux,
> powerpc64{,le}-linux-gnu (ibm128 long double default)
> and powerpc64le-linux-gnu (ieee128 long double default).
>
> Is it OK for trunk if {1,2}/3 in this series get landed?

Yes.  Thanks for doing this!

Richard

> BR,
> Kewen
> -
>
>   PR target/112993
>
> gcc/ChangeLog:
>
>   * tree.cc (build_common_tree_nodes): Drop the workaround for rs6000
>   KFmode precision adjustment.
> ---
>  gcc/tree.cc | 9 -
>  1 file changed, 9 deletions(-)
>
> diff --git a/gcc/tree.cc b/gcc/tree.cc
> index f801712c9dd..f730981ec8b 100644
> --- a/gcc/tree.cc
> +++ b/gcc/tree.cc
> @@ -9575,15 +9575,6 @@ build_common_tree_nodes (bool signed_char)
>if (!targetm.floatn_mode (n, extended).exists (&mode))
>   continue;
>int precision = GET_MODE_PRECISION (mode);
> -  /* Work around the rs6000 KFmode having precision 113 not
> -  128.  */
> -  const struct real_format *fmt = REAL_MODE_FORMAT (mode);
> -  gcc_assert (fmt->b == 2 && fmt->emin + fmt->emax == 3);
> -  int min_precision = fmt->p + ceil_log2 (fmt->emax - fmt->emin);
> -  if (!extended)
> - gcc_assert (min_precision == n);
> -  if (precision < min_precision)
> - precision = min_precision;
>FLOATN_NX_TYPE_NODE (i) = make_node (REAL_TYPE);
>TYPE_PRECISION (FLOATN_NX_TYPE_NODE (i)) = precision;
>layout_type (FLOATN_NX_TYPE_NODE (i));
> --
> 2.39.1


gcc-patches@gcc.gnu.org

2024-07-04 Thread David Malcolm
These are never nullptr and never change, so use a reference rather
than a pointer.

No functional change intended.

Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
Successful run of analyzer integration tests on x86_64-pc-linux-gnu.
Pushed to trunk as r15-1846-gf8c130cdf53165.

gcc/analyzer/ChangeLog:
* diagnostic-manager.cc
(diagnostic_manager::add_events_for_eedge): Pass sm_ctxt by
reference.
* engine.cc (impl_region_model_context::on_condition): Likewise.
(impl_region_model_context::on_bounded_ranges): Likewise.
(impl_region_model_context::on_phi): Likewise.
(exploded_node::on_stmt): Likewise.
* sm-fd.cc: Update all uses of sm_context * to sm_context &.
* sm-file.cc: Likewise.
* sm-malloc.cc: Likewise.
* sm-pattern-test.cc: Likewise.
* sm-sensitive.cc: Likewise.
* sm-signal.cc: Likewise.
* sm-taint.cc: Likewise.
* sm.h: Likewise.
* varargs.cc: Likewise.

gcc/testsuite/ChangeLog:
* gcc.dg/plugin/analyzer_gil_plugin.c: Update all uses of
sm_context * to sm_context &.

Signed-off-by: David Malcolm 
---
 gcc/analyzer/diagnostic-manager.cc|   2 +-
 gcc/analyzer/engine.cc|   8 +-
 gcc/analyzer/sm-fd.cc | 318 +-
 gcc/analyzer/sm-file.cc   |  38 +--
 gcc/analyzer/sm-malloc.cc | 194 +--
 gcc/analyzer/sm-pattern-test.cc   |  14 +-
 gcc/analyzer/sm-sensitive.cc  |  22 +-
 gcc/analyzer/sm-signal.cc |  20 +-
 gcc/analyzer/sm-taint.cc  | 122 ---
 gcc/analyzer/sm.h |   8 +-
 gcc/analyzer/varargs.cc   |  54 +--
 .../gcc.dg/plugin/analyzer_gil_plugin.c   |  46 +--
 12 files changed, 419 insertions(+), 427 deletions(-)

diff --git a/gcc/analyzer/diagnostic-manager.cc 
b/gcc/analyzer/diagnostic-manager.cc
index 20e793d72c19..fe943ac61c9e 100644
--- a/gcc/analyzer/diagnostic-manager.cc
+++ b/gcc/analyzer/diagnostic-manager.cc
@@ -2239,7 +2239,7 @@ diagnostic_manager::add_events_for_eedge (const 
path_builder &pb,
&iter_point,
emission_path,
pb.get_ext_state 
());
-   sm.on_stmt (&sm_ctxt, dst_point.get_supernode (), stmt);
+   sm.on_stmt (sm_ctxt, dst_point.get_supernode (), stmt);
// TODO: what about phi nodes?
  }
  }
diff --git a/gcc/analyzer/engine.cc b/gcc/analyzer/engine.cc
index f5fad5b2e470..c9f204b13e70 100644
--- a/gcc/analyzer/engine.cc
+++ b/gcc/analyzer/engine.cc
@@ -969,7 +969,7 @@ impl_region_model_context::on_condition (const svalue *lhs,
   m_old_state->m_checker_states[sm_idx],
   m_new_state->m_checker_states[sm_idx],
   m_path_ctxt);
-  sm.on_condition (&sm_ctxt,
+  sm.on_condition (sm_ctxt,
   (m_enode_for_diag
? m_enode_for_diag->get_supernode ()
: NULL),
@@ -996,7 +996,7 @@ impl_region_model_context::on_bounded_ranges (const svalue 
&sval,
   m_old_state->m_checker_states[sm_idx],
   m_new_state->m_checker_states[sm_idx],
   m_path_ctxt);
-  sm.on_bounded_ranges (&sm_ctxt,
+  sm.on_bounded_ranges (sm_ctxt,
(m_enode_for_diag
 ? m_enode_for_diag->get_supernode ()
 : NULL),
@@ -1037,7 +1037,7 @@ impl_region_model_context::on_phi (const gphi *phi, tree 
rhs)
   m_old_state->m_checker_states[sm_idx],
   m_new_state->m_checker_states[sm_idx],
   m_path_ctxt);
-  sm.on_phi (&sm_ctxt, m_enode_for_diag->get_supernode (), phi, rhs);
+  sm.on_phi (sm_ctxt, m_enode_for_diag->get_supernode (), phi, rhs);
 }
 }
 
@@ -1559,7 +1559,7 @@ exploded_node::on_stmt (exploded_graph &eg,
   unknown_side_effects);
 
   /* Allow the state_machine to handle the stmt.  */
-  if (sm.on_stmt (&sm_ctxt, snode, stmt))
+  if (sm.on_stmt (sm_ctxt, snode, stmt))
unknown_side_effects = false;
 }
 
diff --git a/gcc/analyzer/sm-fd.cc b/gcc/analyzer/sm-fd.cc
index ded20576fd10..3396b1d11228 100644
--- a/gcc/analyzer/sm-fd.cc
+++ b/gcc/analyzer/sm-fd.cc
@@ -119,10 +119,10 @@ public:
 return m_start;
   }
 
-  bool on_stmt (sm_context *sm_ctxt, const supernode *node,
+  bool on_stmt (sm_context &sm_ctxt, const supernode *node,
const gimple *stmt) const final override;
 
-

[pushed] analyzer: handle at -O0 [PR115724]

2024-07-04 Thread David Malcolm
At -O0, glibc's:

__extern_always_inline void
error (int __status, int __errnum, const char *__format, ...)
{
  if (__builtin_constant_p (__status) && __status != 0)
__error_noreturn (__status, __errnum, __format, __builtin_va_arg_pack ());
  else
__error_alias (__status, __errnum, __format, __builtin_va_arg_pack ());
}

becomes just:

__extern_always_inline void
error (int __status, int __errnum, const char *__format, ...)
{
  if (0)
__error_noreturn (__status, __errnum, __format, __builtin_va_arg_pack ());
  else
__error_alias (__status, __errnum, __format, __builtin_va_arg_pack ());
}

and thus calls to "error" are calls to "__error_alias" by the
time -fanalyzer "sees" them.

Handle them with more special-casing in kf.cc.

Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
Successful run of analyzer integration tests on x86_64-pc-linux-gnu.
Pushed to trunk as r15-1845-ga6fdb1a2a29061.

gcc/analyzer/ChangeLog:
PR analyzer/115724
* kf.cc (register_known_functions): Add __error_alias and
__error_at_line_alias.

gcc/testsuite/ChangeLog:
PR analyzer/115724
* c-c++-common/analyzer/error-pr115724.c: New test.

Signed-off-by: David Malcolm 
---
 gcc/analyzer/kf.cc|  4 +
 .../c-c++-common/analyzer/error-pr115724.c| 86 +++
 2 files changed, 90 insertions(+)
 create mode 100644 gcc/testsuite/c-c++-common/analyzer/error-pr115724.c

diff --git a/gcc/analyzer/kf.cc b/gcc/analyzer/kf.cc
index 4213b89ac9fb..5c3a71fbb49c 100644
--- a/gcc/analyzer/kf.cc
+++ b/gcc/analyzer/kf.cc
@@ -2325,6 +2325,10 @@ register_known_functions (known_function_manager &kfm,
 kfm.add ("__errno_location", make_unique ());
 kfm.add ("error", make_unique (3));
 kfm.add ("error_at_line", make_unique (5));
+/* Variants of "error" and "error_at_line" seen by the
+   analyzer at -O0 (PR analyzer/115724).  */
+kfm.add ("__error_alias", make_unique (3));
+kfm.add ("__error_at_line_alias", make_unique (5));
   }
 
   /* Other implementations of C standard library.  */
diff --git a/gcc/testsuite/c-c++-common/analyzer/error-pr115724.c 
b/gcc/testsuite/c-c++-common/analyzer/error-pr115724.c
new file mode 100644
index ..ae606ad89d6a
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/analyzer/error-pr115724.c
@@ -0,0 +1,86 @@
+/* Verify that the analyzer handles the no-optimization case in
+   glibc's  when error,error_at_line calls become
+   __error_alias and __error_at_line_alias.  */
+
+typedef __SIZE_TYPE__ size_t;
+#define EXIT_FAILURE 1
+#define __extern_always_inline  extern inline __attribute__ 
((__always_inline__)) __attribute__ ((__gnu_inline__))
+
+int errno;
+
+/* Adapted from glibc's bits/error.h.  */
+
+extern void __error_alias (int __status, int __errnum,
+  const char *__format, ...)
+  __attribute__ ((__format__ (__printf__, 3, 4)));
+extern void __error_noreturn (int __status, int __errnum,
+  const char *__format, ...)
+  __attribute__ ((__noreturn__, __format__ (__printf__, 3, 4)));
+
+/* If we know the function will never return make sure the compiler
+   realizes that, too.  */
+__extern_always_inline void
+error (int __status, int __errnum, const char *__format, ...)
+{
+  if (__builtin_constant_p (__status) && __status != 0)
+__error_noreturn (__status, __errnum, __format, __builtin_va_arg_pack ());
+  else
+__error_alias (__status, __errnum, __format, __builtin_va_arg_pack ());
+}
+
+extern void __error_at_line_alias (int __status, int __errnum,
+  const char *__fname,
+  unsigned int __line,
+  const char *__format, ...)
+  __attribute__ ((__format__ (__printf__, 5, 6)));
+extern void __error_at_line_noreturn (int __status, int __errnum,
+ const char *__fname,
+ unsigned int __line,
+ const char *__format,
+ ...)
+  __attribute__ ((__noreturn__, __format__ (__printf__, 5, 6)));
+
+/* If we know the function will never return make sure the compiler
+   realizes that, too.  */
+__extern_always_inline void
+error_at_line (int __status, int __errnum, const char *__fname,
+  unsigned int __line, const char *__format, ...)
+{
+  if (__builtin_constant_p (__status) && __status != 0)
+__error_at_line_noreturn (__status, __errnum, __fname, __line, __format,
+ __builtin_va_arg_pack ());
+  else
+__error_at_line_alias (__status, __errnum, __fname, __line,
+  __format, __builtin_va_arg_pack ());
+}
+
+
+struct list {
+size_t size;
+void (*destroy)(void *data);
+struct list_node *head;
+struct list_node *tail;
+};
+
+struct list *list_create(void (*destroy)(void *data))
+{
+struct list *

[x86 SSE PATCH] PR target/115751: Avoid force_reg in ix86_expand_ternlog.

2024-07-04 Thread Roger Sayle

This patch fixes a problem with splitting of complex AVX512 ternlog
instructions on x86_64.  A recent change allows the ternlog pattern
to have multiple mem-like operands prior to reload, by emitting any
"reloads" as necessary during split1, before register allocation.
The issue is that this code calls force_reg to place the mem-like
operand into a register, but unfortunately the vec_duplicate (broadcast)
form of operands supported by ternlog isn't considered a "general_operand",
i.e. supported by all instructions.  This mismatch triggers an ICE in
the middle-end's force_reg, even though the x86 supports loading these
vec_duplicate operands into a vector register in a single (move)
instruction.

This patch resolves this problem by replacing force_reg with calls
to gen_reg_rtx and emit_move (as the i386 backend, unlike the middle-end,
knows these will be recognized by recog).

I'll admit that I've been unable to reproduce this error without a
testcase, but my assumption when developing the previous patch was
that was safe to call force_reg on a vec_duplicate, which this PR
shows to be wrong (currently).  I'll let smarter minds pronounce on
whether changing i386.md's definition of general_operand may be an
alternate solution, but such a change can be independent of this fix.
[I've a related patch to expand the CONST_VECTORs allowed in
ix86_legitimate_constant_p before reload, but keeping everything
happy is tricky].

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2024-07-04  Roger Sayle  

gcc/ChangeLog
PR target/115751
* config/i386/i386-expand.c (ix86_expand_ternlog): Avoid use of
force_reg to "reload" non-register operands, as these may contain
vec_duplicate (broadcast) operands that aren't supported by
force_reg.  Use (safer) gen_reg_rtx and emit_move instead.


Thanks in advance (sorry for the inconvenience),
Roger
--

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index dd2c3a8..178dd0b 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -26049,14 +26049,25 @@ ix86_expand_ternlog (machine_mode mode, rtx op0, rtx 
op1, rtx op2, int idx,
   break;
 }
 
-  tmp0 = register_operand (op0, mode) ? op0 : force_reg (mode, op0);
+  if (!register_operand (op0, mode))
+{
+  /* We can't use force_reg (mode, op0).  */
+  tmp0 = gen_reg_rtx (GET_MODE (op0));
+  emit_move_insn (tmp0,op0);
+}
+  else
+tmp0 = op0;
   if (GET_MODE (tmp0) != mode)
 tmp0 = gen_lowpart (mode, tmp0);
 
   if (!op1 || rtx_equal_p (op0, op1))
 tmp1 = copy_rtx (tmp0);
   else if (!register_operand (op1, mode))
-tmp1 = force_reg (mode, op1);
+{
+  /* We can't use force_reg (mode, op1).  */
+  tmp1 = gen_reg_rtx (GET_MODE (op1));
+  emit_move_insn (tmp1, op1);
+}
   else
 tmp1 = op1;
   if (GET_MODE (tmp1) != mode)


[PATCH, gfortran] libgfortran: implement fpu-macppc for Darwin, support IEEE arithmetic

2024-07-04 Thread Sergey Fedorov
>From 50fc05566ba7479844949d727233c04a5e8151e8 Mon Sep 17 00:00:00 2001
From: Sergey Fedorov 
Date: Sat, 29 Apr 2023 14:55:44 +0800
Subject: [PATCH] libgfortran: implement fpu-macppc for Darwin, support IEEE
 arithmetic

Signed-off-by: Sergey Fedorov 
---
 .../gfortran.dg/ieee/signaling_2_c.c  |   9 +
 libgfortran/config/fpu-macppc.h   | 414 ++
 libgfortran/configure.host|   9 +
 3 files changed, 432 insertions(+)
 create mode 100644 libgfortran/config/fpu-macppc.h

diff --git a/gcc/testsuite/gfortran.dg/ieee/signaling_2_c.c
b/gcc/testsuite/gfortran.dg/ieee/signaling_2_c.c
index ea7fc0467bd..4a8f72c5bf2 100644
--- a/gcc/testsuite/gfortran.dg/ieee/signaling_2_c.c
+++ b/gcc/testsuite/gfortran.dg/ieee/signaling_2_c.c
@@ -1,3 +1,11 @@
+#ifdef __POWERPC__ // No support for issignaling in math.h on Darwin PPC
+
+int isnansf (float x)   { return __builtin_issignaling (x) ? 1 : 0; }
+int isnans  (double x)  { return __builtin_issignaling (x) ? 1 : 0; }
+int isnansl (long double x) { return __builtin_issignaling (x) ? 1 : 0; }
+
+#else
+
 #define _GNU_SOURCE
 #include 
 #include 
@@ -6,3 +14,4 @@ int isnansf (float x)   { return issignaling (x) ? 1 :
0; }
 int isnans  (double x)  { return issignaling (x) ? 1 : 0; }
 int isnansl (long double x) { return issignaling (x) ? 1 : 0; }

+#endif
diff --git a/libgfortran/config/fpu-macppc.h
b/libgfortran/config/fpu-macppc.h
new file mode 100644
index 000..d43d3caa3dd
--- /dev/null
+++ b/libgfortran/config/fpu-macppc.h
@@ -0,0 +1,414 @@
+/* FPU-related code for PowerPC.
+   Copyright (C) 2023-2024 Free Software Foundation, Inc.
+   Contributed by Sergey Fedorov 
+
+This file is part of the GNU Fortran runtime library (libgfortran).
+
+Libgfortran is free software; you can redistribute it and/or
+modify it under the terms of the GNU General Public
+License as published by the Free Software Foundation; either
+version 3 of the License, or (at your option) any later version.
+
+Libgfortran is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
+the GNU General Public License for more details.
+
+Under Section 7 of GPL version 3, you are granted additional
+permissions described in the GCC Runtime Library Exception, version 3.1,
+as published by the Free Software Foundation.
+
+You should have received a copy of the GNU General Public License and
+a copy of the GCC Runtime Library Exception along with this program;
+see the files COPYING3 and COPYING.RUNTIME respectively. If not, see
+. */
+
+/* While some of the definitions and functions used here are available
+   with Apple libm, we on purpose avoid pulling it in: to avoid potential
+   conflicts and not self-impose unnecessary constraints. */
+
+/* FP exception flags */
+#define FE_INEXACT  0x0200
+#define FE_DIVBYZERO0x0400
+#define FE_UNDERFLOW0x0800
+#define FE_OVERFLOW 0x1000
+#define FE_INVALID  0x2000
+
+#define FE_ALL_EXCEPT   0x3E00
+#define FE_NO_EXCEPT0xC1FF
+
+/* Extra invalid flags */
+#define FE_INVALID_SNAN 0x0100
+#define FE_INVALID_ISI  0x0080
+#define FE_INVALID_IDI  0x0040
+#define FE_INVALID_ZDZ  0x0020
+#define FE_INVALID_IMZ  0x0010
+#define FE_INVALID_XVC  0x0008
+#define FE_INVALID_SOFT 0x0400
+#define FE_INVALID_SQRT 0x0200
+#define FE_INVALID_CVI  0x0100
+
+/* Rounding modes */
+#define FE_TONEAREST0x
+#define FE_TOWARDZERO   0x0001
+#define FE_UPWARD   0x0002
+#define FE_DOWNWARD 0x0003
+
+/* There is no consistency re what is to be included in all_invalid.
+   Apple libm has 0x01f80300, Openlibm/FreeBSD/NetBSD has 0x21f80700
+   and OpenBSD has 0x01f80700. In particular, FE_INVALID_SOFT and
+   FE_INVALID are contested. */
+#define FE_ALL_INVALID  0x01F80700
+#define FE_NO_INVALID   0xFE07F8FF
+
+/* Apple libm has 0xFFF80300 and 0x0007FCFF here. */
+#define FE_ALL_FLAGS0xFFF80700
+#define FE_NO_FLAGS 0x0007F8FF
+
+#define FE_ALL_RND  0x0003
+#define FE_NO_RND   0xFFFC
+
+/* Floating-point exception summary (FX) bit. */
+#define FE_SET_FX   0x8000
+#define FE_CLR_FX   0x7FFF
+
+/* Some implementations use FE_INVALID_SOFT here. */
+#define SET_INVALID 0x0100
+
+#define FE_EXCEPT_SHIFT 22
+#define EXCEPT_MASK FE_ALL_EXCEPT >> FE_EXCEPT_SHIFT
+
+typedef unsigned intfenv_t;
+typedef unsigned intfexcept_t;
+
+/* default environment object */
+extern const fenv_t _FE_DFL_ENV;
+/* pointer to default environment */
+#define FE_DFL_ENV  &_FE_DFL_ENV
+
+typedef union {
+struct {
+unsigned int hi;
+fenv_t   lo;
+} i;
+double   d;
+} hexdouble;
+
+#define HEXDOUBLE(hi, lo) {{ hi, lo }}
+
+
+/* Check we can actually store the FPU state in the allocated size. */
+_S

Re: [PATCH, gfortran] libgfortran: implement fpu-macppc for Darwin, support IEEE arithmetic

2024-07-04 Thread Sergey Fedorov
Below is the diff of tests for gfortran on powerpc-apple-darwin10.8.0
without (unmodified gcc upstream) vs with the gfortran patch being added.

 === gfortran Summary ===

-# of expected passes69273
-# of unexpected failures36
+# of expected passes69646
+# of unexpected failures85
 # of unexpected successes12
 # of expected failures287
-# of unsupported tests357
+# of unsupported tests363

I.e., we get +373 passes and +49 failures. With one exception (see below),
failures are due to tests which were not run before.
AFAICT, present failures are not specific to my patch, but rather caused by
constraints of 32-bit powerpc or some existing darwin-specific issues: I
have tried a few versions of my patch, including almost copy-paste of SysV
code, which is substantially different, and test results were either
identical or a few extra tests failed (SysV-based version was marginally
worse, for instance), but those which fail here, failed consistently for
every version.

There is one test which changes from PASS to FAIL (round_4), however I
think this is not a real regression, and the test in its present form is
expected to fail.
See comments in the test code in the source and my post:
https://gcc.gnu.org/pipermail/fortran/2024-July/060612.html

---
/Users/svacchanda/Develop/FORTRAN/gfortran_15.0.0_20240608_no_ieee.sum
2024-06-29 00:25:18.0 +0800
+++
/Users/svacchanda/Develop/FORTRAN/gfortran_15.0.0_20240608_ieee_added.sum
2024-06-29 08:11:29.0 +0800
@@ -1,4 +1,4 @@
-Test run by svacchanda on Fri Jun 28 23:25:22 2024
+Test run by svacchanda on Sat Jun 29 07:02:39 2024
 Native configuration is powerpc-apple-darwin10.8.0

 === gfortran tests ===
@@ -48800,17 +48800,17 @@
 PASS: gfortran.dg/round_3.f08   -Os  (test for excess errors)
 PASS: gfortran.dg/round_3.f08   -Os  execution test
 PASS: gfortran.dg/round_4.f90   -O0  (test for excess errors)
-PASS: gfortran.dg/round_4.f90   -O0  execution test
+FAIL: gfortran.dg/round_4.f90   -O0  execution test
 PASS: gfortran.dg/round_4.f90   -O1  (test for excess errors)
-PASS: gfortran.dg/round_4.f90   -O1  execution test
+FAIL: gfortran.dg/round_4.f90   -O1  execution test
 PASS: gfortran.dg/round_4.f90   -O2  (test for excess errors)
-PASS: gfortran.dg/round_4.f90   -O2  execution test
+FAIL: gfortran.dg/round_4.f90   -O2  execution test
 PASS: gfortran.dg/round_4.f90   -O3 -fomit-frame-pointer -funroll-loops
-fpeel-loops -ftracer -finline-functions  (test for excess errors)
-PASS: gfortran.dg/round_4.f90   -O3 -fomit-frame-pointer -funroll-loops
-fpeel-loops -ftracer -finline-functions  execution test
+FAIL: gfortran.dg/round_4.f90   -O3 -fomit-frame-pointer -funroll-loops
-fpeel-loops -ftracer -finline-functions  execution test
 PASS: gfortran.dg/round_4.f90   -O3 -g  (test for excess errors)
-PASS: gfortran.dg/round_4.f90   -O3 -g  execution test
+FAIL: gfortran.dg/round_4.f90   -O3 -g  execution test
 PASS: gfortran.dg/round_4.f90   -Os  (test for excess errors)
-PASS: gfortran.dg/round_4.f90   -Os  execution test
+FAIL: gfortran.dg/round_4.f90   -Os  execution test
 PASS: gfortran.dg/rrspacing_1.f90   -O0  (test for excess errors)
 PASS: gfortran.dg/rrspacing_1.f90   -O0  execution test
 PASS: gfortran.dg/rrspacing_1.f90   -O1  (test for excess errors)
@@ -49289,8 +49289,7 @@
 PASS: gfortran.dg/select_type_2.f03   -O1  execution test
 PASS: gfortran.dg/select_type_2.f03   -O2  (test for excess errors)
 PASS: gfortran.dg/select_type_2.f03   -O2  execution test
-WARNING: gfortran.dg/select_type_2.f03   -O3 -fomit-frame-pointer
-funroll-loops -fpeel-loops -ftracer -finline-functions  (test for excess
errors) program timed out.
-FAIL: gfortran.dg/select_type_2.f03   -O3 -fomit-frame-pointer
-funroll-loops -fpeel-loops -ftracer -finline-functions  (test for excess
errors)
+PASS: gfortran.dg/select_type_2.f03   -O3 -fomit-frame-pointer
-funroll-loops -fpeel-loops -ftracer -finline-functions  (test for excess
errors)
 PASS: gfortran.dg/select_type_2.f03   -O3 -fomit-frame-pointer
-funroll-loops -fpeel-loops -ftracer -finline-functions  execution test
 PASS: gfortran.dg/select_type_2.f03   -O3 -g  (test for excess errors)
 PASS: gfortran.dg/select_type_2.f03   -O3 -g  execution test
@@ -64745,6 +64744,434 @@
 PASS: gfortran.dg/graphite/vect-pr94043.f90   -O  (test for excess errors)
 Running
/Users/svacchanda/Develop/gcc-git/gcc/testsuite/gfortran.dg/guality/guality.exp
...
 Running
/Users/svacchanda/Develop/gcc-git/gcc/testsuite/gfortran.dg/ieee/ieee.exp
...
+PASS: gfortran.dg/ieee/comparisons_1.f90   -O0  (test for excess errors)
+PASS: gfortran.dg/ieee/comparisons_1.f90   -O0  execution test
+PASS: gfortran.dg/ieee/comparisons_1.f90   -O1  (test for excess errors)
+PASS: gfortran.dg/ieee/comparisons_1.f90   -O1  execution test
+PASS: gfortran.dg/ieee/comparisons_1.f90   -O2  (test for excess errors)
+PASS: gfortran.dg/ieee/comparisons_1.f90   -O2  execution test
+PASS: gfortran.dg/ieee/compa

[pushed] wwwdocs: news: Remove some last links to our previous /java sub-site

2024-07-04 Thread Gerald Pfeifer
On the way fix the spelling of SUSE in one case.

Pushed.
Gerald
---
 htdocs/news.html | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/htdocs/news.html b/htdocs/news.html
index e78abfc3..4a104520 100644
--- a/htdocs/news.html
+++ b/htdocs/news.html
@@ -10,8 +10,7 @@
 
 GCC news and announcements
 
-More current news, or some (older)
-GCJ news.
+More current news.
 
 
 
@@ -1068,8 +1067,8 @@ Annual GCC Developers' Summit, which took place May 
25-27, 2003.
 
 January 29, 2003
 
-Andrew Haley of Red Hat completed the work begun by Bo Thorsen of SuSE
-to port GCJ to the AMD x86-64
+Andrew Haley of Red Hat completed the work begun by Bo Thorsen of SUSE
+to port GCJ to the AMD x86-64
 architecture.  This is the first implementation of the Java
 programming language to be made available on that platform.  It will
 be part of the GCC 3.3 release.
@@ -1352,10 +1351,9 @@ stabilization for branching for GCC 3.0.
 
 December 19, 2000
 
-The runtime library for the 
-Java front end, libgcj, has been moved into the GCC 
-tree. This means that a separate download will no longer be required for 
-Java support.
+The runtime library for the Java front end, libgcj, has
+been moved into the GCC tree. A separate download is no longer required
+for Java support.
 
 
 December 4, 2000
-- 
2.45.2


Re: [PATCH, gfortran] libgfortran: implement fpu-macppc for Darwin, support IEEE arithmetic

2024-07-04 Thread FX Coudert
Hi,

The core of the powerpc-FPU manipulation is okay for me. Some comments below.


> --- a/gcc/testsuite/gfortran.dg/ieee/signaling_2_c.c
> +++ b/gcc/testsuite/gfortran.dg/ieee/signaling_2_c.c
> @@ -1,3 +1,11 @@
> +#ifdef __POWERPC__ // No support for issignaling in math.h on Darwin PPC

Two things:

1. I don’t understand why that needs to be added. The companion test in 
signaling_2.f90 has:

! { dg-require-effective-target issignaling } */
! The companion C source needs access to the issignaling macro.

Therefore, if the issignaling macro is not available, the test should not even 
be run.

2. Maybe this is actually moot. Is the __builtin_issignaling() macro available 
on all targets? The test was written before it was added to the middle-end. If 
the answer to the previous question is yes, we should simply use it in instead 
of the macro.


> +/* There is no consistency re what is to be included in all_invalid.
> +   Apple libm has 0x01f80300, Openlibm/FreeBSD/NetBSD has 0x21f80700
> +   and OpenBSD has 0x01f80700. In particular, FE_INVALID_SOFT and
> +   FE_INVALID are contested. */
> +#define FE_ALL_INVALID  0x01F80700
> +#define FE_NO_INVALID   0xFE07F8FF
> +
> +/* Apple libm has 0xFFF80300 and 0x0007FCFF here. */
> +#define FE_ALL_FLAGS0xFFF80700
> +#define FE_NO_FLAGS 0x0007F8FF

Since it’s a darwin-only file, why not follow Apple libm conventions?


> There is one test which changes from PASS to FAIL (round_4), however I think 
> this is not a real regression, and the test in its present form is expected 
> to fail.
> See comments in the test code in the source and my post: 
> https://gcc.gnu.org/pipermail/fortran/2024-July/060612.html


I agree that round_4.f90 should be added to the dg-skip-if.
But I’d like to understand better what are the other new failures, and why 
there arise?

FX

[pushed] wwwdocs: *: Normalize links to the www.gnu.org main page

2024-07-04 Thread Gerald Pfeifer
No need for a trailing slash, and switch to https.

---
 htdocs/gcc-3.0/index.html | 2 +-
 htdocs/gcc-3.1/index.html | 4 ++--
 htdocs/gccmission.html| 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/htdocs/gcc-3.0/index.html b/htdocs/gcc-3.0/index.html
index a1f454e0..7d39658b 100644
--- a/htdocs/gcc-3.0/index.html
+++ b/htdocs/gcc-3.0/index.html
@@ -16,7 +16,7 @@
 
 February 20, 2002
 
-The http://www.gnu.org/";>GNU project and the GCC
+The https://www.gnu.org";>GNU project and the GCC
 developers are pleased to announce the release of GCC 3.0.4, which is
 a bug-fix release for the GCC 3.0 series.
 
diff --git a/htdocs/gcc-3.1/index.html b/htdocs/gcc-3.1/index.html
index 0c85bcb9..6f25a878 100644
--- a/htdocs/gcc-3.1/index.html
+++ b/htdocs/gcc-3.1/index.html
@@ -16,14 +16,14 @@
 
 July 27, 2002
 
-The http://www.gnu.org/";>GNU project and the GCC
+The GNU project and the GCC
 developers are pleased to announce the release of GCC 3.1.1.
 
 The links below still apply to GCC 3.1.1.
 
 May 15, 2002
 
-The http://www.gnu.org/";>GNU project and the GCC
+The https://www.gnu.org";>GNU project and the GCC
 developers are pleased to announce the release of GCC 3.1.
 
 GCC used to stand for the GNU C Compiler, but since the compiler
diff --git a/htdocs/gccmission.html b/htdocs/gccmission.html
index 1124fe9f..2eff0453 100644
--- a/htdocs/gccmission.html
+++ b/htdocs/gccmission.html
@@ -11,7 +11,7 @@
 
 GCC Development Mission Statement (2021-06-01)
 
-GCC development is a part of the http://www.gnu.org/";>GNU
+GCC development is a part of the https://www.gnu.org";>GNU
 Project, aiming to improve the compiler used in the GNU system
 including the GNU/Linux variant.
 The GCC development effort uses an open development environment and
-- 
2.45.2


Re: [x86 SSE PATCH] PR target/115751: Avoid force_reg in ix86_expand_ternlog.

2024-07-04 Thread Hongtao Liu
On Fri, Jul 5, 2024 at 2:54 AM Roger Sayle  wrote:
>
>
> This patch fixes a problem with splitting of complex AVX512 ternlog
> instructions on x86_64.  A recent change allows the ternlog pattern
> to have multiple mem-like operands prior to reload, by emitting any
> "reloads" as necessary during split1, before register allocation.
> The issue is that this code calls force_reg to place the mem-like
> operand into a register, but unfortunately the vec_duplicate (broadcast)
> form of operands supported by ternlog isn't considered a "general_operand",
> i.e. supported by all instructions.  This mismatch triggers an ICE in
> the middle-end's force_reg, even though the x86 supports loading these
> vec_duplicate operands into a vector register in a single (move)
> instruction.
>
> This patch resolves this problem by replacing force_reg with calls
> to gen_reg_rtx and emit_move (as the i386 backend, unlike the middle-end,
> knows these will be recognized by recog).
>
> I'll admit that I've been unable to reproduce this error without a
> testcase, but my assumption when developing the previous patch was
> that was safe to call force_reg on a vec_duplicate, which this PR
> shows to be wrong (currently).  I'll let smarter minds pronounce on
> whether changing i386.md's definition of general_operand may be an
> alternate solution, but such a change can be independent of this fix.
> [I've a related patch to expand the CONST_VECTORs allowed in
> ix86_legitimate_constant_p before reload, but keeping everything
> happy is tricky].
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
Ok.
>
>
> 2024-07-04  Roger Sayle  
>
> gcc/ChangeLog
> PR target/115751
> * config/i386/i386-expand.c (ix86_expand_ternlog): Avoid use of
> force_reg to "reload" non-register operands, as these may contain
> vec_duplicate (broadcast) operands that aren't supported by
> force_reg.  Use (safer) gen_reg_rtx and emit_move instead.
>
>
> Thanks in advance (sorry for the inconvenience),
> Roger
> --
>


-- 
BR,
Hongtao


Re: [PATCH, gfortran] libgfortran: implement fpu-macppc for Darwin, support IEEE arithmetic

2024-07-04 Thread Sergey Fedorov
On Fri, Jul 5, 2024 at 5:23 AM FX Coudert  wrote:

> Hi,
> The core of the powerpc-FPU manipulation is okay for me. Some comments
> below.
>

Thank you for reviewing!


> > --- a/gcc/testsuite/gfortran.dg/ieee/signaling_2_c.c
> > +++ b/gcc/testsuite/gfortran.dg/ieee/signaling_2_c.c
> > @@ -1,3 +1,11 @@
> > +#ifdef __POWERPC__ // No support for issignaling in math.h on Darwin PPC
>
> Two things:
>
> 1. I don’t understand why that needs to be added. The companion test in
> signaling_2.f90 has:
>
> ! { dg-require-effective-target issignaling } */
> ! The companion C source needs access to the issignaling macro.
>
> Therefore, if the issignaling macro is not available, the test should not
> even be run.
>

This part of the patch is quite old, but from the remaining log it looks I
got an error here:
```
issignaling2483.c: In function 'main':
issignaling2483.c:6:18: warning: implicit declaration of function
'issignaling' [-Wimplicit-function-declaration]
Undefined symbols:
  "_issignaling", referenced from:
  _main in ccORNr9V.o
ld: symbol(s) not found
collect2: error: ld returned 1 exit status
```
Now on a second thought, this did not require a fix perhaps. We can drop it.

2. Maybe this is actually moot. Is the __builtin_issignaling() macro
> available on all targets? The test was written before it was added to the
> middle-end. If the answer to the previous question is yes, we should simply
> use it in instead of the macro.
>

I am not really sure about support for it across targets. (If we are
unclear on this, changes to tests themselves could be moved to a separate
issue, I guess.)

> +/* There is no consistency re what is to be included in all_invalid.
> > +   Apple libm has 0x01f80300, Openlibm/FreeBSD/NetBSD has 0x21f80700
> > +   and OpenBSD has 0x01f80700. In particular, FE_INVALID_SOFT and
> > +   FE_INVALID are contested. */
> > +#define FE_ALL_INVALID  0x01F80700
> > +#define FE_NO_INVALID   0xFE07F8FF
> > +
> > +/* Apple libm has 0xFFF80300 and 0x0007FCFF here. */
> > +#define FE_ALL_FLAGS0xFFF80700
> > +#define FE_NO_FLAGS 0x0007F8FF
>
> Since it’s a darwin-only file, why not follow Apple libm conventions?
>

I arrived at a conclusion that they are not exactly correct, and moreover
look somewhat inconsistent internally (i.e. different Apple headers between
each other in regard to assumptions for powerpc hardware capabilities).
However I am no expert here, so please take it with a grain of salt.

Do we know exactly whether FE_ALL_INVALID should or should not include a)
VX and b) VX_SOFT bits?

(This could be a matter of purely academic relevance though, since I did
try switching the values for those macros to add/remove questionable bits,
and the only one to make a difference, FWIW, was SET_INVALID define.)

> There is one test which changes from PASS to FAIL (round_4), however I
> think this is not a real regression, and the test in its present form is
> expected to fail.
> > See comments in the test code in the source and my post:
> https://gcc.gnu.org/pipermail/fortran/2024-July/060612.html
>
>
> I agree that round_4.f90 should be added to the dg-skip-if.
> But I’d like to understand better what are the other new failures, and why
> there arise?
>
> FX


I suspect that a chunk of failures are due to incorrect assumptions for
precision; there are also some obscure differences between Power and
PowerPC architectures in regard of floating point, which might not be
accounted for in the tests.
Since I was getting consistent failures with different versions of the
patch, I am reasonably sure that failures are not caused by fine details of
the implementation here; at least these tests failed on every attempt (and
aside from one version, extra failures, if present, were in single digit
numbers), and whenever the output contained numerical values, those also
matched, AFAICT.

I think I cannot attach a log here, even compressed, but I have put it
here:
http://macos-powerpc.org/gcc/gfortran/gfortran_macos_ppc_2024.06.28_ieee_added.log.zip

If you or anyone could help me to debug, I can rerun tests, modify
something etc. It is desirable to have it working correctly, of course.

By the way, do we have some point of comparison from other ppc32 platforms,
Linux or BSD (for the recent gcc master)?

Sergey


Re: [x86 SSE PATCH] PR target/115751: Avoid force_reg in ix86_expand_ternlog.

2024-07-04 Thread Hongtao Liu
On Fri, Jul 5, 2024 at 8:06 AM Hongtao Liu  wrote:
>
> On Fri, Jul 5, 2024 at 2:54 AM Roger Sayle  wrote:
> >
> >
> > This patch fixes a problem with splitting of complex AVX512 ternlog
> > instructions on x86_64.  A recent change allows the ternlog pattern
> > to have multiple mem-like operands prior to reload, by emitting any
> > "reloads" as necessary during split1, before register allocation.
> > The issue is that this code calls force_reg to place the mem-like
> > operand into a register, but unfortunately the vec_duplicate (broadcast)
> > form of operands supported by ternlog isn't considered a "general_operand",
> > i.e. supported by all instructions.  This mismatch triggers an ICE in
> > the middle-end's force_reg, even though the x86 supports loading these
> > vec_duplicate operands into a vector register in a single (move)
> > instruction.
> >
> > This patch resolves this problem by replacing force_reg with calls
> > to gen_reg_rtx and emit_move (as the i386 backend, unlike the middle-end,
> > knows these will be recognized by recog).
> >
> > I'll admit that I've been unable to reproduce this error without a
> > testcase, but my assumption when developing the previous patch was
> > that was safe to call force_reg on a vec_duplicate, which this PR
> > shows to be wrong (currently).  I'll let smarter minds pronounce on
> > whether changing i386.md's definition of general_operand may be an
> > alternate solution, but such a change can be independent of this fix.
> > [I've a related patch to expand the CONST_VECTORs allowed in
> > ix86_legitimate_constant_p before reload, but keeping everything
> > happy is tricky].
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> Ok.
There're also some other places that use force_reg or gen_lowpart in
switch (idx & 0xff), I think they're also buggy when op is
vec_duplicate.

> >
> >
> > 2024-07-04  Roger Sayle  
> >
> > gcc/ChangeLog
> > PR target/115751
> > * config/i386/i386-expand.c (ix86_expand_ternlog): Avoid use of
> > force_reg to "reload" non-register operands, as these may contain
> > vec_duplicate (broadcast) operands that aren't supported by
> > force_reg.  Use (safer) gen_reg_rtx and emit_move instead.
> >
> >
> > Thanks in advance (sorry for the inconvenience),
> > Roger
> > --
> >
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao


[PATCH v1] RISC-V: Implement .SAT_TRUNC for vector unsigned int

2024-07-04 Thread pan2 . li
From: Pan Li 

This patch would like to implement the .SAT_TRUNC for the RISC-V
backend.  With the help of the RVV Vector Narrowing Fixed-Point
Clip Instructions.  The below SEW(S) are supported:

* e64 => e32
* e64 => e16
* e64 => e8
* e32 => e16
* e32 => e8
* e16 => e8

Take below example to see the changes to asm.
Form 1:
  #define DEF_VEC_SAT_U_TRUNC_FMT_1(NT, WT) \
  void __attribute__((noinline))\
  vec_sat_u_trunc_##NT##_##WT##_fmt_1 (NT *out, WT *in, unsigned limit) \
  { \
unsigned i; \
for (i = 0; i < limit; i++) \
  { \
WT x = in[i];   \
bool overflow = x > (WT)(NT)(-1);   \
out[i] = ((NT)x) | (NT)-overflow;   \
  } \
  }

DEF_VEC_SAT_U_TRUNC_FMT_1 (uint32_t, uint64_t)

Before this patch:
.L3:
  vsetvli  a5,a2,e64,m1,ta,ma
  vle64.v  v1,0(a1)
  vmsgtu.vvv0,v1,v2
  vsetvli  zero,zero,e32,mf2,ta,ma
  vncvt.x.x.w  v1,v1
  vmerge.vim   v1,v1,-1,v0
  vse32.v  v1,0(a0)
  slli a4,a5,3
  add  a1,a1,a4
  slli a4,a5,2
  add  a0,a0,a4
  sub  a2,a2,a5
  bne  a2,zero,.L3

After this patch:
.L3:
  vsetvli  a5,a2,e32,mf2,ta,ma
  vle64.v  v1,0(a1)
  vnclipu.wi   v1,v1,0
  vse32.v  v1,0(a0)
  slli a4,a5,3
  add  a1,a1,a4
  slli a4,a5,2
  add  a0,a0,a4
  sub  a2,a2,a5
  bne  a2,zero,.L3

Passed the rv64gcv fully regression tests.

gcc/ChangeLog:

* config/riscv/autovec.md (ustrunc2): Add
new pattern for double truncation.
(ustrunc2): Ditto but for quad truncation.
(ustrunc2): Ditto but for oct truncation.
* config/riscv/riscv-protos.h (expand_vec_ustrunc): Add new decl
to expand vec ustrunc.
* config/riscv/riscv-v.cc (emit_vec_fixed_binary_rnu): Add new
help func to emit vnclipu.
(expand_vec_ustrunc): Add new func impl to expand vector ustrunc.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/binop/vec_sat_arith.h: Add helper
test macros.
* gcc.target/riscv/rvv/autovec/unop/vec_sat_data.h: New test.
* gcc.target/riscv/rvv/autovec/unop/vec_sat_u_trunc-1.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vec_sat_u_trunc-2.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vec_sat_u_trunc-3.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vec_sat_u_trunc-4.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vec_sat_u_trunc-5.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vec_sat_u_trunc-6.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vec_sat_u_trunc-run-1.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vec_sat_u_trunc-run-2.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vec_sat_u_trunc-run-3.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vec_sat_u_trunc-run-4.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vec_sat_u_trunc-run-5.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vec_sat_u_trunc-run-6.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vec_sat_unary_vv_run.h: New test.

Signed-off-by: Pan Li 
---
 gcc/config/riscv/autovec.md   |  34 ++
 gcc/config/riscv/riscv-protos.h   |   1 +
 gcc/config/riscv/riscv-v.cc   |  64 +++
 .../riscv/rvv/autovec/binop/vec_sat_arith.h   |  22 +
 .../riscv/rvv/autovec/unop/vec_sat_data.h | 394 ++
 .../rvv/autovec/unop/vec_sat_u_trunc-1.c  |  19 +
 .../rvv/autovec/unop/vec_sat_u_trunc-2.c  |  21 +
 .../rvv/autovec/unop/vec_sat_u_trunc-3.c  |  23 +
 .../rvv/autovec/unop/vec_sat_u_trunc-4.c  |  19 +
 .../rvv/autovec/unop/vec_sat_u_trunc-5.c  |  21 +
 .../rvv/autovec/unop/vec_sat_u_trunc-6.c  |  19 +
 .../rvv/autovec/unop/vec_sat_u_trunc-run-1.c  |  16 +
 .../rvv/autovec/unop/vec_sat_u_trunc-run-2.c  |  16 +
 .../rvv/autovec/unop/vec_sat_u_trunc-run-3.c  |  16 +
 .../rvv/autovec/unop/vec_sat_u_trunc-run-4.c  |  16 +
 .../rvv/autovec/unop/vec_sat_u_trunc-run-5.c  |  16 +
 .../rvv/autovec/unop/vec_sat_u_trunc-run-6.c  |  16 +
 .../rvv/autovec/unop/vec_sat_unary_vv_run.h   |  23 +
 18 files changed, 756 insertions(+)
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/vec_sat_data.h
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/vec_sat_u_trunc-1.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/vec_sat_u_trunc-2.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/vec_sat_u_trunc-3.c
 create mo

Re: [PATCH] [i386] restore recompute to override opts after change [PR113719]

2024-07-04 Thread Alexandre Oliva
Hello, Rainer,

Thanks for the report!

On Jul  3, 2024, Rainer Orth  wrote:

> unfortunately this patch caused two regressions on Solaris/x86:

> FAIL: gcc.dg/ipa/iinline-attr.c scan-ipa-dump inline "hooray[^n]*inline 
> copy in test"

> both 32 and 64-bit.  Solaris/x86 does default to -fno-omit-frame-pointer.

> This failure was fixed before by

> commit 499d00127d39ba894b0f7216d73660b380bdc325
> Author: Hongyu Wang 
> Date:   Wed May 15 11:24:34 2024 +0800

> i386: Fix ix86_option override after change [PR 113719]

> Obviously the two patches interact badly.

Interesting.

The earlier patch had regressed this very test on other x86 targets
configured with --enable-frame-pointer.  The effect of this configure
flag should be the same as defaulting to -fno-omit-frame-pointer, no?  I
wonder what it is that Solaris does different.  Hmm, I wonder if leaf
frame pointer has to do with that.  I've seen the machinery for that
enabling the -fomit-frame-pointer infrastructure with some nonstandard
value, which confused at least the active optimization options dumping.
I wonder if that's related with the failure on Solaris.

I'll have a look, hopefully no later than next week.

-- 
Alexandre Oliva, happy hackerhttps://FSFLA.org/blogs/lxo/
   Free Software Activist   GNU Toolchain Engineer
More tolerance and less prejudice are key for inclusion and diversity
Excluding neuro-others for not behaving ""normal"" is *not* inclusive


Re: [PATCH 1/3] expr: Allow same precision modes conversion between {ibm_extended, ieee_quad}_format [PR112993]

2024-07-04 Thread Kewen.Lin
Hi Richard,

Thanks for the review comments!

on 2024/7/4 23:58, Richard Sandiford wrote:
> "Kewen.Lin"  writes:
>> Hi,
>>
>> With some historical reasons, rs6000 defines KFmode, TFmode
>> and IFmode to have different mode precision, but it causes
>> some issues and needs some workarounds such as r14-6478 for
>> PR112788.  So we are going to make all rs6000 128 bit scalar
>> FP modes have 128 bit precision.  Be prepared for that, this
>> patch is to make function convert_mode_scalar allow same
>> precision FP modes conversion if their underlying formats are
>> ibm_extended_format and ieee_quad_format respectively, just
>> like the existing special treatment on arm_bfloat_half_format
>> <-> ieee_half_format.  It also factors out all the relevant
>> checks into a lambda function.
>>
>> Bootstrapped and regtested on x86_64-redhat-linux and
>> powerpc64{,le}-linux-gnu.
>>
>> Is it ok for trunk?
>>
>> BR,
>> Kewen
>> -
>>  PR target/112993
>>
>> gcc/ChangeLog:
>>
>>  * expr.cc (convert_mode_scalar): Allow same precision conversion
>>  between scalar floating point modes if whose underlying format is
>>  ibm_extended_format or ieee_quad_format, and refactor assertion
>>  with new lambda function acceptable_same_precision_modes.
>> ---
>>  gcc/expr.cc | 30 --
>>  1 file changed, 24 insertions(+), 6 deletions(-)
>>
>> diff --git a/gcc/expr.cc b/gcc/expr.cc
>> index ffbac513692..eac4dcc982e 100644
>> --- a/gcc/expr.cc
>> +++ b/gcc/expr.cc
>> @@ -338,6 +338,29 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp)
>>enum rtx_code equiv_code = (unsignedp < 0 ? UNKNOWN
>>: (unsignedp ? ZERO_EXTEND : SIGN_EXTEND));
>>
>> +  auto acceptable_same_precision_modes
>> += [] (scalar_mode from_mode, scalar_mode to_mode) -> bool
>> +{
>> +  if (DECIMAL_FLOAT_MODE_P (from_mode) != DECIMAL_FLOAT_MODE_P 
>> (to_mode))
>> +return true;
>> +
>> +  /* arm_bfloat_half_format <-> ieee_half_format */
>> +  if ((REAL_MODE_FORMAT (from_mode) == &arm_bfloat_half_format
>> +   && REAL_MODE_FORMAT (to_mode) == &ieee_half_format)
>> +  || (REAL_MODE_FORMAT (to_mode) == &arm_bfloat_half_format
>> +  && REAL_MODE_FORMAT (from_mode) == &ieee_half_format))
>> +return true;
>> +
>> +  /* ibm_extended_format <-> ieee_quad_format */
>> +  if ((REAL_MODE_FORMAT (from_mode) == &ibm_extended_format
>> +   && REAL_MODE_FORMAT (to_mode) == &ieee_quad_format)
>> +  || (REAL_MODE_FORMAT (from_mode) == &ieee_quad_format
>> +  && REAL_MODE_FORMAT (to_mode) == &ibm_extended_format))
>> +return true;
>> +
>> +  return false;
>> +};
>> +
>>if (to_real)
>>  {
>>rtx value;
>> @@ -346,12 +369,7 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp)
>>
>>gcc_assert ((GET_MODE_PRECISION (from_mode)
>> != GET_MODE_PRECISION (to_mode))
>> -  || (DECIMAL_FLOAT_MODE_P (from_mode)
>> -  != DECIMAL_FLOAT_MODE_P (to_mode))
>> -  || (REAL_MODE_FORMAT (from_mode) == &arm_bfloat_half_format
>> -  && REAL_MODE_FORMAT (to_mode) == &ieee_half_format)
>> -  || (REAL_MODE_FORMAT (to_mode) == &arm_bfloat_half_format
>> -  && REAL_MODE_FORMAT (from_mode) == &ieee_half_format));
>> +  || acceptable_same_precision_modes (from_mode, to_mode));
>>
>>if (GET_MODE_PRECISION (from_mode) == GET_MODE_PRECISION (to_mode))
>>  {
>> --
>> 2.39.1
> 
> This part looks good to me FWIW, but what's the correct behaviour of:
> 
>   if (GET_MODE_PRECISION (from_mode) == GET_MODE_PRECISION (to_mode))
>   {
> if (REAL_MODE_FORMAT (to_mode) == &arm_bfloat_half_format
> && REAL_MODE_FORMAT (from_mode) == &ieee_half_format)
>   /* libgcc implements just __trunchfbf2, not __extendhfbf2.  */
>   tab = trunc_optab;
> else
>   /* Conversion between decimal float and binary float, same
>  size.  */
>   tab = DECIMAL_FLOAT_MODE_P (from_mode) ? trunc_optab : sext_optab;
> 
> for the new pairing?  The intent for bfloat/half seems to be that bfloat
> is treated as arbitrarily “lesser than” half, so half->bfloat is a
> truncation and bfloat->half is an extension.  It seems like it would be
> good to do something similar for the new pair, so that the float modes
> still form a total order in terms of extension & truncation.

Good question!  If I read it right, this special handling for half->bfloat is
to align with the previous implementation in libgcc and easy for backporting
, but isn't to keep the modes to form a total order, as Jakub's comments
PR114907 #c6 and #c8.  Similar to half vs. bfloat, neither of ibm128 nor ieee128
is a subset or superset of the other, the current behavior for this new paring 
is
that:
  1) define sext_optab for any two of TF/KF/IF (also bi-direction), since 
generic
 code above adopts sext_opt

RE: [PATCH v2] RISC-V: Implement the .SAT_TRUNC for scalar

2024-07-04 Thread Li, Pan2
Hi Jeff,

I have a try to only allow SI/DI mode in the iterator of the ustrunc2 
pattern in the backend.
But it will get false when the middle-end try to tell 
direct_internal_fn_supported_p for HImode, and
finally of course failed to detect the .SAT_TRUNC.

Indeed most patterns of riscv.md only takes GPR instead of ANYI, and I am not 
sure if we need to adjust
the middle-end for the fn_supported check (failed to find similar case from 
tree-ssa-math-opts.cc).

Additionally, we may need to improve the usadd/ussub for almost the same 
scenarios.

Pan

-Original Message-
From: Li, Pan2 
Sent: Thursday, July 4, 2024 10:07 AM
To: Jeff Law ; gcc-patches@gcc.gnu.org
Cc: juzhe.zh...@rivai.ai; kito.ch...@gmail.com; rdapp@gmail.com
Subject: RE: [PATCH v2] RISC-V: Implement the .SAT_TRUNC for scalar

> But if you look at what the hardware can actually support, it doesn't 
> have HImode or QImode operations other than load/store and for rv64 
> there are no SImode logicals.

> That's what WORD_REGISTER_OPERATIONS is designed to support.  Regardless 
> of what happens at the source level, the generic parts of gimple->RTL 
> expansion arrange to widen the types appropriately.

> I haven't looked at the expansion of the SAT_* builtins, but the way 
> this is generally supposed to work is you just have to have your 
> expander only accept the modes the processor actually supports and 
> generic code will handle the widening for you.

Thanks Jeff.
Got it, you mean the widening ops will be covered automatically before 
expanding.
I am not sure which part take care auto-widening, but the SAT_TRUNC expands 
from middle end may look
like below.

Could you please help to enlighten me is there something missing here ?

static void
match_unsigned_saturation_trunc (gimple_stmt_iterator *gsi, gassign *stmt)
{
  tree ops[1];
  tree lhs = gimple_assign_lhs (stmt);
  tree type = TREE_TYPE (lhs);

  if (gimple_unsigned_integer_sat_trunc (lhs, ops, NULL)
&& direct_internal_fn_supported_p (IFN_SAT_TRUNC,
   tree_pair (type, TREE_TYPE (ops[0])),
   OPTIMIZE_FOR_BOTH))
{
  gcall *call = gimple_build_call_internal (IFN_SAT_TRUNC, 1, ops[0]);
  gimple_call_set_lhs (call, lhs);
  gsi_replace (gsi, call, /* update_eh_info */ true);
}
}

Pan


-Original Message-
From: Jeff Law  
Sent: Thursday, July 4, 2024 9:52 AM
To: Li, Pan2 ; gcc-patches@gcc.gnu.org
Cc: juzhe.zh...@rivai.ai; kito.ch...@gmail.com; rdapp@gmail.com
Subject: Re: [PATCH v2] RISC-V: Implement the .SAT_TRUNC for scalar



On 7/3/24 6:48 PM, Li, Pan2 wrote:
> Thanks Jeff for comments.
> 
>> I would expect that QI/HI shouldn't be happening in practice due to the
>> definition of WORD_REGISTER_OPERATIONS.
> 
> Sorry I don't get the point here, I suppose there may be 6 kinds of 
> truncation for scalar.
> 
> uint64_t => uint32_t  
> uint64_t => uint16_t
> uint64_t => uint8_t
> uint32_t => uint16_t
> uint32_t => uint8_t
> uint16_t => uint8_t
> 
> Take uint16_t to uint8_t as example:
> 
> uint8_t   test (uint16_t x)
> {
>bool overflow = x > (uint16_t)(uint8_t)(-1);
>return ((uint8_t)x) | (uint8_t)-overflow;
> }
> 
> Will be expand to:
> 
> uint8_t sat_u_truc_uint16_t_to_uint8_t_fmt_1 (uint16_t x)
> {
>uint8_t _6;
>_6 = .SAT_TRUNC (x_4(D)); [tail call]
>return _6;
> }
> 
> Then we will have HImode as src and the QImode as the dest when enter 
> riscv_expand_ustrunc.
But if you look at what the hardware can actually support, it doesn't 
have HImode or QImode operations other than load/store and for rv64 
there are no SImode logicals.

That's what WORD_REGISTER_OPERATIONS is designed to support.  Regardless 
of what happens at the source level, the generic parts of gimple->RTL 
expansion arrange to widen the types appropriately.

I haven't looked at the expansion of the SAT_* builtins, but the way 
this is generally supposed to work is you just have to have your 
expander only accept the modes the processor actually supports and 
generic code will handle the widening for you.


Jeff



[PATCH] rs6000: Consider explicit VSX when masking off ALTIVEC [PR115688]

2024-07-04 Thread Kewen.Lin
Hi,

PR115688 exposes an inconsistent state in which we have VSX
enabled but ALTIVEC disabled.  There is one hunk:

  if (main_target_opt && !main_target_opt->x_rs6000_altivec_abi)
rs6000_isa_flags &= ~((OPTION_MASK_VSX | OPTION_MASK_ALTIVEC)
  & ~rs6000_isa_flags_explicit);

which disables both VSX and ALTIVEC together only considering
them explicitly set or not.  For the given case, VSX is explicitly
specified, altivec is implicitly enabled as it's part of set
ISA_2_6_MASKS_SERVER.  When falling into the above hunk, vsx is
kept as it's explicitly enabled but altivec gets masked off, it's
unexpected.

This patch is to consider explicit VSX when masking off ALTIVEC,
not mask off it if TARGET_VSX and it's explicitly set.

Bootstrapped and regtested on powerpc64-linux-gnu P8/P9 and
powerpc64le-linux-gnu P9 and P10.

I'm going to push this next week if no objections.

BR,
Kewen
-
PR target/115688

gcc/ChangeLog:

* config/rs6000/rs6000.cc (rs6000_option_override_internal): Consider
explicit VSX when masking off ALTIVEC.

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/pr115688.c: New test.
---
 gcc/config/rs6000/rs6000.cc |  8 ++--
 gcc/testsuite/gcc.target/powerpc/pr115688.c | 14 ++
 2 files changed, 20 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr115688.c

diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index 58553ff66f4..2cbea6ea2d7 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -3933,8 +3933,12 @@ rs6000_option_override_internal (bool global_init_p)
  not for 32-bit.  Don't move this before the above code using ignore_masks,
  since it can reset the cleared VSX/ALTIVEC flag again.  */
   if (main_target_opt && !main_target_opt->x_rs6000_altivec_abi)
-rs6000_isa_flags &= ~((OPTION_MASK_VSX | OPTION_MASK_ALTIVEC)
- & ~rs6000_isa_flags_explicit);
+{
+  rs6000_isa_flags &= ~(OPTION_MASK_VSX & ~rs6000_isa_flags_explicit);
+  /* Don't mask off ALTIVEC if it is enabled by an explicit VSX.  */
+  if (!TARGET_VSX)
+   rs6000_isa_flags &= ~(OPTION_MASK_ALTIVEC & ~rs6000_isa_flags_explicit);
+}

   if (TARGET_CRYPTO && !TARGET_ALTIVEC)
 {
diff --git a/gcc/testsuite/gcc.target/powerpc/pr115688.c 
b/gcc/testsuite/gcc.target/powerpc/pr115688.c
new file mode 100644
index 000..5222e66ef17
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr115688.c
@@ -0,0 +1,14 @@
+/* { dg-do compile { target powerpc*-*-linux* } } */
+/* { dg-options "-mdejagnu-cpu=power5 -O2" } */
+
+/* Ignore some error messages on "target attribute or
+   pragma changes AltiVec ABI".  */
+/* { dg-excess-errors "pr115688" { target ilp32 } } */
+
+/* Verify there is no ICE under 32 bit env.  */
+
+__attribute__((target("vsx")))
+int test (void)
+{
+  return 0;
+}


[PATCH] i386: Refactor ssedoublemode

2024-07-04 Thread Hu, Lin1
Hi, all

ssedoublemode's double should mean double type, like SI -> DI.
And we need to refactor some patterns with  instead of
.

Bootstrapped and regtested on x86-64-linux-gnu, OK for trunk?

BRs,
Lin

gcc/ChangeLog:

* config/i386/sse.md (ssedoublemode): Fix the mode_attr.
---
 gcc/config/i386/sse.md | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index d71b0f2567e..d06ce94fa55 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -808,13 +808,12 @@ (define_mode_attr ssedoublemodelower
(V8HI "v8si")   (V16HI "v16si") (V32HI "v32si")
(V4SI "v4di")   (V8SI "v8di")   (V16SI "v16di")])
 
+;; ssedoublemode means vector mode with sanme number of double-size.
 (define_mode_attr ssedoublemode
-  [(V4SF "V8SF") (V8SF "V16SF") (V16SF "V32SF")
-   (V2DF "V4DF") (V4DF "V8DF") (V8DF "V16DF")
+  [(V4SF "V4DF") (V8SF "V8DF") (V16SF "V16DF")
(V16QI "V16HI") (V32QI "V32HI") (V64QI "V64HI")
(V8HI "V8SI") (V16HI "V16SI") (V32HI "V32SI")
-   (V4SI "V4DI") (V8SI "V16SI") (V16SI "V32SI")
-   (V4DI "V8DI") (V8DI "V16DI")])
+   (V4SI "V4DI") (V8SI "V8DI") (V16SI "V16DI")])
 
 (define_mode_attr ssebytemode
   [(V8DI "V64QI") (V4DI "V32QI") (V2DI "V16QI")
@@ -3319,7 +3318,7 @@ (define_split
 (define_split
   [(set (match_operand:VF_128_256 0 "register_operand")
(match_operator:VF_128_256 7 "addsub_vs_operator"
- [(vec_concat:
+ [(vec_concat:
 (minus:VF_128_256
   (match_operand:VF_128_256 1 "register_operand")
   (match_operand:VF_128_256 2 "vector_operand"))
@@ -3353,7 +3352,7 @@ (define_split
 (define_split
   [(set (match_operand:VF_128_256 0 "register_operand")
(match_operator:VF_128_256 7 "addsub_vs_operator"
- [(vec_concat:
+ [(vec_concat:
 (plus:VF_128_256
   (match_operand:VF_128_256 1 "vector_operand")
   (match_operand:VF_128_256 2 "vector_operand"))
@@ -19869,7 +19868,7 @@ (define_expand "avx512dq_shuf_64x2_mask"
 (define_insn "avx512dq_shuf_64x2_1"
   [(set (match_operand:VI8F_256 0 "register_operand" "=x,v")
(vec_select:VI8F_256
- (vec_concat:
+ (vec_concat:
(match_operand:VI8F_256 1 "register_operand" "x,v")
(match_operand:VI8F_256 2 "nonimmediate_operand" "xjm,vm"))
  (parallel [(match_operand 3 "const_0_to_3_operand")
@@ -19922,7 +19921,7 @@ (define_expand "avx512f_shuf_64x2_mask"
 (define_insn "avx512f_shuf_64x2_1"
   [(set (match_operand:V8FI 0 "register_operand" "=v")
(vec_select:V8FI
- (vec_concat:
+ (vec_concat:
(match_operand:V8FI 1 "register_operand" "v")
(match_operand:V8FI 2 "nonimmediate_operand" "vm"))
  (parallel [(match_operand 3 "const_0_to_7_operand")
@@ -20020,7 +20019,7 @@ (define_expand "avx512vl_shuf_32x4_mask"
 (define_insn "avx512vl_shuf_32x4_1"
   [(set (match_operand:VI4F_256 0 "register_operand" "=x,v")
(vec_select:VI4F_256
- (vec_concat:
+ (vec_concat:
(match_operand:VI4F_256 1 "register_operand" "x,v")
(match_operand:VI4F_256 2 "nonimmediate_operand" "xjm,vm"))
  (parallel [(match_operand 3 "const_0_to_7_operand")
@@ -20091,7 +20090,7 @@ (define_expand "avx512f_shuf_32x4_mask"
 (define_insn "avx512f_shuf_32x4_1"
   [(set (match_operand:V16FI 0 "register_operand" "=v")
(vec_select:V16FI
- (vec_concat:
+ (vec_concat:
(match_operand:V16FI 1 "register_operand" "v")
(match_operand:V16FI 2 "nonimmediate_operand" "vm"))
  (parallel [(match_operand 3 "const_0_to_15_operand")
-- 
2.31.1



Re: [PATCH] i386: Refactor ssedoublemode

2024-07-04 Thread Uros Bizjak
On Fri, Jul 5, 2024 at 7:48 AM Hu, Lin1  wrote:
>
> Hi, all
>
> ssedoublemode's double should mean double type, like SI -> DI.
> And we need to refactor some patterns with  instead of
> .
>
> Bootstrapped and regtested on x86-64-linux-gnu, OK for trunk?
>
> BRs,
> Lin
>
> gcc/ChangeLog:
>
> * config/i386/sse.md (ssedoublemode): Fix the mode_attr.

Please be more descriptive, like ": Remove mappings to double of
elements". Please also add names of changed patterns to ChangeLog.

> ---
>  gcc/config/i386/sse.md | 19 +--
>  1 file changed, 9 insertions(+), 10 deletions(-)
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index d71b0f2567e..d06ce94fa55 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -808,13 +808,12 @@ (define_mode_attr ssedoublemodelower
> (V8HI "v8si")   (V16HI "v16si") (V32HI "v32si")
> (V4SI "v4di")   (V8SI "v8di")   (V16SI "v16di")])
>
> +;; ssedoublemode means vector mode with sanme number of double-size.

Better say: Map vector mode to the same number of double sized elements.

Uros.

>  (define_mode_attr ssedoublemode
> -  [(V4SF "V8SF") (V8SF "V16SF") (V16SF "V32SF")
> -   (V2DF "V4DF") (V4DF "V8DF") (V8DF "V16DF")
> +  [(V4SF "V4DF") (V8SF "V8DF") (V16SF "V16DF")
> (V16QI "V16HI") (V32QI "V32HI") (V64QI "V64HI")
> (V8HI "V8SI") (V16HI "V16SI") (V32HI "V32SI")
> -   (V4SI "V4DI") (V8SI "V16SI") (V16SI "V32SI")
> -   (V4DI "V8DI") (V8DI "V16DI")])
> +   (V4SI "V4DI") (V8SI "V8DI") (V16SI "V16DI")])
>
>  (define_mode_attr ssebytemode
>[(V8DI "V64QI") (V4DI "V32QI") (V2DI "V16QI")
> @@ -3319,7 +3318,7 @@ (define_split
>  (define_split
>[(set (match_operand:VF_128_256 0 "register_operand")
> (match_operator:VF_128_256 7 "addsub_vs_operator"
> - [(vec_concat:
> + [(vec_concat:
>  (minus:VF_128_256
>(match_operand:VF_128_256 1 "register_operand")
>(match_operand:VF_128_256 2 "vector_operand"))
> @@ -3353,7 +3352,7 @@ (define_split
>  (define_split
>[(set (match_operand:VF_128_256 0 "register_operand")
> (match_operator:VF_128_256 7 "addsub_vs_operator"
> - [(vec_concat:
> + [(vec_concat:
>  (plus:VF_128_256
>(match_operand:VF_128_256 1 "vector_operand")
>(match_operand:VF_128_256 2 "vector_operand"))
> @@ -19869,7 +19868,7 @@ (define_expand "avx512dq_shuf_64x2_mask"
>  (define_insn "avx512dq_shuf_64x2_1"
>[(set (match_operand:VI8F_256 0 "register_operand" "=x,v")
> (vec_select:VI8F_256
> - (vec_concat:
> + (vec_concat:
> (match_operand:VI8F_256 1 "register_operand" "x,v")
> (match_operand:VI8F_256 2 "nonimmediate_operand" "xjm,vm"))
>   (parallel [(match_operand 3 "const_0_to_3_operand")
> @@ -19922,7 +19921,7 @@ (define_expand "avx512f_shuf_64x2_mask"
>  (define_insn "avx512f_shuf_64x2_1"
>[(set (match_operand:V8FI 0 "register_operand" "=v")
> (vec_select:V8FI
> - (vec_concat:
> + (vec_concat:
> (match_operand:V8FI 1 "register_operand" "v")
> (match_operand:V8FI 2 "nonimmediate_operand" "vm"))
>   (parallel [(match_operand 3 "const_0_to_7_operand")
> @@ -20020,7 +20019,7 @@ (define_expand "avx512vl_shuf_32x4_mask"
>  (define_insn "avx512vl_shuf_32x4_1"
>[(set (match_operand:VI4F_256 0 "register_operand" "=x,v")
> (vec_select:VI4F_256
> - (vec_concat:
> + (vec_concat:
> (match_operand:VI4F_256 1 "register_operand" "x,v")
> (match_operand:VI4F_256 2 "nonimmediate_operand" "xjm,vm"))
>   (parallel [(match_operand 3 "const_0_to_7_operand")
> @@ -20091,7 +20090,7 @@ (define_expand "avx512f_shuf_32x4_mask"
>  (define_insn "avx512f_shuf_32x4_1"
>[(set (match_operand:V16FI 0 "register_operand" "=v")
> (vec_select:V16FI
> - (vec_concat:
> + (vec_concat:
> (match_operand:V16FI 1 "register_operand" "v")
> (match_operand:V16FI 2 "nonimmediate_operand" "vm"))
>   (parallel [(match_operand 3 "const_0_to_15_operand")
> --
> 2.31.1
>