Re: [PATCH] Add emulated gather capability to the vectorizer

2021-08-02 Thread Richard Biener
On Mon, 2 Aug 2021, Kewen.Lin wrote:

> on 2021/7/30 下午10:04, Kewen.Lin via Gcc-patches wrote:
> > Hi Richi,
> > 
> > on 2021/7/30 下午7:34, Richard Biener wrote:
> >> This adds a gather vectorization capability to the vectorizer
> >> without target support by decomposing the offset vector, doing
> >> sclar loads and then building a vector from the result.  This
> >> is aimed mainly at cases where vectorizing the rest of the loop
> >> offsets the cost of vectorizing the gather.
> >>
> >> Note it's difficult to avoid vectorizing the offset load, but in
> >> some cases later passes can turn the vector load + extract into
> >> scalar loads, see the followup patch.
> >>
> >> On SPEC CPU 2017 510.parest_r this improves runtime from 250s
> >> to 219s on a Zen2 CPU which has its native gather instructions
> >> disabled (using those the runtime instead increases to 254s)
> >> using -Ofast -march=znver2 [-flto].  It turns out the critical
> >> loops in this benchmark all perform gather operations.
> >>
> > 
> > Wow, it sounds promising!
> > 
> >> Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >>
> >> Any comments?  I still plan to run this over full SPEC and
> >> I have to apply TLC to the followup patch before I can post it.
> >>
> >> I think neither power nor z has gather so I'm curious if the
> >> patch helps 510.parest_r there, I'm unsure about neon/advsimd.
> > 
> > Yes, Power (latest Power10) doesn't support gather load.
> > I just measured 510.parest_r with this patch on Power9 at option
> > -Ofast -mcpu=power9 {,-funroll-loops}, both are neutral.
> > 
> > It fails to vectorize the loop in vect-gather-1.c:
> > 
> > vect-gather.c:12:28: missed:  failed: evolution of base is not affine.
> > vect-gather.c:11:46: missed:   not vectorized: data ref analysis failed _6 
> > = *_5;
> > vect-gather.c:12:28: missed:   not vectorized: data ref analysis failed: _6 
> > = *_5;
> > vect-gather.c:11:46: missed:  bad data references.
> > vect-gather.c:11:46: missed: couldn't vectorize loop
> > 
> 
> By further investigation, it's due to that rs6000 fails to make
> maybe_gather true in:
> 
> bool maybe_gather
>   = DR_IS_READ (dr)
> && !TREE_THIS_VOLATILE (DR_REF (dr))
> && (targetm.vectorize.builtin_gather != NULL
> || supports_vec_gather_load_p ());
> 
> With the hacking defining TARGET_VECTORIZE_BUILTIN_GATHER (as
> well as TARGET_VECTORIZE_BUILTIN_SCATTER) for rs6000, the
> case gets vectorized as expected.

Ah, yeah - this check needs to be elided.  Checking with a cross
that's indeed what is missing.

> But re-evaluated 510.parest_r with this extra hacking, the
> runtime performance doesn't have any changes.

OK, it likely needs the followup as well then.  For the added
testcase I end up with

.L4:
lwz %r10,8(%r9)
lwz %r31,12(%r9)
addi %r8,%r8,32
addi %r9,%r9,16
lwz %r7,-16(%r9)
lwz %r0,-12(%r9)
lxv %vs10,-16(%r8)
lxv %vs12,-32(%r8)
extswsli %r10,%r10,3
extswsli %r31,%r31,3
extswsli %r7,%r7,3
extswsli %r0,%r0,3
ldx %r10,%r6,%r10
ldx %r31,%r6,%r31
ldx %r7,%r6,%r7
ldx %r12,%r6,%r0
mtvsrdd %vs0,%r31,%r10
mtvsrdd %vs11,%r12,%r7
xvmuldp %vs0,%vs0,%vs10
xvmaddadp %vs0,%vs12,%vs11
xvadddp %vs32,%vs32,%vs0
bdnz .L4

as the inner vectorized loop.  Looks like power doesn't have
extending SI->DImode loads?  Otherwise the code looks
straight-forward (I've used -mcpu=power10), but eventually
the whole gather, which I think boils down to

lwz %r10,8(%r9)
lwz %r31,12(%r9)
addi %r9,%r9,16
lwz %r7,-16(%r9)
lwz %r0,-12(%r9)
extswsli %r10,%r10,3
extswsli %r31,%r31,3
extswsli %r7,%r7,3
extswsli %r0,%r0,3
ldx %r10,%r6,%r10
ldx %r31,%r6,%r31
ldx %r7,%r6,%r7
ldx %r12,%r6,%r0
mtvsrdd %vs0,%r31,%r10
mtvsrdd %vs11,%r12,%r7

is too difficult for the integer/load side of the pipeline.  It's
of course the same in the scalar loop but there the FP ops only
need to wait for one lane to complete.  Note the above is with
the followup patch.

Richard.

> BR,
> Kewen
> 
> >> Both might need the followup patch - I was surprised about
> >> the speedup without it on Zen (the followup improves runtime
> >> to 198s there).
> >>
> >> Thanks,
> >> Richard.
> >>
> >> 2021-07-30  Richard Biener  
> >>
> >>* tree-vect-data-refs.c (vect_check_gather_scatter):
> >>Include widening conversions only when the result is
> >>still handed by native gather or the current offset
> >>size not already matches the data size.
> >>Also succeed analysis in case there's no native support,
> >>noted by a IFN_LAST ifn and a NULL decl.
> >>* tree-vect-patterns.c (vect_recog_gather_scatter_pattern):
> >>Test for no IFN gather rather than decl gather.
> >>* tree-vect-stmts.c (vect_model_load_cost): 

Re: [PATCH] Add emulated gather capability to the vectorizer

2021-08-02 Thread Richard Biener
On Fri, 30 Jul 2021, Richard Sandiford wrote:

> Richard Biener  writes:
> > This adds a gather vectorization capability to the vectorizer
> > without target support by decomposing the offset vector, doing
> > sclar loads and then building a vector from the result.  This
> > is aimed mainly at cases where vectorizing the rest of the loop
> > offsets the cost of vectorizing the gather.
> >
> > Note it's difficult to avoid vectorizing the offset load, but in
> > some cases later passes can turn the vector load + extract into
> > scalar loads, see the followup patch.
> >
> > On SPEC CPU 2017 510.parest_r this improves runtime from 250s
> > to 219s on a Zen2 CPU which has its native gather instructions
> > disabled (using those the runtime instead increases to 254s)
> > using -Ofast -march=znver2 [-flto].  It turns out the critical
> > loops in this benchmark all perform gather operations.
> 
> Nice!
> 
> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >
> > Any comments?  I still plan to run this over full SPEC and
> > I have to apply TLC to the followup patch before I can post it.
> >
> > I think neither power nor z has gather so I'm curious if the
> > patch helps 510.parest_r there, I'm unsure about neon/advsimd.
> > Both might need the followup patch - I was surprised about
> > the speedup without it on Zen (the followup improves runtime
> > to 198s there).
> >
> > Thanks,
> > Richard.
> >
> > 2021-07-30  Richard Biener  
> >
> > * tree-vect-data-refs.c (vect_check_gather_scatter):
> > Include widening conversions only when the result is
> > still handed by native gather or the current offset
> > size not already matches the data size.
> > Also succeed analysis in case there's no native support,
> > noted by a IFN_LAST ifn and a NULL decl.
> > * tree-vect-patterns.c (vect_recog_gather_scatter_pattern):
> > Test for no IFN gather rather than decl gather.
> > * tree-vect-stmts.c (vect_model_load_cost): Pass in the
> > gather-scatter info and cost emulated gathers accordingly.
> > (vect_truncate_gather_scatter_offset): Properly test for
> > no IFN gather.
> > (vect_use_strided_gather_scatters_p): Likewise.
> > (get_load_store_type): Handle emulated gathers and its
> > restrictions.
> > (vectorizable_load): Likewise.  Emulate them by extracting
> > scalar offsets, doing scalar loads and a vector construct.
> >
> > * gcc.target/i386/vect-gather-1.c: New testcase.
> > * gfortran.dg/vect/vect-8.f90: Adjust.
> > ---
> >  gcc/testsuite/gcc.target/i386/vect-gather-1.c |  18 
> >  gcc/testsuite/gfortran.dg/vect/vect-8.f90 |   2 +-
> >  gcc/tree-vect-data-refs.c |  29 +++--
> >  gcc/tree-vect-patterns.c  |   2 +-
> >  gcc/tree-vect-stmts.c | 100 --
> >  5 files changed, 136 insertions(+), 15 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/vect-gather-1.c
> >
> > diff --git a/gcc/testsuite/gcc.target/i386/vect-gather-1.c 
> > b/gcc/testsuite/gcc.target/i386/vect-gather-1.c
> > new file mode 100644
> > index 000..134aef39666
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/vect-gather-1.c
> > @@ -0,0 +1,18 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-Ofast -msse2 -fdump-tree-vect-details" } */
> > +
> > +#ifndef INDEXTYPE
> > +#define INDEXTYPE int
> > +#endif
> > +double vmul(INDEXTYPE *rowstart, INDEXTYPE *rowend,
> > +   double *luval, double *dst)
> > +{
> > +  double res = 0;
> > +  for (const INDEXTYPE * col = rowstart; col != rowend; ++col, ++luval)
> > +res += *luval * dst[*col];
> > +  return res;
> > +}
> > +
> > +/* With gather emulation this should be profitable to vectorize
> > +   even with plain SSE2.  */
> > +/* { dg-final { scan-tree-dump "loop vectorized" "vect" } } */
> > diff --git a/gcc/testsuite/gfortran.dg/vect/vect-8.f90 
> > b/gcc/testsuite/gfortran.dg/vect/vect-8.f90
> > index 9994805d77f..cc1aebfbd84 100644
> > --- a/gcc/testsuite/gfortran.dg/vect/vect-8.f90
> > +++ b/gcc/testsuite/gfortran.dg/vect/vect-8.f90
> > @@ -706,5 +706,5 @@ END SUBROUTINE kernel
> >  
> >  ! { dg-final { scan-tree-dump-times "vectorized 24 loops" 1 "vect" { 
> > target aarch64_sve } } }
> >  ! { dg-final { scan-tree-dump-times "vectorized 23 loops" 1 "vect" { 
> > target { aarch64*-*-* && { ! aarch64_sve } } } } }
> > -! { dg-final { scan-tree-dump-times "vectorized 2\[23\] loops" 1 "vect" { 
> > target { vect_intdouble_cvt && { ! aarch64*-*-* } } } } }
> > +! { dg-final { scan-tree-dump-times "vectorized 2\[234\] loops" 1 "vect" { 
> > target { vect_intdouble_cvt && { ! aarch64*-*-* } } } } }
> >  ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { 
> > target { { ! vect_intdouble_cvt } && { ! aarch64*-*-* } } } } }
> > diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
> > index 6995efba899..0279e75fa8e 100644
> > --- a/gcc/tree-vect-data-refs.c
> > +++

Re: [PATCH] Add emulated gather capability to the vectorizer

2021-08-02 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Fri, 30 Jul 2021, Richard Sandiford wrote:
>> > @@ -9456,6 +9499,51 @@ vectorizable_load (vec_info *vinfo,
>> >data_ref = NULL_TREE;
>> >break;
>> >  }
>> > +  else if (memory_access_type == VMAT_GATHER_SCATTER)
>> > +{
>> > +  /* Emulated gather-scatter.  */
>> > +  gcc_assert (!final_mask);
>> 
>> For this to be safe, we need to make sure that
>> check_load_store_for_partial_vectors clears
>> LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P.  Is that already guaranteed?
>> (Maybe yes, I didn't look closely. :-))
>
> I've made sure to fail earlier:
>
> @@ -8692,6 +8725,15 @@ vectorizable_load (vec_info *vinfo,
>  "unsupported access type for masked 
> load.\n");
>   return false;
> }
> +  else if (memory_access_type == VMAT_GATHER_SCATTER
> +  && gs_info.ifn == IFN_LAST
> +  && !gs_info.decl)
> +   {
> + if (dump_enabled_p ())
> +   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +"unsupported masked emulated gather.\n");
> + return false;
> +   }
>  }
>
> at least I don't see a way to emulate masked gathers ;)

Yeah, that handles masked loads created by ifcvt.  But we also need to
protect against an unconditional load being converted into a predicated
load for partial loop vectorisation.  I think we still need a check in
check_load_store_for_partial_vectors to handle that.  The check wouldn't
prevent vectorisation, it would just prevent using fully-predicated loops.

Thanks,
Richard


Re: [PATCH] Add emulated gather capability to the vectorizer

2021-08-02 Thread Richard Biener
On Mon, 2 Aug 2021, Richard Sandiford wrote:

> Richard Biener  writes:
> > On Fri, 30 Jul 2021, Richard Sandiford wrote:
> >> > @@ -9456,6 +9499,51 @@ vectorizable_load (vec_info *vinfo,
> >> >  data_ref = NULL_TREE;
> >> >  break;
> >> >}
> >> > +else if (memory_access_type == VMAT_GATHER_SCATTER)
> >> > +  {
> >> > +/* Emulated gather-scatter.  */
> >> > +gcc_assert (!final_mask);
> >> 
> >> For this to be safe, we need to make sure that
> >> check_load_store_for_partial_vectors clears
> >> LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P.  Is that already guaranteed?
> >> (Maybe yes, I didn't look closely. :-))
> >
> > I've made sure to fail earlier:
> >
> > @@ -8692,6 +8725,15 @@ vectorizable_load (vec_info *vinfo,
> >  "unsupported access type for masked 
> > load.\n");
> >   return false;
> > }
> > +  else if (memory_access_type == VMAT_GATHER_SCATTER
> > +  && gs_info.ifn == IFN_LAST
> > +  && !gs_info.decl)
> > +   {
> > + if (dump_enabled_p ())
> > +   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > +"unsupported masked emulated gather.\n");
> > + return false;
> > +   }
> >  }
> >
> > at least I don't see a way to emulate masked gathers ;)
> 
> Yeah, that handles masked loads created by ifcvt.  But we also need to
> protect against an unconditional load being converted into a predicated
> load for partial loop vectorisation.  I think we still need a check in
> check_load_store_for_partial_vectors to handle that.  The check wouldn't
> prevent vectorisation, it would just prevent using fully-predicated loops.

But that seems to check for availability of the masked gather IFN which
of course isn't available and then sets 
LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P to false:

  if (memory_access_type == VMAT_LOAD_STORE_LANES)
{
  if (is_load
  ? !vect_load_lanes_supported (vectype, group_size, true)
  : !vect_store_lanes_supported (vectype, group_size, true))
{
  if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
 "can't operate on partial vectors because"
 " the target doesn't have an appropriate"
 " load/store-lanes instruction.\n");
  LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
  return;

Richard.

> Thanks,
> Richard
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)


Re: [PATCH] Add emulated gather capability to the vectorizer

2021-08-02 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Mon, 2 Aug 2021, Richard Sandiford wrote:
>
>> Richard Biener  writes:
>> > On Fri, 30 Jul 2021, Richard Sandiford wrote:
>> >> > @@ -9456,6 +9499,51 @@ vectorizable_load (vec_info *vinfo,
>> >> > data_ref = NULL_TREE;
>> >> > break;
>> >> >   }
>> >> > +   else if (memory_access_type == VMAT_GATHER_SCATTER)
>> >> > + {
>> >> > +   /* Emulated gather-scatter.  */
>> >> > +   gcc_assert (!final_mask);
>> >> 
>> >> For this to be safe, we need to make sure that
>> >> check_load_store_for_partial_vectors clears
>> >> LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P.  Is that already guaranteed?
>> >> (Maybe yes, I didn't look closely. :-))
>> >
>> > I've made sure to fail earlier:
>> >
>> > @@ -8692,6 +8725,15 @@ vectorizable_load (vec_info *vinfo,
>> >  "unsupported access type for masked 
>> > load.\n");
>> >   return false;
>> > }
>> > +  else if (memory_access_type == VMAT_GATHER_SCATTER
>> > +  && gs_info.ifn == IFN_LAST
>> > +  && !gs_info.decl)
>> > +   {
>> > + if (dump_enabled_p ())
>> > +   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> > +"unsupported masked emulated gather.\n");
>> > + return false;
>> > +   }
>> >  }
>> >
>> > at least I don't see a way to emulate masked gathers ;)
>> 
>> Yeah, that handles masked loads created by ifcvt.  But we also need to
>> protect against an unconditional load being converted into a predicated
>> load for partial loop vectorisation.  I think we still need a check in
>> check_load_store_for_partial_vectors to handle that.  The check wouldn't
>> prevent vectorisation, it would just prevent using fully-predicated loops.
>
> But that seems to check for availability of the masked gather IFN which
> of course isn't available and then sets 
> LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P to false:

Ah, yeah, OK.  I wrongly thought that the relaxed check for gather/scatter
support might have affected that too.

Should be OK then. :-)

Richard


Re: [PATCH] Add emulated gather capability to the vectorizer

2021-08-02 Thread Kewen.Lin via Gcc-patches
on 2021/8/2 下午3:09, Richard Biener wrote:
> On Mon, 2 Aug 2021, Kewen.Lin wrote:
> 
>> on 2021/7/30 下午10:04, Kewen.Lin via Gcc-patches wrote:
>>> Hi Richi,
>>>
>>> on 2021/7/30 下午7:34, Richard Biener wrote:
 This adds a gather vectorization capability to the vectorizer
 without target support by decomposing the offset vector, doing
 sclar loads and then building a vector from the result.  This
 is aimed mainly at cases where vectorizing the rest of the loop
 offsets the cost of vectorizing the gather.

 Note it's difficult to avoid vectorizing the offset load, but in
 some cases later passes can turn the vector load + extract into
 scalar loads, see the followup patch.

 On SPEC CPU 2017 510.parest_r this improves runtime from 250s
 to 219s on a Zen2 CPU which has its native gather instructions
 disabled (using those the runtime instead increases to 254s)
 using -Ofast -march=znver2 [-flto].  It turns out the critical
 loops in this benchmark all perform gather operations.

>>>
>>> Wow, it sounds promising!
>>>
 Bootstrapped and tested on x86_64-unknown-linux-gnu.

 Any comments?  I still plan to run this over full SPEC and
 I have to apply TLC to the followup patch before I can post it.

 I think neither power nor z has gather so I'm curious if the
 patch helps 510.parest_r there, I'm unsure about neon/advsimd.
>>>
>>> Yes, Power (latest Power10) doesn't support gather load.
>>> I just measured 510.parest_r with this patch on Power9 at option
>>> -Ofast -mcpu=power9 {,-funroll-loops}, both are neutral.
>>>
>>> It fails to vectorize the loop in vect-gather-1.c:
>>>
>>> vect-gather.c:12:28: missed:  failed: evolution of base is not affine.
>>> vect-gather.c:11:46: missed:   not vectorized: data ref analysis failed _6 
>>> = *_5;
>>> vect-gather.c:12:28: missed:   not vectorized: data ref analysis failed: _6 
>>> = *_5;
>>> vect-gather.c:11:46: missed:  bad data references.
>>> vect-gather.c:11:46: missed: couldn't vectorize loop
>>>
>>
>> By further investigation, it's due to that rs6000 fails to make
>> maybe_gather true in:
>>
>>bool maybe_gather
>>  = DR_IS_READ (dr)
>>&& !TREE_THIS_VOLATILE (DR_REF (dr))
>>&& (targetm.vectorize.builtin_gather != NULL
>>|| supports_vec_gather_load_p ());
>>
>> With the hacking defining TARGET_VECTORIZE_BUILTIN_GATHER (as
>> well as TARGET_VECTORIZE_BUILTIN_SCATTER) for rs6000, the
>> case gets vectorized as expected.
> 
> Ah, yeah - this check needs to be elided.  Checking with a cross
> that's indeed what is missing.
> 
>> But re-evaluated 510.parest_r with this extra hacking, the
>> runtime performance doesn't have any changes.
> 
> OK, it likely needs the followup as well then.  For the added
> testcase I end up with
> 
> .L4:
> lwz %r10,8(%r9)
> lwz %r31,12(%r9)
> addi %r8,%r8,32
> addi %r9,%r9,16
> lwz %r7,-16(%r9)
> lwz %r0,-12(%r9)
> lxv %vs10,-16(%r8)
> lxv %vs12,-32(%r8)
> extswsli %r10,%r10,3
> extswsli %r31,%r31,3
> extswsli %r7,%r7,3
> extswsli %r0,%r0,3
> ldx %r10,%r6,%r10
> ldx %r31,%r6,%r31
> ldx %r7,%r6,%r7
> ldx %r12,%r6,%r0
> mtvsrdd %vs0,%r31,%r10
> mtvsrdd %vs11,%r12,%r7
> xvmuldp %vs0,%vs0,%vs10
> xvmaddadp %vs0,%vs12,%vs11
> xvadddp %vs32,%vs32,%vs0
> bdnz .L4
> 
> as the inner vectorized loop.  Looks like power doesn't have
> extending SI->DImode loads?  Otherwise the code looks

You meant one instruction for both left shifting and loads?
It doesn't if so.  btw, the above code sequence looks nice!
> straight-forward (I've used -mcpu=power10), but eventually
> the whole gather, which I think boils down to
> 
> lwz %r10,8(%r9)
> lwz %r31,12(%r9)
> addi %r9,%r9,16
> lwz %r7,-16(%r9)
> lwz %r0,-12(%r9)
> extswsli %r10,%r10,3
> extswsli %r31,%r31,3
> extswsli %r7,%r7,3
> extswsli %r0,%r0,3
> ldx %r10,%r6,%r10
> ldx %r31,%r6,%r31
> ldx %r7,%r6,%r7
> ldx %r12,%r6,%r0
> mtvsrdd %vs0,%r31,%r10
> mtvsrdd %vs11,%r12,%r7
> 
> is too difficult for the integer/load side of the pipeline.  It's
> of course the same in the scalar loop but there the FP ops only
> need to wait for one lane to complete.  Note the above is with
> the followup patch.
> 

Good point!  Will check the actual effect after this and its follow-up
are landed, may need some cost tweaking on it.

BR,
Kewen

> Richard.
> 
>> BR,
>> Kewen
>>
 Both might need the followup patch - I was surprised about
 the speedup without it on Zen (the followup improves runtime
 to 198s there).

 Thanks,
 Richard.

 2021-07-30  Richard Biener  

* tree-vect-data-refs.c (vect_check_gather_scatter):
Include widening conversions o

Re: [PATCH] Add emulated gather capability to the vectorizer

2021-08-02 Thread Richard Biener
On Mon, 2 Aug 2021, Kewen.Lin wrote:

> on 2021/8/2 下午3:09, Richard Biener wrote:
> > On Mon, 2 Aug 2021, Kewen.Lin wrote:
> > 
> >> on 2021/7/30 下午10:04, Kewen.Lin via Gcc-patches wrote:
> >>> Hi Richi,
> >>>
> >>> on 2021/7/30 下午7:34, Richard Biener wrote:
>  This adds a gather vectorization capability to the vectorizer
>  without target support by decomposing the offset vector, doing
>  sclar loads and then building a vector from the result.  This
>  is aimed mainly at cases where vectorizing the rest of the loop
>  offsets the cost of vectorizing the gather.
> 
>  Note it's difficult to avoid vectorizing the offset load, but in
>  some cases later passes can turn the vector load + extract into
>  scalar loads, see the followup patch.
> 
>  On SPEC CPU 2017 510.parest_r this improves runtime from 250s
>  to 219s on a Zen2 CPU which has its native gather instructions
>  disabled (using those the runtime instead increases to 254s)
>  using -Ofast -march=znver2 [-flto].  It turns out the critical
>  loops in this benchmark all perform gather operations.
> 
> >>>
> >>> Wow, it sounds promising!
> >>>
>  Bootstrapped and tested on x86_64-unknown-linux-gnu.
> 
>  Any comments?  I still plan to run this over full SPEC and
>  I have to apply TLC to the followup patch before I can post it.
> 
>  I think neither power nor z has gather so I'm curious if the
>  patch helps 510.parest_r there, I'm unsure about neon/advsimd.
> >>>
> >>> Yes, Power (latest Power10) doesn't support gather load.
> >>> I just measured 510.parest_r with this patch on Power9 at option
> >>> -Ofast -mcpu=power9 {,-funroll-loops}, both are neutral.
> >>>
> >>> It fails to vectorize the loop in vect-gather-1.c:
> >>>
> >>> vect-gather.c:12:28: missed:  failed: evolution of base is not affine.
> >>> vect-gather.c:11:46: missed:   not vectorized: data ref analysis failed 
> >>> _6 = *_5;
> >>> vect-gather.c:12:28: missed:   not vectorized: data ref analysis failed: 
> >>> _6 = *_5;
> >>> vect-gather.c:11:46: missed:  bad data references.
> >>> vect-gather.c:11:46: missed: couldn't vectorize loop
> >>>
> >>
> >> By further investigation, it's due to that rs6000 fails to make
> >> maybe_gather true in:
> >>
> >>  bool maybe_gather
> >>= DR_IS_READ (dr)
> >>  && !TREE_THIS_VOLATILE (DR_REF (dr))
> >>  && (targetm.vectorize.builtin_gather != NULL
> >>  || supports_vec_gather_load_p ());
> >>
> >> With the hacking defining TARGET_VECTORIZE_BUILTIN_GATHER (as
> >> well as TARGET_VECTORIZE_BUILTIN_SCATTER) for rs6000, the
> >> case gets vectorized as expected.
> > 
> > Ah, yeah - this check needs to be elided.  Checking with a cross
> > that's indeed what is missing.
> > 
> >> But re-evaluated 510.parest_r with this extra hacking, the
> >> runtime performance doesn't have any changes.
> > 
> > OK, it likely needs the followup as well then.  For the added
> > testcase I end up with
> > 
> > .L4:
> > lwz %r10,8(%r9)
> > lwz %r31,12(%r9)
> > addi %r8,%r8,32
> > addi %r9,%r9,16
> > lwz %r7,-16(%r9)
> > lwz %r0,-12(%r9)
> > lxv %vs10,-16(%r8)
> > lxv %vs12,-32(%r8)
> > extswsli %r10,%r10,3
> > extswsli %r31,%r31,3
> > extswsli %r7,%r7,3
> > extswsli %r0,%r0,3
> > ldx %r10,%r6,%r10
> > ldx %r31,%r6,%r31
> > ldx %r7,%r6,%r7
> > ldx %r12,%r6,%r0
> > mtvsrdd %vs0,%r31,%r10
> > mtvsrdd %vs11,%r12,%r7
> > xvmuldp %vs0,%vs0,%vs10
> > xvmaddadp %vs0,%vs12,%vs11
> > xvadddp %vs32,%vs32,%vs0
> > bdnz .L4
> > 
> > as the inner vectorized loop.  Looks like power doesn't have
> > extending SI->DImode loads?  Otherwise the code looks
> 
> You meant one instruction for both left shifting and loads?

No, I meant doing

 lwz %r10,8(%r9)
 extswsli %r10,%r10,3

in one step.  I think extswsli should be a SImode->DImode sign-extend.
Ah - it does the multiplication by 8 and the sign-extend.  OK, that's
nice as well.  On x86 the multiplication by 8 is done via complex
addressing mode on the dependent load and the scalar load does
the sign-extension part.

> It doesn't if so.  btw, the above code sequence looks nice!
> > straight-forward (I've used -mcpu=power10), but eventually
> > the whole gather, which I think boils down to
> > 
> > lwz %r10,8(%r9)
> > lwz %r31,12(%r9)
> > addi %r9,%r9,16
> > lwz %r7,-16(%r9)
> > lwz %r0,-12(%r9)
> > extswsli %r10,%r10,3
> > extswsli %r31,%r31,3
> > extswsli %r7,%r7,3
> > extswsli %r0,%r0,3
> > ldx %r10,%r6,%r10
> > ldx %r31,%r6,%r31
> > ldx %r7,%r6,%r7
> > ldx %r12,%r6,%r0
> > mtvsrdd %vs0,%r31,%r10
> > mtvsrdd %vs11,%r12,%r7
> > 
> > is too difficult for the integer/load side of the pipeline.  It's
> > of

Re: [PATCH] Add emulated gather capability to the vectorizer

2021-08-02 Thread Kewen.Lin via Gcc-patches
on 2021/8/2 下午5:11, Richard Biener wrote:
> On Mon, 2 Aug 2021, Kewen.Lin wrote:
> 
>> on 2021/8/2 下午3:09, Richard Biener wrote:
>>> On Mon, 2 Aug 2021, Kewen.Lin wrote:
>>>
 on 2021/7/30 下午10:04, Kewen.Lin via Gcc-patches wrote:
> Hi Richi,
>
> on 2021/7/30 下午7:34, Richard Biener wrote:
>> This adds a gather vectorization capability to the vectorizer
>> without target support by decomposing the offset vector, doing
>> sclar loads and then building a vector from the result.  This
>> is aimed mainly at cases where vectorizing the rest of the loop
>> offsets the cost of vectorizing the gather.
>>
>> Note it's difficult to avoid vectorizing the offset load, but in
>> some cases later passes can turn the vector load + extract into
>> scalar loads, see the followup patch.
>>
>> On SPEC CPU 2017 510.parest_r this improves runtime from 250s
>> to 219s on a Zen2 CPU which has its native gather instructions
>> disabled (using those the runtime instead increases to 254s)
>> using -Ofast -march=znver2 [-flto].  It turns out the critical
>> loops in this benchmark all perform gather operations.
>>
>
> Wow, it sounds promising!
>
>> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>>
>> Any comments?  I still plan to run this over full SPEC and
>> I have to apply TLC to the followup patch before I can post it.
>>
>> I think neither power nor z has gather so I'm curious if the
>> patch helps 510.parest_r there, I'm unsure about neon/advsimd.
>
> Yes, Power (latest Power10) doesn't support gather load.
> I just measured 510.parest_r with this patch on Power9 at option
> -Ofast -mcpu=power9 {,-funroll-loops}, both are neutral.
>
> It fails to vectorize the loop in vect-gather-1.c:
>
> vect-gather.c:12:28: missed:  failed: evolution of base is not affine.
> vect-gather.c:11:46: missed:   not vectorized: data ref analysis failed 
> _6 = *_5;
> vect-gather.c:12:28: missed:   not vectorized: data ref analysis failed: 
> _6 = *_5;
> vect-gather.c:11:46: missed:  bad data references.
> vect-gather.c:11:46: missed: couldn't vectorize loop
>

 By further investigation, it's due to that rs6000 fails to make
 maybe_gather true in:

  bool maybe_gather
= DR_IS_READ (dr)
  && !TREE_THIS_VOLATILE (DR_REF (dr))
  && (targetm.vectorize.builtin_gather != NULL
  || supports_vec_gather_load_p ());

 With the hacking defining TARGET_VECTORIZE_BUILTIN_GATHER (as
 well as TARGET_VECTORIZE_BUILTIN_SCATTER) for rs6000, the
 case gets vectorized as expected.
>>>
>>> Ah, yeah - this check needs to be elided.  Checking with a cross
>>> that's indeed what is missing.
>>>
 But re-evaluated 510.parest_r with this extra hacking, the
 runtime performance doesn't have any changes.
>>>
>>> OK, it likely needs the followup as well then.  For the added
>>> testcase I end up with
>>>
>>> .L4:
>>> lwz %r10,8(%r9)
>>> lwz %r31,12(%r9)
>>> addi %r8,%r8,32
>>> addi %r9,%r9,16
>>> lwz %r7,-16(%r9)
>>> lwz %r0,-12(%r9)
>>> lxv %vs10,-16(%r8)
>>> lxv %vs12,-32(%r8)
>>> extswsli %r10,%r10,3
>>> extswsli %r31,%r31,3
>>> extswsli %r7,%r7,3
>>> extswsli %r0,%r0,3
>>> ldx %r10,%r6,%r10
>>> ldx %r31,%r6,%r31
>>> ldx %r7,%r6,%r7
>>> ldx %r12,%r6,%r0
>>> mtvsrdd %vs0,%r31,%r10
>>> mtvsrdd %vs11,%r12,%r7
>>> xvmuldp %vs0,%vs0,%vs10
>>> xvmaddadp %vs0,%vs12,%vs11
>>> xvadddp %vs32,%vs32,%vs0
>>> bdnz .L4
>>>
>>> as the inner vectorized loop.  Looks like power doesn't have
>>> extending SI->DImode loads?  Otherwise the code looks
>>
>> You meant one instruction for both left shifting and loads?
> 
> No, I meant doing
> 
>  lwz %r10,8(%r9)
>  extswsli %r10,%r10,3
> 
> in one step.  I think extswsli should be a SImode->DImode sign-extend.
> Ah - it does the multiplication by 8 and the sign-extend.  OK, that's
> nice as well.  

Ah, OK.  Yeah, extswsli is short for "Extend Sign Word and Shift Left
Immediate", which is introduced since Power9.  It does do both the
sign-extend and the multiplication by 8 as you noted.  :)

> On x86 the multiplication by 8 is done via complex
> addressing mode on the dependent load and the scalar load does
> the sign-extension part.
> 

OK, thanks for the clarification!

BR,
Kewen

>> It doesn't if so.  btw, the above code sequence looks nice!
>>> straight-forward (I've used -mcpu=power10), but eventually
>>> the whole gather, which I think boils down to
>>>
>>> lwz %r10,8(%r9)
>>> lwz %r31,12(%r9)
>>> addi %r9,%r9,16
>>> lwz %r7,-16(%r9)
>>> lwz %r0,-12(%r9)
>>> extswsli %r10,%r10,3
>>> extswsli %r31,%r31,3
>>> extsw

Re: [PATCH] Fix typos in move_sese_region_to_fn

2021-08-02 Thread Richard Biener via Gcc-patches
On Fri, Jul 30, 2021 at 1:15 PM Kewen.Lin  wrote:
>
> Hi,
>
> This patch is to fix the typos in the move_sese_region_to_fn.
> As mentioned here [1], I tried to debug the test case
> gcc.dg/graphite/pr83359.c with trunk, but I found it didn't
> go into the hunk guard with "if (moved_orig_loop_num)".  So
> I switched to commit 555758de90074 (also reproduced the ICE
> with 555758de90074~ to ensure my used command step is correct),
> I noticed the compilation of the test case only covers the
> hunk
>
> else
>   {
> moved_orig_loop_num[dloop->orig_loop_num] = -1;
> dloop->orig_loop_num = 0;
>   }
>
> it doesn't touch the hunk
>
> if ((*larray)[dloop->orig_loop_num] != NULL
> && get_loop (saved_cfun, dloop->orig_loop_num) == NULL)
>   {
> if (moved_orig_loop_num[dloop->orig_loop_num] >= 0
> && moved_orig_loop_num[dloop->orig_loop_num] < 2)
>   moved_orig_loop_num[dloop->orig_loop_num]++;
> dloop->orig_loop_num = (*larray)[dloop->orig_loop_num]->num;
>   }
>
> so the following hunk using dloop and guarded with
> "if (moved_orig_loop_num[orig_loop_num] == 2)" doesn't get executed.
>
> It explains why the problem doesn't get exposed before.
>
> By looking to the code using dloop, I think it's a copy/paste typo,
> the assertion
>
>   gcc_assert ((*larray)[dloop->orig_loop_num] != NULL
>   && (get_loop (saved_cfun, dloop->orig_loop_num)
>   == NULL));
>
> would like to ensure the condition in the previous
> loop iterating is true, that is:
>
> if ((*larray)[dloop->orig_loop_num] != NULL
> && get_loop (saved_cfun, dloop->orig_loop_num) == NULL)
>
> But in that context, I think the expected original number has been
> assigned to variable orig_loop_num by extracting from the arg0
> of IFN_LOOP_DIST_ALIAS call.  So replace those ones.
>
> Is it ok for trunk?

OK.

Thanks,
Richard.

> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576367.html
>
> BR,
> Kewen
> -
> gcc/ChangeLog:
>
> * tree-cfg.c (move_sese_region_to_fn): Fix typos on dloop.


Re: [PATCH] Add a simple fraction class

2021-08-02 Thread Richard Biener via Gcc-patches
On Fri, Jul 30, 2021 at 5:59 PM Richard Sandiford via Gcc-patches
 wrote:
>
> This patch adds a simple class for holding A/B fractions.
> As the comments in the patch say, the class isn't designed
> to have nice numerial properties at the extremes.
>
> The motivating use case was some aarch64 costing work,
> where being able to represent fractions was much easier
> than using single integers and avoided the rounding errors
> that would come with using floats.  (Unlike things like
> COSTS_N_INSNS, there was no sensible constant base factor
> that could be used.)
>
> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?

Hmm, we use the sreal type for profiles.  I don't see any overflow/underflow
handling in your class - I suppose you're going to use it on integer types
given we're not allowed to use native FP?  I mean, how exactly does
the class solve the problem of rounding errors?

Thanks,
Richard.

> Thanks,
> Richard
>
>
> gcc/
> * simple-fraction.h: New file.
> * simple-fraction.cc: Likewise.
> * Makefile.in (OBJS): Add simple-fraction.o.
> * selftest.h (simple_fraction_cc_tests): Declare.
> * selftest-run-tests.c (selftest::run_tests): Call it.
> ---
>  gcc/Makefile.in  |   1 +
>  gcc/selftest-run-tests.c |   1 +
>  gcc/selftest.h   |   1 +
>  gcc/simple-fraction.cc   | 160 
>  gcc/simple-fraction.h| 308 +++
>  5 files changed, 471 insertions(+)
>  create mode 100644 gcc/simple-fraction.cc
>  create mode 100644 gcc/simple-fraction.h
>
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index 1666ef84d6a..8eaaab84143 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1572,6 +1572,7 @@ OBJS = \
> selftest-run-tests.o \
> sese.o \
> shrink-wrap.o \
> +   simple-fraction.o \
> simplify-rtx.o \
> sparseset.o \
> spellcheck.o \
> diff --git a/gcc/selftest-run-tests.c b/gcc/selftest-run-tests.c
> index 1b5583e476a..f17d4e24884 100644
> --- a/gcc/selftest-run-tests.c
> +++ b/gcc/selftest-run-tests.c
> @@ -80,6 +80,7 @@ selftest::run_tests ()
>opt_problem_cc_tests ();
>ordered_hash_map_tests_cc_tests ();
>splay_tree_cc_tests ();
> +  simple_fraction_cc_tests ();
>
>/* Mid-level data structures.  */
>input_c_tests ();
> diff --git a/gcc/selftest.h b/gcc/selftest.h
> index 80459d63a39..716ba41f6bf 100644
> --- a/gcc/selftest.h
> +++ b/gcc/selftest.h
> @@ -254,6 +254,7 @@ extern void read_rtl_function_c_tests ();
>  extern void rtl_tests_c_tests ();
>  extern void sbitmap_c_tests ();
>  extern void selftest_c_tests ();
> +extern void simple_fraction_cc_tests ();
>  extern void simplify_rtx_c_tests ();
>  extern void spellcheck_c_tests ();
>  extern void spellcheck_tree_c_tests ();
> diff --git a/gcc/simple-fraction.h b/gcc/simple-fraction.h
> new file mode 100644
> index 000..8d3ff2bdd2d
> --- /dev/null
> +++ b/gcc/simple-fraction.h
> @@ -0,0 +1,308 @@
> +// Simple fraction utilities
> +// Copyright (C) 2021 Free Software Foundation, Inc.
> +//
> +// This file is part of GCC.
> +//
> +// GCC is free software; you can redistribute it and/or modify it under
> +// the terms of the GNU General Public License as published by the Free
> +// Software Foundation; either version 3, or (at your option) any later
> +// version.
> +//
> +// GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> +// WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +// FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> +// for more details.
> +//
> +// You should have received a copy of the GNU General Public License
> +// along with GCC; see the file COPYING3.  If not see
> +// .
> +
> +// A simple fraction with nominator and denominator of integral type T.
> +// There is little attempt to handle multiplication overflow, so the class
> +// shouldn't be used in cases where that's a risk.  It also doesn't cope
> +// gracefully with the minimum T value, if T is signed.
> +template 
> +class simple_fraction
> +{
> +public:
> +  // Construct a fraction equal to NOMINATOR.
> +  template
> +  constexpr simple_fraction (T1 nominator = 0)
> +: m_nominator (nominator), m_denominator (1) {}
> +
> +  // Construct a fraction equal to NOMINATOR / DENOMINATOR (simplifying
> +  // where possible).
> +  template
> +  simple_fraction (T1 nominator, T2 denominator)
> +: simple_fraction (nominator, denominator, gcd (nominator, denominator)) 
> {}
> +
> +  simple_fraction operator- () const;
> +  simple_fraction operator+ (const simple_fraction &) const;
> +  simple_fraction operator- (const simple_fraction &) const;
> +  simple_fraction operator* (const simple_fraction &) const;
> +  simple_fraction operator/ (const simple_fraction &) const;
> +
> +  simple_fraction &operator+= (const simple_fraction &);
> +  simple_fraction &operator-= (const simple_frac

Re: [PATCH] gcov-profile/71672 Fix indirect call inlining with AutoFDO

2021-08-02 Thread Richard Biener via Gcc-patches
On Fri, Jul 30, 2021 at 9:09 AM Eugene Rozenfeld via Gcc-patches
 wrote:
>
> This patch has the following changes:
>
> 1. The main fix is in auto-profile.c: the histogram value for
>indirect calls was incorrectly set up. That is fixed now.
>
> 2. Several tests now have -fdump-ipa-afdo-optimized instead of -fdump-ipa-afdo
>in dg-options so that the expected output can be found.
>
> 3. I increased the number of iterations in several tests so that perf can have
>enough sampling events.
>
> 4. indir-call-prof-2.c has -fno-early-inlining but AutoFDO can't work without
>early inlining (it needs to match the inlining of the profiled binary).
>I changed profopt.exp to always pass -fearly-inlining for AutoFDO.
>With that the indirect call inlining in indir-call-prof-2.c happens in the 
> early inliner
>so I changed the dg-final-use-autofdo.
>
> 5. create_gcov tool doesn't currently support dwarf 5 so I made a change in 
> profopt.exp
>to pass -gdwarf-4 when compiling the binary to profile.
>
> 6. I updated the invocation of create_gcov in profopt.exp to pass 
> -gcov_version=2.
>I recently made a change to create_gcov to support version 2:
>https://github.com/google/autofdo/pull/117
>
> 7. I removed useless -o perf.data from the invocation of gcc-auto-profile in
>target-supports.exp.
>
> With these changes the tests checking indirect call inlining in gcc.dg and 
> g++.dg
> are passing.

OK.

Thanks,
Richard.

> gcc/ChangeLog:
> PR gcov-profile/71672
> * auto-profile.c (afdo_indirect_call): Fix the setup of the 
> historgram value for indirect calls.
>
> gcc/testsuite/ChangeLog:
> PR gcov-profile/71672
> * g++.dg/tree-prof/indir-call-prof.C: Fix options, increase the 
> number of iterations.
> * g++.dg/tree-prof/morefunc.C: Fix options, increase the number of 
> iterations.
> * g++.dg/tree-prof/reorder.C: Fix options, increase the number of 
> iterations.
> * gcc.dg/tree-prof/indir-call-prof-2.c: Fix options, fix 
> dg-final-use-autofdo, increase the number of iterations.
> * gcc.dg/tree-prof/indir-call-prof.c: Fix options.
> * lib/profopt.exp: Pass gdwarf-4 when compiling binary to profile; 
> pass -fearly-inlining when compiling with AutoFDO; pass -gcov_version=2 to 
> create_gcov.
> * lib/target-supports.exp: Remove unnecessary -o perf.data passed to 
> gcc-auto-profile.
> ---
>  gcc/auto-profile.c | 13 +
>  gcc/testsuite/g++.dg/tree-prof/indir-call-prof.C   |  4 ++--
>  gcc/testsuite/g++.dg/tree-prof/morefunc.C  |  7 ---
>  gcc/testsuite/g++.dg/tree-prof/reorder.C   |  6 +++---
>  gcc/testsuite/gcc.dg/tree-prof/indir-call-prof-2.c |  8 
>  gcc/testsuite/gcc.dg/tree-prof/indir-call-prof.c   |  2 +-
>  gcc/testsuite/lib/profopt.exp  |  6 +++---
>  gcc/testsuite/lib/target-supports.exp  |  2 +-
>  8 files changed, 27 insertions(+), 21 deletions(-)
>
> diff --git a/gcc/auto-profile.c b/gcc/auto-profile.c
> index b23b82b2df4..4c1fc6b536b 100644
> --- a/gcc/auto-profile.c
> +++ b/gcc/auto-profile.c
> @@ -1009,13 +1009,18 @@ afdo_indirect_call (gimple_stmt_iterator *gsi, const 
> icall_target_map &map,
>
>histogram_value hist = gimple_alloc_histogram_value (
>cfun, HIST_TYPE_INDIR_CALL, stmt, callee);
> -  hist->n_counters = 3;
> +  hist->n_counters = 4;
>hist->hvalue.counters = XNEWVEC (gcov_type, hist->n_counters);
>gimple_add_histogram_value (cfun, stmt, hist);
>
> -  hist->hvalue.counters[0] = direct_call->profile_id;
> -  hist->hvalue.counters[1] = max_iter->second;
> -  hist->hvalue.counters[2] = total;
> +  // Total counter
> +  hist->hvalue.counters[0] = total;
> +  // Number of value/counter pairs
> +  hist->hvalue.counters[1] = 1;
> +  // Value
> +  hist->hvalue.counters[2] = direct_call->profile_id;
> +  // Counter
> +  hist->hvalue.counters[3] = max_iter->second;
>
>if (!transform)
>  return;
> diff --git a/gcc/testsuite/g++.dg/tree-prof/indir-call-prof.C 
> b/gcc/testsuite/g++.dg/tree-prof/indir-call-prof.C
> index 3374744613e..b45417106d0 100644
> --- a/gcc/testsuite/g++.dg/tree-prof/indir-call-prof.C
> +++ b/gcc/testsuite/g++.dg/tree-prof/indir-call-prof.C
> @@ -1,4 +1,4 @@
> -/* { dg-options "-O2 -fdump-tree-optimized -fdump-ipa-profile-optimized 
> -fdump-ipa-afdo" } */
> +/* { dg-options "-O2 -fdump-tree-optimized -fdump-ipa-profile-optimized 
> -fdump-ipa-afdo-optimized" } */
>
>  struct A {
>A () {}
> @@ -26,7 +26,7 @@ main (void)
>
>int i;
>
> -  for (i = 0; i < 100; i++)
> +  for (i = 0; i < 1000; i++)
>  {
>p = (A *)wrap ((void *)&a);
>p->AA ();
> diff --git a/gcc/testsuite/g++.dg/tree-prof/morefunc.C 
> b/gcc/testsuite/g++.dg/tree-prof/morefunc.C
> index 621d09aec5b..96e0073ca8f 100644
> --- a/gcc/testsuite/g++.dg/tree-prof/morefunc.C
> +++ b/gcc/testsuite/g++.dg/tree-prof/morefunc.C
> @@ -1,4 +1,5 @@
> -/* {

Re: [PATCH] Fix PR 101683: FP exceptions for float->unsigned

2021-08-02 Thread Richard Biener via Gcc-patches
On Fri, Jul 30, 2021 at 9:22 PM apinski--- via Gcc-patches
 wrote:
>
> From: Andrew Pinski 
>
> Just like the old bug PR9651, unsigned_fix rtl should
> also be handled as a trapping instruction.
>
> OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.

OK.

> gcc/ChangeLog:
>
> PR rtl-optimization/101683
> * rtlanal.c (may_trap_p_1): Handle UNSIGNED_FIX.
> ---
>  gcc/rtlanal.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
> index 3b8d88afd4d..f7f3acb75db 100644
> --- a/gcc/rtlanal.c
> +++ b/gcc/rtlanal.c
> @@ -3261,6 +3261,7 @@ may_trap_p_1 (const_rtx x, unsigned flags)
>break;
>
>  case FIX:
> +case UNSIGNED_FIX:
>/* Conversion of floating point might trap.  */
>if (flag_trapping_math && HONOR_NANS (XEXP (x, 0)))
> return 1;
> --
> 2.27.0
>


Re: [PATCH] Optimize x ? bswap(x) : 0 in tree-ssa-phiopt

2021-08-02 Thread Richard Biener via Gcc-patches
On Sat, Jul 31, 2021 at 9:56 AM Roger Sayle  wrote:
>
>
> Many thanks again to Jakub Jelinek for a speedy fix for PR 101642.
> Interestingly, that test case "bswap16(x) ? : x" also reveals a
> missed optimization opportunity.  The resulting "x ? bswap(x) : 0"
> can be further simplified to just bswap(x).
>
> Conveniently, tree-ssa-phiopt.c already recognizes/optimizes the
> related "x ? popcount(x) : 0", so this patch simply makes that
> transformation make general, additionally handling bswap, parity,
> ffs and clrsb.  All of the required infrastructure is already
> present thanks to Jakub previously adding support for clz/ctz.
> To reflect this generalization, the name of the function is changed
> from cond_removal_in_popcount_clz_ctz_pattern to the hopefully
> equally descriptive cond_removal_in_builtin_zero_pattern.
>
> The following patch has been tested on x86_64-pc-linux-gnu with a
> "make bootstrap" and "make -k check" with no new failures.
>
> Ok for mainline?

OK.

Thanks,
Richard.

>
> 2021-07-31  Roger Sayle  
>
> gcc/ChangeLog
> * tree-ssa-phiopt.c (cond_removal_in_builtin_zero_pattern):
> Renamed from cond_removal_in_popcount_clz_ctz_pattern.
> Add support for BSWAP, FFS, PARITY and CLRSB builtins.
> (tree_ssa_phiop_worker): Update call to function above.
>
> gcc/testuite/ChangeLog
> * gcc.dg/tree-ssa/phi-opt-25.c: New test case.
>
>
> Roger
> --
> Roger Sayle
> NextMove Software
> Cambridge, UK
>


Re: [PATCH] Fix ICE when mixing VLAs and statement expressions [PR91038]

2021-08-02 Thread Richard Biener via Gcc-patches
On Sun, Aug 1, 2021 at 7:37 PM Uecker, Martin
 wrote:
>
>
>
> Here is an attempt to fix some old and annoying bugs related
> to VLAs and statement expressions. In particulary, this seems
> to fix the issues with variably-modified types which are
> returned from statement expressions (which works on clang),
> but there are still bugs remaining related to structs
> with VLA members (which seems to be a FE bug).
>
> Of course, I might be doing something stupid...

How's evaluation order of (f())[g()] defined (with f returning a pointer)?
Isn't that just f() + g()*sizeof(int) and thus undefined?

If it's undefined then I think the incoming GENERIC is ill-defined.

> The patch survives bootstrapping and regresstion testing
> on x86_64.
>
>
> --Martin
>
>
>
>
> Fix ICE when mixing VLAs and statement expressions [PR91038]
>
> When returning VM-types from statement expressions, this can
> lead to an ICE when declarations from the statement expression
> are referred to later. Some of these issues can be addressed by
> gimplifying the base expression earlier in gimplify_compound_lval.
> This fixes PR91038 and some of the test cases from PR29970
> (structs with VLA members need further work).
>
>
> 2021-08-01  Martin Uecker  
>
> gcc/
> PR c/91038
> PR c/29970
> * gimplify.c (gimplify_var_or_parm_decl): Update comment.
> (gimplify_compound_lval): Gimplify base expression first.
>
> gcc/testsuite/
> PR c/91038
> PR c/29970
> * gcc.dg/vla-stexp-01.c: New test.
> * gcc.dg/vla-stexp-02.c: New test.
> * gcc.dg/vla-stexp-03.c: New test.
> * gcc.dg/vla-stexp-04.c: New test.
>
>
> diff --git a/gcc/gimplify.c b/gcc/gimplify.c
> index 21ff32ee4aa..885d5f73585 100644
> --- a/gcc/gimplify.c
> +++ b/gcc/gimplify.c
> @@ -2839,7 +2839,10 @@ gimplify_var_or_parm_decl (tree *expr_p)
>   declaration, for which we've already issued an error.  It would
>   be really nice if the front end wouldn't leak these at all.
>   Currently the only known culprit is C++ destructors, as seen
> - in g++.old-deja/g++.jason/binding.C.  */
> + in g++.old-deja/g++.jason/binding.C.
> + Another culpit are size expressions for variably modified
> + types which are lost in the FE or not gimplified correctly.
> +  */
>if (VAR_P (decl)
>&& !DECL_SEEN_IN_BIND_EXPR_P (decl)
>&& !TREE_STATIC (decl) && !DECL_EXTERNAL (decl)
> @@ -2984,9 +2987,23 @@ gimplify_compound_lval (tree *expr_p, gimple_seq 
> *pre_p, gimple_seq *post_p,
>   expression until we deal with any variable bounds, sizes, or
>   positions in order to deal with PLACEHOLDER_EXPRs.
>
> - So we do this in three steps.  First we deal with the annotations
> - for any variables in the components, then we gimplify the base,
> - then we gimplify any indices, from left to right.  */
> + So we do this in three steps.  First we gimplify the base,
> + then we deal with the annotations for any variables in the
> + components, then we gimplify any indices, from left to right.
> +
> + The base expression may contain a statement expression that
> + has declarations used in size expressions, so has to be
> + gimplified first. */
> +
> +  /* Step 1 is to gimplify the base expression.  Make sure lvalue is set
> + so as to match the min_lval predicate.  Failure to do so may result
> + in the creation of large aggregate temporaries.  */
> +  tret = gimplify_expr (p, pre_p, post_p, is_gimple_min_lval,
> +   fallback | fb_lvalue);
> +
> +  ret = MIN (ret, tret);
> +
> +
>for (i = expr_stack.length () - 1; i >= 0; i--)
>  {
>tree t = expr_stack[i];
> @@ -3076,12 +3093,6 @@ gimplify_compound_lval (tree *expr_p, gimple_seq 
> *pre_p, gimple_seq *post_p,
> }
>  }
>
> -  /* Step 2 is to gimplify the base expression.  Make sure lvalue is set
> - so as to match the min_lval predicate.  Failure to do so may result
> - in the creation of large aggregate temporaries.  */
> -  tret = gimplify_expr (p, pre_p, post_p, is_gimple_min_lval,
> -   fallback | fb_lvalue);
> -  ret = MIN (ret, tret);
>
>/* And finally, the indices and operands of ARRAY_REF.  During this
>   loop we also remove any useless conversions.  */
> diff --git a/gcc/testsuite/gcc.dg/vla-stexpr-01.c 
> b/gcc/testsuite/gcc.dg/vla-stexpr-01.c
> new file mode 100644
> index 000..27e1817eb63
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vla-stexpr-01.c
> @@ -0,0 +1,94 @@
> +/* PR29970, PR91038 */
> +/* { dg-do run } */
> +/* { dg-options "-O0" } */
> +
> +int foo3b(void)   // should not return 0
> +{
> +int n = 0;
> +return sizeof *({ n = 10; int x[n]; x; &x; });
> +}
> +
> +int foo4(void)   // should not ICE
> +{
> +return (*({
> +int n = 20;
> +char (*x)[n][n] = __builtin_malloc(n * n);
> +(*x)[12]

Re: [PATCH] Support cond_add/sub/mul/div for vector float/double.

2021-08-02 Thread Richard Biener via Gcc-patches
On Mon, Aug 2, 2021 at 6:33 AM liuhongt  wrote:
>
> Hi:
>   This patch supports cond_add/sub/mul/div expanders for vector float/double.
>   There're still cond_fma/fms/fnms/fma/max/min/xor/ior/and left which I 
> failed to figure out a testcase to validate them.
> Also cond_add/sub/mul for vector integer.

I think FMA can be verified by combining cond_mul and cond_add since the
FMA recognizing code will try to match those up with cond_fma (plus the
other variants).  Eventually combine can handle this on RTL already.

The max/min/ior/and will only show up with masked epilogues when
vectorizing reductions.

Richard.

>
>   Bootstrap is ok, survive the regression test on x86_64-linux-gnu{-m32,}.
>   Pushed to trunk if there're no objections.
>
> gcc/ChangeLog:
>
> * config/i386/sse.md (cond_):New expander.
> (cond_mul): Ditto.
> (cond_div): Ditto.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/cond_op_addsubmuldiv_double-1.c: New test.
> * gcc.target/i386/cond_op_addsubmuldiv_double-2.c: New test.
> * gcc.target/i386/cond_op_addsubmuldiv_float-1.c: New test.
> * gcc.target/i386/cond_op_addsubmuldiv_float-2.c: New test.
> ---
>  gcc/config/i386/sse.md| 54 
>  .../i386/cond_op_addsubmuldiv_double-1.c  | 31 +++
>  .../i386/cond_op_addsubmuldiv_double-2.c  | 85 +++
>  .../i386/cond_op_addsubmuldiv_float-1.c   |  9 ++
>  .../i386/cond_op_addsubmuldiv_float-2.c   |  4 +
>  5 files changed, 183 insertions(+)
>  create mode 100644 
> gcc/testsuite/gcc.target/i386/cond_op_addsubmuldiv_double-1.c
>  create mode 100644 
> gcc/testsuite/gcc.target/i386/cond_op_addsubmuldiv_double-2.c
>  create mode 100644 
> gcc/testsuite/gcc.target/i386/cond_op_addsubmuldiv_float-1.c
>  create mode 100644 
> gcc/testsuite/gcc.target/i386/cond_op_addsubmuldiv_float-2.c
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index b5a08988590..8bf1764d3d5 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -1891,6 +1891,24 @@ (define_insn_and_split "*nabs2"
>  }
>[(set_attr "isa" "noavx,noavx,avx,avx")])
>
> +(define_expand "cond_"
> +  [(set (match_operand:VF 0 "register_operand")
> +   (vec_merge:VF
> + (plusminus:VF
> +   (match_operand:VF 2 "vector_operand")
> +   (match_operand:VF 3 "vector_operand"))
> + (match_operand:VF 4 "nonimm_or_0_operand")
> + (match_operand: 1 "register_operand")))]
> +  " == 64 || TARGET_AVX512VL"
> +{
> +  emit_insn (gen_3_mask (operands[0],
> +operands[2],
> +operands[3],
> +operands[4],
> +operands[1]));
> +  DONE;
> +})
> +
>  (define_expand "3"
>[(set (match_operand:VF 0 "register_operand")
> (plusminus:VF
> @@ -1953,6 +1971,24 @@ (define_insn 
> "_vm3"
> (set_attr "prefix" "")
> (set_attr "mode" "")])
>
> +(define_expand "cond_mul"
> +  [(set (match_operand:VF 0 "register_operand")
> +   (vec_merge:VF
> + (mult:VF
> +   (match_operand:VF 2 "vector_operand")
> +   (match_operand:VF 3 "vector_operand"))
> + (match_operand:VF 4 "nonimm_or_0_operand")
> + (match_operand: 1 "register_operand")))]
> +  " == 64 || TARGET_AVX512VL"
> +{
> +  emit_insn (gen_mul3_mask (operands[0],
> +operands[2],
> +operands[3],
> +operands[4],
> +operands[1]));
> +  DONE;
> +})
> +
>  (define_expand "mul3"
>[(set (match_operand:VF 0 "register_operand")
> (mult:VF
> @@ -2041,6 +2077,24 @@ (define_expand "div3"
>  }
>  })
>
> +(define_expand "cond_div"
> +  [(set (match_operand:VF 0 "register_operand")
> +   (vec_merge:VF
> + (div:VF
> +   (match_operand:VF 2 "register_operand")
> +   (match_operand:VF 3 "vector_operand"))
> + (match_operand:VF 4 "nonimm_or_0_operand")
> + (match_operand: 1 "register_operand")))]
> +  " == 64 || TARGET_AVX512VL"
> +{
> +  emit_insn (gen__div3_mask (operands[0],
> +   operands[2],
> +   operands[3],
> +   operands[4],
> +   operands[1]));
> +  DONE;
> +})
> +
>  (define_insn "_div3"
>[(set (match_operand:VF 0 "register_operand" "=x,v")
> (div:VF
> diff --git a/gcc/testsuite/gcc.target/i386/cond_op_addsubmuldiv_double-1.c 
> b/gcc/testsuite/gcc.target/i386/cond_op_addsubmuldiv_double-1.c
> new file mode 100644
> index 000..1092cba9876
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/cond_op_addsubmuldiv_double-1.c
> @@ -0,0 +1,31 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=skylake-avx512 -fdump-tree-v

Re: [PATCH] Add a simple fraction class

2021-08-02 Thread Richard Sandiford via Gcc-patches
Richard Biener via Gcc-patches  writes:
> On Fri, Jul 30, 2021 at 5:59 PM Richard Sandiford via Gcc-patches
>  wrote:
>>
>> This patch adds a simple class for holding A/B fractions.
>> As the comments in the patch say, the class isn't designed
>> to have nice numerial properties at the extremes.
>>
>> The motivating use case was some aarch64 costing work,
>> where being able to represent fractions was much easier
>> than using single integers and avoided the rounding errors
>> that would come with using floats.  (Unlike things like
>> COSTS_N_INSNS, there was no sensible constant base factor
>> that could be used.)
>>
>> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
>
> Hmm, we use the sreal type for profiles.  I don't see any overflow/underflow
> handling in your class - I suppose you're going to use it on integer types
> given we're not allowed to use native FP?

Yeah, I'm going to use it on integer types.  And it's not designed
to have nice properties at extremes, including handling underflow and
overflow.

I want to use it in costing code, where we already happily multiply
and add “int”-sized costs without worrying about overflow.  I'll be
using uint64_t for the fractions though, just in case. :-)

sreal doesn't help because it's still significand/exponent.  That matters
because…

> I mean, how exactly does
> the class solve the problem of rounding errors?

…I wanted something that represented the results exactly (barring any of
integer ops overflowing).  This makes it meaningful to compare costs for
equality.  It also means we can use ordered comparisons without having
to introduce a fudge factor to cope with one calculation having different
intermediate rounding from the other.

E.g. aarch64 has code like:

  if (scalar_cycles_per_iter < sve_estimate)
{
  unsigned int min_cost
= orig_body_cost * estimated_poly_value (BYTES_PER_SVE_VECTOR);
  if (body_cost < min_cost)
{
  if (dump_enabled_p ())
dump_printf_loc (MSG_NOTE, vect_location,
 "Increasing body cost to %d because the"
 " scalar code could issue within the limit"
 " imposed by predicate operations\n",
 min_cost);
  body_cost = min_cost;
  should_disparage = true;
}
}

I want to be able to keep this while making scalar_cycles_per_iter and
sve_estimate non-integral.

Thanks,
Richard


Re: [PATCH] Add a simple fraction class

2021-08-02 Thread Richard Biener via Gcc-patches
On Mon, Aug 2, 2021 at 12:43 PM Richard Sandiford
 wrote:
>
> Richard Biener via Gcc-patches  writes:
> > On Fri, Jul 30, 2021 at 5:59 PM Richard Sandiford via Gcc-patches
> >  wrote:
> >>
> >> This patch adds a simple class for holding A/B fractions.
> >> As the comments in the patch say, the class isn't designed
> >> to have nice numerial properties at the extremes.
> >>
> >> The motivating use case was some aarch64 costing work,
> >> where being able to represent fractions was much easier
> >> than using single integers and avoided the rounding errors
> >> that would come with using floats.  (Unlike things like
> >> COSTS_N_INSNS, there was no sensible constant base factor
> >> that could be used.)
> >>
> >> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
> >
> > Hmm, we use the sreal type for profiles.  I don't see any overflow/underflow
> > handling in your class - I suppose you're going to use it on integer types
> > given we're not allowed to use native FP?
>
> Yeah, I'm going to use it on integer types.  And it's not designed
> to have nice properties at extremes, including handling underflow and
> overflow.

So maybe assert that it doesn't?  In particular nominator/denominator
are prone to overflowing in fractional representations.

There's the option to round or ICE.  Or rather than the only option
is to round (or use a more expensive arbitrary precision representation).

So the question is whether the fractional behavior is better in more
cases than the sreal behavior (I can easily believe it is).

> I want to use it in costing code, where we already happily multiply
> and add “int”-sized costs without worrying about overflow.  I'll be
> using uint64_t for the fractions though, just in case. :-)
>
> sreal doesn't help because it's still significand/exponent.  That matters
> because…
>
> > I mean, how exactly does
> > the class solve the problem of rounding errors?
>
> …I wanted something that represented the results exactly (barring any of
> integer ops overflowing).  This makes it meaningful to compare costs for
> equality.  It also means we can use ordered comparisons without having
> to introduce a fudge factor to cope with one calculation having different
> intermediate rounding from the other.

I think you're underestimating how quickly your denominator will overflow?
So I suppose all factors of all possible denominators are known, in fact
whats your main source for the divisions?  The VF?

> E.g. aarch64 has code like:
>
>   if (scalar_cycles_per_iter < sve_estimate)
> {
>   unsigned int min_cost
> = orig_body_cost * estimated_poly_value (BYTES_PER_SVE_VECTOR);
>   if (body_cost < min_cost)
> {
>   if (dump_enabled_p ())
> dump_printf_loc (MSG_NOTE, vect_location,
>  "Increasing body cost to %d because the"
>  " scalar code could issue within the limit"
>  " imposed by predicate operations\n",
>  min_cost);
>   body_cost = min_cost;
>   should_disparage = true;
> }
> }
>
> I want to be able to keep this while making scalar_cycles_per_iter and
> sve_estimate non-integral.
>
> Thanks,
> Richard


Re: [PATCH v6 03/10] x86: Update piecewise move and store

2021-08-02 Thread Uros Bizjak via Gcc-patches
On Fri, Jul 30, 2021 at 11:32 PM H.J. Lu  wrote:
>
> We can use TImode/OImode/XImode integers for piecewise move and store.
>
> 1. Define MAX_MOVE_MAX to 64, which is the constant maximum number of
> bytes that a single instruction can move quickly between memory and
> registers or between two memory locations.
> 2. Define MOVE_MAX to MOVE_MAX_PIECES, which is the maximum number of
> bytes we can move from memory to memory in one reasonably fast instruction.
> The difference between MAX_MOVE_MAX and MOVE_MAX is that MAX_MOVE_MAX
> must be a constant, independent of compiler options, since it is used in
> reload.h to define struct target_reload and MOVE_MAX can vary, depending
> on compiler options.
> 3. When vector register is used for piecewise move and store, we don't
> increase stack_alignment_needed since vector register spill isn't
> required for piecewise move and store.  Since stack_realign_needed is
> set to true by checking stack_alignment_estimated set by pseudo vector
> register usage, we also need to check stack_realign_needed to eliminate
> frame pointer.
>
> gcc/
>
> * config/i386/i386.c (ix86_finalize_stack_frame_flags): Also
> check stack_realign_needed for stack realignment.
> (ix86_legitimate_constant_p): Always allow CONST_WIDE_INT smaller
> than the largest integer supported by vector register.
> * config/i386/i386.h (MAX_MOVE_MAX): New.  Set to 64.
> (MOVE_MAX_PIECES): Set to bytes of the largest integer supported
> by vector register.
> (MOVE_MAX): Defined to MOVE_MAX_PIECES.
> (STORE_MAX_PIECES): New.
>
> gcc/testsuite/
>
> * gcc.target/i386/pr90773-1.c: Adjust to expect movq for 32-bit.
> * gcc.target/i386/pr90773-4.c: Also run for 32-bit.
> * gcc.target/i386/pr90773-15.c: Likewise.
> * gcc.target/i386/pr90773-16.c: Likewise.
> * gcc.target/i386/pr90773-17.c: Likewise.
> * gcc.target/i386/pr90773-24.c: Likewise.
> * gcc.target/i386/pr90773-25.c: Likewise.
> * gcc.target/i386/pr100865-1.c: Likewise.
> * gcc.target/i386/pr100865-2.c: Likewise.
> * gcc.target/i386/pr100865-3.c: Likewise.
> * gcc.target/i386/pr90773-14.c: Also run for 32-bit and expect
> XMM movd to store 4 bytes.
> * gcc.target/i386/pr100865-4a.c: Also run for 32-bit and expect
> YMM registers.
> * gcc.target/i386/pr100865-4b.c: Likewise.
> * gcc.target/i386/pr100865-10a.c: Expect YMM registers.
> * gcc.target/i386/pr100865-10b.c: Likewise.
> ---
>  gcc/config/i386/i386.c   | 21 --
>  gcc/config/i386/i386.h   | 40 
>  gcc/testsuite/gcc.target/i386/pr100865-1.c   |  2 +-
>  gcc/testsuite/gcc.target/i386/pr100865-10a.c |  4 +-
>  gcc/testsuite/gcc.target/i386/pr100865-10b.c |  4 +-
>  gcc/testsuite/gcc.target/i386/pr100865-2.c   |  2 +-
>  gcc/testsuite/gcc.target/i386/pr100865-3.c   |  2 +-
>  gcc/testsuite/gcc.target/i386/pr100865-4a.c  |  6 +--
>  gcc/testsuite/gcc.target/i386/pr100865-4b.c  |  8 ++--
>  gcc/testsuite/gcc.target/i386/pr90773-1.c| 10 ++---
>  gcc/testsuite/gcc.target/i386/pr90773-14.c   |  2 +-
>  gcc/testsuite/gcc.target/i386/pr90773-15.c   |  6 +--
>  gcc/testsuite/gcc.target/i386/pr90773-16.c   |  2 +-
>  gcc/testsuite/gcc.target/i386/pr90773-17.c   |  2 +-
>  gcc/testsuite/gcc.target/i386/pr90773-24.c   |  2 +-
>  gcc/testsuite/gcc.target/i386/pr90773-25.c   |  2 +-
>  gcc/testsuite/gcc.target/i386/pr90773-4.c|  2 +-
>  17 files changed, 76 insertions(+), 41 deletions(-)
>
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 5d20ca2067f..842eb0e6786 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -7953,8 +7953,17 @@ ix86_finalize_stack_frame_flags (void)
>   assumed stack realignment might be needed or -fno-omit-frame-pointer
>   is used, but in the end nothing that needed the stack alignment had
>   been spilled nor stack access, clear frame_pointer_needed and say we
> - don't need stack realignment.  */
> -  if ((stack_realign || (!flag_omit_frame_pointer && optimize))
> + don't need stack realignment.
> +
> + When vector register is used for piecewise move and store, we don't
> + increase stack_alignment_needed as there is no register spill for
> + piecewise move and store.  Since stack_realign_needed is set to true
> + by checking stack_alignment_estimated which is updated by pseudo
> + vector register usage, we also need to check stack_realign_needed to
> + eliminate frame pointer.  */
> +  if ((stack_realign
> +   || (!flag_omit_frame_pointer && optimize)
> +   || crtl->stack_realign_needed)
>&& frame_pointer_needed
>&& crtl->is_leaf
>&& crtl->sp_is_unchanging
> @@ -10418,7 +10427,13 @@ ix86_legitimate_constant_p (machine_mode mode, rtx x)
>   /* FALLTHRU */
> case E_OImode:
>

Re: [PATCH] Add a simple fraction class

2021-08-02 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Mon, Aug 2, 2021 at 12:43 PM Richard Sandiford
>  wrote:
>>
>> Richard Biener via Gcc-patches  writes:
>> > On Fri, Jul 30, 2021 at 5:59 PM Richard Sandiford via Gcc-patches
>> >  wrote:
>> >>
>> >> This patch adds a simple class for holding A/B fractions.
>> >> As the comments in the patch say, the class isn't designed
>> >> to have nice numerial properties at the extremes.
>> >>
>> >> The motivating use case was some aarch64 costing work,
>> >> where being able to represent fractions was much easier
>> >> than using single integers and avoided the rounding errors
>> >> that would come with using floats.  (Unlike things like
>> >> COSTS_N_INSNS, there was no sensible constant base factor
>> >> that could be used.)
>> >>
>> >> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
>> >
>> > Hmm, we use the sreal type for profiles.  I don't see any 
>> > overflow/underflow
>> > handling in your class - I suppose you're going to use it on integer types
>> > given we're not allowed to use native FP?
>>
>> Yeah, I'm going to use it on integer types.  And it's not designed
>> to have nice properties at extremes, including handling underflow and
>> overflow.
>
> So maybe assert that it doesn't?  In particular nominator/denominator
> are prone to overflowing in fractional representations.
>
> There's the option to round or ICE.  Or rather than the only option
> is to round (or use a more expensive arbitrary precision representation).

Yeah, I guess we could do that, but it semes inconsistent to assert
for these costs and not do it for vector costs in general.  I think it's
difficult to guarantee that there is no user input for which the current
vector costs overflow.  And if we assert, we have to have a reason for
believing that no such user input exists (modulo bugs).

E.g. vect-inner-loop-cost-factor has an upper limit of 99, so the
existing code only needs a cost of 2148 to overflow “int”.

> So the question is whether the fractional behavior is better in more
> cases than the sreal behavior (I can easily believe it is).
>
>> I want to use it in costing code, where we already happily multiply
>> and add “int”-sized costs without worrying about overflow.  I'll be
>> using uint64_t for the fractions though, just in case. :-)
>>
>> sreal doesn't help because it's still significand/exponent.  That matters
>> because…
>>
>> > I mean, how exactly does
>> > the class solve the problem of rounding errors?
>>
>> …I wanted something that represented the results exactly (barring any of
>> integer ops overflowing).  This makes it meaningful to compare costs for
>> equality.  It also means we can use ordered comparisons without having
>> to introduce a fudge factor to cope with one calculation having different
>> intermediate rounding from the other.
>
> I think you're underestimating how quickly your denominator will overflow?

Well, it depends on how you use it. :-)  I agree you have to go into
this knowing the risks of the representation (but then I'd argue that's
true for floats/sreals too, if you use them for costs).

> So I suppose all factors of all possible denominators are known, in fact
> whats your main source for the divisions?  The VF?

Yeah, the set of possible dominators is fixed at compile time and
relatively small, but not easily enumerable.  The VF is one source,
but we also have “number of X per cycle” values.  The problem with sreal
is that sometimes those “X per cycle” values are 3, and 1/3 is where the
rounding problems with floats/sreals start to come in.

I'm fairly sure that using a uint64_t fractional representation for
int costs and these set of denominator values is safe.  But if we
think that this is just too dangerous to advertise as a general
class within GCC, we could make it local to the aarch64 cost code
instead.  Would that be OK?

Thanks,
Richard


Re: [PATCH] Add a simple fraction class

2021-08-02 Thread Richard Biener via Gcc-patches
On Mon, Aug 2, 2021 at 1:31 PM Richard Sandiford
 wrote:
>
> Richard Biener  writes:
> > On Mon, Aug 2, 2021 at 12:43 PM Richard Sandiford
> >  wrote:
> >>
> >> Richard Biener via Gcc-patches  writes:
> >> > On Fri, Jul 30, 2021 at 5:59 PM Richard Sandiford via Gcc-patches
> >> >  wrote:
> >> >>
> >> >> This patch adds a simple class for holding A/B fractions.
> >> >> As the comments in the patch say, the class isn't designed
> >> >> to have nice numerial properties at the extremes.
> >> >>
> >> >> The motivating use case was some aarch64 costing work,
> >> >> where being able to represent fractions was much easier
> >> >> than using single integers and avoided the rounding errors
> >> >> that would come with using floats.  (Unlike things like
> >> >> COSTS_N_INSNS, there was no sensible constant base factor
> >> >> that could be used.)
> >> >>
> >> >> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
> >> >
> >> > Hmm, we use the sreal type for profiles.  I don't see any 
> >> > overflow/underflow
> >> > handling in your class - I suppose you're going to use it on integer 
> >> > types
> >> > given we're not allowed to use native FP?
> >>
> >> Yeah, I'm going to use it on integer types.  And it's not designed
> >> to have nice properties at extremes, including handling underflow and
> >> overflow.
> >
> > So maybe assert that it doesn't?  In particular nominator/denominator
> > are prone to overflowing in fractional representations.
> >
> > There's the option to round or ICE.  Or rather than the only option
> > is to round (or use a more expensive arbitrary precision representation).
>
> Yeah, I guess we could do that, but it semes inconsistent to assert
> for these costs and not do it for vector costs in general.  I think it's
> difficult to guarantee that there is no user input for which the current
> vector costs overflow.  And if we assert, we have to have a reason for
> believing that no such user input exists (modulo bugs).
>
> E.g. vect-inner-loop-cost-factor has an upper limit of 99, so the
> existing code only needs a cost of 2148 to overflow “int”.

I'd argue those are of course bugs.  The 99 upper bound is way
too big given REB_BR_PROB_BASE is only 1.  But then we're now
set up to initialize vinfo->inner_loop_cost_factor based on profile data
(if it is reliable).

> > So the question is whether the fractional behavior is better in more
> > cases than the sreal behavior (I can easily believe it is).
> >
> >> I want to use it in costing code, where we already happily multiply
> >> and add “int”-sized costs without worrying about overflow.  I'll be
> >> using uint64_t for the fractions though, just in case. :-)
> >>
> >> sreal doesn't help because it's still significand/exponent.  That matters
> >> because…
> >>
> >> > I mean, how exactly does
> >> > the class solve the problem of rounding errors?
> >>
> >> …I wanted something that represented the results exactly (barring any of
> >> integer ops overflowing).  This makes it meaningful to compare costs for
> >> equality.  It also means we can use ordered comparisons without having
> >> to introduce a fudge factor to cope with one calculation having different
> >> intermediate rounding from the other.
> >
> > I think you're underestimating how quickly your denominator will overflow?
>
> Well, it depends on how you use it. :-)  I agree you have to go into
> this knowing the risks of the representation (but then I'd argue that's
> true for floats/sreals too, if you use them for costs).

Yeah, and sreals handle overflow/underflow in a well-defined way because
profile info tends to be crap ;)

> > So I suppose all factors of all possible denominators are known, in fact
> > whats your main source for the divisions?  The VF?
>
> Yeah, the set of possible dominators is fixed at compile time and
> relatively small, but not easily enumerable.  The VF is one source,
> but we also have “number of X per cycle” values.  The problem with sreal
> is that sometimes those “X per cycle” values are 3, and 1/3 is where the
> rounding problems with floats/sreals start to come in.
>
> I'm fairly sure that using a uint64_t fractional representation for
> int costs and these set of denominator values is safe.  But if we
> think that this is just too dangerous to advertise as a general
> class within GCC, we could make it local to the aarch64 cost code
> instead.  Would that be OK?

I think we should instead make its use safe, that is, simply round when
the denominator gets too big.  The gcn compute is already expensive
and so is the division, I suppose a practical way would be to use
uint32 for the representation and [u]int64 for the intermediate compute?

One could put extra debugging that dumps to the active dumpfile
whenever this happens as well (but likely with a editable #define,
disabled by default).

Maybe even gcc_checking_assert()s would do if we document that
the set of denominators need to be fixed in a way to ensure overflow
doesn't 

Re: [PATCH 2/2] Ada: Remove debug line number for DECL_IGNORED_P functions

2021-08-02 Thread Eric Botcazou
> It was pointed out in PR101598 to be inappropriate, that
> ignored Ada decls receive the source line number which was
> recorded in the function decl's DECL_SOURCE_LOCATION.
> Therefore set all front-end-generated Ada decls with
> DECL_IGNORED_P to UNKNOWN_LOCATION.
> 
> 2021-07-24  Bernd Edlinger  
> 
>   PR debug/101598
>   * gcc-interface/trans.c (Subprogram_Body_to_gnu): Set the
>   DECL_SOURCE_LOCATION of DECL_IGNORED_P gnu_subprog_decl to
>   UNKNOWN_LOCATION.

Is that really needed in DWARF 5?  If no, I'm not sure that we want it.

-- 
Eric Botcazou




Re: [PATCH, libgomp, OpenMP 5.0] Implement omp_get_device_num

2021-08-02 Thread Chung-Lin Tang




On 2021/7/23 7:01 PM, Tobias Burnus wrote:

I personally prefer having:
    int initial_dev;
and inside 'omp target' (with 'map(from:initial_dev)'):
    initial_device = omp_is_initial_device();

Then the check would be:
   if (initial_device && host_device_num != device_num)
 abort();
   if (!initial_device && host_device_num == device_num)
 abort();

(Likewise for Fortran.)


Thanks, I've adjusted the new testcases to use this style.


And instead of restricting the target to nvptx/gcn, we could just add
dg-xfail-run-if for *-intelmic-* and *-intelmicemul-*.


I've added a 'offload_target_intelmic' to use on the new testcases.


Additionally, offload_target_nvptx/...amdgcn only check whether
compilation support is available not whether a device exists
at run time.
(The device availability is checked by target_offload_device,
using omp_is_initial_device().)


I guess there is value in testing compilation as long as the compiler
is properly configured, and leaving the execution as an independent test.
OTOH, I think the OpenMP execution tests are not properly forcing offload
(or not) using the environment variables, unlike what we have for OpenACC.

Thanks,
Chung-Lin


[PATCH, v2, libgomp, OpenMP 5.0] Implement omp_get_device_num

2021-08-02 Thread Chung-Lin Tang

On 2021/7/23 6:39 PM, Jakub Jelinek wrote:

On Fri, Jul 23, 2021 at 06:21:41PM +0800, Chung-Lin Tang wrote:

--- a/libgomp/icv-device.c
+++ b/libgomp/icv-device.c
@@ -61,8 +61,17 @@ omp_is_initial_device (void)
return 1;
  }
  
+int

+omp_get_device_num (void)
+{
+  /* By specification, this is equivalent to omp_get_initial_device
+ on the host.  */
+  return omp_get_initial_device ();
+}
+


I think this won't work properly with the intel micoffload, where the host
libgomp is used in the offloaded code.
For omp_is_initial_device, the plugin solves it by:
liboffloadmic/plugin/offload_target_main.cpp
overriding it:
/* Override the corresponding functions from libgomp.  */
extern "C" int
omp_is_initial_device (void) __GOMP_NOTHROW
{
   return 0;
}

extern "C" int32_t

omp_is_initial_device_ (void)
{
   return omp_is_initial_device ();
}
but guess it will need slightly more work because we need to copy the value
to the offloading device too.
It can be done incrementally though.


I guess this part of intelmic functionality will just have to wait later.
There seem to be other parts of liboffloadmic that seems to need re-work,
e.g. omp_get_num_devices() return mic_engines_total, where it should actually
return the number of all devices (not just intelmic). omp_get_initial_device()
returning -1 (which I don't quite understand), etc.

Really suggest to have intelmic support be re-worked as an offload plugin inside
libgomp, rather than floating outside by itself.


--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -102,6 +102,12 @@ struct addr_pair
uintptr_t end;
  };
  
+/* This symbol is to name a target side variable that holds the designated

+   'device number' of the target device. The symbol needs to be available to
+   libgomp code and the  offload plugin (which in the latter case must be
+   stringified).  */
+#define GOMP_DEVICE_NUM_VAR __gomp_device_num


For a single var it is acceptable (though, please avoid the double space
before offload plugin in the comment), but once we have more than one
variable, I think we should simply have a struct which will contain all the
parameters that need to be copied from the host to the offloading device at
image load time (and have eventually another struct that holds parameters
that we'll need to copy to the device on each kernel launch, I bet some ICVs
will be one category, other ICVs another one).


Actually, if you look at the 5.[01] specifications, omp_get_device_num() is not
defined in terms of an ICV. Maybe it conceptually ought to be, but the current
description of "the device number of the device on which the calling thread is
executing" is not one if the defined ICVs.

It looks like there will eventually be some kind of ICV block handled in a 
similar
way, but I think that the modifications will be straightforward then. For now,
I think it's okay for GOMP_DEVICE_NUM_VAR to just be a normal global variable.


diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index 8ea27b5565f..ffcb98ae99e 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -197,6 +197,8 @@ OMP_5.0.1 {
omp_get_supported_active_levels_;
omp_fulfill_event;
omp_fulfill_event_;
+   omp_get_device_num;
+   omp_get_device_num_;
  } OMP_5.0;


This is wrong.  We've already released GCC 11.1 with the OMP_5.0.1
symbol version, so we must not add any further symbols into that symbol
version.  OpenMP 5.0 routines added in GCC 12 should be OMP_5.0.2 symbol
version.


I've adjusted this into 5.0.2, in between 5.0.1 and the new 5.1 added by the 
recent
omp_display_env[_] routines. omp_get_device_num is a OpenMP 5.0 introduced
API function, so I think this is the correct handling (instead of stashing into 
5.1).

There is a new function check_effective_target_offload_target_intelmic() in
testsuite/lib/libgomp.exp, used to test for non-intelmic offloading situations.

Re-tested with no regressions, seeking approval for trunk.

Thanks,
Chung-Lin

2021-08-02  Chung-Lin Tang  

libgomp/ChangeLog

* icv-device.c (omp_get_device_num): New API function, host side.
* fortran.c (omp_get_device_num_): New interface function.
* libgomp-plugin.h (GOMP_DEVICE_NUM_VAR): Define macro symbol.
* libgomp.map (OMP_5.0.2): New version space with omp_get_device_num,
omp_get_device_num_.
* libgomp.texi (omp_get_device_num): Add documentation for new API
function.
* omp.h.in (omp_get_device_num): Add declaration.
* omp_lib.f90.in (omp_get_device_num): Likewise.
* omp_lib.h.in (omp_get_device_num): Likewise.
* target.c (gomp_load_image_to_device): If additional entry for device
number exists at end of returned entries from 'load_image_func' hook,
copy the assigned device number over to the device variable.

* config/gcn/icv-device.c (GOMP_DEVICE_NUM_VAR): Define static global.
(omp_get_device_num): New API function, device side.

Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-08-02 Thread Maged Michael via Gcc-patches
Please find attached an updated patch after incorporating Jonathan's
suggestions.

Changes from the last patch include:
- Add a TSAN macro to bits/c++config.
- Use separate constexpr bool-s for the conditions for lock-freedom,
double-width and alignment.
- Move the code in the optimized path to a separate function
_M_release_double_width_cas.

Thanks,
Maged


On Fri, Jul 16, 2021 at 11:55 AM Maged Michael 
wrote:

> Thank you, Jonathan, for the detailed comments! I'll update the patch
> accordingly.
>
> On Fri, Jul 16, 2021 at 9:55 AM Jonathan Wakely 
> wrote:
>
>> On Thu, 17 Dec 2020 at 20:51, Maged Michael wrote:
>> >
>> > Please find a proposed patch for _Sp_counted_base::_M_release to skip
>> the
>> > two atomic instructions that decrement each of the use count and the
>> weak
>> > count when both are 1. I proposed the general idea in an earlier thread
>> (
>> > https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and
>> got
>> > useful feedback on a draft patch and responses to related questions
>> about
>> > multi-granular atomicity and alignment. This patch is based on that
>> > feedback.
>> >
>> >
>> > I added a check for thread sanitizer to use the current algorithm in
>> that
>> > case because TSAN does not support multi-granular atomicity. I'd like to
>> > add a check of __has_feature(thread_sanitizer) for building using LLVM.
>> I
>> > found examples of __has_feature in libstdc++
>>
>> There are no uses of __has_feature in libstdc++. We do use
>> __has_builtin (which GCC also supports) and Clang's __is_identifier
>> (which GCC doesn't support) to work around some weird semantics of
>> __has_builtin in older versions of Clang.
>>
>>
>> > but it doesn't seem to be
>> > recognized in shared_ptr_base.h. Any guidance on how to check
>> > __has_feature(thread_sanitizer) in this patch?
>>
>> I think we want to do something like this in include/bits/c++config
>>
>> #if __SANITIZE_THREAD__
>> #  define _GLIBCXX_TSAN 1
>> #elif defined __has_feature
>> # if __has_feature(thread_sanitizer)
>> #  define _GLIBCXX_TSAN 1
>> # endif
>> #endif
>>
>> Then in bits/shared_ptr_base.h
>>
>> #if _GLIBCXX_TSAN
>> _M_release_orig();
>> return;
>> #endif
>>
>>
>>
>> > GCC generates code for _M_release that is larger and more complex than
>> that
>> > generated by LLVM. I'd like to file a bug report about that. Jonathan,
>>
>> Is this the same issue as https://gcc.gnu.org/PR101406 ?
>>
>> Partly yes. Even when using __atomic_add_dispatch I noticed that clang
> generated less code than gcc. I see in the response to the issue that the
> new glibc is expected to optimize better. So maybe this will eliminate the
> issue.
>
>
>> > would you please create a bugzilla account for me (
>> > https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you.
>>
>> Done (sorry, I didn't notice the request in this mail until coming
>> back to it to review the patch properly).
>>
>> Thank you!
>
>
>>
>>
>> >
>> >
>> > Information about the patch:
>> >
>> > - Benefits of the patch: Save the cost of the last atomic decrements of
>> > each of the use count and the weak count in _Sp_counted_base. Atomic
>> > instructions are significantly slower than regular loads and stores
>> across
>> > major architectures.
>> >
>> > - How current code works: _M_release() atomically decrements the use
>> count,
>> > checks if it was 1, if so calls _M_dispose(), atomically decrements the
>> > weak count, checks if it was 1, and if so calls _M_destroy().
>> >
>> > - How the proposed patch works: _M_release() loads both use count and
>> weak
>> > count together atomically (when properly aligned), checks if the value
>> is
>> > equal to the value of both counts equal to 1 (e.g., 0x10001), and
>> if so
>> > calls _M_dispose() and _M_destroy(). Otherwise, it follows the original
>> > algorithm.
>> >
>> > - Why it works: When the current thread executing _M_release() finds
>> each
>> > of the counts is equal to 1, then (when _lock_policy is _S_atomic) no
>> other
>> > threads could possibly hold use or weak references to this control
>> block.
>> > That is, no other threads could possibly access the counts or the
>> protected
>> > object.
>> >
>> > - The proposed patch is intended to interact correctly with current code
>> > (under certain conditions: _Lock_policy is _S_atomic, proper alignment,
>> and
>> > native lock-free support for atomic operations). That is, multiple
>> threads
>> > using different versions of the code with and without the patch
>> operating
>> > on the same objects should always interact correctly. The intent for the
>> > patch is to be ABI compatible with the current implementation.
>> >
>> > - The proposed patch involves a performance trade-off between saving the
>> > costs of two atomic instructions when the counts are both 1 vs adding
>> the
>> > cost of loading the combined counts and comparison with two ones (e.g.,
>> > 0x10001).
>> >
>> > - The patch has been in use (built using LLVM) in a l

Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-08-02 Thread Maged Michael via Gcc-patches
This is the right patch. The previous one is missing noexcept. Sorry.


On Mon, Aug 2, 2021 at 9:23 AM Maged Michael 
wrote:

> Please find attached an updated patch after incorporating Jonathan's
> suggestions.
>
> Changes from the last patch include:
> - Add a TSAN macro to bits/c++config.
> - Use separate constexpr bool-s for the conditions for lock-freedom,
> double-width and alignment.
> - Move the code in the optimized path to a separate function
> _M_release_double_width_cas.
>
> Thanks,
> Maged
>
>
> On Fri, Jul 16, 2021 at 11:55 AM Maged Michael 
> wrote:
>
>> Thank you, Jonathan, for the detailed comments! I'll update the patch
>> accordingly.
>>
>> On Fri, Jul 16, 2021 at 9:55 AM Jonathan Wakely 
>> wrote:
>>
>>> On Thu, 17 Dec 2020 at 20:51, Maged Michael wrote:
>>> >
>>> > Please find a proposed patch for _Sp_counted_base::_M_release to skip
>>> the
>>> > two atomic instructions that decrement each of the use count and the
>>> weak
>>> > count when both are 1. I proposed the general idea in an earlier
>>> thread (
>>> > https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html)
>>> and got
>>> > useful feedback on a draft patch and responses to related questions
>>> about
>>> > multi-granular atomicity and alignment. This patch is based on that
>>> > feedback.
>>> >
>>> >
>>> > I added a check for thread sanitizer to use the current algorithm in
>>> that
>>> > case because TSAN does not support multi-granular atomicity. I'd like
>>> to
>>> > add a check of __has_feature(thread_sanitizer) for building using
>>> LLVM. I
>>> > found examples of __has_feature in libstdc++
>>>
>>> There are no uses of __has_feature in libstdc++. We do use
>>> __has_builtin (which GCC also supports) and Clang's __is_identifier
>>> (which GCC doesn't support) to work around some weird semantics of
>>> __has_builtin in older versions of Clang.
>>>
>>>
>>> > but it doesn't seem to be
>>> > recognized in shared_ptr_base.h. Any guidance on how to check
>>> > __has_feature(thread_sanitizer) in this patch?
>>>
>>> I think we want to do something like this in include/bits/c++config
>>>
>>> #if __SANITIZE_THREAD__
>>> #  define _GLIBCXX_TSAN 1
>>> #elif defined __has_feature
>>> # if __has_feature(thread_sanitizer)
>>> #  define _GLIBCXX_TSAN 1
>>> # endif
>>> #endif
>>>
>>> Then in bits/shared_ptr_base.h
>>>
>>> #if _GLIBCXX_TSAN
>>> _M_release_orig();
>>> return;
>>> #endif
>>>
>>>
>>>
>>> > GCC generates code for _M_release that is larger and more complex than
>>> that
>>> > generated by LLVM. I'd like to file a bug report about that. Jonathan,
>>>
>>> Is this the same issue as https://gcc.gnu.org/PR101406 ?
>>>
>>> Partly yes. Even when using __atomic_add_dispatch I noticed that clang
>> generated less code than gcc. I see in the response to the issue that the
>> new glibc is expected to optimize better. So maybe this will eliminate the
>> issue.
>>
>>
>>> > would you please create a bugzilla account for me (
>>> > https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you.
>>>
>>> Done (sorry, I didn't notice the request in this mail until coming
>>> back to it to review the patch properly).
>>>
>>> Thank you!
>>
>>
>>>
>>>
>>> >
>>> >
>>> > Information about the patch:
>>> >
>>> > - Benefits of the patch: Save the cost of the last atomic decrements of
>>> > each of the use count and the weak count in _Sp_counted_base. Atomic
>>> > instructions are significantly slower than regular loads and stores
>>> across
>>> > major architectures.
>>> >
>>> > - How current code works: _M_release() atomically decrements the use
>>> count,
>>> > checks if it was 1, if so calls _M_dispose(), atomically decrements the
>>> > weak count, checks if it was 1, and if so calls _M_destroy().
>>> >
>>> > - How the proposed patch works: _M_release() loads both use count and
>>> weak
>>> > count together atomically (when properly aligned), checks if the value
>>> is
>>> > equal to the value of both counts equal to 1 (e.g., 0x10001), and
>>> if so
>>> > calls _M_dispose() and _M_destroy(). Otherwise, it follows the original
>>> > algorithm.
>>> >
>>> > - Why it works: When the current thread executing _M_release() finds
>>> each
>>> > of the counts is equal to 1, then (when _lock_policy is _S_atomic) no
>>> other
>>> > threads could possibly hold use or weak references to this control
>>> block.
>>> > That is, no other threads could possibly access the counts or the
>>> protected
>>> > object.
>>> >
>>> > - The proposed patch is intended to interact correctly with current
>>> code
>>> > (under certain conditions: _Lock_policy is _S_atomic, proper
>>> alignment, and
>>> > native lock-free support for atomic operations). That is, multiple
>>> threads
>>> > using different versions of the code with and without the patch
>>> operating
>>> > on the same objects should always interact correctly. The intent for
>>> the
>>> > patch is to be ABI compatible with the current implementation.
>>> >
>>> > - The proposed patch 

Re: [PATCH 1/2] Fix debug info for ignored decls at start of assembly

2021-08-02 Thread Richard Biener
On Fri, 30 Jul 2021, Bernd Edlinger wrote:

> 
> 
> On 7/29/21 9:23 AM, Richard Biener wrote:
> > On Wed, 28 Jul 2021, Bernd Edlinger wrote:
> > 
> >> On 7/28/21 2:51 PM, Richard Biener wrote:
> >>> On Mon, 26 Jul 2021, Bernd Edlinger wrote:
> >>>
>  Ignored functions decls that are compiled at the start of
>  the assembly have bogus line numbers until the first .file
>  directive, as reported in PR101575.
> 
>  The work around for this issue is to emit a dummy .file
>  directive when the first function is DECL_IGNORED_P, when
>  that is not already done, mostly for -fdwarf-4.
> >>>
> >>> I wonder if it makes sense to unconditionally announce the
> >>> TU with a .file directive at the beginning.  ISTR this is
> >>> what we now do with -gdwarf-5.
> >>>
> >>
> >> Yes, that would work, even when the file name is not guessed
> >> correctly.
> >>
> >> Initially I had "" unconditionally here, and it did
> >> not really hurt, except that it is visible with readelf.
> > 
> > I think I'd prefer that, since if we don't announce a .file
> > before the first assembler statement but ask gas to produce
> > line info it might be tempted to create line info referencing
> > the possibly temporary filename of the assembler file which
> > is undesirable from a build reproducability point.
> > 
> 
> Yeah, I understand.
> 
> Meanwhile I found a simple C test case without ignored functions
> 
> $ cat test1.c
> asm("nop");
> int main () 
> {
>   return 0;
> }
> 
> $ gcc -g test1.c
> $ readelf --debug-dump=decodedline a.out 
> Contents of the .debug_line section:
> 
> CU: ./test1.c:
> File nameLine numberStarting addressView  
>   Stmt
> test1.c50x401106  
>  x
> test1.c30x401107  
>  x
> test1.c40x40110b  
>  x
> test1.c50x401110  
>  x
> test1.c-0x401112
> 
> even with the proposed patch, so I agree it is incomplete.
> 
> I tried the gdb test case and compile it with different LTO
> options, but the gen_AT_string was always valid, in some
> cases a lto debug section together with a couple .file 
> directives was output before the .file 0.
> So I'd like to use the file name from gen_AT_string, since
> it's most of the time accurate, and avoids unnecessary confusion
> on the readers of the produced debug info.
> 
> So I'd propose the attached patch instead.
> Is it OK for trunk?

Works for me - please give others a chance to comment though.

Thanks,
Richard.

> 
> > Richard.
> > 
> >>> Note get_AT_string (comp_unit_die (), DW_AT_name) doesn't
> >>> work with LTO, you'll get  then.
> >>>
> >>
> >> Yeah, that's why I wanted to restrict that to the case where
> >> it's absolutely necessary.
> >>
> >>> Is the dwarf assembler bug reported/fixed?  Can you include
> >>> a reference please?
> >>>
> >>
> >> I've just added a bug report, it's unlikely to be fixed IMHO:
> >> https://sourceware.org/bugzilla/show_bug.cgi?id=28149
> >>
> >> I will add that to the patch description:
> >>
> >> Ignored functions decls that are compiled at the start of
> >> the assembly have bogus line numbers until the first .file
> >> directive, as reported in PR101575.
> >>
> >> The corresponding binutils bug report is
> >> https://sourceware.org/bugzilla/show_bug.cgi?id=28149
> >>
> >> The work around for this issue is to emit a dummy .file
> >> directive when the first function is DECL_IGNORED_P, when
> >> that is not already done, mostly for -fdwarf-4.
> >>
> >>
> >> Thanks
> >> Bernd.
> >>
> >>> Thanks,
> >>> Richard.
> >>>
>  2021-07-24  Bernd Edlinger  
> 
>   PR ada/101575
>   * dwarf2out.c (dwarf2out_begin_prologue): Move init
>   of fde->ignored_debug to dwarf2out_set_ignored_loc.
>   (dwarf2out_set_ignored_loc): This is now also called
>   when no .loc statement is to be generated, in that case
>   we emit a dummy .file statement when needed.
>   * final.c (final_start_function_1,
>   final_scan_insn_1): Call debug_hooks->set_ignored_loc
>   for all DECL_IGNORED_P functions.
>  ---
>   gcc/dwarf2out.c | 29 +
>   gcc/final.c |  5 ++---
>   2 files changed, 27 insertions(+), 7 deletions(-)
> 
>  diff --git a/gcc/dwarf2out.c b/gcc/dwarf2out.c
>  index 884f1e1..8de0d6f 100644
>  --- a/gcc/dwarf2out.c
>  +++ b/gcc/dwarf2out.c
>  @@ -1115,7 +1115,6 @@ dwarf2out_begin_prologue (unsigned int line 
>  ATTRIBUTE_UNUSED,
> fde->dw_fde_current_label = dup_label;
> fde->in_std_section = (fnsec == text_section
>    || (cold_text_section && fnsec == 
>  cold_text_section));
>  -  fde->ignored_debug = DECL_IGNORED_P (current_function

Re: [PATCH 42/55] rs6000: Handle gimple folding of target built-ins

2021-08-02 Thread Bill Schmidt via Gcc-patches

Hi Will,

On 7/29/21 7:42 AM, Bill Schmidt wrote:

On 7/28/21 4:21 PM, will schmidt wrote:

On Thu, 2021-06-17 at 10:19 -0500, Bill Schmidt via Gcc-patches wrote:


+/* Vector compares; EQ, NE, GE, GT, LE.  */
+case RS6000_BIF_VCMPEQUB:
+case RS6000_BIF_VCMPEQUH:
+case RS6000_BIF_VCMPEQUW:
+case RS6000_BIF_VCMPEQUD:
+  fold_compare_helper (gsi, EQ_EXPR, stmt);
+  return true;
+
+case RS6000_BIF_VCMPNEB:
+case RS6000_BIF_VCMPNEH:
+case RS6000_BIF_VCMPNEW:
+  fold_compare_helper (gsi, NE_EXPR, stmt);
+  return true;
+
Noting that entries for  _CMPNET,_VCMPEQUT, etc are missing from this
version versus the non-new version of this function.
I believe thiswas/is deliberate and by design.
Same with entries for P10V_BUILTIN_CMPLE_1TI, etc below.


Indeed not!  This is something I missed when new code was added after I
posted the original patch series.  I'll reinstate the quadword
compares.  Thanks for spotting this!



Interestingly, when the quadword compares are expanded at GIMPLE time, 
we generate worse code involving individual 64-bit compares.  For the 
time being, I will not expand these at GIMPLE time; independently, this 
bears looking at to see why expressions like (uint128_1 < uint128_2) 
will generate poor code.


Bill



Bill



[PATCH 1/2] Add emulated gather capability to the vectorizer

2021-08-02 Thread Richard Biener
This adds a gather vectorization capability to the vectorizer
without target support by decomposing the offset vector, doing
sclar loads and then building a vector from the result.  This
is aimed mainly at cases where vectorizing the rest of the loop
offsets the cost of vectorizing the gather.

Note it's difficult to avoid vectorizing the offset load, but in
some cases later passes can turn the vector load + extract into
scalar loads, see the followup patch.

On SPEC CPU 2017 510.parest_r this improves runtime from 250s
to 219s on a Zen2 CPU which has its native gather instructions
disabled (using those the runtime instead increases to 254s)
using -Ofast -march=znver2 [-flto].  It turns out the critical
loops in this benchmark all perform gather operations.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

2021-07-30  Richard Biener  

* tree-vect-data-refs.c (vect_check_gather_scatter):
Include widening conversions only when the result is
still handed by native gather or the current offset
size not already matches the data size.
Also succeed analysis in case there's no native support,
noted by a IFN_LAST ifn and a NULL decl.
(vect_analyze_data_refs): Always consider gathers.
* tree-vect-patterns.c (vect_recog_gather_scatter_pattern):
Test for no IFN gather rather than decl gather.
* tree-vect-stmts.c (vect_model_load_cost): Pass in the
gather-scatter info and cost emulated gathers accordingly.
(vect_truncate_gather_scatter_offset): Properly test for
no IFN gather.
(vect_use_strided_gather_scatters_p): Likewise.
(get_load_store_type): Handle emulated gathers and its
restrictions.
(vectorizable_load): Likewise.  Emulate them by extracting
scalar offsets, doing scalar loads and a vector construct.

* gcc.target/i386/vect-gather-1.c: New testcase.
* gfortran.dg/vect/vect-8.f90: Adjust.
---
 gcc/testsuite/gcc.target/i386/vect-gather-1.c |  18 
 gcc/testsuite/gfortran.dg/vect/vect-8.f90 |   2 +-
 gcc/tree-vect-data-refs.c |  34 --
 gcc/tree-vect-patterns.c  |   2 +-
 gcc/tree-vect-stmts.c | 100 --
 5 files changed, 138 insertions(+), 18 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/vect-gather-1.c

diff --git a/gcc/testsuite/gcc.target/i386/vect-gather-1.c 
b/gcc/testsuite/gcc.target/i386/vect-gather-1.c
new file mode 100644
index 000..134aef39666
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-gather-1.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -msse2 -fdump-tree-vect-details" } */
+
+#ifndef INDEXTYPE
+#define INDEXTYPE int
+#endif
+double vmul(INDEXTYPE *rowstart, INDEXTYPE *rowend,
+   double *luval, double *dst)
+{
+  double res = 0;
+  for (const INDEXTYPE * col = rowstart; col != rowend; ++col, ++luval)
+res += *luval * dst[*col];
+  return res;
+}
+
+/* With gather emulation this should be profitable to vectorize
+   even with plain SSE2.  */
+/* { dg-final { scan-tree-dump "loop vectorized" "vect" } } */
diff --git a/gcc/testsuite/gfortran.dg/vect/vect-8.f90 
b/gcc/testsuite/gfortran.dg/vect/vect-8.f90
index 9994805d77f..cc1aebfbd84 100644
--- a/gcc/testsuite/gfortran.dg/vect/vect-8.f90
+++ b/gcc/testsuite/gfortran.dg/vect/vect-8.f90
@@ -706,5 +706,5 @@ END SUBROUTINE kernel
 
 ! { dg-final { scan-tree-dump-times "vectorized 24 loops" 1 "vect" { target 
aarch64_sve } } }
 ! { dg-final { scan-tree-dump-times "vectorized 23 loops" 1 "vect" { target { 
aarch64*-*-* && { ! aarch64_sve } } } } }
-! { dg-final { scan-tree-dump-times "vectorized 2\[23\] loops" 1 "vect" { 
target { vect_intdouble_cvt && { ! aarch64*-*-* } } } } }
+! { dg-final { scan-tree-dump-times "vectorized 2\[234\] loops" 1 "vect" { 
target { vect_intdouble_cvt && { ! aarch64*-*-* } } } } }
 ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { target { 
{ ! vect_intdouble_cvt } && { ! aarch64*-*-* } } } } }
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 6995efba899..3c29ff04fd8 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -4007,8 +4007,27 @@ vect_check_gather_scatter (stmt_vec_info stmt_info, 
loop_vec_info loop_vinfo,
  continue;
}
 
- if (TYPE_PRECISION (TREE_TYPE (op0))
- < TYPE_PRECISION (TREE_TYPE (off)))
+ /* Include the conversion if it is widening and we're using
+the IFN path or the target can handle the converted from
+offset or the current size is not already the same as the
+data vector element size.  */
+ if ((TYPE_PRECISION (TREE_TYPE (op0))
+  < TYPE_PRECISION (TREE_TYPE (off)))
+ && ((!use_ifn_p
+  && (DR_IS_READ (dr)
+  ? (targetm.vectorize.builtin_gather
+  

[PATCH 2/2] Rewrite more vector loads to scalar loads

2021-08-02 Thread Richard Biener
This teaches forwprop to rewrite more vector loads that are only
used in BIT_FIELD_REFs as scalar loads.  This provides the
remaining uplift to SPEC CPU 2017 510.parest_r on Zen 2 which
has CPU gathers disabled.

In particular vector load + vec_unpack + bit-field-ref is turned
into (extending) scalar loads which avoids costly XMM/GPR
transitions.  To not conflict with vector load + bit-field-ref
+ vector constructor matching to vector load + shuffle the
extended transform is only done after vector lowering.

Overall the two patches provide a 22% speedup of 510.parest_r.

I'm in the process of confirming speedups of 500.perlbench_r,
557.xz_r, 549.fotonik3d_r and 554.roms_r as well as slowdowns
of 503.bwaves_r, 507.cactuBSSN_r and 538.imagick_r.

2021-07-30  Richard Biener  

* tree-ssa-forwprop.c (pass_forwprop::execute): Split
out code to decompose vector loads ...
(optimize_vector_load): ... here.  Generalize it to
handle intermediate widening and TARGET_MEM_REF loads
and apply it to loads with a supported vector mode as well.

* gcc.target/i386/vect-gather-1.c: Amend.
---
 gcc/testsuite/gcc.target/i386/vect-gather-1.c |   4 +-
 gcc/tree-ssa-forwprop.c   | 244 +-
 2 files changed, 185 insertions(+), 63 deletions(-)

diff --git a/gcc/testsuite/gcc.target/i386/vect-gather-1.c 
b/gcc/testsuite/gcc.target/i386/vect-gather-1.c
index 134aef39666..261b66be061 100644
--- a/gcc/testsuite/gcc.target/i386/vect-gather-1.c
+++ b/gcc/testsuite/gcc.target/i386/vect-gather-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-Ofast -msse2 -fdump-tree-vect-details" } */
+/* { dg-options "-Ofast -msse2 -fdump-tree-vect-details -fdump-tree-forwprop4" 
} */
 
 #ifndef INDEXTYPE
 #define INDEXTYPE int
@@ -16,3 +16,5 @@ double vmul(INDEXTYPE *rowstart, INDEXTYPE *rowend,
 /* With gather emulation this should be profitable to vectorize
even with plain SSE2.  */
 /* { dg-final { scan-tree-dump "loop vectorized" "vect" } } */
+/* The index vector loads and promotions should be scalar after forwprop.  */
+/* { dg-final { scan-tree-dump-not "vec_unpack" "forwprop4" } } */
diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
index db3b18b275c..bd64b8e46bc 100644
--- a/gcc/tree-ssa-forwprop.c
+++ b/gcc/tree-ssa-forwprop.c
@@ -2757,6 +2757,182 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi)
 }
 
 
+/* Rewrite the vector load at *GSI to component-wise loads if the load
+   is only used in BIT_FIELD_REF extractions with eventual intermediate
+   widening.  */
+
+static void
+optimize_vector_load (gimple_stmt_iterator *gsi)
+{
+  gimple *stmt = gsi_stmt (*gsi);
+  tree lhs = gimple_assign_lhs (stmt);
+  tree rhs = gimple_assign_rhs1 (stmt);
+
+  /* Gather BIT_FIELD_REFs to rewrite, looking through
+ VEC_UNPACK_{LO,HI}_EXPR.  */
+  use_operand_p use_p;
+  imm_use_iterator iter;
+  bool rewrite = true;
+  auto_vec bf_stmts;
+  auto_vec worklist;
+  worklist.quick_push (lhs);
+  do
+{
+  tree def = worklist.pop ();
+  unsigned HOST_WIDE_INT def_eltsize
+   = TREE_INT_CST_LOW (TYPE_SIZE (TREE_TYPE (TREE_TYPE (def;
+  FOR_EACH_IMM_USE_FAST (use_p, iter, def)
+   {
+ gimple *use_stmt = USE_STMT (use_p);
+ if (is_gimple_debug (use_stmt))
+   continue;
+ if (!is_gimple_assign (use_stmt))
+   {
+ rewrite = false;
+ break;
+   }
+ enum tree_code use_code = gimple_assign_rhs_code (use_stmt);
+ tree use_rhs = gimple_assign_rhs1 (use_stmt);
+ if (use_code == BIT_FIELD_REF
+ && TREE_OPERAND (use_rhs, 0) == def
+ /* If its on the VEC_UNPACK_{HI,LO}_EXPR
+def need to verify it is element aligned.  */
+ && (def == lhs
+ || (known_eq (bit_field_size (use_rhs), def_eltsize)
+ && constant_multiple_p (bit_field_offset (use_rhs),
+ def_eltsize
+   {
+ bf_stmts.safe_push (use_stmt);
+ continue;
+   }
+ /* Walk through one level of VEC_UNPACK_{LO,HI}_EXPR.  */
+ if (def == lhs
+ && (use_code == VEC_UNPACK_HI_EXPR
+ || use_code == VEC_UNPACK_LO_EXPR)
+ && use_rhs == lhs)
+   {
+ worklist.safe_push (gimple_assign_lhs (use_stmt));
+ continue;
+   }
+ rewrite = false;
+ break;
+   }
+  if (!rewrite)
+   break;
+}
+  while (!worklist.is_empty ());
+
+  if (!rewrite)
+{
+  gsi_next (gsi);
+  return;
+}
+  /* We now have all ultimate uses of the load to rewrite in bf_stmts.  */
+
+  /* Prepare the original ref to be wrapped in adjusted BIT_FIELD_REFs.
+ For TARGET_MEM_REFs we have to separate the LEA from the reference.  */
+  tree load_rhs = rhs;
+  if (TREE_CODE (load_rhs) == TARGET_MEM_REF)
+{
+  

[PATCH 0/3] arm: fix problems when targetting extended FPUs [PR101723]

2021-08-02 Thread Richard Earnshaw via Gcc-patches
This patch series addresses an issue that has come to light due to a
change in the way GAS handles .fpu directives in the assembler.  A fix
to the assembler made in binutils 2.34 to clear out all features
realated to the FPU when .fpu is emitted has started causing problems
for GCC because of the order in which we emit .fpu and .arch_extension
directives.  To fully address this we need to re-organize the way in
which the compiler does this.

I'll hold of pushing the patches for a couple of days.  Although I've
gone through the testsuite quite carefully and run this through
several configurations, it's possible that this may have some impact
on the testsuite that I've missed.  Christophe, is the any chance you
can run this through your test environment before I commit this?

R.

Richard Earnshaw (3):
  arm: ensure the arch_name is always set for the build target
  arm: Don't reconfigure globals in arm_configure_build_target
  arm: reorder assembler architecture directives [PR101723]

 gcc/config/arm/arm-c.c|   1 +
 gcc/config/arm/arm-cpus.in|   1 +
 gcc/config/arm/arm.c  | 190 --
 gcc/testsuite/gcc.target/arm/attr-neon.c  |   9 +-
 gcc/testsuite/gcc.target/arm/attr-neon2.c |  35 +++-
 gcc/testsuite/gcc.target/arm/attr-neon3.c |  43 +++-
 .../arm/cortex-m55-nofp-flag-hard.c   |   2 +-
 .../arm/cortex-m55-nofp-flag-softfp.c |   2 +-
 .../arm/cortex-m55-nofp-nomve-flag-softfp.c   |   2 +-
 .../gcc.target/arm/mve/intrinsics/mve_fpu1.c  |   5 +-
 .../gcc.target/arm/mve/intrinsics/mve_fpu2.c  |   5 +-
 gcc/testsuite/gcc.target/arm/pr98636.c|   3 +-
 12 files changed, 153 insertions(+), 145 deletions(-)

-- 
2.25.1



[PATCH 1/3] arm: ensure the arch_name is always set for the build target

2021-08-02 Thread Richard Earnshaw via Gcc-patches

This should never happen now if GCC is invoked by the driver, but in
the unusual case of calling cc1 (or its ilk) directly from the command
line the build target's arch_name string can remain NULL.  This can
complicate later processing meaning that we need to check for this
case explicitly in some circumstances.  Nothing should rely on this
behaviour, so it's simpler to always set the arch_name when
configuring the build target and be done with it.

gcc:

* config/arm/arm.c (arm_configure_build_target): Ensure the target's
arch_name is always set.
---
 gcc/config/arm/arm.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 6d781e23ee9..b2dd58d8751 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -3432,6 +3432,8 @@ arm_configure_build_target (struct arm_build_target *target,
   const cpu_tune *tune_data = &all_tunes[arm_selected_tune - all_cores];
 
   /* Finish initializing the target structure.  */
+  if (!target->arch_name)
+target->arch_name = arm_selected_arch->common.name;
   target->arch_pp_name = arm_selected_arch->arch;
   target->base_arch = arm_selected_arch->base_arch;
   target->profile = arm_selected_arch->profile;


[PATCH 2/3] arm: Don't reconfigure globals in arm_configure_build_target

2021-08-02 Thread Richard Earnshaw via Gcc-patches

arm_configure_build_target is usually used to reconfigure the
arm_active_target structure, which is then used to reconfigure a
number of other global variables describing the current target.
Occasionally, however, we need to use arm_configure_build_target to
construct a temporary target structure and in that case it is wrong to
try to reconfigure the global variables (although probably harmless,
since arm_option_reconfigure_globals() only looks at
arm_active_target).  At the very least, however, this is wasted work,
so it is best not to do it unless needed.  What's more, several
callers of arm_configure_build target call
arm_option_reconfigure_globals themselves within a few lines, making
the call from within arm_configure_build_target completely redundant.

So this patch moves the responsibility of calling of
arm_configure_build_target to its callers (only two places needed
updating).

gcc:
* config/arm/arm.c (arm_configure_build_target): Don't call
arm_option_reconfigure_globals.
(arm_option_restore): Call arm_option_reconfigure_globals after
reconfiguring the target.
* config/arm/arm-c.c (arm_pragma_target_parse): Likewise.
---
 gcc/config/arm/arm-c.c | 1 +
 gcc/config/arm/arm.c   | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/config/arm/arm-c.c b/gcc/config/arm/arm-c.c
index ae2139c4bfa..cc7901bca8d 100644
--- a/gcc/config/arm/arm-c.c
+++ b/gcc/config/arm/arm-c.c
@@ -409,6 +409,7 @@ arm_pragma_target_parse (tree args, tree pop_target)
   target_option_current_node = cur_tree;
   arm_configure_build_target (&arm_active_target,
   TREE_TARGET_OPTION (cur_tree), false);
+  arm_option_reconfigure_globals ();
 }
 
   /* Update macros if target_node changes. The global state will be restored
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index b2dd58d8751..273202ac2fd 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -3058,6 +3058,7 @@ arm_option_restore (struct gcc_options */* opts */,
 		struct cl_target_option *ptr)
 {
   arm_configure_build_target (&arm_active_target, ptr, false);
+  arm_option_reconfigure_globals ();
 }
 
 /* Reset options between modes that the user has specified.  */
@@ -3441,7 +3442,6 @@ arm_configure_build_target (struct arm_build_target *target,
   target->tune_flags = tune_data->tune_flags;
   target->tune = tune_data->tune;
   target->tune_core = tune_data->scheduler;
-  arm_option_reconfigure_globals ();
 }
 
 /* Fix up any incompatible options that the user has specified.  */


[PATCH 3/3] arm: reorder assembler architecture directives [PR101723]

2021-08-02 Thread Richard Earnshaw via Gcc-patches

A change to the way gas interprets the .fpu directive in binutils-2.34
means that issuing .fpu will clear any features set by .arch_extension
that apply to the floating point or simd units.  This unfortunately
causes problems for more recent versions of the architecture because
we currently emit .arch, .arch_extension and .fpu directives at
different times and try to suppress redundant changes.

This change addresses this by firstly unifying all the places where we
emit these directives to a single block of code and secondly
(re)emitting all the directives if any changes have been made to the
target options.  Whilst this is slightly more than the strict minimum
it should be enough to catch all cases where a change could have
happened.  The new code also emits the directives in the order: .arch,
.fpu, .arch_extension.  This ensures that the additional architectural
extensions are not removed by a later .fpu directive.

Whilst writing this patch I also noticed that in the corner case where
the last function to be compiled had a non-standard set of
architecture flags, the assembler would add an incorrect set of
derived attributes for the file as a whole.  Instead of reflecting the
command-line options it would reflect the flags from the last file in
the function.  To address this I've also added a call to re-emit the
flags from the asm_file_end callback so the assembler will be in the
correct state when it finishes processing the intput.

There's some slight churn to the testsuite as a consequence of this,
because previously we had a hack to suppress emitting a .fpu directive
for one specific case, but with the new order this is no-longer
necessary.

gcc/ChangeLog:

PR target/101723
* config/arm/arm-cpus.in (generic-armv7-a): Add quirk to suppress
writing .cpu directive in asm output.
* config/arm/arm.c (arm_identify_fpu_from_isa): New variable.
(arm_last_printed_arch_string): Delete.
(arm_last-printed_fpu_string): Delete.
(arm_configure_build_target): If use of floating-point/SIMD is
disabled, remove all fp/simd related features from the target ISA.
(last_arm_targ_options): New variable.
(arm_print_asm_arch_directives): Add new parameters.  Change order
of emitted directives and handle all cases here.
(arm_file_start): Always call arm_print_asm_arch_directives, move
all generation of .arch/.arch_extension here.
(arm_file_end): Call arm_print_asm_arch.
(arm_declare_function_name): Call arm_print_asm_arch_directives
instead of printing .arch/.fpu directives directly.

gcc/testsuite/ChangeLog:

PR target/101723
* gcc.target/arm/cortex-m55-nofp-flag-hard.c: Update expected output.
* gcc.target/arm/cortex-m55-nofp-flag-softfp.c: Likewise.
* gcc.target/arm/cortex-m55-nofp-nomve-flag-softfp.c: Likewise.
* gcc.target/arm/mve/intrinsics/mve_fpu1.c: Convert to dg-do assemble.
Add a non-no-op function body.
* gcc.target/arm/mve/intrinsics/mve_fpu2.c: Likewise.
* gcc.target/arm/pr98636.c (dg-options): Add -mfloat-abi=softfp.
* gcc.target/arm/attr-neon.c: Tighten scan-assembler tests.
* gcc.target/arm/attr-neon2.c: Use -Ofast, convert test to use
check-function-bodies.
* gcc.target/arm/attr-neon3.c: Likewise.
---
 gcc/config/arm/arm-cpus.in|   1 +
 gcc/config/arm/arm.c  | 186 --
 gcc/testsuite/gcc.target/arm/attr-neon.c  |   9 +-
 gcc/testsuite/gcc.target/arm/attr-neon2.c |  35 ++--
 gcc/testsuite/gcc.target/arm/attr-neon3.c |  43 ++--
 .../arm/cortex-m55-nofp-flag-hard.c   |   2 +-
 .../arm/cortex-m55-nofp-flag-softfp.c |   2 +-
 .../arm/cortex-m55-nofp-nomve-flag-softfp.c   |   2 +-
 .../gcc.target/arm/mve/intrinsics/mve_fpu1.c  |   5 +-
 .../gcc.target/arm/mve/intrinsics/mve_fpu2.c  |   5 +-
 gcc/testsuite/gcc.target/arm/pr98636.c|   3 +-
 11 files changed, 149 insertions(+), 144 deletions(-)

diff --git a/gcc/config/arm/arm-cpus.in b/gcc/config/arm/arm-cpus.in
index ab4b6acf5ea..249995a6bca 100644
--- a/gcc/config/arm/arm-cpus.in
+++ b/gcc/config/arm/arm-cpus.in
@@ -1080,6 +1080,7 @@ begin cpu generic-armv7-a
  cname genericv7a
  tune flags LDSCHED
  architecture armv7-a+fp
+ isa quirk_no_asmcpu
  option mp add mp
  option sec add sec
  option vfpv3-d16 add VFPv3 FP_DBL
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 273202ac2fd..11dafc70067 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -79,10 +79,6 @@
 typedef struct minipool_nodeMnode;
 typedef struct minipool_fixup   Mfix;
 
-/* The last .arch and .fpu assembly strings that we printed.  */
-static std::string arm_last_printed_arch_string;
-static std::string arm_last_printed_fpu_string;
-
 void (*arm_lang_output_object_attributes_hook)(void);
 
 struct four_ints
@@ -334,6 +330,7 @@ static rtx_insn *thumb1_md_asm_

[PATCH v7 03/10] x86: Update piecewise move and store

2021-08-02 Thread H.J. Lu via Gcc-patches
On Mon, Aug 2, 2021 at 4:20 AM Uros Bizjak  wrote:
>
> On Fri, Jul 30, 2021 at 11:32 PM H.J. Lu  wrote:
> >
> > We can use TImode/OImode/XImode integers for piecewise move and store.
> >
> > 1. Define MAX_MOVE_MAX to 64, which is the constant maximum number of
> > bytes that a single instruction can move quickly between memory and
> > registers or between two memory locations.
> > 2. Define MOVE_MAX to MOVE_MAX_PIECES, which is the maximum number of
> > bytes we can move from memory to memory in one reasonably fast instruction.
> > The difference between MAX_MOVE_MAX and MOVE_MAX is that MAX_MOVE_MAX
> > must be a constant, independent of compiler options, since it is used in
> > reload.h to define struct target_reload and MOVE_MAX can vary, depending
> > on compiler options.
> > 3. When vector register is used for piecewise move and store, we don't
> > increase stack_alignment_needed since vector register spill isn't
> > required for piecewise move and store.  Since stack_realign_needed is
> > set to true by checking stack_alignment_estimated set by pseudo vector
> > register usage, we also need to check stack_realign_needed to eliminate
> > frame pointer.
> >
> > gcc/
> >
> > * config/i386/i386.c (ix86_finalize_stack_frame_flags): Also
> > check stack_realign_needed for stack realignment.
> > (ix86_legitimate_constant_p): Always allow CONST_WIDE_INT smaller
> > than the largest integer supported by vector register.
> > * config/i386/i386.h (MAX_MOVE_MAX): New.  Set to 64.
> > (MOVE_MAX_PIECES): Set to bytes of the largest integer supported
> > by vector register.
> > (MOVE_MAX): Defined to MOVE_MAX_PIECES.
> > (STORE_MAX_PIECES): New.
> >
> > gcc/testsuite/
> >
> > * gcc.target/i386/pr90773-1.c: Adjust to expect movq for 32-bit.
> > * gcc.target/i386/pr90773-4.c: Also run for 32-bit.
> > * gcc.target/i386/pr90773-15.c: Likewise.
> > * gcc.target/i386/pr90773-16.c: Likewise.
> > * gcc.target/i386/pr90773-17.c: Likewise.
> > * gcc.target/i386/pr90773-24.c: Likewise.
> > * gcc.target/i386/pr90773-25.c: Likewise.
> > * gcc.target/i386/pr100865-1.c: Likewise.
> > * gcc.target/i386/pr100865-2.c: Likewise.
> > * gcc.target/i386/pr100865-3.c: Likewise.
> > * gcc.target/i386/pr90773-14.c: Also run for 32-bit and expect
> > XMM movd to store 4 bytes.
> > * gcc.target/i386/pr100865-4a.c: Also run for 32-bit and expect
> > YMM registers.
> > * gcc.target/i386/pr100865-4b.c: Likewise.
> > * gcc.target/i386/pr100865-10a.c: Expect YMM registers.
> > * gcc.target/i386/pr100865-10b.c: Likewise.
> > ---
> >  gcc/config/i386/i386.c   | 21 --
> >  gcc/config/i386/i386.h   | 40 
> >  gcc/testsuite/gcc.target/i386/pr100865-1.c   |  2 +-
> >  gcc/testsuite/gcc.target/i386/pr100865-10a.c |  4 +-
> >  gcc/testsuite/gcc.target/i386/pr100865-10b.c |  4 +-
> >  gcc/testsuite/gcc.target/i386/pr100865-2.c   |  2 +-
> >  gcc/testsuite/gcc.target/i386/pr100865-3.c   |  2 +-
> >  gcc/testsuite/gcc.target/i386/pr100865-4a.c  |  6 +--
> >  gcc/testsuite/gcc.target/i386/pr100865-4b.c  |  8 ++--
> >  gcc/testsuite/gcc.target/i386/pr90773-1.c| 10 ++---
> >  gcc/testsuite/gcc.target/i386/pr90773-14.c   |  2 +-
> >  gcc/testsuite/gcc.target/i386/pr90773-15.c   |  6 +--
> >  gcc/testsuite/gcc.target/i386/pr90773-16.c   |  2 +-
> >  gcc/testsuite/gcc.target/i386/pr90773-17.c   |  2 +-
> >  gcc/testsuite/gcc.target/i386/pr90773-24.c   |  2 +-
> >  gcc/testsuite/gcc.target/i386/pr90773-25.c   |  2 +-
> >  gcc/testsuite/gcc.target/i386/pr90773-4.c|  2 +-
> >  17 files changed, 76 insertions(+), 41 deletions(-)
> >
> > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> > index 5d20ca2067f..842eb0e6786 100644
> > --- a/gcc/config/i386/i386.c
> > +++ b/gcc/config/i386/i386.c
> > @@ -7953,8 +7953,17 @@ ix86_finalize_stack_frame_flags (void)
> >   assumed stack realignment might be needed or -fno-omit-frame-pointer
> >   is used, but in the end nothing that needed the stack alignment had
> >   been spilled nor stack access, clear frame_pointer_needed and say we
> > - don't need stack realignment.  */
> > -  if ((stack_realign || (!flag_omit_frame_pointer && optimize))
> > + don't need stack realignment.
> > +
> > + When vector register is used for piecewise move and store, we don't
> > + increase stack_alignment_needed as there is no register spill for
> > + piecewise move and store.  Since stack_realign_needed is set to true
> > + by checking stack_alignment_estimated which is updated by pseudo
> > + vector register usage, we also need to check stack_realign_needed to
> > + eliminate frame pointer.  */
> > +  if ((stack_realign
> > +   || (!flag_omit_frame_pointer && optimize)
> > +   || crtl->stack_realign_nee

[PUSHED] Remove --param=threader-iterative.

2021-08-02 Thread Aldy Hernandez via Gcc-patches
This was meant to be an internal construct, but I see folks are using
it and submitting PRs against it.  Let's just remove this to avoid
further confusion.

Tested on x86-64 Linux.

Pushed to keep PRs down and Jeff happy :).

gcc/ChangeLog:

PR tree-optimization/101724
* params.opt: Remove --param=threader-iterative.
* tree-ssa-threadbackward.c (pass_thread_jumps::execute): Remove
iterative mode.
---
 gcc/params.opt|  4 
 gcc/tree-ssa-threadbackward.c | 18 ++
 2 files changed, 2 insertions(+), 20 deletions(-)

diff --git a/gcc/params.opt b/gcc/params.opt
index f1f47b44215..aa2fb4047b6 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1010,10 +1010,6 @@ Maximum depth of DFS walk used by modref escape analysis.
 Common Joined UInteger Var(param_modref_max_escape_points) Init(256) Param 
Optimization
 Maximum number of escape points tracked by modref per SSA-name.
 
--param=threader-iterative=
-Common Joined UInteger Var(param_threader_iterative) Init(0) Param Optimization
-Run backwards threader in iterative mode.
-
 -param=threader-mode=
 Common Joined Var(param_threader_mode) Enum(threader_mode) 
Init(THREADER_MODE_RANGER) Param Optimization
 --param=threader-mode=[legacy|ranger] Specifies the mode the backwards 
threader should run in.
diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c
index 2c0e9751101..91ce443b1b2 100644
--- a/gcc/tree-ssa-threadbackward.c
+++ b/gcc/tree-ssa-threadbackward.c
@@ -1342,24 +1342,10 @@ pass_thread_jumps::execute (function *fun)
 {
   loop_optimizer_init (LOOPS_HAVE_PREHEADERS | LOOPS_HAVE_SIMPLE_LATCHES);
 
-  // Iterative mode is a testing construct and is not meant for public
-  // consumption.  It is OFF by default.
-  bool iterative = param_threader_iterative;
-
-  bool changed = false;
-  while (try_thread_blocks (fun))
-{
-  changed = true;
-
-  if (!iterative)
-   break;
-
-  if ((param_threader_mode & THREADER_MODE_RANGER) == 0)
-   break;
-  cleanup_tree_cfg (TODO_update_ssa);
-}
+  bool changed = try_thread_blocks (fun);
 
   loop_optimizer_finalize ();
+
   return changed ? TODO_cleanup_cfg : 0;
 }
 
-- 
2.31.1



[committed] libstc++: Add dg-error for additional error in C++11 mode

2021-08-02 Thread Jonathan Wakely via Gcc-patches
When the comparison with a nullptr_t is ill-formed, there is an
additional error for C++11 mode due to the constexpr function body being
invalid.

Signed-off-by: Jonathan Wakely 

libstdc++-v3/ChangeLog:

* testsuite/20_util/tuple/comparison_operators/overloaded2.cc:
Add dg-error for c++11_only target.

Tested powerpc64le-linux. Committed to trunk.

commit 2aaf69133f39f2409baa146e755a6c52eee4452f
Author: Jonathan Wakely 
Date:   Mon Aug 2 14:41:17 2021

libstc++: Add dg-error for additional error in C++11 mode

When the comparison with a nullptr_t is ill-formed, there is an
additional error for C++11 mode due to the constexpr function body being
invalid.

Signed-off-by: Jonathan Wakely 

libstdc++-v3/ChangeLog:

* testsuite/20_util/tuple/comparison_operators/overloaded2.cc:
Add dg-error for c++11_only target.

diff --git 
a/libstdc++-v3/testsuite/20_util/tuple/comparison_operators/overloaded2.cc 
b/libstdc++-v3/testsuite/20_util/tuple/comparison_operators/overloaded2.cc
index fc3118f85a8..a66a9315902 100644
--- a/libstdc++-v3/testsuite/20_util/tuple/comparison_operators/overloaded2.cc
+++ b/libstdc++-v3/testsuite/20_util/tuple/comparison_operators/overloaded2.cc
@@ -50,3 +50,4 @@ auto a = std::make_tuple(nullptr, Compares{}, 2, 'U');
 auto b = a < a;
 
 // { dg-error "ordered comparison" "" { target *-*-* } 0 }
+// { dg-error "not a return-statement" "" { target c++11_only } 0 }


[committed] libstdc++: Fix filesystem::temp_directory_path [PR101709]

2021-08-02 Thread Jonathan Wakely via Gcc-patches

On 30/07/21 18:13 +0100, Jonathan Wakely wrote:

This adds a configure check for the GNU extension secure_getenv and then
uses it for looking up TMPDIR and similar variables.

Signed-off-by: Jonathan Wakely 

libstdc++-v3/ChangeLog:

PR libstdc++/65018
* configure.ac: Check for secure_getenv.
* config.h.in: Regenerate.
* configure: Regenerate.
* src/filesystem/ops-common.h (get_temp_directory_from_env): New
helper function to obtain path from the environment.
* src/c++17/fs_ops.cc (fs::temp_directory_path): Use new helper.
* src/filesystem/ops.cc (fs::temp_directory_path): Likewise.
* testsuite/27_io/filesystem/operations/temp_directory_path.cc:
Print messages if test cannot be run.
* testsuite/experimental/filesystem/operations/temp_directory_path.cc:
Likewise. Fix incorrect condition. Use "TMP" to work with
Windows as well as POSIX.


This refactoring broke the Windows build. Fixed like so.

Tested powerpc64le-linux and x86_64-w64-mingw32 (cross-compiled and
run under Wine). Committed to trunk.


commit 38fb24ba4d67254cea78731fc8d961903dad9646
Author: Jonathan Wakely 
Date:   Mon Aug 2 15:52:41 2021

libstdc++: Fix filesystem::temp_directory_path [PR101709]

Signed-off-by: Jonathan Wakely 

libstdc++-v3/ChangeLog:

PR libstdc++/101709
* src/filesystem/ops-common.h (get_temp_directory_from_env):
Add error_code parameter.
* src/c++17/fs_ops.cc (fs::temp_directory_path): Pass error_code
argument to get_temp_directory_from_env and check it.
* src/filesystem/ops.cc (fs::temp_directory_path): Likewise.

diff --git a/libstdc++-v3/src/c++17/fs_ops.cc b/libstdc++-v3/src/c++17/fs_ops.cc
index db2250e4841..0e2d952673f 100644
--- a/libstdc++-v3/src/c++17/fs_ops.cc
+++ b/libstdc++-v3/src/c++17/fs_ops.cc
@@ -1604,7 +1604,9 @@ fs::temp_directory_path()
 fs::path
 fs::temp_directory_path(error_code& ec)
 {
-  path p = fs::get_temp_directory_from_env();
+  path p = fs::get_temp_directory_from_env(ec);
+  if (ec)
+return p;
   auto st = status(p, ec);
   if (ec)
 p.clear();
diff --git a/libstdc++-v3/src/filesystem/ops-common.h b/libstdc++-v3/src/filesystem/ops-common.h
index b8bbf446883..304e5b263fb 100644
--- a/libstdc++-v3/src/filesystem/ops-common.h
+++ b/libstdc++-v3/src/filesystem/ops-common.h
@@ -572,7 +572,7 @@ _GLIBCXX_BEGIN_NAMESPACE_FILESYSTEM
   // Caller must check that the path is an accessible directory.
 #ifdef _GLIBCXX_FILESYSTEM_IS_WINDOWS
   inline wstring
-  get_temp_directory_from_env()
+  get_temp_directory_from_env(error_code& ec)
   {
 unsigned len = 1024;
 std::wstring buf;
@@ -583,17 +583,18 @@ _GLIBCXX_BEGIN_NAMESPACE_FILESYSTEM
   } while (len > buf.size());
 
 if (len == 0)
-  {
-	ec.assign((int)GetLastError(), std::system_category());
-	return p;
-  }
+  ec.assign((int)GetLastError(), std::system_category());
+else
+  ec.clear();
+
 buf.resize(len);
 return buf;
   }
 #else
   inline const char*
-  get_temp_directory_from_env() noexcept
+  get_temp_directory_from_env(error_code& ec) noexcept
   {
+ec.clear();
 for (auto env : { "TMPDIR", "TMP", "TEMP", "TEMPDIR" })
   {
 #if _GLIBCXX_HAVE_SECURE_GETENV
diff --git a/libstdc++-v3/src/filesystem/ops.cc b/libstdc++-v3/src/filesystem/ops.cc
index 3494cbd19f9..b0a0f15e98f 100644
--- a/libstdc++-v3/src/filesystem/ops.cc
+++ b/libstdc++-v3/src/filesystem/ops.cc
@@ -1302,7 +1302,9 @@ fs::temp_directory_path()
 fs::path
 fs::temp_directory_path(error_code& ec)
 {
-  path p = fs::get_temp_directory_from_env();
+  path p = fs::get_temp_directory_from_env(ec);
+  if (ec)
+return p;
   auto st = status(p, ec);
   if (ec)
 p.clear();


Re: [PATCH v7 03/10] x86: Update piecewise move and store

2021-08-02 Thread Uros Bizjak via Gcc-patches
On Mon, Aug 2, 2021 at 4:57 PM H.J. Lu  wrote:
>
> On Mon, Aug 2, 2021 at 4:20 AM Uros Bizjak  wrote:
> >
> > On Fri, Jul 30, 2021 at 11:32 PM H.J. Lu  wrote:
> > >
> > > We can use TImode/OImode/XImode integers for piecewise move and store.
> > >
> > > 1. Define MAX_MOVE_MAX to 64, which is the constant maximum number of
> > > bytes that a single instruction can move quickly between memory and
> > > registers or between two memory locations.
> > > 2. Define MOVE_MAX to MOVE_MAX_PIECES, which is the maximum number of
> > > bytes we can move from memory to memory in one reasonably fast 
> > > instruction.
> > > The difference between MAX_MOVE_MAX and MOVE_MAX is that MAX_MOVE_MAX
> > > must be a constant, independent of compiler options, since it is used in
> > > reload.h to define struct target_reload and MOVE_MAX can vary, depending
> > > on compiler options.
> > > 3. When vector register is used for piecewise move and store, we don't
> > > increase stack_alignment_needed since vector register spill isn't
> > > required for piecewise move and store.  Since stack_realign_needed is
> > > set to true by checking stack_alignment_estimated set by pseudo vector
> > > register usage, we also need to check stack_realign_needed to eliminate
> > > frame pointer.
> > >
> > > gcc/
> > >
> > > * config/i386/i386.c (ix86_finalize_stack_frame_flags): Also
> > > check stack_realign_needed for stack realignment.
> > > (ix86_legitimate_constant_p): Always allow CONST_WIDE_INT smaller
> > > than the largest integer supported by vector register.
> > > * config/i386/i386.h (MAX_MOVE_MAX): New.  Set to 64.
> > > (MOVE_MAX_PIECES): Set to bytes of the largest integer supported
> > > by vector register.
> > > (MOVE_MAX): Defined to MOVE_MAX_PIECES.
> > > (STORE_MAX_PIECES): New.
> > >
> > > gcc/testsuite/
> > >
> > > * gcc.target/i386/pr90773-1.c: Adjust to expect movq for 32-bit.
> > > * gcc.target/i386/pr90773-4.c: Also run for 32-bit.
> > > * gcc.target/i386/pr90773-15.c: Likewise.
> > > * gcc.target/i386/pr90773-16.c: Likewise.
> > > * gcc.target/i386/pr90773-17.c: Likewise.
> > > * gcc.target/i386/pr90773-24.c: Likewise.
> > > * gcc.target/i386/pr90773-25.c: Likewise.
> > > * gcc.target/i386/pr100865-1.c: Likewise.
> > > * gcc.target/i386/pr100865-2.c: Likewise.
> > > * gcc.target/i386/pr100865-3.c: Likewise.
> > > * gcc.target/i386/pr90773-14.c: Also run for 32-bit and expect
> > > XMM movd to store 4 bytes.
> > > * gcc.target/i386/pr100865-4a.c: Also run for 32-bit and expect
> > > YMM registers.
> > > * gcc.target/i386/pr100865-4b.c: Likewise.
> > > * gcc.target/i386/pr100865-10a.c: Expect YMM registers.
> > > * gcc.target/i386/pr100865-10b.c: Likewise.
> > > ---
> > >  gcc/config/i386/i386.c   | 21 --
> > >  gcc/config/i386/i386.h   | 40 
> > >  gcc/testsuite/gcc.target/i386/pr100865-1.c   |  2 +-
> > >  gcc/testsuite/gcc.target/i386/pr100865-10a.c |  4 +-
> > >  gcc/testsuite/gcc.target/i386/pr100865-10b.c |  4 +-
> > >  gcc/testsuite/gcc.target/i386/pr100865-2.c   |  2 +-
> > >  gcc/testsuite/gcc.target/i386/pr100865-3.c   |  2 +-
> > >  gcc/testsuite/gcc.target/i386/pr100865-4a.c  |  6 +--
> > >  gcc/testsuite/gcc.target/i386/pr100865-4b.c  |  8 ++--
> > >  gcc/testsuite/gcc.target/i386/pr90773-1.c| 10 ++---
> > >  gcc/testsuite/gcc.target/i386/pr90773-14.c   |  2 +-
> > >  gcc/testsuite/gcc.target/i386/pr90773-15.c   |  6 +--
> > >  gcc/testsuite/gcc.target/i386/pr90773-16.c   |  2 +-
> > >  gcc/testsuite/gcc.target/i386/pr90773-17.c   |  2 +-
> > >  gcc/testsuite/gcc.target/i386/pr90773-24.c   |  2 +-
> > >  gcc/testsuite/gcc.target/i386/pr90773-25.c   |  2 +-
> > >  gcc/testsuite/gcc.target/i386/pr90773-4.c|  2 +-
> > >  17 files changed, 76 insertions(+), 41 deletions(-)
> > >
> > > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> > > index 5d20ca2067f..842eb0e6786 100644
> > > --- a/gcc/config/i386/i386.c
> > > +++ b/gcc/config/i386/i386.c
> > > @@ -7953,8 +7953,17 @@ ix86_finalize_stack_frame_flags (void)
> > >   assumed stack realignment might be needed or -fno-omit-frame-pointer
> > >   is used, but in the end nothing that needed the stack alignment had
> > >   been spilled nor stack access, clear frame_pointer_needed and say we
> > > - don't need stack realignment.  */
> > > -  if ((stack_realign || (!flag_omit_frame_pointer && optimize))
> > > + don't need stack realignment.
> > > +
> > > + When vector register is used for piecewise move and store, we don't
> > > + increase stack_alignment_needed as there is no register spill for
> > > + piecewise move and store.  Since stack_realign_needed is set to true
> > > + by checking stack_alignment_estimated which is updated by pseudo
> 

Re: [PATCH 2/2] Ada: Remove debug line number for DECL_IGNORED_P functions

2021-08-02 Thread Bernd Edlinger
On 8/2/21 3:07 PM, Eric Botcazou wrote:
>> It was pointed out in PR101598 to be inappropriate, that
>> ignored Ada decls receive the source line number which was
>> recorded in the function decl's DECL_SOURCE_LOCATION.
>> Therefore set all front-end-generated Ada decls with
>> DECL_IGNORED_P to UNKNOWN_LOCATION.
>>
>> 2021-07-24  Bernd Edlinger  
>>
>>  PR debug/101598
>>  * gcc-interface/trans.c (Subprogram_Body_to_gnu): Set the
>>  DECL_SOURCE_LOCATION of DECL_IGNORED_P gnu_subprog_decl to
>>  UNKNOWN_LOCATION.
> 
> Is that really needed in DWARF 5?  If no, I'm not sure that we want it.
> 

No, this one is completely optional, only DWARF 4 may have additional
issues without part 1/2 of this patch.

The location of these ignored Ada decls looks completely sane to me.
However, it was an unintentional side effect of the patch to allow
minimal debugging of ignored decls.  This means we can now step into
those functions or set line breakpoints there, while previously that
was not possible.  And I guess it could be considered an improvement.

So it's your choice, how you want these functions to be debugged.


Thanks
Bernd.


[PATCH] x86: Use XMM31 for scratch SSE register

2021-08-02 Thread H.J. Lu via Gcc-patches
In 64-bit mode, use XMM31 for scratch SSE register to avoid vzeroupper
if possible.

gcc/

* config/i386/i386.c (ix86_gen_scratch_sse_rtx): In 64-bit mode,
try XMM31 to avoid vzeroupper.

gcc/testsuite/

* gcc.target/i386/avx-vzeroupper-14.c: Pass -mno-avx512f to
disable XMM31.
* gcc.target/i386/avx-vzeroupper-15.c: Likewise.
* gcc.target/i386/pr82941-1.c: Updated.  Check for vzeroupper.
* gcc.target/i386/pr82942-1.c: Likewise.
* gcc.target/i386/pr82990-1.c: Likewise.
* gcc.target/i386/pr82990-3.c: Likewise.
* gcc.target/i386/pr82990-5.c: Likewise.
* gcc.target/i386/pr100865-4b.c: Likewise.
* gcc.target/i386/pr100865-6b.c: Likewise.
* gcc.target/i386/pr100865-7b.c: Likewise.
* gcc.target/i386/pr100865-10b.c: Likewise.
* gcc.target/i386/pr100865-8b.c: Updated.
* gcc.target/i386/pr100865-9b.c: Likewise.
* gcc.target/i386/pr100865-11b.c: Likewise.
* gcc.target/i386/pr100865-12b.c: Likewise.
---
 gcc/config/i386/i386.c | 18 +++---
 .../gcc.target/i386/avx-vzeroupper-14.c|  2 +-
 .../gcc.target/i386/avx-vzeroupper-15.c|  2 +-
 gcc/testsuite/gcc.target/i386/pr100865-10b.c   |  1 +
 gcc/testsuite/gcc.target/i386/pr100865-11b.c   |  2 +-
 gcc/testsuite/gcc.target/i386/pr100865-12b.c   |  2 +-
 gcc/testsuite/gcc.target/i386/pr100865-4b.c|  2 ++
 gcc/testsuite/gcc.target/i386/pr100865-6b.c|  5 -
 gcc/testsuite/gcc.target/i386/pr100865-7b.c|  5 -
 gcc/testsuite/gcc.target/i386/pr100865-8b.c|  2 +-
 gcc/testsuite/gcc.target/i386/pr100865-9b.c|  2 +-
 gcc/testsuite/gcc.target/i386/pr82941-1.c  |  3 ++-
 gcc/testsuite/gcc.target/i386/pr82942-1.c  |  3 ++-
 gcc/testsuite/gcc.target/i386/pr82990-1.c  |  3 ++-
 gcc/testsuite/gcc.target/i386/pr82990-3.c  |  3 ++-
 gcc/testsuite/gcc.target/i386/pr82990-5.c  |  3 ++-
 16 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 842eb0e6786..ec0690876b7 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -23335,9 +23335,21 @@ rtx
 ix86_gen_scratch_sse_rtx (machine_mode mode)
 {
   if (TARGET_SSE && !lra_in_progress)
-return gen_rtx_REG (mode, (TARGET_64BIT
-  ? LAST_REX_SSE_REG
-  : LAST_SSE_REG));
+{
+  unsigned int regno;
+  if (TARGET_64BIT)
+   {
+ /* In 64-bit mode, use XMM31 to avoid vzeroupper and always
+use XMM31 for CSE.  */
+ if (ix86_hard_regno_mode_ok (LAST_EXT_REX_SSE_REG, mode))
+   regno = LAST_EXT_REX_SSE_REG;
+ else
+   regno = LAST_REX_SSE_REG;
+   }
+  else
+   regno = LAST_SSE_REG;
+  return gen_rtx_REG (mode, regno);
+}
   else
 return gen_reg_rtx (mode);
 }
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-14.c 
b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-14.c
index a31b4a2a63a..9590f25da22 100644
--- a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-14.c
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-14.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -mavx -mtune=generic -dp" } */
+/* { dg-options "-O2 -mavx -mno-avx512f -mtune=generic -dp" } */
 
 #include 
 
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-15.c 
b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-15.c
index 803936eef01..36dcf7367f1 100644
--- a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-15.c
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-15.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -mavx -mtune=generic -dp" } */
+/* { dg-options "-O2 -mavx -mno-avx512f -mtune=generic -dp" } */
 
 #include 
 
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-10b.c 
b/gcc/testsuite/gcc.target/i386/pr100865-10b.c
index e5616d8d258..77ace86ffe8 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-10b.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-10b.c
@@ -5,3 +5,4 @@
 
 /* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%(?:r|e)\[^\n\]*, 
%ymm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqu8\[\\t \]%ymm\[0-9\]+, " 8 } } */
+/* { dg-final { scan-assembler-not "vzeroupper" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-11b.c 
b/gcc/testsuite/gcc.target/i386/pr100865-11b.c
index 12d55b9a642..7e458e85cdd 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-11b.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-11b.c
@@ -5,4 +5,4 @@
 
 /* { dg-final { scan-assembler-times "movabsq" 1 } } */
 /* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%(?:r|e)\[^\n\]*, 
%xmm\[0-9\]+" 1 } } */
-/* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
+/* { dg-final { scan-assembler-times "vmovdqa64\[\\t \]%xmm\[0-9\]+, " 16 } } 
*/
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-12b.c 
b/gcc/testsuite/gcc.target/i386/pr100865-12b.c

Re: [PATCH 4/6] Support -fexcess-precision=16 which will enable FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16 when backend supports _Float16.

2021-08-02 Thread Joseph Myers
On Mon, 2 Aug 2021, liuhongt via Gcc-patches wrote:

> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 7979e240426..dc673c89bc8 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -23352,6 +23352,8 @@ ix86_get_excess_precision (enum excess_precision_type 
> type)
>   return (type == EXCESS_PRECISION_TYPE_STANDARD
>   ? FLT_EVAL_METHOD_PROMOTE_TO_FLOAT
>   : FLT_EVAL_METHOD_UNPREDICTABLE);
> +  case EXCESS_PRECISION_TYPE_FLOAT16:
> + return FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16;
>default:
>   gcc_unreachable ();
>  }

I'd expect an error for -fexcess-precision=16 with -mfpmath=387 (since x87 
doesn't do float or double arithmetic, but -fexcess-precision=16 implies 
that all of _Float16, float and double are represented to the range and 
precision of their type withou any excess precision).

-- 
Joseph S. Myers
jos...@codesourcery.com


[PATCH, rs6000] Add store fusion support for Power10

2021-08-02 Thread Pat Haugen via Gcc-patches
Enable store fusion on Power10.

Use the SCHED_REORDER hook to implement Power10 specific ready list reordering.
As of now, pairing stores for store fusion is the only function being
performed.

Bootstrap/regtest on powerpc64le(Power10) with no new regressions. Ok for
master?

-Pat


2021-08-02  Pat Haugen  

gcc/ChangeLog:

* config/rs6000/rs6000-cpus.def (ISA_3_1_MASKS_SERVER): Add new flag.
(POWERPC_MASKS): Likewise.
* config/rs6000/rs6000.c (rs6000_option_override_internal): Enable
store fusion for Power10.
(is_load_insn1): Verify destination is a register.
(is_store_insn1): Verify source is a register.
(is_fusable_store): New.
(power10_sched_reorder): Likewise.
(rs6000_sched_reorder): Do Power10 specific reordering.
(rs6000_sched_reorder2): Likewise.
* config/rs6000/rs6000.opt: Add new option.



diff --git a/gcc/config/rs6000/rs6000-cpus.def 
b/gcc/config/rs6000/rs6000-cpus.def
index 6758296c0fd..f5812da0184 100644
--- a/gcc/config/rs6000/rs6000-cpus.def
+++ b/gcc/config/rs6000/rs6000-cpus.def
@@ -90,7 +90,8 @@
 | OPTION_MASK_P10_FUSION_2LOGICAL  \
 | OPTION_MASK_P10_FUSION_LOGADD\
 | OPTION_MASK_P10_FUSION_ADDLOG\
-| OPTION_MASK_P10_FUSION_2ADD)
+| OPTION_MASK_P10_FUSION_2ADD  \
+| OPTION_MASK_P10_FUSION_2STORE)
 
 /* Flags that need to be turned off if -mno-power9-vector.  */
 #define OTHER_P9_VECTOR_MASKS  (OPTION_MASK_FLOAT128_HW\
@@ -143,6 +144,7 @@
 | OPTION_MASK_P10_FUSION_LOGADD\
 | OPTION_MASK_P10_FUSION_ADDLOG\
 | OPTION_MASK_P10_FUSION_2ADD  \
+| OPTION_MASK_P10_FUSION_2STORE\
 | OPTION_MASK_HTM  \
 | OPTION_MASK_ISEL \
 | OPTION_MASK_MFCRF\
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 279f00cc648..1460a0d7c5c 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -4490,6 +4490,10 @@ rs6000_option_override_internal (bool global_init_p)
   && (rs6000_isa_flags_explicit & OPTION_MASK_P10_FUSION_2ADD) == 0)
 rs6000_isa_flags |= OPTION_MASK_P10_FUSION_2ADD;
 
+  if (TARGET_POWER10
+  && (rs6000_isa_flags_explicit & OPTION_MASK_P10_FUSION_2STORE) == 0)
+rs6000_isa_flags |= OPTION_MASK_P10_FUSION_2STORE;
+
   /* Turn off vector pair/mma options on non-power10 systems.  */
   else if (!TARGET_POWER10 && TARGET_MMA)
 {
@@ -18357,7 +18361,7 @@ is_load_insn1 (rtx pat, rtx *load_mem)
   if (!pat || pat == NULL_RTX)
 return false;
 
-  if (GET_CODE (pat) == SET)
+  if (GET_CODE (pat) == SET && REG_P (SET_DEST (pat)))
 return find_mem_ref (SET_SRC (pat), load_mem);
 
   if (GET_CODE (pat) == PARALLEL)
@@ -18394,7 +18398,8 @@ is_store_insn1 (rtx pat, rtx *str_mem)
   if (!pat || pat == NULL_RTX)
 return false;
 
-  if (GET_CODE (pat) == SET)
+  if (GET_CODE (pat) == SET
+  && (REG_P (SET_SRC (pat)) || SUBREG_P (SET_SRC (pat
 return find_mem_ref (SET_DEST (pat), str_mem);
 
   if (GET_CODE (pat) == PARALLEL)
@@ -18859,6 +18864,96 @@ power9_sched_reorder2 (rtx_insn **ready, int lastpos)
   return cached_can_issue_more;
 }
 
+/* Determine if INSN is a store to memory that can be fused with a similar
+   adjacent store.  */
+
+static bool
+is_fusable_store (rtx_insn *insn, rtx *str_mem)
+{
+  /* Exit early if not doing store fusion.  */
+  if (!(TARGET_P10_FUSION && TARGET_P10_FUSION_2STORE))
+return false;
+
+  /* Insn must be a non-prefixed base+disp form store.  */
+  if (is_store_insn (insn, str_mem)
+  && get_attr_prefixed (insn) == PREFIXED_NO
+  && get_attr_update (insn) == UPDATE_NO
+  && get_attr_indexed (insn) == INDEXED_NO)
+{
+  /* Further restictions by mode and size.  */
+  machine_mode mode = GET_MODE (*str_mem);
+  HOST_WIDE_INT size;
+  if MEM_SIZE_KNOWN_P (*str_mem)
+   size = MEM_SIZE (*str_mem);
+  else
+   return false;
+
+  if INTEGRAL_MODE_P (mode)
+   {
+ /* Must be word or dword size.  */
+ return (size == 4 || size == 8);
+   }
+  else if FLOAT_MODE_P (mode)
+   {
+ /* Must be dword size.  */
+ return (size == 8);
+   }
+}
+
+  return false;
+}
+
+/* Do Power10 specific reordering of the ready list.  */
+
+static int
+power10_sched_reorder (rtx_insn **ready, int lastpos)
+{
+  int pos;
+  rtx mem1, mem2;
+
+  /* Do store fusion during sched2 only.  */
+  if (!reload_completed)
+return cached_can_issue_more;
+
+  /* If the prior insn finished off a sto

Re: [PATCH] Objective-C: don't require redundant -fno-objc-sjlj-exceptions for the NeXT v2 ABI

2021-08-02 Thread Eric Gallager via Gcc-patches
On Wed, Jul 28, 2021 at 11:36 PM Matt Jacobson via Gcc-patches
 wrote:
>
> As is, an invocation of GCC with -fnext-runtime -fobjc-abi-version=2 crashes,
> unless target-specific code adds an implicit -fno-objc-sjlj-exceptions (which
> Darwin does).
>
> This patch makes the general case not crash.
>
> I don't have commit access, so if this patch is suitable, I'd need someone 
> else
> to commit it for me.  Thanks.

Is there a bug open for the issue that this fixes? Just wondering for
cross-referencing purposes...

>
> gcc/objc/ChangeLog:
>
> 2021-07-28  Matt Jacobson  
>
> * objc-next-runtime-abi-02.c (objc_next_runtime_abi_02_init): Warn
> about and reset flag_objc_sjlj_exceptions regardless of
> flag_objc_exceptions.
>
>
> gcc/c-family/ChangeLog:
>
> 2021-07-28  Matt Jacobson  
>
> * c-opts.c (c_common_post_options): Default to
> flag_objc_sjlj_exceptions = 1 only when flag_objc_abi < 2.
>
> diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
> index c51d6d34726..2568df67972 100644
> --- a/gcc/c-family/c-opts.c
> +++ b/gcc/c-family/c-opts.c
> @@ -840,9 +840,9 @@ c_common_post_options (const char **pfilename)
>else if (!flag_gnu89_inline && !flag_isoc99)
>  error ("%<-fno-gnu89-inline%> is only supported in GNU99 or C99 mode");
>
> -  /* Default to ObjC sjlj exception handling if NeXT runtime.  */
> +  /* Default to ObjC sjlj exception handling if NeXT if (flag_objc_sjlj_exceptions < 0)
> -flag_objc_sjlj_exceptions = flag_next_runtime;
> +flag_objc_sjlj_exceptions = (flag_next_runtime && flag_objc_abi < 2);
>if (flag_objc_exceptions && !flag_objc_sjlj_exceptions)
>  flag_exceptions = 1;
>
> diff --git a/gcc/objc/objc-next-runtime-abi-02.c 
> b/gcc/objc/objc-next-runtime-abi-02.c
> index 66c13ad0db2..9a0868410a8 100644
> --- a/gcc/objc/objc-next-runtime-abi-02.c
> +++ b/gcc/objc/objc-next-runtime-abi-02.c
> @@ -245,7 +245,7 @@ objc_next_runtime_abi_02_init (objc_runtime_hooks 
> *rthooks)
>  {
>extern_names = ggc_cleared_vec_alloc (SIZEHASHTABLE);
>
> -  if (flag_objc_exceptions && flag_objc_sjlj_exceptions)
> +  if (flag_objc_sjlj_exceptions)
>  {
>inform (UNKNOWN_LOCATION,
>   "%<-fobjc-sjlj-exceptions%> is ignored for "
>


Re: [PATCH] Objective-C: don't require redundant -fno-objc-sjlj-exceptions for the NeXT v2 ABI

2021-08-02 Thread Matt Jacobson via Gcc-patches



> On Aug 2, 2021, at 5:09 PM, Eric Gallager  wrote:
> 
> On Wed, Jul 28, 2021 at 11:36 PM Matt Jacobson via Gcc-patches
>  wrote:
>> 
>> As is, an invocation of GCC with -fnext-runtime -fobjc-abi-version=2 crashes,
>> unless target-specific code adds an implicit -fno-objc-sjlj-exceptions (which
>> Darwin does).
>> 
>> This patch makes the general case not crash.
>> 
>> I don't have commit access, so if this patch is suitable, I'd need someone 
>> else
>> to commit it for me.  Thanks.
> 
> Is there a bug open for the issue that this fixes? Just wondering for
> cross-referencing purposes...

No, I didn’t file a bug for this one, just sent the patch directly.  Hope 
that’s OK.  If not, happy to file one.

Matt

Warn for reads from write-only arguments [PR101734]

2021-08-02 Thread Martin Sebor via Gcc-patches

The write_only mode to attribute access specifies that the pointer
applies to is used to write to the referenced object but not read
from it.

A function that uses the pointer to read the referenced object might 
rely on the contents of uninitialized memory and so such attempts

should be diagnosed.  The attached enhancement makes that happen.
It was tested on x86_64-linux and by building Glibc where it found
an inappropriate use of the attribute on a function documented to
read from the argument as an extension [BZ 28170].

I plan to implement a similar warning for writes to read-only objects
(either with attribute read_only, or those declared const, or const
restrict) pointers in a followup (as part of my solution for PR 90404).

Martin

BZ 28170: https://sourceware.org/bugzilla/show_bug.cgi?id=28170
Warn for reads from write-only arguments [PR101734].

Resolves:
PR middle-end/101734 - missing warning reading from a write-only object

gcc/ChangeLog:

	PR middle-end/101734
	* tree-ssa-uninit.c (maybe_warn_read_write_only): New function.
	(maybe_warn_operand): Call it.

gcc/testsuite/ChangeLog:

	PR middle-end/101734
	* gcc.dg/uninit-42.c: New test.

diff --git a/gcc/testsuite/gcc.dg/uninit-42.c b/gcc/testsuite/gcc.dg/uninit-42.c
new file mode 100644
index 000..5e0a2b4ea5f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/uninit-42.c
@@ -0,0 +1,85 @@
+/* Verify that reading objects pointed to by arguments declared with
+   attribute access none or write-only is diagnosed.
+   { dg-do compile }
+   { dg-options "-Wall" } */
+
+#define A(mode, ...) __attribute__ ((access (mode, __VA_ARGS__)))
+
+void sink (void *, ...);
+
+
+A (write_only, 1, 2)
+int nowarn_wo_assign_r0 (int *p, int n)
+{
+  *p = n;
+  return *p;
+}
+
+A (write_only, 1, 2)
+int nowarn_wo_sink_r0 (int *p, int n)
+{
+  sink (p, p + 1, p + n);
+  return *p;
+}
+
+A (write_only, 1, 2)
+int warn_wo_r0 (int *p, int n)
+{
+  return *p;// { dg-warning "'\\*p' is used uninitialized" }
+}
+
+
+A (write_only, 1, 2)
+int nowarn_wo_w1_r1 (int *p, int n)
+{
+  p[1] = n;
+  return p[1];
+}
+
+A (write_only, 1, 2)
+int warn_wo_r1 (int *p, int n)
+{
+  return p[1];  // { dg-warning "'p\\\[1]' is used uninitialized" }
+}
+
+
+A (write_only, 1, 2)
+int nowarn_wo_rwi_rj (int *p, int n, int i, int j)
+{
+  p[i] = n;
+  return p[j];
+}
+
+A (write_only, 1, 2)
+  int warn_wo_ri (int *p, int n, int i)
+{
+  return p[i];  // { dg-warning " is used uninitialized" }
+}
+
+
+
+A (none, 1, 2)
+int* nowarn_none_sink_return (int *p, int n)
+{
+  sink (p, p + 1, p + n);
+  return p;
+}
+
+A (none, 1, 2)
+int warn_none_r0 (int *p, int n)
+{
+  (void)&n;
+  return *p;// { dg-warning "'\\*p' is used uninitialized" }
+}
+
+A (none, 1, 2)
+int warn_none_r1 (int *p, int n)
+{
+  return p[1];  // { dg-warning "'p\\\[1]' is used uninitialized" }
+}
+
+A (write_only, 1, 2)
+int warn_none_ri (int *p, int n, int i)
+{
+  return p[i];  // { dg-warning " is used uninitialized" }
+}
diff --git a/gcc/tree-ssa-uninit.c b/gcc/tree-ssa-uninit.c
index 5d7bc800419..62cd680b1db 100644
--- a/gcc/tree-ssa-uninit.c
+++ b/gcc/tree-ssa-uninit.c
@@ -283,6 +283,64 @@ builtin_call_nomodifying_p (gimple *stmt)
   return true;
 }
 
+/* If ARG is a FNDECL parameter declared with attribute access none or
+   write_only issue a warning for its read access via PTR.  */
+
+static void
+maybe_warn_read_write_only (tree fndecl, gimple *stmt, tree arg, tree ptr)
+{
+  if (!fndecl)
+return;
+
+  if (get_no_uninit_warning (arg))
+return;
+
+  tree fntype = TREE_TYPE (fndecl);
+  if (!fntype)
+return;
+
+  /* Initialize a map of attribute access specifications for arguments
+ to the function function call.  */
+  rdwr_map rdwr_idx;
+  init_attr_rdwr_indices (&rdwr_idx, TYPE_ATTRIBUTES (fntype));
+
+  unsigned argno = 0;
+  tree parms = DECL_ARGUMENTS (fndecl);
+  for (tree parm = parms; parm; parm = TREE_CHAIN (parm), ++argno)
+{
+  if (parm != arg)
+	continue;
+
+  const attr_access* access = rdwr_idx.get (argno);
+  if (!access)
+	break;
+
+  if (access->mode != access_none
+	  && access->mode != access_write_only)
+	continue;
+
+  location_t stmtloc
+	= linemap_resolve_location (line_table, gimple_location (stmt),
+LRK_SPELLING_LOCATION, NULL);
+
+  if (!warning_at (stmtloc, OPT_Wuninitialized,
+		   "%qE is used uninitialized", ptr))
+	break;
+
+  suppress_warning (arg, OPT_Wuninitialized);
+
+  const char* const access_str =
+	TREE_STRING_POINTER (access->to_external_string ());
+
+  location_t parmloc = DECL_SOURCE_LOCATION (parm);
+  inform (parmloc, "accessing argument %u of a function declared with "
+	  "attribute %<%s%>",
+	  argno + 1, access_str);
+
+  break;
+}
+}
+
 /* Callback for walk_aliased_vdefs.  */
 
 static bool
@@ -350,7 +408,9 @@ struct wlimits
 };
 
 /* Determine if REF references an uninitialized operand and diagnose
-   it if so.  */
+   it if so.  STMS is the referenc

Go patch committed: Support unsafe.Add and unsafe.Slice

2021-08-02 Thread Ian Lance Taylor via Gcc-patches
The upcoming Go 1.17 release adds two new functions to the unsafe
package: unsafe.Add and unsafe.Slice.  These functions must be
implemented in the compiler.  This patch implements them for gccgo.
Bootstrapped and ran Go testsuite on x86_64-pc-linux-gnu.  Committed
to mainline.

Ian
06d0437d4a5faca2b695918cbe1d54a61935c98b
diff --git a/gcc/go/gofrontend/MERGE b/gcc/go/gofrontend/MERGE
index 5323e186eab..19fbd647840 100644
--- a/gcc/go/gofrontend/MERGE
+++ b/gcc/go/gofrontend/MERGE
@@ -1,4 +1,4 @@
-920549b6382a2623538d31001271941f0e9e5a51
+ad667e7c70cea9fa5730660d72ad891b5753eb62
 
 The first line of this file holds the git revision number of the last
 merge done from the gofrontend repository.
diff --git a/gcc/go/gofrontend/expressions.cc b/gcc/go/gofrontend/expressions.cc
index a0472acc209..46e71e5a13b 100644
--- a/gcc/go/gofrontend/expressions.cc
+++ b/gcc/go/gofrontend/expressions.cc
@@ -8252,12 +8252,16 @@ Builtin_call_expression::Builtin_call_expression(Gogo* 
gogo,
 this->code_ = BUILTIN_REAL;
   else if (name == "recover")
 this->code_ = BUILTIN_RECOVER;
+  else if (name == "Add")
+this->code_ = BUILTIN_ADD;
   else if (name == "Alignof")
 this->code_ = BUILTIN_ALIGNOF;
   else if (name == "Offsetof")
 this->code_ = BUILTIN_OFFSETOF;
   else if (name == "Sizeof")
 this->code_ = BUILTIN_SIZEOF;
+  else if (name == "Slice")
+this->code_ = BUILTIN_SLICE;
   else
 go_unreachable();
 }
@@ -8694,6 +8698,119 @@ Builtin_call_expression::do_flatten(Gogo* gogo, 
Named_object* function,
 
 return Runtime::make_call(code, loc, 3, e1, e2, e3);
   }
+
+case BUILTIN_ADD:
+  {
+   Expression* ptr = this->args()->front();
+   Type* uintptr_type = Type::lookup_integer_type("uintptr");
+   ptr = Expression::make_cast(uintptr_type, ptr, loc);
+   Expression* len = this->args()->back();
+   len = Expression::make_cast(uintptr_type, len, loc);
+   Expression* add = Expression::make_binary(OPERATOR_PLUS, ptr, len,
+ loc);
+   return Expression::make_cast(this->args()->front()->type(), add, loc);
+  }
+
+case BUILTIN_SLICE:
+  {
+   Expression* ptr = this->args()->front();
+   Temporary_statement* ptr_temp = NULL;
+   if (!ptr->is_multi_eval_safe())
+ {
+   ptr_temp = Statement::make_temporary(NULL, ptr, loc);
+   inserter->insert(ptr_temp);
+   ptr = Expression::make_temporary_reference(ptr_temp, loc);
+ }
+
+   Expression* len = this->args()->back();
+   Temporary_statement* len_temp = NULL;
+   if (!len->is_multi_eval_safe())
+ {
+   len_temp = Statement::make_temporary(NULL, len, loc);
+   inserter->insert(len_temp);
+   len = Expression::make_temporary_reference(len_temp, loc);
+ }
+
+   bool fits_in_int;
+   Numeric_constant nc;
+   if (this->args()->back()->numeric_constant_value(&nc))
+ {
+   // We gave an error for constants that don't fit in int in
+   // check_types.
+   fits_in_int = true;
+ }
+   else
+ {
+   Integer_type* itype = this->args()->back()->type()->integer_type();
+   go_assert(itype != NULL);
+   int ebits = itype->bits();
+   int intbits =
+ Type::lookup_integer_type("int")->integer_type()->bits();
+
+   // We can treat ebits == intbits as small even for an
+   // unsigned integer type, because we will convert the
+   // value to int and then reject it in the runtime if it is
+   // negative.
+
+   fits_in_int = ebits <= intbits;
+ }
+
+   Runtime::Function code = (fits_in_int
+ ? Runtime::UNSAFESLICE
+ : Runtime::UNSAFESLICE64);
+   Expression* td =
+ Expression::make_type_descriptor(ptr->type()->points_to(), loc);
+   Expression* check = Runtime::make_call(code, loc, 3,
+  td, ptr, len);
+
+   if (ptr_temp == NULL)
+ ptr = ptr->copy();
+   else
+ ptr = Expression::make_temporary_reference(ptr_temp, loc);
+   Expression* nil = Expression::make_nil(loc);
+   nil = Expression::make_cast(ptr->type(), nil, loc);
+   Expression* is_nil = Expression::make_binary(OPERATOR_EQEQ, ptr, nil,
+loc);
+
+   if (len_temp == NULL)
+ len = len->copy();
+   else
+ len = Expression::make_temporary_reference(len_temp, loc);
+   Expression* zero = Expression::make_integer_ul(0, len->type(), loc);
+   Expression* is_zero = Expression::make_binary(OPERATOR_EQEQ, len, zero,
+ loc);
+
+   Expression* cond = Expression::make_binary(OPERATOR_ANDAND, is_nil,
+  is_zero, loc);
+
+  

Re: [PATCH v3 1/2] rs6000: Add support for _mm_minpos_epu16

2021-08-02 Thread Segher Boessenkool
Hi!

On Thu, Jul 15, 2021 at 06:29:17PM -0500, Paul A. Clarke wrote:
> Add a naive implementation of the subject x86 intrinsic to
> ease porting.

> --- a/gcc/config/rs6000/smmintrin.h
> +++ b/gcc/config/rs6000/smmintrin.h
> @@ -172,4 +172,31 @@ _mm_test_mix_ones_zeros (__m128i __A, __m128i __mask)
>return any_ones * any_zeros;
>  }
>  
> +/* Return horizontal packed word minimum and its index in bits [15:0]
> +   and bits [18:16] respectively.  */
> +__inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_minpos_epu16 (__m128i __A)
> +{
> +  union __u
> +{
> +  __m128i __m;
> +  __v8hu __uh;
> +};
> +  union __u __u = { .__m = __A }, __r = { .__m = {0} };
> +  unsigned short __ridx = 0;
> +  unsigned short __rmin = __u.__uh[__ridx];
> +  for (unsigned long __i = 1; __i < 8; __i++)
> +{
> +  if (__u.__uh[__i] < __rmin)
> + {
> +   __rmin = __u.__uh[__i];
> +   __ridx = __i;
> + }
> +}
> +  __r.__uh[0] = __rmin;
> +  __r.__uh[1] = __ridx;
> +  return __r.__m;
> +}

As before: does this work correctly on BE?  Was it tested there?

Okay for trunk if so.  Thanks!


Segher


Go patch committed: Allow converting from slice to pointer-to-array

2021-08-02 Thread Ian Lance Taylor via Gcc-patches
The upcoming Go 1.17 release has a new language feature: it permits
conversions from slice types to pointer-to-array types.  If the slice
is too short, the conversion panics.  This patch implements this new
feature in gccgo.  Bootstrapped and ran Go testsuite on
x86_64-pc-linux-gnu.  Committed to mainline.

Ian
7459bfa8a37a4fbd6ed5153bff76f49d372b4ace
diff --git a/gcc/go/gofrontend/MERGE b/gcc/go/gofrontend/MERGE
index 19fbd647840..95b9340b42d 100644
--- a/gcc/go/gofrontend/MERGE
+++ b/gcc/go/gofrontend/MERGE
@@ -1,4 +1,4 @@
-ad667e7c70cea9fa5730660d72ad891b5753eb62
+0a4d612e6b211780b294717503fc739bbd1f509c
 
 The first line of this file holds the git revision number of the last
 merge done from the gofrontend repository.
diff --git a/gcc/go/gofrontend/expressions.cc b/gcc/go/gofrontend/expressions.cc
index 46e71e5a13b..15c9eabc6bf 100644
--- a/gcc/go/gofrontend/expressions.cc
+++ b/gcc/go/gofrontend/expressions.cc
@@ -3866,11 +3866,12 @@ Type_conversion_expression::do_traverse(Traverse* 
traverse)
   return TRAVERSE_CONTINUE;
 }
 
-// Convert to a constant at lowering time.
+// Convert to a constant at lowering time.  Also lower conversions
+// from slice to pointer-to-array, as they can panic.
 
 Expression*
 Type_conversion_expression::do_lower(Gogo*, Named_object*,
-Statement_inserter*, int)
+Statement_inserter* inserter, int)
 {
   Type* type = this->type_;
   Expression* val = this->expr_;
@@ -3958,6 +3959,54 @@ Type_conversion_expression::do_lower(Gogo*, 
Named_object*,
}
 }
 
+  if (type->points_to() != NULL
+  && type->points_to()->array_type() != NULL
+  && !type->points_to()->is_slice_type()
+  && val->type()->is_slice_type())
+{
+  Temporary_statement* val_temp = NULL;
+  if (!val->is_multi_eval_safe())
+   {
+ val_temp = Statement::make_temporary(val->type(), NULL, location);
+ inserter->insert(val_temp);
+ val = Expression::make_set_and_use_temporary(val_temp, val,
+  location);
+   }
+
+  Type* int_type = Type::lookup_integer_type("int");
+  Temporary_statement* vallen_temp =
+   Statement::make_temporary(int_type, NULL, location);
+  inserter->insert(vallen_temp);
+
+  Expression* arrlen = type->points_to()->array_type()->length();
+  Expression* vallen =
+   Expression::make_slice_info(val, Expression::SLICE_INFO_LENGTH,
+   location);
+  vallen = Expression::make_set_and_use_temporary(vallen_temp, vallen,
+ location);
+  Expression* cond = Expression::make_binary(OPERATOR_GT, arrlen, vallen,
+location);
+
+  vallen = Expression::make_temporary_reference(vallen_temp, location);
+  Expression* panic = Runtime::make_call(Runtime::PANIC_SLICE_CONVERT,
+location, 2, arrlen, vallen);
+
+  Expression* nil = Expression::make_nil(location);
+  Expression* check = Expression::make_conditional(cond, panic, nil,
+  location);
+
+  if (val_temp == NULL)
+   val = val->copy();
+  else
+   val = Expression::make_temporary_reference(val_temp, location);
+  Expression* ptr =
+   Expression::make_slice_info(val, Expression::SLICE_INFO_VALUE_POINTER,
+   location);
+  ptr = Expression::make_unsafe_cast(type, ptr, location);
+
+  return Expression::make_compound(check, ptr, location);
+}
+
   return this;
 }
 
diff --git a/gcc/go/gofrontend/runtime.def b/gcc/go/gofrontend/runtime.def
index 231d92d2661..fad8cebc012 100644
--- a/gcc/go/gofrontend/runtime.def
+++ b/gcc/go/gofrontend/runtime.def
@@ -582,6 +582,11 @@ DEF_GO_RUNTIME(PANIC_EXTEND_SLICE3_C, 
"runtime.goPanicExtendSlice3C",
 DEF_GO_RUNTIME(PANIC_EXTEND_SLICE3_C_U, "runtime.goPanicExtendSlice3CU",
   P2(UINT64, INT), R0())
 
+// Panic for conversion of slice to pointer-to-array if the slice is
+// too short.
+DEF_GO_RUNTIME(PANIC_SLICE_CONVERT, "runtime.goPanicSliceConvert",
+  P2(INT, INT), R0())
+
 // Remove helper macros.
 #undef ABFT6
 #undef ABFT2
diff --git a/gcc/go/gofrontend/types.cc b/gcc/go/gofrontend/types.cc
index ab7166b8b1b..7c7b2eb8271 100644
--- a/gcc/go/gofrontend/types.cc
+++ b/gcc/go/gofrontend/types.cc
@@ -842,6 +842,13 @@ Type::are_convertible(const Type* lhs, const Type* rhs, 
std::string* reason)
return true;
 }
 
+  // A slice may be converted to a pointer-to-array.
+  if (rhs->is_slice_type()
+  && lhs->points_to() != NULL
+  && lhs->points_to()->array_type() != NULL
+  && !lhs->points_to()->is_slice_type())
+return true;
+
   // An unsafe.Pointer type may be converted to any pointer type or to
   // a type whose underlying type is uintptr, and v

Re: [PATCH v3 2/2] rs6000: Add test for _mm_minpos_epu16

2021-08-02 Thread Segher Boessenkool
Hi!

On Thu, Jul 15, 2021 at 06:29:18PM -0500, Paul A. Clarke wrote:
> Copy the test for _mm_minpos_epu16 from
> gcc/testsuite/gcc.target/i386/sse4_1-phminposuw.c, with
> a few adjustments:

> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-phminposuw.c
> @@ -0,0 +1,68 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -mpower8-vector -Wno-psabi" } */
> +/* { dg-require-effective-target p8vector_hw } */

What does this need P8 for?  Please test for just what you need, and
don't use -mpower8-vector at all, it is never needed?

Okay for trunk with that fixed.  Thanks!


Segher


Re: [PATCH 42/55] rs6000: Handle gimple folding of target built-ins

2021-08-02 Thread Segher Boessenkool
On Mon, Aug 02, 2021 at 08:31:43AM -0500, Bill Schmidt wrote:
> Interestingly, when the quadword compares are expanded at GIMPLE time, 
> we generate worse code involving individual 64-bit compares.  For the 
> time being, I will not expand these at GIMPLE time; independently, this 
> bears looking at to see why expressions like (uint128_1 < uint128_2) 
> will generate poor code.

Details like this should probably not be exposed before RTL anyway?
Everything else is at a more abstracted level as well before expand?

It will be interesting to see what causes the worse code though :-)


Segher


Re: [PATCH] Support cond_add/sub/mul/div for vector float/double.

2021-08-02 Thread Hongtao Liu via Gcc-patches
On Mon, Aug 2, 2021 at 6:20 PM Richard Biener via Gcc-patches
 wrote:
>
> On Mon, Aug 2, 2021 at 6:33 AM liuhongt  wrote:
> >
> > Hi:
> >   This patch supports cond_add/sub/mul/div expanders for vector 
> > float/double.
> >   There're still cond_fma/fms/fnms/fma/max/min/xor/ior/and left which I 
> > failed to figure out a testcase to validate them.
> > Also cond_add/sub/mul for vector integer.
>
> I think FMA can be verified by combining cond_mul and cond_add since the
> FMA recognizing code will try to match those up with cond_fma (plus the
> other variants).  Eventually combine can handle this on RTL already.
>
> The max/min/ior/and will only show up with masked epilogues when
> vectorizing reductions.
>
Oh, thanks, adding --param=vect-partial-vector-usage=1 helps.
> Richard.
>
> >
> >   Bootstrap is ok, survive the regression test on x86_64-linux-gnu{-m32,}.
> >   Pushed to trunk if there're no objections.
> >
> > gcc/ChangeLog:
> >
> > * config/i386/sse.md (cond_):New expander.
> > (cond_mul): Ditto.
> > (cond_div): Ditto.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/cond_op_addsubmuldiv_double-1.c: New test.
> > * gcc.target/i386/cond_op_addsubmuldiv_double-2.c: New test.
> > * gcc.target/i386/cond_op_addsubmuldiv_float-1.c: New test.
> > * gcc.target/i386/cond_op_addsubmuldiv_float-2.c: New test.
> > ---
> >  gcc/config/i386/sse.md| 54 
> >  .../i386/cond_op_addsubmuldiv_double-1.c  | 31 +++
> >  .../i386/cond_op_addsubmuldiv_double-2.c  | 85 +++
> >  .../i386/cond_op_addsubmuldiv_float-1.c   |  9 ++
> >  .../i386/cond_op_addsubmuldiv_float-2.c   |  4 +
> >  5 files changed, 183 insertions(+)
> >  create mode 100644 
> > gcc/testsuite/gcc.target/i386/cond_op_addsubmuldiv_double-1.c
> >  create mode 100644 
> > gcc/testsuite/gcc.target/i386/cond_op_addsubmuldiv_double-2.c
> >  create mode 100644 
> > gcc/testsuite/gcc.target/i386/cond_op_addsubmuldiv_float-1.c
> >  create mode 100644 
> > gcc/testsuite/gcc.target/i386/cond_op_addsubmuldiv_float-2.c
> >
> > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > index b5a08988590..8bf1764d3d5 100644
> > --- a/gcc/config/i386/sse.md
> > +++ b/gcc/config/i386/sse.md
> > @@ -1891,6 +1891,24 @@ (define_insn_and_split "*nabs2"
> >  }
> >[(set_attr "isa" "noavx,noavx,avx,avx")])
> >
> > +(define_expand "cond_"
> > +  [(set (match_operand:VF 0 "register_operand")
> > +   (vec_merge:VF
> > + (plusminus:VF
> > +   (match_operand:VF 2 "vector_operand")
> > +   (match_operand:VF 3 "vector_operand"))
> > + (match_operand:VF 4 "nonimm_or_0_operand")
> > + (match_operand: 1 "register_operand")))]
> > +  " == 64 || TARGET_AVX512VL"
> > +{
> > +  emit_insn (gen_3_mask (operands[0],
> > +operands[2],
> > +operands[3],
> > +operands[4],
> > +operands[1]));
> > +  DONE;
> > +})
> > +
> >  (define_expand "3"
> >[(set (match_operand:VF 0 "register_operand")
> > (plusminus:VF
> > @@ -1953,6 +1971,24 @@ (define_insn 
> > "_vm3"
> > (set_attr "prefix" "")
> > (set_attr "mode" "")])
> >
> > +(define_expand "cond_mul"
> > +  [(set (match_operand:VF 0 "register_operand")
> > +   (vec_merge:VF
> > + (mult:VF
> > +   (match_operand:VF 2 "vector_operand")
> > +   (match_operand:VF 3 "vector_operand"))
> > + (match_operand:VF 4 "nonimm_or_0_operand")
> > + (match_operand: 1 "register_operand")))]
> > +  " == 64 || TARGET_AVX512VL"
> > +{
> > +  emit_insn (gen_mul3_mask (operands[0],
> > +operands[2],
> > +operands[3],
> > +operands[4],
> > +operands[1]));
> > +  DONE;
> > +})
> > +
> >  (define_expand "mul3"
> >[(set (match_operand:VF 0 "register_operand")
> > (mult:VF
> > @@ -2041,6 +2077,24 @@ (define_expand "div3"
> >  }
> >  })
> >
> > +(define_expand "cond_div"
> > +  [(set (match_operand:VF 0 "register_operand")
> > +   (vec_merge:VF
> > + (div:VF
> > +   (match_operand:VF 2 "register_operand")
> > +   (match_operand:VF 3 "vector_operand"))
> > + (match_operand:VF 4 "nonimm_or_0_operand")
> > + (match_operand: 1 "register_operand")))]
> > +  " == 64 || TARGET_AVX512VL"
> > +{
> > +  emit_insn (gen__div3_mask (operands[0],
> > +   operands[2],
> > +   operands[3],
> > +   operands[4],
> > +   operands[1]));
> > +  DONE;
> > +})
> > +
> >  (define_insn "_div3"
> >[(set (match_operand:VF 0 "register_operand" "=x,v")
> > (div:V

Re: [PATCH] x86: Use XMM31 for scratch SSE register

2021-08-02 Thread Hongtao Liu via Gcc-patches
On Tue, Aug 3, 2021 at 1:48 AM H.J. Lu via Gcc-patches
 wrote:
>
> In 64-bit mode, use XMM31 for scratch SSE register to avoid vzeroupper
> if possible.
>
> gcc/
>
> * config/i386/i386.c (ix86_gen_scratch_sse_rtx): In 64-bit mode,
> try XMM31 to avoid vzeroupper.
LGTM.
>
> gcc/testsuite/
>
> * gcc.target/i386/avx-vzeroupper-14.c: Pass -mno-avx512f to
> disable XMM31.
> * gcc.target/i386/avx-vzeroupper-15.c: Likewise.
> * gcc.target/i386/pr82941-1.c: Updated.  Check for vzeroupper.
> * gcc.target/i386/pr82942-1.c: Likewise.
> * gcc.target/i386/pr82990-1.c: Likewise.
> * gcc.target/i386/pr82990-3.c: Likewise.
> * gcc.target/i386/pr82990-5.c: Likewise.
> * gcc.target/i386/pr100865-4b.c: Likewise.
> * gcc.target/i386/pr100865-6b.c: Likewise.
> * gcc.target/i386/pr100865-7b.c: Likewise.
> * gcc.target/i386/pr100865-10b.c: Likewise.
> * gcc.target/i386/pr100865-8b.c: Updated.
> * gcc.target/i386/pr100865-9b.c: Likewise.
> * gcc.target/i386/pr100865-11b.c: Likewise.
> * gcc.target/i386/pr100865-12b.c: Likewise.
> ---
>  gcc/config/i386/i386.c | 18 +++---
>  .../gcc.target/i386/avx-vzeroupper-14.c|  2 +-
>  .../gcc.target/i386/avx-vzeroupper-15.c|  2 +-
>  gcc/testsuite/gcc.target/i386/pr100865-10b.c   |  1 +
>  gcc/testsuite/gcc.target/i386/pr100865-11b.c   |  2 +-
>  gcc/testsuite/gcc.target/i386/pr100865-12b.c   |  2 +-
>  gcc/testsuite/gcc.target/i386/pr100865-4b.c|  2 ++
>  gcc/testsuite/gcc.target/i386/pr100865-6b.c|  5 -
>  gcc/testsuite/gcc.target/i386/pr100865-7b.c|  5 -
>  gcc/testsuite/gcc.target/i386/pr100865-8b.c|  2 +-
>  gcc/testsuite/gcc.target/i386/pr100865-9b.c|  2 +-
>  gcc/testsuite/gcc.target/i386/pr82941-1.c  |  3 ++-
>  gcc/testsuite/gcc.target/i386/pr82942-1.c  |  3 ++-
>  gcc/testsuite/gcc.target/i386/pr82990-1.c  |  3 ++-
>  gcc/testsuite/gcc.target/i386/pr82990-3.c  |  3 ++-
>  gcc/testsuite/gcc.target/i386/pr82990-5.c  |  3 ++-
>  16 files changed, 42 insertions(+), 16 deletions(-)
>
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 842eb0e6786..ec0690876b7 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -23335,9 +23335,21 @@ rtx
>  ix86_gen_scratch_sse_rtx (machine_mode mode)
>  {
>if (TARGET_SSE && !lra_in_progress)
> -return gen_rtx_REG (mode, (TARGET_64BIT
> -  ? LAST_REX_SSE_REG
> -  : LAST_SSE_REG));
> +{
> +  unsigned int regno;
> +  if (TARGET_64BIT)
> +   {
> + /* In 64-bit mode, use XMM31 to avoid vzeroupper and always
> +use XMM31 for CSE.  */
> + if (ix86_hard_regno_mode_ok (LAST_EXT_REX_SSE_REG, mode))
> +   regno = LAST_EXT_REX_SSE_REG;
> + else
> +   regno = LAST_REX_SSE_REG;
> +   }
> +  else
> +   regno = LAST_SSE_REG;
> +  return gen_rtx_REG (mode, regno);
> +}
>else
>  return gen_reg_rtx (mode);
>  }
> diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-14.c 
> b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-14.c
> index a31b4a2a63a..9590f25da22 100644
> --- a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-14.c
> +++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-14.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -mavx -mtune=generic -dp" } */
> +/* { dg-options "-O2 -mavx -mno-avx512f -mtune=generic -dp" } */
>
>  #include 
>
> diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-15.c 
> b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-15.c
> index 803936eef01..36dcf7367f1 100644
> --- a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-15.c
> +++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-15.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -mavx -mtune=generic -dp" } */
> +/* { dg-options "-O2 -mavx -mno-avx512f -mtune=generic -dp" } */
>
>  #include 
>
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-10b.c 
> b/gcc/testsuite/gcc.target/i386/pr100865-10b.c
> index e5616d8d258..77ace86ffe8 100644
> --- a/gcc/testsuite/gcc.target/i386/pr100865-10b.c
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-10b.c
> @@ -5,3 +5,4 @@
>
>  /* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%(?:r|e)\[^\n\]*, 
> %ymm\[0-9\]+" 1 } } */
>  /* { dg-final { scan-assembler-times "vmovdqu8\[\\t \]%ymm\[0-9\]+, " 8 } } 
> */
> +/* { dg-final { scan-assembler-not "vzeroupper" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr100865-11b.c 
> b/gcc/testsuite/gcc.target/i386/pr100865-11b.c
> index 12d55b9a642..7e458e85cdd 100644
> --- a/gcc/testsuite/gcc.target/i386/pr100865-11b.c
> +++ b/gcc/testsuite/gcc.target/i386/pr100865-11b.c
> @@ -5,4 +5,4 @@
>
>  /* { dg-final { scan-assembler-times "movabsq" 1 } } */
>  /* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%(?:r|e)\[^\n\]*,

Re: [PATCH 4/6] Support -fexcess-precision=16 which will enable FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16 when backend supports _Float16.

2021-08-02 Thread Hongtao Liu via Gcc-patches
On Tue, Aug 3, 2021 at 3:34 AM Joseph Myers  wrote:
>
> On Mon, 2 Aug 2021, liuhongt via Gcc-patches wrote:
>
> > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> > index 7979e240426..dc673c89bc8 100644
> > --- a/gcc/config/i386/i386.c
> > +++ b/gcc/config/i386/i386.c
> > @@ -23352,6 +23352,8 @@ ix86_get_excess_precision (enum 
> > excess_precision_type type)
> >   return (type == EXCESS_PRECISION_TYPE_STANDARD
> >   ? FLT_EVAL_METHOD_PROMOTE_TO_FLOAT
> >   : FLT_EVAL_METHOD_UNPREDICTABLE);
> > +  case EXCESS_PRECISION_TYPE_FLOAT16:
> > + return FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16;
> >default:
> >   gcc_unreachable ();
> >  }
>
> I'd expect an error for -fexcess-precision=16 with -mfpmath=387 (since x87
> doesn't do float or double arithmetic, but -fexcess-precision=16 implies
> that all of _Float16, float and double are represented to the range and
> precision of their type withou any excess precision).
>
Yes, additional changes like this.

modified   gcc/config/i386/i386.c
@@ -23443,6 +23443,9 @@ ix86_get_excess_precision (enum
excess_precision_type type)
  ? FLT_EVAL_METHOD_PROMOTE_TO_FLOAT
  : FLT_EVAL_METHOD_UNPREDICTABLE);
   case EXCESS_PRECISION_TYPE_FLOAT16:
+ if (TARGET_80387
+ && !(TARGET_SSE_MATH && TARGET_SSE))
+   error ("%<-fexcess-precision=16%> is not compatible with %<-mfpmath=387%>");
  return FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16;
   default:
  gcc_unreachable ();
new file   gcc/testsuite/gcc.target/i386/float16-7.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mfpmath=387 -fexcess-precision=16" } */
+/* { dg-excess-errors "'-fexcess-precision=16' is not compatible with
'-mfpmath=387'" } */
+_Float16
+foo (_Float16 a, _Float16 b)
+{
+  return a + b;/* { dg-error "'-fexcess-precision=16' is not
compatible with '-mfpmath=387'" } */
+}
+

> --
> Joseph S. Myers
> jos...@codesourcery.com



-- 
BR,
Hongtao


[PATCH] x86: Add testcases for PR target/80566

2021-08-02 Thread H.J. Lu via Gcc-patches
PR target/80566
* g++.target/i386/pr80566-1.C: New test.
* g++.target/i386/pr80566-2.C: Likewise.
---
 gcc/testsuite/g++.target/i386/pr80566-1.C | 15 +++
 gcc/testsuite/g++.target/i386/pr80566-2.C | 14 ++
 2 files changed, 29 insertions(+)
 create mode 100644 gcc/testsuite/g++.target/i386/pr80566-1.C
 create mode 100644 gcc/testsuite/g++.target/i386/pr80566-2.C

diff --git a/gcc/testsuite/g++.target/i386/pr80566-1.C 
b/gcc/testsuite/g++.target/i386/pr80566-1.C
new file mode 100644
index 000..753f9740529
--- /dev/null
+++ b/gcc/testsuite/g++.target/i386/pr80566-1.C
@@ -0,0 +1,15 @@
+// { dg-do compile }
+// { dg-options "-O2 -march=haswell" }
+
+#include 
+
+int *
+foo()
+{
+  int * p = new int[16];
+  memset(p,0,16*sizeof(int));
+  return p;
+}
+
+/* { dg-final { scan-assembler-times "vpxor\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 2 } } */
diff --git a/gcc/testsuite/g++.target/i386/pr80566-2.C 
b/gcc/testsuite/g++.target/i386/pr80566-2.C
new file mode 100644
index 000..2a2e82d0a3a
--- /dev/null
+++ b/gcc/testsuite/g++.target/i386/pr80566-2.C
@@ -0,0 +1,14 @@
+// { dg-do compile }
+// { dg-options "-O2 -march=haswell" }
+
+#include 
+
+int *
+foo(int * q)
+{
+  int * p = new int[16];
+  memcpy(q,p,16*sizeof(int));
+  return p;
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 4 } } */
-- 
2.31.1



Re: Re: [PATCH] Fix ICE when mixing VLAs and statement expressions [PR91038]

2021-08-02 Thread Martin Uecker



(resending from a different account, as emails seem to do not
go out from my other account at this time)

Am Montag, den 02.08.2021, 16:05 +0200 schrieb Martin Uecker:
> > On Sun, Aug 1, 2021 at 7:37 PM Uecker, Martin
> >  wrote:
> > > 
> > > 
> > > Here is an attempt to fix some old and annoying bugs related
> > > to VLAs and statement expressions. In particulary, this seems
> > > to fix the issues with variably-modified types which are
> > > returned from statement expressions (which works on clang),
> > > but there are still bugs remaining related to structs
> > > with VLA members (which seems to be a FE bug).
> > > 
> > > Of course, I might be doing something stupid...
> > 
> > How's evaluation order of (f())[g()] defined (with f returning a
> > pointer)?
> > Isn't that just f() + g()*sizeof(int) and thus undefined?
> 
> Yes, in C it is
> 
> f() + g()
> 
> and it is unsequenced. But the order of 'f' and 'g'
> is not relevant here and also the patch does not change 
> it (the base expression is gimplified before the index).
> 
> Essentially, we have
> 
> ({ ... }) + g() * sizeof(X) 
> 
> where X refers to a declaration in the statement expression.
> Without the patch the size expressions are gimplified before
> the base expression and also before the index expression. 
> With the patch the ({ ... }) is gimplified also before the
> size expression.
> 
> > If it's undefined then I think the incoming GENERIC is ill-defined.
> 
> I think it is OK because the arguments are evaluated 
> before the operation.  Without the patch, parts of the 
> operation (the size expressions) are gimplified before
> the arguments and this seems wrong to me.
> 
> 
> Martin
> 
> 
> 
> 



[PATCH] Add cond_add/sub/mul for vector integer modes.

2021-08-02 Thread liuhongt via Gcc-patches
Hi:
  This is a follow up of [1].
  Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
  Pushed to trunk.
[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576514.html

gcc/ChangeLog:

* config/i386/sse.md (cond_): New expander.
(cond_mul): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/cond_op_addsubmul_d-1.c: New test.
* gcc.target/i386/cond_op_addsubmul_d-2.c: New test.
* gcc.target/i386/cond_op_addsubmul_q-1.c: New test.
* gcc.target/i386/cond_op_addsubmul_q-2.c: New test.
* gcc.target/i386/cond_op_addsubmul_w-1.c: New test.
* gcc.target/i386/cond_op_addsubmul_w-2.c: New test.
---
 gcc/config/i386/sse.md| 88 +--
 .../gcc.target/i386/cond_op_addsubmul_d-1.c   | 32 +++
 .../gcc.target/i386/cond_op_addsubmul_d-2.c   | 76 
 .../gcc.target/i386/cond_op_addsubmul_q-1.c   |  7 ++
 .../gcc.target/i386/cond_op_addsubmul_q-2.c   |  4 +
 .../gcc.target/i386/cond_op_addsubmul_w-1.c   |  6 ++
 .../gcc.target/i386/cond_op_addsubmul_w-2.c   |  5 ++
 7 files changed, 210 insertions(+), 8 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_addsubmul_d-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_addsubmul_d-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_addsubmul_q-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_addsubmul_q-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_addsubmul_w-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_addsubmul_w-2.c

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 8bf1764d3d5..52b2b4214d7 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -333,6 +333,14 @@ (define_mode_iterator VI48_AVX512VL
   [V16SI (V8SI  "TARGET_AVX512VL") (V4SI  "TARGET_AVX512VL")
V8DI  (V4DI  "TARGET_AVX512VL") (V2DI  "TARGET_AVX512VL")])
 
+(define_mode_iterator VI1248_AVX512VLBW
+  [(V64QI "TARGET_AVX512BW") (V32QI "TARGET_AVX512VL && TARGET_AVX512BW")
+   (V16QI "TARGET_AVX512VL && TARGET_AVX512BW")
+   (V32HI "TARGET_AVX512BW") (V16HI "TARGET_AVX512VL && TARGET_AVX512BW")
+   (V8HI "TARGET_AVX512VL && TARGET_AVX512BW")
+   V16SI (V8SI "TARGET_AVX512VL") (V4SI "TARGET_AVX512VL")
+   V8DI (V4DI "TARGET_AVX512VL") (V2DI "TARGET_AVX512VL")])
+
 (define_mode_iterator VF_AVX512VL
   [V16SF (V8SF "TARGET_AVX512VL") (V4SF "TARGET_AVX512VL")
V8DF (V4DF "TARGET_AVX512VL") (V2DF "TARGET_AVX512VL")])
@@ -11803,6 +11811,24 @@ (define_expand "3"
   "TARGET_SSE2"
   "ix86_fixup_binary_operands_no_copy (, mode, operands);")
 
+(define_expand "cond_"
+  [(set (match_operand:VI1248_AVX512VLBW 0 "register_operand")
+   (vec_merge:VI1248_AVX512VLBW
+ (plusminus:VI1248_AVX512VLBW
+   (match_operand:VI1248_AVX512VLBW 2 "nonimmediate_operand")
+   (match_operand:VI1248_AVX512VLBW 3 "nonimmediate_operand"))
+ (match_operand:VI1248_AVX512VLBW 4 "nonimm_or_0_operand")
+ (match_operand: 1 "register_operand")))]
+  "TARGET_AVX512F"
+{
+  emit_insn (gen_3_mask (operands[0],
+operands[2],
+operands[3],
+operands[4],
+operands[1]));
+  DONE;
+})
+
 (define_expand "3_mask"
   [(set (match_operand:VI48_AVX512VL 0 "register_operand")
(vec_merge:VI48_AVX512VL
@@ -11929,6 +11955,24 @@ (define_expand "mul3"
   DONE;
 })
 
+(define_expand "cond_mul"
+  [(set (match_operand:VI2_AVX512VL 0 "register_operand")
+   (vec_merge:VI2_AVX512VL
+ (mult:VI2_AVX512VL
+   (match_operand:VI2_AVX512VL 2 "vector_operand")
+   (match_operand:VI2_AVX512VL 3 "vector_operand"))
+ (match_operand:VI2_AVX512VL 4 "nonimm_or_0_operand")
+ (match_operand: 1 "register_operand")))]
+  "TARGET_AVX512BW"
+{
+  emit_insn (gen_mul3_mask (operands[0],
+ operands[2],
+ operands[3],
+ operands[4],
+ operands[1]));
+  DONE;
+})
+
 (define_expand "mul3"
   [(set (match_operand:VI2_AVX2 0 "register_operand")
(mult:VI2_AVX2 (match_operand:VI2_AVX2 1 "vector_operand")
@@ -12363,6 +12407,24 @@ (define_insn "*sse2_pmaddwd"
(set_attr "prefix" "orig,vex")
(set_attr "mode" "TI")])
 
+(define_expand "cond_mul"
+  [(set (match_operand:VI8_AVX512VL 0 "register_operand")
+   (vec_merge:VI8_AVX512VL
+ (mult:VI8_AVX512VL
+   (match_operand:VI8_AVX512VL 2 "vector_operand")
+   (match_operand:VI8_AVX512VL 3 "vector_operand"))
+ (match_operand:VI8_AVX512VL 4 "nonimm_or_0_operand")
+ (match_operand: 1 "register_operand")))]
+  "TARGET_AVX512DQ"
+{
+  emit_insn (gen_avx512dq_mul3_mask (operands[0],
+  operands[2],
+  operands[3],
+