Re: [PATCH] aarch64: Improve on ldp-stp policies code structure.

2023-09-29 Thread Richard Sandiford
Thanks for the update.

Manos Anagnostakis  writes:
> Improves on: 834fc2bf
>
> This improves the code structure of the ldp-stp policies
> patch introduced in 834fc2bf
>
> Bootstrapped and regtested on aarch64-linux.
>
> gcc/ChangeLog:
>   * config/aarch64/aarch64-opts.h (enum aarch64_ldp_policy):
>   Added AARCH64 prefix.
>   (enum aarch64_stp_policy): Added AARCH64 prefix.
>   * config/aarch64/aarch64-protos.h (struct tune_params):
>   Merged enums aarch64_ldp_policy_model and aarch64_stp_policy_model
>   to aarch64_ldp_stp_policy_model.
>   * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): Removed.
>   (aarch64_parse_ldp_stp_policy): New function.
>   (aarch64_parse_stp_policy): Removed.
>   (aarch64_override_options_internal): Added call to
>   new parsing function and removed superseded ones.
>   (aarch64_mem_ok_with_ldpstp_policy_model): Improved
>   code quality based on the new changes.
>   * config/aarch64/aarch64.opt: Added AARCH64 prefix.
>
> gcc/testsuite/ChangeLog:
>   * gcc.target/aarch64/ldp_aligned.c: Splitted into this and
>   ldp_unaligned.
>   * gcc.target/aarch64/stp_aligned.c: Splitted into this and
>   stp_unaligned.
>   * gcc.target/aarch64/ldp_unaligned.c: New test.
>   * gcc.target/aarch64/stp_unaligned.c: New test.
>
> Signed-off-by: Manos Anagnostakis 
> ---
>  gcc/config/aarch64/aarch64-opts.h |  16 +-
>  gcc/config/aarch64/aarch64-protos.h   |  30 +--
>  gcc/config/aarch64/aarch64.cc | 184 --
>  gcc/config/aarch64/aarch64.opt|  20 +-
>  .../gcc.target/aarch64/ldp_aligned.c  |  28 ---
>  .../gcc.target/aarch64/ldp_unaligned.c|  40 
>  .../gcc.target/aarch64/stp_aligned.c  |  25 ---
>  .../gcc.target/aarch64/stp_unaligned.c|  37 
>  8 files changed, 189 insertions(+), 191 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_unaligned.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_unaligned.c
>
> diff --git a/gcc/config/aarch64/aarch64-opts.h 
> b/gcc/config/aarch64/aarch64-opts.h
> index db8348507a3..e23e1ea200e 100644
> --- a/gcc/config/aarch64/aarch64-opts.h
> +++ b/gcc/config/aarch64/aarch64-opts.h
> @@ -110,18 +110,18 @@ enum aarch64_key_type {
>  
>  /* Load pair policy type.  */
>  enum aarch64_ldp_policy {
> -  LDP_POLICY_DEFAULT,
> -  LDP_POLICY_ALWAYS,
> -  LDP_POLICY_NEVER,
> -  LDP_POLICY_ALIGNED
> +  AARCH64_LDP_POLICY_DEFAULT,
> +  AARCH64_LDP_POLICY_ALWAYS,
> +  AARCH64_LDP_POLICY_NEVER,
> +  AARCH64_LDP_POLICY_ALIGNED
>  };
>  
>  /* Store pair policy type.  */
>  enum aarch64_stp_policy {
> -  STP_POLICY_DEFAULT,
> -  STP_POLICY_ALWAYS,
> -  STP_POLICY_NEVER,
> -  STP_POLICY_ALIGNED
> +  AARCH64_STP_POLICY_DEFAULT,
> +  AARCH64_STP_POLICY_ALWAYS,
> +  AARCH64_STP_POLICY_NEVER,
> +  AARCH64_STP_POLICY_ALIGNED
>  };
>  
>  #endif

I was hoping it'd be possible to have a single enum here too, i.e.:

/* Load/store pair policy.  */
enum aarch64_ldp_stp_policy {
  AARCH64_LDP_STP_POLICY_DEFAULT,
  AARCH64_LDP_STP_POLICY_ALWAYS,
  AARCH64_LDP_STP_POLICY_NEVER,
  AARCH64_LDP_STP_POLICY_ALIGNED
};

Similarly, in the .opt file, we would just need an aarch64_ldp_stp_policy
that is shared between LDP and STP.  Maybe the error for unrecognised
options could be:

  UnknownError(unknown LDP/STP policy %qs)

> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 5c6802b4fe8..7d19111a215 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -568,30 +568,20 @@ struct tune_params
>/* Place prefetch struct pointer at the end to enable type checking
>   errors when tune_params misses elements (e.g., from erroneous merges).  
> */
>const struct cpu_prefetch_tune *prefetch;
> -/* An enum specifying how to handle load pairs using a fine-grained policy:
> -   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
> -   to at least double the alignment of the type.
> -   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
> -   - LDP_POLICY_NEVER: Do not emit ldp.  */
>  
> -  enum aarch64_ldp_policy_model
> -  {
> -LDP_POLICY_ALIGNED,
> -LDP_POLICY_ALWAYS,
> -LDP_POLICY_NEVER
> -  } ldp_policy_model;
> -/* An enum specifying how to handle store pairs using a fine-grained policy:
> -   - STP_POLICY_ALIGNED: Emit stp if the source pointer is aligned
> +/* An enum specifying how to handle load and store pairs using
> +   a fine-grained policy:
> +   - LDP_STP_POLICY_ALIGNED: Emit ldp/stp if the source pointer is aligned
> to at least double the alignment of the type.
> -   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
> -   - STP_POLICY_NEVER: Do not emit stp.  */
> +   - LDP_STP_POLICY_ALWAYS: Emit ldp/stp regardless of alignment.
> +   - LDP_STP_POLICY_NEVER: Do not emit ldp/stp.  */
>  
> -  enum aarch64_stp_policy_model
> +  en

Re: [RFC] > WIDE_INT_MAX_PREC support in wide_int and widest_int

2023-09-29 Thread Richard Sandiford
Richard Biener  writes:
> On Thu, 28 Sep 2023, Jakub Jelinek wrote:
>
>> Hi!
>> 
>> On Tue, Aug 29, 2023 at 05:09:52PM +0200, Jakub Jelinek via Gcc-patches 
>> wrote:
>> > On Tue, Aug 29, 2023 at 11:42:48AM +0100, Richard Sandiford wrote:
>> > > > I'll note tree-ssa-loop-niter.cc also uses GMP in some cases, 
>> > > > widest_int
>> > > > is really trying to be poor-mans GMP by limiting the maximum precision.
>> > > 
>> > > I'd characterise widest_int as "a wide_int that is big enough to hold
>> > > all supported integer types, without losing sign information".  It's
>> > > not big enough to do arbitrary arithmetic without losing precision
>> > > (in the way that GMP is).
>> > > 
>> > > If the new limit on integer sizes is 65535 bits for all targets,
>> > > then I think that means that widest_int needs to become a 65536-bit type.
>> > > (But not with all bits represented all the time, of course.)
>> > 
>> > If the widest_int storage would be dependent on the len rather than
>> > precision for how it is stored, then I think we'd need a new method which
>> > would be called at the start of filling the limbs where we'd tell how many
>> > limbs there would be (i.e. what will set_len be called with later on), and
>> > do nothing for all storages but the new widest_int_storage.
>> 
>> So, I've spent some time on this.  While wide_int is in the patch a 
>> fixed/variable
>> number of limbs (aka len) storage depending on precision (precision >
>> WIDE_INT_MAX_PRECISION means heap allocated limb array, otherwise it is
>> inline), widest_int has always very large precision
>> (WIDEST_INT_MAX_PRECISION, currently defined to the INTEGER_CST imposed
>> limitation of 255 64-bit limbs) but uses inline array for length
>> corresponding up to WIDE_INT_MAX_PRECISION bits and for larger one uses
>> similarly to wide_int a heap allocated array of limbs.
>> These changes make both wide_int and widest_int obviously non-POD, not
>> trivially default constructible, nor trivially copy constructible, trivially
>> destructible, trivially copyable, so not a good fit for GC and some vec
>> operations.
>> One common use of wide_int in GC structures was in dwarf2out.{h,cc}; but as
>> large _BitInt constants don't appear in RTL, we really don't need such large
>> precisions there.
>> So, for wide_int the patch introduces rwide_int, restricted wide_int, which
>> acts like the old wide_int (except that it is now trivially default
>> constructible and has assertions precision isn't set above
>> WIDE_INT_MAX_PRECISION).
>> For widest_int, the nastiness is that because it always has huge precision
>> of 16320 right now,
>> a) we need to be told upfront in wide-int.h before calling the large
>>value internal functions in wide-int.cc how many elements we'll need for
>>the result (some reasonable upper estimate is fine)
>> b) various of the wide-int.cc functions were lazy and assumed precision is
>>small enough and often used up to that many elements, which is
>>undesirable; so, it now tries to decreas that and use xi.len etc. based
>>estimates instead if possible (sometimes only if precision is above
>>WIDE_INT_MAX_PRECISION)
>> c) with the higher precision, behavior changes for lrshift (-1, 2) etc. or
>>unsigned division with dividend having most significant bit set in
>>widest_int - while such values were considered to be above or equal to
>>1 << (WIDE_INT_MAX_PRECISION - 2), now they are with
>>WIDEST_INT_MAX_PRECISION and so much larger; but lrshift on widest_int
>>is I think only done in ccp and I'd strongly hope that we treat the
>>values as unsigned and so usually much smaller length; so it is just
>>when we call wi::lrshift (-1, 2) or similar that results change.
>> I've noticed that for wide_int or widest_int references even simple
>> operations like eq_p liked to allocate and immediately free huge buffers,
>> which was caused by wide_int doing allocation on creation with a particular
>> precision and e.g. get_binary_precision running into that.  So, I've
>> duplicated that to avoid the allocations when all we need is just a
>> precision.
>> 
>> The patch below doesn't actually build anymore since the vec.h asserts
>> (which point to useful stuff though), so temporarily I've applied it also
>> with
>> --- gcc/vec.h.xx 2023-09-28 12:56:09

Re: [PATCH v2] aarch64: Improve on ldp-stp policies code structure.

2023-09-29 Thread Richard Sandiford
Manos Anagnostakis  writes:
> Improves on: 834fc2bf
>
> This improves the code structure of the ldp-stp policies
> patch introduced in 834fc2bf
>
> Bootstrapped and regtested on aarch64-linux.
>
> gcc/ChangeLog:
>   * config/aarch64/aarch64-opts.h (enum aarch64_ldp_policy): Removed.
>   (enum aarch64_ldp_stp_policy): Merged enums aarch64_ldp_policy
>   and aarch64_stp_policy to aarch64_ldp_stp_policy.
>   (enum aarch64_stp_policy): Removed.
>   * config/aarch64/aarch64-protos.h (struct tune_params): Removed
>   aarch64_ldp_policy_model and aarch64_stp_policy_model enum types
>   and left only the definitions to the aarch64-opts one.
>   * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): Removed.
>   (aarch64_parse_stp_policy): Removed.
>   (aarch64_override_options_internal): Removed calls to parsing
>   functions and added obvious direct assignments.
>   (aarch64_mem_ok_with_ldpstp_policy_model): Improved
>   code quality based on the new changes.
>   * config/aarch64/aarch64.opt: Use single enum type
>   aarch64_ldp_stp_policy for both ldp and stp options.
>
> gcc/testsuite/ChangeLog:
>   * gcc.target/aarch64/ldp_aligned.c: Splitted into this and
>   ldp_unaligned.
>   * gcc.target/aarch64/stp_aligned.c: Splitted into this and
>   stp_unaligned.
>   * gcc.target/aarch64/ldp_unaligned.c: New test.
>   * gcc.target/aarch64/stp_unaligned.c: New test.

Nice!  OK for trunk, thanks.

Sorry again for my mix-up with the original review.

Richard

> Signed-off-by: Manos Anagnostakis 
> ---
>  gcc/config/aarch64/aarch64-opts.h |  26 ++-
>  gcc/config/aarch64/aarch64-protos.h   |  25 +--
>  gcc/config/aarch64/aarch64.cc | 160 +++---
>  gcc/config/aarch64/aarch64.opt|  29 +---
>  .../gcc.target/aarch64/ldp_aligned.c  |  28 ---
>  .../gcc.target/aarch64/ldp_unaligned.c|  40 +
>  .../gcc.target/aarch64/stp_aligned.c  |  25 ---
>  .../gcc.target/aarch64/stp_unaligned.c|  37 
>  8 files changed, 155 insertions(+), 215 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_unaligned.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_unaligned.c
>
> diff --git a/gcc/config/aarch64/aarch64-opts.h 
> b/gcc/config/aarch64/aarch64-opts.h
> index db8348507a3..831e28ab52a 100644
> --- a/gcc/config/aarch64/aarch64-opts.h
> +++ b/gcc/config/aarch64/aarch64-opts.h
> @@ -108,20 +108,18 @@ enum aarch64_key_type {
>AARCH64_KEY_B
>  };
>  
> -/* Load pair policy type.  */
> -enum aarch64_ldp_policy {
> -  LDP_POLICY_DEFAULT,
> -  LDP_POLICY_ALWAYS,
> -  LDP_POLICY_NEVER,
> -  LDP_POLICY_ALIGNED
> -};
> -
> -/* Store pair policy type.  */
> -enum aarch64_stp_policy {
> -  STP_POLICY_DEFAULT,
> -  STP_POLICY_ALWAYS,
> -  STP_POLICY_NEVER,
> -  STP_POLICY_ALIGNED
> +/* An enum specifying how to handle load and store pairs using
> +   a fine-grained policy:
> +   - LDP_STP_POLICY_DEFAULT: Use the policy defined in the tuning structure.
> +   - LDP_STP_POLICY_ALIGNED: Emit ldp/stp if the source pointer is aligned
> +   to at least double the alignment of the type.
> +   - LDP_STP_POLICY_ALWAYS: Emit ldp/stp regardless of alignment.
> +   - LDP_STP_POLICY_NEVER: Do not emit ldp/stp.  */
> +enum aarch64_ldp_stp_policy {
> +  AARCH64_LDP_STP_POLICY_DEFAULT,
> +  AARCH64_LDP_STP_POLICY_ALIGNED,
> +  AARCH64_LDP_STP_POLICY_ALWAYS,
> +  AARCH64_LDP_STP_POLICY_NEVER
>  };
>  
>  #endif
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 5c6802b4fe8..60a55f4bc19 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -568,30 +568,9 @@ struct tune_params
>/* Place prefetch struct pointer at the end to enable type checking
>   errors when tune_params misses elements (e.g., from erroneous merges).  
> */
>const struct cpu_prefetch_tune *prefetch;
> -/* An enum specifying how to handle load pairs using a fine-grained policy:
> -   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
> -   to at least double the alignment of the type.
> -   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
> -   - LDP_POLICY_NEVER: Do not emit ldp.  */
>  
> -  enum aarch64_ldp_policy_model
> -  {
> -LDP_POLICY_ALIGNED,
> -LDP_POLICY_ALWAYS,
> -LDP_POLICY_NEVER
> -  } ldp_policy_model;
> -/* An enum specifying how to handle store pairs using a fine-grained policy:
> -   - STP_POLICY_ALIGNED: Emit stp if the source pointer is aligned
> -   to at least double the alignment of the type.
> -   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
> -   - STP_POLICY_NEVER: Do not emit stp.  */
> -
> -  enum aarch64_stp_policy_model
> -  {
> -STP_POLICY_ALIGNED,
> -STP_POLICY_ALWAYS,
> -STP_POLICY_NEVER
> -  } stp_policy_model;
> +  /* Define models for the aarch64_ldp_stp_policy.  */
> +  enum aarch64_ldp_stp_po

[pushed] Fix profiledbootstrap poly_int fallout [PR111642]

2023-10-01 Thread Richard Sandiford
rtl-tests.cc and simplify-rtx.cc used partial specialisation
to try to restrict the NUM_POLY_INT_COEFFS>1 tests without
resorting to preprocessor tests.  That now triggers an error
in some configurations, since the NUM_POLY_INT_COEFFS>1 tests
used the global poly_int64, whose definition does not depend
on the template parameter.

This patch uses local types that do depend on the template parameter.

Tested using profiledbootstrap and bootstrap4 on x86_64-linux-gnu,
both of which failed for me for unrelated reasons later.  But Sergei
confirms in the PR that the patch does fix the bug.  Pushed as obvious.

Richard


gcc/
PR bootstrap/111642
* rtl-tests.cc (const_poly_int_tests::run): Use a local
poly_int64 typedef.
* simplify-rtx.cc (simplify_const_poly_int_tests::run): Likewise.
---
 gcc/rtl-tests.cc| 1 +
 gcc/simplify-rtx.cc | 1 +
 2 files changed, 2 insertions(+)

diff --git a/gcc/rtl-tests.cc b/gcc/rtl-tests.cc
index ae8669419b6..96656c54a48 100644
--- a/gcc/rtl-tests.cc
+++ b/gcc/rtl-tests.cc
@@ -246,6 +246,7 @@ template
 void
 const_poly_int_tests::run ()
 {
+  using poly_int64 = poly_int;
   rtx x1 = gen_int_mode (poly_int64 (1, 1), QImode);
   rtx x255 = gen_int_mode (poly_int64 (1, 255), QImode);
 
diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index 170406aa28b..bd9443dbcc2 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -8689,6 +8689,7 @@ template
 void
 simplify_const_poly_int_tests::run ()
 {
+  using poly_int64 = poly_int;
   rtx x1 = gen_int_mode (poly_int64 (1, 1), QImode);
   rtx x2 = gen_int_mode (poly_int64 (-80, 127), QImode);
   rtx x3 = gen_int_mode (poly_int64 (-79, -128), QImode);
-- 
2.25.1



Re: [PATCH] Remove poly_int_pod

2023-10-02 Thread Richard Sandiford
Jan-Benedict Glaw  writes:
> Hi Richard,
>
> On Thu, 2023-09-28 10:55:46 +0100, Richard Sandiford 
>  wrote:
>> poly_int was written before the switch to C++11 and so couldn't
>> use explicit default constructors.  This led to an awkward split
>> between poly_int_pod and poly_int.  poly_int simply inherited from
>> poly_int_pod and added constructors, with the argumentless constructor
>> having an empty body.  But inheritance meant that poly_int had to
>> repeat the assignment operators from poly_int_pod (again, no C++11,
>> so no "using" to inherit base-class implementations).
> [...]
>
> I haven't bisected it, but I guess your patch caused this:
>
> [all 2023-10-02 06:59:02] 
> /var/lib/laminar/run/gcc-local/75/local-toolchain-install/bin/g++ -std=c++11  
> -fno-PIE -c   -g -O2   -DIN_GCC-fno-exceptions -fno-rtti 
> -fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings 
> -Wcast-qual -Wmissing-format-attribute -Wconditionally-supported 
> -Woverloaded-virtual -pedantic -Wno-long-long -Wno-variadic-macros 
> -Wno-overlength-strings -Werror -fno-common  -DHAVE_CONFIG_H -fno-PIE -I. -I. 
> -I../../gcc/gcc -I../../gcc/gcc/. -I../../gcc/gcc/../include  
> -I../../gcc/gcc/../libcpp/include -I../../gcc/gcc/../libcody  
> -I../../gcc/gcc/../libdecnumber -I../../gcc/gcc/../libdecnumber/bid 
> -I../libdecnumber -I../../gcc/gcc/../libbacktrace   -o rtl-tests.o -MT 
> rtl-tests.o -MMD -MP -MF ./.deps/rtl-tests.TPo ../../gcc/gcc/rtl-tests.cc
> [all 2023-10-02 06:59:04] In file included from ../../gcc/gcc/coretypes.h:480,
> [all 2023-10-02 06:59:04]  from ../../gcc/gcc/rtl-tests.cc:22:
> [all 2023-10-02 06:59:04] ../../gcc/gcc/poly-int.h: In instantiation of 
> 'constexpr poly_int::poly_int(poly_int_full, const Cs& ...) [with Cs = 
> {int, int}; unsigned int N = 1; C = long int]':
> [all 2023-10-02 06:59:04] ../../gcc/gcc/poly-int.h:439:13:   required from 
> here
> [all 2023-10-02 06:59:04] ../../gcc/gcc/rtl-tests.cc:249:25:   in 'constexpr' 
> expansion of 'poly_int<1, long int>(1, 1)'
> [all 2023-10-02 06:59:04] ../../gcc/gcc/poly-int.h:453:5: error: too many 
> initializers for 'long int [1]'
> [all 2023-10-02 06:59:04]   453 |   : coeffs { (typename 
> poly_coeff_traits::
> [all 2023-10-02 06:59:04]   | 
> ^
> [all 2023-10-02 06:59:04]   454 |   template init_cast::type 
> (cs))... } {}
> [all 2023-10-02 06:59:04]   |   
> ~~~
> [all 2023-10-02 06:59:04] make[1]: *** [Makefile:1188: rtl-tests.o] Error 1
> [all 2023-10-02 06:59:04] make[1]: Leaving directory 
> '/var/lib/laminar/run/gcc-local/75/toolchain-build/gcc'
> [all 2023-10-02 06:59:05] make: *** [Makefile:4993: all-gcc] Error 2
>
>
> (Full build log at
> http://toolchain.lug-owl.de/laminar/jobs/gcc-local/75 .  That's in a
> Docker container on amd64-linux with the host gcc being at fairly new
> at basepoints/gcc-14-3827-g30e6ee07458)

Yeah, this was PR111642.  I pushed a fix this morning.

Thanks,
Richard


Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.

2023-10-05 Thread Richard Sandiford
Tamar Christina  writes:
> Hi All,
>
> This adds an implementation for masked copysign along with an optimized
> pattern for masked copysign (x, -1).

It feels like we're ending up with a lot of AArch64-specific code that
just hard-codes the observation that changing the sign is equivalent to
changing the top bit.  We then need to make sure that we choose the best
way of changing the top bit for any given situation.

Hard-coding the -1/negative case is one instance of that.  But it looks
like we also fail to use the best sequence for SVE2.  E.g.
[https://godbolt.org/z/ajh3MM5jv]:

#include 

void f(double *restrict a, double *restrict b) {
for (int i = 0; i < 100; ++i)
a[i] = __builtin_copysign(a[i], b[i]);
}

void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) {
for (int i = 0; i < 100; ++i)
a[i] = (a[i] & ~c) | (b[i] & c);
}

gives:

f:
mov x2, 0
mov w3, 100
whilelo p7.d, wzr, w3
.L2:
ld1dz30.d, p7/z, [x0, x2, lsl 3]
ld1dz31.d, p7/z, [x1, x2, lsl 3]
and z30.d, z30.d, #0x7fff
and z31.d, z31.d, #0x8000
orr z31.d, z31.d, z30.d
st1dz31.d, p7, [x0, x2, lsl 3]
incdx2
whilelo p7.d, w2, w3
b.any   .L2
ret
g:
mov x3, 0
mov w4, 100
mov z29.d, x2
whilelo p7.d, wzr, w4
.L6:
ld1dz30.d, p7/z, [x0, x3, lsl 3]
ld1dz31.d, p7/z, [x1, x3, lsl 3]
bsl z31.d, z31.d, z30.d, z29.d
st1dz31.d, p7, [x0, x3, lsl 3]
incdx3
whilelo p7.d, w3, w4
b.any   .L6
ret

I saw that you originally tried to do this in match.pd and that the
decision was to fold to copysign instead.  But perhaps there's a compromise
where isel does something with the (new) copysign canonical form?
I.e. could we go with your new version of the match.pd patch, and add
some isel stuff as a follow-on?

Not saying no to this patch, just thought that the above was worth
considering.

[I agree with Andrew's comments FWIW.]

Thanks,
Richard

>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   PR tree-optimization/109154
>   * config/aarch64/aarch64-sve.md (cond_copysign): New.
>
> gcc/testsuite/ChangeLog:
>
>   PR tree-optimization/109154
>   * gcc.target/aarch64/sve/fneg-abs_5.c: New test.
>
> --- inline copy of patch -- 
> diff --git a/gcc/config/aarch64/aarch64-sve.md 
> b/gcc/config/aarch64/aarch64-sve.md
> index 
> 071400c820a5b106ddf9dc9faebb117975d74ea0..00ca30c24624dc661254568f45b61a14aa11c305
>  100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -6429,6 +6429,57 @@ (define_expand "copysign3"
>}
>  )
>  
> +(define_expand "cond_copysign"
> +  [(match_operand:SVE_FULL_F 0 "register_operand")
> +   (match_operand: 1 "register_operand")
> +   (match_operand:SVE_FULL_F 2 "register_operand")
> +   (match_operand:SVE_FULL_F 3 "nonmemory_operand")
> +   (match_operand:SVE_FULL_F 4 "aarch64_simd_reg_or_zero")]
> +  "TARGET_SVE"
> +  {
> +rtx sign = gen_reg_rtx (mode);
> +rtx mant = gen_reg_rtx (mode);
> +rtx int_res = gen_reg_rtx (mode);
> +int bits = GET_MODE_UNIT_BITSIZE (mode) - 1;
> +
> +rtx arg2 = lowpart_subreg (mode, operands[2], mode);
> +rtx arg3 = lowpart_subreg (mode, operands[3], mode);
> +rtx arg4 = lowpart_subreg (mode, operands[4], mode);
> +
> +rtx v_sign_bitmask
> +  = aarch64_simd_gen_const_vector_dup (mode,
> +HOST_WIDE_INT_M1U << bits);
> +
> +/* copysign (x, -1) should instead be expanded as orr with the sign
> +   bit.  */
> +if (!REG_P (operands[3]))
> +  {
> + auto r0
> +   = CONST_DOUBLE_REAL_VALUE (unwrap_const_vec_duplicate (operands[3]));
> + if (-1 == real_to_integer (r0))
> +   {
> + arg3 = force_reg (mode, v_sign_bitmask);
> + emit_insn (gen_cond_ior (int_res, operands[1], arg2,
> +   arg3, arg4));
> + emit_move_insn (operands[0], gen_lowpart (mode, int_res));
> + DONE;
> +   }
> +  }
> +
> +operands[2] = force_reg (mode, operands[3]);
> +emit_insn (gen_and3 (sign, arg3, v_sign_bitmask));
> +emit_insn (gen_and3
> +(mant, arg2,
> + aarch64_simd_gen_const_vector_dup (mode,
> +~(HOST_WIDE_INT_M1U
> +  << bits;
> +emit_insn (gen_cond_ior (int_res, operands[1], sign, mant,
> +   arg4));
> +emit_move_insn (operands[0], gen_lowpart (mode, int_res));
> +DONE;
> +  }
> +)
> +
>  (define_expand "xorsign3"
>[(match_operand:SVE_FULL_F 0 "register_operand")
> (match_operand:SVE_FULL_F 1 "register_operand")
> diff -

Re: [PATCH]AArch64 Add special patterns for creating DI scalar and vector constant 1 << 63 [PR109154]

2023-10-05 Thread Richard Sandiford
Tamar Christina  writes:
> Hi,
>
>> The lowpart_subreg should simplify this back into CONST0_RTX (mode),
>> making it no different from:
>> 
>> emti_move_insn (target, CONST0_RTX (mode));
>> 
>> If the intention is to share zeros between modes (sounds good!), then I think
>> the subreg needs to be on the lhs instead.
>> 
>> > +  rtx neg = lowpart_subreg (V2DFmode, target, mode);
>> > +  emit_insn (gen_negv2df2 (neg, lowpart_subreg (V2DFmode, target,
>> > + mode)));
>> 
>> The rhs seems simpler as copy_rtx (neg).  (Even the copy_rtx shouldn't be
>> needed after RA, but it's probably more future-proof to keep it.)
>> 
>> > +  emit_move_insn (target, lowpart_subreg (mode, neg, V2DFmode));
>> 
>> This shouldn't be needed, since neg is already a reference to target.
>> 
>> Overall, looks like a nice change/framework.
>
> Updated the patch, and in te process also realized this can be used for the
> vector variants:
>
> Hi All,
>
> This adds a way to generate special sequences for creation of constants for
> which we don't have single instructions sequences which would have normally
> lead to a GP -> FP transfer or a literal load.
>
> The patch starts out by adding support for creating 1 << 63 using fneg (mov 
> 0).
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   PR tree-optimization/109154
>   * config/aarch64/aarch64-protos.h (aarch64_simd_special_constant_p,
>   aarch64_maybe_generate_simd_constant): New.
>   * config/aarch64/aarch64-simd.md (*aarch64_simd_mov,
>   *aarch64_simd_mov): Add new coden for special constants.
>   * config/aarch64/aarch64.cc (aarch64_extract_vec_duplicate_wide_int):
>   Take optional mode.
>   (aarch64_simd_special_constant_p,
>   aarch64_maybe_generate_simd_constant): New.
>   * config/aarch64/aarch64.md (*movdi_aarch64): Add new codegen for
>   special constants.
>   * config/aarch64/constraints.md (Dx): new.
>
> gcc/testsuite/ChangeLog:
>
>   PR tree-optimization/109154
>   * gcc.target/aarch64/fneg-abs_1.c: Updated.
>   * gcc.target/aarch64/fneg-abs_2.c: Updated.
>   * gcc.target/aarch64/fneg-abs_4.c: Updated.
>   * gcc.target/aarch64/dbl_mov_immediate_1.c: Updated.
>
> --- inline copy of patch ---
>
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 
> 60a55f4bc1956786ea687fc7cad7ec9e4a84e1f0..36d6c688bc888a51a9de174bd3665aebe891b8b1
>  100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -831,6 +831,8 @@ bool aarch64_sve_ptrue_svpattern_p (rtx, struct 
> simd_immediate_info *);
>  bool aarch64_simd_valid_immediate (rtx, struct simd_immediate_info *,
>   enum simd_immediate_check w = AARCH64_CHECK_MOV);
>  rtx aarch64_check_zero_based_sve_index_immediate (rtx);
> +bool aarch64_maybe_generate_simd_constant (rtx, rtx, machine_mode);
> +bool aarch64_simd_special_constant_p (rtx, machine_mode);
>  bool aarch64_sve_index_immediate_p (rtx);
>  bool aarch64_sve_arith_immediate_p (machine_mode, rtx, bool);
>  bool aarch64_sve_sqadd_sqsub_immediate_p (machine_mode, rtx, bool);
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> 81ff5bad03d598fa0d48df93d172a28bc0d1d92e..33eceb436584ff73c7271f93639f2246d1af19e0
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -142,26 +142,35 @@ (define_insn "aarch64_dup_lane_"
>[(set_attr "type" "neon_dup")]
>  )
>  
> -(define_insn "*aarch64_simd_mov"
> +(define_insn_and_split "*aarch64_simd_mov"
>[(set (match_operand:VDMOV 0 "nonimmediate_operand")
>   (match_operand:VDMOV 1 "general_operand"))]
>"TARGET_FLOAT
> && (register_operand (operands[0], mode)
> || aarch64_simd_reg_or_zero (operands[1], mode))"
> -  {@ [cons: =0, 1; attrs: type, arch]
> - [w , m ; neon_load1_1reg , *   ] ldr\t%d0, %1
> - [r , m ; load_8 , *   ] ldr\t%x0, %1
> - [m , Dz; store_8, *   ] str\txzr, %0
> - [m , w ; neon_store1_1reg, *   ] str\t%d1, %0
> - [m , r ; store_8, *   ] str\t%x1, %0
> - [w , w ; neon_logic  , simd] mov\t%0., %1.
> - [w , w ; neon_logic  , *   ] fmov\t%d0, %d1
> - [?r, w ; neon_to_gp  , simd] umov\t%0, %1.d[0]
> - [?r, w ; neon_to_gp  , *   ] fmov\t%x0, %d1
> - [?w, r ; f_mcr  , *   ] fmov\t%d0, %1
> - [?r, r ; mov_reg, *   ] mov\t%0, %1
> - [w , Dn; neon_move   , simd] << 
> aarch64_output_simd_mov_immediate (operands[1], 64);
> - [w , Dz; f_mcr  , *   ] fmov\t%d0, xzr
> +  {@ [cons: =0, 1; attrs: type, arch, length]
> + [w , m ; neon_load1_1reg , *   , *] ldr\t%d0, %1
> + [r , m ; load_8 , *   , *] ldr\t%x0, %1
> + [m , Dz; store_8, *   , *] str\txzr, %0
> + [m , w ; neo

Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.

2023-10-05 Thread Richard Sandiford
Tamar Christina  writes:
>> -Original Message-
>> From: Richard Sandiford 
>> Sent: Thursday, October 5, 2023 8:29 PM
>> To: Tamar Christina 
>> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> ; Marcus Shawcroft
>> ; Kyrylo Tkachov 
>> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
>> 
>> Tamar Christina  writes:
>> > Hi All,
>> >
>> > This adds an implementation for masked copysign along with an
>> > optimized pattern for masked copysign (x, -1).
>> 
>> It feels like we're ending up with a lot of AArch64-specific code that just 
>> hard-
>> codes the observation that changing the sign is equivalent to changing the 
>> top
>> bit.  We then need to make sure that we choose the best way of changing the
>> top bit for any given situation.
>> 
>> Hard-coding the -1/negative case is one instance of that.  But it looks like 
>> we
>> also fail to use the best sequence for SVE2.  E.g.
>> [https://godbolt.org/z/ajh3MM5jv]:
>> 
>> #include 
>> 
>> void f(double *restrict a, double *restrict b) {
>> for (int i = 0; i < 100; ++i)
>> a[i] = __builtin_copysign(a[i], b[i]); }
>> 
>> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) {
>> for (int i = 0; i < 100; ++i)
>> a[i] = (a[i] & ~c) | (b[i] & c); }
>> 
>> gives:
>> 
>> f:
>> mov x2, 0
>> mov w3, 100
>> whilelo p7.d, wzr, w3
>> .L2:
>> ld1dz30.d, p7/z, [x0, x2, lsl 3]
>> ld1dz31.d, p7/z, [x1, x2, lsl 3]
>> and z30.d, z30.d, #0x7fff
>> and z31.d, z31.d, #0x8000
>> orr z31.d, z31.d, z30.d
>> st1dz31.d, p7, [x0, x2, lsl 3]
>> incdx2
>> whilelo p7.d, w2, w3
>> b.any   .L2
>> ret
>> g:
>> mov x3, 0
>> mov w4, 100
>> mov z29.d, x2
>> whilelo p7.d, wzr, w4
>> .L6:
>> ld1dz30.d, p7/z, [x0, x3, lsl 3]
>> ld1dz31.d, p7/z, [x1, x3, lsl 3]
>> bsl z31.d, z31.d, z30.d, z29.d
>> st1dz31.d, p7, [x0, x3, lsl 3]
>> incdx3
>> whilelo p7.d, w3, w4
>> b.any   .L6
>> ret
>> 
>> I saw that you originally tried to do this in match.pd and that the decision 
>> was
>> to fold to copysign instead.  But perhaps there's a compromise where isel 
>> does
>> something with the (new) copysign canonical form?
>> I.e. could we go with your new version of the match.pd patch, and add some
>> isel stuff as a follow-on?
>> 
>
> Sure if that's what's desired But..
>
> The example you posted above is for instance worse for x86 
> https://godbolt.org/z/x9ccqxW6T
> where the first operation has a dependency chain of 2 and the latter of 3.  
> It's likely any
> open coding of this operation is going to hurt a target.
>
> So I'm unsure what isel transform this into...

I didn't mean that we should go straight to using isel for the general
case, just for the new case.  The example above was instead trying to
show the general point that hiding the logic ops in target code is a
double-edged sword.

The x86_64 example for the -1 case would be https://godbolt.org/z/b9s6MaKs8
where the isel change would be an improvement.  Without that, I guess
x86_64 will need to have a similar patch to the AArch64 one.

That said, https://godbolt.org/z/e6nqoqbMh suggests that powerpc64
is probably relying on the current copysign -> neg/abs transform.
(Not sure why the second function uses different IVs from the first.)

Personally, I wouldn't be against a target hook that indicated whether
float bit manipulation is "free" for a given mode, if it comes to that.

Thanks,
Richard


Re: [PATCH V2] Emit funcall external declarations only if actually used.

2023-10-05 Thread Richard Sandiford
"Jose E. Marchesi"  writes:
> ping

I don't know this code very well, and have AFAIR haven't worked
with an assembler that requires external declarations, but since
it's at a second ping :)

>
>> ping
>>
>>> [Differences from V1:
>>> - Prototype for call_from_call_insn moved before comment block.
>>> - Reuse the `call' flag for SYMBOL_REF_LIBCALL.
>>> - Fallback to check REG_CALL_DECL in non-direct calls.
>>> - New test to check correct behavior for non-direct calls.]
>>>
>>> There are many places in GCC where alternative local sequences are
>>> tried in order to determine what is the cheapest or best alternative
>>> to use in the current target.  When any of these sequences involve a
>>> libcall, the current implementation of emit_library_call_value_1
>>> introduce a side-effect consisting on emitting an external declaration
>>> for the funcall (such as __divdi3) which is thus emitted even if the
>>> sequence that does the libcall is not retained.
>>>
>>> This is problematic in targets such as BPF, because the kernel loader
>>> chokes on the spurious symbol __divdi3 and makes the resulting BPF
>>> object unloadable.  Note that BPF objects are not linked before being
>>> loaded.
>>>
>>> This patch changes emit_library_call_value_1 to mark the target
>>> SYMBOL_REF as a libcall.  Then, the emission of the external
>>> declaration is done in the first loop of final.cc:shorten_branches.
>>> This happens only if the corresponding sequence has been kept.
>>>
>>> Regtested in x86_64-linux-gnu.
>>> Tested with host x86_64-linux-gnu with target bpf-unknown-none.

I'm not sure that shorten_branches is a natural place to do this.
It isn't something that would normally emit asm text.

Would it be OK to emit the declaration at the same point as for decls,
which IIUC is process_pending_assemble_externals?  If so, how about
making assemble_external_libcall add the symbol to a list when
!SYMBOL_REF_USED, instead of calling targetm.asm_out.external_libcall
directly?  assemble_external_libcall could then also call get_identifier
on the name (perhaps after calling strip_name_encoding -- can't
remember whether assemble_external_libcall sees the encoded or
unencoded name).

All being well, the call to get_identifier should cause
assemble_name_resolve to record when the name is used, via
TREE_SYMBOL_REFERENCED.  Then process_pending_assemble_externals could
go through the list of libcalls recorded by assemble_external_libcall
and check whether TREE_SYMBOL_REFERENCED is set on the get_identifier.

Not super elegant, but it seems to fit within the existing scheme.
And I don't there should be any problem with using get_identifier
for libcalls, since it isn't valid to use libcall names for other
types of symbol.

Thanks,
Richard

>>>
>>> gcc/ChangeLog
>>>
>>> * rtl.h (SYMBOL_REF_LIBCALL): Define.
>>> * calls.cc (emit_library_call_value_1): Do not emit external
>>> libcall declaration here.
>>> * final.cc (shorten_branches): Do it here.
>>>
>>> gcc/testsuite/ChangeLog
>>>
>>> * gcc.target/bpf/divmod-libcall-1.c: New test.
>>> * gcc.target/bpf/divmod-libcall-2.c: Likewise.
>>> * gcc.c-torture/compile/libcall-2.c: Likewise.
>>> ---
>>>  gcc/calls.cc  |  9 +++---
>>>  gcc/final.cc  | 30 +++
>>>  gcc/rtl.h |  5 
>>>  .../gcc.c-torture/compile/libcall-2.c |  8 +
>>>  .../gcc.target/bpf/divmod-libcall-1.c | 19 
>>>  .../gcc.target/bpf/divmod-libcall-2.c | 16 ++
>>>  6 files changed, 83 insertions(+), 4 deletions(-)
>>>  create mode 100644 gcc/testsuite/gcc.c-torture/compile/libcall-2.c
>>>  create mode 100644 gcc/testsuite/gcc.target/bpf/divmod-libcall-1.c
>>>  create mode 100644 gcc/testsuite/gcc.target/bpf/divmod-libcall-2.c
>>>
>>> diff --git a/gcc/calls.cc b/gcc/calls.cc
>>> index 1f3a6d5c450..219ea599b16 100644
>>> --- a/gcc/calls.cc
>>> +++ b/gcc/calls.cc
>>> @@ -4388,9 +4388,10 @@ emit_library_call_value_1 (int retval, rtx orgfun, 
>>> rtx value,
>>> || argvec[i].partial != 0)
>>>update_stack_alignment_for_call (&argvec[i].locate);
>>>  
>>> -  /* If this machine requires an external definition for library
>>> - functions, write one out.  */
>>> -  assemble_external_libcall (fun);
>>> +  /* Mark the emitted target as a libcall.  This will be used by final
>>> + in order to emit an external symbol declaration if the libcall is
>>> + ever used.  */
>>> +  SYMBOL_REF_LIBCALL (fun) = 1;
>>>  
>>>original_args_size = args_size;
>>>args_size.constant = (aligned_upper_bound (args_size.constant
>>> @@ -4735,7 +4736,7 @@ emit_library_call_value_1 (int retval, rtx orgfun, 
>>> rtx value,
>>>valreg,
>>>old_inhibit_defer_pop + 1, call_fusage, flags, args_so_far);
>>>  
>>> -  if (flag_ipa_ra)
>>> +  if (flag_ipa_ra || SYMBOL_REF_LIBCALL (orgfun))
>>>  {
>>>rtx datum = orgfun;
>>>

Re: [PATCH]middle-end match.pd: optimize fneg (fabs (x)) to x | (1 << signbit(x)) [PR109154]

2023-10-07 Thread Richard Sandiford
Richard Biener  writes:
> On Thu, 5 Oct 2023, Tamar Christina wrote:
>
>> > I suppose the idea is that -abs(x) might be easier to optimize with other
>> > patterns (consider a - copysign(x,...), optimizing to a + abs(x)).
>> > 
>> > For abs vs copysign it's a canonicalization, but (negate (abs @0)) is less
>> > canonical than copysign.
>> > 
>> > > Should I try removing this?
>> > 
>> > I'd say yes (and put the reverse canonicalization next to this pattern).
>> > 
>> 
>> This patch transforms fneg (fabs (x)) into copysign (x, -1) which is more
>> canonical and allows a target to expand this sequence efficiently.  Such
>> sequences are common in scientific code working with gradients.
>> 
>> various optimizations in match.pd only happened on COPYSIGN but not 
>> COPYSIGN_ALL
>> which means they exclude IFN_COPYSIGN.  COPYSIGN however is restricted to 
>> only
>
> That's not true:
>
> (define_operator_list COPYSIGN
> BUILT_IN_COPYSIGNF
> BUILT_IN_COPYSIGN
> BUILT_IN_COPYSIGNL
> IFN_COPYSIGN)
>
> but they miss the extended float builtin variants like
> __builtin_copysignf16.  Also see below
>
>> the C99 builtins and so doesn't work for vectors.
>> 
>> The patch expands these optimizations to work on COPYSIGN_ALL.
>> 
>> There is an existing canonicalization of copysign (x, -1) to fneg (fabs (x))
>> which I remove since this is a less efficient form.  The testsuite is also
>> updated in light of this.
>> 
>> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>> 
>> Ok for master?
>> 
>> Thanks,
>> Tamar
>> 
>> gcc/ChangeLog:
>> 
>>  PR tree-optimization/109154
>>  * match.pd: Add new neg+abs rule, remove inverse copysign rule and
>>  expand existing copysign optimizations.
>> 
>> gcc/testsuite/ChangeLog:
>> 
>>  PR tree-optimization/109154
>>  * gcc.dg/fold-copysign-1.c: Updated.
>>  * gcc.dg/pr55152-2.c: Updated.
>>  * gcc.dg/tree-ssa/abs-4.c: Updated.
>>  * gcc.dg/tree-ssa/backprop-6.c: Updated.
>>  * gcc.dg/tree-ssa/copy-sign-2.c: Updated.
>>  * gcc.dg/tree-ssa/mult-abs-2.c: Updated.
>>  * gcc.target/aarch64/fneg-abs_1.c: New test.
>>  * gcc.target/aarch64/fneg-abs_2.c: New test.
>>  * gcc.target/aarch64/fneg-abs_3.c: New test.
>>  * gcc.target/aarch64/fneg-abs_4.c: New test.
>>  * gcc.target/aarch64/sve/fneg-abs_1.c: New test.
>>  * gcc.target/aarch64/sve/fneg-abs_2.c: New test.
>>  * gcc.target/aarch64/sve/fneg-abs_3.c: New test.
>>  * gcc.target/aarch64/sve/fneg-abs_4.c: New test.
>> 
>> --- inline copy of patch ---
>> 
>> diff --git a/gcc/match.pd b/gcc/match.pd
>> index 
>> 4bdd83e6e061b16dbdb2845b9398fcfb8a6c9739..bd6599d36021e119f51a4928354f580ffe82c6e2
>>  100644
>> --- a/gcc/match.pd
>> +++ b/gcc/match.pd
>> @@ -1074,45 +1074,43 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>>  
>>  /* cos(copysign(x, y)) -> cos(x).  Similarly for cosh.  */
>>  (for coss (COS COSH)
>> - copysigns (COPYSIGN)
>> - (simplify
>> -  (coss (copysigns @0 @1))
>> -   (coss @0)))
>> + (for copysigns (COPYSIGN_ALL)
>
> So this ends up generating for example the match
> (cosf (copysignl ...)) which doesn't make much sense.
>
> The lock-step iteration did
> (cosf (copysignf ..)) ... (ifn_cos (ifn_copysign ...))
> which is leaner but misses the case of
> (cosf (ifn_copysign ..)) - that's probably what you are
> after with this change.
>
> That said, there isn't a nice solution (without altering the match.pd
> IL).  There's the explicit solution, spelling out all combinations.
>
> So if we want to go with yout pragmatic solution changing this
> to use COPYSIGN_ALL isn't necessary, only changing the lock-step
> for iteration to a cross product for iteration is.
>
> Changing just this pattern to
>
> (for coss (COS COSH)
>  (for copysigns (COPYSIGN)
>   (simplify
>(coss (copysigns @0 @1))
>(coss @0
>
> increases the total number of gimple-match-x.cc lines from
> 234988 to 235324.

I guess the difference between this and the later suggestions is that
this one allows builtin copysign to be paired with ifn cos, which would
be potentially useful in other situations.  (It isn't here because
ifn_cos is rarely provided.)  How much of the growth is due to that,
and much of it is from nonsensical combinations like
(builtin_cosf (builtin_copysignl ...))?

If it's mostly from nonsensical combinations then would it be possible
to make genmatch drop them?

> The alternative is to do
>
> (for coss (COS COSH)
>  copysigns (COPYSIGN)
>  (simplify
>   (coss (copysigns @0 @1))
>(coss @0))
>  (simplify
>   (coss (IFN_COPYSIGN @0 @1))
>(coss @0)))
>
> which properly will diagnose a duplicate pattern.  Ther are
> currently no operator lists with just builtins defined (that
> could be fixed, see gencfn-macros.cc), supposed we'd have
> COS_C we could do
>
> (for coss (COS_C COSH_C IFN_COS IFN_COSH)
>  copysigns (COPYSIGN_C COPYSIGN_C IFN_COPYSIGN IFN_COPYSIGN 
> IFN_COPYSIGN IFN_COPYSIGN IFN_COPYSIGN IFN_COPYSIGN IFN_COPYSIGN 
> 

Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.

2023-10-07 Thread Richard Sandiford
Richard Biener  writes:
> On Thu, Oct 5, 2023 at 10:46 PM Tamar Christina  
> wrote:
>>
>> > -Original Message-----
>> > From: Richard Sandiford 
>> > Sent: Thursday, October 5, 2023 9:26 PM
>> > To: Tamar Christina 
>> > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> > ; Marcus Shawcroft
>> > ; Kyrylo Tkachov 
>> > Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
>> >
>> > Tamar Christina  writes:
>> > >> -Original Message-
>> > >> From: Richard Sandiford 
>> > >> Sent: Thursday, October 5, 2023 8:29 PM
>> > >> To: Tamar Christina 
>> > >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> > >> ; Marcus Shawcroft
>> > >> ; Kyrylo Tkachov
>> > 
>> > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
>> > >>
>> > >> Tamar Christina  writes:
>> > >> > Hi All,
>> > >> >
>> > >> > This adds an implementation for masked copysign along with an
>> > >> > optimized pattern for masked copysign (x, -1).
>> > >>
>> > >> It feels like we're ending up with a lot of AArch64-specific code
>> > >> that just hard- codes the observation that changing the sign is
>> > >> equivalent to changing the top bit.  We then need to make sure that
>> > >> we choose the best way of changing the top bit for any given situation.
>> > >>
>> > >> Hard-coding the -1/negative case is one instance of that.  But it
>> > >> looks like we also fail to use the best sequence for SVE2.  E.g.
>> > >> [https://godbolt.org/z/ajh3MM5jv]:
>> > >>
>> > >> #include 
>> > >>
>> > >> void f(double *restrict a, double *restrict b) {
>> > >> for (int i = 0; i < 100; ++i)
>> > >> a[i] = __builtin_copysign(a[i], b[i]); }
>> > >>
>> > >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) {
>> > >> for (int i = 0; i < 100; ++i)
>> > >> a[i] = (a[i] & ~c) | (b[i] & c); }
>> > >>
>> > >> gives:
>> > >>
>> > >> f:
>> > >> mov x2, 0
>> > >> mov w3, 100
>> > >> whilelo p7.d, wzr, w3
>> > >> .L2:
>> > >> ld1dz30.d, p7/z, [x0, x2, lsl 3]
>> > >> ld1dz31.d, p7/z, [x1, x2, lsl 3]
>> > >> and z30.d, z30.d, #0x7fff
>> > >> and z31.d, z31.d, #0x8000
>> > >> orr z31.d, z31.d, z30.d
>> > >> st1dz31.d, p7, [x0, x2, lsl 3]
>> > >> incdx2
>> > >> whilelo p7.d, w2, w3
>> > >> b.any   .L2
>> > >> ret
>> > >> g:
>> > >> mov x3, 0
>> > >> mov w4, 100
>> > >> mov z29.d, x2
>> > >> whilelo p7.d, wzr, w4
>> > >> .L6:
>> > >> ld1dz30.d, p7/z, [x0, x3, lsl 3]
>> > >> ld1dz31.d, p7/z, [x1, x3, lsl 3]
>> > >> bsl z31.d, z31.d, z30.d, z29.d
>> > >> st1dz31.d, p7, [x0, x3, lsl 3]
>> > >> incdx3
>> > >> whilelo p7.d, w3, w4
>> > >> b.any   .L6
>> > >> ret
>> > >>
>> > >> I saw that you originally tried to do this in match.pd and that the
>> > >> decision was to fold to copysign instead.  But perhaps there's a
>> > >> compromise where isel does something with the (new) copysign canonical
>> > form?
>> > >> I.e. could we go with your new version of the match.pd patch, and add
>> > >> some isel stuff as a follow-on?
>> > >>
>> > >
>> > > Sure if that's what's desired But..
>> > >
>> > > The example you posted above is for instance worse for x86
>> > > https://godbolt.org/z/x9ccqxW6T where the first operation has a
>> > > dependency chain of 2 and the latter of 3.  It's likely any open coding 
>> > > of this
>> > operation is going to hurt a target.
>> > >
>> > >

Re: [PATCH]middle-end match.pd: optimize fneg (fabs (x)) to x | (1 << signbit(x)) [PR109154]

2023-10-07 Thread Richard Sandiford
Richard Biener  writes:
>> Am 07.10.2023 um 11:23 schrieb Richard Sandiford 
>> >> Richard Biener  writes:
>>> On Thu, 5 Oct 2023, Tamar Christina wrote:
>>> 
>>>>> I suppose the idea is that -abs(x) might be easier to optimize with other
>>>>> patterns (consider a - copysign(x,...), optimizing to a + abs(x)).
>>>>> 
>>>>> For abs vs copysign it's a canonicalization, but (negate (abs @0)) is less
>>>>> canonical than copysign.
>>>>> 
>>>>>> Should I try removing this?
>>>>> 
>>>>> I'd say yes (and put the reverse canonicalization next to this pattern).
>>>>> 
>>>> 
>>>> This patch transforms fneg (fabs (x)) into copysign (x, -1) which is more
>>>> canonical and allows a target to expand this sequence efficiently.  Such
>>>> sequences are common in scientific code working with gradients.
>>>> 
>>>> various optimizations in match.pd only happened on COPYSIGN but not 
>>>> COPYSIGN_ALL
>>>> which means they exclude IFN_COPYSIGN.  COPYSIGN however is restricted to 
>>>> only
>>> 
>>> That's not true:
>>> 
>>> (define_operator_list COPYSIGN
>>>BUILT_IN_COPYSIGNF
>>>BUILT_IN_COPYSIGN
>>>BUILT_IN_COPYSIGNL
>>>IFN_COPYSIGN)
>>> 
>>> but they miss the extended float builtin variants like
>>> __builtin_copysignf16.  Also see below
>>> 
>>>> the C99 builtins and so doesn't work for vectors.
>>>> 
>>>> The patch expands these optimizations to work on COPYSIGN_ALL.
>>>> 
>>>> There is an existing canonicalization of copysign (x, -1) to fneg (fabs 
>>>> (x))
>>>> which I remove since this is a less efficient form.  The testsuite is also
>>>> updated in light of this.
>>>> 
>>>> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>>>> 
>>>> Ok for master?
>>>> 
>>>> Thanks,
>>>> Tamar
>>>> 
>>>> gcc/ChangeLog:
>>>> 
>>>>PR tree-optimization/109154
>>>>* match.pd: Add new neg+abs rule, remove inverse copysign rule and
>>>>expand existing copysign optimizations.
>>>> 
>>>> gcc/testsuite/ChangeLog:
>>>> 
>>>>PR tree-optimization/109154
>>>>* gcc.dg/fold-copysign-1.c: Updated.
>>>>* gcc.dg/pr55152-2.c: Updated.
>>>>* gcc.dg/tree-ssa/abs-4.c: Updated.
>>>>* gcc.dg/tree-ssa/backprop-6.c: Updated.
>>>>* gcc.dg/tree-ssa/copy-sign-2.c: Updated.
>>>>* gcc.dg/tree-ssa/mult-abs-2.c: Updated.
>>>>* gcc.target/aarch64/fneg-abs_1.c: New test.
>>>>* gcc.target/aarch64/fneg-abs_2.c: New test.
>>>>* gcc.target/aarch64/fneg-abs_3.c: New test.
>>>>* gcc.target/aarch64/fneg-abs_4.c: New test.
>>>>* gcc.target/aarch64/sve/fneg-abs_1.c: New test.
>>>>* gcc.target/aarch64/sve/fneg-abs_2.c: New test.
>>>>* gcc.target/aarch64/sve/fneg-abs_3.c: New test.
>>>>* gcc.target/aarch64/sve/fneg-abs_4.c: New test.
>>>> 
>>>> --- inline copy of patch ---
>>>> 
>>>> diff --git a/gcc/match.pd b/gcc/match.pd
>>>> index 
>>>> 4bdd83e6e061b16dbdb2845b9398fcfb8a6c9739..bd6599d36021e119f51a4928354f580ffe82c6e2
>>>>  100644
>>>> --- a/gcc/match.pd
>>>> +++ b/gcc/match.pd
>>>> @@ -1074,45 +1074,43 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>>>> 
>>>> /* cos(copysign(x, y)) -> cos(x).  Similarly for cosh.  */
>>>> (for coss (COS COSH)
>>>> - copysigns (COPYSIGN)
>>>> - (simplify
>>>> -  (coss (copysigns @0 @1))
>>>> -   (coss @0)))
>>>> + (for copysigns (COPYSIGN_ALL)
>>> 
>>> So this ends up generating for example the match
>>> (cosf (copysignl ...)) which doesn't make much sense.
>>> 
>>> The lock-step iteration did
>>> (cosf (copysignf ..)) ... (ifn_cos (ifn_copysign ...))
>>> which is leaner but misses the case of
>>> (cosf (ifn_copysign ..)) - that's probably what you are
>>> after with this change.
>>> 
>>> That said, there isn't a nice solution (without altering the match.pd
>>> IL).  There's the explicit solution, spelling out all combinat

Re: [PATCH 6/6] aarch64: Add front-end argument type checking for target builtins

2023-10-07 Thread Richard Sandiford
Richard Earnshaw  writes:
> On 03/10/2023 16:18, Victor Do Nascimento wrote:
>> In implementing the ACLE read/write system register builtins it was
>> observed that leaving argument type checking to be done at expand-time
>> meant that poorly-formed function calls were being "fixed" by certain
>> optimization passes, meaning bad code wasn't being properly picked up
>> in checking.
>> 
>> Example:
>> 
>>const char *regname = "amcgcr_el0";
>>long long a = __builtin_aarch64_rsr64 (regname);
>> 
>> is reduced by the ccp1 pass to
>> 
>>long long a = __builtin_aarch64_rsr64 ("amcgcr_el0");
>> 
>> As these functions require an argument of STRING_CST type, there needs
>> to be a check carried out by the front-end capable of picking this up.
>> 
>> The introduced `check_general_builtin_call' function will be called by
>> the TARGET_CHECK_BUILTIN_CALL hook whenever a call to a builtin
>> belonging to the AARCH64_BUILTIN_GENERAL category is encountered,
>> carrying out any appropriate checks associated with a particular
>> builtin function code.
>
> Doesn't this prevent reasonable wrapping of the __builtin... names with 
> something more palatable?  Eg:
>
> static inline __attribute__(("always_inline")) long long get_sysreg_ll 
> (const char *regname)
> {
>return __builtin_aarch64_rsr64 (regname);
> }
>
> ...
>long long x = get_sysreg_ll("amcgcr_el0");
> ...

I think it's case of picking your poison.  If we didn't do this,
and only checked later, then it's unlikely that GCC and Clang would
be consistent about when a constant gets folded soon enough.

But yeah, it means that the above would need to be a macro in C.
Enlightened souls using C++ could instead do:

  template
  long long get_sysreg_ll()
  {
return __builtin_aarch64_rsr64(regname);
  }

  ... get_sysreg_ll<"amcgcr_el0">() ...

Or at least I hope so.  Might be nice to have a test for this.

Thanks,
Richard


Re: [PATCH v2][GCC] aarch64: Enable Cortex-X4 CPU

2023-10-07 Thread Richard Sandiford
Saurabh Jha  writes:
> On 10/6/2023 2:24 PM, Saurabh Jha wrote:
>> Hey,
>>
>> This patch adds support for the Cortex-X4 CPU to GCC.
>>
>> Regression testing for aarch64-none-elf target and found no regressions.
>>
>> Okay for gcc-master? I don't have commit access so if it looks okay, 
>> could someone please help me commit this?
>>
>>
>> Thanks,
>>
>> Saurabh
>>
>>
>> gcc/ChangeLog
>>
>>   * config/aarch64/aarch64-cores.def (AARCH64_CORE): Add support for 
>> cortex-x4 core.
>>   * config/aarch64/aarch64-tune.md: Regenerated.
>>   * doc/invoke.texi: Add command-line option for cortex-x4 core.
>
> Apologies, I forgot to add the patch file on my previous email.

Thanks, pushed to trunk.

Richard


Re: [PATCH] RFC: Add late-combine pass [PR106594]

2023-10-07 Thread Richard Sandiford
Robin Dapp  writes:
> Hi Richard,
>
> cool, thanks.  I just gave it a try with my test cases and it does what
> it is supposed to do, at least if I disable the register pressure check :)
> A cursory look over the test suite showed no major regressions and just
> some overly specific tests.
>
> My test case only works before split, though, as the UNSPEC predicates will
> prevent further combination afterwards.
>
> Right now the (pre-RA) code combines every instance disregarding the actual
> pressure and just checking if the "new" value does not occupy more registers
> than the old one.
>
> - Shouldn't the "pressure" also depend on the number of available hard regs
> (i.e. an nregs = 2 is not necessarily worse than nregs = 1 if we have 32
> hard regs in the new class vs 16 in the old one)?

Right, that's what I meant by extending/tweaking the pressure heuristics
for your case.

> - I assume/hope you expected my (now obsolete) fwprop change could be re-used?

Yeah, I was hoping you'd be able to apply similar heuristics to the new pass.
(I didn't find time to look at the old heuristics in detail, though, sorry.)

I suppose the point of comparison would then be "new pass with current
heuristics" vs. "new pass with relaxed heuristics".

It'd be a good/interesting test of the new heuristics to apply them
without any constraint on the complexity of the SET_SRC.

> Otherwise we wouldn't want to unconditionally "propagate" into a loop for 
> example?
> For my test case the combination of the vec_duplicate into all insns leads
> to "high" register pressure that we could avoid.
>
> How should we continue here?  I suppose you'll first want to get this version
> to the trunk before complicating it further.

Yeah, that'd probably be best.  I need to split the patch up into a
proper submission sequence, do more testing, and make it RFA quality.
Jeff has also found a couple of regressions that I need to look at.

But the substance probably won't change much, so I don't think you'd
be wasting your time if you developed the heuristics based on the
current version.  I'd be happy to review them on that basis too
(though time is short at the moment).

Thanks,
Richard


Re: [PATCH] ifcvt/vect: Emit COND_ADD for conditional scalar reduction.

2023-10-08 Thread Richard Sandiford
Robin Dapp  writes:
> Hi Tamar,
>
>> The only comment I have is whether you actually need this helper
>> function? It looks like all the uses of it are in cases you have, or
>> will call conditional_internal_fn_code directly.
> removed the cond_fn_p entirely in the attached v3.
>
> Bootstrapped and regtested on x86_64, aarch64 and power10.
>
> Regards
>  Robin
>
> Subject: [PATCH v3] ifcvt/vect: Emit COND_ADD for conditional scalar
>  reduction.
>
> As described in PR111401 we currently emit a COND and a PLUS expression
> for conditional reductions.  This makes it difficult to combine both
> into a masked reduction statement later.
> This patch improves that by directly emitting a COND_ADD during ifcvt and
> adjusting some vectorizer code to handle it.
>
> It also makes neutral_op_for_reduction return -0 if HONOR_SIGNED_ZEROS
> is true.
>
> gcc/ChangeLog:
>
>   PR middle-end/111401
>   * tree-if-conv.cc (convert_scalar_cond_reduction): Emit COND_ADD
>   if supported.
>   (predicate_scalar_phi): Add whitespace.
>   * tree-vect-loop.cc (fold_left_reduction_fn): Add IFN_COND_ADD.
>   (neutral_op_for_reduction): Return -0 for PLUS.
>   (vect_is_simple_reduction): Don't count else operand in
>   COND_ADD.
>   (vect_create_epilog_for_reduction): Fix whitespace.
>   (vectorize_fold_left_reduction): Add COND_ADD handling.
>   (vectorizable_reduction): Don't count else operand in COND_ADD.
>   (vect_transform_reduction): Add COND_ADD handling.
>   * tree-vectorizer.h (neutral_op_for_reduction): Add default
>   parameter.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.dg/vect/vect-cond-reduc-in-order-2-signed-zero.c: New test.
>   * gcc.target/riscv/rvv/autovec/cond/pr111401.c: New test.

The patch LGTM too FWIW, except...

> ---
>  .../vect-cond-reduc-in-order-2-signed-zero.c  | 141 
>  .../riscv/rvv/autovec/cond/pr111401.c | 139 
>  gcc/tree-if-conv.cc   |  63 ++--
>  gcc/tree-vect-loop.cc | 150 ++
>  gcc/tree-vectorizer.h |   2 +-
>  5 files changed, 451 insertions(+), 44 deletions(-)
>  create mode 100644 
> gcc/testsuite/gcc.dg/vect/vect-cond-reduc-in-order-2-signed-zero.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/cond/pr111401.c
>
> diff --git 
> a/gcc/testsuite/gcc.dg/vect/vect-cond-reduc-in-order-2-signed-zero.c 
> b/gcc/testsuite/gcc.dg/vect/vect-cond-reduc-in-order-2-signed-zero.c
> new file mode 100644
> index 000..7b46e7d8a2a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cond-reduc-in-order-2-signed-zero.c
> @@ -0,0 +1,141 @@
> +/* Make sure a -0 stays -0 when we perform a conditional reduction.  */
> +/* { dg-do run } */
> +/* { dg-require-effective-target vect_double } */
> +/* { dg-add-options ieee } */
> +/* { dg-additional-options "-std=gnu99 -fno-fast-math" } */
> +
> +#include "tree-vect.h"
> +
> +#include 
> +
> +#define N (VECTOR_BITS * 17)
> +
> +double __attribute__ ((noinline, noclone))
> +reduc_plus_double (double *restrict a, double init, int *cond, int n)
> +{
> +  double res = init;
> +  for (int i = 0; i < n; i++)
> +if (cond[i])
> +  res += a[i];
> +  return res;
> +}
> +
> +double __attribute__ ((noinline, noclone, optimize ("0")))
> +reduc_plus_double_ref (double *restrict a, double init, int *cond, int n)
> +{
> +  double res = init;
> +  for (int i = 0; i < n; i++)
> +if (cond[i])
> +  res += a[i];
> +  return res;
> +}
> +
> +double __attribute__ ((noinline, noclone))
> +reduc_minus_double (double *restrict a, double init, int *cond, int n)
> +{
> +  double res = init;
> +  for (int i = 0; i < n; i++)
> +if (cond[i])
> +  res -= a[i];
> +  return res;
> +}
> +
> +double __attribute__ ((noinline, noclone, optimize ("0")))
> +reduc_minus_double_ref (double *restrict a, double init, int *cond, int n)
> +{
> +  double res = init;
> +  for (int i = 0; i < n; i++)
> +if (cond[i])
> +  res -= a[i];
> +  return res;
> +}
> +
> +int __attribute__ ((optimize (1)))
> +main ()
> +{
> +  int n = 19;
> +  double a[N];
> +  int cond1[N], cond2[N];
> +
> +  for (int i = 0; i < N; i++)
> +{
> +  a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
> +  cond1[i] = 0;
> +  cond2[i] = i & 4 ? 1 : 0;
> +  asm volatile ("" ::: "memory");
> +}
> +
> +  double res1 = reduc_plus_double (a, -0.0, cond1, n);
> +  double ref1 = reduc_plus_double_ref (a, -0.0, cond1, n);
> +  double res2 = reduc_minus_double (a, -0.0, cond1, n);
> +  double ref2 = reduc_minus_double_ref (a, -0.0, cond1, n);
> +  double res3 = reduc_plus_double (a, -0.0, cond1, n);
> +  double ref3 = reduc_plus_double_ref (a, -0.0, cond1, n);
> +  double res4 = reduc_minus_double (a, -0.0, cond1, n);
> +  double ref4 = reduc_minus_double_ref (a, -0.0, cond1, n);
> +
> +  if (res1 != ref1 || signbit (res1) != signbit (ref1))
> +__builtin_abort ();
> +  if (res2 != ref2 || sign

Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.

2023-10-09 Thread Richard Sandiford
Tamar Christina  writes:
>> -Original Message-
>> From: Richard Sandiford 
>> Sent: Saturday, October 7, 2023 10:58 AM
>> To: Richard Biener 
>> Cc: Tamar Christina ; gcc-patches@gcc.gnu.org;
>> nd ; Richard Earnshaw ;
>> Marcus Shawcroft ; Kyrylo Tkachov
>> 
>> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
>> 
>> Richard Biener  writes:
>> > On Thu, Oct 5, 2023 at 10:46 PM Tamar Christina
>>  wrote:
>> >>
>> >> > -Original Message-
>> >> > From: Richard Sandiford 
>> >> > Sent: Thursday, October 5, 2023 9:26 PM
>> >> > To: Tamar Christina 
>> >> > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> >> > ; Marcus Shawcroft
>> >> > ; Kyrylo Tkachov
>> 
>> >> > Subject: Re: [PATCH]AArch64 Add SVE implementation for
>> cond_copysign.
>> >> >
>> >> > Tamar Christina  writes:
>> >> > >> -Original Message-
>> >> > >> From: Richard Sandiford 
>> >> > >> Sent: Thursday, October 5, 2023 8:29 PM
>> >> > >> To: Tamar Christina 
>> >> > >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> >> > >> ; Marcus Shawcroft
>> >> > >> ; Kyrylo Tkachov
>> >> > 
>> >> > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for
>> cond_copysign.
>> >> > >>
>> >> > >> Tamar Christina  writes:
>> >> > >> > Hi All,
>> >> > >> >
>> >> > >> > This adds an implementation for masked copysign along with an
>> >> > >> > optimized pattern for masked copysign (x, -1).
>> >> > >>
>> >> > >> It feels like we're ending up with a lot of AArch64-specific
>> >> > >> code that just hard- codes the observation that changing the
>> >> > >> sign is equivalent to changing the top bit.  We then need to
>> >> > >> make sure that we choose the best way of changing the top bit for any
>> given situation.
>> >> > >>
>> >> > >> Hard-coding the -1/negative case is one instance of that.  But
>> >> > >> it looks like we also fail to use the best sequence for SVE2.  E.g.
>> >> > >> [https://godbolt.org/z/ajh3MM5jv]:
>> >> > >>
>> >> > >> #include 
>> >> > >>
>> >> > >> void f(double *restrict a, double *restrict b) {
>> >> > >> for (int i = 0; i < 100; ++i)
>> >> > >> a[i] = __builtin_copysign(a[i], b[i]); }
>> >> > >>
>> >> > >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) {
>> >> > >> for (int i = 0; i < 100; ++i)
>> >> > >> a[i] = (a[i] & ~c) | (b[i] & c); }
>> >> > >>
>> >> > >> gives:
>> >> > >>
>> >> > >> f:
>> >> > >> mov x2, 0
>> >> > >> mov w3, 100
>> >> > >> whilelo p7.d, wzr, w3
>> >> > >> .L2:
>> >> > >> ld1dz30.d, p7/z, [x0, x2, lsl 3]
>> >> > >> ld1dz31.d, p7/z, [x1, x2, lsl 3]
>> >> > >> and z30.d, z30.d, #0x7fff
>> >> > >> and z31.d, z31.d, #0x8000
>> >> > >> orr z31.d, z31.d, z30.d
>> >> > >> st1dz31.d, p7, [x0, x2, lsl 3]
>> >> > >> incdx2
>> >> > >> whilelo p7.d, w2, w3
>> >> > >> b.any   .L2
>> >> > >> ret
>> >> > >> g:
>> >> > >> mov x3, 0
>> >> > >> mov w4, 100
>> >> > >> mov z29.d, x2
>> >> > >> whilelo p7.d, wzr, w4
>> >> > >> .L6:
>> >> > >> ld1dz30.d, p7/z, [x0, x3, lsl 3]
>> >> > >> ld1dz31.d, p7/z, [x1, x3, lsl 3]
>> >> > >> bsl z31.d, z31.d, z30.d, z29.d
>> >> > >> st1dz31.d, p7, [x0, x3, lsl 3]
>> >> &g

Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.

2023-10-09 Thread Richard Sandiford
Tamar Christina  writes:
>> -Original Message-
>> From: Richard Sandiford 
>> Sent: Monday, October 9, 2023 10:56 AM
>> To: Tamar Christina 
>> Cc: Richard Biener ; gcc-patches@gcc.gnu.org;
>> nd ; Richard Earnshaw ;
>> Marcus Shawcroft ; Kyrylo Tkachov
>> 
>> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
>> 
>> Tamar Christina  writes:
>> >> -Original Message-
>> >> From: Richard Sandiford 
>> >> Sent: Saturday, October 7, 2023 10:58 AM
>> >> To: Richard Biener 
>> >> Cc: Tamar Christina ;
>> >> gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> >> ; Marcus Shawcroft
>> >> ; Kyrylo Tkachov
>> 
>> >> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
>> >>
>> >> Richard Biener  writes:
>> >> > On Thu, Oct 5, 2023 at 10:46 PM Tamar Christina
>> >>  wrote:
>> >> >>
>> >> >> > -Original Message-
>> >> >> > From: Richard Sandiford 
>> >> >> > Sent: Thursday, October 5, 2023 9:26 PM
>> >> >> > To: Tamar Christina 
>> >> >> > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> >> >> > ; Marcus Shawcroft
>> >> >> > ; Kyrylo Tkachov
>> >> 
>> >> >> > Subject: Re: [PATCH]AArch64 Add SVE implementation for
>> >> cond_copysign.
>> >> >> >
>> >> >> > Tamar Christina  writes:
>> >> >> > >> -Original Message-
>> >> >> > >> From: Richard Sandiford 
>> >> >> > >> Sent: Thursday, October 5, 2023 8:29 PM
>> >> >> > >> To: Tamar Christina 
>> >> >> > >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard
>> >> >> > >> Earnshaw ; Marcus Shawcroft
>> >> >> > >> ; Kyrylo Tkachov
>> >> >> > 
>> >> >> > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for
>> >> cond_copysign.
>> >> >> > >>
>> >> >> > >> Tamar Christina  writes:
>> >> >> > >> > Hi All,
>> >> >> > >> >
>> >> >> > >> > This adds an implementation for masked copysign along with
>> >> >> > >> > an optimized pattern for masked copysign (x, -1).
>> >> >> > >>
>> >> >> > >> It feels like we're ending up with a lot of AArch64-specific
>> >> >> > >> code that just hard- codes the observation that changing the
>> >> >> > >> sign is equivalent to changing the top bit.  We then need to
>> >> >> > >> make sure that we choose the best way of changing the top bit
>> >> >> > >> for any
>> >> given situation.
>> >> >> > >>
>> >> >> > >> Hard-coding the -1/negative case is one instance of that.
>> >> >> > >> But it looks like we also fail to use the best sequence for SVE2. 
>> >> >> > >>  E.g.
>> >> >> > >> [https://godbolt.org/z/ajh3MM5jv]:
>> >> >> > >>
>> >> >> > >> #include 
>> >> >> > >>
>> >> >> > >> void f(double *restrict a, double *restrict b) {
>> >> >> > >> for (int i = 0; i < 100; ++i)
>> >> >> > >> a[i] = __builtin_copysign(a[i], b[i]); }
>> >> >> > >>
>> >> >> > >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) {
>> >> >> > >> for (int i = 0; i < 100; ++i)
>> >> >> > >> a[i] = (a[i] & ~c) | (b[i] & c); }
>> >> >> > >>
>> >> >> > >> gives:
>> >> >> > >>
>> >> >> > >> f:
>> >> >> > >> mov x2, 0
>> >> >> > >> mov w3, 100
>> >> >> > >> whilelo p7.d, wzr, w3
>> >> >> > >> .L2:
>> >> >> > >> ld1dz30.d, p7/z, [x0, x2, lsl 3]
>> >> >> > >> ld1dz31.d, p7/z, [x1, 

Re: PR111648: Fix wrong code-gen due to incorrect VEC_PERM_EXPR folding

2023-10-09 Thread Richard Sandiford
Prathamesh Kulkarni  writes:
> Hi,
> The attached patch attempts to fix PR111648.
> As mentioned in PR, the issue is when a1 is a multiple of vector
> length, we end up creating following encoding in result: { base_elem,
> arg[0], arg[1], ... } (assuming S = 1),
> where arg is chosen input vector, which is incorrect, since the
> encoding originally in arg would be: { arg[0], arg[1], arg[2], ... }
>
> For the test-case mentioned in PR, vectorizer pass creates
> VEC_PERM_EXPR where:
> arg0: { -16, -9, -10, -11 }
> arg1: { -12, -5, -6, -7 }
> sel = { 3, 4, 5, 6 }
>
> arg0, arg1 and sel are encoded with npatterns = 1 and nelts_per_pattern = 3.
> Since a1 = 4 and arg_len = 4, it ended up creating the result with
> following encoding:
> res = { arg0[3], arg1[0], arg1[1] } // npatterns = 1, nelts_per_pattern = 3
>   = { -11, -12, -5 }
>
> So for res[3], it used S = (-5) - (-12) = 7
> And hence computed it as -5 + 7 = 2.
> instead of selecting arg1[2], ie, -6.
>
> The patch tweaks valid_mask_for_fold_vec_perm_cst_p to punt if a1 is a 
> multiple
> of vector length, so a1 ... ae select elements only from stepped part
> of the pattern
> from input vector and return false for this case.
>
> Since the vectors are VLS, fold_vec_perm_cst then sets:
> res_npatterns = res_nelts
> res_nelts_per_pattern  = 1
> which seems to fix the issue by encoding all the elements.
>
> The patch resulted in Case 4 and Case 5 failing from test_nunits_min_2 because
> they used sel = { 0, 0, 1, ... } and {len, 0, 1, ... } respectively,
> which used a1 = 0, and thus selected arg1[0].
>
> I removed Case 4 because it was already covered in test_nunits_min_4,
> and moved Case 5 to test_nunits_min_4, with sel = { len, 1, 2, ... }
> and added a new Case 9 to test for this issue.
>
> Passes bootstrap+test on aarch64-linux-gnu with and without SVE,
> and on x86_64-linux-gnu.
> Does the patch look OK ?
>
> Thanks,
> Prathamesh
>
> [PR111648] Fix wrong code-gen due to incorrect VEC_PERM_EXPR folding.
>
> gcc/ChangeLog:
>   PR tree-optimization/111648
>   * fold-const.cc (valid_mask_for_fold_vec_perm_cst_p): Punt if a1
>   is a multiple of vector length.
>   (test_nunits_min_2): Remove Case 4 and move Case 5 to ...
>   (test_nunits_min_4): ... here and rename case numbers. Also add
>   Case 9.
>
> gcc/testsuite/ChangeLog:
>   PR tree-optimization/111648
>   * gcc.dg/vect/pr111648.c: New test.
>
>
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 4f8561509ff..c5f421d6b76 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -10682,8 +10682,8 @@ valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree 
> arg1,
> return false;
>   }
>  
> -  /* Ensure that the stepped sequence always selects from the same
> -  input pattern.  */
> +  /* Ensure that the stepped sequence always selects from the stepped
> +  part of same input pattern.  */
>unsigned arg_npatterns
>   = ((q1 & 1) == 0) ? VECTOR_CST_NPATTERNS (arg0)
> : VECTOR_CST_NPATTERNS (arg1);
> @@ -10694,6 +10694,20 @@ valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree 
> arg1,
>   *reason = "step is not multiple of npatterns";
> return false;
>   }
> +
> +  /* If a1 is a multiple of len, it will select base element of input
> +  vector resulting in following encoding:
> +  { base_elem, arg[0], arg[1], ... } where arg is the chosen input
> +  vector. This encoding is not originally present in arg, since it's
> +  defined as:
> +  { arg[0], arg[1], arg[2], ... }.  */
> +
> +  if (multiple_p (a1, arg_len))
> + {
> +   if (reason)
> + *reason = "selecting base element of input vector";
> +   return false;
> + }

That wouldn't catch (for example) cases where a1 == arg_len + 1 and the
second argument has 2 stepped patterns.

The equivalent condition that handles multiple patterns would
probably be to reject q1 < arg_npatterns.  But that's only necessary if:

(1) the argument has three elements per pattern (i.e. has a stepped
sequence) and

(2) element 2 - element 1 != element 1 - element 0

I think we should check those to avoid pessimising VLA cases.

Thanks,
Richard

>  }
>  
>return true;
> @@ -17425,47 +17439,6 @@ test_nunits_min_2 (machine_mode vmode)
>   tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
>   validate_res (2, 2, res, expected_res);
>}
> -
> -  /* Case 4: mask = {0, 0, 1, ...} // (1, 3)
> -  Test that the stepped sequence of the pattern selects from
> -  same input pattern. Since input vectors have npatterns = 2,
> -  and step (a2 - a1) = 1, step is not a multiple of npatterns
> -  in input vector. So return NULL_TREE.  */
> -  {
> - tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> - tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> - poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> -
> - vec_perm_builder buil

Re: [PATCH] wide-int: Allow up to 16320 bits wide_int and change widest_int precision to 32640 bits [PR102989]

2023-10-09 Thread Richard Sandiford
Jakub Jelinek  writes:
> Hi!
>
> As mentioned in the _BitInt support thread, _BitInt(N) is currently limited
> by the wide_int/widest_int maximum precision limitation, which is depending
> on target 191, 319, 575 or 703 bits (one less than WIDE_INT_MAX_PRECISION).
> That is fairly low limit for _BitInt, especially on the targets with the 191
> bit limitation.
>
> The following patch bumps that limit to 16319 bits on all arches, which is
> the limit imposed by INTEGER_CST representation (unsigned char members
> holding number of HOST_WIDE_INT limbs).
>
> In order to achieve that, wide_int is changed from a trivially copyable type
> which contained just an inline array of WIDE_INT_MAX_ELTS (3, 5, 9 or
> 11 limbs depending on target) limbs into a non-trivially copy constructible,
> copy assignable and destructible type which for the usual small cases (up
> to WIDE_INT_MAX_INL_ELTS which is the former WIDE_INT_MAX_ELTS) still uses
> an inline array of limbs, but for larger precisions uses heap allocated
> limb array.  This makes wide_int unusable in GC structures, so for dwarf2out
> which was the only place which needed it there is a new rwide_int type
> (restricted wide_int) which supports only up to RWIDE_INT_MAX_ELTS limbs
> inline and is trivially copyable (dwarf2out should never deal with large
> _BitInt constants, those should have been lowered earlier).
>
> Similarly, widest_int has been changed from a trivially copyable type which
> contained also an inline array of WIDE_INT_MAX_ELTS limbs (but unlike
> wide_int didn't contain precision and assumed that to be
> WIDE_INT_MAX_PRECISION) into a non-trivially copy constructible, copy
> assignable and destructible type which has always WIDEST_INT_MAX_PRECISION
> precision (32640 bits currently, twice as much as INTEGER_CST limitation
> allows) and unlike wide_int decides depending on get_len () value whether
> it uses an inline array (again, up to WIDE_INT_MAX_INL_ELTS) or heap
> allocated one.  In wide-int.h this means we need to estimate an upper
> bound on how many limbs will wide-int.cc (usually, sometimes wide-int.h)
> need to write, heap allocate if needed based on that estimation and upon
> set_len which is done at the end if we guessed over WIDE_INT_MAX_INL_ELTS
> and allocated dynamically, while we actually need less than that
> copy/deallocate.  The unexact guesses are needed because the exact
> computation of the length in wide-int.cc is sometimes quite complex and
> especially canonicalize at the end can decrease it.  widest_int is again
> because of this not usable in GC structures, so cfgloop.h has been changed
> to use fixed_wide_int_storage  and punt if
> we'd have larger _BitInt based iterators, programs having more than 128-bit
> iterators will be hopefully rare and I think it is fine to treat loops with
> more than 2^127 iterations as effectively possibly infinite, omp-general.cc
> is changed to use fixed_wide_int_storage <1024>, as it better should support
> scores with the same precision on all arches.
>
> Code which used WIDE_INT_PRINT_BUFFER_SIZE sized buffers for printing
> wide_int/widest_int into buffer had to be changed to use XALLOCAVEC for
> larger lengths.
>
> On x86_64, the patch in --enable-checking=yes,rtl,extra configured
> bootstrapped cc1plus enlarges the .text section by 1.01% - from
> 0x25725a5 to 0x25e and similarly at least when compiling insn-recog.cc
> with the usual bootstrap option slows compilation down by 1.01%,
> user 4m22.046s and 4m22.384s on vanilla trunk vs.
> 4m25.947s and 4m25.581s on patched trunk.  I'm afraid some code size growth
> and compile time slowdown is unavoidable in this case, we use wide_int and
> widest_int everywhere, and while the rare cases are marked with UNLIKELY
> macros, it still means extra checks for it.

Yeah, it's unfortunate, but like you say, it's probably unavoidable.
Having effectively arbitrary-size integers breaks most of the simplifying
asssumptions.

> The patch also regresses
> +FAIL: gm2/pim/fail/largeconst.mod,  -O  
> +FAIL: gm2/pim/fail/largeconst.mod,  -O -g  
> +FAIL: gm2/pim/fail/largeconst.mod,  -O3 -fomit-frame-pointer  
> +FAIL: gm2/pim/fail/largeconst.mod,  -O3 -fomit-frame-pointer 
> -finline-functions  
> +FAIL: gm2/pim/fail/largeconst.mod,  -Os  
> +FAIL: gm2/pim/fail/largeconst.mod,  -g  
> +FAIL: gm2/pim/fail/largeconst2.mod,  -O  
> +FAIL: gm2/pim/fail/largeconst2.mod,  -O -g  
> +FAIL: gm2/pim/fail/largeconst2.mod,  -O3 -fomit-frame-pointer  
> +FAIL: gm2/pim/fail/largeconst2.mod,  -O3 -fomit-frame-pointer 
> -finline-functions  
> +FAIL: gm2/pim/fail/largeconst2.mod,  -Os  
> +FAIL: gm2/pim/fail/largeconst2.mod,  -g  
> tests, which previously were rejected with
> error: constant literal 
> ‘12345678912345678912345679123456789123456789123456789123456789123456791234567891234567891234567891234567891234567912345678912345678912345678912345678912345679123456789123456789’
>  exceeds internal ZTYPE range
> kind of errors, but now are accepted.  Seems the F

Re: [PATCH] ifcvt/vect: Emit COND_ADD for conditional scalar reduction.

2023-10-09 Thread Richard Sandiford
Robin Dapp  writes:
>> It'd be good to expand on this comment a bit.  What kind of COND are you
>> anticipating?  A COND with the neutral op as the else value, so that the
>> PLUS_EXPR (or whatever) can remain unconditional?  If so, it would be
>> good to sketch briefly how that happens, and why it's better than using
>> the conditional PLUS_EXPR.
>> 
>> If that's the reason, perhaps we want a single-use check as well.
>> It's possible that OP1 is used elsewhere in the loop body, in a
>> context that would prefer a different else value.
>
> Would something like the following on top work?
>
> -  /* If possible try to create an IFN_COND_ADD instead of a COND_EXPR and
> - a PLUS_EXPR.  Don't do this if the reduction def operand itself is
> +  /* If possible create a COND_OP instead of a COND_EXPR and an OP_EXPR.
> + The COND_OP will have a neutral_op else value.
> +
> + This allows re-using the mask directly in a masked reduction instead
> + of creating a vector merge (or similar) and then an unmasked reduction.
> +
> + Don't do this if the reduction def operand itself is
>   a vectorizable call as we can create a COND version of it directly.  */

It wasn't very clear, sorry, but it was the last sentence I was asking
for clarification on, not the other bits.  Why do we want to avoid
generating a COND_ADD when the operand is a vectorisable call?

Thanks,
Richard

>
>if (ifn != IFN_LAST
>&& vectorized_internal_fn_supported_p (ifn, TREE_TYPE (lhs))
> -  && try_cond_op && !swap)
> +  && use_cond_op && !swap && has_single_use (op1))
>
> Regards
>  Robin


Re: [PATCH] wide-int: Allow up to 16320 bits wide_int and change widest_int precision to 32640 bits [PR102989]

2023-10-10 Thread Richard Sandiford
Jakub Jelinek  writes:
> On Mon, Oct 09, 2023 at 03:44:10PM +0200, Jakub Jelinek wrote:
>> Thanks, just quick answers, will work on patch adjustments after trying to
>> get rid of rwide_int (seems dwarf2out has very limited needs from it, just
>> some routine to construct it in GCed memory (and never change afterwards)
>> from const wide_int_ref & or so, and then working operator ==,
>> get_precision, elt, get_len and get_val methods, so I think we could just
>> have a struct dw_wide_int { unsigned int prec, len; HOST_WIDE_INT val[1]; };
>> and perform the methods on it after converting to a storage ref.
>
> Now in patch form (again, incremental).
>
>> > Does the variable-length memcpy pay for itself?  If so, perhaps that's a
>> > sign that we should have a smaller inline buffer for this class (say 2 
>> > HWIs).
>> 
>> Guess I'll try to see what results in smaller .text size.
>
> I've left the memcpy changes into a separate patch (incremental, attached).
> Seems that second patch results in .text growth by 16256 bytes (0.04%),
> though I'd bet it probably makes compile time tiny bit faster because it
> replaces an out of line memcpy (caused by variable length) with inlined one.
>
> With even the third one it shrinks by 84544 bytes (0.21% down), but the
> extra statistics patch then shows massive number of allocations after
> running make check-gcc check-g++ check-gfortran for just a minute or two.
> On the widest_int side, I see (first number from sort | uniq -c | sort -nr,
> second the estimated or final len)
> 7289034 4
>  173586 5
>   21819 6
> i.e. there are tons of widest_ints which need len 4 (or perhaps just
> have it as upper estimation), maybe even 5 would be nice.

Thanks for running the stats.  That's definitely a lot more than I expected.

> On the wide_int side, I see
>  155291 576
> (supposedly because of bound_wide_int, where we create wide_int_ref from
> the 576-bit precision bound_wide_int and then create 576-bit wide_int when
> using unary or binary operation on that).

576 bits seems quite a lot for a loop bound.  We're enterning near-infinite
territory with 128 bits. :)  But I don't want to rehash old discussions.
If we take this size for wide_int as given, then...

> So, perhaps we could get away with say WIDEST_INT_MAX_INL_ELTS of 5 or 6
> instead of 9 but keep WIDE_INT_MAX_INL_ELTS at 9 (or whatever is computed
> from MAX_BITSIZE_MODE_ANY_INT?).  Or keep it at 9 for both (i.e. without
> the third patch).

...I agree we might as well keep the widest_int size the same for
simplicity.  It'd only be worth distinguishing them if we have positive
proof that it's worthwhile.

So going with patches 1 + 2 sounds good to me, but I don't have a strong
preference.

On the wide-int.cc changes:

> @@ -1469,6 +1452,36 @@ wi::mul_internal (HOST_WIDE_INT *val, co
>return 1;
>  }
> 
> +  /* The sizes here are scaled to support a 2x WIDE_INT_MAX_INL_PRECISION by 
> 2x
> + WIDE_INT_MAX_INL_PRECISION yielding a 4x WIDE_INT_MAX_INL_PRECISION
> + result.  */
> +
> +  unsigned HOST_HALF_WIDE_INT
> +ubuf[4 * WIDE_INT_MAX_INL_PRECISION / HOST_BITS_PER_HALF_WIDE_INT];
> +  unsigned HOST_HALF_WIDE_INT
> +vbuf[4 * WIDE_INT_MAX_INL_PRECISION / HOST_BITS_PER_HALF_WIDE_INT];
> +  /* The '2' in 'R' is because we are internally doing a full
> + multiply.  */
> +  unsigned HOST_HALF_WIDE_INT
> +rbuf[2 * 4 * WIDE_INT_MAX_INL_PRECISION / HOST_BITS_PER_HALF_WIDE_INT];
> +  const HOST_WIDE_INT mask = ((HOST_WIDE_INT)1 << 
> HOST_BITS_PER_HALF_WIDE_INT) - 1;
> +  unsigned HOST_HALF_WIDE_INT *u = ubuf;
> +  unsigned HOST_HALF_WIDE_INT *v = vbuf;
> +  unsigned HOST_HALF_WIDE_INT *r = rbuf;
> +
> +  if (prec > WIDE_INT_MAX_INL_PRECISION && !high)
> +prec = (op1len + op2len + 1) * HOST_BITS_PER_WIDE_INT;

Changing the precision looked a bit dangerous at first, but I agree it
looks correct in context, in that nothing later on seems to require prec
to be the real precision of the number.  But I wonder whether we should
instead do:

  if (!high)
prec = MIN ((op1len + op2len + 1) * HOST_BITS_PER_WIDE_INT, prec);

so that the assumption gets a bit more testing.  Same idea for the others.
I.e. in any case where we think it's safe to reduce a precision or
length for out-of-line buffers, I think we should try to do the same
for inline ones.

I'm not sure off-hand why + 1 is safe here but + 2 is needed for the
write_val estimate.  Might be worth a comment in one place or the other.

> +  unsigned int blocks_needed = BLOCKS_NEEDED (prec);
> +  unsigned int half_blocks_needed = blocks_needed * 2;
> +  if (UNLIKELY (prec > WIDE_INT_MAX_INL_PRECISION))
> +{
> +  unsigned HOST_HALF_WIDE_INT *buf
> + = XALLOCAVEC (unsigned HOST_HALF_WIDE_INT, 4 * 4 * blocks_needed);
> +  u = buf;
> +  v = u + 4 * blocks_needed;
> +  r = v + 4 * blocks_needed;
> +}
> +
>/* We do unsigned mul and then correct it.  */
>wi_unpack (u, op1val, op1len, half_blocks_needed, prec, SIGNED)

Re: [PATCH 02/11] Handle epilogues that contain jumps

2023-10-12 Thread Richard Sandiford
Richard Biener  writes:
> On Tue, Aug 22, 2023 at 12:42 PM Szabolcs Nagy via Gcc-patches
>  wrote:
>>
>> From: Richard Sandiford 
>>
>> The prologue/epilogue pass allows the prologue sequence
>> to contain jumps.  The sequence is then partitioned into
>> basic blocks using find_many_sub_basic_blocks.
>>
>> This patch treats epilogues in the same way.  It's needed for
>> a follow-on aarch64 patch that adds conditional code to both
>> the prologue and the epilogue.
>>
>> Tested on aarch64-linux-gnu (including with a follow-on patch)
>> and x86_64-linux-gnu.  OK to install?
>>
>> Richard
>>
>> gcc/
>> * function.cc (thread_prologue_and_epilogue_insns): Handle
>> epilogues that contain jumps.
>> ---
>>
>> This is a previously approved patch that was not committed
>> because it was not needed at the time, but i'd like to commit
>> it as it is needed for the followup aarch64 eh_return changes:
>>
>> https://gcc.gnu.org/pipermail/gcc-patches/2022-November/605769.html
>>
>> ---
>>  gcc/function.cc | 10 ++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/gcc/function.cc b/gcc/function.cc
>> index dd2c1136e07..70d1cd65303 100644
>> --- a/gcc/function.cc
>> +++ b/gcc/function.cc
>> @@ -6120,6 +6120,11 @@ thread_prologue_and_epilogue_insns (void)
>>   && returnjump_p (BB_END (e->src)))
>> e->flags &= ~EDGE_FALLTHRU;
>> }
>> +
>> + auto_sbitmap blocks (last_basic_block_for_fn (cfun));
>> + bitmap_clear (blocks);
>> +   bitmap_set_bit (blocks, BLOCK_FOR_INSN (epilogue_seq)->index);
>> + find_many_sub_basic_blocks (blocks);
>> }
>>else if (next_active_insn (BB_END (exit_fallthru_edge->src)))
>> {
>> @@ -6218,6 +6223,11 @@ thread_prologue_and_epilogue_insns (void)
>>   set_insn_locations (seq, epilogue_location);
>>
>>   emit_insn_before (seq, insn);
>> +
>> + auto_sbitmap blocks (last_basic_block_for_fn (cfun));
>> + bitmap_clear (blocks);
>> + bitmap_set_bit (blocks, BLOCK_FOR_INSN (insn)->index);
>> + find_many_sub_basic_blocks (blocks);
>
> I'll note that clearing a full sbitmap to pass down a single basic block
> to find_many_sub_basic_blocks is a quite expensive operation.  May I suggest
> to add an overload operating on a single basic block?  It's only
>
>   FOR_EACH_BB_FN (bb, cfun)
> SET_STATE (bb,
>bitmap_bit_p (blocks, bb->index) ? BLOCK_TO_SPLIT :
> BLOCK_ORIGINAL);
>
> using the bitmap, so factoring the rest of the function and customizing this
> walk would do the trick.  Note that the whole function could be refactored to
> handle single blocks more efficiently.

Sorry for the late reply, but does this look OK?  Tested on
aarch64-linux-gnu and x86_64-linux-gnu.

Thanks,
Richard

---

The prologue/epilogue pass allows the prologue sequence to contain
jumps.  The sequence is then partitioned into basic blocks using
find_many_sub_basic_blocks.

This patch treats epilogues in a similar way.  Since only one block
might need to be split, the patch (re)introduces a find_sub_basic_blocks
routine to handle a single block.

The new routine hard-codes the assumption that split_block will chain
the new block immediately after the original block.  The routine doesn't
try to replicate the fix for PR81030, since that was specific to
gimple->rtl expansion.

The patch is needed for follow-on aarch64 patches that add conditional
code to the epilogue.  The tests are part of those patches.

gcc/
* cfgbuild.h (find_sub_basic_blocks): Declare.
* cfgbuild.cc (update_profile_for_new_sub_basic_block): New function,
split out from...
(find_many_sub_basic_blocks): ...here.
(find_sub_basic_blocks): New function.
* function.cc (thread_prologue_and_epilogue_insns): Handle
epilogues that contain jumps.
---
 gcc/cfgbuild.cc | 95 +
 gcc/cfgbuild.h  |  1 +
 gcc/function.cc |  4 +++
 3 files changed, 70 insertions(+), 30 deletions(-)

diff --git a/gcc/cfgbuild.cc b/gcc/cfgbuild.cc
index 15ed4deb5f7..9a6b34fb4b1 100644
--- a/gcc/cfgbuild.cc
+++ b/gcc/cfgbuild.cc
@@ -693,6 +693,43 @@ compute_outgoing_frequencies (basic_block b)
 }
 }
 
+/* Update the profile information for BB, which was created by splitting
+   an RTL block that had a non-final jump.  */
+
+static void
+update_profile_for_new_sub_basic_block (basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+
+  bool initialized_src = false, uninitialized_sr

Re: [PATCH 6/6] aarch64: Add front-end argument type checking for target builtins

2023-10-12 Thread Richard Sandiford
"Richard Earnshaw (lists)"  writes:
> On 09/10/2023 14:12, Victor Do Nascimento wrote:
>> 
>> 
>> On 10/7/23 12:53, Richard Sandiford wrote:
>>> Richard Earnshaw  writes:
>>>> On 03/10/2023 16:18, Victor Do Nascimento wrote:
>>>>> In implementing the ACLE read/write system register builtins it was
>>>>> observed that leaving argument type checking to be done at expand-time
>>>>> meant that poorly-formed function calls were being "fixed" by certain
>>>>> optimization passes, meaning bad code wasn't being properly picked up
>>>>> in checking.
>>>>>
>>>>> Example:
>>>>>
>>>>>     const char *regname = "amcgcr_el0";
>>>>>     long long a = __builtin_aarch64_rsr64 (regname);
>>>>>
>>>>> is reduced by the ccp1 pass to
>>>>>
>>>>>     long long a = __builtin_aarch64_rsr64 ("amcgcr_el0");
>>>>>
>>>>> As these functions require an argument of STRING_CST type, there needs
>>>>> to be a check carried out by the front-end capable of picking this up.
>>>>>
>>>>> The introduced `check_general_builtin_call' function will be called by
>>>>> the TARGET_CHECK_BUILTIN_CALL hook whenever a call to a builtin
>>>>> belonging to the AARCH64_BUILTIN_GENERAL category is encountered,
>>>>> carrying out any appropriate checks associated with a particular
>>>>> builtin function code.
>>>>
>>>> Doesn't this prevent reasonable wrapping of the __builtin... names with
>>>> something more palatable?  Eg:
>>>>
>>>> static inline __attribute__(("always_inline")) long long get_sysreg_ll
>>>> (const char *regname)
>>>> {
>>>>     return __builtin_aarch64_rsr64 (regname);
>>>> }
>>>>
>>>> ...
>>>>     long long x = get_sysreg_ll("amcgcr_el0");
>>>> ...
>>>
>>> I think it's case of picking your poison.  If we didn't do this,
>>> and only checked later, then it's unlikely that GCC and Clang would
>>> be consistent about when a constant gets folded soon enough.
>>>
>>> But yeah, it means that the above would need to be a macro in C.
>>> Enlightened souls using C++ could instead do:
>>>
>>>    template
>>>    long long get_sysreg_ll()
>>>    {
>>>  return __builtin_aarch64_rsr64(regname);
>>>    }
>>>
>>>    ... get_sysreg_ll<"amcgcr_el0">() ...
>>>
>>> Or at least I hope so.  Might be nice to have a test for this.
>>>
>>> Thanks,
>>> Richard
>> 
>> As Richard Earnshaw mentioned, this does break the use of `static inline 
>> __attribute__(("always_inline"))', something I had found out in my testing.  
>> My chosen implementation was indeed, to quote Richard Sandiford, a case of 
>> "picking your poison" to have things line up with Clang and behaving 
>> consistently across optimization levels.
>> 
>> Relaxing the the use of `TARGET_CHECK_BUILTIN_CALL' meant optimizations were 
>> letting too many things through. Example:
>> 
>> const char *regname = "amcgcr_el0";
>> long long a = __builtin_aarch64_rsr64 (regname);
>> 
>> gets folded to
>> 
>> long long a = __builtin_aarch64_rsr64 ("amcgcr_el0");
>> 
>> and compilation passes at -01 even though it fails at -O0.
>> 
>> I had, however, not given any thought to the use of a template as a valid 
>> C++ alternative.
>> 
>> I will evaluate the use of templates and add tests accordingly.
>
> This just seems inconsistent with all the builtins we already have that 
> require literal constants for parameters.  For example (to pick just one of 
> many), vshr_n_q8(), where the second parameter must be a literal value.  In 
> practice we accept anything that resolves to a compile-time constant integer 
> expression and rely on that to avoid having to have hundreds of macros 
> binding the ACLE names to the underlying builtin equivalents.

That's true for the way that GCC handles things like Advanced SIMD.
But Clang behaves differently.  So does GCC's SVE ACLE implementation.
Both of those (try to) follow the language rules about what is a constant
expression.

Thanks,
Richard


Re: [PATCH] ifcvt/vect: Emit COND_ADD for conditional scalar reduction.

2023-10-12 Thread Richard Sandiford
Robin Dapp  writes:
>> It wasn't very clear, sorry, but it was the last sentence I was asking
>> for clarification on, not the other bits.  Why do we want to avoid
>> generating a COND_ADD when the operand is a vectorisable call?
>
> Ah, I see, apologies.  Upon thinking about it a bit more (thanks)
> I figured this hunk is not necessary.  I added it early in the process
> in order to keep the current behavior for situations like the following:
>
>  before:
>  _1 = .FMA (...)
>  _2 = COND (cond, .FMA, 0.0)
>  _3 = COND_ADD (true, result, _2, result)
>
>  This we would simplify to:
>  _2 = COND_FMA (cond, ...)
>  _3 = COND_ADD (true, result, _2, result)
>
>  with the patch we have:
>  _1 = .FMA (...)
>  _2 = .COND_ADD (cond, arg1, _1, arg1)
>
> Due to differences in expansion we'd end up with a masked
> vfmacc ("a += a + b * c") before and now emit an unmasked
> vfmadd ("a += a * b + c") and a masked result add.  This shouldn't
> be worse from a vector spec point of view, so I just changed the
> test expectation for now.

Thanks, sounds good.

> The attached v4 also includes Richi's suggestion for the HONOR...
> stuff.
>
> Bootstrap and regtest unchanged on aarch64, x86 and power10.

I'm reluctant to comment on the signed zeros/MINUS_EXPR parts,
but FWIW, the rest looks good to me.

Thanks,
Richard

>
> Regards
>  Robin
>
>
> From 1752507ce22c22b50b96f889dc0a9c2fc8e50859 Mon Sep 17 00:00:00 2001
> From: Robin Dapp 
> Date: Wed, 13 Sep 2023 22:19:35 +0200
> Subject: [PATCH v4] ifcvt/vect: Emit COND_ADD for conditional scalar
>  reduction.
>
> As described in PR111401 we currently emit a COND and a PLUS expression
> for conditional reductions.  This makes it difficult to combine both
> into a masked reduction statement later.
> This patch improves that by directly emitting a COND_ADD during ifcvt and
> adjusting some vectorizer code to handle it.
>
> It also makes neutral_op_for_reduction return -0 if HONOR_SIGNED_ZEROS
> is true.
>
> gcc/ChangeLog:
>
>   PR middle-end/111401
>   * tree-if-conv.cc (convert_scalar_cond_reduction): Emit COND_ADD
>   if supported.
>   (predicate_scalar_phi): Add whitespace.
>   * tree-vect-loop.cc (fold_left_reduction_fn): Add IFN_COND_ADD.
>   (neutral_op_for_reduction): Return -0 for PLUS.
>   (vect_is_simple_reduction): Don't count else operand in
>   COND_ADD.
>   (vect_create_epilog_for_reduction): Fix whitespace.
>   (vectorize_fold_left_reduction): Add COND_ADD handling.
>   (vectorizable_reduction): Don't count else operand in COND_ADD.
>   (vect_transform_reduction): Add COND_ADD handling.
>   * tree-vectorizer.h (neutral_op_for_reduction): Add default
>   parameter.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.dg/vect/vect-cond-reduc-in-order-2-signed-zero.c: New test.
>   * gcc.target/riscv/rvv/autovec/cond/pr111401.c: New test.
>   * gcc.target/riscv/rvv/autovec/reduc/reduc_call-2.c: Adjust.
>   * gcc.target/riscv/rvv/autovec/reduc/reduc_call-4.c: Ditto.
> ---
>  .../vect-cond-reduc-in-order-2-signed-zero.c  | 141 
>  .../riscv/rvv/autovec/cond/pr111401.c | 139 
>  .../riscv/rvv/autovec/reduc/reduc_call-2.c|   4 +-
>  .../riscv/rvv/autovec/reduc/reduc_call-4.c|   4 +-
>  gcc/tree-if-conv.cc   |  49 --
>  gcc/tree-vect-loop.cc | 156 ++
>  gcc/tree-vectorizer.h |   2 +-
>  7 files changed, 446 insertions(+), 49 deletions(-)
>  create mode 100644 
> gcc/testsuite/gcc.dg/vect/vect-cond-reduc-in-order-2-signed-zero.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/cond/pr111401.c
>
> diff --git 
> a/gcc/testsuite/gcc.dg/vect/vect-cond-reduc-in-order-2-signed-zero.c 
> b/gcc/testsuite/gcc.dg/vect/vect-cond-reduc-in-order-2-signed-zero.c
> new file mode 100644
> index 000..7b46e7d8a2a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cond-reduc-in-order-2-signed-zero.c
> @@ -0,0 +1,141 @@
> +/* Make sure a -0 stays -0 when we perform a conditional reduction.  */
> +/* { dg-do run } */
> +/* { dg-require-effective-target vect_double } */
> +/* { dg-add-options ieee } */
> +/* { dg-additional-options "-std=gnu99 -fno-fast-math" } */
> +
> +#include "tree-vect.h"
> +
> +#include 
> +
> +#define N (VECTOR_BITS * 17)
> +
> +double __attribute__ ((noinline, noclone))
> +reduc_plus_double (double *restrict a, double init, int *cond, int n)
> +{
> +  double res = init;
> +  for (int i = 0; i < n; i++)
> +if (cond[i])
> +  res += a[i];
> +  return res;
> +}
> +
> +double __attribute__ ((noinline, noclone, optimize ("0")))
> +reduc_plus_double_ref (double *restrict a, double init, int *cond, int n)
> +{
> +  double res = init;
> +  for (int i = 0; i < n; i++)
> +if (cond[i])
> +  res += a[i];
> +  return res;
> +}
> +
> +double __attribute__ ((noinline, noclone))
> +reduc_minus_double (double *restrict a, double init, int *cond, int n)

Re: [PATCH] wide-int: Allow up to 16320 bits wide_int and change widest_int precision to 32640 bits [PR102989]

2023-10-12 Thread Richard Sandiford
Jakub Jelinek  writes:
> @@ -2036,11 +2075,20 @@ wi::lrshift_large (HOST_WIDE_INT *val, c
>  unsigned int xlen, unsigned int xprecision,
>  unsigned int precision, unsigned int shift)
>  {
> -  unsigned int len = rshift_large_common (val, xval, xlen, xprecision, 
> shift);
> +  /* Work out how many blocks are needed to store the significant bits
> + (excluding the upper zeros or signs).  */
> +  unsigned int blocks_needed = BLOCKS_NEEDED (xprecision - shift);
> +  unsigned int len = blocks_needed;
> +  if (UNLIKELY (len > WIDE_INT_MAX_INL_ELTS)
> +  && len > xlen
> +  && xval[xlen - 1] >= 0)
> +len = xlen;

I think here too it would be worth dropping the:

  UNLIKELY (len > WIDE_INT_MAX_INL_ELTS)

part of the condition, since presumably the change should be safe
regardless of that.

> +
> +  rshift_large_common (val, xval, xlen, shift, len);
>  
>/* The value we just created has precision XPRECISION - SHIFT.
>   Zero-extend it to wider precisions.  */
> -  if (precision > xprecision - shift)
> +  if (precision > xprecision - shift && len == blocks_needed)
>  {
>unsigned int small_prec = (xprecision - shift) % 
> HOST_BITS_PER_WIDE_INT;
>if (small_prec)
> @@ -2063,11 +2111,18 @@ wi::arshift_large (HOST_WIDE_INT *val, c
>  unsigned int xlen, unsigned int xprecision,
>  unsigned int precision, unsigned int shift)
>  {
> -  unsigned int len = rshift_large_common (val, xval, xlen, xprecision, 
> shift);
> +  /* Work out how many blocks are needed to store the significant bits
> + (excluding the upper zeros or signs).  */
> +  unsigned int blocks_needed = BLOCKS_NEEDED (xprecision - shift);
> +  unsigned int len = blocks_needed;
> +  if (UNLIKELY (len > WIDE_INT_MAX_INL_ELTS) && len > xlen)
> +len = xlen;
> +

Same here.

OK for thw wide-int parts with those changes.

Thanks,
Richard


Re: [PATCH-1v2, expand] Enable vector mode for compare_by_pieces [PR111449]

2023-10-12 Thread Richard Sandiford
HAO CHEN GUI  writes:
> Hi,
>   Vector mode instructions are efficient on some targets (e.g. ppc64).
> This patch enables vector mode for compare_by_pieces. The non-member
> function widest_fixed_size_mode_for_size takes by_pieces_operation
> as the second argument and decide whether vector mode is enabled or
> not by the type of operations. Currently only set and compare enabled
> vector mode and do the optab checking correspondingly.
>
>   The test case is in the second patch which is rs6000 specific.
>
>   Compared to last version, the main change is to enable vector mode
> for compare_by_pieces in smallest_fixed_size_mode_for_size which
> is used for overlapping compare.
>
>   Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no
> regressions.
>
> Thanks
> Gui Haochen
>
> ChangeLog
> Expand: Enable vector mode for pieces compares
>
> Vector mode compare instructions are efficient for equality compare on
> rs6000. This patch refactors the codes of pieces operation to enable
> vector mode for compare.
>
> gcc/
>   PR target/111449
>   * expr.cc (widest_fixed_size_mode_for_size): Enable vector mode
>   for compare.  Replace the second argument with the type of pieces
>   operation.  Add optab checks for vector mode used in compare.
>   (by_pieces_ninsns): Pass the type of pieces operation to
>   widest_fixed_size_mode_for_size.
>   (class op_by_pieces_d): Define virtual function
>   widest_fixed_size_mode_for_size and optab_checking.
>   (op_by_pieces_d::op_by_pieces_d): Call outer function
>   widest_fixed_size_mode_for_size.
>   (op_by_pieces_d::get_usable_mode): Call class function
>   widest_fixed_size_mode_for_size.
>   (op_by_pieces_d::smallest_fixed_size_mode_for_size): Call
>   optab_checking for different types of operations.
>   (op_by_pieces_d::run): Call class function
>   widest_fixed_size_mode_for_size.
>   (class move_by_pieces_d): Declare function
>   widest_fixed_size_mode_for_size.
>   (move_by_pieces_d::widest_fixed_size_mode_for_size): Implement.
>   (class store_by_pieces_d): Declare function
>   widest_fixed_size_mode_for_size and optab_checking.
>   (store_by_pieces_d::optab_checking): Implement.
>   (store_by_pieces_d::widest_fixed_size_mode_for_size): Implement.
>   (can_store_by_pieces): Pass the type of pieces operation to
>   widest_fixed_size_mode_for_size.
>   (class compare_by_pieces_d): Declare function
>   widest_fixed_size_mode_for_size and optab_checking.
>   (compare_by_pieces_d::compare_by_pieces_d): Set m_qi_vector_mode
>   to true to enable vector mode.
>   (compare_by_pieces_d::widest_fixed_size_mode_for_size): Implement.
>   (compare_by_pieces_d::optab_checking): Implement.
>
> patch.diff
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index 9a37bff1fdd..e83c0a378ed 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -992,8 +992,9 @@ alignment_for_piecewise_move (unsigned int max_pieces, 
> unsigned int align)
> that is narrower than SIZE bytes.  */
>
>  static fixed_size_mode
> -widest_fixed_size_mode_for_size (unsigned int size, bool qi_vector)
> +widest_fixed_size_mode_for_size (unsigned int size, by_pieces_operation op)

The comment above the function needs to be updated to describe the
new parameter.

>  {
> +  bool qi_vector = ((op == COMPARE_BY_PIECES) || op == SET_BY_PIECES);

Nit: redundant brackets around the first comparison.

>fixed_size_mode result = NARROWEST_INT_MODE;
>
>gcc_checking_assert (size > 1);
> @@ -1009,8 +1010,13 @@ widest_fixed_size_mode_for_size (unsigned int size, 
> bool qi_vector)
> {
>   if (GET_MODE_SIZE (candidate) >= size)
> break;
> - if (optab_handler (vec_duplicate_optab, candidate)
> - != CODE_FOR_nothing)
> + if ((op == SET_BY_PIECES
> +  && optab_handler (vec_duplicate_optab, candidate)
> +!= CODE_FOR_nothing)
> +  || (op == COMPARE_BY_PIECES
> +  && optab_handler (mov_optab, mode)
> + != CODE_FOR_nothing
> +  && can_compare_p (EQ, mode, ccp_jump)))
> result = candidate;
> }
>
> @@ -1061,8 +1067,7 @@ by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned 
> int align,
>  {
>/* NB: Round up L and ALIGN to the widest integer mode for
>MAX_SIZE.  */
> -  mode = widest_fixed_size_mode_for_size (max_size,
> -   op == SET_BY_PIECES);
> +  mode = widest_fixed_size_mode_for_size (max_size, op);
>if (optab_handler (mov_optab, mode) != CODE_FOR_nothing)
>   {
> unsigned HOST_WIDE_INT up = ROUND_UP (l, GET_MODE_SIZE (mode));
> @@ -1076,8 +1081,7 @@ by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned 
> int align,
>
>while (max_size > 1 && l > 0)
>  {
> -  mode = widest_fixed_size_mode_for_size (max_size,
> -  

Re: [PATCH] wide-int: Allow up to 16320 bits wide_int and change widest_int precision to 32640 bits [PR102989]

2023-10-12 Thread Richard Sandiford
Jakub Jelinek  writes:
> On Thu, Oct 12, 2023 at 11:54:14AM +0100, Richard Sandiford wrote:
>> Jakub Jelinek  writes:
>> > @@ -2036,11 +2075,20 @@ wi::lrshift_large (HOST_WIDE_INT *val, c
>> >   unsigned int xlen, unsigned int xprecision,
>> >   unsigned int precision, unsigned int shift)
>> >  {
>> > -  unsigned int len = rshift_large_common (val, xval, xlen, xprecision, 
>> > shift);
>> > +  /* Work out how many blocks are needed to store the significant bits
>> > + (excluding the upper zeros or signs).  */
>> > +  unsigned int blocks_needed = BLOCKS_NEEDED (xprecision - shift);
>> > +  unsigned int len = blocks_needed;
>> > +  if (UNLIKELY (len > WIDE_INT_MAX_INL_ELTS)
>> > +  && len > xlen
>> > +  && xval[xlen - 1] >= 0)
>> > +len = xlen;
>> 
>> I think here too it would be worth dropping the:
>> 
>>   UNLIKELY (len > WIDE_INT_MAX_INL_ELTS)
>> 
>> part of the condition, since presumably the change should be safe
>> regardless of that.
>
> If so, there is also one spot in lshift_large as well.  So incrementally:
>
> --- gcc/wide-int.cc   2023-10-11 14:41:23.719132402 +0200
> +++ gcc/wide-int.cc   2023-10-11 14:41:23.719132402 +0200
> @@ -2013,8 +2013,7 @@
>  
>/* The whole-block shift fills with zeros.  */
>unsigned int len = BLOCKS_NEEDED (precision);
> -  if (UNLIKELY (len > WIDE_INT_MAX_INL_ELTS))
> -len = xlen + skip + 1;
> +  len = MIN (xlen + skip + 1, len);
>for (unsigned int i = 0; i < skip; ++i)
>  val[i] = 0;
>  
> @@ -2079,9 +2078,7 @@
>   (excluding the upper zeros or signs).  */
>unsigned int blocks_needed = BLOCKS_NEEDED (xprecision - shift);
>unsigned int len = blocks_needed;
> -  if (UNLIKELY (len > WIDE_INT_MAX_INL_ELTS)
> -  && len > xlen
> -  && xval[xlen - 1] >= 0)
> +  if (len > xlen && xval[xlen - 1] >= 0)
>  len = xlen;
>  
>rshift_large_common (val, xval, xlen, shift, len);
> @@ -2114,9 +2111,7 @@
>/* Work out how many blocks are needed to store the significant bits
>   (excluding the upper zeros or signs).  */
>unsigned int blocks_needed = BLOCKS_NEEDED (xprecision - shift);
> -  unsigned int len = blocks_needed;
> -  if (UNLIKELY (len > WIDE_INT_MAX_INL_ELTS) && len > xlen)
> -len = xlen;
> +  unsigned int len = MIN (xlen, blocks_needed);
>  
>rshift_large_common (val, xval, xlen, shift, len);
>  
> which I'll test soon.

LGTM.

>> OK for thw wide-int parts with those changes.
>
> Thanks.  What do you think about that
> --- gcc/wide-int.h.jj 2023-10-11 12:05:47.718059477 +0200
> +++ gcc/wide-int.h2023-10-11 13:51:56.081552500 +0200
> @@ -1635,6 +1635,8 @@ widest_int_storage ::write_val (unsig
>u.valp = XNEWVEC (HOST_WIDE_INT, l);
>return u.valp;
>  }
> +  else if (CHECKING_P && l < WIDE_INT_MAX_INL_ELTS)
> +u.val[l] = HOST_WIDE_INT_UC (0xbaaddeadbeef);
>return u.val;
>  }
>  
> @@ -1650,6 +1652,9 @@ widest_int_storage ::set_len (unsigne
>memcpy (u.val, valp, l * sizeof (u.val[0]));
>XDELETEVEC (valp);
>  }
> +  else if (len && len < WIDE_INT_MAX_INL_ELTS)
> +gcc_checking_assert ((unsigned HOST_WIDE_INT) u.val[len]
> +  == HOST_WIDE_INT_UC (0xbaaddeadbeef));
>len = l;
>/* There are no excess bits in val[len - 1].  */
>STATIC_ASSERT (N % HOST_BITS_PER_WIDE_INT == 0);
>
> part, shall that go into trunk as well or is that too much slowdown
> for checking builds?

I don't have a good intuition about how big the slowdown will be,
but FWIW I agree with Richi that it'd be better to include the change.
We can always take it out again if it proves to be unexpectedly expensive.

Thanks,
Richard


Re: [PATCH V2] Emit funcall external declarations only if actually used.

2023-10-12 Thread Richard Sandiford
"Jose E. Marchesi"  writes:
> Hi Richard.
> Thanks for looking at this! :)
>
>
>> "Jose E. Marchesi"  writes:
>>> ping
>>
>> I don't know this code very well, and have AFAIR haven't worked
>> with an assembler that requires external declarations, but since
>> it's at a second ping :)
>>
>>>
 ping

> [Differences from V1:
> - Prototype for call_from_call_insn moved before comment block.
> - Reuse the `call' flag for SYMBOL_REF_LIBCALL.
> - Fallback to check REG_CALL_DECL in non-direct calls.
> - New test to check correct behavior for non-direct calls.]
>
> There are many places in GCC where alternative local sequences are
> tried in order to determine what is the cheapest or best alternative
> to use in the current target.  When any of these sequences involve a
> libcall, the current implementation of emit_library_call_value_1
> introduce a side-effect consisting on emitting an external declaration
> for the funcall (such as __divdi3) which is thus emitted even if the
> sequence that does the libcall is not retained.
>
> This is problematic in targets such as BPF, because the kernel loader
> chokes on the spurious symbol __divdi3 and makes the resulting BPF
> object unloadable.  Note that BPF objects are not linked before being
> loaded.
>
> This patch changes emit_library_call_value_1 to mark the target
> SYMBOL_REF as a libcall.  Then, the emission of the external
> declaration is done in the first loop of final.cc:shorten_branches.
> This happens only if the corresponding sequence has been kept.
>
> Regtested in x86_64-linux-gnu.
> Tested with host x86_64-linux-gnu with target bpf-unknown-none.
>>
>> I'm not sure that shorten_branches is a natural place to do this.
>> It isn't something that would normally emit asm text.
>
> Well, that was the approach suggested by another reviewer (Jakub) once
> my initial approach (in the V1) got rejected.  He explicitly suggested
> to use shorten_branches.
>
>> Would it be OK to emit the declaration at the same point as for decls,
>> which IIUC is process_pending_assemble_externals?  If so, how about
>> making assemble_external_libcall add the symbol to a list when
>> !SYMBOL_REF_USED, instead of calling targetm.asm_out.external_libcall
>> directly?  assemble_external_libcall could then also call get_identifier
>> on the name (perhaps after calling strip_name_encoding -- can't
>> remember whether assemble_external_libcall sees the encoded or
>> unencoded name).
>>
>> All being well, the call to get_identifier should cause
>> assemble_name_resolve to record when the name is used, via
>> TREE_SYMBOL_REFERENCED.  Then process_pending_assemble_externals could
>> go through the list of libcalls recorded by assemble_external_libcall
>> and check whether TREE_SYMBOL_REFERENCED is set on the get_identifier.
>>
>> Not super elegant, but it seems to fit within the existing scheme.
>> And I don't there should be any problem with using get_identifier
>> for libcalls, since it isn't valid to use libcall names for other
>> types of symbol.
>
> This sounds way more complicated to me than the approach in V2, which
> seems to work and is thus a clear improvement compared to the current
> situation in the trunk.  The approach in V2 may be ugly, but it is
> simple and easy to understand.  Is the proposed more convoluted
> alternative really worth the extra complexity, given it is "not super
> elegant"?

Is it really that much more convoluted?  I was thinking of something
like the attached, which seems a bit shorter than V2, and does seem
to fix the bpf tests.

I think most (all?) libcalls already have an associated decl due to
optabs-libfuncs.cc, so an alternative to get_identifier would be to
set the SYMBOL_REF_DECL.  Using get_identiifer seems a bit more
lightweight though.

Richard


diff --git a/gcc/varasm.cc b/gcc/varasm.cc
index b0eff17b8b5..073e3eb2579 100644
--- a/gcc/varasm.cc
+++ b/gcc/varasm.cc
@@ -2461,6 +2461,10 @@ contains_pointers_p (tree type)
it all the way to final.  See PR 17982 for further discussion.  */
 static GTY(()) tree pending_assemble_externals;
 
+/* A similar list of pending libcall symbols.  We only want to declare
+   symbols that are actually used in the final assembly.  */
+static GTY(()) rtx pending_libcall_symbols;
+
 #ifdef ASM_OUTPUT_EXTERNAL
 /* Some targets delay some output to final using TARGET_ASM_FILE_END.
As a result, assemble_external can be called after the list of externals
@@ -2516,12 +2520,20 @@ void
 process_pending_assemble_externals (void)
 {
 #ifdef ASM_OUTPUT_EXTERNAL
-  tree list;
-  for (list = pending_assemble_externals; list; list = TREE_CHAIN (list))
+  for (tree list = pending_assemble_externals; list; list = TREE_CHAIN (list))
 assemble_external_real (TREE_VALUE (list));
 
+  for (rtx list = pending_libcall_symbols; list; list = XEXP (list, 1))
+{
+  rtx symbol = XEXP (list, 0);
+  tree id = 

Re: [PATCH] gimple-match: Do not try UNCOND optimization with COND_LEN.

2023-10-12 Thread Richard Sandiford
Robin Dapp via Gcc-patches  writes:
> Hi,
>
> as Juzhe noticed in gcc.dg/pr92301.c there was still something missing in
> the last patch.  The attached v2 makes sure we always have a COND_LEN 
> operation
> before returning true and initializes len and bias even if they are unused.
>
> Bootstrapped and regtested on aarch64 and x86.

Sorry for the slow review.  I was hoping Richi would take it,
but I see he was hoping the same from me.

> Regards
>  Robin
>
> Subject: [PATCH v2] gimple-match: Do not try UNCOND optimization with
>  COND_LEN.
>
> On riscv we mis-optimize conditional (length) operations into
> unconditional operations e.g. in slp-reduc-7.c and
> gcc.dg/pr92301.c.
>
> This patch prevents optimizing e.g.
>  COND_LEN_ADD ({-1, ... }, a, 0, c, len, bias)
> unconditionally into just "a".
>
> Currently, we assume that COND_LEN operations can be optimized similarly
> to COND operations.  As the length is part of the mask (and usually not
> compile-time constant), we must not perform any optimization that relies
> on just the mask being "true".  This patch ensures that we still have a
> COND_LEN pattern after optimization.
>
> gcc/ChangeLog:
>
>   PR target/111311
>   * gimple-match-exports.cc (maybe_resimplify_conditional_op):
>   Check for length masking.
>   (try_conditional_simplification): Check that the result is still
>   length masked.
> ---
>  gcc/gimple-match-exports.cc | 38 ++---
>  gcc/gimple-match.h  |  3 ++-
>  2 files changed, 33 insertions(+), 8 deletions(-)
>
> diff --git a/gcc/gimple-match-exports.cc b/gcc/gimple-match-exports.cc
> index b36027b0bad..d41de98a3d3 100644
> --- a/gcc/gimple-match-exports.cc
> +++ b/gcc/gimple-match-exports.cc
> @@ -262,7 +262,8 @@ maybe_resimplify_conditional_op (gimple_seq *seq, 
> gimple_match_op *res_op,
>if (!res_op->cond.cond)
>  return false;
>  
> -  if (!res_op->cond.else_value
> +  if (!res_op->cond.len
> +  && !res_op->cond.else_value
>&& res_op->code.is_tree_code ())
>  {
>/* The "else" value doesn't matter.  If the "then" value is a

Why are the contents of this if statement wrong for COND_LEN?
If the "else" value doesn't matter, then the masked form can use
the "then" value for all elements.  I would have expected the same
thing to be true of COND_LEN.

> @@ -301,9 +302,12 @@ maybe_resimplify_conditional_op (gimple_seq *seq, 
> gimple_match_op *res_op,
>  
>/* If the "then" value is a gimple value and the "else" value matters,
>   create a VEC_COND_EXPR between them, then see if it can be further
> - simplified.  */
> + simplified.
> + Don't do this if we have a COND_LEN_ as that would make us lose the
> + length masking.  */
>gimple_match_op new_op;
> -  if (res_op->cond.else_value
> +  if (!res_op->cond.len
> +  && res_op->cond.else_value
>&& VECTOR_TYPE_P (res_op->type)
>&& gimple_simplified_result_is_gimple_val (res_op))
>  {

The change LGTM, but it would be nice to phrase the comment to avoid
the "Do A.  Don't do A if B" pattern.  Maybe:

  /* If the condition represents MASK ? THEN : ELSE, where THEN is a gimple
 value and ELSE matters, create a VEC_COND_EXPR between them, then see
 if it can be further simplified.  */

> @@ -314,7 +318,7 @@ maybe_resimplify_conditional_op (gimple_seq *seq, 
> gimple_match_op *res_op,
>return gimple_resimplify3 (seq, res_op, valueize);
>  }
>  
> -  /* Otherwise try rewriting the operation as an IFN_COND_* call.
> +  /* Otherwise try rewriting the operation as an IFN_COND_(LEN_)* call.
>   Again, this isn't a simplification in itself, since it's what
>   RES_OP already described.  */
>if (convert_conditional_op (res_op, &new_op))
> @@ -386,9 +390,29 @@ try_conditional_simplification (internal_fn ifn, 
> gimple_match_op *res_op,
>  default:
>gcc_unreachable ();
>  }
> -  *res_op = cond_op;
> -  maybe_resimplify_conditional_op (seq, res_op, valueize);
> -  return true;
> +
> +  if (len)
> +{
> +  /* If we had a COND_LEN before we need to ensure that it stays that
> +  way.  */
> +  gimple_match_op old_op = *res_op;
> +  *res_op = cond_op;
> +  maybe_resimplify_conditional_op (seq, res_op, valueize);
> +
> +  auto cfn = combined_fn (res_op->code);
> +  if (internal_fn_p (cfn)
> +   && internal_fn_len_index (as_internal_fn (cfn)) != -1)
> + return true;

Why isn't it enough to check the result of maybe_resimplify_conditional_op?

Thanks,
Richard

> +
> +  *res_op = old_op;
> +  return false;
> +}
> +  else
> +{
> +  *res_op = cond_op;
> +  maybe_resimplify_conditional_op (seq, res_op, valueize);
> +  return true;
> +}
>  }
>  
>  /* Helper for the autogenerated code, valueize OP.  */
> diff --git a/gcc/gimple-match.h b/gcc/gimple-match.h
> index bec3ff42e3e..d192b7dae3e 100644
> --- a/gcc/gimple-match.h
> +++ b/gcc/gimple-match.h
> @@ -56,7 +56,

Re: [PATCH V2] Emit funcall external declarations only if actually used.

2023-10-12 Thread Richard Sandiford
"Jose E. Marchesi"  writes:
>> "Jose E. Marchesi"  writes:
>>> Hi Richard.
>>> Thanks for looking at this! :)
>>>
>>>
 "Jose E. Marchesi"  writes:
> ping

 I don't know this code very well, and have AFAIR haven't worked
 with an assembler that requires external declarations, but since
 it's at a second ping :)

>
>> ping
>>
>>> [Differences from V1:
>>> - Prototype for call_from_call_insn moved before comment block.
>>> - Reuse the `call' flag for SYMBOL_REF_LIBCALL.
>>> - Fallback to check REG_CALL_DECL in non-direct calls.
>>> - New test to check correct behavior for non-direct calls.]
>>>
>>> There are many places in GCC where alternative local sequences are
>>> tried in order to determine what is the cheapest or best alternative
>>> to use in the current target.  When any of these sequences involve a
>>> libcall, the current implementation of emit_library_call_value_1
>>> introduce a side-effect consisting on emitting an external declaration
>>> for the funcall (such as __divdi3) which is thus emitted even if the
>>> sequence that does the libcall is not retained.
>>>
>>> This is problematic in targets such as BPF, because the kernel loader
>>> chokes on the spurious symbol __divdi3 and makes the resulting BPF
>>> object unloadable.  Note that BPF objects are not linked before being
>>> loaded.
>>>
>>> This patch changes emit_library_call_value_1 to mark the target
>>> SYMBOL_REF as a libcall.  Then, the emission of the external
>>> declaration is done in the first loop of final.cc:shorten_branches.
>>> This happens only if the corresponding sequence has been kept.
>>>
>>> Regtested in x86_64-linux-gnu.
>>> Tested with host x86_64-linux-gnu with target bpf-unknown-none.

 I'm not sure that shorten_branches is a natural place to do this.
 It isn't something that would normally emit asm text.
>>>
>>> Well, that was the approach suggested by another reviewer (Jakub) once
>>> my initial approach (in the V1) got rejected.  He explicitly suggested
>>> to use shorten_branches.
>>>
 Would it be OK to emit the declaration at the same point as for decls,
 which IIUC is process_pending_assemble_externals?  If so, how about
 making assemble_external_libcall add the symbol to a list when
 !SYMBOL_REF_USED, instead of calling targetm.asm_out.external_libcall
 directly?  assemble_external_libcall could then also call get_identifier
 on the name (perhaps after calling strip_name_encoding -- can't
 remember whether assemble_external_libcall sees the encoded or
 unencoded name).

 All being well, the call to get_identifier should cause
 assemble_name_resolve to record when the name is used, via
 TREE_SYMBOL_REFERENCED.  Then process_pending_assemble_externals could
 go through the list of libcalls recorded by assemble_external_libcall
 and check whether TREE_SYMBOL_REFERENCED is set on the get_identifier.

 Not super elegant, but it seems to fit within the existing scheme.
 And I don't there should be any problem with using get_identifier
 for libcalls, since it isn't valid to use libcall names for other
 types of symbol.
>>>
>>> This sounds way more complicated to me than the approach in V2, which
>>> seems to work and is thus a clear improvement compared to the current
>>> situation in the trunk.  The approach in V2 may be ugly, but it is
>>> simple and easy to understand.  Is the proposed more convoluted
>>> alternative really worth the extra complexity, given it is "not super
>>> elegant"?
>>
>> Is it really that much more convoluted?  I was thinking of something
>> like the attached, which seems a bit shorter than V2, and does seem
>> to fix the bpf tests.
>
> o_O
> Ok I clearly misunderstood what you was proposing.  This is way simpler!
>
> How does the magic of TREE_SYMBOL_REFERENCED work?  How is it set to
> `true' only if the RTL containing the call is retained in the final
> chain?

It happens in assemble_name, via assemble_name_resolve.  The system
relies on code using that rather than assemble_name_raw for symbols
that might need to be declared, or that might need visibility
information attached.  (It relies on that in general, I mean,
not just for this patch.)

Thanks,
Richard



Re: [PATCH] gimple-match: Do not try UNCOND optimization with COND_LEN.

2023-10-12 Thread Richard Sandiford
Richard Sandiford  writes:
> Robin Dapp via Gcc-patches  writes:
>> [...]
>> @@ -386,9 +390,29 @@ try_conditional_simplification (internal_fn ifn, 
>> gimple_match_op *res_op,
>>  default:
>>gcc_unreachable ();
>>  }
>> -  *res_op = cond_op;
>> -  maybe_resimplify_conditional_op (seq, res_op, valueize);
>> -  return true;
>> +
>> +  if (len)
>> +{
>> +  /* If we had a COND_LEN before we need to ensure that it stays that
>> + way.  */
>> +  gimple_match_op old_op = *res_op;
>> +  *res_op = cond_op;
>> +  maybe_resimplify_conditional_op (seq, res_op, valueize);
>> +
>> +  auto cfn = combined_fn (res_op->code);
>> +  if (internal_fn_p (cfn)
>> +  && internal_fn_len_index (as_internal_fn (cfn)) != -1)
>> +return true;
>
> Why isn't it enough to check the result of maybe_resimplify_conditional_op?

Sorry, ignore that part.  I get it now.

But isn't the test whether res_op->code itself is an internal_function?
In other words, shouldn't it just be:

  if (internal_fn_p (res_op->code)
  && internal_fn_len_index (as_internal_fn (res_op->code)) != -1)
return true;

maybe_resimplify_conditional_op should already have converted to an
internal function where possible, and if combined_fn (res_op->code)
does any extra conversion on the fly, that conversion won't be reflected
in res_op.

Thanks,
Richard


Re: [PATCH 5/6]AArch64: Fix Armv9-a warnings that get emitted whenever a ACLE header is used.

2023-10-12 Thread Richard Sandiford
Tamar Christina  writes:
> Hi All,
>
> At the moment, trying to use -march=armv9-a with any ACLE header such as
> arm_neon.h results in rows and rows of warnings saying:
>
> : warning: "__ARM_ARCH" redefined
> : note: this is the location of the previous definition
>
> This is obviously not useful and happens because the header was defined at
> __ARM_ARCH == 8 and the commandline changes it.
>
> The Arm port solves this by undef the macro during argument processing and we 
> do
> the same on AArch64 for the majority of macros.  However we define this macro
> using a different helper which requires the manual undef.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-c.cc (aarch64_update_cpp_builtins): Add undef.

OK!  Thanks for fixing this.

Richard.

>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/armv9_warning.c: New test.
>
> --- inline copy of patch -- 
> diff --git a/gcc/config/aarch64/aarch64-c.cc b/gcc/config/aarch64/aarch64-c.cc
> index 
> 578ec6f45b06347d90f951b37064006786baf10f..ab8844f6049dc95b97648b651bfcd3a4ccd3ca0b
>  100644
> --- a/gcc/config/aarch64/aarch64-c.cc
> +++ b/gcc/config/aarch64/aarch64-c.cc
> @@ -82,6 +82,7 @@ aarch64_update_cpp_builtins (cpp_reader *pfile)
>  {
>aarch64_def_or_undef (flag_unsafe_math_optimizations, "__ARM_FP_FAST", 
> pfile);
>  
> +  cpp_undef (pfile, "__ARM_ARCH");
>builtin_define_with_int_value ("__ARM_ARCH", AARCH64_ISA_V9A ? 9 : 8);
>  
>builtin_define_with_int_value ("__ARM_SIZEOF_MINIMAL_ENUM",
> diff --git a/gcc/testsuite/gcc.target/aarch64/armv9_warning.c 
> b/gcc/testsuite/gcc.target/aarch64/armv9_warning.c
> new file mode 100644
> index 
> ..35690d5bce790e11331788aacef00f3f35cdf216
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/armv9_warning.c
> @@ -0,0 +1,5 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-march=armv9-a -Wpedantic -Werror" } */
> +
> +#include 
> +


Re: [PATCH] Support g++ 4.8 as a host compiler.

2023-10-15 Thread Richard Sandiford
"Roger Sayle"  writes:
> I'd like to ping my patch for restoring bootstrap using g++ 4.8.5
> (the system compiler on RHEL 7 and later systems).
> https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632008.html
>
> Note the preprocessor #ifs can be removed; they are only there to document
> why the union u must have an explicit, empty (but not default) constructor.
>
> I completely agree with the various opinions that we might consider
> upgrading the minimum host compiler for many good reasons (Ada,
> D, newer C++ features etc.).  It's inevitable that older compilers and
> systems can't be supported indefinitely.
>
> Having said that I don't think that this unintentional trivial breakage,
> that has a safe one-line work around is sufficient cause (or non-neglible
> risk or support burden), to inconvenice a large number of GCC users
> (the impact/disruption to cfarm has already been mentioned).
>
> Interestingly, "scl enable devtoolset-XX" to use a newer host compiler,
> v10 or v11, results in a significant increase (100+) in unexpected failures I 
> see
> during mainline regression testing using "make -k check" (on RedHat 7.9).
> (Older) system compilers, despite their flaws, are selected for their
> (overall) stability and maturity.
>
> If another patch/change hits the compiler next week that reasonably
> means that 4.8.5 can no longer be supported, so be it, but its an
> annoying (and unnecessary?) inconvenience in the meantime.
>
> Perhaps we should file a Bugzilla PR indicating that the documentation
> and release notes need updating, if my fix isn't considered acceptable?
>
> Why this patch is an trigger issue (that requires significant discussion
> and deliberation) is somewhat of a mystery.

It seemed like there was considerable support for bumping the minimum
to beyond 4.8.  I think we should wait until a decision has been made
before adding more 4.8 workarounds.

Having a conditional explicit constructor is dangerous because it changes
semantics.  E.g. consider:

  #include 

  union u { int x; };
  void f(u *ptr) { new(ptr) u; }
  void g(u *ptr) { new(ptr) u(); }

g(ptr) zeros ptr->x whereas f(ptr) doesn't.  If we add "u() {}" then g()
does not zero ptr->x.

So if we did add the workaround, it would need to be unconditional,
like you say.

Thanks,
Richard


Re: PR111648: Fix wrong code-gen due to incorrect VEC_PERM_EXPR folding

2023-10-16 Thread Richard Sandiford
Prathamesh Kulkarni  writes:
> On Wed, 11 Oct 2023 at 16:57, Prathamesh Kulkarni
>  wrote:
>>
>> On Wed, 11 Oct 2023 at 16:42, Prathamesh Kulkarni
>>  wrote:
>> >
>> > On Mon, 9 Oct 2023 at 17:05, Richard Sandiford
>> >  wrote:
>> > >
>> > > Prathamesh Kulkarni  writes:
>> > > > Hi,
>> > > > The attached patch attempts to fix PR111648.
>> > > > As mentioned in PR, the issue is when a1 is a multiple of vector
>> > > > length, we end up creating following encoding in result: { base_elem,
>> > > > arg[0], arg[1], ... } (assuming S = 1),
>> > > > where arg is chosen input vector, which is incorrect, since the
>> > > > encoding originally in arg would be: { arg[0], arg[1], arg[2], ... }
>> > > >
>> > > > For the test-case mentioned in PR, vectorizer pass creates
>> > > > VEC_PERM_EXPR where:
>> > > > arg0: { -16, -9, -10, -11 }
>> > > > arg1: { -12, -5, -6, -7 }
>> > > > sel = { 3, 4, 5, 6 }
>> > > >
>> > > > arg0, arg1 and sel are encoded with npatterns = 1 and 
>> > > > nelts_per_pattern = 3.
>> > > > Since a1 = 4 and arg_len = 4, it ended up creating the result with
>> > > > following encoding:
>> > > > res = { arg0[3], arg1[0], arg1[1] } // npatterns = 1, 
>> > > > nelts_per_pattern = 3
>> > > >   = { -11, -12, -5 }
>> > > >
>> > > > So for res[3], it used S = (-5) - (-12) = 7
>> > > > And hence computed it as -5 + 7 = 2.
>> > > > instead of selecting arg1[2], ie, -6.
>> > > >
>> > > > The patch tweaks valid_mask_for_fold_vec_perm_cst_p to punt if a1 is a 
>> > > > multiple
>> > > > of vector length, so a1 ... ae select elements only from stepped part
>> > > > of the pattern
>> > > > from input vector and return false for this case.
>> > > >
>> > > > Since the vectors are VLS, fold_vec_perm_cst then sets:
>> > > > res_npatterns = res_nelts
>> > > > res_nelts_per_pattern  = 1
>> > > > which seems to fix the issue by encoding all the elements.
>> > > >
>> > > > The patch resulted in Case 4 and Case 5 failing from test_nunits_min_2 
>> > > > because
>> > > > they used sel = { 0, 0, 1, ... } and {len, 0, 1, ... } respectively,
>> > > > which used a1 = 0, and thus selected arg1[0].
>> > > >
>> > > > I removed Case 4 because it was already covered in test_nunits_min_4,
>> > > > and moved Case 5 to test_nunits_min_4, with sel = { len, 1, 2, ... }
>> > > > and added a new Case 9 to test for this issue.
>> > > >
>> > > > Passes bootstrap+test on aarch64-linux-gnu with and without SVE,
>> > > > and on x86_64-linux-gnu.
>> > > > Does the patch look OK ?
>> > > >
>> > > > Thanks,
>> > > > Prathamesh
>> > > >
>> > > > [PR111648] Fix wrong code-gen due to incorrect VEC_PERM_EXPR folding.
>> > > >
>> > > > gcc/ChangeLog:
>> > > >   PR tree-optimization/111648
>> > > >   * fold-const.cc (valid_mask_for_fold_vec_perm_cst_p): Punt if a1
>> > > >   is a multiple of vector length.
>> > > >   (test_nunits_min_2): Remove Case 4 and move Case 5 to ...
>> > > >   (test_nunits_min_4): ... here and rename case numbers. Also add
>> > > >   Case 9.
>> > > >
>> > > > gcc/testsuite/ChangeLog:
>> > > >   PR tree-optimization/111648
>> > > >   * gcc.dg/vect/pr111648.c: New test.
>> > > >
>> > > >
>> > > > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
>> > > > index 4f8561509ff..c5f421d6b76 100644
>> > > > --- a/gcc/fold-const.cc
>> > > > +++ b/gcc/fold-const.cc
>> > > > @@ -10682,8 +10682,8 @@ valid_mask_for_fold_vec_perm_cst_p (tree arg0, 
>> > > > tree arg1,
>> > > > return false;
>> > > >   }
>> > > >
>> > > > -  /* Ensure that the stepped sequence always selects from the same
>> > > > -  input pattern.  */
>> > > > +  /* Ensure that the stepped sequence always selects from the 
>> > > > stepped
>>

Re: [PATCH V3] VECT: Enhance SLP of MASK_LEN_GATHER_LOAD[PR111721]

2023-10-16 Thread Richard Sandiford
Juzhe-Zhong  writes:
> This patch fixes this following FAILs in RISC-V regression:
>
> FAIL: gcc.dg/vect/vect-gather-1.c -flto -ffat-lto-objects  scan-tree-dump 
> vect "Loop contains only SLP stmts"
> FAIL: gcc.dg/vect/vect-gather-1.c scan-tree-dump vect "Loop contains only SLP 
> stmts"
> FAIL: gcc.dg/vect/vect-gather-3.c -flto -ffat-lto-objects  scan-tree-dump 
> vect "Loop contains only SLP stmts"
> FAIL: gcc.dg/vect/vect-gather-3.c scan-tree-dump vect "Loop contains only SLP 
> stmts"
>
> The root cause of these FAIL is that GCC SLP failed on MASK_LEN_GATHER_LOAD.
>
> We have 2 following situations of scalar recognized MASK_LEN_GATHER_LOAD:
>
> 1. conditional gather load: MASK_LEN_GATHER_LOAD (base, offset, scale, zero, 
> condtional mask).
>
>This situation we just need to leverage the current MASK_GATHER_LOAD which 
> can achieve SLP MASK_LEN_GATHER_LOAD.
>
> 2. un-conditional gather load: MASK_LEN_GATHER_LOAD (base, offset, scale, 
> zero, -1)
>
>Current SLP check will failed on dummy mask -1, so we relax the check in 
> tree-vect-slp.cc and allow it to be materialized.
> 
> Consider this following case:
>
> void __attribute__((noipa))
> f (int *restrict y, int *restrict x, int *restrict indices, int n)
> {
>   for (int i = 0; i < n; ++i)
> {
>   y[i * 2] = x[indices[i * 2]] + 1;
>   y[i * 2 + 1] = x[indices[i * 2 + 1]] + 2;
> }
> }
>
> https://godbolt.org/z/WG3M3n7Mo
>
> GCC unable to SLP using VEC_LOAD_LANES/VEC_STORE_LANES:
>
> f:
> ble a3,zero,.L5
> .L3:
> vsetvli a5,a3,e8,mf4,ta,ma
> vsetvli zero,a5,e32,m1,ta,ma
> vlseg2e32.v v6,(a2)
> vsetvli a4,zero,e64,m2,ta,ma
> vsext.vf2   v2,v6
> vsll.vi v2,v2,2
> vsetvli zero,a5,e32,m1,ta,ma
> vluxei64.v  v1,(a1),v2
> vsetvli a4,zero,e64,m2,ta,ma
> vsext.vf2   v2,v7
> vsetvli zero,zero,e32,m1,ta,ma
> vadd.vi v4,v1,1
> vsetvli zero,zero,e64,m2,ta,ma
> vsll.vi v2,v2,2
> vsetvli zero,a5,e32,m1,ta,ma
> vluxei64.v  v2,(a1),v2
> vsetvli a4,zero,e32,m1,ta,ma
> sllia6,a5,3
> vadd.vi v5,v2,2
> sub a3,a3,a5
> vsetvli zero,a5,e32,m1,ta,ma
> vsseg2e32.v v4,(a0)
> add a2,a2,a6
> add a0,a0,a6
> bne a3,zero,.L3
> .L5:
> ret
>
> After this patch:
>
> f:
>   ble a3,zero,.L5
>   li  a5,1
>   csrrt1,vlenb
>   sllia5,a5,33
>   srlia7,t1,2
>   addia5,a5,1
>   sllia3,a3,1
>   neg t3,a7
>   vsetvli a4,zero,e64,m1,ta,ma
>   vmv.v.x v4,a5
> .L3:
>   minua5,a3,a7
>   vsetvli zero,a5,e32,m1,ta,ma
>   vle32.v v1,0(a2)
>   vsetvli a4,zero,e64,m2,ta,ma
>   vsext.vf2   v2,v1
>   vsll.vi v2,v2,2
>   vsetvli zero,a5,e32,m1,ta,ma
>   vluxei64.v  v2,(a1),v2
>   vsetvli a4,zero,e32,m1,ta,ma
>   mv  a6,a3
>   vadd.vv v2,v2,v4
>   vsetvli zero,a5,e32,m1,ta,ma
>   vse32.v v2,0(a0)
>   add a2,a2,t1
>   add a0,a0,t1
>   add a3,a3,t3
>   bgtua6,a7,.L3
> .L5:
>   ret
>
> Note that I found we are missing conditional mask gather_load SLP test, 
> Append a test for it in this patch.

Yeah, we're missing a target-independent test.  I'm afraid I used
aarch64-specific tests for a lot of this stuff, since (a) I wanted
to check the quality of the asm output and (b) it's very hard to write
gcc.dg/vect tests that don't fail on some target or other.  Thanks for
picking this up.

>
> Tested on RISC-V and Bootstrap && Regression on X86 passed.
>
> Ok for trunk ?
>
> gcc/ChangeLog:
>
>   * tree-vect-slp.cc (vect_get_operand_map): Add MASK_LEN_GATHER_LOAD.
>   (vect_get_and_check_slp_defs): Ditto.
>   (vect_build_slp_tree_1): Ditto.
>   (vect_build_slp_tree_2): Ditto.
>   * tree-vect-stmts.cc (vectorizable_load): Ditto.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.dg/vect/vect-gather-6.c: New test.
>
> ---
>  gcc/testsuite/gcc.dg/vect/vect-gather-6.c | 15 +++
>  gcc/tree-vect-slp.cc  | 22 ++
>  gcc/tree-vect-stmts.cc| 10 +-
>  3 files changed, 42 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-gather-6.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-gather-6.c 
> b/gcc/testsuite/gcc.dg/vect/vect-gather-6.c
> new file mode 100644
> index 000..ff55f321854
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-gather-6.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +
> +void
> +f (int *restrict y, int *restrict x, int *restrict indices, int *restrict 
> cond, int n)
> +{
> +  for (int i = 0; i < n; ++i)
> +{
> +  if (cond[i * 2])
> + y[i * 2] = x[indices[i * 2]] + 1;
> +  if (cond[i * 2 + 1])
> + y[i * 2 + 1] = x[indices[i * 2 + 1]] + 2;
> +}
> +}
> +
> +/* { dg-final { scan-tr

Re: PATCH-1v3, expand] Enable vector mode for compare_by_pieces [PR111449]

2023-10-16 Thread Richard Sandiford
Thanks for the update.  The comments below are mostly asking for
cosmetic changes.

HAO CHEN GUI  writes:
> Hi,
>   Vector mode instructions are efficient for compare on some targets.
> This patch enables vector mode for compare_by_pieces. Currently,
> vector mode is enabled for compare, set and clear. Helper function
> "qi_vector_p" decides if vector mode is enabled for certain by pieces
> operation. optabs_checking checks if optabs are available for the
> mode and certain by pieces operations. Both of them are called in
> fixed_size_mode finding functions. A member is added to class
> op_by_pieces_d in order to record the type of by pieces operations.
>
>   The test case is in the second patch which is rs6000 specific.
>
>   Compared to last version, the main change is to create two helper
> functions and call them in mode finding function.
>
>   Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no
> regressions.
>
> Thanks
> Gui Haochen
>
> ChangeLog
> Expand: Enable vector mode for pieces compares
>
> Vector mode compare instructions are efficient for equality compare on
> rs6000. This patch refactors the codes of pieces operation to enable
> vector mode for compare.
>
> gcc/
>   PR target/111449
>   * expr.cc (qi_vector_p): New function to indicate if vector mode
>   is enabled for certain by pieces operations.
>   (optabs_checking): New function to check if optabs are available
>   for certain by pieces operations.
>   (widest_fixed_size_mode_for_size): Replace the second argument
>   with the type of by pieces operations.  Call qi_vector_p to check
>   if vector mode is enable.  Call optabs_checking to check if optabs
>   are available for the candidate vector mode.
>   (by_pieces_ninsns): Pass the type of by pieces operation to
>   widest_fixed_size_mode_for_size.
>   (class op_by_pieces_d): Add a protected member m_op to record the
>   type of by pieces operations.  Declare member function
>   fixed_size_mode widest_fixed_size_mode_for_size.
>   (op_by_pieces_d::op_by_pieces_d): Change last argument to the type
>   of by pieces operations, initialize m_op with it.  Call non-member
>   function widest_fixed_size_mode_for_size.
>   (op_by_pieces_d::get_usable_mode): Call member function
>   widest_fixed_size_mode_for_size.
>   (op_by_pieces_d::smallest_fixed_size_mode_for_size): Call
>   qi_vector_p to check if vector mode is enable.  Call
>   optabs_checking to check if optabs are available for the candidate
>   vector mode.
>   (op_by_pieces_d::run): Call member function
>   widest_fixed_size_mode_for_size.
>   (op_by_pieces_d::widest_fixed_size_mode_for_size): Implement.
>   (move_by_pieces_d::move_by_pieces_d): Set m_op to MOVE_BY_PIECES.
>   (store_by_pieces_d::store_by_pieces_d): Set m_op with the op.
>   (can_store_by_pieces): Pass the type of by pieces operations to
>   widest_fixed_size_mode_for_size.
>   (clear_by_pieces): Initialize class store_by_pieces_d with
>   CLEAR_BY_PIECES.
>   (compare_by_pieces_d::compare_by_pieces_d): Set m_op to
>   COMPARE_BY_PIECES.
>
> patch.diff
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index d87346dc07f..8ec3f5465a9 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -988,18 +988,43 @@ alignment_for_piecewise_move (unsigned int max_pieces, 
> unsigned int align)
>return align;
>  }
>
> -/* Return the widest QI vector, if QI_MODE is true, or integer mode
> -   that is narrower than SIZE bytes.  */
> +/* Return true if vector mode is enabled for the op.  */

Maybe:

/* Return true if we know how to implement OP using vectors of bytes.  */

"enabled" would normally imply target support.

> +static bool
> +qi_vector_p (by_pieces_operation op)

And maybe call the function "can_use_qi_vectors"

> +{
> +  return (op == COMPARE_BY_PIECES
> +   || op == SET_BY_PIECES
> +   || op == CLEAR_BY_PIECES);
> +}
> +
> +/* Return true if optabs are available for the mode and by pieces
> +   operations.  */

Maybe:

/* Return true if the target supports operation OP using vector mode MODE.  */

> +static bool
> +optabs_checking (fixed_size_mode mode, by_pieces_operation op)

And maybe call the function "qi_vector_mode_supported_p".

> +{
> +  if ((op == SET_BY_PIECES || op == CLEAR_BY_PIECES)
> +  && optab_handler (vec_duplicate_optab, mode) != CODE_FOR_nothing)
> +return true;
> +  else if (op == COMPARE_BY_PIECES
> +&& optab_handler (mov_optab, mode) != CODE_FOR_nothing
> +&& can_compare_p (EQ, mode, ccp_jump))
> +return true;
> +
> +  return false;
> +}
> +
> +/* Return the widest QI vector, if vector mode is enabled for the op,
> +   or integer mode that is narrower than SIZE bytes.  */

Maybe:

/* Return the widest mode that can be used to perform part of
   an operation OP on SIZE bytes.  Try to use QI vector modes
   where possible.  */
>
>  static fixed_size_mode
> -widest_fix

Re: [PATCH] gimple-match: Do not try UNCOND optimization with COND_LEN.

2023-10-16 Thread Richard Sandiford
Robin Dapp  writes:
>> Why are the contents of this if statement wrong for COND_LEN?
>> If the "else" value doesn't matter, then the masked form can use
>> the "then" value for all elements.  I would have expected the same
>> thing to be true of COND_LEN.
>
> Right, that one was overly pessimistic.  Removed.
>
>> But isn't the test whether res_op->code itself is an internal_function?
>> In other words, shouldn't it just be:
>> 
>>   if (internal_fn_p (res_op->code)
>>&& internal_fn_len_index (as_internal_fn (res_op->code)) != -1)
>>  return true;
>> 
>> maybe_resimplify_conditional_op should already have converted to an
>> internal function where possible, and if combined_fn (res_op->code)
>> does any extra conversion on the fly, that conversion won't be reflected
>> in res_op.
>
> I went through some of our test cases and believe most of the problems
> are due to situations like the following:
>
> In vect-cond-arith-2.c we have (on riscv)
>   vect_neg_xi_14.4_23 = -vect_xi_13.3_22;
>   vect_res_2.5_24 = .COND_LEN_ADD ({ -1, ... }, vect_res_1.0_17, 
> vect_neg_xi_14.4_23, vect_res_1.0_17, _29, 0);
>
> On aarch64 this is a situation that matches the VEC_COND_EXPR
> simplification that I disabled with this patch.  We valueized
> to _26 = vect_res_1.0_17 - vect_xi_13.3_22 and then create
> vect_res_2.5_24 = VEC_COND_EXPR ;
> This is later re-assembled into a COND_SUB.
>
> As we have two masks or COND_LEN we cannot use a VEC_COND_EXPR to
> achieve the same thing.  Would it be possible to create a COND_OP
> directly instead, though?  I tried the following (not very polished
> obviously):
>
> -  new_op.set_op (VEC_COND_EXPR, res_op->type,
> -res_op->cond.cond, res_op->ops[0],
> -res_op->cond.else_value);
> -  *res_op = new_op;
> -  return gimple_resimplify3 (seq, res_op, valueize);
> +  if (!res_op->cond.len)
> +   {
> + new_op.set_op (VEC_COND_EXPR, res_op->type,
> +res_op->cond.cond, res_op->ops[0],
> +res_op->cond.else_value);
> + *res_op = new_op;
> + return gimple_resimplify3 (seq, res_op, valueize);
> +   }
> +  else if (seq && *seq && is_gimple_assign (*seq))
> +   {
> + new_op.code = gimple_assign_rhs_code (*seq);
> + new_op.type = res_op->type;
> + new_op.num_ops = gimple_num_ops (*seq) - 1;
> + new_op.ops[0] = gimple_assign_rhs1 (*seq);
> + if (new_op.num_ops > 1)
> +   new_op.ops[1] = gimple_assign_rhs2 (*seq);
> + if (new_op.num_ops > 2)
> +   new_op.ops[2] = gimple_assign_rhs2 (*seq);
> +
> + new_op.cond = res_op->cond;
> +
> + gimple_match_op bla2;
> + if (convert_conditional_op (&new_op, &bla2))
> +   {
> + *res_op = bla2;
> + // SEQ should now be dead.
> + return true;
> +   }
> +   }
>
> This would make the other hunk (check whether it was a LEN
> and try to recreate it) redundant I hope.
>
> I don't know enough about valueization, whether it's always
> safe to do that and other implications.  On riscv this seems
> to work, though and the other backends never go through the LEN
> path.  If, however, this is a feasible direction it could also
> be done for the non-LEN targets?

I don't know much about valueisation either :)  But it does feel
like we're working around the lack of a LEN form of COND_EXPR.
In other words, it seems odd that we can do:

  IFN_COND_LEN_ADD (mask, a, 0, b, len, bias)

but we can't do:

  IFN_COND_LEN (mask, a, b, len, bias)

There seems to be no way of applying a length without also finding an
operation to perform.

Does IFN_COND_LEN make conceptual sense on RVV?  If so, would defining
it solve some of these problems?

I suppose in the worst case, IFN_COND_LEN is equivalent to IFN_COND_LEN_IOR
with a zero input (and extended to floats).  So if the target can do
IFN_COND_LEN_IOR, it could implement IFN_COND_LEN using the same instruction.

Thanks,
Richard



Re: [PATCH] gimple-match: Do not try UNCOND optimization with COND_LEN.

2023-10-17 Thread Richard Sandiford
Robin Dapp  writes:
>>> I don't know much about valueisation either :)  But it does feel
>>> like we're working around the lack of a LEN form of COND_EXPR.
>>> In other words, it seems odd that we can do:
>>>
>>>   IFN_COND_LEN_ADD (mask, a, 0, b, len, bias)
>>>
>>> but we can't do:
>>>
>>>   IFN_COND_LEN (mask, a, b, len, bias)
>>>
>>> There seems to be no way of applying a length without also finding an
>>> operation to perform.
>> 
>> Indeed .. maybe - _maybe_ we want to scrap VEC_COND_EXPR for
>> IFN_COND{,_LEN} to be more consistent here?
>
> So, yes we could define IFN_COND_LEN (or VCOND_MASK_LEN) but I'd
> assume that there would be a whole lot of follow-up things to
> consider.
>
> I'm wondering if we really gain something from the the round-trip
> via VEC_COND_EXPR when we eventually create a COND_(LEN_)_OP anyway?

The main purpose of the VEC_COND_EXPR isn't as an intermediate step,
but as an end in its own right.  E.g. it allows:

  IFN_COND_ADD (mask, cst1, cst2, else)

to be folded to:

  VEC_COND_EXPR 

This is especially useful when vectorisation has the effect of completely
unrolling a loop.

The VEC_COND_EXPR is only used if the equivalent unconditional rule
folds to a gimple value.

> Sure, if the target doesn't have the particular operation we would
> want a VEC_COND_EXPR.  Same if SEQ is somehow more complicated.
>
> So the IFN_COND(_LEN) =? VCOND_MASK(_LEN) discussion notwithstanding,
> couldn't what I naively proposed be helpful as well?

I don't think it's independently useful, since the fold that it's
attempting is one that match.pd should be able to do.  match.pd can
also do it in a more general way, since it isn't restricted to looking
at the currenct sequence.

> Or do we
> potentially lose optimizations during the time where e.g. a
>  _foo = a BINOP b
>  VEC_COND_EXPR (cond, foo, else)
> has not yet been converted into a
>  COND_OP?

Yeah, it would miss out on that too.  

> We already create COND_OPs for the other paths
> (via convert_conditional_op) so why not for this one?  Or am I missing
> some interdependence with SEQ?

The purpose of this code is to see what happens if we apply the
usual folds for unconditional ops to the corresponding conditional forms.
E.g. for IFN_COND_ADD (mask, a, b, c) it sees what a + b would fold to,
then tries to reapply the VEC_DOND_EXPR (mask, ..., c) to the result.

If a + b folds to a gimple value, we can fold to a VEC_COND_EXPR
involving that gimple value, as discussed above.  This could happen
if a + b folds to a constant, or for things like a + 0 -> a.

If instead a + b folds to a new operation (say a + b' or a - b'),
we need to construct the equivalent conditional form of that operation,
with the same mask and else values.  This is a correctness issue rather
than an optimisation.  As the comment in:

  /* Otherwise try rewriting the operation as an IFN_COND_* call.
 Again, this isn't a simplification in itself, since it's what
 RES_OP already described.  */
  if (convert_conditional_op (res_op, &new_op))
*res_op = new_op;

says, it's just reconstituting what RES_OP describes in gimple form.
If that isn't possible then the simplification must fail.

In some cases we could, as a follow-on, try to make a a' op b' fold
result fall back to an unconditional a' op b' followed by a VEC_COND_EXPR.
But we don't do that currently.  It isn't safe in all cases, since
IFN_COND_ADD only adds active elements, whereas an unconditional a' op b'
would operate on all elements.  I also don't know of any specific example
where this would be useful on SVE.

Thanks,
Richard

>
> FWIW I did a full bootstrap and testsuite run on the usual architectures
> showing no changes with the attached patch.
>
> Regards
>  Robin
>
> Subject: [PATCH] gimple-match: Create COND_OP directly if possible.
>
> This patch converts simplified sequences into conditional operations
> instead of VEC_COND_EXPRs if the target supports them.
> This helps for len-masked targets which cannot directly use a
> VEC_COND_EXPR in the presence of length masking.
>
> gcc/ChangeLog:
>
>   * gimple-match-exports.cc (directly_supported_p): Define.
>   (maybe_resimplify_conditional_op): Create COND_OP directly.
>   * gimple-match.h (gimple_match_cond::gimple_match_cond):
>   Initialize length and bias.
> ---
>  gcc/gimple-match-exports.cc | 40 -
>  gcc/gimple-match.h  |  7 +--
>  2 files changed, 36 insertions(+), 11 deletions(-)
>
> diff --git a/gcc/gimple-match-exports.cc b/gcc/gimple-match-exports.cc
> index b36027b0bad..ba3bd1450db 100644
> --- a/gcc/gimple-match-exports.cc
> +++ b/gcc/gimple-match-exports.cc
> @@ -98,6 +98,8 @@ static bool gimple_resimplify5 (gimple_seq *, 
> gimple_match_op *, tree (*)(tree))
>  static bool gimple_resimplify6 (gimple_seq *, gimple_match_op *, tree 
> (*)(tree));
>  static bool gimple_resimplify7 (gimple_seq *, gimple_match_op *, tree 
> (*)(tree));
>  
> +bool directly_supported_p (code_h

Re: [PATCH] gimple-match: Do not try UNCOND optimization with COND_LEN.

2023-10-17 Thread Richard Sandiford
Richard Biener  writes:
> On Mon, Oct 16, 2023 at 11:59 PM Richard Sandiford
>  wrote:
>>
>> Robin Dapp  writes:
>> >> Why are the contents of this if statement wrong for COND_LEN?
>> >> If the "else" value doesn't matter, then the masked form can use
>> >> the "then" value for all elements.  I would have expected the same
>> >> thing to be true of COND_LEN.
>> >
>> > Right, that one was overly pessimistic.  Removed.
>> >
>> >> But isn't the test whether res_op->code itself is an internal_function?
>> >> In other words, shouldn't it just be:
>> >>
>> >>   if (internal_fn_p (res_op->code)
>> >>&& internal_fn_len_index (as_internal_fn (res_op->code)) != -1)
>> >>  return true;
>> >>
>> >> maybe_resimplify_conditional_op should already have converted to an
>> >> internal function where possible, and if combined_fn (res_op->code)
>> >> does any extra conversion on the fly, that conversion won't be reflected
>> >> in res_op.
>> >
>> > I went through some of our test cases and believe most of the problems
>> > are due to situations like the following:
>> >
>> > In vect-cond-arith-2.c we have (on riscv)
>> >   vect_neg_xi_14.4_23 = -vect_xi_13.3_22;
>> >   vect_res_2.5_24 = .COND_LEN_ADD ({ -1, ... }, vect_res_1.0_17, 
>> > vect_neg_xi_14.4_23, vect_res_1.0_17, _29, 0);
>> >
>> > On aarch64 this is a situation that matches the VEC_COND_EXPR
>> > simplification that I disabled with this patch.  We valueized
>> > to _26 = vect_res_1.0_17 - vect_xi_13.3_22 and then create
>> > vect_res_2.5_24 = VEC_COND_EXPR ;
>> > This is later re-assembled into a COND_SUB.
>> >
>> > As we have two masks or COND_LEN we cannot use a VEC_COND_EXPR to
>> > achieve the same thing.  Would it be possible to create a COND_OP
>> > directly instead, though?  I tried the following (not very polished
>> > obviously):
>> >
>> > -  new_op.set_op (VEC_COND_EXPR, res_op->type,
>> > -res_op->cond.cond, res_op->ops[0],
>> > -res_op->cond.else_value);
>> > -  *res_op = new_op;
>> > -  return gimple_resimplify3 (seq, res_op, valueize);
>> > +  if (!res_op->cond.len)
>> > +   {
>> > + new_op.set_op (VEC_COND_EXPR, res_op->type,
>> > +res_op->cond.cond, res_op->ops[0],
>> > +res_op->cond.else_value);
>> > + *res_op = new_op;
>> > + return gimple_resimplify3 (seq, res_op, valueize);
>> > +   }
>> > +  else if (seq && *seq && is_gimple_assign (*seq))
>> > +   {
>> > + new_op.code = gimple_assign_rhs_code (*seq);
>> > + new_op.type = res_op->type;
>> > + new_op.num_ops = gimple_num_ops (*seq) - 1;
>> > + new_op.ops[0] = gimple_assign_rhs1 (*seq);
>> > + if (new_op.num_ops > 1)
>> > +   new_op.ops[1] = gimple_assign_rhs2 (*seq);
>> > + if (new_op.num_ops > 2)
>> > +   new_op.ops[2] = gimple_assign_rhs2 (*seq);
>> > +
>> > + new_op.cond = res_op->cond;
>> > +
>> > + gimple_match_op bla2;
>> > + if (convert_conditional_op (&new_op, &bla2))
>> > +   {
>> > + *res_op = bla2;
>> > + // SEQ should now be dead.
>> > + return true;
>> > +   }
>> > +   }
>> >
>> > This would make the other hunk (check whether it was a LEN
>> > and try to recreate it) redundant I hope.
>> >
>> > I don't know enough about valueization, whether it's always
>> > safe to do that and other implications.  On riscv this seems
>> > to work, though and the other backends never go through the LEN
>> > path.  If, however, this is a feasible direction it could also
>> > be done for the non-LEN targets?
>>
>> I don't know much about valueisation either :)  But it does feel
>> like we're working around the lack of a LEN form of COND_EXPR.
>> In other words, it seems odd that we can do:
>>
>>   IFN_COND_LEN_ADD (mask, a, 0, b, len, bias)
>>
>> but we can't do:
>>
>>   IFN_COND_LEN (mask, a, b, len, bias)
>>
>> There seems to be no way of applying a length without also finding an
>> operation to perform.
>
> Indeed .. maybe - _maybe_ we want to scrap VEC_COND_EXPR for
> IFN_COND{,_LEN} to be more consistent here?

Yeah, sounds like it could be worthwhile.  But I suppose we still need
VEC_COND_EXPR itself because it's a generic front-end operation that
needs to be lowered.  So it might be worth starting with an ifn for the
LEN form and seeing whether the non-LEN form should switch over.

Thanks,
Richard


Re: [PATCH] gimple-match: Do not try UNCOND optimization with COND_LEN.

2023-10-17 Thread Richard Sandiford
Robin Dapp  writes:
> Thank you for the explanation.
>
> So, assuming I added an IFN_VCOND_MASK and IFN_VCOND_MASK_LEN along
> with the respective helper and expand functions, what would be the
> way forward?

IMO it'd be worth starting with the _LEN form only.

> Generate an IFN_VCOND_MASK(_LEN) here instead of a VEC_COND_EXPR?
> How would I make sure all of match.pd's vec_cond optimizations
> applied to it as well?

I think the most important ones are:

/* Simplify:

 a = a1 op a2
 r = c ? a : b;

   to:

 r = c ? a1 op a2 : b;

   if the target can do it in one go.  This makes the operation conditional
   on c, so could drop potentially-trapping arithmetic, but that's a valid
   simplification if the result of the operation isn't needed.

   Avoid speculatively generating a stand-alone vector comparison
   on targets that might not support them.  Any target implementing
   conditional internal functions must support the same comparisons
   inside and outside a VEC_COND_EXPR.  */

It would be nice if there was some match.pd syntax that automatically
extended these rules to IFN_VCOND_MASK_LEN, but I don't know how easy
that would be, due to the extra two parameters.

Perhaps that itself could be done in gimple-match-exports.cc, in a similar
way to the current conditional stuff.  That is:

- for IFN_VCOND_MASK_LEN, try folding as a VEC_COND_EXPR and then "adding
  the length back"

- for IFN_COND_LEN_FOO, try folding as an IFN_COND_FOO and then
  "add the length back"

Not sure how important the second one is.

Thanks,
Richard

> Right now AFAIK IFN_VCOND_MASK only gets created in isel and
> everything is just a VEC_COND before.  But that does not provide
> length masking so is not the way to go?
>
> Thanks.
>
> Regards
>  Robin


[PATCH 1/2] aarch64: Use vecs to store register save order

2023-10-17 Thread Richard Sandiford
aarch64_save/restore_callee_saves looped over registers in register
number order.  This in turn meant that we could only use LDP and STP
for registers that were consecutive both number-wise and
offset-wise (after unsaved registers are excluded).

This patch instead builds lists of the registers that we've decided to
save, in offset order.  We can then form LDP/STP pairs regardless of
register number order, which in turn means that we can put the LR save
slot first without losing LDP/STP opportunities.

Tested on aarch64-linux-gnu & pushed.

Richard


gcc/
* config/aarch64/aarch64.h (aarch64_frame): Add vectors that
store the list saved GPRs, FPRs and predicate registers.
* config/aarch64/aarch64.cc (aarch64_layout_frame): Initialize
the lists of saved registers.  Use them to choose push candidates.
Invalidate pop candidates if we're not going to do a pop.
(aarch64_next_callee_save): Delete.
(aarch64_save_callee_saves): Take a list of registers,
rather than a range.  Make !skip_wb select only write-back
candidates.
(aarch64_expand_prologue): Update calls accordingly.
(aarch64_restore_callee_saves): Take a list of registers,
rather than a range.  Always skip pop candidates.  Also skip
LR if shadow call stacks are enabled.
(aarch64_expand_epilogue): Update calls accordingly.

gcc/testsuite/
* gcc.target/aarch64/sve/pcs/stack_clash_2.c: Expect restores
to happen in offset order.
* gcc.target/aarch64/sve/pcs/stack_clash_2_128.c: Likewise.
* gcc.target/aarch64/sve/pcs/stack_clash_2_256.c: Likewise.
* gcc.target/aarch64/sve/pcs/stack_clash_2_512.c: Likewise.
* gcc.target/aarch64/sve/pcs/stack_clash_2_1024.c: Likewise.
* gcc.target/aarch64/sve/pcs/stack_clash_2_2048.c: Likewise.
---
 gcc/config/aarch64/aarch64.cc | 203 +-
 gcc/config/aarch64/aarch64.h  |   9 +-
 .../aarch64/sve/pcs/stack_clash_2.c   |   6 +-
 .../aarch64/sve/pcs/stack_clash_2_1024.c  |   6 +-
 .../aarch64/sve/pcs/stack_clash_2_128.c   |   6 +-
 .../aarch64/sve/pcs/stack_clash_2_2048.c  |   6 +-
 .../aarch64/sve/pcs/stack_clash_2_256.c   |   6 +-
 .../aarch64/sve/pcs/stack_clash_2_512.c   |   6 +-
 8 files changed, 128 insertions(+), 120 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 9fbfc548a89..e8b5dfe4d58 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8527,13 +8527,17 @@ aarch64_save_regs_above_locals_p ()
 static void
 aarch64_layout_frame (void)
 {
-  int regno, last_fp_reg = INVALID_REGNUM;
+  unsigned regno, last_fp_reg = INVALID_REGNUM;
   machine_mode vector_save_mode = aarch64_reg_save_mode (V8_REGNUM);
   poly_int64 vector_save_size = GET_MODE_SIZE (vector_save_mode);
   bool frame_related_fp_reg_p = false;
   aarch64_frame &frame = cfun->machine->frame;
   poly_int64 top_of_locals = -1;
 
+  vec_safe_truncate (frame.saved_gprs, 0);
+  vec_safe_truncate (frame.saved_fprs, 0);
+  vec_safe_truncate (frame.saved_prs, 0);
+
   frame.emit_frame_chain = aarch64_needs_frame_chain ();
 
   /* Adjust the outgoing arguments size if required.  Keep it in sync with what
@@ -8618,6 +8622,7 @@ aarch64_layout_frame (void)
   for (regno = P0_REGNUM; regno <= P15_REGNUM; regno++)
 if (known_eq (frame.reg_offset[regno], SLOT_REQUIRED))
   {
+   vec_safe_push (frame.saved_prs, regno);
if (frame.sve_save_and_probe == INVALID_REGNUM)
  frame.sve_save_and_probe = regno;
frame.reg_offset[regno] = offset;
@@ -8639,7 +8644,7 @@ aarch64_layout_frame (void)
 If we don't have any vector registers to save, and we know how
 big the predicate save area is, we can just round it up to the
 next 16-byte boundary.  */
-  if (last_fp_reg == (int) INVALID_REGNUM && offset.is_constant ())
+  if (last_fp_reg == INVALID_REGNUM && offset.is_constant ())
offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
   else
{
@@ -8653,10 +8658,11 @@ aarch64_layout_frame (void)
 }
 
   /* If we need to save any SVE vector registers, add them next.  */
-  if (last_fp_reg != (int) INVALID_REGNUM && crtl->abi->id () == ARM_PCS_SVE)
+  if (last_fp_reg != INVALID_REGNUM && crtl->abi->id () == ARM_PCS_SVE)
 for (regno = V0_REGNUM; regno <= V31_REGNUM; regno++)
   if (known_eq (frame.reg_offset[regno], SLOT_REQUIRED))
{
+ vec_safe_push (frame.saved_fprs, regno);
  if (frame.sve_save_and_probe == INVALID_REGNUM)
frame.sve_save_and_probe = regno;
  frame.reg_offset[regno] = offset;
@@ -8677,13 +8683,8 @@ aarch64_layout_frame (void)
 
   auto allocate_gpr_slot = [&](unsigned int regno)
 {
-  if (frame.hard_fp_save_and_probe == INVALID_REGNUM)
-   frame.hard_fp_save_and_probe = regno;
+  vec_safe_push (f

[PATCH 2/2] aarch64: Put LR save slot first in more cases

2023-10-17 Thread Richard Sandiford
Now that the prologue and epilogue code iterates over saved
registers in offset order, we can put the LR save slot first
without compromising LDP/STP formation.

This isn't worthwhile when shadow call stacks are enabled, since the
first two registers are also push/pop candidates, and LR cannot be
popped when shadow call stacks are enabled.  (LR is instead loaded
first and compared against the shadow stack's value.)

But otherwise, it seems better to put the LR save slot first,
to reduce unnecessary variation with the layout for stack clash
protection.

Tested on aarch64-linux-gnu & pushed.

Richard


gcc/
* config/aarch64/aarch64.cc (aarch64_layout_frame): Don't make
the position of the LR save slot dependent on stack clash
protection unless shadow call stacks are enabled.

gcc/testsuite/
* gcc.target/aarch64/test_frame_2.c: Expect x30 to come before x19.
* gcc.target/aarch64/test_frame_4.c: Likewise.
* gcc.target/aarch64/test_frame_7.c: Likewise.
* gcc.target/aarch64/test_frame_10.c: Likewise.
---
 gcc/config/aarch64/aarch64.cc| 2 +-
 gcc/testsuite/gcc.target/aarch64/test_frame_10.c | 4 ++--
 gcc/testsuite/gcc.target/aarch64/test_frame_2.c  | 4 ++--
 gcc/testsuite/gcc.target/aarch64/test_frame_4.c  | 4 ++--
 gcc/testsuite/gcc.target/aarch64/test_frame_7.c  | 4 ++--
 5 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index e8b5dfe4d58..62b1ae0652f 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8694,7 +8694,7 @@ aarch64_layout_frame (void)
   allocate_gpr_slot (R29_REGNUM);
   allocate_gpr_slot (R30_REGNUM);
 }
-  else if (flag_stack_clash_protection
+  else if ((flag_stack_clash_protection || !frame.is_scs_enabled)
   && known_eq (frame.reg_offset[R30_REGNUM], SLOT_REQUIRED))
 /* Put the LR save slot first, since it makes a good choice of probe
for stack clash purposes.  The idea is that the link register usually
diff --git a/gcc/testsuite/gcc.target/aarch64/test_frame_10.c 
b/gcc/testsuite/gcc.target/aarch64/test_frame_10.c
index c19505082fa..c54ab2d0ccb 100644
--- a/gcc/testsuite/gcc.target/aarch64/test_frame_10.c
+++ b/gcc/testsuite/gcc.target/aarch64/test_frame_10.c
@@ -14,6 +14,6 @@
 t_frame_pattern_outgoing (test10, 480, "x19", 24, a[8], a[9], a[10])
 t_frame_run (test10)
 
-/* { dg-final { scan-assembler-times "stp\tx19, x30, \\\[sp, \[0-9\]+\\\]" 1 } 
} */
-/* { dg-final { scan-assembler "ldp\tx19, x30, \\\[sp, \[0-9\]+\\\]" } } */
+/* { dg-final { scan-assembler-times "stp\tx30, x19, \\\[sp, \[0-9\]+\\\]" 1 } 
} */
+/* { dg-final { scan-assembler "ldp\tx30, x19, \\\[sp, \[0-9\]+\\\]" } } */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/test_frame_2.c 
b/gcc/testsuite/gcc.target/aarch64/test_frame_2.c
index 7e5df84cf5f..0d715314cb8 100644
--- a/gcc/testsuite/gcc.target/aarch64/test_frame_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/test_frame_2.c
@@ -14,6 +14,6 @@ t_frame_pattern (test2, 200, "x19")
 t_frame_run (test2)
 
 
-/* { dg-final { scan-assembler-times "stp\tx19, x30, \\\[sp, -\[0-9\]+\\\]!" 1 
} } */
-/* { dg-final { scan-assembler "ldp\tx19, x30, \\\[sp\\\], \[0-9\]+" } } */
+/* { dg-final { scan-assembler-times "stp\tx30, x19, \\\[sp, -\[0-9\]+\\\]!" 1 
} } */
+/* { dg-final { scan-assembler "ldp\tx30, x19, \\\[sp\\\], \[0-9\]+" } } */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/test_frame_4.c 
b/gcc/testsuite/gcc.target/aarch64/test_frame_4.c
index ed13487a094..b41229c42f4 100644
--- a/gcc/testsuite/gcc.target/aarch64/test_frame_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/test_frame_4.c
@@ -13,6 +13,6 @@
 t_frame_pattern (test4, 400, "x19")
 t_frame_run (test4)
 
-/* { dg-final { scan-assembler-times "stp\tx19, x30, \\\[sp, -\[0-9\]+\\\]!" 1 
} } */
-/* { dg-final { scan-assembler "ldp\tx19, x30, \\\[sp\\\], \[0-9\]+" } } */
+/* { dg-final { scan-assembler-times "stp\tx30, x19, \\\[sp, -\[0-9\]+\\\]!" 1 
} } */
+/* { dg-final { scan-assembler "ldp\tx30, x19, \\\[sp\\\], \[0-9\]+" } } */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/test_frame_7.c 
b/gcc/testsuite/gcc.target/aarch64/test_frame_7.c
index 96452794956..5702656a5da 100644
--- a/gcc/testsuite/gcc.target/aarch64/test_frame_7.c
+++ b/gcc/testsuite/gcc.target/aarch64/test_frame_7.c
@@ -13,6 +13,6 @@
 t_frame_pattern (test7, 700, "x19")
 t_frame_run (test7)
 
-/* { dg-final { scan-assembler-times "stp\tx19, x30, \\\[sp]" 1 } } */
-/* { dg-final { scan-assembler "ldp\tx19, x30, \\\[sp\\\]" } } */
+/* { dg-final { scan-assembler-times "stp\tx30, x19, \\\[sp]" 1 } } */
+/* { dg-final { scan-assembler "ldp\tx30, x19, \\\[sp\\\]" } } */
 
-- 
2.25.1



Re: [PATCH] Support g++ 4.8 as a host compiler.

2023-10-18 Thread Richard Sandiford
Jakub Jelinek  writes:
> On Sun, Oct 15, 2023 at 12:43:10PM +0100, Richard Sandiford wrote:
>> It seemed like there was considerable support for bumping the minimum
>> to beyond 4.8.  I think we should wait until a decision has been made
>> before adding more 4.8 workarounds.
>
> I think adding a workaround until that decision is made and perhaps
> removing it afterwards will make life easier for people still using gcc 4.8.
>
>> Having a conditional explicit constructor is dangerous because it changes
>> semantics.  E.g. consider:
>> 
>>   #include 
>> 
>>   union u { int x; };
>>   void f(u *ptr) { new(ptr) u; }
>>   void g(u *ptr) { new(ptr) u(); }
>> 
>> g(ptr) zeros ptr->x whereas f(ptr) doesn't.  If we add "u() {}" then g()
>> does not zero ptr->x.
>> 
>> So if we did add the workaround, it would need to be unconditional,
>> like you say.
>
> What about using more directed workaround then?
>
> Like (just stage1 build tested, perhaps with comment why we do that)
> below?  Seems at least in stage1 it is the only problematic spot.
>
> --- a/gcc/cse.cc
> +++ b/gcc/cse.cc
> @@ -4951,8 +4951,14 @@ cse_insn (rtx_insn *insn)
>   && is_a  (mode, &int_mode)
>   && (extend_op = load_extend_op (int_mode)) != UNKNOWN)
> {
> +#if GCC_VERSION >= 5000
>   struct rtx_def memory_extend_buf;
>   rtx memory_extend_rtx = &memory_extend_buf;
> +#else
> + alignas (alignof (rtx_def)) unsigned char
> +   memory_extended_buf[sizeof (rtx_def)];

Looks like the simpler "alignas (rtx_def)" should work.

LGTM otherwise FWIW.

Richard

> + rtx memory_extend_rtx = (rtx) &memory_extended_buf[0];
> +#endif
>  
>   /* Set what we are trying to extend and the operation it might
>  have been extended with.  */
>
>
>   Jakub


[Backport RFA] lra: Avoid unfolded plus-0

2023-10-18 Thread Richard Sandiford
Vlad, is it OK if I backport the patch below to fix
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111528 ?  Jakub has
given a conditional OK on irc.

Thanks,
Richard

Richard Sandiford  writes:
> While backporting another patch to an earlier release, I hit a
> situation in which lra_eliminate_regs_1 would eliminate an address to:
>
> (plus (reg:P R) (const_int 0))
>
> This address compared not-equal to plain:
>
> (reg:P R)
>
> which caused an ICE in a later peephole2.  (The ICE showed up in
> gfortran.fortran-torture/compile/pr80464.f90 on the branch but seems
> to be latent on trunk.)
>
> These unfolded PLUSes shouldn't occur in the insn stream, and later code
> in the same function tried to avoid them.
>
> Tested on aarch64-linux-gnu so far, but I'll test on x86_64-linux-gnu too.
> Does this look OK?
>
> There are probably other instances of the same thing elsewhere,
> but it seemed safer to stick to the one that caused the issue.
>
> Thanks,
> Richard
>
>
> gcc/
>   * lra-eliminations.cc (lra_eliminate_regs_1): Use simplify_gen_binary
>   rather than gen_rtx_PLUS.
> ---
>  gcc/lra-eliminations.cc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/lra-eliminations.cc b/gcc/lra-eliminations.cc
> index df613cdda76..4daaff1a124 100644
> --- a/gcc/lra-eliminations.cc
> +++ b/gcc/lra-eliminations.cc
> @@ -406,7 +406,7 @@ lra_eliminate_regs_1 (rtx_insn *insn, rtx x, machine_mode 
> mem_mode,
>   elimination_fp2sp_occured_p = true;
>  
> if (! update_p && ! full_p)
> - return gen_rtx_PLUS (Pmode, to, XEXP (x, 1));
> + return simplify_gen_binary (PLUS, Pmode, to, XEXP (x, 1));
>  
> if (maybe_ne (update_sp_offset, 0))
>   offset = ep->to_rtx == stack_pointer_rtx ? update_sp_offset : 0;


Re: PR111648: Fix wrong code-gen due to incorrect VEC_PERM_EXPR folding

2023-10-18 Thread Richard Sandiford
Prathamesh Kulkarni  writes:
> On Tue, 17 Oct 2023 at 02:40, Richard Sandiford
>  wrote:
>> Prathamesh Kulkarni  writes:
>> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
>> > index 4f8561509ff..55a6a68c16c 100644
>> > --- a/gcc/fold-const.cc
>> > +++ b/gcc/fold-const.cc
>> > @@ -10684,9 +10684,8 @@ valid_mask_for_fold_vec_perm_cst_p (tree arg0, 
>> > tree arg1,
>> >
>> >/* Ensure that the stepped sequence always selects from the same
>> >input pattern.  */
>> > -  unsigned arg_npatterns
>> > - = ((q1 & 1) == 0) ? VECTOR_CST_NPATTERNS (arg0)
>> > -   : VECTOR_CST_NPATTERNS (arg1);
>> > +  tree arg = ((q1 & 1) == 0) ? arg0 : arg1;
>> > +  unsigned arg_npatterns = VECTOR_CST_NPATTERNS (arg);
>> >
>> >if (!multiple_p (step, arg_npatterns))
>> >   {
>> > @@ -10694,6 +10693,29 @@ valid_mask_for_fold_vec_perm_cst_p (tree arg0, 
>> > tree arg1,
>> >   *reason = "step is not multiple of npatterns";
>> > return false;
>> >   }
>> > +
>> > +  /* If a1 chooses base element from arg, ensure that it's a natural
>> > +  stepped sequence, ie, (arg[2] - arg[1]) == (arg[1] - arg[0])
>> > +  to preserve arg's encoding.  */
>> > +
>> > +  unsigned HOST_WIDE_INT index;
>> > +  if (!r1.is_constant (&index))
>> > + return false;
>> > +  if (index < arg_npatterns)
>> > + {
>>
>> I don't know whether it matters in practice, but I think the two conditions
>> above are more natural as:
>>
>> if (maybe_lt (r1, arg_npatterns))
>>   {
>> unsigned HOST_WIDE_INT index;
>> if (!r1.is_constant (&index))
>>   return false;
>>
>> ...[code below]...
>>   }
>>
>> > +   tree arg_elem0 = vector_cst_elt (arg, index);
>> > +   tree arg_elem1 = vector_cst_elt (arg, index + arg_npatterns);
>> > +   tree arg_elem2 = vector_cst_elt (arg, index + arg_npatterns * 2);
>> > +
>> > +   if (!operand_equal_p (const_binop (MINUS_EXPR, arg_elem2, 
>> > arg_elem1),
>> > + const_binop (MINUS_EXPR, arg_elem1, 
>> > arg_elem0),
>> > + 0))
>>
>> This needs to check whether const_binop returns null.  Maybe:
>>
>>tree step1, step2;
>>if (!(step1 = const_binop (MINUS_EXPR, arg_elem1, arg_elem0))
>>|| !(step2 = const_binop (MINUS_EXPR, arg_elem2, arg_elem1))
>>|| !operand_equal_p (step1, step2, 0))
>>
>> OK with those changes, thanks.
> Hi Richard,
> Thanks for the suggestions, updated the attached patch accordingly.
> Bootstrapped+tested with and without SVE on aarch64-linux-gnu and
> x86_64-linux-gnu.
> OK to commit ?

Yes, thanks.

Richard

>
> Thanks,
> Prathamesh
>>
>> Richard
>>
>> > + {
>> > +   if (reason)
>> > + *reason = "not a natural stepped sequence";
>> > +   return false;
>> > + }
>> > + }
>> >  }
>> >
>> >return true;
>> > @@ -17161,7 +17183,8 @@ namespace test_fold_vec_perm_cst {
>> >  static tree
>> >  build_vec_cst_rand (machine_mode vmode, unsigned npatterns,
>> >   unsigned nelts_per_pattern,
>> > - int step = 0, int threshold = 100)
>> > + int step = 0, bool natural_stepped = false,
>> > + int threshold = 100)
>> >  {
>> >tree inner_type = lang_hooks.types.type_for_mode (GET_MODE_INNER 
>> > (vmode), 1);
>> >tree vectype = build_vector_type_for_mode (inner_type, vmode);
>> > @@ -17176,17 +17199,28 @@ build_vec_cst_rand (machine_mode vmode, unsigned 
>> > npatterns,
>> >
>> >// Fill a1 for each pattern
>> >for (unsigned i = 0; i < npatterns; i++)
>> > -builder.quick_push (build_int_cst (inner_type, rand () % threshold));
>> > -
>> > +{
>> > +  tree a1;
>> > +  if (natural_stepped)
>> > + {
>> > +   tree a0 = builder[i];
>> > +   wide_int a0_val = wi::to_wide (a0);
>> > +   wide_int a1_val = a0_val + step;
>> > +   a1 = wide_int_to_tree (inner_type, a1_val);
>> > + }
>>

Re: [PATCH 01/11] rtl-ssa: Fix bug in function_info::add_insn_after

2023-10-18 Thread Richard Sandiford
Alex Coplan  writes:
> In the case that !insn->is_debug_insn () && next->is_debug_insn (), this
> function was missing an update of the prev pointer on the first nondebug
> insn following the sequence of debug insns starting at next.
>
> This can lead to corruption of the insn chain, in that we end up with:
>
>   insn->next_any_insn ()->prev_any_insn () != insn
>
> in this case.  This patch fixes that.
>
> Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?
>
> gcc/ChangeLog:
>
>   * rtl-ssa/insns.cc (function_info::add_insn_after): Ensure we
>   update the prev pointer on the following nondebug insn in the
>   case that !insn->is_debug_insn () && next->is_debug_insn ().

OK, thanks.

Richard

> ---
>  gcc/rtl-ssa/insns.cc | 14 +++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/rtl-ssa/insns.cc b/gcc/rtl-ssa/insns.cc
> index a0c2fec2b70..f970375d906 100644
> --- a/gcc/rtl-ssa/insns.cc
> +++ b/gcc/rtl-ssa/insns.cc
> @@ -291,9 +291,17 @@ function_info::add_insn_after (insn_info *insn, 
> insn_info *after)
> first->set_last_debug_insn (insn);
>   }
>else // !insn->is_debug_insn () && next->is_debug_insn ()
> - // At present we don't (need to) support inserting a nondebug
> - // instruction between two existing debug instructions.
> - gcc_assert (!after->is_debug_insn ());
> + {
> +   // At present we don't (need to) support inserting a nondebug
> +   // instruction between two existing debug instructions.
> +   gcc_assert (!after->is_debug_insn ());
> +
> +   // Find the next nondebug insn and update its previous pointer
> +   // to point to INSN.
> +   auto next_nondebug = next->last_debug_insn ()->next_any_insn ();
> +   gcc_checking_assert (!next_nondebug->is_debug_insn ());
> +   next_nondebug->set_prev_sametype_insn (insn);
> + }
>  
>// If AFTER and NEXT are separated by at least two points, we can
>// use a unique point number for INSN.  Otherwise INSN will have


Re: [PATCH 02/11] rtl-ssa: Add drop_memory_access helper

2023-10-18 Thread Richard Sandiford
Alex Coplan  writes:
> Add a helper routine to access-utils.h which removes the memory access
> from an access_array, if it has one.
>
> Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?
>
> gcc/ChangeLog:
>
>   * rtl-ssa/access-utils.h (drop_memory_access): New.
> ---
>  gcc/rtl-ssa/access-utils.h | 11 +++
>  1 file changed, 11 insertions(+)
>
> diff --git a/gcc/rtl-ssa/access-utils.h b/gcc/rtl-ssa/access-utils.h
> index fb2c2d3..0c108b18bb8 100644
> --- a/gcc/rtl-ssa/access-utils.h
> +++ b/gcc/rtl-ssa/access-utils.h
> @@ -51,6 +51,17 @@ memory_access (T accesses) -> decltype (accesses[0])
>return nullptr;
>  }
>  
> +template
> +inline T
> +drop_memory_access (T accesses)
> +{
> +  if (!memory_access (accesses))
> +return accesses;
> +
> +  access_array arr (accesses);
> +  return T (arr.begin (), accesses.size () - 1);
> +}

There ought to be a comment above the function.  OK with that change, thanks.

Richard

> +
>  // If sorted array ACCESSES includes a reference to REGNO, return the
>  // access, otherwise return null.
>  template


Re: [PATCH 03/11] rtl-ssa: Add entry point to allow re-parenting uses

2023-10-18 Thread Richard Sandiford
Alex Coplan  writes:
> This is needed by the upcoming aarch64 load pair pass, as it can
> re-order stores (when alias analysis determines this is safe) and thus
> change which mem def a given use consumes (in the RTL-SSA view, there is
> no alias disambiguation of memory).
>
> Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?
>
> gcc/ChangeLog:
>
>   * rtl-ssa/accesses.cc (function_info::reparent_use): New.
>   * rtl-ssa/functions.h (function_info): Declare new member
>   function reparent_use.

OK, thanks.

Richard

> ---
>  gcc/rtl-ssa/accesses.cc | 8 
>  gcc/rtl-ssa/functions.h | 3 +++
>  2 files changed, 11 insertions(+)
>
> diff --git a/gcc/rtl-ssa/accesses.cc b/gcc/rtl-ssa/accesses.cc
> index f12b5f4dd77..774ab9d99ee 100644
> --- a/gcc/rtl-ssa/accesses.cc
> +++ b/gcc/rtl-ssa/accesses.cc
> @@ -1239,6 +1239,14 @@ function_info::add_use (use_info *use)
>  insert_use_before (use, neighbor->value ());
>  }
>  
> +void
> +function_info::reparent_use (use_info *use, set_info *new_def)
> +{
> +  remove_use (use);
> +  use->set_def (new_def);
> +  add_use (use);
> +}
> +
>  // If USE has a known definition, remove USE from that definition's list
>  // of uses.  Also remove if it from the associated splay tree, if any.
>  void
> diff --git a/gcc/rtl-ssa/functions.h b/gcc/rtl-ssa/functions.h
> index 8b53b264064..d7da9774213 100644
> --- a/gcc/rtl-ssa/functions.h
> +++ b/gcc/rtl-ssa/functions.h
> @@ -159,6 +159,9 @@ public:
>// Like change_insns, but for a single change CHANGE.
>void change_insn (insn_change &change);
>  
> +  // Given a use USE, re-parent it to get its def from NEW_DEF.
> +  void reparent_use (use_info *use, set_info *new_def);
> +
>// If the changes that have been made to instructions require updates
>// to the CFG, perform those updates now.  Return true if something 
> changed.
>// If it did:


Re: [PATCH 04/11] rtl-ssa: Support inferring uses of mem in change_insns

2023-10-18 Thread Richard Sandiford
Alex Coplan  writes:
> Currently, rtl_ssa::change_insns requires all new uses and defs to be
> specified explicitly.  This turns out to be rather inconvenient for
> forming load pairs in the new aarch64 load pair pass, as the pass has to
> determine which mem def the final load pair consumes, and then obtain or
> create a suitable use (i.e. significant bookkeeping, just to keep the
> RTL-SSA IR consistent).  It turns out to be much more convenient to
> allow change_insns to infer which def is consumed and create a suitable
> use of mem itself.  This patch does that.
>
> Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?
>
> gcc/ChangeLog:
>
>   * rtl-ssa/changes.cc (function_info::finalize_new_accesses): Add new
>   parameter to give final insn position, infer use of mem if it isn't
>   specified explicitly.
>   (function_info::change_insns): Pass down final insn position to
>   finalize_new_accesses.
>   * rtl-ssa/functions.h: Add parameter to finalize_new_accesses.

OK, thanks.

Richard

> ---
>  gcc/rtl-ssa/changes.cc  | 31 ---
>  gcc/rtl-ssa/functions.h |  2 +-
>  2 files changed, 29 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/rtl-ssa/changes.cc b/gcc/rtl-ssa/changes.cc
> index c48ddd2463c..523ad60d7d8 100644
> --- a/gcc/rtl-ssa/changes.cc
> +++ b/gcc/rtl-ssa/changes.cc
> @@ -370,8 +370,11 @@ update_insn_in_place (insn_change &change)
>  // Finalize the new list of definitions and uses in CHANGE, removing
>  // any uses and definitions that are no longer needed, and converting
>  // pending clobbers into actual definitions.
> +//
> +// POS gives the final position of INSN, which hasn't yet been moved into
> +// place.
>  void
> -function_info::finalize_new_accesses (insn_change &change)
> +function_info::finalize_new_accesses (insn_change &change, insn_info *pos)
>  {
>insn_info *insn = change.insn ();
>  
> @@ -462,13 +465,34 @@ function_info::finalize_new_accesses (insn_change 
> &change)
>// Add (possibly temporary) uses to m_temp_uses for each resource.
>// If there are multiple references to the same resource, aggregate
>// information in the modes and flags.
> +  use_info *mem_use = nullptr;
>for (rtx_obj_reference ref : properties.refs ())
>  if (ref.is_read ())
>{
>   unsigned int regno = ref.regno;
>   machine_mode mode = ref.is_reg () ? ref.mode : BLKmode;
>   use_info *use = find_access (unshared_uses, ref.regno);
> - gcc_assert (use);
> + if (!use)
> +   {
> + // For now, we only support inferring uses of mem.
> + gcc_assert (regno == MEM_REGNO);
> +
> + if (mem_use)
> +   {
> + mem_use->record_reference (ref, false);
> + continue;
> +   }
> +
> + resource_info resource { mode, regno };
> + auto def = find_def (resource, pos).prev_def (pos);
> + auto set = safe_dyn_cast  (def);
> + gcc_assert (set);
> + mem_use = allocate (insn, resource, set);
> + mem_use->record_reference (ref, true);
> + m_temp_uses.safe_push (mem_use);
> + continue;
> +   }
> +
>   if (use->m_has_been_superceded)
> {
>   // This is the first reference to the resource.
> @@ -656,7 +680,8 @@ function_info::change_insns (array_slice 
> changes)
>  
> // Finalize the new list of accesses for the change.  Don't install
> // them yet, so that we still have access to the old lists below.
> -   finalize_new_accesses (change);
> +   finalize_new_accesses (change,
> +  placeholder ? placeholder : insn);
>   }
>placeholders[i] = placeholder;
>  }
> diff --git a/gcc/rtl-ssa/functions.h b/gcc/rtl-ssa/functions.h
> index d7da9774213..73690a0e63b 100644
> --- a/gcc/rtl-ssa/functions.h
> +++ b/gcc/rtl-ssa/functions.h
> @@ -265,7 +265,7 @@ private:
>  
>insn_info *add_placeholder_after (insn_info *);
>void possibly_queue_changes (insn_change &);
> -  void finalize_new_accesses (insn_change &);
> +  void finalize_new_accesses (insn_change &, insn_info *);
>void apply_changes_to_insn (insn_change &);
>  
>void init_function_data ();


Re: [PATCH 07/11] aarch64, testsuite: Prevent stp in lr_free_1.c

2023-10-18 Thread Richard Sandiford
Alex Coplan  writes:
> The test is looking for individual stores which are able to be merged
> into stp instructions.  The test currently passes -fno-schedule-fusion
> -fno-peephole2, presumably to prevent these stores from being turned
> into stps, but this is no longer sufficient with the new ldp/stp fusion
> pass.
>
> As such, we add --param=aarch64-stp-policy=never to prevent stps being
> formed.
>
> Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/lr_free_1.c: Add
>   --param=aarch64-stp-policy=never to dg-options.

OK.  Thanks to Manos for adding this --param.

Richard

> ---
>  gcc/testsuite/gcc.target/aarch64/lr_free_1.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/lr_free_1.c 
> b/gcc/testsuite/gcc.target/aarch64/lr_free_1.c
> index 50dcf04e697..9949061096e 100644
> --- a/gcc/testsuite/gcc.target/aarch64/lr_free_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/lr_free_1.c
> @@ -1,5 +1,5 @@
>  /* { dg-do run } */
> -/* { dg-options "-fno-inline -O2 -fomit-frame-pointer -ffixed-x2 -ffixed-x3 
> -ffixed-x4 -ffixed-x5 -ffixed-x6 -ffixed-x7 -ffixed-x8 -ffixed-x9 -ffixed-x10 
> -ffixed-x11 -ffixed-x12 -ffixed-x13 -ffixed-x14 -ffixed-x15 -ffixed-x16 
> -ffixed-x17 -ffixed-x18 -ffixed-x19 -ffixed-x20 -ffixed-x21 -ffixed-x22 
> -ffixed-x23 -ffixed-x24 -ffixed-x25 -ffixed-x26 -ffixed-x27 -ffixed-28 
> -ffixed-29 --save-temps -mgeneral-regs-only -fno-ipa-cp -fno-schedule-fusion 
> -fno-peephole2" } */
> +/* { dg-options "-fno-inline -O2 -fomit-frame-pointer -ffixed-x2 -ffixed-x3 
> -ffixed-x4 -ffixed-x5 -ffixed-x6 -ffixed-x7 -ffixed-x8 -ffixed-x9 -ffixed-x10 
> -ffixed-x11 -ffixed-x12 -ffixed-x13 -ffixed-x14 -ffixed-x15 -ffixed-x16 
> -ffixed-x17 -ffixed-x18 -ffixed-x19 -ffixed-x20 -ffixed-x21 -ffixed-x22 
> -ffixed-x23 -ffixed-x24 -ffixed-x25 -ffixed-x26 -ffixed-x27 -ffixed-28 
> -ffixed-29 --save-temps -mgeneral-regs-only -fno-ipa-cp -fno-schedule-fusion 
> -fno-peephole2 --param=aarch64-stp-policy=never" } */
>  
>  extern void abort ();
>  


Re: [PATCH 08/11] aarch64, testsuite: Tweak sve/pcs/args_9.c to allow stps

2023-10-18 Thread Richard Sandiford
Alex Coplan  writes:
> With the new ldp/stp pass enabled, there is a change in the codegen for
> this test as follows:
>
> add x8, sp, 16
> ptrue   p3.h, mul3
> str p3, [x8]
> -   str x8, [sp, 8]
> -   str x9, [sp]
> +   stp x9, x8, [sp]
> ptrue   p3.d, vl8
> ptrue   p2.s, vl7
> ptrue   p1.h, vl6
>
> i.e. we now form an stp that we were missing previously. This patch
> adjusts the scan-assembler such that it should pass whether or not
> we form the stp.
>
> Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/sve/pcs/args_9.c: Adjust scan-assemblers to
>   allow for stp.

OK, thanks.

Richard

> ---
>  gcc/testsuite/gcc.target/aarch64/sve/pcs/args_9.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pcs/args_9.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/pcs/args_9.c
> index ad9affadf02..942a44ab448 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pcs/args_9.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pcs/args_9.c
> @@ -45,5 +45,5 @@ caller (int64_t *x0, int16_t *x1, svbool_t p0)
>return svcntp_b8 (res, res);
>  }
>  
> -/* { dg-final { scan-assembler {\tptrue\t(p[0-9]+)\.b, mul3\n\tstr\t\1, 
> \[(x[0-9]+)\]\n.*\tstr\t\2, \[sp\]\n} } } */
> -/* { dg-final { scan-assembler {\tptrue\t(p[0-9]+)\.h, mul3\n\tstr\t\1, 
> \[(x[0-9]+)\]\n.*\tstr\t\2, \[sp, 8\]\n} } } */
> +/* { dg-final { scan-assembler {\tptrue\t(p[0-9]+)\.b, mul3\n\tstr\t\1, 
> \[(x[0-9]+)\]\n.*\t(?:str\t\2, \[sp\]|stp\t\2, x[0-9]+, \[sp\])\n} } } */
> +/* { dg-final { scan-assembler {\tptrue\t(p[0-9]+)\.h, mul3\n\tstr\t\1, 
> \[(x[0-9]+)\]\n.*\t(?:str\t\2, \[sp, 8\]|stp\tx[0-9]+, \2, \[sp\])\n} } } */


Re: [PATCH 09/11] aarch64, testsuite: Fix up pr71727.c

2023-10-18 Thread Richard Sandiford
Alex Coplan  writes:
> The test is trying to check that we don't use q-register stores with
> -mstrict-align, so actually check specifically for that.
>
> This is a prerequisite to avoid regressing:
>
> scan-assembler-not "add\tx0, x0, :"
>
> with the upcoming ldp fusion pass, as we change where the ldps are
> formed such that a register is used rather than a symbolic (lo_sum)
> address for the first load.
>
> Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/pr71727.c: Adjust scan-assembler-not to
>   make sure we don't have q-register stores with -mstrict-align.

OK, thanks.

Richard

> ---
>  gcc/testsuite/gcc.target/aarch64/pr71727.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr71727.c 
> b/gcc/testsuite/gcc.target/aarch64/pr71727.c
> index 41fa72bc67e..226258a76fe 100644
> --- a/gcc/testsuite/gcc.target/aarch64/pr71727.c
> +++ b/gcc/testsuite/gcc.target/aarch64/pr71727.c
> @@ -30,4 +30,4 @@ _start (void)
>  }
>  
>  /* { dg-final { scan-assembler-times "mov\tx" 5 {target lp64} } } */
> -/* { dg-final { scan-assembler-not "add\tx0, x0, :" {target lp64} } } */
> +/* { dg-final { scan-assembler-not {st[rp]\tq[0-9]+} {target lp64} } } */


Re: [PATCH 10/11] aarch64: Generalise TFmode load/store pair patterns

2023-10-18 Thread Richard Sandiford
Alex Coplan  writes:
> This patch generalises the TFmode load/store pair patterns to TImode and
> TDmode.  This brings them in line with the DXmode patterns, and uses the
> same technique with separate mode iterators (TX and TX2) to allow for
> distinct modes in each arm of the load/store pair.
>
> For example, in combination with the post-RA load/store pair fusion pass
> in the following patch, this improves the codegen for the following
> varargs testcase involving TImode stores:
>
> void g(void *);
> int foo(int x, ...)
> {
> __builtin_va_list ap;
> __builtin_va_start (ap, x);
> g(&ap);
> __builtin_va_end (ap);
> }
>
> from:
>
> foo:
> .LFB0:
>   stp x29, x30, [sp, -240]!
> .LCFI0:
>   mov w9, -56
>   mov w8, -128
>   mov x29, sp
>   add x10, sp, 176
>   stp x1, x2, [sp, 184]
>   add x1, sp, 240
>   add x0, sp, 16
>   stp x1, x1, [sp, 16]
>   str x10, [sp, 32]
>   stp w9, w8, [sp, 40]
>   str q0, [sp, 48]
>   str q1, [sp, 64]
>   str q2, [sp, 80]
>   str q3, [sp, 96]
>   str q4, [sp, 112]
>   str q5, [sp, 128]
>   str q6, [sp, 144]
>   str q7, [sp, 160]
>   stp x3, x4, [sp, 200]
>   stp x5, x6, [sp, 216]
>   str x7, [sp, 232]
>   bl  g
>   ldp x29, x30, [sp], 240
> .LCFI1:
>   ret
>
> to:
>
> foo:
> .LFB0:
>   stp x29, x30, [sp, -240]!
> .LCFI0:
>   mov w9, -56
>   mov w8, -128
>   mov x29, sp
>   add x10, sp, 176
>   stp x1, x2, [sp, 1bd4971b7c71e70a637a1dq84]
>   add x1, sp, 240
>   add x0, sp, 16
>   stp x1, x1, [sp, 16]
>   str x10, [sp, 32]
>   stp w9, w8, [sp, 40]
>   stp q0, q1, [sp, 48]
>   stp q2, q3, [sp, 80]
>   stp q4, q5, [sp, 112]
>   stp q6, q7, [sp, 144]
>   stp x3, x4, [sp, 200]
>   stp x5, x6, [sp, 216]
>   str x7, [sp, 232]
>   bl  g
>   ldp x29, x30, [sp], 240
> .LCFI1:
>   ret
>
> Note that this patch isn't needed if we only use the mode
> canonicalization approach in the new ldp fusion pass (since we
> canonicalize T{I,F,D}mode to V16QImode), but we seem to get slightly
> better performance with mode canonicalization disabled (see
> --param=aarch64-ldp-canonicalize-modes in the following patch).
>
> Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.md (load_pair_dw_tftf): Rename to ...
>   (load_pair_dw_): ... this.
>   (store_pair_dw_tftf): Rename to ...
>   (store_pair_dw_): ... this.
>   * config/aarch64/iterators.md (TX2): New.

OK, thanks.  It would be nice to investigate & fix the reasons for
the regressions with canonicalised modes, but I agree that this patch
is a strict improvement, since it fixes a hole in the current scheme.

Richard

> ---
>  gcc/config/aarch64/aarch64.md   | 22 +++---
>  gcc/config/aarch64/iterators.md |  3 +++
>  2 files changed, 14 insertions(+), 11 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 32c7adc8928..e6af09c2e8b 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -1757,16 +1757,16 @@ (define_insn "load_pair_dw_"
>}
>  )
>  
> -(define_insn "load_pair_dw_tftf"
> -  [(set (match_operand:TF 0 "register_operand" "=w")
> - (match_operand:TF 1 "aarch64_mem_pair_operand" "Ump"))
> -   (set (match_operand:TF 2 "register_operand" "=w")
> - (match_operand:TF 3 "memory_operand" "m"))]
> +(define_insn "load_pair_dw_"
> +  [(set (match_operand:TX 0 "register_operand" "=w")
> + (match_operand:TX 1 "aarch64_mem_pair_operand" "Ump"))
> +   (set (match_operand:TX2 2 "register_operand" "=w")
> + (match_operand:TX2 3 "memory_operand" "m"))]
> "TARGET_SIMD
>  && rtx_equal_p (XEXP (operands[3], 0),
>   plus_constant (Pmode,
>  XEXP (operands[1], 0),
> -GET_MODE_SIZE (TFmode)))"
> +GET_MODE_SIZE (mode)))"
>"ldp\\t%q0, %q2, %z1"
>[(set_attr "type" "neon_ldp_q")
> (set_attr "fp" "yes")]
> @@ -1805,11 +1805,11 @@ (define_insn "store_pair_dw_"
>}
>  )
>  
> -(define_insn "store_pair_dw_tftf"
> -  [(set (match_operand:TF 0 "aarch64_mem_pair_operand" "=Ump")
> - (match_operand:TF 1 "register_operand" "w"))
> -   (set (match_operand:TF 2 "memory_operand" "=m")
> - (match_operand:TF 3 "register_operand" "w"))]
> +(define_insn "store_pair_dw_"
> +  [(set (match_operand:TX 0 "aarch64_mem_pair_operand" "=Ump")
> + (match_operand:TX 1 "register_operand" "w"))
> +   (set (match_operand:TX2 2 "memory_operand" "=m")
> + (match_operand:TX2 3 "register_operand" "w"))]
> "TARGET_SIMD &&
>  rtx_equal_p (XEXP (operands[2], 0),
>pl

Re: [PATCH V2 3/7] aarch64: Implement system register validation tools

2023-10-18 Thread Richard Sandiford
Generally looks really good.  Some comments below.

Victor Do Nascimento  writes:
> Given the implementation of a mechanism of encoding system registers
> into GCC, this patch provides the mechanism of validating their use by
> the compiler.  In particular, this involves:
>
>   1. Ensuring a supplied string corresponds to a known system
>  register name.  System registers can be accessed either via their
>  name (e.g. `SPSR_EL1') or their encoding (e.g. `S3_0_C4_C0_0').
>  Register names are validated using a hash map, mapping known
>  system register names to its corresponding `sysreg_t' struct,
>  which is populated from the `aarch64_system_regs.def' file.
>  Register name validation is done via `lookup_sysreg_map', while
>  the encoding naming convention is validated via a parser
>  implemented in this patch - `is_implem_def_reg'.
>   2. Once a given register name is deemed to be valid, it is checked
>  against a further 2 criteria:
>a. Is the referenced register implemented in the target
>   architecture?  This is achieved by comparing the ARCH field
> in the relevant SYSREG entry from `aarch64_system_regs.def'
> against `aarch64_feature_flags' flags set at compile-time.
>b. Is the register being used correctly?  Check the requested
> operation against the FLAGS specified in SYSREG.
> This prevents operations like writing to a read-only system
> register.
>
> gcc/ChangeLog:
>
>   * gcc/config/aarch64/aarch64-protos.h (aarch64_valid_sysreg_name_p): 
> New.
>   (aarch64_retrieve_sysreg): Likewise.
>   * gcc/config/aarch64/aarch64.cc (is_implem_def_reg): Likewise.
>   (aarch64_valid_sysreg_name_p): Likewise.
>   (aarch64_retrieve_sysreg): Likewise.
>   (aarch64_register_sysreg): Likewise.
>   (aarch64_init_sysregs): Likewise.
>   (aarch64_lookup_sysreg_map): Likewise.
>   * gcc/config/aarch64/predicates.md (aarch64_sysreg_string): New.
> ---
>  gcc/config/aarch64/aarch64-protos.h |   2 +
>  gcc/config/aarch64/aarch64.cc   | 146 
>  gcc/config/aarch64/predicates.md|   4 +
>  3 files changed, 152 insertions(+)
>
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 60a55f4bc19..a134e2fcf8e 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -830,6 +830,8 @@ bool aarch64_simd_shift_imm_p (rtx, machine_mode, bool);
>  bool aarch64_sve_ptrue_svpattern_p (rtx, struct simd_immediate_info *);
>  bool aarch64_simd_valid_immediate (rtx, struct simd_immediate_info *,
>   enum simd_immediate_check w = AARCH64_CHECK_MOV);
> +bool aarch64_valid_sysreg_name_p (const char *);
> +const char *aarch64_retrieve_sysreg (char *, bool);
>  rtx aarch64_check_zero_based_sve_index_immediate (rtx);
>  bool aarch64_sve_index_immediate_p (rtx);
>  bool aarch64_sve_arith_immediate_p (machine_mode, rtx, bool);
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 69de2366424..816c4b69fc8 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -85,6 +85,7 @@
>  #include "config/arm/aarch-common.h"
>  #include "config/arm/aarch-common-protos.h"
>  #include "ssa.h"
> +#include "hash-map.h"
>  
>  /* This file should be included last.  */
>  #include "target-def.h"
> @@ -2845,6 +2846,52 @@ const sysreg_t sysreg_structs[] =
>  const unsigned nsysreg = TOTAL_ITEMS;
>  #undef TOTAL_ITEMS
>  
> +using sysreg_map_t = hash_map;
> +static sysreg_map_t *sysreg_map = nullptr;

One concern with static, non-GTY, runtime-initialised data is "does it
work with PCH?".  I suspect it does, since all uses of the map go through
aarch64_lookup_sysreg_map, and since nothing seems to rely on persistent
pointer values.  But it would be good to have a PCH test just to make sure.

I'm thinking of something like the tests in gcc/testsuite/gcc.dg/pch.
The header file (.hs) would define a function that does sysreg reads
and writes.  When the .hs is included from the .c file, the reads and
writes would be imported through a PCH load, rather than through the
normal frontend route.

> +
> +/* Map system register names to their hardware metadata: Encoding,

s/Encoding/encoding/

> +   feature flags and architectural feature requirements, all of which
> +   are encoded in a sysreg_t struct.  */
> +void
> +aarch64_register_sysreg (const char *name, const sysreg_t *metadata)
> +{
> +  bool dup = sysreg_map->put (name, metadata);
> +  gcc_checking_assert (!dup);
> +}
> +
> +/* Lazily initialize hash table for system register validation,
> +   checking the validity of supplied register name and returning
> +   register's associated metadata.  */
> +static void
> +aarch64_init_sysregs (void)
> +{
> +  gcc_assert (!sysreg_map);
> +  sysreg_map = new sysreg_map_t;
> +  gcc_assert (sysreg_map);

This assert seems redundant.  new

Re: [PATCH V2 5/7] aarch64: Implement system register r/w arm ACLE intrinsic functions

2023-10-18 Thread Richard Sandiford
Victor Do Nascimento  writes:
> Implement the aarch64 intrinsics for reading and writing system
> registers with the following signatures:
>
>   uint32_t __arm_rsr(const char *special_register);
>   uint64_t __arm_rsr64(const char *special_register);
>   void* __arm_rsrp(const char *special_register);
>   float __arm_rsrf(const char *special_register);
>   double __arm_rsrf64(const char *special_register);
>   void __arm_wsr(const char *special_register, uint32_t value);
>   void __arm_wsr64(const char *special_register, uint64_t value);
>   void __arm_wsrp(const char *special_register, const void *value);
>   void __arm_wsrf(const char *special_register, float value);
>   void __arm_wsrf64(const char *special_register, double value);
>
> gcc/ChangeLog:
>
>   * gcc/config/aarch64/aarch64-builtins.cc (enum aarch64_builtins):
>   Add enums for new builtins.
>   (aarch64_init_rwsr_builtins): New.
>   (aarch64_general_init_builtins): Call aarch64_init_rwsr_builtins.
>   (aarch64_expand_rwsr_builtin):  New.
>   (aarch64_general_expand_builtin): Call aarch64_general_expand_builtin.
>   * gcc/config/aarch64/aarch64.md (read_sysregdi): New insn_and_split.
>   (write_sysregdi): Likewise.
>   * gcc/config/aarch64/arm_acle.h (__arm_rsr): New.
>   (__arm_rsrp): Likewise.
>   (__arm_rsr64): Likewise.
>   (__arm_rsrf): Likewise.
>   (__arm_rsrf64): Likewise.
>   (__arm_wsr): Likewise.
>   (__arm_wsrp): Likewise.
>   (__arm_wsr64): Likewise.
>   (__arm_wsrf): Likewise.
>   (__arm_wsrf64): Likewise.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc/testsuite/gcc.target/aarch64/acle/rwsr.c: New.
>   * gcc/testsuite/gcc.target/aarch64/acle/rwsr-1.c: Likewise.
> ---
>  gcc/config/aarch64/aarch64-builtins.cc| 200 ++
>  gcc/config/aarch64/aarch64.md |  17 ++
>  gcc/config/aarch64/arm_acle.h |  30 +++
>  .../gcc.target/aarch64/acle/rwsr-1.c  |  20 ++
>  gcc/testsuite/gcc.target/aarch64/acle/rwsr.c  | 144 +
>  5 files changed, 411 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/acle/rwsr-1.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/acle/rwsr.c
>
> diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
> b/gcc/config/aarch64/aarch64-builtins.cc
> index 04f59fd9a54..d8bb2a989a5 100644
> --- a/gcc/config/aarch64/aarch64-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-builtins.cc
> @@ -808,6 +808,17 @@ enum aarch64_builtins
>AARCH64_RBIT,
>AARCH64_RBITL,
>AARCH64_RBITLL,
> +  /* System register builtins.  */
> +  AARCH64_RSR,
> +  AARCH64_RSRP,
> +  AARCH64_RSR64,
> +  AARCH64_RSRF,
> +  AARCH64_RSRF64,
> +  AARCH64_WSR,
> +  AARCH64_WSRP,
> +  AARCH64_WSR64,
> +  AARCH64_WSRF,
> +  AARCH64_WSRF64,
>AARCH64_BUILTIN_MAX
>  };
>  
> @@ -1798,6 +1809,65 @@ aarch64_init_rng_builtins (void)
>  AARCH64_BUILTIN_RNG_RNDRRS);
>  }
>  
> +/* Add builtins for reading system register.  */
> +static void
> +aarch64_init_rwsr_builtins (void)
> +{
> +  tree fntype = NULL;
> +  tree const_char_ptr_type
> += build_pointer_type (build_type_variant (char_type_node, true, false));
> +
> +#define AARCH64_INIT_RWSR_BUILTINS_DECL(F, N, T) \
> +  aarch64_builtin_decls[AARCH64_##F] \
> += aarch64_general_add_builtin ("__builtin_aarch64_"#N, T, AARCH64_##F);
> +
> +  fntype
> += build_function_type_list (uint32_type_node, const_char_ptr_type, NULL);
> +  AARCH64_INIT_RWSR_BUILTINS_DECL (RSR, rsr, fntype);
> +
> +  fntype
> += build_function_type_list (ptr_type_node, const_char_ptr_type, NULL);
> +  AARCH64_INIT_RWSR_BUILTINS_DECL (RSRP, rsrp, fntype);
> +
> +  fntype
> += build_function_type_list (uint64_type_node, const_char_ptr_type, NULL);
> +  AARCH64_INIT_RWSR_BUILTINS_DECL (RSR64, rsr64, fntype);
> +
> +  fntype
> += build_function_type_list (float_type_node, const_char_ptr_type, NULL);
> +  AARCH64_INIT_RWSR_BUILTINS_DECL (RSRF, rsrf, fntype);
> +
> +  fntype
> += build_function_type_list (double_type_node, const_char_ptr_type, NULL);
> +  AARCH64_INIT_RWSR_BUILTINS_DECL (RSRF64, rsrf64, fntype);
> +
> +  fntype
> += build_function_type_list (void_type_node, const_char_ptr_type,
> + uint32_type_node, NULL);
> +
> +  AARCH64_INIT_RWSR_BUILTINS_DECL (WSR, wsr, fntype);
> +
> +  fntype
> += build_function_type_list (void_type_node, const_char_ptr_type,
> + const_ptr_type_node, NULL);
> +  AARCH64_INIT_RWSR_BUILTINS_DECL (WSRP, wsrp, fntype);
> +
> +  fntype
> += build_function_type_list (void_type_node, const_char_ptr_type,
> + uint64_type_node, NULL);
> +  AARCH64_INIT_RWSR_BUILTINS_DECL (WSR64, wsr64, fntype);
> +
> +  fntype
> += build_function_type_list (void_type_node, const_char_ptr_type,
> + float_type_node, NULL);
> +  AARCH64_INIT

Re: [PATCH V2 2/7] aarch64: Add support for aarch64-sys-regs.def

2023-10-18 Thread Richard Sandiford
Victor Do Nascimento  writes:
> This patch defines the structure of a new .def file used for
> representing the aarch64 system registers, what information it should
> hold and the basic framework in GCC to process this file.
>
> Entries in the aarch64-system-regs.def file should be as follows:
>
>   SYSREG (NAME, CPENC (sn,op1,cn,cm,op2), FLAG1 | ... | FLAGn, ARCH)
>
> Where the arguments to SYSREG correspond to:
>   - NAME:  The system register name, as used in the assembly language.
>   - CPENC: The system register encoding, mapping to:
>
>  s__c_c_
>
>   - FLAG: The entries in the FLAGS field are bitwise-OR'd together to
> encode extra information required to ensure proper use of
> the system register.  For example, a read-only system
> register will have the flag F_REG_READ, while write-only
> registers will be labeled F_REG_WRITE.  Such flags are
> tested against at compile-time.
>   - ARCH: The architectural features the system register is associated
> with.  This is encoded via one of three possible macros:
> 1. When a system register is universally implemented, we say
> it has no feature requirements, so we tag it with the
> AARCH64_NO_FEATURES macro.
> 2. When a register is only implemented for a single
> architectural extension EXT, the AARCH64_FEATURE (EXT), is
> used.
> 3. When a given system register is made available by any of N
> possible architectural extensions, the AARCH64_FEATURES(N, ...)
> macro is used to combine them accordingly.
>
> In order to enable proper interpretation of the SYSREG entries by the
> compiler, flags defining system register behavior such as `F_REG_READ'
> and `F_REG_WRITE' are also defined here, so they can later be used for
> the validation of system register properties.
>
> Finally, any architectural feature flags from Binutils missing from GCC
> have appropriate aliases defined here so as to ensure
> cross-compatibility of SYSREG entries across the toolchain.
>
> gcc/ChangeLog:
>
>   * gcc/config/aarch64/aarch64.cc (sysreg_t): New.
>   (sysreg_structs): Likewise.
>   (nsysreg): Likewise.
>   (AARCH64_FEATURE): Likewise.
>   (AARCH64_FEATURES): Likewise.
>   (AARCH64_NO_FEATURES): Likewise.
>   * gcc/config/aarch64/aarch64.h (AARCH64_ISA_V8A): Add missing
>   ISA flag.
>   (AARCH64_ISA_V8_1A): Likewise.
>   (AARCH64_ISA_V8_7A): Likewise.
>   (AARCH64_ISA_V8_8A): Likewise.
>   (AARCH64_NO_FEATURES): Likewise.
>   (AARCH64_FL_RAS): New ISA flag alias.
>   (AARCH64_FL_LOR): Likewise.
>   (AARCH64_FL_PAN): Likewise.
>   (AARCH64_FL_AMU): Likewise.
>   (AARCH64_FL_SCXTNUM): Likewise.
>   (AARCH64_FL_ID_PFR2): Likewise.
>   (F_DEPRECATED): New.
>   (F_REG_READ): Likewise.
>   (F_REG_WRITE): Likewise.
>   (F_ARCHEXT): Likewise.
>   (F_REG_ALIAS): Likewise.
> ---
>  gcc/config/aarch64/aarch64.cc | 38 +++
>  gcc/config/aarch64/aarch64.h  | 36 +
>  2 files changed, 74 insertions(+)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 9fbfc548a89..69de2366424 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -2807,6 +2807,44 @@ static const struct processor all_cores[] =
>{NULL, aarch64_none, aarch64_none, aarch64_no_arch, 0, NULL}
>  };
>  
> +typedef struct {
> +  const char* name;
> +  const char* encoding;

Formatting nit, but GCC style is:

  const char *foo

rather than:

  const char* foo;

> +  const unsigned properties;
> +  const unsigned long long arch_reqs;

I don't think these two should be const.  There's no reason in principle
why a sysreg_t can't be created and modified dynamically.

It would be useful to have some comments above the fields to say what
they represent.  E.g. the definition on its own doesn't make clear what
"properties" refers to.

arch_reqs should use aarch64_feature_flags rather than unsigned long long.
We're running out of feature flags in GCC too, so aarch64_feature_flags
is soon likely to be a C++ class.

> +} sysreg_t;
> +
> +/* An aarch64_feature_set initializer for a single feature,
> +   AARCH64_FEATURE_.  */
> +#define AARCH64_FEATURE(FEAT) AARCH64_FL_##FEAT
> +
> +/* Used by AARCH64_FEATURES.  */
> +#define AARCH64_OR_FEATURES_1(X, F1) \
> +  AARCH64_FEATURE (F1)
> +#define AARCH64_OR_FEATURES_2(X, F1, F2) \
> +  (AARCH64_FEATURE (F1) | AARCH64_OR_FEATURES_1 (X, F2))
> +#define AARCH64_OR_FEATURES_3(X, F1, ...) \
> +  (AARCH64_FEATURE (F1) | AARCH64_OR_FEATURES_2 (X, __VA_ARGS__))
> +
> +/* An aarch64_feature_set initializer for the N features listed in "...".  */
> +#define AARCH64_FEATURES(N, ...) \
> +  AARCH64_OR_FEATURES_##N (0, __VA_ARGS__)
> +
> +/* Database of system registers, their encodings and architectural
> +   requirements.  */
> +const sysreg_t sysreg_struct

Re: [PATCH V2 4/7] aarch64: Add basic target_print_operand support for CONST_STRING

2023-10-18 Thread Richard Sandiford
Victor Do Nascimento  writes:
> Motivated by the need to print system register names in output
> assembly, this patch adds the required logic to
> `aarch64_print_operand' to accept rtxs of type CONST_STRING and
> process these accordingly.
>
> Consequently, an rtx such as:
>
>   (set (reg/i:DI 0 x0)
>  (unspec:DI [(const_string ("s3_3_c13_c2_2"))])
>
> can now be output correctly using the following output pattern when
> composing `define_insn's:
>
>   "mrs\t%x0, %1"
>
> gcc/ChangeLog
>
>   * gcc/config/aarch64/aarch64.cc (aarch64_print_operand): Add
>   support for CONST_STRING.
> ---
>  gcc/config/aarch64/aarch64.cc | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 816c4b69fc8..d187e171beb 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -12430,6 +12430,12 @@ aarch64_print_operand (FILE *f, rtx x, int code)
>  
>switch (GET_CODE (x))
>   {
> + case CONST_STRING:
> +   {
> + const char *output_op = XSTR (x, 0);
> + asm_fprintf (f, "%s", output_op);
> + break;
> +   }

LGTM, but it seems slightly neater to avoid the temporary:

case CONST_STRING:
  asm_fprintf (f, "%s", XSTR (x, 0));
  break;

(Sorry for the micro-comment.)

Thanks,
Richard

>   case REG:
> if (aarch64_sve_data_mode_p (GET_MODE (x)))
>   {


Re: [PATCH V2 6/7] aarch64: Add front-end argument type checking for target builtins

2023-10-18 Thread Richard Sandiford
Victor Do Nascimento  writes:
> In implementing the ACLE read/write system register builtins it was
> observed that leaving argument type checking to be done at expand-time
> meant that poorly-formed function calls were being "fixed" by certain
> optimization passes, meaning bad code wasn't being properly picked up
> in checking.
>
> Example:
>
>   const char *regname = "amcgcr_el0";
>   long long a = __builtin_aarch64_rsr64 (regname);
>
> is reduced by the ccp1 pass to
>
>   long long a = __builtin_aarch64_rsr64 ("amcgcr_el0");
>
> As these functions require an argument of STRING_CST type, there needs
> to be a check carried out by the front-end capable of picking this up.
>
> The introduced `check_general_builtin_call' function will be called by
> the TARGET_CHECK_BUILTIN_CALL hook whenever a call to a builtin
> belonging to the AARCH64_BUILTIN_GENERAL category is encountered,
> carrying out any appropriate checks associated with a particular
> builtin function code.
>
> gcc/ChangeLog:
>
>   * gcc/config/aarch64/aarch64-builtins.cc (check_general_builtin_call):
>   New.
>   * gcc/config/aarch64/aarch64-c.cc (aarch64_check_builtin_call):
>   Add check_general_builtin_call call.
>   * gcc/config/aarch64/aarch64-protos.h (check_general_builtin_call):
>   New.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc/testsuite/gcc.target/aarch64/acle/rwsr-2.c: New.
> ---
>  gcc/config/aarch64/aarch64-builtins.cc| 33 +++
>  gcc/config/aarch64/aarch64-c.cc   |  4 +--
>  gcc/config/aarch64/aarch64-protos.h   |  3 ++
>  .../gcc.target/aarch64/acle/rwsr-2.c  | 15 +
>  4 files changed, 53 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/acle/rwsr-2.c
>
> diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
> b/gcc/config/aarch64/aarch64-builtins.cc
> index d8bb2a989a5..6734361f4f4 100644
> --- a/gcc/config/aarch64/aarch64-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-builtins.cc
> @@ -2126,6 +2126,39 @@ aarch64_general_builtin_decl (unsigned code, bool)
>return aarch64_builtin_decls[code];
>  }
>  
> +bool
> +check_general_builtin_call (location_t location, vec,
> + unsigned int code, tree fndecl,
> + unsigned int nargs ATTRIBUTE_UNUSED, tree *args)
> +{

How about aarch64_general_check_builtin_call?  It's better to use
aarch64_* prefixes where possible.

> +  switch (code)
> +{
> +case AARCH64_RSR:
> +case AARCH64_RSRP:
> +case AARCH64_RSR64:
> +case AARCH64_RSRF:
> +case AARCH64_RSRF64:
> +case AARCH64_WSR:
> +case AARCH64_WSRP:
> +case AARCH64_WSR64:
> +case AARCH64_WSRF:
> +case AARCH64_WSRF64:
> +  if (TREE_CODE (args[0]) == VAR_DECL
> +   || TREE_CODE (TREE_TYPE (args[0])) != POINTER_TYPE
> +   || TREE_CODE (TREE_OPERAND (TREE_OPERAND (args[0], 0) , 0))
> +   != STRING_CST)

Similarly to the expand code in 5/7, I think this should check
positively for specific tree codes rather than negatively for a
VAR_DECL.  That is, we should ensure TREE_CODE (x) is something
(rather than isn't something) before accessing TREE_OPERAND (x, 0).

> + {
> +   const char  *fn_name, *err_msg;
> +   fn_name = IDENTIFIER_POINTER (DECL_NAME (fndecl));
> +   err_msg = "first argument to %<%s%> must be a string literal";
> +   error_at (location, err_msg, fn_name);

The error message needs to remain part of the error_at call,
since being in error_at ensures that it gets picked up for translation.
It's simpler to use %qD rather than %<%s%>, and pass fndecl directly.

> +   return false;
> + }
> +}
> +  /* Default behavior.  */
> +  return true;
> +}
> +
>  typedef enum
>  {
>SIMD_ARG_COPY_TO_REG,
> diff --git a/gcc/config/aarch64/aarch64-c.cc b/gcc/config/aarch64/aarch64-c.cc
> index ab8844f6049..c2a9a59df73 100644
> --- a/gcc/config/aarch64/aarch64-c.cc
> +++ b/gcc/config/aarch64/aarch64-c.cc
> @@ -339,8 +339,8 @@ aarch64_check_builtin_call (location_t loc, 
> vec arg_loc,
>switch (code & AARCH64_BUILTIN_CLASS)
>  {
>  case AARCH64_BUILTIN_GENERAL:
> -  return true;
> -
> +  return check_general_builtin_call (loc, arg_loc, subcode, orig_fndecl,
> +  nargs, args);
>  case AARCH64_BUILTIN_SVE:
>return aarch64_sve::check_builtin_call (loc, arg_loc, subcode,
> orig_fndecl, nargs, args);
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index a134e2fcf8e..9ef96ff511f 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -990,6 +990,9 @@ tree aarch64_general_builtin_rsqrt (unsigned int);
>  void handle_arm_acle_h (void);
>  void handle_arm_neon_h (void);
>  
> +bool check_general_builtin_call (location_t, vec, unsigned int,
> +   tree, unsigned int, tree

Re: [PATCH V2 7/7] aarch64: Add system register duplication check selftest

2023-10-18 Thread Richard Sandiford
Victor Do Nascimento  writes:
> Add a build-time test to check whether system register data, as
> imported from `aarch64-sys-reg.def' has any duplicate entries.
>
> Duplicate entries are defined as any two SYSREG entries in the .def
> file which share the same encoding values (as specified by its `CPENC'
> field) and where the relationship amongst the two does not fit into
> one of the following categories:
>
>   * Simple aliasing: In some cases, it is observed that one
>   register name serves as an alias to another.  One example of
>   this is where TRCEXTINSELR aliases TRCEXTINSELR0.
>   * Expressing intent: It is possible that when a given register
>   serves two distinct functions depending on how it is used, it
>   is given two distinct names whose use should match the context
>   under which it is being used.  Example:  Debug Data Transfer
>   Register. When used to receive data, it should be accessed as
>   DBGDTRRX_EL0 while when transmitting data it should be
>   accessed via DBGDTRTX_EL0.
>   * Register depreciation: Some register names have been
>   deprecated and should no longer be used, but backwards-
>   compatibility requires that such names continue to be
>   recognized, as is the case for the SPSR_EL1 register, whose
>   access via the SPSR_SVC name is now deprecated.
>   * Same encoding different target: Some encodings are given
>   different meaning depending on the target architecture and, as
>   such, are given different names in each of theses contexts.
>   We see an example of this for CPENC(3,4,2,0,0), which
>   corresponds to TTBR0_EL2 for Armv8-A targets and VSCTLR_EL2
>   in Armv8-R targets.
>
> A consequence of these observations is that `CPENC' duplication is
> acceptable iff at least one of the `properties' or `arch_reqs' fields
> of the `sysreg_t' structs associated with the two registers in
> question differ and it's this condition that is checked by the new
> `aarch64_test_sysreg_encoding_clashes' function.
>
> gcc/ChangeLog:
>
>   * gcc/config/aarch64/aarch64.cc
>   (aarch64_test_sysreg_encoding_clashes): New.
>   (aarch64_run_selftests): add call to
>   aarch64_test_sysreg_encoding_clashes selftest.
> ---
>  gcc/config/aarch64/aarch64.cc | 53 +++
>  1 file changed, 53 insertions(+)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index d187e171beb..e0be2877ede 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -22,6 +22,7 @@
>  
>  #define INCLUDE_STRING
>  #define INCLUDE_ALGORITHM
> +#define INCLUDE_VECTOR
>  #include "config.h"
>  #include "system.h"
>  #include "coretypes.h"
> @@ -28332,6 +28333,57 @@ aarch64_test_fractional_cost ()
>ASSERT_EQ (cf (1, 2).as_double (), 0.5);
>  }
>  
> +/* Calculate whether our system register data, as imported from
> +   `aarch64-sys-reg.def' has any duplicate entries.  */
> +static void
> +aarch64_test_sysreg_encoding_clashes (void)
> +{
> +  using dup_counters_t = hash_map;
> +  using dup_instances_t = hash_map +std::vector>;
> +
> +  dup_counters_t duplicate_counts;
> +  dup_instances_t duplicate_instances;
> +
> +  /* Every time an encoding is established to come up more than once
> +  we add it to a "clash-analysis queue", which is then used to extract
> +  necessary information from our hash map when establishing whether
> +  repeated encodings are valid.  */

Formatting nit, sorry, but second and subsequent lines should be
indented to line up with the "E".

> +
> +  /* 1) Collect recurrence information.  */
> +  std::vector testqueue;
> +
> +  for (unsigned i = 0; i < nsysreg; i++)
> +{
> +  const sysreg_t *reg = sysreg_structs + i;
> +
> +  unsigned *tbl_entry = &duplicate_counts.get_or_insert (reg->encoding);
> +  *tbl_entry += 1;
> +
> +  std::vector *tmp
> + = &duplicate_instances.get_or_insert (reg->encoding);
> +
> +  tmp->push_back (reg);
> +  if (*tbl_entry > 1)
> +   testqueue.push_back (reg->encoding);
> +}

Do we need two hash maps here?  It looks like the length of the vector
is always equal to the count.  Also...

> +
> +  /* 2) Carry out analysis on collected data.  */
> +  for (auto enc : testqueue)

...hash_map itself is iterable.  We could iterate over that instead,
which would avoid the need for the queue.

> +{
> +  unsigned nrep = *duplicate_counts.get (enc);
> +  for (unsigned i = 0; i < nrep; i++)
> + for (unsigned j = i+1; j < nrep; j++)

Formatting nit, but "i + 1" rather than "i+1".

Overall, it looks like really nice work.  Thanks for doing this.

Richard

> +   {
> + std::vector *tmp2 = duplicate_instances.get (enc);
> + const sysreg_t *a = (*tmp2)[i];
> + const sysreg_t *b = (*tmp2)[j];
> + ASSERT_TRUE ((a->properties != b->properties)
> +  || (a->arch_req

Re: [PATCH, aarch64 3/4] aarch64: Add movprfx alternatives for predicate patterns

2018-07-02 Thread Richard Sandiford
Richard Henderson  writes:
> @@ -2687,34 +2738,60 @@
>aarch64_sve_prepare_conditional_op (operands, 5, );
>  })
>  
> -;; Predicated floating-point operations.
> -(define_insn "*cond_"
> -  [(set (match_operand:SVE_F 0 "register_operand" "=w")
> +;; Predicated floating-point operations with select matching output.
> +(define_insn "*cond__0"
> +  [(set (match_operand:SVE_F 0 "register_operand" "+w, w, ?&w")
>   (unspec:SVE_F
> -   [(match_operand: 1 "register_operand" "Upl")
> +   [(match_operand: 1 "register_operand" "Upl, Upl, Upl")
>  (unspec:SVE_F
> -  [(match_operand:SVE_F 2 "register_operand" "0")
> -   (match_operand:SVE_F 3 "register_operand" "w")]
> +  [(match_dup 1)
> +   (match_operand:SVE_F 2 "register_operand" "0, w, w")
> +   (match_operand:SVE_F 3 "register_operand" "w, 0, w")]
> +  SVE_COND_FP_BINARY)
> +(match_dup 0)]
> +   UNSPEC_SEL))]
> +  "TARGET_SVE"
> +  "@
> +   \t%0., %1/m, %0., %3.
> +   \t%0., %1/m, %0., %2.
> +   movprfx\t%0, %1/m, %2\;\t%0., %1/m, %0., 
> %3."
> +  [(set_attr "movprfx" "*,*,yes")]
> +)

Reintroduces a (match_dup 1) into the SVE_COND_FP_BINARY.

OK otherwise, thanks.

The original reason for using SVE_COND_FP_BINARY rather than rtx codes
was to emphasise that nothing happens for inactive lanes: this is really
a predicated operation that returns "don't care" values for inactive lanes
fused with a select that "happens" to use (but in fact always uses) the same
predicate.  So from that point of view it seemed natural for both unspecs
to have the predicate.

OTOH, since SVE_COND_FP_BINARY is never used independently, and since it's
an unspec, I guess it doesn't matter much either way.

Richard


Re: [PATCH, aarch64 1/4] aarch64: Add movprfx alternatives for unpredicated patterns

2018-07-02 Thread Richard Sandiford
Richard Henderson  writes:
>   * config/aarch64/aarch64.md (movprfx): New attr.
>   (length): Default movprfx to 8.
>   * config/aarch64/aarch64-sve.md (*mul3): Add movprfx alt.
>   (*madd, *msub   (*mul3_highpart): Likewise.
>   (*3): Likewise.
>   (*v3): Likewise.
>   (*3): Likewise.
>   (*3): Likewise.
>   (*fma4, *fnma4): Likewise.
>   (*fms4, *fnms4): Likewise.
>   (*div4): Likewise.

OK, thanks.

Richard

> ---
>  gcc/config/aarch64/aarch64-sve.md | 184 ++
>  gcc/config/aarch64/aarch64.md |  11 +-
>  2 files changed, 116 insertions(+), 79 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-sve.md 
> b/gcc/config/aarch64/aarch64-sve.md
> index 8e2433385a8..3dee6a4376d 100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -937,47 +937,53 @@
>  ;; to gain much and would make the instruction seem less uniform to the
>  ;; register allocator.
>  (define_insn "*mul3"
> -  [(set (match_operand:SVE_I 0 "register_operand" "=w, w")
> +  [(set (match_operand:SVE_I 0 "register_operand" "=w, w, ?&w")
>   (unspec:SVE_I
> -   [(match_operand: 1 "register_operand" "Upl, Upl")
> +   [(match_operand: 1 "register_operand" "Upl, Upl, Upl")
>  (mult:SVE_I
> -  (match_operand:SVE_I 2 "register_operand" "%0, 0")
> -  (match_operand:SVE_I 3 "aarch64_sve_mul_operand" "vsm, w"))]
> +  (match_operand:SVE_I 2 "register_operand" "%0, 0, w")
> +  (match_operand:SVE_I 3 "aarch64_sve_mul_operand" "vsm, w, w"))]
> UNSPEC_MERGE_PTRUE))]
>"TARGET_SVE"
>"@
> mul\t%0., %0., #%3
> -   mul\t%0., %1/m, %0., %3."
> +   mul\t%0., %1/m, %0., %3.
> +   movprfx\t%0, %2\;mul\t%0., %1/m, %0., %3."
> +  [(set_attr "movprfx" "*,*,yes")]
>  )
>  
>  (define_insn "*madd"
> -  [(set (match_operand:SVE_I 0 "register_operand" "=w, w")
> +  [(set (match_operand:SVE_I 0 "register_operand" "=w, w, ?&w")
>   (plus:SVE_I
> (unspec:SVE_I
> - [(match_operand: 1 "register_operand" "Upl, Upl")
> -  (mult:SVE_I (match_operand:SVE_I 2 "register_operand" "%0, w")
> -  (match_operand:SVE_I 3 "register_operand" "w, w"))]
> + [(match_operand: 1 "register_operand" "Upl, Upl, Upl")
> +  (mult:SVE_I (match_operand:SVE_I 2 "register_operand" "%0, w, w")
> +  (match_operand:SVE_I 3 "register_operand" "w, w, w"))]
>   UNSPEC_MERGE_PTRUE)
> -   (match_operand:SVE_I 4 "register_operand" "w, 0")))]
> +   (match_operand:SVE_I 4 "register_operand" "w, 0, w")))]
>"TARGET_SVE"
>"@
> mad\t%0., %1/m, %3., %4.
> -   mla\t%0., %1/m, %2., %3."
> +   mla\t%0., %1/m, %2., %3.
> +   movprfx\t%0, %4\;mla\t%0., %1/m, %2., %3."
> +  [(set_attr "movprfx" "*,*,yes")]
>  )
>  
>  (define_insn "*msub3"
> -  [(set (match_operand:SVE_I 0 "register_operand" "=w, w")
> +  [(set (match_operand:SVE_I 0 "register_operand" "=w, w, ?&w")
>   (minus:SVE_I
> -   (match_operand:SVE_I 4 "register_operand" "w, 0")
> +   (match_operand:SVE_I 4 "register_operand" "w, 0, w")
> (unspec:SVE_I
> - [(match_operand: 1 "register_operand" "Upl, Upl")
> -  (mult:SVE_I (match_operand:SVE_I 2 "register_operand" "%0, w")
> -  (match_operand:SVE_I 3 "register_operand" "w, w"))]
> + [(match_operand: 1 "register_operand" "Upl, Upl, Upl")
> +  (mult:SVE_I (match_operand:SVE_I 2 "register_operand" "%0, w, w")
> +  (match_operand:SVE_I 3 "register_operand" "w, w, w"))]
>   UNSPEC_MERGE_PTRUE)))]
>"TARGET_SVE"
>"@
> msb\t%0., %1/m, %3., %4.
> -   mls\t%0., %1/m, %2., %3."
> +   mls\t%0., %1/m, %2., %3.
> +   movprfx\t%0, %4\;mls\t%0., %1/m, %2., %3."
> +  [(set_attr "movprfx" "*,*,yes")]
>  )
>  
>  ;; Unpredicated highpart multiplication.
> @@ -997,15 +1003,18 @@
>  
>  ;; Predicated highpart multiplication.
>  (define_insn "*mul3_highpart"
> -  [(set (match_operand:SVE_I 0 "register_operand" "=w")
> +  [(set (match_operand:SVE_I 0 "register_operand" "=w, ?&w")
>   (unspec:SVE_I
> -   [(match_operand: 1 "register_operand" "Upl")
> -(unspec:SVE_I [(match_operand:SVE_I 2 "register_operand" "%0")
> -   (match_operand:SVE_I 3 "register_operand" "w")]
> +   [(match_operand: 1 "register_operand" "Upl, Upl")
> +(unspec:SVE_I [(match_operand:SVE_I 2 "register_operand" "%0, w")
> +   (match_operand:SVE_I 3 "register_operand" "w, w")]
>MUL_HIGHPART)]
> UNSPEC_MERGE_PTRUE))]
>"TARGET_SVE"
> -  "mulh\t%0., %1/m, %0., %3."
> +  "@
> +   mulh\t%0., %1/m, %0., %3.
> +   movprfx\t%0, %2\;mulh\t%0., %1/m, %0., %3."
> +  [(set_attr "movprfx" "*,yes")]
>  )
>  
>  ;; Unpredicated division.
> @@ -1025,17 +1034,19 @@
>  
>  ;; Division predicated with a PTRUE.
>  (define_insn "*3"
> -  [(set (match_operand:SVE_SDI 0 "register_operan

Re: [PATCH, aarch64 2/4] aarch64: Remove predicate from inside SVE_COND_FP_BINARY

2018-07-02 Thread Richard Sandiford
Richard Henderson  writes:
> The predicate is present within the containing UNSPEC_SEL;
> there is no need to duplicate it.
>
>   * config/aarch64/aarch64-sve.md (cond_):
>   Remove match_dup 1 from the inner unspec.
>   (*cond_): Likewise.

OK, thanks.

Richard

> ---
>  gcc/config/aarch64/aarch64-sve.md | 9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-sve.md 
> b/gcc/config/aarch64/aarch64-sve.md
> index 3dee6a4376d..2aceef65c80 100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -2677,8 +2677,7 @@
>   (unspec:SVE_F
> [(match_operand: 1 "register_operand")
>  (unspec:SVE_F
> -  [(match_dup 1)
> -   (match_operand:SVE_F 2 "register_operand")
> +  [(match_operand:SVE_F 2 "register_operand")
> (match_operand:SVE_F 3 "register_operand")]
>SVE_COND_FP_BINARY)
>  (match_operand:SVE_F 4 "register_operand")]
> @@ -2694,8 +2693,7 @@
>   (unspec:SVE_F
> [(match_operand: 1 "register_operand" "Upl")
>  (unspec:SVE_F
> -  [(match_dup 1)
> -   (match_operand:SVE_F 2 "register_operand" "0")
> +  [(match_operand:SVE_F 2 "register_operand" "0")
> (match_operand:SVE_F 3 "register_operand" "w")]
>SVE_COND_FP_BINARY)
>  (match_dup 2)]
> @@ -2710,8 +2708,7 @@
>   (unspec:SVE_F
> [(match_operand: 1 "register_operand" "Upl")
>  (unspec:SVE_F
> -  [(match_dup 1)
> -   (match_operand:SVE_F 2 "register_operand" "w")
> +  [(match_operand:SVE_F 2 "register_operand" "w")
> (match_operand:SVE_F 3 "register_operand" "0")]
>SVE_COND_FP_BINARY)
>  (match_dup 3)]


Re: [PATCH, aarch64 4/4] aarch64: Add movprfx patterns for zero and unmatched select

2018-07-02 Thread Richard Sandiford
Richard Henderson  writes:
>   * config/aarch64/aarch64-protos.h, config/aarch64/aarch64.c
>   (aarch64_sve_prepare_conditional_op): Remove.
>   * config/aarch64/aarch64-sve.md (cond_):
>   Allow aarch64_simd_reg_or_zero as select operand; remove
>   the aarch64_sve_prepare_conditional_op call.
>   (cond_): Likewise.
>   (cond_): Likewise.
>   (*cond__z): New pattern.
>   (*cond__z): New pattern.
>   (*cond__z): New pattern.
>   (*cond__any): New pattern.
>   (*cond__any): New pattern.
>   (*cond__any): New pattern
>   and a splitters to match all of the *_any patterns.
>   * config/aarch64/predicates.md (aarch64_sve_any_binary_operator): New.
> ---
>  gcc/config/aarch64/aarch64-protos.h |   1 -
>  gcc/config/aarch64/aarch64.c|  54 --
>  gcc/config/aarch64/aarch64-sve.md   | 154 
>  gcc/config/aarch64/predicates.md|   3 +
>  4 files changed, 136 insertions(+), 76 deletions(-)

OK, thanks.

Richard

>
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 87c6ae20278..514ddc457ca 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -513,7 +513,6 @@ bool aarch64_gen_adjusted_ldpstp (rtx *, bool, 
> scalar_mode, RTX_CODE);
>  void aarch64_expand_sve_vec_cmp_int (rtx, rtx_code, rtx, rtx);
>  bool aarch64_expand_sve_vec_cmp_float (rtx, rtx_code, rtx, rtx, bool);
>  void aarch64_expand_sve_vcond (machine_mode, machine_mode, rtx *);
> -void aarch64_sve_prepare_conditional_op (rtx *, unsigned int, bool);
>  #endif /* RTX_CODE */
>  
>  void aarch64_init_builtins (void);
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 3af7e98e166..d75d45f4b8b 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -16058,60 +16058,6 @@ aarch64_expand_sve_vcond (machine_mode data_mode, 
> machine_mode cmp_mode,
>emit_set_insn (ops[0], gen_rtx_UNSPEC (data_mode, vec, UNSPEC_SEL));
>  }
>  
> -/* Prepare a cond_ operation that has the operands
> -   given by OPERANDS, where:
> -
> -   - operand 0 is the destination
> -   - operand 1 is a predicate
> -   - operands 2 to NOPS - 2 are the operands to an operation that is
> - performed for active lanes
> -   - operand NOPS - 1 specifies the values to use for inactive lanes.
> -
> -   COMMUTATIVE_P is true if operands 2 and 3 are commutative.  In that case,
> -   no pattern is provided for a tie between operands 3 and NOPS - 1.  */
> -
> -void
> -aarch64_sve_prepare_conditional_op (rtx *operands, unsigned int nops,
> - bool commutative_p)
> -{
> -  /* We can do the operation directly if the "else" value matches one
> - of the other inputs.  */
> -  for (unsigned int i = 2; i < nops - 1; ++i)
> -if (rtx_equal_p (operands[i], operands[nops - 1]))
> -  {
> - if (i == 3 && commutative_p)
> -   std::swap (operands[2], operands[3]);
> - return;
> -  }
> -
> -  /* If the "else" value is different from the other operands, we have
> - the choice of doing a SEL on the output or a SEL on an input.
> - Neither choice is better in all cases, but one advantage of
> - selecting the input is that it can avoid a move when the output
> - needs to be distinct from the inputs.  E.g. if operand N maps to
> - register N, selecting the output would give:
> -
> - MOVPRFX Z0.S, Z2.S
> - ADD Z0.S, P1/M, Z0.S, Z3.S
> - SEL Z0.S, P1, Z0.S, Z4.S
> -
> - whereas selecting the input avoids the MOVPRFX:
> -
> - SEL Z0.S, P1, Z2.S, Z4.S
> - ADD Z0.S, P1/M, Z0.S, Z3.S.
> -
> - ??? Matching the other input can produce
> -
> - MOVPRFX Z4.S, P1/M, Z2.S
> - ADD Z4.S, P1/M, Z4.S, Z3.S
> -   */
> -  machine_mode mode = GET_MODE (operands[0]);
> -  rtx temp = gen_reg_rtx (mode);
> -  rtvec vec = gen_rtvec (3, operands[1], operands[2], operands[nops - 1]);
> -  emit_set_insn (temp, gen_rtx_UNSPEC (mode, vec, UNSPEC_SEL));
> -  operands[2] = operands[nops - 1] = temp;
> -}
> -
>  /* Implement TARGET_MODES_TIEABLE_P.  In principle we should always return
> true.  However due to issues with register allocation it is preferable
> to avoid tieing integer scalar and FP scalar modes.  Executing integer
> diff --git a/gcc/config/aarch64/aarch64-sve.md 
> b/gcc/config/aarch64/aarch64-sve.md
> index db16affc093..b16d0455159 100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -1817,13 +1817,10 @@
>  (SVE_INT_BINARY:SVE_I
>(match_operand:SVE_I 2 "register_operand")
>(match_operand:SVE_I 3 "register_operand"))
> -(match_operand:SVE_I 4 "register_operand")]
> +(match_operand:SVE_I 4 "aarch64_simd_reg_or_zero")]
> UNSPEC_SEL))]
>"TARGET_SVE"
> -{
> -  bool commutative_p = (GET_RTX_CLASS () == RTX_COMM_ARITH);
> -  aarch64_sve_prepare_conditional_op (oper

Re: [14/n] PR85694: Rework overwidening detection

2018-07-02 Thread Richard Sandiford
Christophe Lyon  writes:
> On Fri, 29 Jun 2018 at 13:36, Richard Sandiford
>  wrote:
>>
>> Richard Sandiford  writes:
>> > This patch is the main part of PR85694.  The aim is to recognise at least:
>> >
>> >   signed char *a, *b, *c;
>> >   ...
>> >   for (int i = 0; i < 2048; i++)
>> > c[i] = (a[i] + b[i]) >> 1;
>> >
>> > as an over-widening pattern, since the addition and shift can be done
>> > on shorts rather than ints.  However, it ended up being a lot more
>> > general than that.
>> >
>> > The current over-widening pattern detection is limited to a few simple
>> > cases: logical ops with immediate second operands, and shifts by a
>> > constant.  These cases are enough for common pixel-format conversion
>> > and can be detected in a peephole way.
>> >
>> > The loop above requires two generalisations of the current code: support
>> > for addition as well as logical ops, and support for non-constant second
>> > operands.  These are harder to detect in the same peephole way, so the
>> > patch tries to take a more global approach.
>> >
>> > The idea is to get information about the minimum operation width
>> > in two ways:
>> >
>> > (1) by using the range information attached to the SSA_NAMEs
>> > (effectively a forward walk, since the range info is
>> > context-independent).
>> >
>> > (2) by back-propagating the number of output bits required by
>> > users of the result.
>> >
>> > As explained in the comments, there's a balance to be struck between
>> > narrowing an individual operation and fitting in with the surrounding
>> > code.  The approach is pretty conservative: if we could narrow an
>> > operation to N bits without changing its semantics, it's OK to do that if:
>> >
>> > - no operations later in the chain require more than N bits; or
>> >
>> > - all internally-defined inputs are extended from N bits or fewer,
>> >   and at least one of them is single-use.
>> >
>> > See the comments for the rationale.
>> >
>> > I didn't bother adding STMT_VINFO_* wrappers for the new fields
>> > since the code seemed more readable without.
>> >
>> > Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
>>
>> Here's a version rebased on top of current trunk.  Changes from last time:
>>
>> - reintroduce dump_generic_expr_loc, with the obvious change to the
>>   prototype
>>
>> - fix a typo in a comment
>>
>> - use vect_element_precision from the new version of 12/n.
>>
>> Tested as before.  OK to install?
>>
>
> Hi Richard,
>
> This patch introduces regressions on arm-none-linux-gnueabihf:
> gcc.dg/vect/vect-over-widen-1-big-array.c -flto -ffat-lto-objects
> scan-tree-dump-times vect "vect_recog_widen_shift_pattern: detected" 2
> gcc.dg/vect/vect-over-widen-1-big-array.c scan-tree-dump-times
> vect "vect_recog_widen_shift_pattern: detected" 2
> gcc.dg/vect/vect-over-widen-1.c -flto -ffat-lto-objects
> scan-tree-dump-times vect "vect_recog_widen_shift_pattern: detected" 2
> gcc.dg/vect/vect-over-widen-1.c scan-tree-dump-times vect
> "vect_recog_widen_shift_pattern: detected" 2
> gcc.dg/vect/vect-over-widen-4-big-array.c -flto -ffat-lto-objects
> scan-tree-dump-times vect "vect_recog_widen_shift_pattern: detected" 2
> gcc.dg/vect/vect-over-widen-4-big-array.c scan-tree-dump-times
> vect "vect_recog_widen_shift_pattern: detected" 2
> gcc.dg/vect/vect-over-widen-4.c -flto -ffat-lto-objects
> scan-tree-dump-times vect "vect_recog_widen_shift_pattern: detected" 2
> gcc.dg/vect/vect-over-widen-4.c scan-tree-dump-times vect
> "vect_recog_widen_shift_pattern: detected" 2
> gcc.dg/vect/vect-widen-shift-s16.c -flto -ffat-lto-objects
> scan-tree-dump-times vect "vect_recog_widen_shift_pattern: detected" 8
> gcc.dg/vect/vect-widen-shift-s16.c scan-tree-dump-times vect
> "vect_recog_widen_shift_pattern: detected" 8
> gcc.dg/vect/vect-widen-shift-s8.c -flto -ffat-lto-objects
> scan-tree-dump-times vect "vect_recog_widen_shift_pattern: detected" 1
> gcc.dg/vect/vect-widen-shift-s8.c scan-tree-dump-times vect
> "vect_recog_widen_shift_pattern: detected" 1
> gcc.dg/vect/vect-widen-shift-u16.c -flto -ffat-lto-objects
> scan-tree-dump-times vect "vect_recog_

Re: [PATCH] [RFC] Higher-level reporting of vectorization problems

2018-07-02 Thread Richard Sandiford
Richard Biener  writes:
> On Fri, 22 Jun 2018, David Malcolm wrote:
>
>> NightStrike and I were chatting on IRC last week about
>> issues with trying to vectorize the following code:
>> 
>> #include 
>> std::size_t f(std::vector> const & v) {
>>  std::size_t ret = 0;
>>  for (auto const & w: v)
>>  ret += w.size();
>>  return ret;
>> }
>> 
>> icc could vectorize it, but gcc couldn't, but neither of us could
>> immediately figure out what the problem was.
>> 
>> Using -fopt-info leads to a wall of text.
>> 
>> I tried using my patch here:
>> 
>>  "[PATCH] v3 of optinfo, remarks and optimization records"
>>   https://gcc.gnu.org/ml/gcc-patches/2018-06/msg01267.html
>> 
>> It improved things somewhat, by showing:
>> (a) the nesting structure via indentation, and
>> (b) the GCC line at which each message is emitted (by using the
>> "remark" output)
>> 
>> but it's still a wall of text:
>> 
>>   https://dmalcolm.fedorapeople.org/gcc/2018-06-18/test.cc.remarks.html
>>   
>> https://dmalcolm.fedorapeople.org/gcc/2018-06-18/test.cc.d/..%7C..%7Csrc%7Ctest.cc.html#line-4
>> 
>> It doesn't yet provide a simple high-level message to a
>> tech-savvy user on what they need to do to get GCC to
>> vectorize their loop.
>
> Yeah, in particular the vectorizer is way too noisy in its low-level
> functions.  IIRC -fopt-info-vec-missed is "somewhat" better:
>
> t.C:4:26: note: step unknown.
> t.C:4:26: note: vector alignment may not be reachable
> t.C:4:26: note: not ssa-name.
> t.C:4:26: note: use not simple.
> t.C:4:26: note: not ssa-name.
> t.C:4:26: note: use not simple.
> t.C:4:26: note: no array mode for V2DI[3]
> t.C:4:26: note: Data access with gaps requires scalar epilogue loop
> t.C:4:26: note: can't use a fully-masked loop because the target doesn't 
> have the appropriate masked load or store.
> t.C:4:26: note: not ssa-name.
> t.C:4:26: note: use not simple.
> t.C:4:26: note: not ssa-name.
> t.C:4:26: note: use not simple.
> t.C:4:26: note: no array mode for V2DI[3]
> t.C:4:26: note: Data access with gaps requires scalar epilogue loop
> t.C:4:26: note: op not supported by target.
> t.C:4:26: note: not vectorized: relevant stmt not supported: _15 = _14 
> /[ex] 4;
> t.C:4:26: note: bad operation or unsupported loop bound.
> t.C:4:26: note: not vectorized: no grouped stores in basic block.
> t.C:4:26: note: not vectorized: no grouped stores in basic block.
> t.C:6:12: note: not vectorized: not enough data-refs in basic block.
>
>
>> The pertinent dump messages are:
>> 
>> test.cc:4:23: remark: === try_vectorize_loop_1 === 
>> [../../src/gcc/tree-vectorizer.c:674:try_vectorize_loop_1]
>> cc1plus: remark:
>> Analyzing loop at test.cc:4 
>> [../../src/gcc/dumpfile.c:735:ensure_pending_optinfo]
>> test.cc:4:23: remark:  === analyze_loop_nest === 
>> [../../src/gcc/tree-vect-loop.c:2299:vect_analyze_loop]
>> [...snip...]
>> test.cc:4:23: remark:   === vect_analyze_loop_operations === 
>> [../../src/gcc/tree-vect-loop.c:1520:vect_analyze_loop_operations]
>> [...snip...]
>> test.cc:4:23: remark:==> examining statement: ‘_15 = _14 /[ex] 4;’ 
>> [../../src/gcc/tree-vect-stmts.c:9382:vect_analyze_stmt]
>> test.cc:4:23: remark:vect_is_simple_use: operand ‘_14’ 
>> [../../src/gcc/tree-vect-stmts.c:10064:vect_is_simple_use]
>> test.cc:4:23: remark:def_stmt: ‘_14 = _8 - _7;’ 
>> [../../src/gcc/tree-vect-stmts.c:10098:vect_is_simple_use]
>> test.cc:4:23: remark:type of def: internal 
>> [../../src/gcc/tree-vect-stmts.c:10112:vect_is_simple_use]
>> test.cc:4:23: remark:vect_is_simple_use: operand ‘4’ 
>> [../../src/gcc/tree-vect-stmts.c:10064:vect_is_simple_use]
>> test.cc:4:23: remark:op not supported by target. 
>> [../../src/gcc/tree-vect-stmts.c:5932:vectorizable_operation]
>> test.cc:4:23: remark:not vectorized: relevant stmt not supported: ‘_15 = 
>> _14 /[ex] 4;’ [../../src/gcc/tree-vect-stmts.c:9565:vect_analyze_stmt]
>> test.cc:4:23: remark:   bad operation or unsupported loop bound. 
>> [../../src/gcc/tree-vect-loop.c:2043:vect_analyze_loop_2]
>> cc1plus: remark: vectorized 0 loops in function. 
>> [../../src/gcc/tree-vectorizer.c:904:vectorize_loops]
>> 
>> In particular, that complaint from
>>   [../../src/gcc/tree-vect-stmts.c:9565:vect_analyze_stmt]
>> is coming from:
>> 
>>   if (!ok)
>> {
>>   if (dump_enabled_p ())
>> {
>>   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>"not vectorized: relevant stmt not ");
>>   dump_printf (MSG_MISSED_OPTIMIZATION, "supported: ");
>>   dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, 0);
>> }
>> 
>>   return false;
>> }
>> 
>> This got me thinking: the user presumably wants to know several
>> things:
>> 
>> * the location of the loop that can't be vectorized (vect_location
>>   captures this)
>> * location of the problematic statement
>> * why it's problematic
>> * the problematic statement itself.
>> 
>> The following i

Avoid matching the same pattern statement twice

2018-07-03 Thread Richard Sandiford
r262275 allowed pattern matching on pattern statements.  Testing for
SVE on more benchmarks showed a case where this interacted badly
with 14/n.

The new over-widening detection could narrow a COND_EXPR A to another
COND_EXPR B, which mixed_size_cond could then match.  This was working
as expected.  However, we left B (now dead) in the pattern definition
sequence with a non-null PATTERN_DEF_SEQ.  mask_conversion also
matched B, and unlike most recognisers, didn't clear PATTERN_DEF_SEQ
before adding statements to it.  This meant that the statements
created by mixed_size_cond appeared in two supposedy separate
sequences, causing much confusion.

This patch removes pattern statements that are replaced by further
pattern statements.  As a belt-and-braces fix, it also nullifies
PATTERN_DEF_SEQ on failure, in the same way Richard B. did recently
for RELATED_STMT.

I have patches to clean up the PATTERN_DEF_SEQ handling, but they
only apply after the complete PR85694 sequence, whereas this needs
to go in before 14/n.

Tested on aarch64-linux-gnu, arm-linux-gnueabihf and x86_64-linux-gnu.
OK to install?

Richard


2018-07-03  Richard Sandiford  

gcc/
* tree-vect-patterns.c (vect_mark_pattern_stmts): Remove pattern
statements that have been replaced by further pattern statements.
(vect_pattern_recog_1): Clear STMT_VINFO_PATTERN_DEF_SEQ on failure.

gcc/testsuite/
* gcc.dg/vect/vect-mixed-size-cond-1.c: New test.

Index: gcc/tree-vect-patterns.c
===
--- gcc/tree-vect-patterns.c2018-07-02 14:34:45.857732632 +0100
+++ gcc/tree-vect-patterns.c2018-07-03 08:56:56.610251460 +0100
@@ -4295,6 +4295,9 @@ vect_mark_pattern_stmts (gimple *orig_st
   gimple_stmt_iterator gsi = gsi_for_stmt (orig_stmt, orig_def_seq);
   gsi_insert_seq_before_without_update (&gsi, def_seq, GSI_SAME_STMT);
   gsi_insert_before_without_update (&gsi, pattern_stmt, GSI_SAME_STMT);
+
+  /* Remove the pattern statement that this new pattern replaces.  */
+  gsi_remove (&gsi, false);
 }
   else
 vect_set_pattern_stmt (pattern_stmt, orig_stmt_info, pattern_vectype);
@@ -4358,6 +4361,8 @@ vect_pattern_recog_1 (vect_recog_func *r
  if (!is_pattern_stmt_p (stmt_info))
STMT_VINFO_RELATED_STMT (stmt_info) = NULL;
}
+  /* Clear any half-formed pattern definition sequence.  */
+  STMT_VINFO_PATTERN_DEF_SEQ (stmt_info) = NULL;
   return;
 }
 
Index: gcc/testsuite/gcc.dg/vect/vect-mixed-size-cond-1.c
===
--- /dev/null   2018-06-13 14:36:57.192460992 +0100
+++ gcc/testsuite/gcc.dg/vect/vect-mixed-size-cond-1.c  2018-07-03 
08:56:56.610251460 +0100
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+
+int
+f (unsigned char *restrict x, short *restrict y)
+{
+  for (int i = 0; i < 100; ++i)
+{
+  unsigned short a = (x[i] + 11) >> 1;
+  unsigned short b = (x[i] + 42) >> 2;
+  unsigned short cmp = y[i] == 0 ? a : b;
+  int res = cmp + 1;
+  x[i] = res;
+}
+}


Clean up interface to vector pattern recognisers

2018-07-03 Thread Richard Sandiford
The PR85694 series removed the only cases in which a pattern recogniser
could attach patterns to more than one statement.  I think it would be
better to avoid adding any new instances of that, since it interferes
with the normal matching order.

This patch therefore switches the interface back to passing a single
statement instead of a vector.  It also gets rid of the clearing of
STMT_VINFO_RELATED_STMT on failure, since no recognisers use it now.

Tested on aarch64-linux-gnu, arm-linux-gnueabihf and x86_64-linux-gnu.
OK to install?

Richard


2018-07-03  Richard Sandiford  

gcc/
* tree-vect-patterns.c (vect_recog_dot_prod_pattern):
(vect_recog_sad_pattern, vect_recog_widen_op_pattern)
(vect_recog_widen_mult_pattern, vect_recog_pow_pattern):
(vect_recog_widen_sum_pattern, vect_recog_over_widening_pattern)
(vect_recog_average_pattern, vect_recog_cast_forwprop_pattern)
(vect_recog_widen_shift_pattern, vect_recog_rotate_pattern)
(vect_recog_vector_vector_shift_pattern, vect_synth_mult_by_constant)
(vect_recog_mult_pattern, vect_recog_divmod_pattern)
(vect_recog_mixed_size_cond_pattern, vect_recog_bool_pattern)
(vect_recog_mask_conversion_pattern): Replace vec
parameter with a single stmt_vec_info.
(vect_recog_func_ptr): Likewise.
(vect_recog_gather_scatter_pattern): Likewise, folding in...
(vect_try_gather_scatter_pattern): ...this.
(vect_pattern_recog_1): Remove stmts_to_replace and just pass
the stmt_vec_info of the statement to be matched.  Don't clear
STMT_VINFO_RELATED_STMT.
(vect_pattern_recog): Update call accordingly.

Index: gcc/tree-vect-patterns.c
===
--- gcc/tree-vect-patterns.c2018-07-03 09:03:39.0 +0100
+++ gcc/tree-vect-patterns.c2018-07-03 09:03:39.834882009 +0100
@@ -888,7 +888,7 @@ vect_reassociating_reduction_p (stmt_vec
 
Input:
 
-   * STMTS: Contains a stmt from which the pattern search begins.  In the
+   * STMT_VINFO: The stmt from which the pattern search begins.  In the
example, when this function is called with S7, the pattern {S3,S4,S5,S6,S7}
will be detected.
 
@@ -909,11 +909,10 @@ vect_reassociating_reduction_p (stmt_vec
  inner-loop nested in an outer-loop that us being vectorized).  */
 
 static gimple *
-vect_recog_dot_prod_pattern (vec *stmts, tree *type_out)
+vect_recog_dot_prod_pattern (stmt_vec_info stmt_vinfo, tree *type_out)
 {
-  gimple *last_stmt = (*stmts)[0];
   tree oprnd0, oprnd1;
-  stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);
+  gimple *last_stmt = stmt_vinfo->stmt;
   vec_info *vinfo = stmt_vinfo->vinfo;
   tree type, half_type;
   gimple *pattern_stmt;
@@ -1021,7 +1020,7 @@ vect_recog_dot_prod_pattern (vec *stmts, tree *type_out)
+vect_recog_sad_pattern (stmt_vec_info stmt_vinfo, tree *type_out)
 {
-  gimple *last_stmt = (*stmts)[0];
-  stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);
+  gimple *last_stmt = stmt_vinfo->stmt;
   vec_info *vinfo = stmt_vinfo->vinfo;
   tree half_type;
 
@@ -1182,12 +1180,11 @@ vect_recog_sad_pattern (vec *s
name of the pattern being matched, for dump purposes.  */
 
 static gimple *
-vect_recog_widen_op_pattern (vec *stmts, tree *type_out,
+vect_recog_widen_op_pattern (stmt_vec_info last_stmt_info, tree *type_out,
 tree_code orig_code, tree_code wide_code,
 bool shift_p, const char *name)
 {
-  gimple *last_stmt = stmts->pop ();
-  stmt_vec_info last_stmt_info = vinfo_for_stmt (last_stmt);
+  gimple *last_stmt = last_stmt_info->stmt;
 
   vect_unpromoted_value unprom[2];
   tree half_type;
@@ -1231,7 +1228,6 @@ vect_recog_widen_op_pattern (vecsafe_push (last_stmt);
   return vect_convert_output (last_stmt_info, type, pattern_stmt, vecitype);
 }
 
@@ -1239,9 +1235,9 @@ vect_recog_widen_op_pattern (vec *stmts, tree *type_out)
+vect_recog_widen_mult_pattern (stmt_vec_info last_stmt_info, tree *type_out)
 {
-  return vect_recog_widen_op_pattern (stmts, type_out, MULT_EXPR,
+  return vect_recog_widen_op_pattern (last_stmt_info, type_out, MULT_EXPR,
  WIDEN_MULT_EXPR, false,
  "vect_recog_widen_mult_pattern");
 }
@@ -1257,7 +1253,7 @@ vect_recog_widen_mult_pattern (vec *stmts, tree *type_out)
+vect_recog_pow_pattern (stmt_vec_info stmt_vinfo, tree *type_out)
 {
-  gimple *last_stmt = (*stmts)[0];
+  gimple *last_stmt = stmt_vinfo->stmt;
   tree base, exp;
   gimple *stmt;
   tree var;
@@ -1344,7 +1340,6 @@ vect_recog_pow_pattern (vec *s
  *type_out = get_vectype_for_scalar_type (TREE_TYPE (base));
  if (!*type_out)
return NULL;
- stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);
  tree def = vect_recog_temp_ssa_var (TREE_TYPE (b

Ensure PATTERN_DEF_SEQ is empty before recognising patterns

2018-07-03 Thread Richard Sandiford
Various recognisers set PATTERN_DEF_SEQ to null before adding
statements to it, but it should always be null at that point anyway.
This patch asserts for that in vect_pattern_recog_1 and removes
the redundant code.

Tested on aarch64-linux-gnu, arm-linux-gnueabihf and x86_64-linux-gnu.
OK to install?

Richard


2018-07-03  Richard Sandiford  

gcc/
* tree-vect-patterns.c (new_pattern_def_seq): Delete.
(vect_recog_dot_prod_pattern, vect_recog_sad_pattern)
(vect_recog_widen_op_pattern, vect_recog_over_widening_pattern)
(vect_recog_rotate_pattern, vect_synth_mult_by_constant): Don't set
STMT_VINFO_PATTERN_DEF_SEQ to null here.
(vect_recog_pow_pattern, vect_recog_vector_vector_shift_pattern)
(vect_recog_mixed_size_cond_pattern, vect_recog_bool_pattern): Use
append_pattern_def_seq instead of new_pattern_def_seq.
(vect_recog_divmod_pattern): Do both of the above.
(vect_pattern_recog_1): Assert that STMT_VINO_PATTERN_DEF_SEQ
is null.

Index: gcc/tree-vect-patterns.c
===
--- gcc/tree-vect-patterns.c2018-07-03 09:03:39.834882009 +0100
+++ gcc/tree-vect-patterns.c2018-07-03 09:06:43.861330261 +0100
@@ -150,13 +150,6 @@ append_pattern_def_seq (stmt_vec_info st
  new_stmt);
 }
 
-static inline void
-new_pattern_def_seq (stmt_vec_info stmt_info, gimple *stmt)
-{
-  STMT_VINFO_PATTERN_DEF_SEQ (stmt_info) = NULL;
-  append_pattern_def_seq (stmt_info, stmt);
-}
-
 /* The caller wants to perform new operations on vect_external variable
VAR, so that the result of the operations would also be vect_external.
Return the edge on which the operations can be performed, if one exists.
@@ -983,7 +976,6 @@ vect_recog_dot_prod_pattern (stmt_vec_in
 return NULL;
 
   /* Get the inputs in the appropriate types.  */
-  STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo) = NULL;
   tree mult_oprnd[2];
   vect_convert_inputs (stmt_vinfo, 2, mult_oprnd, half_type,
   unprom0, half_vectype);
@@ -1142,7 +1134,6 @@ vect_recog_sad_pattern (stmt_vec_info st
 return NULL;
 
   /* Get the inputs to the SAD_EXPR in the appropriate types.  */
-  STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo) = NULL;
   tree sad_oprnd[2];
   vect_convert_inputs (stmt_vinfo, 2, sad_oprnd, half_type,
   unprom, half_vectype);
@@ -1220,7 +1211,6 @@ vect_recog_widen_op_pattern (stmt_vec_in
   if (!*type_out)
 return NULL;
 
-  STMT_VINFO_PATTERN_DEF_SEQ (last_stmt_info) = NULL;
   tree oprnd[2];
   vect_convert_inputs (last_stmt_info, 2, oprnd, half_type, unprom, vectype);
 
@@ -1342,7 +1332,7 @@ vect_recog_pow_pattern (stmt_vec_info st
return NULL;
  tree def = vect_recog_temp_ssa_var (TREE_TYPE (base), NULL);
  gimple *g = gimple_build_assign (def, MULT_EXPR, exp, logc);
- new_pattern_def_seq (stmt_vinfo, g);
+ append_pattern_def_seq (stmt_vinfo, g);
  tree res = vect_recog_temp_ssa_var (TREE_TYPE (base), NULL);
  g = gimple_build_call (exp_decl, 1, def);
  gimple_call_set_lhs (g, res);
@@ -1687,7 +1677,6 @@ vect_recog_over_widening_pattern (stmt_v
 }
 
   /* Calculate the rhs operands for an operation on NEW_TYPE.  */
-  STMT_VINFO_PATTERN_DEF_SEQ (last_stmt_info) = NULL;
   tree ops[3] = {};
   for (unsigned int i = 1; i < first_op; ++i)
 ops[i - 1] = gimple_op (last_stmt, i);
@@ -2073,7 +2062,6 @@ vect_recog_rotate_pattern (stmt_vec_info
def = rhs1;
 }
 
-  STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo) = NULL;
   if (def == NULL_TREE)
 {
   def = vect_recog_temp_ssa_var (type, NULL);
@@ -2269,7 +2257,7 @@ vect_recog_vector_vector_shift_pattern (
  set_vinfo_for_stmt (def_stmt, new_stmt_info);
  STMT_VINFO_VECTYPE (new_stmt_info)
= get_vectype_for_scalar_type (TREE_TYPE (rhs1));
- new_pattern_def_seq (stmt_vinfo, def_stmt);
+ append_pattern_def_seq (stmt_vinfo, def_stmt);
}
}
 }
@@ -2278,7 +2266,7 @@ vect_recog_vector_vector_shift_pattern (
 {
   def = vect_recog_temp_ssa_var (TREE_TYPE (oprnd0), NULL);
   def_stmt = gimple_build_assign (def, NOP_EXPR, oprnd1);
-  new_pattern_def_seq (stmt_vinfo, def_stmt);
+  append_pattern_def_seq (stmt_vinfo, def_stmt);
 }
 
   /* Pattern detected.  */
@@ -2472,7 +2460,6 @@ vect_synth_mult_by_constant (tree op, tr
   tree accumulator;
 
   /* Clear out the sequence of statements so we can populate it below.  */
-  STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo) = NULL;
   gimple *stmt = NULL;
 
   if (cast_to_unsigned_p)
@@ -2769,7 +2756,7 @@ vect_recog_divmod_pattern (stmt_vec_info
   fold_build2 (MINUS_EXPR, itype, oprnd1,
build_int_cst (

Pass more vector types to append_pattern_def_seq

2018-07-03 Thread Richard Sandiford
The PR85694 series added a vectype argument to append_pattern_def_seq.
This patch makes more callers use it.

Tested on aarch64-linux-gnu, arm-linux-gnueabihf and x86_64-linux-gnu.
OK to install?

Richard


2018-07-03  Richard Sandiford  

gcc/
* tree-vect-patterns.c (vect_recog_rotate_pattern)
(vect_recog_vector_vector_shift_pattern, vect_recog_divmod_pattern)
(vect_recog_mixed_size_cond_pattern, adjust_bool_pattern_cast)
(adjust_bool_pattern, vect_recog_bool_pattern): Pass the vector
type to append_pattern_def_seq instead of creating a stmt_vec_info
directly.
(build_mask_conversion): Likewise.  Remove vinfo argument.
(vect_add_conversion_to_patterm): Likewise, renaming to...
(vect_add_conversion_to_pattern): ...this.
(vect_recog_mask_conversion_pattern): Update call to
build_mask_conversion.  Pass the vector type to
append_pattern_def_seq here too.
(vect_recog_gather_scatter_pattern): Update call to
vect_add_conversion_to_pattern.

Index: gcc/tree-vect-patterns.c
===
--- gcc/tree-vect-patterns.c2018-07-03 09:06:43.861330261 +0100
+++ gcc/tree-vect-patterns.c2018-07-03 09:09:41.627853962 +0100
@@ -2090,7 +2090,6 @@ vect_recog_rotate_pattern (stmt_vec_info
   else
 {
   tree vecstype = get_vectype_for_scalar_type (stype);
-  stmt_vec_info def_stmt_vinfo;
 
   if (vecstype == NULL_TREE)
return NULL;
@@ -2103,12 +2102,7 @@ vect_recog_rotate_pattern (stmt_vec_info
  gcc_assert (!new_bb);
}
   else
-   {
- def_stmt_vinfo = new_stmt_vec_info (def_stmt, vinfo);
- set_vinfo_for_stmt (def_stmt, def_stmt_vinfo);
- STMT_VINFO_VECTYPE (def_stmt_vinfo) = vecstype;
- append_pattern_def_seq (stmt_vinfo, def_stmt);
-   }
+   append_pattern_def_seq (stmt_vinfo, def_stmt, vecstype);
 
   def2 = vect_recog_temp_ssa_var (stype, NULL);
   tree mask = build_int_cst (stype, GET_MODE_PRECISION (smode) - 1);
@@ -2121,12 +2115,7 @@ vect_recog_rotate_pattern (stmt_vec_info
  gcc_assert (!new_bb);
}
   else
-   {
- def_stmt_vinfo = new_stmt_vec_info (def_stmt, vinfo);
- set_vinfo_for_stmt (def_stmt, def_stmt_vinfo);
- STMT_VINFO_VECTYPE (def_stmt_vinfo) = vecstype;
- append_pattern_def_seq (stmt_vinfo, def_stmt);
-   }
+   append_pattern_def_seq (stmt_vinfo, def_stmt, vecstype);
 }
 
   var1 = vect_recog_temp_ssa_var (type, NULL);
@@ -2252,12 +2241,8 @@ vect_recog_vector_vector_shift_pattern (
   TYPE_PRECISION (TREE_TYPE (oprnd1)));
  def = vect_recog_temp_ssa_var (TREE_TYPE (rhs1), NULL);
  def_stmt = gimple_build_assign (def, BIT_AND_EXPR, rhs1, mask);
- stmt_vec_info new_stmt_info
-   = new_stmt_vec_info (def_stmt, vinfo);
- set_vinfo_for_stmt (def_stmt, new_stmt_info);
- STMT_VINFO_VECTYPE (new_stmt_info)
-   = get_vectype_for_scalar_type (TREE_TYPE (rhs1));
- append_pattern_def_seq (stmt_vinfo, def_stmt);
+ tree vecstype = get_vectype_for_scalar_type (TREE_TYPE (rhs1));
+ append_pattern_def_seq (stmt_vinfo, def_stmt, vecstype);
}
}
 }
@@ -2688,11 +2673,9 @@ vect_recog_divmod_pattern (stmt_vec_info
   tree oprnd0, oprnd1, vectype, itype, cond;
   gimple *pattern_stmt, *def_stmt;
   enum tree_code rhs_code;
-  vec_info *vinfo = stmt_vinfo->vinfo;
   optab optab;
   tree q;
   int dummy_int, prec;
-  stmt_vec_info def_stmt_vinfo;
 
   if (!is_gimple_assign (last_stmt))
 return NULL;
@@ -2792,18 +2775,12 @@ vect_recog_divmod_pattern (stmt_vec_info
  def_stmt = gimple_build_assign (var, COND_EXPR, cond,
  build_int_cst (utype, -1),
  build_int_cst (utype, 0));
- def_stmt_vinfo = new_stmt_vec_info (def_stmt, vinfo);
- set_vinfo_for_stmt (def_stmt, def_stmt_vinfo);
- STMT_VINFO_VECTYPE (def_stmt_vinfo) = vecutype;
- append_pattern_def_seq (stmt_vinfo, def_stmt);
+ append_pattern_def_seq (stmt_vinfo, def_stmt, vecutype);
  var = vect_recog_temp_ssa_var (utype, NULL);
  def_stmt = gimple_build_assign (var, RSHIFT_EXPR,
  gimple_assign_lhs (def_stmt),
  shift);
- def_stmt_vinfo = new_stmt_vec_info (def_stmt, vinfo);
- set_vinfo_for_stmt (def_stmt, def_stmt_vinfo);
- STMT_VINFO_VECTYPE (def_stmt_vinfo) = vecutype;
- append_pattern_def_seq (stmt_vinfo, def_stmt);
+ append_pattern_def_seq (stmt_vinfo, def_stmt, vecutype);
  signmask = vect_recog_temp_ssa_var (itype, N

Re: [14/n] PR85694: Rework overwidening detection

2018-07-03 Thread Richard Sandiford
Richard Biener  writes:
> On Fri, Jun 29, 2018 at 1:36 PM Richard Sandiford
>  wrote:
>>
>> Richard Sandiford  writes:
>> > This patch is the main part of PR85694.  The aim is to recognise at least:
>> >
>> >   signed char *a, *b, *c;
>> >   ...
>> >   for (int i = 0; i < 2048; i++)
>> > c[i] = (a[i] + b[i]) >> 1;
>> >
>> > as an over-widening pattern, since the addition and shift can be done
>> > on shorts rather than ints.  However, it ended up being a lot more
>> > general than that.
>> >
>> > The current over-widening pattern detection is limited to a few simple
>> > cases: logical ops with immediate second operands, and shifts by a
>> > constant.  These cases are enough for common pixel-format conversion
>> > and can be detected in a peephole way.
>> >
>> > The loop above requires two generalisations of the current code: support
>> > for addition as well as logical ops, and support for non-constant second
>> > operands.  These are harder to detect in the same peephole way, so the
>> > patch tries to take a more global approach.
>> >
>> > The idea is to get information about the minimum operation width
>> > in two ways:
>> >
>> > (1) by using the range information attached to the SSA_NAMEs
>> > (effectively a forward walk, since the range info is
>> > context-independent).
>> >
>> > (2) by back-propagating the number of output bits required by
>> > users of the result.
>> >
>> > As explained in the comments, there's a balance to be struck between
>> > narrowing an individual operation and fitting in with the surrounding
>> > code.  The approach is pretty conservative: if we could narrow an
>> > operation to N bits without changing its semantics, it's OK to do that if:
>> >
>> > - no operations later in the chain require more than N bits; or
>> >
>> > - all internally-defined inputs are extended from N bits or fewer,
>> >   and at least one of them is single-use.
>> >
>> > See the comments for the rationale.
>> >
>> > I didn't bother adding STMT_VINFO_* wrappers for the new fields
>> > since the code seemed more readable without.
>> >
>> > Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
>>
>> Here's a version rebased on top of current trunk.  Changes from last time:
>>
>> - reintroduce dump_generic_expr_loc, with the obvious change to the
>>   prototype
>>
>> - fix a typo in a comment
>>
>> - use vect_element_precision from the new version of 12/n.
>>
>> Tested as before.  OK to install?
>
> OK.

Thanks.  For the record, here's what I installed (updated on top of
Dave's recent patch, and with an obvious fix to vect-widen-mult-u8-u32.c).

Richard


2018-07-03  Richard Sandiford  

gcc/
* poly-int.h (print_hex): New function.
* dumpfile.h (dump_dec, dump_hex): Declare.
* dumpfile.c (dump_dec, dump_hex): New poly_wide_int functions.
* tree-vectorizer.h (_stmt_vec_info): Add min_output_precision,
min_input_precision, operation_precision and operation_sign.
* tree-vect-patterns.c (vect_get_range_info): New function.
(vect_same_loop_or_bb_p, vect_single_imm_use)
(vect_operation_fits_smaller_type): Delete.
(vect_look_through_possible_promotion): Add an optional
single_use_p parameter.
(vect_recog_over_widening_pattern): Rewrite to use new
stmt_vec_info infomration.  Handle one operation at a time.
(vect_recog_cast_forwprop_pattern, vect_narrowable_type_p)
(vect_truncatable_operation_p, vect_set_operation_type)
(vect_set_min_input_precision): New functions.
(vect_determine_min_output_precision_1): Likewise.
(vect_determine_min_output_precision): Likewise.
(vect_determine_precisions_from_range): Likewise.
(vect_determine_precisions_from_users): Likewise.
(vect_determine_stmt_precisions, vect_determine_precisions): Likewise.
(vect_vect_recog_func_ptrs): Put over_widening first.
Add cast_forwprop.
(vect_pattern_recog): Call vect_determine_precisions.

gcc/testsuite/
* gcc.dg/vect/vect-widen-mult-u8-u32.c: Check specifically for a
widen_mult pattern.
* gcc.dg/vect/vect-over-widen-1.c: Update the scan tests for new
over-widening messages.
* gcc.dg/vect/vect-over-widen-1-big-array.c: Likewise.
* gcc.dg/vect/vect-over-widen-2.c: Likewise.
* gcc.dg/vect/vect-over-widen-2-big-array.c: Likewise.

Re: [14/n] PR85694: Rework overwidening detection

2018-07-04 Thread Richard Sandiford
Christophe Lyon  writes:
> On Tue, 3 Jul 2018 at 12:02, Richard Sandiford
>  wrote:
>>
>> Richard Biener  writes:
>> > On Fri, Jun 29, 2018 at 1:36 PM Richard Sandiford
>> >  wrote:
>> >>
>> >> Richard Sandiford  writes:
>> >> > This patch is the main part of PR85694.  The aim is to recognise
> at least:
>> >> >
>> >> >   signed char *a, *b, *c;
>> >> >   ...
>> >> >   for (int i = 0; i < 2048; i++)
>> >> > c[i] = (a[i] + b[i]) >> 1;
>> >> >
>> >> > as an over-widening pattern, since the addition and shift can be done
>> >> > on shorts rather than ints.  However, it ended up being a lot more
>> >> > general than that.
>> >> >
>> >> > The current over-widening pattern detection is limited to a few simple
>> >> > cases: logical ops with immediate second operands, and shifts by a
>> >> > constant.  These cases are enough for common pixel-format conversion
>> >> > and can be detected in a peephole way.
>> >> >
>> >> > The loop above requires two generalisations of the current code: support
>> >> > for addition as well as logical ops, and support for non-constant second
>> >> > operands.  These are harder to detect in the same peephole way, so the
>> >> > patch tries to take a more global approach.
>> >> >
>> >> > The idea is to get information about the minimum operation width
>> >> > in two ways:
>> >> >
>> >> > (1) by using the range information attached to the SSA_NAMEs
>> >> > (effectively a forward walk, since the range info is
>> >> > context-independent).
>> >> >
>> >> > (2) by back-propagating the number of output bits required by
>> >> > users of the result.
>> >> >
>> >> > As explained in the comments, there's a balance to be struck between
>> >> > narrowing an individual operation and fitting in with the surrounding
>> >> > code.  The approach is pretty conservative: if we could narrow an
>> >> > operation to N bits without changing its semantics, it's OK to do
> that if:
>> >> >
>> >> > - no operations later in the chain require more than N bits; or
>> >> >
>> >> > - all internally-defined inputs are extended from N bits or fewer,
>> >> >   and at least one of them is single-use.
>> >> >
>> >> > See the comments for the rationale.
>> >> >
>> >> > I didn't bother adding STMT_VINFO_* wrappers for the new fields
>> >> > since the code seemed more readable without.
>> >> >
>> >> > Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
>> >>
>> >> Here's a version rebased on top of current trunk.  Changes from last time:
>> >>
>> >> - reintroduce dump_generic_expr_loc, with the obvious change to the
>> >>   prototype
>> >>
>> >> - fix a typo in a comment
>> >>
>> >> - use vect_element_precision from the new version of 12/n.
>> >>
>> >> Tested as before.  OK to install?
>> >
>> > OK.
>>
>> Thanks.  For the record, here's what I installed (updated on top of
>> Dave's recent patch, and with an obvious fix to vect-widen-mult-u8-u32.c).
>>
>> Richard
>>
> Hi,
>
> It seems the new bb-slp-over-widen tests lack a -fdump option:
> gcc.dg/vect/bb-slp-over-widen-2.c -flto -ffat-lto-objects : dump file
> does not exist
> UNRESOLVED: gcc.dg/vect/bb-slp-over-widen-2.c -flto -ffat-lto-objects
> scan-tree-dump-times vect "basic block vectorized" 2

I've applied the following as obvious.

Richard


2018-07-04  Richard Sandiford  

gcc/testsuite/
* gcc.dg/vect/bb-slp-over-widen-1.c: Fix name of dump file for
final scan test.
* gcc.dg/vect/bb-slp-over-widen-2.c: Likewise.

Index: gcc/testsuite/gcc.dg/vect/bb-slp-over-widen-1.c
===
--- gcc/testsuite/gcc.dg/vect/bb-slp-over-widen-1.c 2018-07-03 
10:59:30.480481417 +0100
+++ gcc/testsuite/gcc.dg/vect/bb-slp-over-widen-1.c 2018-07-04 
08:16:36.210113069 +0100
@@ -63,4 +63,4 @@ main (void)
 
 /* { dg-final { scan-tree-dump "demoting int to signed short" "slp2" { target 
{ ! vect_widen_shift } } } } */
 /* { dg-final { scan-tree-dump "demoting int to unsigned short" "slp2" { 
target { ! vect_widen_shift } } } } */
-/* { dg-final { scan-tree-dump-times "basic block vectorized" 2 "vect" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 2 "slp2" } } */
Index: gcc/testsuite/gcc.dg/vect/bb-slp-over-widen-2.c
===
--- gcc/testsuite/gcc.dg/vect/bb-slp-over-widen-2.c 2018-07-03 
10:59:30.480481417 +0100
+++ gcc/testsuite/gcc.dg/vect/bb-slp-over-widen-2.c 2018-07-04 
08:16:36.210113069 +0100
@@ -62,4 +62,4 @@ main (void)
 
 /* { dg-final { scan-tree-dump "demoting int to signed short" "slp2" { target 
{ ! vect_widen_shift } } } } */
 /* { dg-final { scan-tree-dump "demoting int to unsigned short" "slp2" { 
target { ! vect_widen_shift } } } } */
-/* { dg-final { scan-tree-dump-times "basic block vectorized" 2 "vect" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 2 "slp2" } } */



Re: Extend tree code folds to IFN_COND_*

2018-07-04 Thread Richard Sandiford
Finally getting back to this...

Richard Biener  writes:
> On Wed, Jun 6, 2018 at 10:16 PM Richard Sandiford
>  wrote:
>>
>> > On Thu, May 24, 2018 at 11:36 AM Richard Sandiford
>> >  wrote:
>> >>
>> >> This patch adds match.pd support for applying normal folds to their
>> >> IFN_COND_* forms.  E.g. the rule:
>> >>
>> >>   (plus @0 (negate @1)) -> (minus @0 @1)
>> >>
>> >> also allows the fold:
>> >>
>> >>   (IFN_COND_ADD @0 @1 (negate @2) @3) -> (IFN_COND_SUB @0 @1 @2 @3)
>> >>
>> >> Actually doing this by direct matches in gimple-match.c would
>> >> probably lead to combinatorial explosion, so instead, the patch
>> >> makes gimple_match_op carry a condition under which the operation
>> >> happens ("cond"), and the value to use when the condition is false
>> >> ("else_value").  Thus in the example above we'd do the following
>> >>
>> >> (a) convert:
>> >>
>> >>   cond:NULL_TREE (IFN_COND_ADD @0 @1 @4 @3) else_value:NULL_TREE
>> >>
>> >> to:
>> >>
>> >>   cond:@0 (plus @1 @4) else_value:@3
>> >>
>> >> (b) apply gimple_resimplify to (plus @1 @4)
>> >>
>> >> (c) reintroduce cond and else_value when constructing the result.
>> >>
>> >> Nested operations inherit the condition of the outer operation
>> >> (so that we don't introduce extra faults) but have a null else_value.
>> >> If we try to build such an operation, the target gets to choose what
>> >> else_value it can handle efficiently: obvious choices include one of
>> >> the operands or a zero constant.  (The alternative would be to have some
>> >> representation for an undefined value, but that seems a bit invasive,
>> >> and isn't likely to be useful here.)
>> >>
>> >> I've made the condition a mandatory part of the gimple_match_op
>> >> constructor so that it doesn't accidentally get dropped.
>> >>
>> >> Tested on aarch64-linux-gnu (with and without SVE), aarch64_be-elf
>> >> and x86_64-linux-gnu.  OK to install?
>> >
>> > It looks somewhat clever but after looking for a while it doesn't handle
>> > simplifying
>> >
>> >  (IFN_COND_ADD @0 @1 (IFN_COND_SUB @0 @2 @1 @3) @3)
>> >
>> > to
>> >
>> >  (cond @0 @2 @3)
>> >
>> > right?  Because while the conditional gimple_match_op is built
>> > by try_conditional_simplification it isn't built when doing
>> > SSA use->def following in the generated matching code?
>>
>> Right.  This would be easy to add, but there's no motivating case yet.
>
> ...
>
>> > So it looks like a bit much noise for this very special case?
>> >
>> > I suppose you ran into the need of these foldings from looking
>> > at real code - which foldings specifically were appearing here?
>> > Usually code is well optimized before if-conversion/vectorization
>> > so we shouldn't need full-blown handling?
>>
>> It's needed to get the FMA, FMS, FNMA and FNMS folds for IFN_COND_* too.
>> I thought it'd be better to do it "automatically" rather than add specific
>> folds, since if we don't do it automatically now, it's going to end up
>> being a precedent for not doing it automatically in future either.
>
> ... not like above isn't a similar precedent ;)  But OK, given...

But we're not doing the above case manually either yet :-)  Whereas the
series does need to do what the patch does one way or another.

Also, it might be hard to do the above case manually anyway (i.e. match
nested IFN_COND_* ops with an implicitly-conditional top-level op),
since the match.pd rule wouldn't have easy access to the overall condition.
And that's by design, or so I'd like to claim.

>> > That said, I'm not sure how much work it is to massage
>> >
>> >   if (gimple *def_stmt = get_def (valueize, op2))
>> > {
>> >   if (gassign *def = dyn_cast  (def_stmt))
>> > switch (gimple_assign_rhs_code (def))
>> >   {
>> >   case PLUS_EXPR:
>> >
>> > to look like
>> >
>> >   if (gimple *def_stmt = get_def (valueize, op2))
>> > {
>> >code = ERROR_MARK;
>> >if (!is_cond_ifn_with

Re: [RFC, testsuite/guality] Use relative line numbers in gdb-test

2018-07-04 Thread Richard Sandiford
Tom de Vries  writes:
> [ was: [PATCH, testsuite/guality] Use line number vars in gdb-test ]
> On Thu, Jun 28, 2018 at 07:49:30PM +0200, Tom de Vries wrote:
>> Hi,
>> 
>> I played around with pr45882.c and ran into FAILs.  It took me a while to
>> realize that the FAILs where due to the gdb-test (a dg-final action) using
>> absolute line numbers, and me adding lines before the gdb-test lines.
>> 
>> I've written this patch, which factors out the handling of relative line
>> numbers as well as line number variables from process-message, and reuses the
>> functionality in gdb-test.
>> 
>> This enables the line number variables functionality in gdb-test.  [ There's
>> one quirk: line number variables have a define-before-use semantics (with
>> matching used-before-defined error) but in the test-case the use in gdb-test
>> preceeds the definition in gdb-line.  This doesn't cause errors, because
>> the dg-final actions are executed after the definition has taken effect. ]
>> 
>> [ Relative line numbers still don't work in gdb-test, but that's due to an
>> orthogonal issue: gdb-test is a dg-final action, and while dg-final receives
>> the line number on which it occurred as it's first argument, it doesn't pass
>> on this line number to the argument list of the action. I'll submit a
>> follow-on rfc patch for this. ]
>>
>
> This patch adds a dg-final override that passes it's first argument to the
> gdb-test action.  This allows us to use relative line numbers in gdb-test.
>
> Tested pr45882.c.
>
> Any comments?
>  
> Thanks,
> - Tom
>
> [testsuite/guality] Use relative line numbers in gdb-test
>
> 2018-06-28  Tom de Vries  
>
>   * gcc.dg/guality/pr45882.c (foo): Use relative line numbers.
>   * lib/gcc-dg.exp (dg-final): New proc.
>   * lib/gcc-gdb-test.exp (gdb-test): Add and handle additional line number
>   argument.
>
> ---
>  gcc/testsuite/gcc.dg/guality/pr45882.c | 10 +-
>  gcc/testsuite/lib/gcc-dg.exp   | 20 
>  gcc/testsuite/lib/gcc-gdb-test.exp |  4 ++--
>  3 files changed, 27 insertions(+), 7 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.dg/guality/pr45882.c 
> b/gcc/testsuite/gcc.dg/guality/pr45882.c
> index da9e2755590..02d74389ea0 100644
> --- a/gcc/testsuite/gcc.dg/guality/pr45882.c
> +++ b/gcc/testsuite/gcc.dg/guality/pr45882.c
> @@ -9,11 +9,11 @@ volatile short int v;
>  __attribute__((noinline,noclone,used)) int
>  foo (int i, int j)
>  {
> -  int b = i; /* { dg-final { gdb-test bpline "b" "7" } } */
> -  int c = i + 4; /* { dg-final { gdb-test bpline "c" "11" } } */
> -  int d = a[i];  /* { dg-final { gdb-test bpline "d" "112" } } */
> -  int e = a[i + 6];  /* { dg-final { gdb-test bpline "e" "142" } } */
> -  ++v;   /* { dg-line bpline } */
> +  int b = i; /* { dg-final { gdb-test .+4 "b" "7" } } */
> +  int c = i + 4; /* { dg-final { gdb-test .+3 "c" "11" } } */
> +  int d = a[i];  /* { dg-final { gdb-test .+2 "d" "112" } } */
> +  int e = a[i + 6];  /* { dg-final { gdb-test .+1 "e" "142" } } */
> +  ++v;
>return ++j;
>  }
>  
> diff --git a/gcc/testsuite/lib/gcc-dg.exp b/gcc/testsuite/lib/gcc-dg.exp
> index 22065c7e3fe..6f88ce2213e 100644
> --- a/gcc/testsuite/lib/gcc-dg.exp
> +++ b/gcc/testsuite/lib/gcc-dg.exp
> @@ -114,6 +114,26 @@ if [info exists ADDITIONAL_TORTURE_OPTIONS] {
>   [concat $DG_TORTURE_OPTIONS $ADDITIONAL_TORTURE_OPTIONS]
>  }
>  
> +proc dg-final { args } {
> +upvar dg-final-code final-code
> +
> +if { [llength $args] > 2 } {
> + error "[lindex $args 0]: too many arguments"
> +}
> +set line [lindex $args 0]
> +set code [lindex $args 1]
> +set directive [lindex $code 0]
> +set withline \
> + [switch $directive {
> + gdb-test {expr {1}}
> + default  {expr {0}}
> + }]
> +if { $withline == 1 } {
> + set code [linsert $code 1 $line]
> +}
> +append final-code "$code\n"
> +}

Like the idea, but I think:

set withline \
[switch $directive {
gdb-test {expr {1}}
default  {expr {0}}
}]
if { $withline == 1 } {
set code [linsert $code 1 $line]
}

would be clearer as:

switch $directive {
gdb-test {
set code [linsert $code 1 $line]
}
}

Thanks,
Richard


Re: [PATCH] doc clarification: DONE and FAIL in define_split and define_peephole2

2018-07-06 Thread Richard Sandiford
Paul Koning  writes:
> Currently DONE and FAIL are documented only for define_expand, but
> they also work in essentially the same way for define_split and
> define_peephole2.
>
> If FAIL is used in a define_insn_and_split, the output pattern cannot
> be the usual "#" dummy value.
>
> This patch updates the doc to describe those cases.  Ok for trunk?
>
>   paul
>
> ChangeLog:
>
> 2018-07-05  Paul Koning  
>
>   * doc/md.texi (define_split): Document DONE and FAIL.  Describe
>   interaction with usual "#" output template in
>   define_insn_and_split.
>   (define_peephole2): Document DONE and FAIL.
>
> Index: doc/md.texi
> ===
> --- doc/md.texi   (revision 262455)
> +++ doc/md.texi   (working copy)
> @@ -8060,6 +8060,30 @@ those in @code{define_expand}, however, these stat
>  generate any new pseudo-registers.  Once reload has completed, they also
>  must not allocate any space in the stack frame.
>  
> +There are two special macros defined for use in the preparation statements:
> +@code{DONE} and @code{FAIL}.  Use them with a following semicolon,
> +as a statement.
> +
> +@table @code
> +
> +@findex DONE
> +@item DONE
> +Use the @code{DONE} macro to end RTL generation for the splitter.  The
> +only RTL insns generated as replacement for the matched input insn will
> +be those already emitted by explicit calls to @code{emit_insn} within
> +the preparation statements; the replacement pattern is not used.
> +
> +@findex FAIL
> +@item FAIL
> +Make the @code{define_split} fail on this occasion.  When a 
> @code{define_split}
> +fails, it means that the splitter was not truly available for the inputs
> +it was given, and this split is not done.
> +@end table
> +
> +If the preparation falls through (invokes neither @code{DONE} nor
> +@code{FAIL}), then the @code{define_split} uses the replacement
> +template.
> +
>  Patterns are matched against @var{insn-pattern} in two different
>  circumstances.  If an insn needs to be split for delay slot scheduling
>  or insn scheduling, the insn is already known to be valid, which means

Looks good.

> @@ -8232,6 +8256,15 @@ functionality as two separate @code{define_insn} a
>  patterns.  It exists for compactness, and as a maintenance tool to prevent
>  having to ensure the two patterns' templates match.
>  
> +In @code{define_insn_and_split}, the output template is usually simply
> +@samp{#} since the assembly output is done by @code{define_insn}
> +statements matching the generated insns, not by this
> +@code{define_insn_and_split} statement.  But if @code{FAIL} is used in
> +the preparation statements for certain input insns, those will not be
> +split and during assembly output will again match this
> +@code{define_insn_and_split}.  In that case, the appropriate assembly
> +output statements are needed in the output template.
> +

I agree "#" on its own is relatively common, but it's also not that
unusual to have a define_insn_and_split in which the define_insn part
handles simple alternatives directly and leaves more complex ones to
be split.  Maybe that's more common on RISC-like targets.

Also, the define_split matches the template independently of the
define_insn, so it can sometimes split insns that match an earlier
define_insn rather than the one in the define_insn_and_split.
(That might be bad practice.)  So using "#" and FAIL together is valid
if the FAIL only happens for cases that match earlier define_insns.

Another case is when the define_split condition doesn't start with
"&&" and is less strict than the define_insn condition.  This can
be useful if the define_split is supposed to match patterns created
by combine.

So maybe we should instead expand the FAIL documentation to say that
a define_split must not FAIL when splitting an instruction whose
output template is "#".

> @@ -8615,6 +8648,31 @@ so here's a silly made-up example:
>"")
>  @end smallexample
>  
> +There are two special macros defined for use in the preparation statements:
> +@code{DONE} and @code{FAIL}.  Use them with a following semicolon,
> +as a statement.
> +
> +@table @code
> +
> +@findex DONE
> +@item DONE
> +Use the @code{DONE} macro to end RTL generation for the peephole.  The
> +only RTL insns generated as replacement for the matched input insn will
> +be those already emitted by explicit calls to @code{emit_insn} within
> +the preparation statements; the replacement pattern is not used.
> +
> +@findex FAIL
> +@item FAIL
> +Make the @code{define_peephole2} fail on this occasion.  When a 
> @code{define_peephole2}
> +fails, it means that the replacement was not truly available for the
> +particular inputs it was given, and the input insns are left unchanged.

If it FAILs, GCC will try to apply later define_peehole2s instead.
(This is in contrast to define_split, so it's a bit inconsistent.
Would be easy to make define_split behave the same way if there was a
motivating case.)

Thanks,
Richa

Re: [PATCH] doc clarification: DONE and FAIL in define_split and define_peephole2

2018-07-06 Thread Richard Sandiford
Paul Koning  writes:
> @@ -8615,6 +8639,34 @@ so here's a silly made-up example:
>"")
>  @end smallexample
>  
> +There are two special macros defined for use in the preparation statements:
> +@code{DONE} and @code{FAIL}.  Use them with a following semicolon,
> +as a statement.
> +
> +@table @code
> +
> +@findex DONE
> +@item DONE
> +Use the @code{DONE} macro to end RTL generation for the peephole.  The
> +only RTL insns generated as replacement for the matched input insn will
> +be those already emitted by explicit calls to @code{emit_insn} within
> +the preparation statements; the replacement pattern is not used.
> +
> +@findex FAIL
> +@item FAIL
> +Make the @code{define_peephole2} fail on this occasion.  When a 
> @code{define_peephole2}
> +fails, it means that the replacement was not truly available for the
> +particular inputs it was given.  In that case, GCC may still apply a
> +later @code{define_peephole2} that also matches the given insn pattern.
> +(Note that this is different from @code{define_split}, where @code{FAIL}
> +prevents the input insn from being split at all.)
> +@end table
> +
> +If the preparation falls through (invokes neither @code{DONE} nor
> +@code{FAIL}), then the @code{define_peephole2} uses the replacement
> +template.
> +
> +
>  @noindent
>  If we had not added the @code{(match_dup 4)} in the middle of the input
>  sequence, it might have been the case that the register we chose at the

Double empty line.

OK otherwise, thanks.  (Think this counts as a gen* patch.)

Richard


Re: calculate overflow type in wide int arithmetic

2018-07-07 Thread Richard Sandiford
Richard Biener  writes:
> On Fri, Jul 6, 2018 at 9:50 AM Aldy Hernandez  wrote:
>>
>>
>>
>> On 07/05/2018 05:50 AM, Richard Biener wrote:
>> > On Thu, Jul 5, 2018 at 9:35 AM Aldy Hernandez  wrote:
>> >>
>> >> The reason for this patch are the changes showcased in tree-vrp.c.
>> >> Basically I'd like to discourage rolling our own overflow and underflow
>> >> calculation when doing wide int arithmetic.  We should have a
>> >> centralized place for this, that is-- in the wide int code itself ;-).
>> >>
>> >> The only cases I care about are plus/minus, which I have implemented,
>> >> but we also get division for free, since AFAICT, division can only
>> >> positive overflow:
>> >>
>> >>  -MIN / -1 => +OVERFLOW
>> >>
>> >> Multiplication OTOH, can underflow, but I've not implemented it because
>> >> we have no uses for it.  I have added a note in the code explaining this.
>> >>
>> >> Originally I tried to only change plus/minus, but that made code that
>> >> dealt with plus/minus in addition to div or mult a lot uglier.  You'd
>> >> have to special case "int overflow_for_add_stuff" and "bool
>> >> overflow_for_everything_else".  Changing everything to int, makes things
>> >> consistent.
>> >>
>> >> Note: I have left poly-int as is, with its concept of yes/no for
>> >> overflow.  I can adapt this as well if desired.
>> >>
>> >> Tested on x86-64 Linux.
>> >>
>> >> OK for trunk?
>> >
>> > looks all straight-forward but the following:
>> >
>> > else if (op1)
>> >   {
>> > if (minus_p)
>> > -   {
>> > - wi = -wi::to_wide (op1);
>> > -
>> > - /* Check for overflow.  */
>> > - if (sgn == SIGNED
>> > - && wi::neg_p (wi::to_wide (op1))
>> > - && wi::neg_p (wi))
>> > -   ovf = 1;
>> > - else if (sgn == UNSIGNED && wi::to_wide (op1) != 0)
>> > -   ovf = -1;
>> > -   }
>> > +   wi = wi::neg (wi::to_wide (op1));
>> > else
>> >  wi = wi::to_wide (op1);
>> >
>> > you fail to handle - -INT_MIN.
>>
>> Woah, very good catch.  I previously had this implemented as wi::sub(0,
>> op1, &ovf) which was calculating overflow correctly but when I
>> implemented the overflow type in wi::neg I missed this.  Thanks.
>>
>> >
>> > Given the fact that for multiplication (or others, didn't look too  close)
>> > you didn't implement the direction indicator I wonder if it would be more
>> > appropriate to do
>> >
>> > enum ovfl { OVFL_NONE = 0, OVFL_UNDERFLOW = -1, OVFL_OVERFLOW = 1,
>> > OVFL_UNKNOWN = 2 };
>> >
>> > and tell us the "truth" here?
>>
>> Excellent idea...though it came with lots of typing :).  Fixed.
>>
>> BTW, if I understand correctly, I've implemented the overflow types
>> correctly for everything but multiplication (which we have no users for
>> and I return OVF_UNKNOWN).  I have indicated this in comments.  Also,
>> for division I did nothing special, as we can only +OVERFLOW.
>>
>> >
>> > Hopefully if (overflow) will still work with that.
>>
>> It does.
>>
>> >
>> > Otherwise can you please add a toplevel comment to wide-int.h as to what 
>> > the
>> > overflow result semantically is for a) SIGNED and b) UNSIGNED operations?
>>
>> Done.  Let me know if the current comment is what you had in mind.
>>
>> OK for trunk?
>
> I'd move accumulate_overflow to wi::, it looks generally useful.  That 
> function
> misses to handle the !suboverflow && overflow case optimally.
>
> I see that poly-int choses to accumulate overflow (callers need to
> initialize it) while wide_int chooses not to accumulate...  to bad
> this is inconsistent.  Richard?

poly-int needs to accumulate internally when handling multiple coefficients,
but the external interface is the same as for wi:: (no caller initialisation).

Richard


Re: [PATCH 0/5] [RFC v2] Higher-level reporting of vectorization problems

2018-07-11 Thread Richard Sandiford
David Malcolm  writes:
> On Mon, 2018-06-25 at 11:10 +0200, Richard Biener wrote:
>> On Fri, 22 Jun 2018, David Malcolm wrote:
>> 
>> > NightStrike and I were chatting on IRC last week about
>> > issues with trying to vectorize the following code:
>> > 
>> > #include 
>> > std::size_t f(std::vector> const & v) {
>> >std::size_t ret = 0;
>> >for (auto const & w: v)
>> >ret += w.size();
>> >return ret;
>> > }
>> > 
>> > icc could vectorize it, but gcc couldn't, but neither of us could
>> > immediately figure out what the problem was.
>> > 
>> > Using -fopt-info leads to a wall of text.
>> > 
>> > I tried using my patch here:
>> > 
>> >  "[PATCH] v3 of optinfo, remarks and optimization records"
>> >   https://gcc.gnu.org/ml/gcc-patches/2018-06/msg01267.html
>> > 
>> > It improved things somewhat, by showing:
>> > (a) the nesting structure via indentation, and
>> > (b) the GCC line at which each message is emitted (by using the
>> > "remark" output)
>> > 
>> > but it's still a wall of text:
>> > 
>> >   https://dmalcolm.fedorapeople.org/gcc/2018-06-18/test.cc.remarks.
>> > html
>> >   https://dmalcolm.fedorapeople.org/gcc/2018-06-18/test.cc.d/..%7C.
>> > .%7Csrc%7Ctest.cc.html#line-4
>> > 
>> > It doesn't yet provide a simple high-level message to a
>> > tech-savvy user on what they need to do to get GCC to
>> > vectorize their loop.
>> 
>> Yeah, in particular the vectorizer is way too noisy in its low-level
>> functions.  IIRC -fopt-info-vec-missed is "somewhat" better:
>> 
>> t.C:4:26: note: step unknown.
>> t.C:4:26: note: vector alignment may not be reachable
>> t.C:4:26: note: not ssa-name.
>> t.C:4:26: note: use not simple.
>> t.C:4:26: note: not ssa-name.
>> t.C:4:26: note: use not simple.
>> t.C:4:26: note: no array mode for V2DI[3]
>> t.C:4:26: note: Data access with gaps requires scalar epilogue loop
>> t.C:4:26: note: can't use a fully-masked loop because the target
>> doesn't 
>> have the appropriate masked load or store.
>> t.C:4:26: note: not ssa-name.
>> t.C:4:26: note: use not simple.
>> t.C:4:26: note: not ssa-name.
>> t.C:4:26: note: use not simple.
>> t.C:4:26: note: no array mode for V2DI[3]
>> t.C:4:26: note: Data access with gaps requires scalar epilogue loop
>> t.C:4:26: note: op not supported by target.
>> t.C:4:26: note: not vectorized: relevant stmt not supported: _15 =
>> _14 
>> /[ex] 4;
>> t.C:4:26: note: bad operation or unsupported loop bound.
>> t.C:4:26: note: not vectorized: no grouped stores in basic block.
>> t.C:4:26: note: not vectorized: no grouped stores in basic block.
>> t.C:6:12: note: not vectorized: not enough data-refs in basic block.
>> 
>> 
>> > The pertinent dump messages are:
>> > 
>> > test.cc:4:23: remark: === try_vectorize_loop_1 ===
>> > [../../src/gcc/tree-vectorizer.c:674:try_vectorize_loop_1]
>> > cc1plus: remark:
>> > Analyzing loop at test.cc:4
>> > [../../src/gcc/dumpfile.c:735:ensure_pending_optinfo]
>> > test.cc:4:23: remark:  === analyze_loop_nest ===
>> > [../../src/gcc/tree-vect-loop.c:2299:vect_analyze_loop]
>> > [...snip...]
>> > test.cc:4:23: remark:   === vect_analyze_loop_operations ===
>> > [../../src/gcc/tree-vect-loop.c:1520:vect_analyze_loop_operations]
>> > [...snip...]
>> > test.cc:4:23: remark:==> examining statement: ‘_15 = _14 /[ex]
>> > 4;’ [../../src/gcc/tree-vect-stmts.c:9382:vect_analyze_stmt]
>> > test.cc:4:23: remark:vect_is_simple_use: operand ‘_14’
>> > [../../src/gcc/tree-vect-stmts.c:10064:vect_is_simple_use]
>> > test.cc:4:23: remark:def_stmt: ‘_14 = _8 - _7;’
>> > [../../src/gcc/tree-vect-stmts.c:10098:vect_is_simple_use]
>> > test.cc:4:23: remark:type of def: internal [../../src/gcc/tree-
>> > vect-stmts.c:10112:vect_is_simple_use]
>> > test.cc:4:23: remark:vect_is_simple_use: operand ‘4’
>> > [../../src/gcc/tree-vect-stmts.c:10064:vect_is_simple_use]
>> > test.cc:4:23: remark:op not supported by target.
>> > [../../src/gcc/tree-vect-stmts.c:5932:vectorizable_operation]
>> > test.cc:4:23: remark:not vectorized: relevant stmt not
>> > supported: ‘_15 = _14 /[ex] 4;’ [../../src/gcc/tree-vect-
>> > stmts.c:9565:vect_analyze_stmt]
>> > test.cc:4:23: remark:   bad operation or unsupported loop bound.
>> > [../../src/gcc/tree-vect-loop.c:2043:vect_analyze_loop_2]
>> > cc1plus: remark: vectorized 0 loops in function.
>> > [../../src/gcc/tree-vectorizer.c:904:vectorize_loops]
>> > 
>> > In particular, that complaint from
>> >   [../../src/gcc/tree-vect-stmts.c:9565:vect_analyze_stmt]
>> > is coming from:
>> > 
>> >   if (!ok)
>> > {
>> >   if (dump_enabled_p ())
>> > {
>> >   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> >"not vectorized: relevant stmt not ");
>> >   dump_printf (MSG_MISSED_OPTIMIZATION, "supported: ");
>> >   dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM,
>> > stmt, 0);
>> > }
>> > 
>> >   return false;
>> > }
>> > 
>> > This got me thinking: the user p

Re: abstract wide int binop code from VRP

2018-07-11 Thread Richard Sandiford
Richard Biener  writes:
> On Wed, Jul 11, 2018 at 8:48 AM Aldy Hernandez  wrote:
>>
>> Hmmm, I think we can do better, and since this hasn't been reviewed yet,
>> I don't think anyone will mind the adjustment to the patch ;-).
>>
>> I really hate int_const_binop_SOME_RANDOM_NUMBER.  We should abstract
>> them into properly named poly_int_binop, wide_int_binop, and tree_binop,
>> and then use a default argument for int_const_binop() to get things going.
>>
>> Sorry for more changes in flight, but I thought we could benefit from
>> more cleanups :).
>>
>> OK for trunk pending tests?
>
> Much of GCC pre-dates function overloading / default args ;)
>
> Looks OK but can you please rename your tree_binop to int_cst_binop?
> Or maybe inline it into int_const_binop, also sharing the force_fit_type ()
> tail with poly_int_binop?
>
> What about mixed INTEGER_CST / poly_int constants?  Shouldn't it
> be
>
>   if (neither-poly-nor-integer-cst (arg1 || arg2))
> return NULL_TREE;
>   if (poly_int_tree (arg1) || poly_int_tree (arg2))
> poly-int-stuff
>   else if (INTEGER_CST && INTEGER_CST)
> wide-int-stuff
>
> ?  I see that is a pre-existing issue but if you are at refactoring...
> wi::to_poly_wide should handle INTEGER_CST operands just fine
> I hope.

Don't think it's a preexisting issue.  poly_int_tree_p returns true
for anything that can be represented as a poly_int, i.e. both
INTEGER_CST and POLY_INT_CST.  (It wouldn't really make sense to
ask whether something could *only* be represented as a POLY_INT_CST.)

So:

  if (poly_int_tree_p (arg1) && poly_int_tree_p (arg2))
{
  poly_wide_int res;
  bool overflow;
  tree type = TREE_TYPE (arg1);
  signop sign = TYPE_SIGN (type);
  switch (code)
{
case PLUS_EXPR:
  res = wi::add (wi::to_poly_wide (arg1),
 wi::to_poly_wide (arg2), sign, &overflow);
  break;

handles POLY_INT_CST + POLY_INT_CST, POLY_INT_CST + INTEGER_CST and
INTEGER_CST + POLY_INT_CST.

Thanks,
Richard


Re: abstract wide int binop code from VRP

2018-07-11 Thread Richard Sandiford
Aldy Hernandez  writes:
> On 07/11/2018 08:52 AM, Richard Biener wrote:
>> On Wed, Jul 11, 2018 at 8:48 AM Aldy Hernandez  wrote:
>>>
>>> Hmmm, I think we can do better, and since this hasn't been reviewed yet,
>>> I don't think anyone will mind the adjustment to the patch ;-).
>>>
>>> I really hate int_const_binop_SOME_RANDOM_NUMBER.  We should abstract
>>> them into properly named poly_int_binop, wide_int_binop, and tree_binop,
>>> and then use a default argument for int_const_binop() to get things going.
>>>
>>> Sorry for more changes in flight, but I thought we could benefit from
>>> more cleanups :).
>>>
>>> OK for trunk pending tests?
>> 
>> Much of GCC pre-dates function overloading / default args ;)
>
> Heh...and ANSI C.
>
>> 
>> Looks OK but can you please rename your tree_binop to int_cst_binop?
>> Or maybe inline it into int_const_binop, also sharing the force_fit_type ()
>> tail with poly_int_binop?
>
> I tried both, but inlining looked cleaner :).  Done.
>
>> 
>> What about mixed INTEGER_CST / poly_int constants?  Shouldn't it
>> be
>> 
>>if (neither-poly-nor-integer-cst (arg1 || arg2))
>>  return NULL_TREE;
>>if (poly_int_tree (arg1) || poly_int_tree (arg2))
>>  poly-int-stuff
>>else if (INTEGER_CST && INTEGER_CST)
>>  wide-int-stuff
>> 
>> ?  I see that is a pre-existing issue but if you are at refactoring...
>> wi::to_poly_wide should handle INTEGER_CST operands just fine
>> I hope.
>
> This aborted:
> gcc_assert (NUM_POLY_INT_COEFFS != 1);
>
> but even taking it out made the bootstrap die somewhere else.
>
> If it's ok, I'd rather not tackle this now, as I have some more cleanups 
> that are pending on this.  If you feel strongly, I could do it at a 
> later time.
>
> OK pending tests?

LGTM FWIW, just some nits:

> -/* Subroutine of int_const_binop_1 that handles two INTEGER_CSTs.  */
> +/* Combine two wide ints ARG1 and ARG2 under operation CODE to produce
> +   a new constant in RES.  Return FALSE if we don't know how to
> +   evaluate CODE at compile-time.  */
> 
> -static tree
> -int_const_binop_2 (enum tree_code code, const_tree parg1, const_tree parg2,
> -int overflowable)
> +bool
> +wide_int_binop (enum tree_code code,
> + wide_int &res, const wide_int &arg1, const wide_int &arg2,
> + signop sign, wi::overflow_type &overflow)
>  {

IMO we should avoid pass-back by reference like the plague. :-)
It's especially confusing when the code does things like:

case FLOOR_DIV_EXPR:
  if (arg2 == 0)
return false;
  res = wi::div_floor (arg1, arg2, sign, &overflow);
  break;

It looked at first like it was taking the address of a local variable
and failing to propagate the information back up.

I think we should stick to using pointers for this kind of thing.

> -/* Combine two integer constants PARG1 and PARG2 under operation CODE
> -   to produce a new constant.  Return NULL_TREE if we don't know how
> +/* Combine two poly int's ARG1 and ARG2 under operation CODE to
> +   produce a new constant in RES.  Return FALSE if we don't know how
> to evaluate CODE at compile-time.  */
> 
> -static tree
> -int_const_binop_1 (enum tree_code code, const_tree arg1, const_tree arg2,
> -int overflowable)
> +static bool
> +poly_int_binop (poly_wide_int &res, enum tree_code code,
> + const_tree arg1, const_tree arg2,
> + signop sign, wi::overflow_type &overflow)
>  {

Would be good to be consistent about the order of the result and code
arguments.  Here it's "result, code" (which seems better IMO),
but in wide_int_binop it's "code, result".

> +/* Combine two integer constants PARG1 and PARG2 under operation CODE
> +   to produce a new constant.  Return NULL_TREE if we don't know how
> +   to evaluate CODE at compile-time.  */
> +
>  tree
> -int_const_binop (enum tree_code code, const_tree arg1, const_tree arg2)
> +int_const_binop (enum tree_code code, const_tree arg1, const_tree arg2,
> +  int overflowable)

s/PARG/ARG/g in comment.

>  {
> -  return int_const_binop_1 (code, arg1, arg2, 1);
> +  bool success = false;
> +  poly_wide_int poly_res;
> +  tree type = TREE_TYPE (arg1);
> +  signop sign = TYPE_SIGN (type);
> +  wi::overflow_type overflow = wi::OVF_NONE;
> +
> +  if (TREE_CODE (arg1) == INTEGER_CST && TREE_CODE (arg2) == INTEGER_CST)
> +{
> +  wide_int warg1 = wi::to_wide (arg1), res;
> +  wide_int warg2 = wi::to_wide (arg2, TYPE_PRECISION (type));
> +  success = wide_int_binop (code, res, warg1, warg2, sign, overflow);
> +  poly_res = res;
> +}
> +  else if (poly_int_tree_p (arg1) && poly_int_tree_p (arg2))
> +success = poly_int_binop (poly_res, code, arg1, arg2, sign, overflow);
> +  if (success)
> +return force_fit_type (type, poly_res, overflowable,
> +(((sign == SIGNED || overflowable == -1)
> +  && overflow)
> + | TREE_OVERFLOW (arg1) | TREE_OVERFL

Re: RFC: lra-constraints.c and TARGET_HARD_REGNO_CALL_PART_CLOBBERED question/patch

2018-07-11 Thread Richard Sandiford
Jeff Law  writes:
> On 07/11/2018 02:07 PM, Steve Ellcey wrote:
>> I have a reload/register allocation question and possible patch.  While
>> working on the Aarch64 SIMD ABI[1] I ran into a problem where GCC was
>> saving and restoring registers that it did not need to.  I tracked it
>> down to lra-constraints.c and its use of
>> targetm.hard_regno_call_part_clobbered on instructions that are not
>> calls.  Specifically need_for_call_save_p would check this macro even
>> when the instruction in question (unknown to need_for_call_save_p)
>> was not a call instruction.
>> 
>> This seems wrong to me and I was wondering if anyone more familiar
>> with the register allocator and reload could look at this patch and
>> tell me if it seems reasonable or not.  It passed bootstrap and I
>> am running tests now.  I am just wondering if there is any reason why
>> this target function would need to be called on non-call instructions
>> or if doing so is just an oversight/bug.
>> 
>> Steve Ellcey
>> sell...@cavium.com
>> 
>> 
>> [1] https://gcc.gnu.org/ml/gcc/2018-07/msg00012.html
>> 
>> 
>> 2018-07-11  Steve Ellcey  
>> 
>>  * lra-constraints.c (need_for_call_save_p): Add insn argument
>>  and only check targetm.hard_regno_call_part_clobbered on calls.
>>  (need_for_split_p): Add insn argument, pass to need_for_call_save_p.
>>  (split_reg): Pass insn to need_for_call_save_p.
>>  (split_if_necessary): Pass curr_insn to need_for_split_p.
>>  (inherit_in_ebb): Ditto.
> Various target have calls which are exposed as INSNs rather than as
> CALL_INSNs.   So we need to check that hook on all insns.
>
> You can probably see this in action with the TLS insns on aarch64.

Not sure whether it's that: I think other code does only consider
hard_regno_call_part_clobbered on calls.  But as it stands
need_for_call_save_p is checking whether there's a call somewhere
inbetween the current instruction and the last use in the EBB:

/* Return true if we need a caller save/restore for pseudo REGNO which
   was assigned to a hard register.  */
static inline bool
need_for_call_save_p (int regno)
{
  lra_assert (regno >= FIRST_PSEUDO_REGISTER && reg_renumber[regno] >= 0);
  return (usage_insns[regno].calls_num < calls_num
...
}

So it only calls targetm.hard_regno_call_part_clobbered if such a
call is known to exist somewhere between the two references to
regno (although we don't have the calls themselves to hand).

Thanks,
Richard


Re: abstract wide int binop code from VRP

2018-07-12 Thread Richard Sandiford
Aldy Hernandez  writes:
> On 07/11/2018 01:33 PM, Richard Sandiford wrote:
>> Aldy Hernandez  writes:
>>> On 07/11/2018 08:52 AM, Richard Biener wrote:
>>>> On Wed, Jul 11, 2018 at 8:48 AM Aldy Hernandez  wrote:
>>>>>
>>>>> Hmmm, I think we can do better, and since this hasn't been reviewed yet,
>>>>> I don't think anyone will mind the adjustment to the patch ;-).
>>>>>
>>>>> I really hate int_const_binop_SOME_RANDOM_NUMBER.  We should abstract
>>>>> them into properly named poly_int_binop, wide_int_binop, and tree_binop,
>>>>> and then use a default argument for int_const_binop() to get things going.
>>>>>
>>>>> Sorry for more changes in flight, but I thought we could benefit from
>>>>> more cleanups :).
>>>>>
>>>>> OK for trunk pending tests?
>>>>
>>>> Much of GCC pre-dates function overloading / default args ;)
>>>
>>> Heh...and ANSI C.
>>>
>>>>
>>>> Looks OK but can you please rename your tree_binop to int_cst_binop?
>>>> Or maybe inline it into int_const_binop, also sharing the force_fit_type ()
>>>> tail with poly_int_binop?
>>>
>>> I tried both, but inlining looked cleaner :).  Done.
>>>
>>>>
>>>> What about mixed INTEGER_CST / poly_int constants?  Shouldn't it
>>>> be
>>>>
>>>> if (neither-poly-nor-integer-cst (arg1 || arg2))
>>>>   return NULL_TREE;
>>>> if (poly_int_tree (arg1) || poly_int_tree (arg2))
>>>>   poly-int-stuff
>>>> else if (INTEGER_CST && INTEGER_CST)
>>>>   wide-int-stuff
>>>>
>>>> ?  I see that is a pre-existing issue but if you are at refactoring...
>>>> wi::to_poly_wide should handle INTEGER_CST operands just fine
>>>> I hope.
>>>
>>> This aborted:
>>> gcc_assert (NUM_POLY_INT_COEFFS != 1);
>>>
>>> but even taking it out made the bootstrap die somewhere else.
>>>
>>> If it's ok, I'd rather not tackle this now, as I have some more cleanups
>>> that are pending on this.  If you feel strongly, I could do it at a
>>> later time.
>>>
>>> OK pending tests?
>> 
>> LGTM FWIW, just some nits:
>> 
>>> -/* Subroutine of int_const_binop_1 that handles two INTEGER_CSTs.  */
>>> +/* Combine two wide ints ARG1 and ARG2 under operation CODE to produce
>>> +   a new constant in RES.  Return FALSE if we don't know how to
>>> +   evaluate CODE at compile-time.  */
>>>
>>> -static tree
>>> -int_const_binop_2 (enum tree_code code, const_tree parg1, const_tree parg2,
>>> -  int overflowable)
>>> +bool
>>> +wide_int_binop (enum tree_code code,
>>> +   wide_int &res, const wide_int &arg1, const wide_int &arg2,
>>> +   signop sign, wi::overflow_type &overflow)
>>>   {
>> 
>> IMO we should avoid pass-back by reference like the plague. :-)
>> It's especially confusing when the code does things like:
>> 
>>  case FLOOR_DIV_EXPR:
>>if (arg2 == 0)
>>  return false;
>>res = wi::div_floor (arg1, arg2, sign, &overflow);
>>break;
>  >
>  > It looked at first like it was taking the address of a local variable
>  > and failing to propagate the information back up.
>  >
>  > I think we should stick to using pointers for this kind of thing.
>  >
>
> Hmmm, I kinda like them.  It just takes some getting used to, but 
> generally yields cleaner code as you don't have to keep using '*' 
> everywhere.  Plus, the callee can assume the pointer is non-zero.

But it can assume that for "*" too.

The problem isn't getting used to them.  I've worked on codebases where
this is the norm before and had to live with it.  It's just always felt
a mistake even then.

E.g. compare:

  int_const_binop_1 (code, arg1, arg2, overflowable);

and:

  wide_int_binop (code, res, arg1, arg2, sign, overflow);

There's just no visual clue to tell you that "overflowable" is an
input and "overflow" is an output.  ("overflowable" could well be
an output from the raw meaning: "the calculation might have induced
an overflow, but we're not sure".)

I wouldn't mind so much if we had a convention that the outputs
had a suffix to make it clear that they were outputs.  But that
would be more typing than "*".

Thanks,
Richard


Re: [PATCH][GCC][AARCH64] Canonicalize aarch64 widening simd plus insns

2018-07-12 Thread Richard Sandiford
Looks good to me FWIW (not a maintainer), just a minor formatting thing:

Matthew Malcomson  writes:
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> aac5fa146ed8dde4507a0eb4ad6a07ce78d2f0cd..67b29cbe2cad91e031ee23be656ec61a403f2cf9
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -3302,38 +3302,78 @@
>DONE;
>  })
>  
> -(define_insn "aarch64_w"
> +(define_insn "aarch64_subw"
>[(set (match_operand: 0 "register_operand" "=w")
> -(ADDSUB: (match_operand: 1 "register_operand" "w")
> - (ANY_EXTEND:
> -   (match_operand:VD_BHSI 2 "register_operand" "w"]
> + (minus:
> +  (match_operand: 1 "register_operand" "w")
> +  (ANY_EXTEND:
> +(match_operand:VD_BHSI 2 "register_operand" "w"]

The (minus should be under the "(match_operand":

(define_insn "aarch64_subw"
  [(set (match_operand: 0 "register_operand" "=w")
(minus: (match_operand: 1 "register_operand" "w")
   (ANY_EXTEND:
 (match_operand:VD_BHSI 2 "register_operand" "w"]

Same for the other patterns.

Thanks,
Richard


Re: Add IFN_COND_FMA functions

2018-07-12 Thread Richard Sandiford
Richard Biener  writes:
> On Thu, May 24, 2018 at 2:08 PM Richard Sandiford <
> richard.sandif...@linaro.org> wrote:
>
>> This patch adds conditional equivalents of the IFN_FMA built-in functions.
>> Most of it is just a mechanical extension of the binary stuff.
>
>> Tested on aarch64-linux-gnu (with and without SVE), aarch64_be-elf
>> and x86_64-linux-gnu.  OK for the non-AArch64 bits?
>
> OK.

Thanks.  For the record, here's what I installed after updating
the SVE patterns in line with rth's recent MOVPRFX changes.

Richard

2018-07-12  Richard Sandiford  

gcc/
* doc/md.texi (cond_fma, cond_fms, cond_fnma, cond_fnms): Document.
* optabs.def (cond_fma_optab, cond_fms_optab, cond_fnma_optab)
(cond_fnms_optab): New optabs.
* internal-fn.def (COND_FMA, COND_FMS, COND_FNMA, COND_FNMS): New
internal functions.
(FMA): Use DEF_INTERNAL_FLT_FN rather than DEF_INTERNAL_FLT_FLOATN_FN.
* internal-fn.h (get_conditional_internal_fn): Declare.
(get_unconditional_internal_fn): Likewise.
* internal-fn.c (cond_ternary_direct): New macro.
(expand_cond_ternary_optab_fn): Likewise.
(direct_cond_ternary_optab_supported_p): Likewise.
(FOR_EACH_COND_FN_PAIR): Likewise.
(get_conditional_internal_fn): New function.
(get_unconditional_internal_fn): Likewise.
* gimple-match.h (gimple_match_op::MAX_NUM_OPS): Bump to 5.
(gimple_match_op::gimple_match_op): Add a new overload for 5
operands.
(gimple_match_op::set_op): Likewise.
(gimple_resimplify5): Declare.
* genmatch.c (decision_tree::gen): Generate simplifications for
5 operands.
* gimple-match-head.c (gimple_simplify): Define an overload for
5 operands.  Handle calls with 5 arguments in the top-level overload.
(convert_conditional_op): Handle conversions from unconditional
internal functions to conditional ones.
(gimple_resimplify5): New function.
(build_call_internal): Pass a fifth operand.
(maybe_push_res_to_seq): Likewise.
(try_conditional_simplification): Try converting conditional
internal functions to unconditional internal functions.
Handle 3-operand unconditional forms.
* match.pd (UNCOND_TERNARY, COND_TERNARY): Operator lists.
Define ternary equivalents of the current rules for binary conditional
internal functions.
* config/aarch64/aarch64.c (aarch64_preferred_else_value): Handle
ternary operations.
* config/aarch64/iterators.md (UNSPEC_COND_FMLA, UNSPEC_COND_FMLS)
(UNSPEC_COND_FNMLA, UNSPEC_COND_FNMLS): New unspecs.
(optab): Handle them.
(SVE_COND_FP_TERNARY): New int iterator.
(sve_fmla_op, sve_fmad_op): New int attributes.
* config/aarch64/aarch64-sve.md (cond_)
(*cond__2, *cond__any): New SVE_COND_FP_TERNARY patterns.

gcc/testsuite/
* gcc.dg/vect/vect-cond-arith-3.c: New test.
* gcc.target/aarch64/sve/vcond_13.c: Likewise.
* gcc.target/aarch64/sve/vcond_13_run.c: Likewise.
* gcc.target/aarch64/sve/vcond_14.c: Likewise.
* gcc.target/aarch64/sve/vcond_14_run.c: Likewise.
* gcc.target/aarch64/sve/vcond_15.c: Likewise.
* gcc.target/aarch64/sve/vcond_15_run.c: Likewise.
* gcc.target/aarch64/sve/vcond_16.c: Likewise.
* gcc.target/aarch64/sve/vcond_16_run.c: Likewise.

Index: gcc/doc/md.texi
===
--- gcc/doc/md.texi 2018-07-12 12:39:27.789323671 +0100
+++ gcc/doc/md.texi 2018-07-12 12:42:44.366933190 +0100
@@ -6438,6 +6438,23 @@ Operands 0, 2, 3 and 4 all have mode @va
 integer if @var{m} is scalar, otherwise it has the mode returned by
 @code{TARGET_VECTORIZE_GET_MASK_MODE}.
 
+@cindex @code{cond_fma@var{mode}} instruction pattern
+@cindex @code{cond_fms@var{mode}} instruction pattern
+@cindex @code{cond_fnma@var{mode}} instruction pattern
+@cindex @code{cond_fnms@var{mode}} instruction pattern
+@item @samp{cond_fma@var{mode}}
+@itemx @samp{cond_fms@var{mode}}
+@itemx @samp{cond_fnma@var{mode}}
+@itemx @samp{cond_fnms@var{mode}}
+Like @samp{cond_add@var{m}}, except that the conditional operation
+takes 3 operands rather than two.  For example, the vector form of
+@samp{cond_fma@var{mode}} is equivalent to:
+
+@smallexample
+for (i = 0; i < GET_MODE_NUNITS (@var{m}); i++)
+  op0[i] = op1[i] ? fma (op2[i], op3[i], op4[i]) : op5[i];
+@end smallexample
+
 @cindex @code{neg@var{mode}cc} instruction pattern
 @item @samp{neg@var{mode}cc}
 Similar to @samp{mov@var{mode}cc} but for conditional negation.  Conditionally
Index: gcc/optabs.def
===
--- gcc/optabs.def  2018-07-12 12:39:27.976869878 +0100
+++ gcc/optabs.def  2018-07-12 12:42:44.368856626 +0100
@@ -234,6

[gen/AArch64] Generate helpers for substituting iterator values into pattern names

2018-07-13 Thread Richard Sandiford
Given a pattern like:

  (define_insn "aarch64_frecpe" ...)

the SVE ACLE implementation wants to generate the pattern for a
particular (non-constant) mode.  This patch automatically generates
helpers to do that, specifically:

  // Return CODE_FOR_nothing on failure.
  insn_code maybe_code_for_aarch64_frecpe (machine_mode);

  // Assert that the code exists.
  insn_code code_for_aarch64_frecpe (machine_mode);

  // Return NULL_RTX on failure.
  rtx maybe_gen_aarch64_frecpe (machine_mode, rtx, rtx);

  // Assert that generation succeeds.
  rtx gen_aarch64_frecpe (machine_mode, rtx, rtx);

Many patterns don't have sensible names when all <...>s are removed.
E.g. "2" would give a base name "2".  The new functions
therefore require explicit opt-in, which should also help to reduce
code bloat.

The (arbitrary) opt-in syntax I went for was to prefix the pattern
name with '@', similarly to the existing '*' marker.

The patch also makes config/aarch64 use the new routines in cases where
they obviously apply.  This was mostly straight-forward, but it seemed
odd that we defined:

   aarch64_reload_movcp<...>

but then only used it with DImode, never SImode.  If we should be
using Pmode instead of DImode, then that's a simple change,
but should probably be a separate patch.

Tested on aarch64-linux-gnu (with and without SVE), aarch64_be-elf
and x86_64-linux-gnu.  I think I can self-approve the gen* bits,
but OK for the AArch64 parts?

Any objections to this approach or syntax?

Richard


2018-07-13  Richard Sandiford  

gcc/
* doc/md.texi: Expand the documentation of instruction names
to mention port-local uses.  Document '@' in pattern names.
* read-md.h (overloaded_instance, overloaded_name): New structs.
(mapping): Declare.
(md_reader::handle_overloaded_name): New member function.
(md_reader::get_overloads): Likewise.
(md_reader::m_first_overload): New member variable.
(md_reader::m_next_overload_ptr): Likewise.
(md_reader::m_overloads_htab): Likewise.
* read-md.c (md_reader::md_reader): Initialize m_first_overload,
m_next_overload_ptr and m_overloads_htab.
* read-rtl.c (iterator_group): Add "type" and "get_c_token" fields.
(get_mode_token, get_code_token, get_int_token): New functions.
(map_attr_string): Add an optional argument that passes back
the associated iterator.
(overloaded_name_hash, overloaded_name_eq_p, named_rtx_p):
(md_reader::handle_overloaded_name, add_overload_instance): New
functions.
(apply_iterators): Handle '@' names.  Report an error if '@'
is used without iterators.
(initialize_iterators): Initialize the new iterator_group fields.
* gencodes.c (handle_overloaded_code_for): New function.
(main): Use it to print declarations of maybe_code_for_* functions
and inline definitions of code_for_*.
* genflags.c (emit_overloaded_gen_proto): New function.
(main): Use it to print declarations of maybe_gen_* functions
and inline definitions of gen_*.
* genemit.c (print_overload_arguments, print_overload_test)
(handle_overloaded_code_for, handle_overloaded_gen): New functions.
(main): Use it to print definitions of maybe_code_for_* and
maybe_gen_* functions.
* config/aarch64/aarch64.c (aarch64_split_128bit_move): Use
gen_aarch64_mov{low,high}_di and gen_aarch64_movdi_{low,high}
instead of explicit mode checks.
(aarch64_split_simd_combine): Likewise gen_aarch64_simd_combine.
(aarch64_split_simd_move): Likewise gen_aarch64_split_simd_mov.
(aarch64_emit_load_exclusive): Likewise gen_aarch64_load_exclusive.
(aarch64_emit_store_exclusive): Likewise gen_aarch64_store_exclusive.
(aarch64_expand_compare_and_swap): Likewise
gen_aarch64_compare_and_swap and gen_aarch64_compare_and_swap_lse
(aarch64_gen_atomic_cas): Likewise gen_aarch64_atomic_cas.
(aarch64_emit_atomic_swap): Likewise gen_aarch64_atomic_swp.
(aarch64_constant_pool_reload_icode): Delete.
(aarch64_secondary_reload): Use code_for_aarch64_reload_movcp
instead of aarch64_constant_pool_reload_icode.  Use
code_for_aarch64_reload_mov instead of explicit mode checks.
(rsqrte_type, get_rsqrte_type, rsqrts_type, get_rsqrts_type): Delete.
(aarch64_emit_approx_sqrt): Use gen_aarch64_rsqrte instead of
get_rsqrte_type and gen_aarch64_rsqrts instead of gen_rqrts_type.
(recpe_type, get_recpe_type, recps_type, get_recps_type): Delete.
(aarch64_emit_approx_div): Use gen_aarch64_frecpe instead of
get_recpe_type and gen_aarch64_frecps instead of get_recps_type.
(aarch64_atomic_load_op_code): Delete.
(aarch64_emit_atomic_

Re: [PATCH]Use MIN/MAX_EXPR for intrinsics or __builtin_fmin/max when appropriate

2018-07-18 Thread Richard Sandiford
Richard Biener  writes:
> On Wed, Jul 18, 2018 at 11:50 AM Kyrill Tkachov
>  wrote:
>>
>>
>> On 18/07/18 10:44, Richard Biener wrote:
>> > On Tue, Jul 17, 2018 at 3:46 PM Kyrill Tkachov
>> >  wrote:
>> >> Hi Richard,
>> >>
>> >> On 17/07/18 14:27, Richard Biener wrote:
>> >>> On Tue, Jul 17, 2018 at 2:35 PM Kyrill Tkachov
>> >>>  wrote:
>>  Hi all,
>> 
>>  This is my first Fortran patch, so apologies if I'm missing something.
>>  The current expansion of the min and max intrinsics explicitly expands
>>  the comparisons between each argument to calculate the global min/max.
>>  Some targets, like aarch64, have instructions that can calculate
>>  the min/max
>>  of two real (floating-point) numbers with the proper NaN-handling
>>  semantics
>>  (if both inputs are NaN, return Nan. If one is NaN, return the
>>  other) and those
>>  are the semantics provided by the __builtin_fmin/max family of
>>  functions that expand
>>  to these instructions.
>> 
>>  This patch makes the frontend emit __builtin_fmin/max directly to
>>  compare each
>>  pair of numbers when the numbers are floating-point, and use
>>  MIN_EXPR/MAX_EXPR otherwise
>>  (integral types and -ffast-math) which should hopefully be easier
>>  to recognise in the
>> >>> What is Fortrans requirement on min/max intrinsics?  Doesn't it only
>> >>> require things that
>> >>> are guaranteed by MIN/MAX_EXPR anyways?  The only restriction here is
>> >> The current implementation expands to:
>> >>   mvar = a1;
>> >>   if (a2 .op. mvar || isnan (mvar))
>> >> mvar = a2;
>> >>   if (a3 .op. mvar || isnan (mvar))
>> >> mvar = a3;
>> >>   ...
>> >>   return mvar;
>> >>
>> >> That is, if one of the operands is a NaN it will return the other 
>> >> argument.
>> >> If both (all) are NaNs, it will return NaN. This is the same as the 
>> >> semantics of fmin/max
>> >> as far as I can tell.
>> >>
>> >>> /* Minimum and maximum values.  When used with floating point, if both
>> >>>  operands are zeros, or if either operand is NaN, then it is
>> >>> unspecified
>> >>>  which of the two operands is returned as the result.  */
>> >>>
>> >>> which means MIN/MAX_EXPR are not strictly IEEE compliant with signed
>> >>> zeros or NaNs.
>> >>> Thus the correct test would be !HONOR_SIGNED_ZEROS && !HONOR_NANS
>> >>> if singed
>> >>> zeros are significant.
>> >> True, MIN/MAX_EXPR would not be appropriate in that condition. I
>> >> guarded their use
>> >> on !HONOR_NANS (type) only. I'll update it to !HONOR_SIGNED_ZEROS
>> >> (type) && !HONOR_NANS (type).
>> >>
>> >>
>> >>> I'm not sure if using fmin/max calls when we cannot use MIN/MAX_EXPR
>> >>> is a good idea,
>> >>> this may both generate bigger code and be slower.
>> >> The patch will generate fmin/fmax calls (or the fminf,fminl
>> >> variants) when mathfn_built_in advertises
>> >> them as available (does that mean they'll have a fast inline
>> >> implementation?)
>> > This doesn't mean anything given you make them available with your
>> > patch ;)  So I expect it may
>> > cause issues for !c99_runtime targets (and long double at least).
>>
>> Urgh, that can cause headaches...
>>
>> >> If the above doesn't hold and we can't use either MIN/MAX_EXPR of
>> >> fmin/fmax then the patch falls back
>> >> to the existing expansion.
>> > As said I would not use fmin/fmax calls here at all.
>>
>> ... Given the comments from Thomas and Janne, maybe we should just
>> emit MIN/MAX_EXPRs here
>> since there is no language requirement on NaN/signed zero handling on
>> these intrinsics?
>> That should make it simpler and more portable.
>
> That's fortran maintainers call.
>
>> >> FWIW, this patch does improve performance on 521.wrf from SPEC2017
>> >> on aarch64.
>> > You said that, yes.  Even without -ffast-math?
>>
>> It improves at -O3 without -ffast-math in particular. With -ffast-math
>> phiopt optimisation
>> is more aggressive and merges the conditionals into MIN/MAX_EXPRs
>> (minmax_replacement in tree-ssa-phiopt.c)
>
> The question is will it be slower without -ffast-math, that is, when
> fmin/max() calls are emitted rather
> than inline conditionals.
>
> I think a patch just using MAX/MIN_EXPR within the existing
> constraints and otherwise falling back to
> the current code would be more obvious and other changes should be
> mande independently.

If going to MIN_EXPR and MAX_EXPR unconditionally isn't acceptable,
maybe an alternative would be to go straight to internal functions,
under the usual:

  direct_internal_fn_supported_p (IFN_F{MIN,MAX}, type, OPTIMIZE_FOR_SPEED)

condition.

Thanks,
Richard


Re: [PATCH][Fortran][v2] Use MIN/MAX_EXPR for min/max intrinsics

2018-07-18 Thread Richard Sandiford
Thanks for doing this.

Kyrill  Tkachov  writes:
> +   calc = build_call_expr_internal_loc (input_location, ifn, type,
> +   2, mvar, convert (type, val));

(indentation looks off)

> diff --git a/gcc/testsuite/gfortran.dg/max_fmaxl_aarch64.f90 
> b/gcc/testsuite/gfortran.dg/max_fmaxl_aarch64.f90
> new file mode 100644
> index 
> ..8c8ea063e5d0718dc829c1f5574c5b46040e6786
> --- /dev/null
> +++ b/gcc/testsuite/gfortran.dg/max_fmaxl_aarch64.f90
> @@ -0,0 +1,9 @@
> +! { dg-do compile { target aarch64*-*-* } }
> +! { dg-options "-O2 -fdump-tree-optimized" }
> +
> +subroutine fool (a, b, c, d, e, f, g, h)
> +  real (kind=16) :: a, b, c, d, e, f, g, h
> +  a = max (a, b, c, d, e, f, g, h)
> +end subroutine
> +
> +! { dg-final { scan-tree-dump-times "__builtin_fmaxl " 7 "optimized" } }
> diff --git a/gcc/testsuite/gfortran.dg/min_fminl_aarch64.f90 
> b/gcc/testsuite/gfortran.dg/min_fminl_aarch64.f90
> new file mode 100644
> index 
> ..92368917fb48e0c468a16d080ab3a9ac842e01a7
> --- /dev/null
> +++ b/gcc/testsuite/gfortran.dg/min_fminl_aarch64.f90
> @@ -0,0 +1,9 @@
> +! { dg-do compile { target aarch64*-*-* } }
> +! { dg-options "-O2 -fdump-tree-optimized" }
> +
> +subroutine fool (a, b, c, d, e, f, g, h)
> +  real (kind=16) :: a, b, c, d, e, f, g, h
> +  a = min (a, b, c, d, e, f, g, h)
> +end subroutine
> +
> +! { dg-final { scan-tree-dump-times "__builtin_fminl " 7 "optimized" } }

Do these still pass?  I wouldn't have expected us to use __builtin_fmin*
and __builtin_fmax* now.

It would be good to have tests that we use ".FMIN" and ".FMAX" for kind=4
and kind=8 on AArch64, since that's really the end goal here.

Thanks,
Richard


[wwwdocs] Document new sve-acle-branch

2018-07-18 Thread Richard Sandiford
Hi,

I've created a new git branch for developing the SVE ACLE (i.e. intrinsics)
implementation.  Is the branches entry below OK to commit?  Although the
branch is on git rather than svn, other git branches have also been
documented here.

Thanks,
Richard


Index: htdocs/svn.html
===
RCS file: /cvs/gcc/wwwdocs/htdocs/svn.html,v
retrieving revision 1.222
diff -u -p -r1.222 svn.html
--- htdocs/svn.html 2 Jun 2018 21:16:11 -   1.222
+++ htdocs/svn.html 18 Jul 2018 14:44:59 -
@@ -394,6 +394,15 @@ the command svn log --stop-on-copy
 Architecture-specific
 
 
+  https://gcc.gnu.org/git/?p=gcc.git;a=shortlog;h=refs/heads/aarch64/sve-acle-branch";>aarch64/sve-acle-branch
+  This https://gcc.gnu.org/wiki/GitMirror";>Git-only branch is
+  used for collaborative development of the AArch64 SVE ACLE implementation.
+  The branch is based off and merged with trunk.  Please send patches to
+  gcc-patches with an [SVE ACLE] tag in the subject line.
+  There's no need to use changelogs; the changelogs will instead be
+  written when the work is ready to be merged into trunk.  The branch is
+  maintained by Richard Sandiford.
+
   arc-20081210-branch
   The goal of this branch is to make the port to the ARCompact
   architecture available.  This branch is maintained by Joern Rennecke


[AArch64] Add support for 16-bit FMOV immediates

2018-07-18 Thread Richard Sandiford
aarch64_float_const_representable_p was still returning false for
HFmode, so we wouldn't use 16-bit FMOV immediate.  E.g. before the
patch:

__fp16 foo (void) { return 0x1.1p-3; }

gave:

   mov w0, 12352
   fmovh0, w0

with -march=armv8.2-a+fp16, whereas now it gives:

   fmovh0, 1.328125e-1

Tested on aarch64-linux-gnu, both with and without SVE.  OK to install?

Richard


2018-07-18  Richard Sandiford  

gcc/
* config/aarch64/aarch64.c (aarch64_float_const_representable_p):
Allow HFmode constants if TARGET_FP_F16INST.

gcc/testsuite/
* gcc.target/aarch64/f16_mov_immediate_1.c: Expect fmov immediate
to be used.
* gcc.target/aarch64/f16_mov_immediate_2.c: Likewise.
* gcc.target/aarch64/f16_mov_immediate_3.c: Force +nofp16.
* gcc.target/aarch64/sve/single_1.c: Except fmov immediate to be used
for .h.
* gcc.target/aarch64/sve/single_2.c: Likewise.
* gcc.target/aarch64/sve/single_3.c: Likewise.
* gcc.target/aarch64/sve/single_4.c: Likewise.

Index: gcc/config/aarch64/aarch64.c
===
--- gcc/config/aarch64/aarch64.c2018-07-18 18:45:26.0 +0100
+++ gcc/config/aarch64/aarch64.c2018-07-18 18:45:27.025332090 +0100
@@ -14908,8 +14908,8 @@ aarch64_float_const_representable_p (rtx
   if (!CONST_DOUBLE_P (x))
 return false;
 
-  /* We don't support HFmode constants yet.  */
-  if (GET_MODE (x) == VOIDmode || GET_MODE (x) == HFmode)
+  if (GET_MODE (x) == VOIDmode
+  || (GET_MODE (x) == HFmode && !TARGET_FP_F16INST))
 return false;
 
   r = *CONST_DOUBLE_REAL_VALUE (x);
Index: gcc/testsuite/gcc.target/aarch64/f16_mov_immediate_1.c
===
--- gcc/testsuite/gcc.target/aarch64/f16_mov_immediate_1.c  2018-07-18 
18:45:26.0 +0100
+++ gcc/testsuite/gcc.target/aarch64/f16_mov_immediate_1.c  2018-07-18 
18:45:27.025332090 +0100
@@ -44,6 +44,6 @@ __fp16 f5 ()
   return a;
 }
 
-/* { dg-final { scan-assembler-times "mov\tw\[0-9\]+, #?19520"   3 } } 
*/
-/* { dg-final { scan-assembler-times "movi\tv\[0-9\]+\\\.4h, 0xbc, lsl 8"  1 } 
} */
-/* { dg-final { scan-assembler-times "movi\tv\[0-9\]+\\\.4h, 0x4c, lsl 8"  1 } 
} */
+/* { dg-final { scan-assembler-times {fmov\th[0-9]+, #?1\.7e\+1}  3 } } */
+/* { dg-final { scan-assembler-times {fmov\th[0-9]+, #?-1\.0e\+0} 1 } } */
+/* { dg-final { scan-assembler-times {fmov\th[0-9]+, #?1\.6e\+1}  1 } } */
Index: gcc/testsuite/gcc.target/aarch64/f16_mov_immediate_2.c
===
--- gcc/testsuite/gcc.target/aarch64/f16_mov_immediate_2.c  2018-07-18 
18:45:26.0 +0100
+++ gcc/testsuite/gcc.target/aarch64/f16_mov_immediate_2.c  2018-07-18 
18:45:27.025332090 +0100
@@ -40,6 +40,4 @@ float16_t f3(void)
 /* { dg-final { scan-assembler-times "movi\tv\[0-9\]+\\\.4h, 0x5c, lsl 8" 1 } 
} */
 /* { dg-final { scan-assembler-times "movi\tv\[0-9\]+\\\.4h, 0x7c, lsl 8" 1 } 
} */
 
-/* { dg-final { scan-assembler-times "mov\tw\[0-9\]+, 19520"  1 } 
} */
-/* { dg-final { scan-assembler-times "fmov\th\[0-9\], w\[0-9\]+"  1 } 
} */
-
+/* { dg-final { scan-assembler-times {fmov\th[0-9]+, #?1.7e\+1}   1 } 
} */
Index: gcc/testsuite/gcc.target/aarch64/f16_mov_immediate_3.c
===
--- gcc/testsuite/gcc.target/aarch64/f16_mov_immediate_3.c  2018-07-18 
18:45:26.0 +0100
+++ gcc/testsuite/gcc.target/aarch64/f16_mov_immediate_3.c  2018-07-18 
18:45:27.025332090 +0100
@@ -1,6 +1,8 @@
 /* { dg-do compile } */
 /* { dg-options "-O2" } */
 
+#pragma GCC target "+nofp16"
+
 __fp16 f4 ()
 {
   __fp16 a = 0.1;
Index: gcc/testsuite/gcc.target/aarch64/sve/single_1.c
===
--- gcc/testsuite/gcc.target/aarch64/sve/single_1.c 2018-07-18 
18:45:26.0 +0100
+++ gcc/testsuite/gcc.target/aarch64/sve/single_1.c 2018-07-18 
18:45:27.025332090 +0100
@@ -36,7 +36,7 @@ TEST_LOOP (double, 3.0)
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, #6\n} 1 } } */
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #7\n} 1 } } */
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #8\n} 1 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.h, #15360\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tfmov\tz[0-9]+\.h, #1\.0e\+0\n} 1 } } */
 /* { dg-final { scan-assembler-times {\tfmov\tz[0-9]+\.s, #2\.0e\+0\n} 1 } } */
 /* { dg-final { scan-assembler-times {\tfmov\tz[0-9]+\.d, #3\.0e\+0\n} 1 } } */
 
Index: gcc/testsuite/gcc.target/aarch64/sve/single_2.c
===
--- gcc/testsuite/gcc.target/aarch64/sve/sing

[SVE ACLE] Add initial support for arm_sve.h

2018-07-18 Thread Richard Sandiford
This patch adds the target framework for handling the SVE ACLE,
starting with four functions: svadd, svptrue, svsub and svsubr.

The ACLE has both overloaded and non-overloaded names.  Without
the equivalent of clang's __attribute__((overloadable)), a header
file that declared all functions would need three sets of declarations:

- the non-overloaded forms (used for both C and C++)
- _Generic-based macros to handle overloading in C
- normal overloaded inline functions for C++

This would likely require a lot of cut-&-paste.  It would probably
also lead to poor diagnosics and be slow to parse.

Another consideration is that some functions require certain arguments
to be integer constant expressions.  We can (sort of) enforce that
for calls to built-in functions using resolve_overloaded_builtin,
but it would be harder to enforce with inline forwarder functions.

For these reasons and others, the patch takes the approach of adding
a pragma that gets the compiler to insert the definitions itself.
This requires a slight variation on the existing lang hooks for
built-in functions, but otherwise it seems to just work.

It was easier to add the support without enumerating every function
at build time.  This in turn meant that it was easier if the SVE
builtins occupied a distinct numberspace from the existing AArch64 ones.
The patch therefore divides the built-in functions codes into "major"
and "minor" codes.  At present the major code is just "general" or "SVE".

For now, the patch is only expected to work for fixed-length SVE.
Some uses of the ACLE do manage to squeak through the front-end
in the normal vector-length agnostic mode, but that's more by
accident than design.  We're planning to work on proper frontend
support for "sizeless" types in parallel with the backend changes.

Other things not handled yet:

- support for the SVE AAPCS
- handling the built-ins correctly when the compiler is invoked
  without SVE enabled (e.g. if SVE is enabled later by a pragma)

Both of these are blockers to merging the support into trunk.

The aim is to make sure when adding a function that the function
produces the expected assembly output for all relevant combinations.
The patch adds a new check-function-bodies test to try to make
that easier.

Tested on aarch64-linux-gnu (with and without SVE) and committed
to aarch64/sve-acle-branch.

Richard




initial-sve-acle.diff.gz
Description: application/gzip


Re: RFC: Patch to implement Aarch64 SIMD ABI

2018-07-19 Thread Richard Sandiford
Hi,

Thanks for doing this.

Steve Ellcey  writes:
> This is a patch to support the Aarch64 SIMD ABI [1] in GCC.  I intend
> to eventually follow this up with two more patches; one to define the
> TARGET_SIMD_CLONE* macros and one to improve the GCC register
> allocation/usage when calling SIMD functions.
>
> The significant difference between the standard ARM ABI and the SIMD ABI
> is that in the normal ABI a callee saves only the lower 64 bits of registers
> V8-V15, in the SIMD ABI the callee must save all 128 bits of registers
> V8-V23.
>
> This patch checks for SIMD functions and saves the extra registers when
> needed.  It does not change the caller behavour, so with just this patch
> there may be values saved by both the caller and callee.  This is not
> efficient, but it is correct code.
>
> This patch bootstraps and passes the GCC testsuite but that only verifies
> I haven't broken anything, it doesn't validate the handling of SIMD functions.
> I tried to write some tests, but I could never get GCC to generate code
> that would save the FP callee-save registers in the prologue.  Complex code
> might generate spills and fills but it never triggered the prologue/epilogue
> code to save V8-V23.  If anyone has ideas on how to write a test that would
> cause GCC to generate this code I would appreciate some ideas.  Just doing
> lots of calculations with lots of intermediate values doesn't seem to be 
> enough.

Probably easiest to use asm clobbers, e.g.:

void __attribute__ ((aarch64_vector_pcs))
f (void)
{
  asm volatile ("" ::: "s8", "s13");
}

This also lets you control exactly which registers are saved.

> @@ -4105,7 +4128,8 @@ aarch64_layout_frame (void)
>{
>   /* If there is an alignment gap between integer and fp callee-saves,
>  allocate the last fp register to it if possible.  */
> - if (regno == last_fp_reg && has_align_gap && (offset & 8) == 0)
> + if (regno == last_fp_reg && has_align_gap
> + && !simd_function && (offset & 8) == 0)
> {
>   cfun->machine->frame.reg_offset[regno] = max_int_offset;
>   break;
> @@ -4117,7 +4141,7 @@ aarch64_layout_frame (void)
>   else if (cfun->machine->frame.wb_candidate2 == INVALID_REGNUM
>&& cfun->machine->frame.wb_candidate1 >= V0_REGNUM)
> cfun->machine->frame.wb_candidate2 = regno;
> - offset += UNITS_PER_WORD;
> + offset += simd_function ? UNITS_PER_VREG : UNITS_PER_WORD;
>}
>  
>offset = ROUND_UP (offset, STACK_BOUNDARY / BITS_PER_UNIT);
> @@ -4706,8 +4730,11 @@ aarch64_process_components (sbitmap components, bool 
> prologue_p)
>while (regno != last_regno)
>  {
>/* AAPCS64 section 5.1.2 requires only the bottom 64 bits to be saved
> -  so DFmode for the vector registers is enough.  */
> -  machine_mode mode = GP_REGNUM_P (regno) ? E_DImode : E_DFmode;
> +  so DFmode for the vector registers is enough.  For simd functions
> + we want to save the entire register.  */
> +  machine_mode mode = GP_REGNUM_P (regno) ? E_DImode
> + : (aarch64_simd_function_p (cfun->decl) ? E_TFmode : E_DFmode);

This condition also occurs in aarch64_push_regs and aarch64_pop_regs.
It'd probably be worth splitting it out into a subfunction.

I think you also need to handle the writeback cases, which should work
for Q registers too.  This will mean extra loadwb_pair and storewb_pair
patterns.

LGTM otherwise FWIW.

> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> index f284e74..d11474e 100644
> --- a/gcc/config/aarch64/aarch64.h
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -500,6 +500,8 @@ extern unsigned aarch64_architecture_version;
>  #define PR_LO_REGNUM_P(REGNO)\
>(((unsigned) (REGNO - P0_REGNUM)) <= (P7_REGNUM - P0_REGNUM))
>  
> +#define FP_SIMD_SAVED_REGNUM_P(REGNO)\
> +  (((unsigned) (REGNO - V8_REGNUM)) <= (V23_REGNUM - V8_REGNUM))

(We should probably rewrite these to use IN_RANGE at some point,
but I agree it's better to be consistent until then.)

Thanks,
Richard


Re: [SVE ACLE] Add initial support for arm_sve.h

2018-07-19 Thread Richard Sandiford
Richard Biener  writes:
> On Wed, Jul 18, 2018 at 8:08 PM Richard Sandiford
>  wrote:
>>
>> This patch adds the target framework for handling the SVE ACLE,
>> starting with four functions: svadd, svptrue, svsub and svsubr.
>>
>> The ACLE has both overloaded and non-overloaded names.  Without
>> the equivalent of clang's __attribute__((overloadable)), a header
>> file that declared all functions would need three sets of declarations:
>>
>> - the non-overloaded forms (used for both C and C++)
>> - _Generic-based macros to handle overloading in C
>> - normal overloaded inline functions for C++
>>
>> This would likely require a lot of cut-&-paste.  It would probably
>> also lead to poor diagnosics and be slow to parse.
>>
>> Another consideration is that some functions require certain arguments
>> to be integer constant expressions.  We can (sort of) enforce that
>> for calls to built-in functions using resolve_overloaded_builtin,
>> but it would be harder to enforce with inline forwarder functions.
>>
>> For these reasons and others, the patch takes the approach of adding
>> a pragma that gets the compiler to insert the definitions itself.
>> This requires a slight variation on the existing lang hooks for
>> built-in functions, but otherwise it seems to just work.
>
> I guess you did consider auto-generating the three variants from a template?

Yeah.  But that would only solve the cut-&-paste problem, not the others.
It would also be quite a lot more complicated overall.

E.g. scripting code to produce the right _Generics is much more complicated
than just implementing the overloading using resolve_overloaded_builtin
(which also produces better error messages).  And even just scripting
the declarations is more work: the backend has to register a built-in
function either way, so getting it to register the public name is
easier than having the backend register a __builtin_ function and
then scripting a header file declaration with the same prototype
and attributes.

Thanks,
Richard


Handle SLP of call pattern statements

2018-07-20 Thread Richard Sandiford
We couldn't vectorise:

  for (int j = 0; j < n; ++j)
{
  for (int i = 0; i < 16; ++i)
a[i] = (b[i] + c[i]) >> 1;
  a += step;
  b += step;
  c += step;
}

at -O3 because cunrolli unrolled the inner loop and SLP couldn't handle
AVG_FLOOR patterns (see also PR86504).  The problem was some overly
strict checking of pattern statements compared to normal statements
in vect_get_and_check_slp_defs:

  switch (gimple_code (def_stmt))
{
case GIMPLE_PHI:
case GIMPLE_ASSIGN:
  break;

default:
  if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
 "unsupported defining stmt:\n");
  return -1;
}

The easy fix would have been to add GIMPLE_CALL to the switch,
but I don't think the switch is doing anything useful.  We only create
pattern statements that the rest of the vectoriser can handle, and code
in this function and elsewhere check whether SLP is possible.

I'm also not sure why:

  if (!first && !oprnd_info->first_pattern
  /* Allow different pattern state for the defs of the
 first stmt in reduction chains.  */
  && (oprnd_info->first_dt != vect_reduction_def

is necessary.  All that should matter is that the statements in the
node are "similar enough".  It turned out to be quite hard to find a
convincing example that used a mixture of pattern and non-pattern
statements, so bb-slp-pow-1.c is the best I could come up with.
But it does show that the combination of "xi * xi" statements and
"pow (xj, 2) -> xj * xj" patterns are handled correctly.

The patch therefore just removes the whole if block.

The loop also needed commutative swapping to be extended to at least
AVG_FLOOR.

This gives +3.9% on 525.x264_r at -O3.

Tested on aarch64-linux-gnu (with and without SVE), aarch64_be-elf
and x86_64-linux-gnu.  OK to install?

Richard


2018-07-20  Richard Sandiford  

gcc/
* internal-fn.h (first_commutative_argument): Declare.
* internal-fn.c (first_commutative_argument): New function.
* tree-vect-slp.c (vect_get_and_check_slp_defs): Remove extra
restrictions for pattern statements.  Use first_commutative_argument
to look for commutative operands in calls to internal functions.

gcc/testsuite/
* gcc.dg/vect/bb-slp-over-widen-1.c: Expect AVG_FLOOR to be used
on vect_avg_qi targets.
* gcc.dg/vect/bb-slp-over-widen-2.c: Likewise.
* gcc.dg/vect/bb-slp-pow-1.c: New test.
* gcc.dg/vect/vect-avg-15.c: Likewise.

Index: gcc/internal-fn.h
===
--- gcc/internal-fn.h   2018-07-13 10:11:14.009847140 +0100
+++ gcc/internal-fn.h   2018-07-20 11:18:58.167047743 +0100
@@ -201,6 +201,8 @@ direct_internal_fn_supported_p (internal
 opt_type);
 }
 
+extern int first_commutative_argument (internal_fn);
+
 extern bool set_edom_supported_p (void);
 
 extern internal_fn get_conditional_internal_fn (tree_code);
Index: gcc/internal-fn.c
===
--- gcc/internal-fn.c   2018-07-13 10:11:14.009847140 +0100
+++ gcc/internal-fn.c   2018-07-20 11:18:58.163047778 +0100
@@ -3183,6 +3183,42 @@ direct_internal_fn_supported_p (internal
   return direct_internal_fn_supported_p (fn, tree_pair (type, type), opt_type);
 }
 
+/* If FN is commutative in two consecutive arguments, return the
+   index of the first, otherwise return -1.  */
+
+int
+first_commutative_argument (internal_fn fn)
+{
+  switch (fn)
+{
+case IFN_FMA:
+case IFN_FMS:
+case IFN_FNMA:
+case IFN_FNMS:
+case IFN_AVG_FLOOR:
+case IFN_AVG_CEIL:
+case IFN_FMIN:
+case IFN_FMAX:
+  return 0;
+
+case IFN_COND_ADD:
+case IFN_COND_MUL:
+case IFN_COND_MIN:
+case IFN_COND_MAX:
+case IFN_COND_AND:
+case IFN_COND_IOR:
+case IFN_COND_XOR:
+case IFN_COND_FMA:
+case IFN_COND_FMS:
+case IFN_COND_FNMA:
+case IFN_COND_FNMS:
+  return 1;
+
+default:
+  return -1;
+}
+}
+
 /* Return true if IFN_SET_EDOM is supported.  */
 
 bool
Index: gcc/tree-vect-slp.c
===
--- gcc/tree-vect-slp.c 2018-07-13 10:11:15.113837768 +0100
+++ gcc/tree-vect-slp.c 2018-07-20 11:18:58.167047743 +0100
@@ -299,15 +299,20 @@ vect_get_and_check_slp_defs (vec_info *v
   bool pattern = false;
   slp_oprnd_info oprnd_info;
   int first_op_idx = 1;
-  bool commutative = false;
+  unsigned int commutative_op = -1U;
   bool first_op_cond = false;
   bool first = stmt_num == 0;
   bool second = stmt_num == 1;
 
-  if (is_gimple_call (stmt))
+  if (gcall *call = dyn_cast  (stmt))
 {
-  num

Fold pointer range checks with equal spans

2018-07-20 Thread Richard Sandiford
When checking whether vectorised accesses at A and B are independent,
the vectoriser falls back to tests of the form:

A + size <= B || B + size <= A

But in the common case that "size" is just the constant size of a vector
(or a small multiple), it would be more efficient to do:

   ((size_t) A - (size_t) B + size - 1) > (size - 1) * 2

This patch adds folds to do that.  E.g. before the patch, the alias
checks for:

  for (int j = 0; j < n; ++j)
{
  for (int i = 0; i < 16; ++i)
a[i] = (b[i] + c[i]) >> 1;
  a += step;
  b += step;
  c += step;
}

were:

add x7, x1, 15
add x5, x0, 15
cmp x0, x7
add x7, x2, 15
ccmpx1, x5, 2, ls
csetw8, hi
cmp x0, x7
ccmpx2, x5, 2, ls
csetw4, hi
tst w8, w4

while after the patch they're:

sub x7, x0, x1
sub x5, x0, x2
add x7, x7, 15
add x5, x5, 15
cmp x7, 30
ccmpx5, 30, 0, hi

The old scheme needs:

[A] one addition per vector pointer
[B] two comparisons and one IOR per range check

The new one needs:

[C] one subtraction, one addition and one comparison per range check

The range checks are then ANDed together, with the same number of
ANDs either way.

With conditional comparisons (such as on AArch64), we're able to remove
the IOR between comparisons in the old scheme, but then need an explicit
AND or branch when combining the range checks, as the example above shows.
With the new scheme we can instead use conditional comparisons for
the AND chain.

So even with conditional comparisons, the new scheme should in practice
be a win in almost all cases.  Without conditional comparisons, the new
scheme removes [A] and replaces [B] with an equivalent number of operations
[C], so should always give fewer operations overall.  Although each [C]
is linear, multiple [C]s are easily parallelisable.

A better implementation of the above would be:

add x5, x0, 15
sub x7, x5, x1
sub x5, x5, x2
cmp x7, 30
ccmpx5, 30, 0, hi

where we add 15 to "a" before the subtraction.  Unfortunately,
canonicalisation rules mean that even if we try to create it in
that form, it gets folded into the one above instead.

An alternative would be not to do this in match.pd and instead get
tree-data-ref.c to do it itself.  I started out that way but thought
the match.pd approach seemed cleaner.

Tested on aarch64-linux-gnu (with and without SVE), aarch64_be-elf
and x86_64-linux-gnu.  OK to install?

Richard


2018-07-20  Richard Sandiford  

gcc/
* match.pd: Optimise pointer range checks.

gcc/testsuite/
* gcc.dg/pointer-range-check-1.c: New test.
* gcc.dg/pointer-range-check-2.c: Likewise.

Index: gcc/match.pd
===
--- gcc/match.pd2018-07-18 18:44:22.565914281 +0100
+++ gcc/match.pd2018-07-20 11:24:33.692045585 +0100
@@ -4924,3 +4924,37 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
(if (inverse_conditions_p (@0, @2)
 && element_precision (type) == element_precision (op_type))
 (view_convert (cond_op @2 @3 @4 @5 (view_convert:op_type @1)))
+
+/* For pointers @0 and @2 and unsigned constant offset @1, look for
+   expressions like:
+
+   A: (@0 + @1 < @2) | (@2 + @1 < @0)
+   B: (@0 + @1 <= @2) | (@2 + @1 <= @0)
+
+   If pointers are known not to wrap, B checks whether @1 bytes starting at
+   @0 and @2 do not overlap, while A tests the same thing for @1 + 1 bytes.
+   A is more efficiently tested as:
+
+   ((sizetype) @0 - (sizetype) @2 + @1) > (@1 * 2)
+
+   as long as @1 * 2 doesn't overflow.  B is the same with @1 replaced
+   with @1 - 1.  */
+(for ior (truth_orif truth_or bit_ior)
+ (for cmp (le lt)
+  (simplify
+   (ior (cmp (pointer_plus:s @0 INTEGER_CST@1) @2)
+   (cmp (pointer_plus:s @2 @1) @0))
+   (if (!flag_trapv && !TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)))
+/* Convert the B form to the A form.  */
+(with { offset_int off = wi::to_offset (@1) - (cmp == LE_EXPR ? 1 : 0); }
+ /* Always fails for negative values.  */
+ (if (wi::min_precision (off, UNSIGNED) * 2 <= TYPE_PRECISION (sizetype))
+  /* It doesn't matter whether we use @2 - @0 or @0 - @2, so let
+tree_swap_operands_p pick a canonical order.  */
+  (with { tree ptr1 = @0, ptr2 = @2;
+ if (tree_swap_operands_p (ptr1, ptr2))
+   std::swap (ptr1, ptr2); }
+   (gt (plus (minus (convert:sizetype { ptr1; })
+   (convert:sizetype { ptr2; }))
+{ wide_int_to_tree (sizetype, off); })
+  { wide_int_to_tree (sizetype, off * 2); }
Index: gcc/testsuite/gcc.dg/pointer-range-check-1.c
===
--- /

Make the vectoriser drop to strided accesses for stores with gaps

2018-07-20 Thread Richard Sandiford
We could vectorise:

 for (...)
   {
 a[0] = ...;
 a[1] = ...;
 a[2] = ...;
 a[3] = ...;
 a += stride;
   }

(including the case when stride == 8) but not:

 for (...)
   {
 a[0] = ...;
 a[1] = ...;
 a[2] = ...;
 a[3] = ...;
 a += 8;
   }

(where the stride is always 8).  The former was treated as a "grouped
and strided" store, while the latter was treated as grouped store with
gaps, which we don't support.

This patch makes us treat groups of stores with gaps at the end as
strided groups too.  I tried to go through all uses of STMT_VINFO_STRIDED_P
and all vector uses of DR_STEP to see whether there were any hard-baked
assumptions, but couldn't see any.  I wondered whether we should relax:

  /* We do not have to consider dependences between accesses that belong
 to the same group, unless the stride could be smaller than the
 group size.  */
  if (DR_GROUP_FIRST_ELEMENT (stmtinfo_a)
  && (DR_GROUP_FIRST_ELEMENT (stmtinfo_a)
  == DR_GROUP_FIRST_ELEMENT (stmtinfo_b))
  && !STMT_VINFO_STRIDED_P (stmtinfo_a))
return false;

for cases in which the step is constant and the absolute step is known
to be greater than the group size, but data dependence analysis should
already return chrec_known for those cases.

The new test is a version of vect-avg-15.c with the variable step
replaced by a constant one.

A natural follow-on would be to do the same for groups with gaps in
the middle:

  /* Check that the distance between two accesses is equal to the type
 size. Otherwise, we have gaps.  */
  diff = (TREE_INT_CST_LOW (DR_INIT (data_ref))
  - TREE_INT_CST_LOW (prev_init)) / type_size;
  if (diff != 1)
{
  [...]
  if (DR_IS_WRITE (data_ref))
{
  if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
 "interleaved store with gaps\n");
  return false;
}

But I think we should do that separately and see what the fallout
from this change is first.

Tested on aarch64-linux-gnu (with and without SVE), aarch64_be-elf
and x86_64-linux-gnu.  OK to install?

Richard


2018-07-20  Richard Sandiford  

gcc/
* tree-vect-data-refs.c (vect_analyze_group_access_1): Convert
grouped stores with gaps to a strided group.

gcc/testsuite/
* gcc.dg/vect/vect-avg-16.c: New test.
* gcc.dg/vect/slp-37.c: Expect the loop to be vectorized.
* gcc.dg/vect/vect-strided-u8-i8-gap4.c,
* gcc.dg/vect/vect-strided-u8-i8-gap4-big-array.c: Likewise for
the second loop in main1.

Index: gcc/tree-vect-data-refs.c
===
--- gcc/tree-vect-data-refs.c   2018-06-30 13:44:38.567611988 +0100
+++ gcc/tree-vect-data-refs.c   2018-07-20 11:55:22.570911497 +0100
@@ -2632,10 +2632,14 @@ vect_analyze_group_access_1 (struct data
   if (groupsize != count
  && !DR_IS_READ (dr))
 {
- if (dump_enabled_p ())
-   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-"interleaved store with gaps\n");
- return false;
+ groupsize = count;
+ STMT_VINFO_STRIDED_P (stmt_info) = true;
+ stmt_vec_info next_info = stmt_info;
+ while (DR_GROUP_NEXT_ELEMENT (next_info))
+   {
+ next_info = vinfo_for_stmt (DR_GROUP_NEXT_ELEMENT (next_info));
+ STMT_VINFO_STRIDED_P (next_info) = 1;
+   }
}
 
   /* If there is a gap after the last load in the group it is the
@@ -2651,6 +2655,8 @@ vect_analyze_group_access_1 (struct data
   "Detected interleaving ");
  if (DR_IS_READ (dr))
dump_printf (MSG_NOTE, "load ");
+ else if (STMT_VINFO_STRIDED_P (stmt_info))
+   dump_printf (MSG_NOTE, "strided store ");
  else
dump_printf (MSG_NOTE, "store ");
  dump_printf (MSG_NOTE, "of size %u starting with ",
Index: gcc/testsuite/gcc.dg/vect/vect-avg-16.c
===
--- /dev/null   2018-07-09 14:52:09.234750850 +0100
+++ gcc/testsuite/gcc.dg/vect/vect-avg-16.c 2018-07-20 11:55:22.570911497 
+0100
@@ -0,0 +1,52 @@
+/* { dg-additional-options "-O3" } */
+/* { dg-require-effective-target vect_int } */
+
+#include "tree-vect.h"
+
+#define N 80
+
+void __attribute__ ((noipa))
+f (signed char *restrict a, signed char *restrict b,
+   signed char *restrict c, int n)
+{
+  for (int j = 0; j < n; ++j)
+{
+  for (int i = 0; i < 16; ++i)
+   a[i] = (b[i] + c[i]) >> 1;
+   

Re: Make the vectoriser drop to strided accesses for stores with gaps

2018-07-20 Thread Richard Sandiford
Richard Biener  writes:
> On Fri, Jul 20, 2018 at 12:57 PM Richard Sandiford
>  wrote:
>>
>> We could vectorise:
>>
>>  for (...)
>>{
>>  a[0] = ...;
>>  a[1] = ...;
>>  a[2] = ...;
>>  a[3] = ...;
>>  a += stride;
>>}
>>
>> (including the case when stride == 8) but not:
>>
>>  for (...)
>>{
>>  a[0] = ...;
>>  a[1] = ...;
>>  a[2] = ...;
>>  a[3] = ...;
>>  a += 8;
>>}
>>
>> (where the stride is always 8).  The former was treated as a "grouped
>> and strided" store, while the latter was treated as grouped store with
>> gaps, which we don't support.
>>
>> This patch makes us treat groups of stores with gaps at the end as
>> strided groups too.  I tried to go through all uses of STMT_VINFO_STRIDED_P
>> and all vector uses of DR_STEP to see whether there were any hard-baked
>> assumptions, but couldn't see any.  I wondered whether we should relax:
>>
>>   /* We do not have to consider dependences between accesses that belong
>>  to the same group, unless the stride could be smaller than the
>>  group size.  */
>>   if (DR_GROUP_FIRST_ELEMENT (stmtinfo_a)
>>   && (DR_GROUP_FIRST_ELEMENT (stmtinfo_a)
>>   == DR_GROUP_FIRST_ELEMENT (stmtinfo_b))
>>   && !STMT_VINFO_STRIDED_P (stmtinfo_a))
>> return false;
>>
>> for cases in which the step is constant and the absolute step is known
>> to be greater than the group size, but data dependence analysis should
>> already return chrec_known for those cases.
>>
>> The new test is a version of vect-avg-15.c with the variable step
>> replaced by a constant one.
>>
>> A natural follow-on would be to do the same for groups with gaps in
>> the middle:
>>
>>   /* Check that the distance between two accesses is equal to the 
>> type
>>  size. Otherwise, we have gaps.  */
>>   diff = (TREE_INT_CST_LOW (DR_INIT (data_ref))
>>   - TREE_INT_CST_LOW (prev_init)) / type_size;
>>   if (diff != 1)
>> {
>>   [...]
>>   if (DR_IS_WRITE (data_ref))
>> {
>>   if (dump_enabled_p ())
>> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>  "interleaved store with gaps\n");
>>   return false;
>> }
>>
>> But I think we should do that separately and see what the fallout
>> from this change is first.
>
> Agreed.
>
>> Tested on aarch64-linux-gnu (with and without SVE), aarch64_be-elf
>> and x86_64-linux-gnu.  OK to install?
>
> Do you need to set STMT_VINFO_STRIDED_P on all stmts?  I think
> it is enough to set it on the first group element.

get_load_store_type tests STMT_VINFO_STRIDED_P on the stmt it's
given rather than the first element, but I guess that's my bug...

I'm hoping this weekend to finally finish off the series to get
rid of vinfo_for_stmt, which will make it slightly easier to do this.

Thanks,
Richard

> OK otherwise.
> Thanks,
> Richard.
>
>> Richard
>>
>>
>> 2018-07-20  Richard Sandiford  
>>
>> gcc/
>> * tree-vect-data-refs.c (vect_analyze_group_access_1): Convert
>> grouped stores with gaps to a strided group.
>>
>> gcc/testsuite/
>> * gcc.dg/vect/vect-avg-16.c: New test.
>> * gcc.dg/vect/slp-37.c: Expect the loop to be vectorized.
>> * gcc.dg/vect/vect-strided-u8-i8-gap4.c,
>> * gcc.dg/vect/vect-strided-u8-i8-gap4-big-array.c: Likewise for
>> the second loop in main1.
>>
>> Index: gcc/tree-vect-data-refs.c
>> ===
>> --- gcc/tree-vect-data-refs.c   2018-06-30 13:44:38.567611988 +0100
>> +++ gcc/tree-vect-data-refs.c   2018-07-20 11:55:22.570911497 +0100
>> @@ -2632,10 +2632,14 @@ vect_analyze_group_access_1 (struct data
>>if (groupsize != count
>>   && !DR_IS_READ (dr))
>>  {
>> - if (dump_enabled_p ())
>> -   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> -"interleaved store with gaps\n");
>> - return false;
>> + groupsize = count;
>> + STMT_VINFO_STRIDED_P (stmt_info) = true;
>&g

Re: Fold pointer range checks with equal spans

2018-07-23 Thread Richard Sandiford
Marc Glisse  writes:
> On Fri, 20 Jul 2018, Richard Sandiford wrote:
>
>> --- gcc/match.pd 2018-07-18 18:44:22.565914281 +0100
>> +++ gcc/match.pd 2018-07-20 11:24:33.692045585 +0100
>> @@ -4924,3 +4924,37 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>>(if (inverse_conditions_p (@0, @2)
>> && element_precision (type) == element_precision (op_type))
>> (view_convert (cond_op @2 @3 @4 @5 (view_convert:op_type @1)))
>> +
>> +/* For pointers @0 and @2 and unsigned constant offset @1, look for
>> +   expressions like:
>> +
>> +   A: (@0 + @1 < @2) | (@2 + @1 < @0)
>> +   B: (@0 + @1 <= @2) | (@2 + @1 <= @0)
>> +
>> +   If pointers are known not to wrap, B checks whether @1 bytes starting at
>> +   @0 and @2 do not overlap, while A tests the same thing for @1 + 1 bytes.
>> +   A is more efficiently tested as:
>> +
>> +   ((sizetype) @0 - (sizetype) @2 + @1) > (@1 * 2)
>> +
>> +   as long as @1 * 2 doesn't overflow.  B is the same with @1 replaced
>> +   with @1 - 1.  */
>> +(for ior (truth_orif truth_or bit_ior)
>> + (for cmp (le lt)
>> +  (simplify
>> +   (ior (cmp (pointer_plus:s @0 INTEGER_CST@1) @2)
>> +(cmp (pointer_plus:s @2 @1) @0))
>
> Do you want :c on cmp, in case it appears as @2 > @0 + @1 ? (may need some 
> care with "cmp == LE_EXPR" below)
> Do you want :s on cmp as well?
>
>> +   (if (!flag_trapv && !TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)))
>
> Don't you want TYPE_OVERFLOW_UNDEFINED?

Thanks, fixed below.  Think the cmp == LE_EXPR stuff is still ok with :c,
since the generated code sets cmp to LE_EXPR when matching GE_EXPR.

Tested as before.  OK to install?

Richard


2018-07-23  Richard Sandiford  

gcc/
* match.pd: Optimise pointer range checks.

gcc/testsuite/
* gcc.dg/pointer-range-check-1.c: New test.
* gcc.dg/pointer-range-check-2.c: Likewise.

Index: gcc/match.pd
===
--- gcc/match.pd2018-07-23 15:56:47.0 +0100
+++ gcc/match.pd2018-07-23 15:58:33.480269844 +0100
@@ -4924,3 +4924,37 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
(if (inverse_conditions_p (@0, @2)
 && element_precision (type) == element_precision (op_type))
 (view_convert (cond_op @2 @3 @4 @5 (view_convert:op_type @1)))
+
+/* For pointers @0 and @2 and unsigned constant offset @1, look for
+   expressions like:
+
+   A: (@0 + @1 < @2) | (@2 + @1 < @0)
+   B: (@0 + @1 <= @2) | (@2 + @1 <= @0)
+
+   If pointers are known not to wrap, B checks whether @1 bytes starting at
+   @0 and @2 do not overlap, while A tests the same thing for @1 + 1 bytes.
+   A is more efficiently tested as:
+
+   ((sizetype) @0 - (sizetype) @2 + @1) > (@1 * 2)
+
+   as long as @1 * 2 doesn't overflow.  B is the same with @1 replaced
+   with @1 - 1.  */
+(for ior (truth_orif truth_or bit_ior)
+ (for cmp (le lt)
+  (simplify
+   (ior (cmp:cs (pointer_plus:s @0 INTEGER_CST@1) @2)
+   (cmp:cs (pointer_plus:s @2 @1) @0))
+   (if (!flag_trapv && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (@0)))
+/* Convert the B form to the A form.  */
+(with { offset_int off = wi::to_offset (@1) - (cmp == LE_EXPR ? 1 : 0); }
+ /* Always fails for negative values.  */
+ (if (wi::min_precision (off, UNSIGNED) * 2 <= TYPE_PRECISION (sizetype))
+  /* It doesn't matter whether we use @2 - @0 or @0 - @2, so let
+tree_swap_operands_p pick a canonical order.  */
+  (with { tree ptr1 = @0, ptr2 = @2;
+ if (tree_swap_operands_p (ptr1, ptr2))
+   std::swap (ptr1, ptr2); }
+   (gt (plus (minus (convert:sizetype { ptr1; })
+   (convert:sizetype { ptr2; }))
+{ wide_int_to_tree (sizetype, off); })
+  { wide_int_to_tree (sizetype, off * 2); }
Index: gcc/testsuite/gcc.dg/pointer-range-check-1.c
===
--- /dev/null   2018-07-09 14:52:09.234750850 +0100
+++ gcc/testsuite/gcc.dg/pointer-range-check-1.c2018-07-23 
15:58:33.480269844 +0100
@@ -0,0 +1,37 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-ipa-icf -fdump-tree-optimized" } */
+
+/* All four functions should be folded to:
+
+   ((sizetype) a - (sizetype) b + 15) < 30.  */
+
+_Bool
+f1 (char *a, char *b)
+{
+  return (a + 16 <= b) || (b + 16 <= a);
+}
+
+_Bool
+f2 (char *a, char *b)
+{
+  return (a + 15 < b) || (b + 15 < a);
+}
+
+_Bool
+f3 (char *a, char *b)
+{
+  return (a + 16 <= b) || (b + 16 <= a);
+}
+
+_Bool
+f4 (char *a, char *b)
+{
+  return (a + 15 < b) || (b + 15 < a);
+}
+
+/* { dg-final { scan-tree-dump-times { = [^\n]* - [^\n]*;} 4 "optimized" } } */
+/* { dg-fina

[00/46] Remove vinfo_for_stmt etc.

2018-07-24 Thread Richard Sandiford
The aim of this series is to:

(a) make the vectoriser refer to statements using its own expanded
stmt_vec_info rather than the underlying gimple stmt.  This reduces
the number of stmt lookups from 480 in current sources to under 100.

(b) make the remaining lookups relative the owning vec_info rather than
to global state.

The original motivation was to make it more natural to have multiple
vec_infos live at once.

The series is a clean-up only in a data structure sense.  It certainly
doesn't make the code prettier, and in the end it only shaves 120 LOC
in total.  But I think it should make it easier to do follow-on clean-ups.

The series was pretty tedious to write to and will be pretty tedious
to review, sorry.

I tested each individual patch on aarch64-linux-gnu and the series as a
whole on aarch64-linux-gnu with SVE, aarch64_be-elf and x86_64-linux-gnu.
I also built and tested at least one target per CPU directory, made sure
that there were no new warnings, and checked for differences in assembly
output for gcc.dg, g++.dg and gcc.c-torture.  There were a couple of
cases in vect-alias-check-* of equality comparisons using the opposite
operand order, which is an unrelated problem.  There were no other
differences.

OK to install?

Thanks,
Richard


[01/46] Move special cases out of get_initial_def_for_reduction

2018-07-24 Thread Richard Sandiford
This minor clean-up avoids repeating the test for double reductions
and also moves the vect_get_vec_def_for_operand call to the same
function as the corresponding vect_get_vec_def_for_stmt_copy.


2018-07-24  Richard Sandiford  

gcc/
* tree-vect-loop.c (get_initial_def_for_reduction): Move special
cases for nested loops from here to ...
(vect_create_epilog_for_reduction): ...here.  Only call
vect_is_simple_use for inner-loop reductions.

Index: gcc/tree-vect-loop.c
===
--- gcc/tree-vect-loop.c2018-07-13 10:11:14.429843575 +0100
+++ gcc/tree-vect-loop.c2018-07-24 10:22:02.965552667 +0100
@@ -4113,10 +4113,8 @@ get_initial_def_for_reduction (gimple *s
   enum tree_code code = gimple_assign_rhs_code (stmt);
   tree def_for_init;
   tree init_def;
-  bool nested_in_vect_loop = false;
   REAL_VALUE_TYPE real_init_val = dconst0;
   int int_init_val = 0;
-  gimple *def_stmt = NULL;
   gimple_seq stmts = NULL;
 
   gcc_assert (vectype);
@@ -4124,39 +4122,12 @@ get_initial_def_for_reduction (gimple *s
   gcc_assert (POINTER_TYPE_P (scalar_type) || INTEGRAL_TYPE_P (scalar_type)
  || SCALAR_FLOAT_TYPE_P (scalar_type));
 
-  if (nested_in_vect_loop_p (loop, stmt))
-nested_in_vect_loop = true;
-  else
-gcc_assert (loop == (gimple_bb (stmt))->loop_father);
-
-  /* In case of double reduction we only create a vector variable to be put
- in the reduction phi node.  The actual statement creation is done in
- vect_create_epilog_for_reduction.  */
-  if (adjustment_def && nested_in_vect_loop
-  && TREE_CODE (init_val) == SSA_NAME
-  && (def_stmt = SSA_NAME_DEF_STMT (init_val))
-  && gimple_code (def_stmt) == GIMPLE_PHI
-  && flow_bb_inside_loop_p (loop, gimple_bb (def_stmt))
-  && vinfo_for_stmt (def_stmt)
-  && STMT_VINFO_DEF_TYPE (vinfo_for_stmt (def_stmt))
-  == vect_double_reduction_def)
-{
-  *adjustment_def = NULL;
-  return vect_create_destination_var (init_val, vectype);
-}
+  gcc_assert (nested_in_vect_loop_p (loop, stmt)
+ || loop == (gimple_bb (stmt))->loop_father);
 
   vect_reduction_type reduction_type
 = STMT_VINFO_VEC_REDUCTION_TYPE (stmt_vinfo);
 
-  /* In case of a nested reduction do not use an adjustment def as
- that case is not supported by the epilogue generation correctly
- if ncopies is not one.  */
-  if (adjustment_def && nested_in_vect_loop)
-{
-  *adjustment_def = NULL;
-  return vect_get_vec_def_for_operand (init_val, stmt);
-}
-
   switch (code)
 {
 case WIDEN_SUM_EXPR:
@@ -4586,9 +4557,22 @@ vect_create_epilog_for_reduction (vec

[02/46] Remove dead vectorizable_reduction code

2018-07-24 Thread Richard Sandiford
vectorizable_reduction has old code to cope with cases in which the
given statement belongs to a reduction group but isn't the first statement.
That can no longer happen, since all statements in the group go into the
same SLP node, and we only check the first statement in each node.

The point is to remove the only path through vectorizable_reduction
in which stmt and stmt_info refer to different statements.


2018-07-24  Richard Sandiford  

gcc/
* tree-vect-loop.c (vectorizable_reduction): Assert that the
function is not called for second and subsequent members of
a reduction group.

Index: gcc/tree-vect-loop.c
===
--- gcc/tree-vect-loop.c2018-07-24 10:22:02.965552667 +0100
+++ gcc/tree-vect-loop.c2018-07-24 10:22:06.269523330 +0100
@@ -6162,7 +6162,6 @@ vectorizable_reduction (gimple *stmt, gi
   auto_vec phis;
   int vec_num;
   tree def0, tem;
-  bool first_p = true;
   tree cr_index_scalar_type = NULL_TREE, cr_index_vector_type = NULL_TREE;
   tree cond_reduc_val = NULL_TREE;
 
@@ -6178,15 +6177,8 @@ vectorizable_reduction (gimple *stmt, gi
   nested_cycle = true;
 }
 
-  /* In case of reduction chain we switch to the first stmt in the chain, but
- we don't update STMT_INFO, since only the last stmt is marked as reduction
- and has reduction properties.  */
-  if (REDUC_GROUP_FIRST_ELEMENT (stmt_info)
-  && REDUC_GROUP_FIRST_ELEMENT (stmt_info) != stmt)
-{
-  stmt = REDUC_GROUP_FIRST_ELEMENT (stmt_info);
-  first_p = false;
-}
+  if (REDUC_GROUP_FIRST_ELEMENT (stmt_info))
+gcc_assert (slp_node && REDUC_GROUP_FIRST_ELEMENT (stmt_info) == stmt);
 
   if (gimple_code (stmt) == GIMPLE_PHI)
 {
@@ -7050,8 +7042,7 @@ vectorizable_reduction (gimple *stmt, gi
 
   if (!vec_stmt) /* transformation not required.  */
 {
-  if (first_p)
-   vect_model_reduction_cost (stmt_info, reduc_fn, ncopies, cost_vec);
+  vect_model_reduction_cost (stmt_info, reduc_fn, ncopies, cost_vec);
   if (loop_vinfo && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
{
  if (reduction_type != FOLD_LEFT_REDUCTION


  1   2   3   4   5   6   7   8   9   10   >