[PATCH] aarch64: Use type-qualified builtins for vget_low/high intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi, This patch declares unsigned and polynomial type-qualified builtins for vget_low_*/vget_high_* Neon intrinsics. Using these builtins removes the need for many casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan ---

[PATCH] aarch64: Use type-qualified builtins for vcombine_* Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi, This patch declares unsigned and polynomial type-qualified builtins for vcombine_* Neon intrinsics. Using these builtins removes the need for many casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeL

[PATCH] aarch64: Use type-qualified builtins for LD1/ST1 Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi, This patch declares unsigned and polynomial type-qualified builtins and uses them to implement the LD1/ST1 Neon intrinsics. This removes the need for many casts in arm_neon.h. The new type-qualified builtins are also lowered to gimple - as the unqualified builtins are already. Regression tes

[PATCH] aarch64: Use type-qualified builtins for ADDV Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi, This patch declares unsigned type-qualified builtins and uses them to implement the vector reduction Neon intrinsics. This removes the need for many casts in arm_neon.h. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/Chang

[PATCH] aarch64: Use type-qualified builtins for ADDP Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi, This patch declares unsigned type-qualified builtins and uses them to implement the pairwise addition Neon intrinsics. This removes the need for many casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/Chan

[PATCH] aarch64: Use type-qualified builtins for [R]SUBHN[2] Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi, This patch declares unsigned type-qualified builtins and uses them to implement (rounding) halving-narrowing-subtract Neon intrinsics. This removes the need for many casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonatha

[PATCH] aarch64: Use type-qualified builtins for [R]ADDHN[2] Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi, This patch declares unsigned type-qualified builtins and uses them to implement (rounding) halving-narrowing-add Neon intrinsics. This removes the need for many casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --

[PATCH] aarch64: Use type-qualified builtins for UHSUB Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi, This patch declares unsigned type-qualified builtins and uses them to implement halving-subtract Neon intrinsics. This removes the need for many casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog

[PATCH] aarch64: Use type-qualified builtins for U[R]HADD Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi, This patch declares unsigned type-qualified builtins and uses them to implement (rounding) halving-add Neon intrinsics. This removes the need for many casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/Cha

[PATCH] aarch64: Use type-qualified builtins for USUB[LW][2] Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi, This patch declares unsigned type-qualified builtins and uses them to implement widening-subtract Neon intrinsics. This removes the need for many casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLo

[PATCH] aarch64: Use type-qualified builtins for UADD[LW][2] Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi, This patch declares unsigned type-qualified builtins and uses them to implement widening-add Neon intrinsics. This removes the need for many casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2

[PATCH] aarch64: Use type-qualified builtins for [R]SHRN[2] Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi, Thus patch declares unsigned type-qualified builtins and uses them for [R]SHRN[2] Neon intrinsics. This removes the need for casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-11-08  Jonat

[PATCH] aarch64: Use type-qualified builtins for XTN[2] Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi, This patch declares unsigned type-qualified builtins and uses them for XTN[2] Neon intrinsics. This removes the need for casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-11-08  Jonathan

[PATCH] aarch64: Use type-qualified builtins for PMUL[L] Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi, This patch declares poly type-qualified builtins and uses them for PMUL[L] Neon intrinsics. This removes the need for casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-11-08  Jonathan Wri

[PATCH] aarch64: Use type-qualified builtins for unsigned MLA/MLS intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi, This patch declares type-qualified builtins and uses them for MLA/MLS Neon intrinsics that operate on unsigned types. This eliminates lots of casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog:

Re: [PATCH 4/6 V2] aarch64: Add machine modes for Neon vector-tuple types

2021-11-02 Thread Jonathan Wright via Gcc-patches
Hi, Each of the comments on the previous version of the patch have been addressed. Ok for master? Thanks, Jonathan From: Richard Sandiford Sent: 22 October 2021 16:13 To: Jonathan Wright Cc: gcc-patches@gcc.gnu.org ; Kyrylo Tkachov Subject: Re: [PATCH 4/6] aarch64: Add machine modes for Ne

[PATCH 4/6] aarch64: Add machine modes for Neon vector-tuple types

2021-10-22 Thread Jonathan Wright via Gcc-patches
Hi, Until now, GCC has used large integer machine modes (OI, CI and XI) to model Neon vector-tuple types. This is suboptimal for many reasons, the most notable are:  1) Large integer modes are opaque and modifying one vector in the     tuple requires a lot of inefficient set/get gymnastics. The  

[PATCH 6/6] aarch64: Pass and return Neon vector-tuple types without a parallel

2021-10-22 Thread Jonathan Wright via Gcc-patches
Hi, Neon vector-tuple types can be passed in registers on function call and return - there is no need to generate a parallel rtx. This patch adds cases to detect vector-tuple modes and generates an appropriate register rtx. This change greatly improves code generated when passing Neon vector- tup

[PATCH 5/6] gcc/lower_subreg.c: Prevent decomposition if modes are not tieable

2021-10-22 Thread Jonathan Wright via Gcc-patches
Hi, Preventing decomposition if modes are not tieable is necessary to stop AArch64 partial Neon structure modes being treated as packed in registers. This is a necessary prerequisite for a future AArch64 PCS change to maintain good code generation. Bootstrapped and regression tested on: * x86_64

[PATCH 3/6] gcc/expmed.c: Ensure vector modes are tieable before extraction

2021-10-22 Thread Jonathan Wright via Gcc-patches
Hi, Extracting a bitfield from a vector can be achieved by casting the vector to a new type whose elements are the same size as the desired bitfield, before generating a subreg. However, this is only an optimization if the original vector can be accessed in the new machine mode without first being

[PATCH 2/6] gcc/expr.c: Remove historic workaround for broken SIMD subreg

2021-10-22 Thread Jonathan Wright via Gcc-patches
Hi, A long time ago, using a parallel to take a subreg of a SIMD register was broken. This temporary fix[1] (from 2003) spilled these registers to memory and reloaded the appropriate part to obtain the subreg. The fix initially existed for the benefit of the PowerPC E500 - a platform for which GC

[PATCH 1/6] aarch64: Move Neon vector-tuple type declaration into the compiler

2021-10-22 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch declares the Neon vector-tuple types inside the compiler instead of in the arm_neon.h header. This is a necessary first step before adding corresponding machine modes to the AArch64 backend. The vector-tuple types are implemented using a #pragma. This means initializati

[PATCH] aarch64: Remove redundant struct type definitions in arm_neon.h

2021-10-21 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch deletes some redundant type definitions in arm_neon.h. These vector type definitions are an artifact from the initial commit that added the AArch64 port. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gc

[PATCH] aarch64: Fix pointer parameter type in LD1 Neon intrinsics

2021-10-14 Thread Jonathan Wright via Gcc-patches
The pointer parameter to load a vector of signed values should itself be a signed type. This patch fixes two instances of this unsigned- signed implicit conversion in arm_neon.h. Tested relevant intrinsics with -Wpointer-sign and warnings no longer present. Ok for master? Thanks, Jonathan ---

[PATCH] aarch64: Fix type qualifiers for qtbl1 and qtbx1 Neon builtins

2021-09-24 Thread Jonathan Wright via Gcc-patches
Hi, This patch fixes type qualifiers for the qtbl1 and qtbx1 Neon builtins and removes the casts from the Neon intrinsic function bodies that use these builtins. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 23-09

[PATCH] aarch64: Fix float <-> int errors in vld4[q]_lane intrinsics

2021-08-18 Thread Jonathan Wright via Gcc-patches
- gcc/ChangeLog: 2021-08-18  Jonathan Wright   * config/aarch64/arm_neon.h (vld4_lane_f32): Use float RTL pattern. (vld4q_lane_f64): Use float type cast. From: Andreas Schwab Sent: 18 August 2021 13:09 To: Jonathan Wright via Gcc-patches Cc: Jonathan Wright ; Richard San

[PATCH 3/3] aarch64: Remove macros for vld4[q]_lane Neon intrinsics

2021-08-16 Thread Jonathan Wright via Gcc-patches
Hi, This patch removes macros for vld4[q]_lane Neon intrinsics. This is a preparatory step before adding new modes for structures of Advanced SIMD vectors. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-08-16

[PATCH 2/3] aarch64: Remove macros for vld3[q]_lane Neon intrinsics

2021-08-16 Thread Jonathan Wright via Gcc-patches
Hi, This patch removes macros for vld3[q]_lane Neon intrinsics. This is a preparatory step before adding new modes for structures of Advanced SIMD vectors. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-08-16

[PATCH 1/3] aarch64: Remove macros for vld2[q]_lane Neon intrinsics

2021-08-16 Thread Jonathan Wright via Gcc-patches
Hi, This patch removes macros for vld2[q]_lane Neon intrinsics. This is a preparatory step before adding new modes for structures of Advanced SIMD vectors. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-08-12

[PATCH] testsuite: aarch64: Fix invalid SVE tests

2021-08-09 Thread Jonathan Wright via Gcc-patches
Hi, Some scan-assembler tests for SVE code generation were erroneously split over multiple lines - meaning they became invalid. This patch gets the tests working again by putting each test on a single line. The extract_[1234].c tests are corrected to expect that extracted 32-bit values are moved

Re: [PATCH] testsuite: aarch64: Fix failing vector structure tests on big-endian

2021-08-09 Thread Jonathan Wright via Gcc-patches
Hi, I've corrected the quoting and moved everything on to one line. Ok for master? Thanks, Jonathan --- gcc/testsuite/ChangeLog: 2021-08-04  Jonathan Wright   * gcc.target/aarch64/vector_structure_intrinsics.c: Restrict tests to little-endian targets. From: Richard Sandifo

[PATCH 4/4] aarch64: Use memcpy to copy structures in bfloat vst* intrinsics

2021-08-05 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch uses __builtin_memcpy to copy vector structures instead of using a union - or constructing a new opaque structure one vector at a time - in each of the vst[234][q] and vst1[q]_x[234] bfloat Neon intrinsics in arm_neon.h. It also adds new code generation tests to verify

[PATCH 3/4] aarch64: Use memcpy to copy structures in vst2[q]_lane intrinsics

2021-08-05 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch uses __builtin_memcpy to copy vector structures instead of using a union - or constructing a new opaque structure one vector at a time - in each of the vst2[q]_lane Neon intrinsics in arm_neon.h. It also adds new code generation tests to verify that superfluous move ins

[PATCH 2/4] aarch64: Use memcpy to copy structures in vst3[q]_lane intrinsics

2021-08-05 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch uses __builtin_memcpy to copy vector structures instead of using a union - or constructing a new opaque structure one vector at a time - in each of the vst3[q]_lane Neon intrinsics in arm_neon.h. It also adds new code generation tests to verify that superfluous move ins

[PATCH 1/4] aarch64: Use memcpy to copy structures in vst4[q]_lane intrinsics

2021-08-05 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch uses __builtin_memcpy to copy vector structures instead of using a union - or constructing a new opaque structure one vector at a time - in each of the vst4[q]_lane Neon intrinsics in arm_neon.h. It also adds new code generation tests to verify that superfluous move ins

[PATCH V2] aarch64: Don't include vec_select high-half in SIMD subtract cost

2021-08-05 Thread Jonathan Wright via Gcc-patches
Hi, V2 of this change implements the same approach as for the multiply and add-widen patches. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-28  Jonathan Wright   * config/aarch64/aarch64.c: Traver

[PATCH V2] aarch64: Don't include vec_select high-half in SIMD add cost

2021-08-04 Thread Jonathan Wright via Gcc-patches
Hi, V2 of this patch uses the same approach as that just implemented for the multiply high-half cost patch. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-28  Jonathan Wright   * config/aarch64/aa

[PATCH V2] aarch64: Don't include vec_select high-half in SIMD multiply cost

2021-08-04 Thread Jonathan Wright via Gcc-patches
): Declare. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vmul_high_cost.c: New test. From: Richard Sandiford Sent: 04 August 2021 10:05 To: Jonathan Wright via Gcc-patches Cc: Jonathan Wright Subject: Re: [PATCH] aarch64: Don't include vec_select high-half in SIMD multiply cost   Jon

[PATCH] testsuite: aarch64: Fix failing vector structure tests on big-endian

2021-08-04 Thread Jonathan Wright via Gcc-patches
: Christophe Lyon Sent: 03 August 2021 10:42 To: Jonathan Wright Cc: gcc-patches@gcc.gnu.org ; Richard Sandiford Subject: Re: [PATCH 1/8] aarch64: Use memcpy to copy vector tables in vqtbl[234] intrinsics   On Fri, Jul 23, 2021 at 10:22 AM Jonathan Wright via Gcc-patches wrote: Hi, This patch

[PATCH] aarch64: Don't include vec_select high-half in SIMD subtract cost

2021-07-29 Thread Jonathan Wright via Gcc-patches
Hi, The Neon subtract-long/subract-widen instructions can select the top or bottom half of the operand registers. This selection does not change the cost of the underlying instruction and this should be reflected by the RTL cost function. This patch adds RTL tree traversal in the Neon subtract co

[PATCH] aarch64: Don't include vec_select high-half in SIMD add cost

2021-07-29 Thread Jonathan Wright via Gcc-patches
Hi, The Neon add-long/add-widen instructions can select the top or bottom half of the operand registers. This selection does not change the cost of the underlying instruction and this should be reflected by the RTL cost function. This patch adds RTL tree traversal in the Neon add cost function to

[PATCH] aarch64: Don't include vec_select high-half in SIMD multiply cost

2021-07-28 Thread Jonathan Wright via Gcc-patches
Hi, The Neon multiply/multiply-accumulate/multiply-subtract instructions can select the top or bottom half of the operand registers. This selection does not change the cost of the underlying instruction and this should be reflected by the RTL cost function. This patch adds RTL tree traversal in t

Re: [PATCH V2] aarch64: Don't include vec_select in SIMD multiply cost

2021-07-28 Thread Jonathan Wright via Gcc-patches
Hi, V2 of the patch addresses the initial review comments, factors out common code (as we discussed off-list) and adds a set of unit tests to verify the code generation benefit. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/C

Re: [PATCH V2] simplify-rtx: Push sign/zero-extension inside vec_duplicate

2021-07-26 Thread Jonathan Wright via Gcc-patches
Hi, This updated patch fixes the two-operators-per-row style issue in the aarch64-simd.md RTL patterns as well as integrating the simplify-rtx.c change as suggested. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog:

[PATCH] aarch64: Use memcpy to copy vector tables in vst1[q]_x2 intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Hi, This patch uses __builtin_memcpy to copy vector structures instead of building a new opaque structure one vector at a time in each of the vst1[q]_x2 Neon intrinsics in arm_neon.h. This simplifies the header file and also improves code generation - superfluous move instructions were emitted for

[PATCH] aarch64: Use memcpy to copy vector tables in vst1[q]_x3 intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Hi, This patch uses __builtin_memcpy to copy vector structures instead of building a new opaque structure one vector at a time in each of the vst1[q]_x3 Neon intrinsics in arm_neon.h. This simplifies the header file and also improves code generation - superfluous move instructions were emitted for

Re: [PATCH 4/8] aarch64: Use memcpy to copy vector tables in vtbx4 intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Same explanation as for patch 3/8: I haven't added test cases here because these intrinsics don't map to a single instruction (they're legacy from Armv7) and would trip the "scan-assembler not mov" that we're using for the other tests. Thanks, Jonathan From: Richa

[PATCH 8/8] aarch64: Use memcpy to copy vector tables in vst1[q]_x4 intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Hi, This patch uses __builtin_memcpy to copy vector structures instead of using a union in each of the vst1[q]_x4 Neon intrinsics in arm_neon.h. Add new code generation tests to verify that superfluous move instructions are not generated for the vst1q_x4 intrinsics. Regression tested and bootstr

[PATCH 7/8] aarch64: Use memcpy to copy vector tables in vst2[q] intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Hi, This patch uses __builtin_memcpy to copy vector structures instead of building a new opaque structure one vector at a time in each of the vst2[q] Neon intrinsics in arm_neon.h. This simplifies the header file and also improves code generation - superfluous move instructions were emitted for ev

Re: [PATCH 3/8] aarch64: Use memcpy to copy vector tables in vtbl[34] intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
I haven't added test cases here because these intrinsics don't map to a single instruction (they're legacy from Armv7) and would trip the "scan-assembler not mov" that we're using for the other tests. Jonathan From: Richard Sandiford Sent: 23 July 2021 10:29 To: K

[PATCH 6/8] aarch64: Use memcpy to copy vector tables in vst3[q] intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Hi, This patch uses __builtin_memcpy to copy vector structures instead of building a new opaque structure one vector at a time in each of the vst3[q] Neon intrinsics in arm_neon.h. This simplifies the header file and also improves code generation - superfluous move instructions were emitted for ev

[PATCH 5/8] aarch64: Use memcpy to copy vector tables in vst4[q] intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Hi, This patch uses __builtin_memcpy to copy vector structures instead of building a new opaque structure one vector at a time in each of the vst4[q] Neon intrinsics in arm_neon.h. This simplifies the header file and also improves code generation - superfluous move instructions were emitted for ev

[PATCH 4/8] aarch64: Use memcpy to copy vector tables in vtbx4 intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Hi, This patch uses __builtin_memcpy to copy vector structures instead of building a new opaque structure one vector at a time in each of the vtbx4 Neon intrinsics in arm_neon.h. This simplifies the header file and also improves code generation - superfluous move instructions were emitted for ever

[PATCH 3/8] aarch64: Use memcpy to copy vector tables in vtbl[34] intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Hi, This patch uses __builtin_memcpy to copy vector structures instead of building a new opaque structure one vector at a time in each of the vtbl[34] Neon intrinsics in arm_neon.h. This simplifies the header file and also improves code generation - superfluous move instructions were emitted for e

[PATCH 2/8] aarch64: Use memcpy to copy vector tables in vqtbx[234] intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Hi, This patch uses __builtin_memcpy to copy vector structures instead of building a new opaque structure one vector at a time in each of the vqtbx[234] Neon intrinsics in arm_neon.h. This simplifies the header file and also improves code generation - superfluous move instructions were emitted for

[PATCH 1/8] aarch64: Use memcpy to copy vector tables in vqtbl[234] intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Hi, This patch uses __builtin_memcpy to copy vector structures instead of building a new opaque structure one vector at a time in each of the vqtbl[234] Neon intrinsics in arm_neon.h. This simplifies the header file and also improves code generation - superfluous move instructions were emitted for

[PATCH] simplify-rtx: Push sign/zero-extension inside vec_duplicate

2021-07-20 Thread Jonathan Wright via Gcc-patches
Hi, As a general principle, vec_duplicate should be as close to the root of an expression as possible. Where unary operations have vec_duplicate as an argument, these operations should be pushed inside the vec_duplicate. This patch modifies unary operation simplification to push sign/zero-extensi

[PATCH] aarch64: Don't include vec_select in SIMD multiply cost

2021-07-20 Thread Jonathan Wright via Gcc-patches
Hi, The Neon multiply/multiply-accumulate/multiply-subtract instructions can take various forms - multiplying full vector registers of values or multiplying one vector by a single element of another. Regardless of the form used, these instructions have the same cost, and this should be reflected b

[PATCH] aarch64: Refactor TBL/TBX RTL patterns

2021-07-19 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch renames the two-source-register TBL/TBX RTL patterns so that their names better reflect what they do, rather than confusing them with tbl3 or tbx4 patterns. Also use the correct "neon_tbl2" type attribute for both patterns. Rename single-source-register TBL/TBX patterns

Re: [PATCH V2] gcc: Add vec_select -> subreg RTL simplification

2021-07-15 Thread Jonathan Wright via Gcc-patches
Ah, yes - those test results should have only been changed for little endian. I've submitted a patch to the list restoring the original expected results for big endian. Thanks, Jonathan From: Christophe Lyon Sent: 15 July 2021 10:09 To: Richard Sandiford ; Jonat

testsuite: aarch64: Fix failing SVE tests on big endian

2021-07-15 Thread Jonathan Wright via Gcc-patches
Hi, A recent change "gcc: Add vec_select -> subreg RTL simplification" updated the expected test results for SVE extraction tests. The new result should only have been changed for little endian. This patch restores the old expected result for big endian. Ok for master? Thanks, Jonathan --- gcc

[PATCH] aarch64: Use unions for vector tables in vqtbl[234] intrinsics

2021-07-09 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch uses a union instead of constructing a new opaque vector structure for each of the vqtbl[234] Neon intrinsics in arm_neon.h. This simplifies the header file and also improves code generation - superfluous move instructions were emitted for every register extraction/set i

[PATCH V2] gcc: Add vec_select -> subreg RTL simplification

2021-07-07 Thread Jonathan Wright via Gcc-patches
Hi, Version 2 of this patch adds more code generation tests to show the benefit of this RTL simplification as well as adding a new helper function 'rtx_vec_series_p' to reduce code duplication. Patch tested as version 1 - ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-06-08  Jonathan

[PATCH] gcc: Add vec_select -> subreg RTL simplification

2021-07-02 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch adds a new RTL simplification for the case of a VEC_SELECT selecting the low part of a vector. The simplification returns a SUBREG. The primary goal of this patch is to enable better combinations of Neon RTL patterns - specifically allowing generation of 'write-to- high

[PATCH V2] aarch64: Model zero-high-half semantics of ADDHN/SUBHN instructions

2021-06-16 Thread Jonathan Wright via Gcc-patches
depending on endianness. (aarch64_hn_insn_le): Define. (aarch64_hn_insn_be): Define. gcc/testsuite/ChangeLog: * gcc.target/aarch64/narrow_zero_high_half.c: Add new tests. From: Gcc-patches on behalf of Jonathan Wright via Gcc-patches Sent: 15 June 2021 11:02 To: gcc-patches

[PATCH V2] aarch64: Model zero-high-half semantics of [SU]QXTN instructions

2021-06-16 Thread Jonathan Wright via Gcc-patches
endianness. gcc/testsuite/ChangeLog: * gcc.target/aarch64/narrow_zero_high_half.c: Add new tests. From: Gcc-patches on behalf of Jonathan Wright via Gcc-patches Sent: 15 June 2021 10:59 To: gcc-patches@gcc.gnu.org Subject: [PATCH] aarch64: Model zero-high-half semantics of [SU]QXTN

[PATCH V2] aarch64: Model zero-high-half semantics of SQXTUN instruction in RTL

2021-06-16 Thread Jonathan Wright via Gcc-patches
): Define. gcc/testsuite/ChangeLog: * gcc.target/aarch64/narrow_zero_high_half.c: Add new tests. From: Gcc-patches on behalf of Jonathan Wright via Gcc-patches Sent: 15 June 2021 10:52 To: gcc-patches@gcc.gnu.org Subject: [PATCH] aarch64: Model zero-high-half semantics of SQXTUN

[PATCH V2] aarch64: Model zero-high-half semantics of XTN instruction in RTL

2021-06-16 Thread Jonathan Wright via Gcc-patches
/narrow_zero_high_half.c: Add new tests. From: Gcc-patches on behalf of Jonathan Wright via Gcc-patches Sent: 15 June 2021 10:45 To: gcc-patches@gcc.gnu.org Subject: [PATCH] aarch64: Model zero-high-half semantics of XTN instruction in RTL   Hi, Modeling the zero-high-half semantics of the XTN narrowing

[PATCH] testsuite: aarch64: Add zero-high-half tests for narrowing shifts

2021-06-16 Thread Jonathan Wright via Gcc-patches
Hi, This patch adds tests to verify that Neon narrowing-shift instructions clear the top half of the result vector. It is sufficient to show that a subsequent combine with a zero-vector is optimized away - leaving just the narrowing-shift instruction. Ok for master? Thanks, Jonathan --- gcc/te

[PATCH] aarch64: Model zero-high-half semantics of ADDHN/SUBHN instructions

2021-06-15 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch models the zero-high-half semantics of the narrowing arithmetic Neon instructions in the aarch64_hn RTL pattern. Modeling these semantics allows for better RTL combinations while also removing some register allocation issues as the compiler now knows that the operation i

[PATCH] aarch64: Model zero-high-half semantics of [SU]QXTN instructions

2021-06-15 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch first splits the aarch64_qmovn pattern into separate scalar and vector variants. It then further splits the vector RTL  pattern into big/little endian variants that model the zero-high-half semantics of the underlying instruction. Modeling these semantics allows for bett

[PATCH] aarch64: Model zero-high-half semantics of SQXTUN instruction in RTL

2021-06-15 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch first splits the aarch64_sqmovun pattern into separate scalar and vector variants. It then further split the vector pattern into big/little endian variants that model the zero-high-half semantics of the underlying instruction. Modeling these semantics allows for better R

[PATCH] aarch64: Model zero-high-half semantics of XTN instruction in RTL

2021-06-15 Thread Jonathan Wright via Gcc-patches
Hi, Modeling the zero-high-half semantics of the XTN narrowing instruction in RTL indicates to the compiler that this is a totally destructive operation. This enables more RTL simplifications and also prevents some register allocation issues. Regression tested and bootstrapped on aarch64-none-lin

[PATCH] aarch64: Use correct type attributes for RTL generating XTN(2)

2021-05-19 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch corrects the type attribute in RTL patterns that generate XTN/XTN2 instructions to be "neon_move_narrow_q". This makes a material difference because these instructions can be executed on both SIMD pipes in the Cortex-A57 core model, whereas the "neon_shift_imm_narrow_q"

[PATCH] aarch64: Use an expander for quad-word vec_pack_trunc pattern

2021-05-19 Thread Jonathan Wright via Gcc-patches
Hi, The existing vec_pack_trunc RTL pattern emits an opaque two- instruction assembly code sequence that prevents proper instruction scheduling. This commit changes the pattern to an expander that emits individual xtn and xtn2 instructions. This commit also consolidates the duplicate truncation p

[PATCH 5/5] testsuite: aarch64: Add tests for high-half narrowing instructions

2021-05-18 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch adds tests to confirm that a *2 (write to high-half) Neon instruction is generated from vcombine* of a narrowing intrinsic sequence. Ok for master? Thanks, Jonathan --- gcc/testsuite/ChangeLog: 2021-05-14  Jonathan Wright   * gcc.target/aarch64/narrow_high_

[PATCH 4/5] aarch64: Refactor aarch64_qshrn_n RTL pattern

2021-05-18 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch splits the aarch64_qshrn_n pattern into separate scalar and vector variants. It further splits the vector pattern into big/little endian variants that model the zero-high-half semantics of the underlying instruction - allowing for more combinations with the write-to-high

[PATCH 3/5] aarch64: Relax aarch64_sqxtun2 RTL pattern

2021-05-18 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch uses UNSPEC_SQXTUN instead of UNSPEC_SQXTUN2 in the aarch64_sqxtun2 patterns. This allows for more more aggressive combinations and ultimately better code generation - which will be confirmed by a new set of tests in gcc.target/aarch64/narrow_high_combine.c (patch 5/5 in

[PATCH 2/5] aarch64: Relax aarch64_qshrn2_n RTL pattern

2021-05-18 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch implements saturating right-shift and narrow high Neon intrinsic RTL patterns using a vec_concat of a register_operand and a VQSHRN_N unspec - instead of just a VQSHRN_N unspec. This more relaxed pattern allows for more aggressive combinations and ultimately better code

[PATCH 1/5] aarch64: Relax aarch64_hn2 RTL pattern

2021-05-18 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch implements v[r]addhn2 and v[r]subhn2 Neon intrinsic RTL patterns using a vec_concat of a register_operand and an ADDSUBHN unspec - instead of just an ADDSUBHN2 unspec. This more relaxed pattern allows for more aggressive combinations and ultimately better code generation

Re: [PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq intrinsics

2021-05-04 Thread Jonathan Wright via Gcc-patches
Hi Richard, I think you may be referencing an older checkout as we refactored this pattern in a previous change to: (define_insn "mul_lane3" [(set (match_operand:VMUL 0 "register_operand" "=w") (mult:VMUL (vec_duplicate:VMUL (vec_select: (match_operand:VMUL 2 "register_oper

Re: [PATCH 14/20] testsuite: aarch64: Add fusion tests for FP vml[as] intrinsics

2021-04-30 Thread Jonathan Wright via Gcc-patches
via Gcc-patches Cc: Jonathan Wright Subject: Re: [PATCH 14/20] testsuite: aarch64: Add fusion tests for FP vml[as] intrinsics Jonathan Wright via Gcc-patches writes: > Hi, > > As subject, this patch adds compilation tests to make sure that the output > of vmla/vmls floating-point Neo

Re: [PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq intrinsics

2021-04-30 Thread Jonathan Wright via Gcc-patches
Updated the patch to be more consistent with the others in the series. Tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan From: Gcc-patches on behalf of Jonathan Wright via Gcc-patches Sent: 28 April 2021 15:42 To

Re: [PATCH 12/20] aarch64: Use RTL builtins for FP ml[as][q]_lane intrinsics

2021-04-30 Thread Jonathan Wright via Gcc-patches
Patch updated as per suggestion (similar to patch 10/20.) Tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan From: Richard Sandiford Sent: 28 April 2021 16:37 To: Jonathan Wright via Gcc-patches Cc: Jonathan Wright

Re: [PATCH 10/20] aarch64: Use RTL builtins for FP ml[as]_n intrinsics

2021-04-30 Thread Jonathan Wright via Gcc-patches
Patch updated as per your suggestion. Tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan From: Richard Sandiford Sent: 28 April 2021 16:11 To: Jonathan Wright via Gcc-patches Cc: Jonathan Wright Subject: Re: [PATCH

Re: [PATCH 1/20] aarch64: Use RTL builtin for vmull[_high]_p8 intrinsics

2021-04-30 Thread Jonathan Wright via Gcc-patches
Thanks for the review, I've updated the patch as per option 1. Tested and bootstrapped on aarch64-none-linux-gnu with no issues. Ok for master? Thanks, Jonathan From: Richard Sandiford Sent: 28 April 2021 15:11 To: Jonathan Wright via Gcc-patches Cc: Jon

[PATCH 20/20] aarch64: Remove unspecs from [su]qmovn RTL pattern

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi, Saturating truncation can be expressed using the RTL expressions ss_truncate and us_truncate. This patch changes the implementation of the vqmovn_* Neon intrinsics to use these RTL expressions rather than a pair of unspecs. The redundant unspecs are removed along with their code iterator. Reg

[PATCH 19/20] aarch64: Update attributes of arm_acle.h intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch updates the attributes of all intrinsics defined in arm_acle.h to be consistent with the attributes of the intrinsics defined in arm_neon.h. Specifically, this means updating the attributes from:   __extension__ static __inline   __attribute__ ((__always_inline__)) to:

[PATCH 18/20] aarch64: Update attributes of arm_fp16.h intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch updates the attributes of all intrinsics defined in arm_fp16.h to be consistent with the attributes of the intrinsics defined in arm_neon.h. Specifically, this means updating the attributes from:   __extension__ static __inline   __attribute__ ((__always_inline__)) to:

[PATCH 17/20] aarch64: Relax aarch64_qshrnn2_n RTL pattern

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch implements the saturating right-shift and narrow high Neon intrinsic RTL patterns using a vec_concat of a register_operand and a VQSHRN_N unspec - instead of just a VQSHRN2_N unspec. This more relaxed pattern allows for more aggressive combinations and ultimately better

[PATCH 16/20] aarch64: Relax aarch64_hn2 RTL pattern

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch implements the v[r]addhn2 and v[r]subhn2 Neon intrinsic RTL patterns using a vec_concat of a register_operand and an ADDSUBHN unspec - instead of just an ADDSUBHN2 unspec. This more relaxed pattern allows for more aggressive combinations and ultimately better code genera

[PATCH 15/20] aarch64: Use RTL builtins for vcvtx intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch rewrites the vcvtx Neon intrinsics to use RTL builtins rather than inline assembly code, allowing for better scheduling and optimization. Regression tested and bootstrapped on aarch64-none-linux-gnu and aarch64_be-none-elf - no issues. Ok for master? Thanks, Jonathan

[PATCH 14/20] testsuite: aarch64: Add fusion tests for FP vml[as] intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch adds compilation tests to make sure that the output of vmla/vmls floating-point Neon intrinsics (fmul, fadd/fsub) is not fused into fmla/fmls instructions. Ok for master? Thanks, Jonathan --- gcc/testsuite/ChangeLog: 2021-02-16  Jonathan Wright   * gcc.targ

[PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch rewrites the floating-point vml[as][q]_laneq Neon intrinsics to use RTL builtins rather than relying on the GCC vector extensions. Using RTL builtins allows control over the emission of fmla/fmls instructions (which we don't want here.) With this commit, the code genera

[PATCH 12/20] aarch64: Use RTL builtins for FP ml[as][q]_lane intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch rewrites the floating-point vml[as][q]_lane Neon intrinsics to use RTL builtins rather than relying on the GCC vector extensions. Using RTL builtins allows control over the emission of fmla/fmls instructions (which we don't want here.) With this commit, the code generat

[PATCH 11/20] aarch64: Use RTL builtins for FP ml[as] intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch rewrites the floating-point vml[as][q] Neon intrinsics to use RTL builtins rather than relying on the GCC vector extensions. Using RTL builtins allows control over the emission of fmla/fmls instructions (which we don't want here.) With this commit, the code generated by

[PATCH 10/20] aarch64: Use RTL builtins for FP ml[as]_n intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch rewrites the floating-point vml[as][q]_n Neon intrinsics to use RTL builtins rather than inline assembly code, allowing for better scheduling and optimization. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan -

[PATCH 9/20] aarch64: Use RTL builtins for v[q]tbx intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch rewrites the v[q]tbx Neon intrinsics to use RTL builtins rather than inline assembly code, allowing for better scheduling and optimization. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog:

[PATCH 8/20] aarch64: Use RTL builtins for v[q]tbl intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch rewrites the v[q]tbl Neon intrinsics to use RTL builtins rather than inline assembly code, allowing for better scheduling and optimization. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog:

[PATCH 7/20] aarch64: Use RTL builtins for polynomial vsri[q]_n intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi, As subject, this patch rewrites the vsri[q]_n_p* Neon intrinsics to use RTL builtins rather than inline assembly code, allowing for better scheduling and optimization. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeL

  1   2   >