Re: libatomic: use HWCAPs in AArch64 ifunc tests

2025-03-13 Thread Wilco Dijkstra
Hi Richard, > Could you give details?  I thought it was always known that trapped > system register accesses were slow.  In the previous versions, the > checks seemed to be presented as an up-front price worth paying for > faster atomic operations, on the systems that would use those paths. > Now

Re: AArch64: Turn off outline atomics with -mcmodel=large (PR112465)

2025-03-12 Thread Wilco Dijkstra
Hi Richard, > That was also what I was trying to say.  In the worst case, the linked > object has to meet the requirements of the lowest common denominator. > > And my supposition was that that isn't a property of static vs dynamic. But it is. Dynamic linking supports mixing different code models

Re: AArch64: Turn off outline atomics with -mcmodel=large (PR112465)

2025-03-07 Thread Wilco Dijkstra
Hi Richard, >> Basically the small and large model are fundamentally incompatible. The >> infamous >> "dumb linker" approach means it doesn't try to sort sections, so an ADRP >> relocation >> will be out of reach if its data is placed after a huge array. Static >> linking with GLIBC or >> enabl

Re: AArch64: Turn off outline atomics with -mcmodel=large (PR112465)

2025-03-04 Thread Wilco Dijkstra
Hi Ramana, > -Generate code for the large code model.  This makes no assumptions about > -addresses and sizes of sections.  Programs can be statically linked only.  > The > +Generate code for the large code model.  This allows large .bss and .data > +sections, however .text and .rodata must still

Re: AArch64: Turn off outline atomics with -mcmodel=large (PR112465)

2025-03-04 Thread Wilco Dijkstra
Hi Kyrill, > This restriction should be documented in invoke.texi IMO. > I also think it would be more user friendly to warn them about the > incompatibility if an explicit -moutline-atomics option is passed. > It’s okay though to silently turn off the implicit default-on option though. I've upd

Re: AArch64: Enable early scheduling for -O3 and higher (PR118351)

2025-03-04 Thread Wilco Dijkstra
Hi Richard&Kyrill, >> I’m in favour of this. > > Yeah, seems ok to me too.  I suppose we ought to update the documentation too: I've added a note to the documentation. However it is impossible to be complete here since many targets switch off early scheduling under various circumstances. So I'v

libatomic: use HWCAPs in AArch64 ifunc tests

2025-03-03 Thread Wilco Dijkstra
Feedback from the kernel team suggests that it's best to only use HWCAPs rather than also use low-level checks as done by has_lse128() and has_rcpc3(). So change these to just use HWCAPs which simplifies the code and speeds up ifunc selection by avoiding expensive system register accesses. Passes

libgcc: Remove PREDRES and LS64 from AArch64 cpuinfo

2025-03-03 Thread Wilco Dijkstra
Change AArch64 cpuinfo to follow the latest updates to the FMV spec [1]: Remove FEAT_PREDRES and FEAT_LS64*. Preserve the ordering in enum CPUFeatures. Passes regress, OK for commit? [1] https://github.com/ARM-software/acle/pull/382 gcc: * common/config/aarch64/cpuinfo.h: Remove FEAT_PR

AArch64: Enable early scheduling for -O3 and higher (PR118351)

2025-03-03 Thread Wilco Dijkstra
Enable the early scheduler on AArch64 for O3/Ofast. This means GCC15 benefits from much faster build times with -O2, but avoids the regressions in lbm which is very sensitive to minor scheduling changes due to long FMA chains. We can then revisit this for GCC16. gcc: PR target/118351

AArch64: Turn off outline atomics with -mcmodel=large (PR112465)

2025-03-03 Thread Wilco Dijkstra
Outline atomics is not designed to be used with -mcmodel=large, so disable it automatically if the large code model is used. Passes regress, OK for commit? gcc: PR target/112465 * config/aarch64/aarch64.cc (aarch64_override_options_after_change_1): Turn off outline atomic

Re: [PATCH 3/3] AArch64: Add SVE vector cost to baseline tuning

2025-01-14 Thread Wilco Dijkstra
Hi Richard, > Sorry to be awkward, but I don't think we should put > AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT in base. > CHEAP_SHIFT_EXTEND is a good base flag because it means we can make full > use of a certain group of instructions.  FULLY_PIPELINED_FMA similarly > means that FMA chains beh

Re: [PATCH] AArch64: Deprecate -mabi=ilp32

2025-01-14 Thread Wilco Dijkstra
Hi Richard, >> +  if (TARGET_ILP32) >> +    warning (OPT_Wdeprecated, "%<-mabi=ilp32%> is deprecated."); > > There should be no "." at the end of the message. Right, fixed in v2 below. > Otherwise it looks good to me, although like Kyrill says, it'll also > need a release note. I've added one,

[wwwdocs] gcc-15: Deprecate ILP32 on AArch64

2025-01-14 Thread Wilco Dijkstra
As suggested in https://gcc.gnu.org/pipermail/gcc-patches/2025-January/673558.html update the gcc-15 Changes page: Add ILP32 depreciation to Caveats section. --- diff --git a/htdocs/gcc-15/changes.html b/htdocs/gcc-15/changes.html index 1c690c4a168f4d6297ad33dd5b798e9200792dc5..d5037efb34cc8e6

Re: [PATCH] AArch64: Deprecate -mabi=ilp32

2025-01-13 Thread Wilco Dijkstra
Hi all, > In that case, I'm coming round to the idea of deprecating ILP32. > I think it was already common ground that the GNU/Linux support is dead. > watchOS would use Mach objects rather than ELF.  As you say, it isn't > clear how much of the current ILP32 support would be relevant for it. > An

Re: [PATCH] AArch64: Cleanup alignment macros

2025-01-10 Thread Wilco Dijkstra
Hi Richard, > It looks like you committed the original version instead, with no extra > explanation.  I suppose I should have asked for another review round > instead. Did you check the commit log? Change the AARCH64_EXPAND_ALIGNMENT macro into proper function calls to make future change

Re: [PATCH] libatomic: Cleanup AArch64 ifunc selection

2025-01-10 Thread Wilco Dijkstra
Hi Richard, > Yeah, somewhat.  But won't we go on to test has_lse2 anyway, due to: > > #  elif defined (LSE2_LRCPC3_ATOP) > #   define IFUNC_NCOND(N)   2 > #   define IFUNC_COND_1 (has_rcpc3 (hwcap, features)) > #   define IFUNC_COND_2 (has_lse2 (hwcap, features)) > > If we want to reduce the

Re: [PATCH] AArch64: Deprecate -mabi=ilp32

2025-01-10 Thread Wilco Dijkstra
Hi Andrew, > Personally I would like this deprecated even for bare-metal. Yes the > iwatch ABI is an ILP32 ABI but I don't see GCC implementing that any > time soon and I suspect it would not be hard to resurrect the code at > that point. My patch deprecates it in all cases currently. It will be

Re: [PATCH] libatomic: Cleanup AArch64 ifunc selection

2025-01-10 Thread Wilco Dijkstra
Hi Richard, >> +  /* LSE2 is a prerequisite for atomic LDIAPP/STILP.  */ >> +  if (!(hwcap & HWCAP_USCAT)) >> return false; > > Is there a reason for not using has_lse2 here?  It'd be good to have > a comment if so. Yes, the MRS instructions cause expensive traps, so we try to avoid them whe

Re: [PATCH 3/3] AArch64: Add SVE vector cost to baseline tuning

2025-01-10 Thread Wilco Dijkstra
Hi Kyrill, >> Add AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS and >> AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >> to the baseline tuning since all modern cores use it.  Fix the >> neoverse512tvb tuning to be >> like Neoverse V1/V2. > > For neoversev512tvb this means adding AARCH64_EXTRA_TUNE_AVOI

Re: [PATCH 3/3] AArch64: Add SVE vector cost to baseline tuning

2025-01-10 Thread Wilco Dijkstra
ping   Add AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS and AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT to the baseline tuning since all modern cores use it.  Fix the neoverse512tvb tuning to be like Neoverse V1/V2. gcc/ChangeLog:     * config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TU

Re: [PATCH 2/3] AArch64: Add FULLY_PIPELINED_FMA to tune baseline

2025-01-10 Thread Wilco Dijkstra
ping   Add FULLY_PIPELINED_FMA to tune baseline - this is a generic feature that is already enabled for some cores, but benchmarking it shows it is faster on all modern cores (SPECFP improves ~0.17% on Neoverse V1 and 0.04% on Neoverse N1). Passes regress & bootstrap, OK for commit? gcc/ChangeLo

Re: [PATCH] libatomic: Cleanup AArch64 ifunc selection

2025-01-10 Thread Wilco Dijkstra
ping   Simplify and cleanup ifunc selection logic.  Since LRCPC3 does not imply LSE2, has_rcpc3() should also check LSE2 is enabled. Passes regress and bootstrap, OK for commit? libatomic:     * config/linux/aarch64/host-config.h (has_lse2): Cleanup.     (has_lse128): Likewise.     (

[PATCH] AArch64: Deprecate -mabi=ilp32

2025-01-10 Thread Wilco Dijkstra
ILP32 was originally intended to make porting to AArch64 easier. Support was never merged in the Linux kernel or GLIBC, so it has been unsupported for many years. There isn't a benefit in keeping unsupported features forever, so deprecate it now (and it could be removed in a future release). Pa

[PATCH] AArch64: Remove Cortex-A57 FMA steering pass

2025-01-10 Thread Wilco Dijkstra
As a minor cleanup remove Cortex-A57 FMA steering pass. Since Cortex-A57 is pretty old, there isn't any benefit of keeping this. Passes regress & bootstrap, OK for commit? gcc: * config.gcc (extra_objs): Remove cortex-a57-fma-steering.o. * config/aarch64/aarch64-passes.def: Remo

Re: [PATCH v2] AArch64: Block combine_and_move from creating FP literal loads

2025-01-09 Thread Wilco Dijkstra
Hi Richard, > The patch below is what I meant.  It passes bootstrap & regression-test > on aarch64-linux-gnu (and so produces the same results for the tests > that you changed).  Do you see any problems with this version? > If not, I think we should go with it. Thanks for the detailed example - u

Re: [PATCH] AArch64: Cleanup alignment macros

2024-12-06 Thread Wilco Dijkstra
Hi Richard, >> A common case is a constant string which is compared against some >> argument. Most string functions work on 8 or 16-byte quantities. If we >> ensure the whole array fits in one aligned load, we save time in the >> string function. >> >> Runtime data collected for strlen calls shows

Re: [PATCH] AArch64: Cleanup alignment macros

2024-12-06 Thread Wilco Dijkstra
Hi Richard, > So just to be sure I understand: we still want to align (say) an array > of 4 chars to 32 bits so that the LDR & STR are aligned, and an array of > 3 chars to 32 bits so that the LDRH & STRH for the leading two bytes are > aligned?  Is that right?  We don't seem to take advantage of

[PATCH] arm: Fix LDRD register overlap [PR117675]

2024-12-03 Thread Wilco Dijkstra
The register indexed variants of LDRD have complex register overlap constraints which makes them hard to use without using output_move_double (which can't be used for atomics as it doesn't guarantee to emit atomic LDRD/STRD when required). Add a new predicate and constraint for plain LDRD/STRD wi

[PATCH] AArch64: Cleanup alignment macros

2024-12-03 Thread Wilco Dijkstra
Change the AARCH64_EXPAND_ALIGNMENT macro into proper function calls to make future changes easier. Use the existing alignment settings, however avoid overaligning small array's or structs to 64 bits when there is no benefit. This gives a small reduction in data and stack size. Passes regress & b

[PATCH] libatomic: Cleanup AArch64 ifunc selection

2024-11-27 Thread Wilco Dijkstra
Simplify and cleanup ifunc selection logic. Since LRCPC3 does not imply LSE2, has_rcpc3() should also check LSE2 is enabled. Passes regress and bootstrap, OK for commit? libatomic: * config/linux/aarch64/host-config.h (has_lse2): Cleanup. (has_lse128): Likewise. (has_rcp

Re: [PATCH 3/3] AArch64: Add SVE vector cost to baseline tuning

2024-11-15 Thread Wilco Dijkstra
Hi Kyrill, > This would make USE_NEW_VECTOR_COSTS effectively the default. > Jennifer has been trying to do that as well and then to remove it (as it > would be always true) but there are some codegen regressions that still > > need to be addressed. Yes, that's the goal - we should use good tun

[PATCH 2/3] AArch64: Add FULLY_PIPELINED_FMA to tune baseline

2024-11-14 Thread Wilco Dijkstra
Add FULLY_PIPELINED_FMA to tune baseline - this is a generic feature that is already enabled for some cores, but benchmarking it shows it is faster on all modern cores (SPECFP improves ~0.17% on Neoverse V1 and 0.04% on Neoverse N1). Passes regress & bootstrap, OK for commit? gcc/ChangeLog:

[PATCH 1/3] AArch64: Add baseline tune

2024-11-14 Thread Wilco Dijkstra
Cleanup the extra tune defines by introducing AARCH64_EXTRA_TUNE_BASE as a common base supported by all modern cores. Initially set it to AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND. No change in generated code. Passes regress & bootstrap, OK for commit? gcc/ChangeLog: * config/aarch64/aarc

[PATCH 3/3] AArch64: Add SVE vector cost to baseline tuning

2024-11-14 Thread Wilco Dijkstra
Add AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS and AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT to the baseline tuning since all modern cores use it. Fix the neoverse512tvb tuning to be like Neoverse V1/V2. gcc/ChangeLog: * config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNE_BASE

Re: [PATCH v2] AArch64: Block combine_and_move from creating FP literal loads

2024-11-13 Thread Wilco Dijkstra
Hi Richard, > ...I still think we should avoid testing can_create_pseudo_p. > Does it work with the last part replaced by: > >  if (!DECIMAL_FLOAT_MODE_P (mode)) >    { >  if (aarch64_can_const_movi_rtx_p (src, mode) >  || aarch64_float_const_representable_p (src) >  || aarch64

Re: [PATCH] AArch64: Switch off early scheduling

2024-11-12 Thread Wilco Dijkstra
Hi, >>> What do you think about disabling late scheduling as well? >> >> I think this would definitely need separate consideration and evaluation >> given the above. >> >> Another thing to consider is the macro fusion machinery. IIRC it works >> during scheduling so if we don’t run any schedulin

Re: [PATCH v2] AArch64: Block combine_and_move from creating FP literal loads

2024-11-12 Thread Wilco Dijkstra
Hi Richard, > The idea was that, if we did the split during expand, the movsf/df > define_insns would then only accept the immediates that their > constraints can handle. Right, always disallowing these immediates works fine too (it seems reload doesn't require all immediates to be valid), and th

[PATCH] AArch64: Cleanup fusion defines

2024-11-08 Thread Wilco Dijkstra
Cleanup the fusion defines by introducing AARCH64_FUSE_BASE as a common base level of fusion supported by almost all cores. Add AARCH64_FUSE_MOVK as a shortcut for all MOVK fusion. In most cases there is no change. It enables AARCH64_FUSE_CMP_BRANCH for a few older cores since it has no measura

[PATCH] AArch64: Remove duplicated addr_cost tables

2024-11-08 Thread Wilco Dijkstra
Remove duplicated addr_cost tables - use generic_armv9_a_addrcost_table for Armv9-a cores and generic_armv8_a_addrcost_table for recent Armv8-a cores. No changes in generated code. OK for commit? gcc/ChangeLog: * config/aarch64/tuning_models/cortexx925.h (cortexx925_addrcost_table): Re

Re: [PATCH] AArch64: Block combine_and_move from creating FP literal loads

2024-11-08 Thread Wilco Dijkstra
Hi Richard, > That's because, once an instruction matches, the instruction should > continue to match.  It should always be possible to set the INSN_CODE of > an existing instruction to -1, rerun recog, and get the same instruction > code back. > > Because of that, insn conditions shouldn't depend

Re: [PATCH] AArch64: Block combine_and_move from creating FP literal loads

2024-11-08 Thread Wilco Dijkstra
Hi Richard, > It's ok for instructions to require properties that are false during > early RTL passes and then transition to true.  But they can't require > properties that go from true to false, since that would mean that > existing instructions become unrecognisable at certain points during > th

[PATCH v2] AArch64: Switch off early scheduling

2024-11-01 Thread Wilco Dijkstra
v2: split off movsf/df pattern fixes, remove some guality xfails that now pass The early scheduler takes up ~33% of the total build time, however it doesn't provide a meaningful performance gain.  This is partly because modern OoO cores need far less scheduling, partly because the scheduler tends

[PATCH] AArch64: Block combine_and_move from creating FP literal loads

2024-11-01 Thread Wilco Dijkstra
The IRA combine_and_move pass runs if the scheduler is disabled and aggressively combines moves. The movsf/df patterns allow all FP immediates since they rely on a split pattern. However splits do not happen during IRA, so the result is extra literal loads. To avoid this, use a more accurate ch

Re: [PATCH] AArch64: Switch off early scheduling

2024-10-31 Thread Wilco Dijkstra
Hi Kyrill, > I think the approach that I’d like to try is using the TARGET_SCHED_DISPATCH > hooks like x86 does for bdver1-4. > That would try to exploit the dispatch constraints information in the SWOGs > rather than the instruction latency and throughput tables. > That would still require some

Re: [PATCH] AArch64: Switch off early scheduling

2024-10-31 Thread Wilco Dijkstra
Hi Andrew, > I suspect the following scheduling models could be removed due either > to hw never going to production or no longer being used by anyone: > thunderx3t110.md > falkor.md > saphira.md If you're planning to remove these, it would also be good to remove the falkor-tag-collision-avoidanc

[PATCH] AArch64: Switch off early scheduling

2024-10-31 Thread Wilco Dijkstra
The early scheduler takes up ~33% of the total build time, however it doesn't provide a meaningful performance gain. This is partly because modern OoO cores need far less scheduling, partly because the scheduler tends to create many unnecessary spills by increasing register pressure. Building ap

Re: [PATCH 1/4] sched1: hookize pressure scheduling spilling agressiveness

2024-10-29 Thread Wilco Dijkstra
Hi Vineet, > I agree the NARROW/WIDE stuff is obfuscating things in technicalities. Is there evidence this change would make things significantly worse for some targets? I did a few runs on Neoverse V2 with various options and it looks beneficial both for integer and FP. On the example and option

[PATCH] AArch64: Add more accurate constraint [PR117292]

2024-10-25 Thread Wilco Dijkstra
As shown in the PR, reload may only check the constraint in some cases and and not check the predicate is still valid for the resulting instruction. To fix the issue, add a new constraint which matches the predicate exactly. Passes regress & bootstrap, OK for commit? gcc/ChangeLog: PR ta

[PATCH] AArch64: Remove redundant check in aarch64_simd_mov

2024-10-17 Thread Wilco Dijkstra
The split condition in aarch64_simd_mov uses aarch64_simd_special_constant_p. While doing the split, it checks the mode before calling aarch64_maybe_generate_simd_constant. This risky since it may result in unexpectedly calling aarch64_split_simd_move instead of aarch64_maybe_generate_simd_con

[PATCH v3] AArch64: Fix copysign patterns

2024-10-17 Thread Wilco Dijkstra
The current copysign pattern has a mismatch in the predicates and constraints - operand[2] is a register_operand but also has an alternative X which allows any operand. Since it is a floating point operation, having an integer alternative makes no sense. Change the expander to always use vector i

Re: [PATCH 3/3] AArch64: Add support for SIMD xor immediate

2024-10-15 Thread Wilco Dijkstra
Add support for SVE xor immediate when generating AdvSIMD code and SVE is available. Passes bootstrap & regress, OK for commit? gcc/ChangeLog: * config/aarch64/aarch64.cc (enum simd_immediate_check): Add AARCH64_CHECK_XOR. (aarch64_simd_valid_xor_imm): New function. (a

Re: [PATCH 2/2] AArch64: Improve SIMD immediate generation

2024-10-14 Thread Wilco Dijkstra
Allow use of SVE immediates when generating AdvSIMD code and SVE is available. First check for a valid AdvSIMD immediate, and if SVE is available, try using an SVE move or bitmask immediate. Passes bootstrap & regress, OK for commit? gcc/ChangeLog: * config/aarch64/aarch64-simd.md (ior3

[PATCH 1/2] AArch64: Improve SIMD immediate generation

2024-10-14 Thread Wilco Dijkstra
Cleanup the various interfaces related to SIMD immediate generation. Introduce new functions that make it clear which operation (AND, OR, MOV) we are testing for rather than guessing the final instruction. Reduce the use of overly long names, unused and default parameters for clarity. No cha

Re: [PATCH] aarch64: Fix bug with max/min (PR116934)

2024-10-04 Thread Wilco Dijkstra
Hi Saurabh, This looks good, one little nit: > gcc/ChangeLog: > >     * config/aarch64/iterators.md: Move UNSPEC_COND_SMAX and >     UNSPEC_COND_SMIN to correct iterators. This should also have the PR target/116934 before it - it's fine to change it when you commit. Speaking of which,

[PATCH v2] AArch64: Fix copysign patterns

2024-09-18 Thread Wilco Dijkstra
v2: Add more testcase fixes. The current copysign pattern has a mismatch in the predicates and constraints - operand[2] is a register_operand but also has an alternative X which allows any operand. Since it is a floating point operation, having an integer alternative makes no sense. Change the e

[PATCH] AArch64: Fix copysign patterns

2024-09-17 Thread Wilco Dijkstra
The current copysign pattern has a mismatch in the predicates and constraints - operand[2] is a register_operand but also has an alternative X which allows any operand. Since it is a floating point operation, having an integer alternative makes no sense. Change the expander to always use the vec

Re: [PATCH v3] Arm: Fix ldrd offset range [PR115153]

2024-06-27 Thread Wilco Dijkstra
Hi Richard, > The Linaro CI is reporting an ICE while building libgfortran with this change. So it looks like Thumb-2 oddly enough restricts the negative range of DFmode eventhough that is unnecessary and inefficient. The easiest workaround turned out to avoid using checked adjust_address. Cheer

Re: [PATCH v3] Arm: Fix disassembly error in Thumb-1 relaxed load/store [PR115188]

2024-06-27 Thread Wilco Dijkstra
Hi Richard, > Doing just this will mean that the register allocator will have to undo a > pre/post memory operand that was accepted by the predicate (memory_operand).  > I think we really need a tighter predicate (lets call it noautoinc_mem_op) > here to avoid that.  Note that the existing uses

[BACKPORT] AArch64: Fix strict-align cpymem/setmem [PR103100]

2024-06-27 Thread Wilco Dijkstra
OK to backport to GCC13 (it applies cleanly and regress/bootstrap passes)? Cheers, Wilco On 29/11/2023 18:09, Richard Sandiford wrote: > Wilco Dijkstra writes: >> v2: Use UINTVAL, rename max_mops_size. >> >> The cpymemdi/setmemdi implementation doesn't fully support

[PATCH v2] Arm: Fix ldrd offset range [PR115153]

2024-06-11 Thread Wilco Dijkstra
v2: use a new arm_arch_v7ve_neon, fix use of DImode in output_move_neon The valid offset range of LDRD in arm_legitimate_index_p is increased to -1024..1020 if NEON is enabled since VALID_NEON_DREG_MODE includes DImode. Fix this by moving the LDRD check earlier. Passes bootstrap & regress, OK for

[PATCH v2] Arm: Fix disassembly error in Thumb-1 relaxed load/store [PR115188]

2024-06-11 Thread Wilco Dijkstra
Hi Christophe, >  PR target/115153 I guess this is typo (should be 115188) ? Correct. > +/* { dg-options "-O2 -mthumb" } */-mthumb is included in arm_arch_v6m, so I > think you don't need to add it here? Indeed, it's not strictly necessary. Fixed in v2: A Thumb-1 memory operand allows

Re: PATCH] AArch64: Fix cpu features initialization [PR115342]

2024-06-05 Thread Wilco Dijkstra
Hi Richard, >> Essentially anything covered by HWCAP doesn't need an explicit check. So I >> kept >> the LS64 and PREDRES checks since they don't have a HWCAP allocated (I'm not >> entirely convinced we need these, let alone having 3 individual bits for >> LS64, but >> that's something for the A

Re: PATCH] AArch64: Fix cpu features initialization [PR115342]

2024-06-04 Thread Wilco Dijkstra
Hi Richard, I've reworded the commit message a bit: The CPU features initialization code uses CPUID registers (rather than HWCAP). The equality comparisons it uses are incorrect: for example FEAT_SVE is not set if SVE2 is available. Using HWCAPs for these is both simpler and correct. The initi

PATCH] AArch64: Fix cpu features initialization [PR115342]

2024-06-04 Thread Wilco Dijkstra
Fix CPU features initialization. Use HWCAP rather than explicit accesses to CPUID registers. Perform the initialization atomically to avoid multi- threading issues. Passes regress, OK for commit and backport? libgcc: PR target/115342 * config/aarch64/cpuinfo.c (__init_cpu_featu

[PATCH] Arm: Fix disassembly error in Thumb-1 relaxed load/store [PR115188]

2024-06-03 Thread Wilco Dijkstra
A Thumb-1 memory operand allows single-register LDMIA/STMIA. This doesn't get printed as LDR/STR with writeback in unified syntax, resulting in strange assembler errors if writeback is selected. To work around this, use the 'Uw' constraint that blocks writeback. Passes bootstrap & regress, OK for

[PATCH] Arm: Fix ldrd offset range [PR115153]

2024-06-03 Thread Wilco Dijkstra
The valid offset range of LDRD in arm_legitimate_index_p is increased to -1024..1020 if NEON is enabled since VALID_NEON_DREG_MODE includes DImode. Fix this by moving the LDRD check earlier. Passes bootstrap & regress, OK for commit? gcc: PR target/115153 * config/arm/arm.cc (arm

Re: [PATCH] AArch64: Add ACLE MOPS support

2024-05-31 Thread Wilco Dijkstra
Hi Richard, > I think this should be in a push_options/pop_options block, as for other > intrinsics that require certain features. But then the intrinsic would always be defined, which is contrary to what the ACLE spec demands - it would not give a compilation error at the callsite but give assem

[PATCH] AArch64: Add ACLE MOPS support

2024-05-31 Thread Wilco Dijkstra
Add __ARM_FEATURE_MOPS predefine. Add support for ACLE __arm_mops_memset_tag. Passes regress, OK for commit? gcc: * config/aaarch64/aarch64-c.cc (aarch64_update_cpp_builtins): Add __ARM_FEATURE_MOPS predefine. * config/aarch64/arm_acle.h: Add __arm_mops_memset_tag(). gc

[PATCH] testsuite: Improve check-function-bodies

2024-05-31 Thread Wilco Dijkstra
Improve check-function-bodies by allowing single-character function names. Also skip '#' comments which may be emitted from inline assembler. Passes regress, OK for commit? gcc/testsuite: * lib/scanasm.exp (configure_check-function-bodies): Allow single-char function names. Skip

[PATCH v3] aarch64: Fix normal returns inside functions which use eh_returns [PR114843]

2024-05-20 Thread Wilco Dijkstra
Hi Andrew, A few comments on the implementation, I think it can be simplified a lot: > +++ b/gcc/config/aarch64/aarch64.h > @@ -700,8 +700,9 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE = > AARCH64_FL_SM_OFF; > #define DWARF2_UNWIND_INFO 1 > > /* Use R0 through R3 to pass exception handling

Re: [PATCH] AArch64: Improve costing of ctz

2024-05-15 Thread Wilco Dijkstra
Hi Andrew, > I should note popcount has a similar issue which I hope to fix next week. > Popcount cost is used during expand so it is very useful to be slightly more > correct. It's useful to set the cost so that all of the special cases still apply - even if popcount is relatively fast, it's s

[PATCH] AArch64: Improve costing of ctz

2024-05-15 Thread Wilco Dijkstra
Improve costing of ctz - both TARGET_CSSC and vector cases were not handled yet. Passes regress & bootstrap - OK for commit? gcc: * config/aarch64/aarch64.cc (aarch64_rtx_costs): Improve CTZ costing. --- diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index f

[PATCH] AArch64: Fix printing of 2-instruction alternatives

2024-05-15 Thread Wilco Dijkstra
Add missing '\' in 2-instruction movsi/di alternatives so that they are printed on separate lines. Passes bootstrap and regress, OK for commit once stage 1 reopens? gcc: * config/aarch64/aarch64.md (movsi_aarch64): Use '\;' to force newline in 2-instruction pattern. (movdi

[PATCH] AArch64: Use LDP/STP for large struct types

2024-05-15 Thread Wilco Dijkstra
Use LDP/STP for large struct types as they have useful immediate offsets and are typically faster. This removes differences between little and big endian and allows use of LDP/STP without UNSPEC. Passes regress and bootstrap, OK for commit? gcc: * config/aarch64/aarch64.cc (aarch64_clas

[PATCH] AArch64: Use LDP/STP for large struct types

2024-05-15 Thread Wilco Dijkstra
Use LDP/STP for large struct types as they have useful immediate offsets and are typically faster. This removes differences between little and big endian and allows use of LDP/STP without UNSPEC. Passes regress and bootstrap, OK for commit? gcc: * config/aarch64/aarch64.cc (aarch64_clas

[PATCH] AArch64: Use UZP1 instead of INS

2024-05-15 Thread Wilco Dijkstra
Use UZP1 instead of INS when combining low and high halves of vectors. UZP1 has 3 operands which improves register allocation, and is faster on some microarchitectures. Passes regress & bootstrap, OK for commit? gcc: * config/aarch64/aarch64-simd.md (aarch64_combine_internal): Use

[PATCH] regalloc: Ignore '^' in early costing [PR114766]

2024-04-29 Thread Wilco Dijkstra
According to documentation, '^' should only have an effect during reload. However ira-costs.cc treats it in the same way as '?' during early costing. As a result using '^' can accidentally disable valid alternatives and cause significant regressions (see PR114741). Avoid this by ignoring '^' duri

[PATCH] libgcc: Add missing HWCAP entries to aarch64/cpuinfo.c

2024-04-02 Thread Wilco Dijkstra
A few HWCAP entries are missing from aarch64/cpuinfo.c. This results in build errors on older machines. This counts a trivial build fix, but since it's late in stage 4 I'll let maintainers chip in. OK for commit? libgcc/ * config/aarch64/cpuinfo.c: Add HWCAP_EVTSTRM, HWCAP_CRC32, HWC

[PATCH] libatomic: Cleanup macros in atomic_16.S

2024-03-26 Thread Wilco Dijkstra
As mentioned in https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648397.html , do some additional cleanup of the macros and aliases: Cleanup the macros to add the libat_ prefixes in atomic_16.S. Emit the alias to __atomic_ when ifuncs are not enabled in the ENTRY macro. Passes regress and

Re: [PATCH] libatomic: Fix build for --disable-gnu-indirect-function [PR113986]

2024-03-26 Thread Wilco Dijkstra
Hi Richard, > This description is too brief for me.  Could you say in detail how the > new scheme works?  E.g. the description doesn't explain: > > -if ARCH_AARCH64_HAVE_LSE128 > -AM_CPPFLAGS   = -DHAVE_FEAT_LSE128 > -endif That is not needed because we can include auto-config.h in atomic_16.

[COMMITTED] ARM: Fix builtin-bswap-1.c test [PR113915]

2024-03-08 Thread Wilco Dijkstra
On Thumb-2 the use of CBZ blocks conditional execution, so change the test to compare with a non-zero value. gcc/testsuite/ChangeLog: PR target/113915 * gcc.target/arm/builtin-bswap.x: Fix test to avoid emitting CBZ. --- diff --git a/gcc/testsuite/gcc.target/arm/builtin-bswap.x

Re: [PATCH] ARM: Fix conditional execution [PR113915]

2024-02-26 Thread Wilco Dijkstra
Hi Richard, > Did you test this on a thumb1 target?  It seems to me that the target parts > that you've > removed were likely related to that.  In fact, I don't see why this test > would need to be changed at all. The testcase explicitly forces a Thumb-2 target (arm_arch_v6t2). The patterns wer

[PATCH] libatomic: Fix build for --disable-gnu-indirect-function [PR113986]

2024-02-23 Thread Wilco Dijkstra
Fix libatomic build to support --disable-gnu-indirect-function on AArch64. Always build atomic_16.S and add aliases to the __atomic_* functions if !HAVE_IFUNC. Passes regress and bootstrap, OK for commit? libatomic: PR target/113986 * Makefile.in: Regenerated. * Makefile.

Re: [PATCH] ARM: Fix conditional execution [PR113915]

2024-02-23 Thread Wilco Dijkstra
Hi Richard, > This bit isn't.  The correct fix here is to fix the pattern(s) concerned to > add the missing predicate. > > Note that builtin-bswap.x explicitly mentions predicated mnemonics in the > comments. I fixed the patterns in v2. There are likely some more, plus we could likely merge ma

Re: [PATCH] AArch64: memcpy/memset expansions should not emit LDP/STP [PR113618]

2024-02-22 Thread Wilco Dijkstra
Hi Richard, > It looks like this is really doing two things at once: disabling the > direct emission of LDP/STP Qs, and switching the GPR handling from using > pairs of DImode moves to single TImode moves.  At least, that seems to be > the effect of... No it still uses TImode for the !TARGET_SIMD

[PATCH] ARM: Fix conditional execution [PR113915]

2024-02-21 Thread Wilco Dijkstra
By default most patterns can be conditionalized on Arm targets. However Thumb-2 predication requires the "predicable" attribute be explicitly set to "yes". Most patterns are shared between Arm and Thumb(-2) and are marked with "predicable". Given this sharing, it does not make sense to use a di

[PATCH] AArch64: memcpy/memset expansions should not emit LDP/STP [PR113618]

2024-02-01 Thread Wilco Dijkstra
The new RTL introduced for LDP/STP results in regressions due to use of UNSPEC. Given the new LDP fusion pass is good at finding LDP opportunities, change the memcpy, memmove and memset expansions to emit single vector loads/stores. This fixes the regression and enables more RTL optimization on th

Re: [PATCH v4] AArch64: Cleanup memset expansion

2024-01-30 Thread Wilco Dijkstra
Hi Richard, >> That tune is only used by an obsolete core. I ran the memcpy and memset >> benchmarks from Optimized Routines on xgene-1 with and without LDP/STP. >> There is no measurable penalty for using LDP/STP. I'm not sure why it was >> ever added given it does not do anything useful. I'll po

[PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS

2024-01-30 Thread Wilco Dijkstra
(follow-on based on review comments on https://gcc.gnu.org/pipermail/gcc-patches/2024-January/641913.html) Remove the tune AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS since it is only used by an old core and doesn't properly support -Os. SPECINT_2017 shows that removing it has no performance difference

Re: [PATCH] AArch64: Add -mcpu=cobalt-100

2024-01-25 Thread Wilco Dijkstra
Hi, >> Add support for -mcpu=cobalt-100 (Neoverse N2 with a different implementer >> ID). >> >> Passes regress, OK for commit? > > Ok. Also OK to backport to GCC 13, 12 and 11? Cheers, Wilco

[PATCH] AArch64: Add -mcpu=cobalt-100

2024-01-16 Thread Wilco Dijkstra
Add support for -mcpu=cobalt-100 (Neoverse N2 with a different implementer ID). Passes regress, OK for commit? gcc/ChangeLog: * config/aarch64/aarch64-cores.def (AARCH64_CORE): Add 'cobalt-100' CPU. * config/aarch64/aarch64-tune.md: Regenerated. * doc/invoke.texi (-mcpu):

Re: [PATCH] AArch64: Reassociate CONST in address expressions [PR112573]

2024-01-16 Thread Wilco Dijkstra
Hi Richard, >> +  rtx base = strip_offset_and_salt (XEXP (x, 1), &offset); > > This should be just strip_offset, so that we don't lose the salt > during optimisation. Fixed. > + > +  if (offset.is_constant ()) > I'm not sure this is really required.  Logically the same thing > would app

[PATCH] AArch64: Reassociate CONST in address expressions [PR112573]

2024-01-10 Thread Wilco Dijkstra
GCC tends to optimistically create CONST of globals with an immediate offset. However it is almost always better to CSE addresses of globals and add immediate offsets separately (the offset could be merged later in single-use cases). Splitting CONST expressions with an index in aarch64_legitimize_

Re: [PATCH v4] AArch64: Cleanup memset expansion

2024-01-09 Thread Wilco Dijkstra
Hi Richard, >> +#define MAX_SET_SIZE(speed) (speed ? 256 : 96) > > Since this isn't (AFAIK) a standard macro, there doesn't seem to be > any need to put it in the header file.  It could just go at the head > of aarch64.cc instead. Sure, I've moved it in v4. >> +  if (len <= 24 || (aarch64_tune_p

Re: [PATCH v3 2/3] libatomic: Enable LSE128 128-bit atomics for armv9.4-a

2024-01-08 Thread Wilco Dijkstra
Hi Richard, >> Benchmarking showed that LSE and LSE2 RMW atomics have similar performance >> once >> the atomic is acquire, release or both. Given there is already a significant >> overhead due >> to the function call, PLT indirection and argument setup, it doesn't make >> sense to add >> extra

Re: [PATCH v3 2/3] libatomic: Enable LSE128 128-bit atomics for armv9.4-a

2024-01-08 Thread Wilco Dijkstra
Hi, >> Is there no benefit to using SWPPL for RELEASE here?  Similarly for the >> others. > > We started off implementing all possible memory orderings available. > Wilco saw value in merging less restricted orderings into more > restricted ones - mainly to reduce codesize in less frequently use

Re: [PATCH v3] AArch64: Cleanup memset expansion

2023-12-22 Thread Wilco Dijkstra
v3: rebased to latest trunk Cleanup memset implementation. Similar to memcpy/memmove, use an offset and bytes throughout. Simplify the complex calculations when optimizing for size by using a fixed limit. Passes regress & bootstrap. gcc/ChangeLog: * config/aarch64/aarch64.h (MAX_SET_SI

Re: [PATCH v2] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]

2023-12-04 Thread Wilco Dijkstra
Hi Richard, >> Enable lock-free 128-bit atomics on AArch64.  This is backwards compatible >> with >> existing binaries, gives better performance than locking atomics and is what >> most users expect. > > Please add a justification for why it's backwards compatible, rather > than just stating that

Re: [PATCH v3] AArch64: Add inline memmove expansion

2023-12-01 Thread Wilco Dijkstra
Hi Richard, > +  rtx load[max_ops], store[max_ops]; > > Please either add a comment explaining why 40 is guaranteed to be > enough, or (my preference) use: > >  auto_vec, ...> ops; I've changed to using auto_vec since that should help reduce conflicts with Alex' LDP changes. I double-checked maxi

Re: [PATCH] AArch64: Fix __sync_val_compare_and_swap [PR111404]

2023-11-30 Thread Wilco Dijkstra
Hi Richard, Thanks for the review, now committed. > The new aarch64_split_compare_and_swap code looks a bit twisty. > The approach in lse.S seems more obvious.  But I'm guessing you > didn't want to spend any time restructuring the pre-LSE > -mno-outline-atomics code, and I agree the patch in its

  1   2   3   4   5   6   7   8   9   10   >