Hi Richard,
> Could you give details? I thought it was always known that trapped
> system register accesses were slow. In the previous versions, the
> checks seemed to be presented as an up-front price worth paying for
> faster atomic operations, on the systems that would use those paths.
> Now
Hi Richard,
> That was also what I was trying to say. In the worst case, the linked
> object has to meet the requirements of the lowest common denominator.
>
> And my supposition was that that isn't a property of static vs dynamic.
But it is. Dynamic linking supports mixing different code models
Hi Richard,
>> Basically the small and large model are fundamentally incompatible. The
>> infamous
>> "dumb linker" approach means it doesn't try to sort sections, so an ADRP
>> relocation
>> will be out of reach if its data is placed after a huge array. Static
>> linking with GLIBC or
>> enabl
Hi Ramana,
> -Generate code for the large code model. This makes no assumptions about
> -addresses and sizes of sections. Programs can be statically linked only.
> The
> +Generate code for the large code model. This allows large .bss and .data
> +sections, however .text and .rodata must still
Hi Kyrill,
> This restriction should be documented in invoke.texi IMO.
> I also think it would be more user friendly to warn them about the
> incompatibility if an explicit -moutline-atomics option is passed.
> It’s okay though to silently turn off the implicit default-on option though.
I've upd
Hi Richard&Kyrill,
>> I’m in favour of this.
>
> Yeah, seems ok to me too. I suppose we ought to update the documentation too:
I've added a note to the documentation. However it is impossible to be complete
here
since many targets switch off early scheduling under various circumstances. So
I'v
Feedback from the kernel team suggests that it's best to only use HWCAPs
rather than also use low-level checks as done by has_lse128() and has_rcpc3().
So change these to just use HWCAPs which simplifies the code and speeds up
ifunc selection by avoiding expensive system register accesses.
Passes
Change AArch64 cpuinfo to follow the latest updates to the FMV spec [1]:
Remove FEAT_PREDRES and FEAT_LS64*. Preserve the ordering in enum CPUFeatures.
Passes regress, OK for commit?
[1] https://github.com/ARM-software/acle/pull/382
gcc:
* common/config/aarch64/cpuinfo.h: Remove FEAT_PR
Enable the early scheduler on AArch64 for O3/Ofast. This means GCC15 benefits
from much faster build times with -O2, but avoids the regressions in lbm which
is very sensitive to minor scheduling changes due to long FMA chains. We can
then revisit this for GCC16.
gcc:
PR target/118351
Outline atomics is not designed to be used with -mcmodel=large, so disable
it automatically if the large code model is used.
Passes regress, OK for commit?
gcc:
PR target/112465
* config/aarch64/aarch64.cc (aarch64_override_options_after_change_1):
Turn off outline atomic
Hi Richard,
> Sorry to be awkward, but I don't think we should put
> AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT in base.
> CHEAP_SHIFT_EXTEND is a good base flag because it means we can make full
> use of a certain group of instructions. FULLY_PIPELINED_FMA similarly
> means that FMA chains beh
Hi Richard,
>> + if (TARGET_ILP32)
>> + warning (OPT_Wdeprecated, "%<-mabi=ilp32%> is deprecated.");
>
> There should be no "." at the end of the message.
Right, fixed in v2 below.
> Otherwise it looks good to me, although like Kyrill says, it'll also
> need a release note.
I've added one,
As suggested in
https://gcc.gnu.org/pipermail/gcc-patches/2025-January/673558.html
update the gcc-15 Changes page:
Add ILP32 depreciation to Caveats section.
---
diff --git a/htdocs/gcc-15/changes.html b/htdocs/gcc-15/changes.html
index
1c690c4a168f4d6297ad33dd5b798e9200792dc5..d5037efb34cc8e6
Hi all,
> In that case, I'm coming round to the idea of deprecating ILP32.
> I think it was already common ground that the GNU/Linux support is dead.
> watchOS would use Mach objects rather than ELF. As you say, it isn't
> clear how much of the current ILP32 support would be relevant for it.
> An
Hi Richard,
> It looks like you committed the original version instead, with no extra
> explanation. I suppose I should have asked for another review round
> instead.
Did you check the commit log?
Change the AARCH64_EXPAND_ALIGNMENT macro into proper function calls to make
future change
Hi Richard,
> Yeah, somewhat. But won't we go on to test has_lse2 anyway, due to:
>
> # elif defined (LSE2_LRCPC3_ATOP)
> # define IFUNC_NCOND(N) 2
> # define IFUNC_COND_1 (has_rcpc3 (hwcap, features))
> # define IFUNC_COND_2 (has_lse2 (hwcap, features))
>
> If we want to reduce the
Hi Andrew,
> Personally I would like this deprecated even for bare-metal. Yes the
> iwatch ABI is an ILP32 ABI but I don't see GCC implementing that any
> time soon and I suspect it would not be hard to resurrect the code at
> that point.
My patch deprecates it in all cases currently. It will be
Hi Richard,
>> + /* LSE2 is a prerequisite for atomic LDIAPP/STILP. */
>> + if (!(hwcap & HWCAP_USCAT))
>> return false;
>
> Is there a reason for not using has_lse2 here? It'd be good to have
> a comment if so.
Yes, the MRS instructions cause expensive traps, so we try to avoid them whe
Hi Kyrill,
>> Add AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS and
>> AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>> to the baseline tuning since all modern cores use it. Fix the
>> neoverse512tvb tuning to be
>> like Neoverse V1/V2.
>
> For neoversev512tvb this means adding AARCH64_EXTRA_TUNE_AVOI
ping
Add AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS and
AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
to the baseline tuning since all modern cores use it. Fix the neoverse512tvb
tuning to be
like Neoverse V1/V2.
gcc/ChangeLog:
* config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TU
ping
Add FULLY_PIPELINED_FMA to tune baseline - this is a generic feature that is
already enabled for some cores, but benchmarking it shows it is faster on all
modern cores (SPECFP improves ~0.17% on Neoverse V1 and 0.04% on Neoverse N1).
Passes regress & bootstrap, OK for commit?
gcc/ChangeLo
ping
Simplify and cleanup ifunc selection logic. Since LRCPC3 does
not imply LSE2, has_rcpc3() should also check LSE2 is enabled.
Passes regress and bootstrap, OK for commit?
libatomic:
* config/linux/aarch64/host-config.h (has_lse2): Cleanup.
(has_lse128): Likewise.
(
ILP32 was originally intended to make porting to AArch64 easier. Support was
never merged in the Linux kernel or GLIBC, so it has been unsupported for many
years. There isn't a benefit in keeping unsupported features forever, so
deprecate it now (and it could be removed in a future release).
Pa
As a minor cleanup remove Cortex-A57 FMA steering pass. Since Cortex-A57 is
pretty old, there isn't any benefit of keeping this.
Passes regress & bootstrap, OK for commit?
gcc:
* config.gcc (extra_objs): Remove cortex-a57-fma-steering.o.
* config/aarch64/aarch64-passes.def: Remo
Hi Richard,
> The patch below is what I meant. It passes bootstrap & regression-test
> on aarch64-linux-gnu (and so produces the same results for the tests
> that you changed). Do you see any problems with this version?
> If not, I think we should go with it.
Thanks for the detailed example - u
Hi Richard,
>> A common case is a constant string which is compared against some
>> argument. Most string functions work on 8 or 16-byte quantities. If we
>> ensure the whole array fits in one aligned load, we save time in the
>> string function.
>>
>> Runtime data collected for strlen calls shows
Hi Richard,
> So just to be sure I understand: we still want to align (say) an array
> of 4 chars to 32 bits so that the LDR & STR are aligned, and an array of
> 3 chars to 32 bits so that the LDRH & STRH for the leading two bytes are
> aligned? Is that right? We don't seem to take advantage of
The register indexed variants of LDRD have complex register overlap constraints
which makes them hard to use without using output_move_double (which can't be
used for atomics as it doesn't guarantee to emit atomic LDRD/STRD when
required).
Add a new predicate and constraint for plain LDRD/STRD wi
Change the AARCH64_EXPAND_ALIGNMENT macro into proper function calls to make
future changes easier. Use the existing alignment settings, however avoid
overaligning small array's or structs to 64 bits when there is no benefit.
This gives a small reduction in data and stack size.
Passes regress & b
Simplify and cleanup ifunc selection logic. Since LRCPC3 does
not imply LSE2, has_rcpc3() should also check LSE2 is enabled.
Passes regress and bootstrap, OK for commit?
libatomic:
* config/linux/aarch64/host-config.h (has_lse2): Cleanup.
(has_lse128): Likewise.
(has_rcp
Hi Kyrill,
> This would make USE_NEW_VECTOR_COSTS effectively the default.
> Jennifer has been trying to do that as well and then to remove it (as it
> would be always true) but there are some codegen regressions that still >
> need to be addressed.
Yes, that's the goal - we should use good tun
Add FULLY_PIPELINED_FMA to tune baseline - this is a generic feature that is
already enabled for some cores, but benchmarking it shows it is faster on all
modern cores (SPECFP improves ~0.17% on Neoverse V1 and 0.04% on Neoverse N1).
Passes regress & bootstrap, OK for commit?
gcc/ChangeLog:
Cleanup the extra tune defines by introducing AARCH64_EXTRA_TUNE_BASE as a
common base supported by all modern cores. Initially set it to
AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND. No change in generated code.
Passes regress & bootstrap, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarc
Add AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS and
AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
to the baseline tuning since all modern cores use it. Fix the neoverse512tvb
tuning to be
like Neoverse V1/V2.
gcc/ChangeLog:
* config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNE_BASE
Hi Richard,
> ...I still think we should avoid testing can_create_pseudo_p.
> Does it work with the last part replaced by:
>
> if (!DECIMAL_FLOAT_MODE_P (mode))
> {
> if (aarch64_can_const_movi_rtx_p (src, mode)
> || aarch64_float_const_representable_p (src)
> || aarch64
Hi,
>>> What do you think about disabling late scheduling as well?
>>
>> I think this would definitely need separate consideration and evaluation
>> given the above.
>>
>> Another thing to consider is the macro fusion machinery. IIRC it works
>> during scheduling so if we don’t run any schedulin
Hi Richard,
> The idea was that, if we did the split during expand, the movsf/df
> define_insns would then only accept the immediates that their
> constraints can handle.
Right, always disallowing these immediates works fine too (it seems
reload doesn't require all immediates to be valid), and th
Cleanup the fusion defines by introducing AARCH64_FUSE_BASE as a common base
level of fusion supported by almost all cores. Add AARCH64_FUSE_MOVK as a
shortcut for all MOVK fusion. In most cases there is no change. It enables
AARCH64_FUSE_CMP_BRANCH for a few older cores since it has no measura
Remove duplicated addr_cost tables - use generic_armv9_a_addrcost_table for
Armv9-a cores and generic_armv8_a_addrcost_table for recent Armv8-a cores.
No changes in generated code.
OK for commit?
gcc/ChangeLog:
* config/aarch64/tuning_models/cortexx925.h
(cortexx925_addrcost_table): Re
Hi Richard,
> That's because, once an instruction matches, the instruction should
> continue to match. It should always be possible to set the INSN_CODE of
> an existing instruction to -1, rerun recog, and get the same instruction
> code back.
>
> Because of that, insn conditions shouldn't depend
Hi Richard,
> It's ok for instructions to require properties that are false during
> early RTL passes and then transition to true. But they can't require
> properties that go from true to false, since that would mean that
> existing instructions become unrecognisable at certain points during
> th
v2: split off movsf/df pattern fixes, remove some guality xfails that now pass
The early scheduler takes up ~33% of the total build time, however it doesn't
provide a meaningful performance gain. This is partly because modern OoO cores
need far less scheduling, partly because the scheduler tends
The IRA combine_and_move pass runs if the scheduler is disabled and aggressively
combines moves. The movsf/df patterns allow all FP immediates since they rely
on a split pattern. However splits do not happen during IRA, so the result is
extra literal loads. To avoid this, use a more accurate ch
Hi Kyrill,
> I think the approach that I’d like to try is using the TARGET_SCHED_DISPATCH
> hooks like x86 does for bdver1-4.
> That would try to exploit the dispatch constraints information in the SWOGs
> rather than the instruction latency and throughput tables.
> That would still require some
Hi Andrew,
> I suspect the following scheduling models could be removed due either
> to hw never going to production or no longer being used by anyone:
> thunderx3t110.md
> falkor.md
> saphira.md
If you're planning to remove these, it would also be good to remove the
falkor-tag-collision-avoidanc
The early scheduler takes up ~33% of the total build time, however it doesn't
provide a meaningful performance gain. This is partly because modern OoO cores
need far less scheduling, partly because the scheduler tends to create many
unnecessary spills by increasing register pressure. Building ap
Hi Vineet,
> I agree the NARROW/WIDE stuff is obfuscating things in technicalities.
Is there evidence this change would make things significantly worse for
some targets? I did a few runs on Neoverse V2 with various options and
it looks beneficial both for integer and FP. On the example and option
As shown in the PR, reload may only check the constraint in some cases and
and not check the predicate is still valid for the resulting instruction.
To fix the issue, add a new constraint which matches the predicate exactly.
Passes regress & bootstrap, OK for commit?
gcc/ChangeLog:
PR ta
The split condition in aarch64_simd_mov uses aarch64_simd_special_constant_p.
While
doing the split, it checks the mode before calling
aarch64_maybe_generate_simd_constant.
This risky since it may result in unexpectedly calling aarch64_split_simd_move
instead
of aarch64_maybe_generate_simd_con
The current copysign pattern has a mismatch in the predicates and constraints -
operand[2] is a register_operand but also has an alternative X which allows any
operand. Since it is a floating point operation, having an integer alternative
makes no sense. Change the expander to always use vector i
Add support for SVE xor immediate when generating AdvSIMD code and SVE is
available.
Passes bootstrap & regress, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64.cc (enum simd_immediate_check): Add
AARCH64_CHECK_XOR.
(aarch64_simd_valid_xor_imm): New function.
(a
Allow use of SVE immediates when generating AdvSIMD code and SVE is available.
First check for a valid AdvSIMD immediate, and if SVE is available, try using
an SVE move or bitmask immediate.
Passes bootstrap & regress, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64-simd.md (ior3
Cleanup the various interfaces related to SIMD immediate generation. Introduce
new functions
that make it clear which operation (AND, OR, MOV) we are testing for rather
than guessing the
final instruction. Reduce the use of overly long names, unused and default
parameters for
clarity. No cha
Hi Saurabh,
This looks good, one little nit:
> gcc/ChangeLog:
>
> * config/aarch64/iterators.md: Move UNSPEC_COND_SMAX and
> UNSPEC_COND_SMIN to correct iterators.
This should also have the PR target/116934 before it - it's fine to change it
when you commit.
Speaking of which,
v2: Add more testcase fixes.
The current copysign pattern has a mismatch in the predicates and constraints -
operand[2] is a register_operand but also has an alternative X which allows any
operand. Since it is a floating point operation, having an integer alternative
makes no sense. Change the e
The current copysign pattern has a mismatch in the predicates and constraints -
operand[2] is a register_operand but also has an alternative X which allows any
operand. Since it is a floating point operation, having an integer alternative
makes no sense. Change the expander to always use the vec
Hi Richard,
> The Linaro CI is reporting an ICE while building libgfortran with this change.
So it looks like Thumb-2 oddly enough restricts the negative range of DFmode
eventhough that is unnecessary and inefficient. The easiest workaround turned
out to avoid using checked adjust_address.
Cheer
Hi Richard,
> Doing just this will mean that the register allocator will have to undo a
> pre/post memory operand that was accepted by the predicate (memory_operand).
> I think we really need a tighter predicate (lets call it noautoinc_mem_op)
> here to avoid that. Note that the existing uses
OK to backport to GCC13 (it applies cleanly and regress/bootstrap passes)?
Cheers,
Wilco
On 29/11/2023 18:09, Richard Sandiford wrote:
> Wilco Dijkstra writes:
>> v2: Use UINTVAL, rename max_mops_size.
>>
>> The cpymemdi/setmemdi implementation doesn't fully support
v2: use a new arm_arch_v7ve_neon, fix use of DImode in output_move_neon
The valid offset range of LDRD in arm_legitimate_index_p is increased to
-1024..1020 if NEON is enabled since VALID_NEON_DREG_MODE includes DImode.
Fix this by moving the LDRD check earlier.
Passes bootstrap & regress, OK for
Hi Christophe,
> PR target/115153
I guess this is typo (should be 115188) ?
Correct.
> +/* { dg-options "-O2 -mthumb" } */-mthumb is included in arm_arch_v6m, so I
> think you don't need to add it
here?
Indeed, it's not strictly necessary. Fixed in v2:
A Thumb-1 memory operand allows
Hi Richard,
>> Essentially anything covered by HWCAP doesn't need an explicit check. So I
>> kept
>> the LS64 and PREDRES checks since they don't have a HWCAP allocated (I'm not
>> entirely convinced we need these, let alone having 3 individual bits for
>> LS64, but
>> that's something for the A
Hi Richard,
I've reworded the commit message a bit:
The CPU features initialization code uses CPUID registers (rather than
HWCAP). The equality comparisons it uses are incorrect: for example FEAT_SVE
is not set if SVE2 is available. Using HWCAPs for these is both simpler and
correct. The initi
Fix CPU features initialization. Use HWCAP rather than explicit accesses
to CPUID registers. Perform the initialization atomically to avoid multi-
threading issues.
Passes regress, OK for commit and backport?
libgcc:
PR target/115342
* config/aarch64/cpuinfo.c (__init_cpu_featu
A Thumb-1 memory operand allows single-register LDMIA/STMIA. This doesn't get
printed as LDR/STR with writeback in unified syntax, resulting in strange
assembler errors if writeback is selected. To work around this, use the 'Uw'
constraint that blocks writeback.
Passes bootstrap & regress, OK for
The valid offset range of LDRD in arm_legitimate_index_p is increased to
-1024..1020 if NEON is enabled since VALID_NEON_DREG_MODE includes DImode.
Fix this by moving the LDRD check earlier.
Passes bootstrap & regress, OK for commit?
gcc:
PR target/115153
* config/arm/arm.cc (arm
Hi Richard,
> I think this should be in a push_options/pop_options block, as for other
> intrinsics that require certain features.
But then the intrinsic would always be defined, which is contrary to what the
ACLE spec demands - it would not give a compilation error at the callsite
but give assem
Add __ARM_FEATURE_MOPS predefine. Add support for ACLE __arm_mops_memset_tag.
Passes regress, OK for commit?
gcc:
* config/aaarch64/aarch64-c.cc (aarch64_update_cpp_builtins):
Add __ARM_FEATURE_MOPS predefine.
* config/aarch64/arm_acle.h: Add __arm_mops_memset_tag().
gc
Improve check-function-bodies by allowing single-character function names.
Also skip '#' comments which may be emitted from inline assembler.
Passes regress, OK for commit?
gcc/testsuite:
* lib/scanasm.exp (configure_check-function-bodies): Allow single-char
function names. Skip
Hi Andrew,
A few comments on the implementation, I think it can be simplified a lot:
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -700,8 +700,9 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE =
> AARCH64_FL_SM_OFF;
> #define DWARF2_UNWIND_INFO 1
>
> /* Use R0 through R3 to pass exception handling
Hi Andrew,
> I should note popcount has a similar issue which I hope to fix next week.
> Popcount cost is used during expand so it is very useful to be slightly more
> correct.
It's useful to set the cost so that all of the special cases still apply - even
if popcount is
relatively fast, it's s
Improve costing of ctz - both TARGET_CSSC and vector cases were not handled yet.
Passes regress & bootstrap - OK for commit?
gcc:
* config/aarch64/aarch64.cc (aarch64_rtx_costs): Improve CTZ costing.
---
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index
f
Add missing '\' in 2-instruction movsi/di alternatives so that they are
printed on separate lines.
Passes bootstrap and regress, OK for commit once stage 1 reopens?
gcc:
* config/aarch64/aarch64.md (movsi_aarch64): Use '\;' to force
newline in 2-instruction pattern.
(movdi
Use LDP/STP for large struct types as they have useful immediate offsets and
are typically faster.
This removes differences between little and big endian and allows use of
LDP/STP without UNSPEC.
Passes regress and bootstrap, OK for commit?
gcc:
* config/aarch64/aarch64.cc (aarch64_clas
Use LDP/STP for large struct types as they have useful immediate offsets and
are typically faster.
This removes differences between little and big endian and allows use of
LDP/STP without UNSPEC.
Passes regress and bootstrap, OK for commit?
gcc:
* config/aarch64/aarch64.cc (aarch64_clas
Use UZP1 instead of INS when combining low and high halves of vectors.
UZP1 has 3 operands which improves register allocation, and is faster on
some microarchitectures.
Passes regress & bootstrap, OK for commit?
gcc:
* config/aarch64/aarch64-simd.md (aarch64_combine_internal):
Use
According to documentation, '^' should only have an effect during reload.
However ira-costs.cc treats it in the same way as '?' during early costing.
As a result using '^' can accidentally disable valid alternatives and cause
significant regressions (see PR114741). Avoid this by ignoring '^' duri
A few HWCAP entries are missing from aarch64/cpuinfo.c. This results in build
errors
on older machines.
This counts a trivial build fix, but since it's late in stage 4 I'll let
maintainers chip in.
OK for commit?
libgcc/
* config/aarch64/cpuinfo.c: Add HWCAP_EVTSTRM, HWCAP_CRC32,
HWC
As mentioned in
https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648397.html ,
do some additional cleanup of the macros and aliases:
Cleanup the macros to add the libat_ prefixes in atomic_16.S. Emit the
alias to __atomic_ when ifuncs are not enabled in the ENTRY macro.
Passes regress and
Hi Richard,
> This description is too brief for me. Could you say in detail how the
> new scheme works? E.g. the description doesn't explain:
>
> -if ARCH_AARCH64_HAVE_LSE128
> -AM_CPPFLAGS = -DHAVE_FEAT_LSE128
> -endif
That is not needed because we can include auto-config.h in atomic_16.
On Thumb-2 the use of CBZ blocks conditional execution, so change the
test to compare with a non-zero value.
gcc/testsuite/ChangeLog:
PR target/113915
* gcc.target/arm/builtin-bswap.x: Fix test to avoid emitting CBZ.
---
diff --git a/gcc/testsuite/gcc.target/arm/builtin-bswap.x
Hi Richard,
> Did you test this on a thumb1 target? It seems to me that the target parts
> that you've
> removed were likely related to that. In fact, I don't see why this test
> would need to be changed at all.
The testcase explicitly forces a Thumb-2 target (arm_arch_v6t2). The patterns
wer
Fix libatomic build to support --disable-gnu-indirect-function on AArch64.
Always build atomic_16.S and add aliases to the __atomic_* functions if
!HAVE_IFUNC.
Passes regress and bootstrap, OK for commit?
libatomic:
PR target/113986
* Makefile.in: Regenerated.
* Makefile.
Hi Richard,
> This bit isn't. The correct fix here is to fix the pattern(s) concerned to
> add the missing predicate.
>
> Note that builtin-bswap.x explicitly mentions predicated mnemonics in the
> comments.
I fixed the patterns in v2. There are likely some more, plus we could likely
merge ma
Hi Richard,
> It looks like this is really doing two things at once: disabling the
> direct emission of LDP/STP Qs, and switching the GPR handling from using
> pairs of DImode moves to single TImode moves. At least, that seems to be
> the effect of...
No it still uses TImode for the !TARGET_SIMD
By default most patterns can be conditionalized on Arm targets. However
Thumb-2 predication requires the "predicable" attribute be explicitly
set to "yes". Most patterns are shared between Arm and Thumb(-2) and are
marked with "predicable". Given this sharing, it does not make sense to
use a di
The new RTL introduced for LDP/STP results in regressions due to use of UNSPEC.
Given the new LDP fusion pass is good at finding LDP opportunities, change the
memcpy, memmove and memset expansions to emit single vector loads/stores.
This fixes the regression and enables more RTL optimization on th
Hi Richard,
>> That tune is only used by an obsolete core. I ran the memcpy and memset
>> benchmarks from Optimized Routines on xgene-1 with and without LDP/STP.
>> There is no measurable penalty for using LDP/STP. I'm not sure why it was
>> ever added given it does not do anything useful. I'll po
(follow-on based on review comments on
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/641913.html)
Remove the tune AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS since it is only
used by an old core and doesn't properly support -Os. SPECINT_2017
shows that removing it has no performance difference
Hi,
>> Add support for -mcpu=cobalt-100 (Neoverse N2 with a different implementer
>> ID).
>>
>> Passes regress, OK for commit?
>
> Ok.
Also OK to backport to GCC 13, 12 and 11?
Cheers,
Wilco
Add support for -mcpu=cobalt-100 (Neoverse N2 with a different implementer ID).
Passes regress, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64-cores.def (AARCH64_CORE): Add 'cobalt-100' CPU.
* config/aarch64/aarch64-tune.md: Regenerated.
* doc/invoke.texi (-mcpu):
Hi Richard,
>> + rtx base = strip_offset_and_salt (XEXP (x, 1), &offset);
>
> This should be just strip_offset, so that we don't lose the salt
> during optimisation.
Fixed.
> +
> + if (offset.is_constant ())
> I'm not sure this is really required. Logically the same thing
> would app
GCC tends to optimistically create CONST of globals with an immediate offset.
However it is almost always better to CSE addresses of globals and add immediate
offsets separately (the offset could be merged later in single-use cases).
Splitting CONST expressions with an index in aarch64_legitimize_
Hi Richard,
>> +#define MAX_SET_SIZE(speed) (speed ? 256 : 96)
>
> Since this isn't (AFAIK) a standard macro, there doesn't seem to be
> any need to put it in the header file. It could just go at the head
> of aarch64.cc instead.
Sure, I've moved it in v4.
>> + if (len <= 24 || (aarch64_tune_p
Hi Richard,
>> Benchmarking showed that LSE and LSE2 RMW atomics have similar performance
>> once
>> the atomic is acquire, release or both. Given there is already a significant
>> overhead due
>> to the function call, PLT indirection and argument setup, it doesn't make
>> sense to add
>> extra
Hi,
>> Is there no benefit to using SWPPL for RELEASE here? Similarly for the
>> others.
>
> We started off implementing all possible memory orderings available.
> Wilco saw value in merging less restricted orderings into more
> restricted ones - mainly to reduce codesize in less frequently use
v3: rebased to latest trunk
Cleanup memset implementation. Similar to memcpy/memmove, use an offset and
bytes throughout. Simplify the complex calculations when optimizing for size
by using a fixed limit.
Passes regress & bootstrap.
gcc/ChangeLog:
* config/aarch64/aarch64.h (MAX_SET_SI
Hi Richard,
>> Enable lock-free 128-bit atomics on AArch64. This is backwards compatible
>> with
>> existing binaries, gives better performance than locking atomics and is what
>> most users expect.
>
> Please add a justification for why it's backwards compatible, rather
> than just stating that
Hi Richard,
> + rtx load[max_ops], store[max_ops];
>
> Please either add a comment explaining why 40 is guaranteed to be
> enough, or (my preference) use:
>
> auto_vec, ...> ops;
I've changed to using auto_vec since that should help reduce conflicts
with Alex' LDP changes. I double-checked maxi
Hi Richard,
Thanks for the review, now committed.
> The new aarch64_split_compare_and_swap code looks a bit twisty.
> The approach in lse.S seems more obvious. But I'm guessing you
> didn't want to spend any time restructuring the pre-LSE
> -mno-outline-atomics code, and I agree the patch in its
1 - 100 of 1057 matches
Mail list logo