Hi Ramana,
>> I used --target=arm-none-linux-gnueabihf --host=arm-none-linux-gnueabihf
>> --build=arm-none-linux-gnueabihf --with-float=hard. However it seems that the
>> default armhf settings are incorrect. I shouldn't need the --with-float=hard
>> since
>> that is obviously implied by armhf, a
v2: further cleanups, improved comments
Add support for inline memmove expansions. The generated code is identical
as for memcpy, except that all loads are emitted before stores rather than
being interleaved. The maximum size is 256 bytes which requires at most 16
registers.
Passes regress/boot
ping
v2: Use UINTVAL, rename max_mops_size.
The cpymemdi/setmemdi implementation doesn't fully support strict alignment.
Block the expansion if the alignment is less than 16 with STRICT_ALIGNMENT.
Clean up the condition when to use MOPS.
Passes regress/bootstrap, OK for commit?
gcc/Cha
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries
ping
From: Wilco Dijkstra
Sent: 04 August 2023 16:05
To: GCC Patches ; Richard Sandiford
Cc: Kyrylo Tkachov
Subject: [PATCH] libatomic: Improve ifunc selection on AArch64
Add support for ifunc selection based on CPUID register. Neoverse N1 supports
atomic 128-bit load/store, so use
ping
__sync_val_compare_and_swap may be used on 128-bit types and either calls the
outline atomic code or uses an inline loop. On AArch64 LDXP is only atomic if
the value is stored successfully using STXP, but the current implementations
do not perform the store if the comparison fails. In thi
Hi Ramana,
> I remember this to be the previous discussions and common understanding.
>
> https://gcc.gnu.org/legacy-ml/gcc/2016-06/msg00017.html
>
> and here
>
> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-02/msg00168.html
>
> Can you point any discussion recently that shows this has changed
Further improve immediate generation by adding support for 2-instruction
MOV/EOR bitmask immediates. This reduces the number of 3/4-instruction
immediates in SPECCPU2017 by ~2%.
Passes regress, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_internal_mov_immediate)
Cleanup memset implementation. Similar to memcpy/memmove, use an offset and
bytes throughout. Simplify the complex calculations when optimizing for size
by using a fixed limit.
Passes regress/bootstrap, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_progress_poin
Hi Steve,
> This patch checks for SIMD functions and saves the extra registers when
> needed. It does not change the caller behavour, so with just this patch
> there may be values saved by both the caller and callee. This is not
> efficient, but it is correct code.
I tried a few simple test cas
Steve Ellcey wrote:
> Yes, I see where I missed this in aarch64_push_regs
> and aarch64_pop_regs. I think that is why the second of
> Wilco's two examples (f2) is wrong. I am unclear about
> exactly what is meant by writeback and why we have it and
> how that and callee_adjust are used. Any cha
Hi Umesh,
Looking at your patch, this would break all results which need to be normalized.
Index: libgcc/config/arm/ieee754-df.S
===
--- libgcc/config/arm/ieee754-df.S (revision 262850)
+++ libgcc/config/arm/ieee754-df.S (
Umesh Kalappa wrote:
> We tried some of the normalisation numbers and the fix works and please
> could you help us with the input ,where if you see that fix breaks down.
Well try any set of inputs which require normalisation. You'll find these no
longer get normalised and so will get incorrect r
Umesh Kalappa wrote:
> We tested on the SP and yes the problem persist on the SP too and
> attached patch will fix the both SP and DP issues for the denormal
> resultant.
The patch now looks correct to me (but I can't approve).
> We bootstrapped the compiler ,look ok to us with minimal testing
Steve Ellcey wrote:
> OK, I think I understand this a bit better now. I think my main
> problem is with the term 'writeback' which I am not used to seeing.
> But if I understand things correctly we are saving one or two registers
> and (possibly) updating the stack pointer using auto-increment/a
Hi Nicolas,
I think your patch doesn't quite work as expected:
@@ -238,9 +238,10 @@ LSYM(Lad_a):
movsip, ip, lsl #1
adcsxl, xl, xl
adc xh, xh, xh
- tst xh, #0x0010
- sub r4, r4, #1
- bne LSYM(Lad_e)
+ subsr4, r4, #1
+
Nicolas Pitre wrote:
>> However if r4 is non-zero, the carry will be set, and the tsths will be
>> executed. This
>> clears the carry and sets the Z flag based on bit 20.
>
> No, not at all. The carry is not affected. And that's the point of the
> tst instruction here rather than a cmp: it sets
v2: Use check-function-bodies in tests
Further improve immediate generation by adding support for 2-instruction
MOV/EOR bitmask immediates. This reduces the number of 3/4-instruction
immediates in SPECCPU2017 by ~2%.
Passes regress, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64
ping
v2: Use UINTVAL, rename max_mops_size.
The cpymemdi/setmemdi implementation doesn't fully support strict alignment.
Block the expansion if the alignment is less than 16 with STRICT_ALIGNMENT.
Clean up the condition when to use MOPS.
Passes regress/bootstrap, OK for commit?
gcc/Ch
ping
v2: further cleanups, improved comments
Add support for inline memmove expansions. The generated code is identical
as for memcpy, except that all loads are emitted before stores rather than
being interleaved. The maximum size is 256 bytes which requires at most 16
registers.
Passes regre
ping
Cleanup memset implementation. Similar to memcpy/memmove, use an offset and
bytes throughout. Simplify the complex calculations when optimizing for size
by using a fixed limit.
Passes regress/bootstrap, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_progre
ping
__sync_val_compare_and_swap may be used on 128-bit types and either calls the
outline atomic code or uses an inline loop. On AArch64 LDXP is only atomic if
the value is stored successfully using STXP, but the current implementations
do not perform the store if the comparison fails. In
ping
From: Wilco Dijkstra
Sent: 04 August 2023 16:05
To: GCC Patches ; Richard Sandiford
Cc: Kyrylo Tkachov
Subject: [PATCH] libatomic: Improve ifunc selection on AArch64
Add support for ifunc selection based on CPUID register. Neoverse N1 supports
atomic 128-bit load/store, so use
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries
, regress pass, OK for commit?
ChangeLog:
2020-09-11 Wilco Dijkstra
* config/aarch64/aarch64.c (neoversen1_tunings):
Enable AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND.
---
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index
ommit?
ChangeLog:
2020-09-03 Wilco Dijkstra
* config.gcc (aarch64*-*-*): Simplify --with-cpu and --with-arch
processing. Add support for architectural extensions.
* config/aarch64/aarch64.h (TARGET_CPU_DEFAULT): Remove
AARCH64_CPU_DEFAULT_FLAGS.
* config/aa
e, so explicitly
allow that.
Co-authored-by: Delia Burduv
Bootstrap OK, regress pass, OK to commit?
ChangeLog
2020-09-03 Wilco Dijkstra
* config.gcc
(aarch64*-*-*): Add --with-tune. Support --with-cpu=native.
* config/aarch64/aarch64.h (OPTION_DEFAULT_SPECS): Add -
Hi Richard,
>On 14/09/2020 15:19, Wilco Dijkstra wrote:
>> The --with-cpu/--with-arch configure option processing not only checks valid
>> arguments
>> but also sets TARGET_CPU_DEFAULT with a CPU and extension bitmask. This
>> isn't used
>> however since a
-fcommon. It is about
time to change the default.
OK for commit?
ChangeLog
2019-10-25 Wilco Dijkstra
PR85678
* common.opt (fcommon): Change init to 1.
doc/
* invoke.texi (-fcommon): Update documentation.
---
diff --git a/gcc/common.opt b/gcc/common.opt
index
Hi Jeff,
> Has this been bootstrapped and regression tested?
Yes, it bootstraps OK of course. I ran regression over the weekend, there
are a few minor regressions in lto due to relying on tentative definitions
and a few latent bugs. I'd expect there will be a few similar failures on
other targets
Hi,
>> I suppose targets can override this decision.
> I think they probably could via the override_options mechanism.
Yes, it's trivial to add this to target_option_override():
if (!global_options_set.x_flag_no_common)
flag_no_common = 0;
Cheers,
Wilco
Hi Iain,
> for the record, Darwin bootstraps OK with the change (which is to be
> expected,
> since the preferred setting for it is -fno-common).
That's good to hear.
> Testsuite fails are order “a few hundred” mostly seem to be related to
> tree-prof
> and vector tests (plus the anticipated
to C code only, C++ code is not affected by -fcommon. It is about
time to change the default.
Bootstrap OK, passes testsuite on AArch64. OK for commit?
ChangeLog
2019-10-29 Wilco Dijkstra
PR85678
* common.opt (fcommon): Change init to 1.
doc/
* invoke.texi (-fcommon
Hi Richard,
> Please don't add -fcommon in lto.exp.
So what is the best way to add an extra option to lto.exp?
Note dg-lto-options completely overrides the options from lto.exp, so I can't
use that except in tests which already use it.
Cheers,
Wilco
Hi Richard,
>> > Please don't add -fcommon in lto.exp.
>>
>> So what is the best way to add an extra option to lto.exp?
>> Note dg-lto-options completely overrides the options from lto.exp, so I can't
>> use that except in tests which already use it.
>
> On what testcases do you need it at all?
T
by -fcommon. It is about
time to change the default.
Passes bootstrap and regress on AArch64 and x64. OK for commit?
ChangeLog
2019-11-05 Wilco Dijkstra
PR85678
* common.opt (fcommon): Change init to 1.
doc/
* invoke.texi (-fcommon): Update documentation.
testsuite/
ating point
code is generally beneficial (more registers and higher latencies), only enable
the pressure scheduler with -Ofast.
On Cortex-A57 this gives a 0.7% performance gain on SPECINT2006 as well
as a 0.2% codesize reduction.
Bootstrapped on armhf. OK for commit?
ChangeLog:
2019-11-06
BLOCK. Also use the CPU tuning setting when a CPU/tune
is selected if -mrestrict-it is not explicitly set.
On Cortex-A57 this gives 1.1% performance gain on SPECINT2006 as well
as a 0.4% codesize reduction.
Bootstrapped on armhf. OK for commit?
ChangeLog:
2019-08-19 Wilco Dijkstra
18, 6, 11, 5, 10, 9
};
return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
}
Is optimized to:
rbitw0, w0
clz w0, w0
and w0, w0, 31
ret
Bootstrapped on AArch64. OK for commit?
ChangeLog:
2019-11-12 Wilco Dijkstra
Hi Segher,
> Out of interest, what uses this? I have never seen it before.
It's used in sjeng in SPEC and gives a 2% speedup on Cortex-A57.
Tricks like this used to be very common 20 years ago since a loop or binary
search
is way too slow and few CPUs supported fast clz/ctz instructions. It's o
Hi Jakub,
On Sat, Jan 11, 2020 at 05:30:52PM +0100, Jakub Jelinek wrote:
> On Sat, Jan 11, 2020 at 05:24:19PM +0100, Andreas Schwab wrote:
> > ../../gcc/tree-ssa-forwprop.c: In function 'bool
> > simplify_count_trailing_zeroes(gimple_stmt_iterator*)':
> > ../../gcc/tree-ssa-forwprop.c:1925:23: er
returns 0 or 1. Add extra test cases.
(note the diff uses the old tree and includes Jakub's bootstrap fixes)
Bootstrap OK on AArch64 and x64.
ChangeLog:
2020-01-13 Wilco Dijkstra
PR tree-optimization/93231
* tree-ssa-forwprop.c
(optimize_count_trailing_zeroes)
on negative shift
counts or multiply constants. Check the type is a char type for the
string constant case to avoid accidentally matching a wide STRING_CST.
Add a tree_expr_nonzero_p check to allow the optimization even if
CTZ_DEFINED_VALUE_AT_ZERO returns 0 or 1. Add extra test cases.
Bootstrap OK on
this fixes the failure you were getting?
ChangeLog:
2020-01-16 Wilco Dijkstra
PR target/92692
* config/aarch64/aarch64.c (aarch64_split_compare_and_swap)
Add assert to ensure prolog has been emitted.
(aarch64_split_atomic_op): Likewise.
* config/aarch64
ping
Testing shows the setting of 32:16 for jump alignment has a significant codesize
cost, however it doesn't make a difference in performance. So set jump-align
to 4 to get 1.6% codesize improvement.
OK for commit?
ChangeLog
2019-12-24 Wilco Dijkstra
* config/aarch64/aarc
ping
Enable the most basic form of compare-branch fusion since various CPUs
support it. This has no measurable effect on cores which don't support
branch fusion, but increases fusion opportunities on cores which do.
Bootstrapped on AArch64, OK for commit?
ChangeLog:
2019-12-24 Wilco Dij
uling floating point
code is generally beneficial (more registers and higher latencies), only enable
the pressure scheduler with -Ofast.
On Cortex-A57 this gives a 0.7% performance gain on SPECINT2006 as well
as a 0.2% codesize reduction.
Bootstrapped on armhf. OK for commit?
ChangeLog:
2019-11-06
Hi Richard,
> If you're able to say for the record which cores you tested, then that'd
> be good.
I've mostly checked it on Cortex-A57 - if there is any affect, it would be on
older cores.
> OK, thanks. I agree there doesn't seem to be an obvious reason why this
> would pessimise any cores sign
Hi Kyrill & Richard,
> I was leaving this to others in case it was obvious to them. On the
> basis that silence suggests it wasn't, :-) could you go into more details?
> Is it expected on first principles that jump alignment doesn't matter
> for Neoverse N1, or is this purely based on experimenta
Hi Kewen,
Would it not make more sense to use the TARGET_ADDRESS_COST hook
to return different costs for immediate offset and register offset addressing,
and ensure IVOpts correctly takes this into account?
On AArch64 we've defined different costs for immediate offset, register offset,
register o
r3, r2, r3
add r0, r0, r3
bx lr
Bootstrap OK, OK for commit?
ChangeLog:
2019-09-11 Wilco Dijkstra
* config/arm/arm.h (SLOW_BYTE_ACCESS): Set to 1.
--
diff --git a/gcc/config/arm/arm.h b/gcc/config/arm/arm.h
index
e07cf03538c5bb23e3285859b9e44a6
code hoisting for -O3 and higher.
OK for commit?
ChangeLog:
2019-11-26 Wilco Dijkstra
PR tree-optimization/80155
* common/config/arm/arm-common.c (arm_option_optimization_table):
Disable -fcode-hoisting with -O3.
--
diff --git a/gcc/common/config/arm/arm-common.c
b/gcc/c
Hi Segher,
> On Thu, Jan 16, 2020 at 12:50:14PM +0000, Wilco Dijkstra wrote:
>> The separate shrinkwrapping pass may insert stores in the middle
>> of atomics loops which can cause issues on some implementations.
>> Avoid this by delaying splitting of atomic patterns until a
expansion is now:
fmovs0, w0
cnt v0.8b, v0.8b
addvb0, v0.8b
fmovw0, s0
Bootstrap OK, passes regress.
ChangeLog
2020-02-02 Wilco Dijkstra
gcc/
* config/aarch64/aarch64.md (popcount2): Improve expansion.
* config/aarch64/aarch64-simd.md
Wilco Dijkstra
* config/aarch64/aarch64.md (clz2): Mask the clz result.
(clrsb2): Likewise.
(ctz2): Likewise.
--
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index
5edc76ee14b55b2b4323530e10bd22b3ffca483e
Hi Andrew,
> You might want to add a testcase that the autovectorizers too.
>
> Currently we get also:
>
> ldr q0, [x0]
> addv b0, v0.16b
> umov w0, v0.b[0]
> ret
My patch doesn't change this case on purpose - there are also many intrinsics
which generate re
r3, r2, r3
add r0, r0, r3
bx lr
Bootstrap OK, OK for commit?
ChangeLog:
2019-09-11 Wilco Dijkstra
* config/arm/arm.h (SLOW_BYTE_ACCESS): Set to 1.
--
diff --git a/gcc/config/arm/arm.h b/gcc/config/arm/arm.h
index
e07cf03538c5bb23e3285859b9e44a6
uling floating point
code is generally beneficial (more registers and higher latencies), only enable
the pressure scheduler with -Ofast.
On Cortex-A57 this gives a 0.7% performance gain on SPECINT2006 as well
as a 0.2% codesize reduction.
Bootstrapped on armhf. OK for commit?
ChangeLog:
2019-11-06
s have max_cond_insns set to 5 due to historical reasons.
Benchmarking shows that max_cond_insns=2 is fastest on modern Cortex-A
cores, so change it to 2. Set it to 4 on older in-order cores as that is
the MAX_INSN_PER_IT_BLOCK limit for Thumb-2.
Bootstrapped on armhf. OK for commit?
ChangeLo
Any further comments? Note GCC doesn't support S/UMULLS either since it is
equally
useless. It's no surprise that Thumb-2 removed support for flag-setting 64-bit
multiplies,
while AArch64 didn't add flag-setting multiplies. So there is no argument that
these
instructions are in any way useful to
range of clz/ctz/cls results,
Combine sometimes behaves oddly and duplicates ctz to remove an unnecessary
sign extension. Avoid this by adding an explicit AND with 127 in the
patterns. Deepsjeng performance improves by ~0.6%.
Bootstrap OK.
ChangeLog:
2020-02-04 Wilco Dijkstra
PR rtl-o
Hi Modi,
Thanks for your patch!
> Adding support for extv and extzv on aarch64 as described in
> PR86901. I also changed
> extract_bit_field_using_extv to use gen_lowpart_if_possible instead of
> gen_lowpart directly. Using
> gen_lowpart directly will fail with an ICE in building libgcc when t
Hi,
Richard wrote:
> However, inside the compiler we really want to represent this as a
>shift.
...
> Ideally this would be handled inside the mid-end expansion of an
> extract, but in the absence of that I think this is best done inside the
> extv expansion so that we never end up with a real
mance improves by ~0.6%.
Bootstrap OK.
ChangeLog:
2020-02-12 Wilco Dijkstra
PR rtl-optimization/93565
* config/aarch64/aarch64.c (aarch64_rtx_costs): Add CTZ costs.
* gcc.target/aarch64/pr93565.c: New test.
--
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aa
Hi Richard,
See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565#c8 - the problem is
more generic like I suspected and it's easy to create similar examples. So while
this turned out to be an easy worksaround for ctz, there general case is harder
to avoid since you still want to allow beneficial
Hi Andrew,
> Yes I agree a better cost model for CTZ/CLZ is the right solution but
> I disagree with 2 ALU instruction as the cost. It should either be
> the same cost as a multiply or have its own cost entry.
> For an example on OcteonTX (and ThunderX1), the cost of CLS/CLZ is 4
> cycles, the sa
Hi Jeff,
>> I've noticed quite significant package failures caused by the revision.
>> Would you please consider documenting this change in porting_to.html
>> (and in changes.html) for GCC 10 release?
>
> I'm not in the office right now, but figured I'd chime in. I'd estimate
> 400-500 packages a
Hi,
Add entries for the default change in changes.html and porting_to.html.
Passes the W3 validator.
Cheers,
Wilco
---
diff --git a/htdocs/gcc-10/changes.html b/htdocs/gcc-10/changes.html
index
e02966460450b7aad884b2d45190b9ecd8c7a5d8..304e1e8ccd38795104156e86b92062696fa5aa8b
100644
--- a/htd
Hi,
I have updated the documentation patch here and added relevant maintainers
so hopefully this can go in soon:
https://gcc.gnu.org/ml/gcc-patches/2019-12/msg00311.html
I moved the paragraph in changes.html to the C section like you suggested. Would
it make sense to link to the porting_to entry
Hi Christophe,
> This patch (r278968) is causing regressions when building GCC
> --target arm-none-linux-gnueabihf
> --with-mode thumb
> --with-cpu cortex-a57
> --with-fpu crypto-neon-fp-armv8
> because the assembler (gas version 2.33.1) complains:
> /ccc7z5eW.s:4267: IT blocks containing more tha
Hi Christophe,
I've added an option to allow the warning to be enabled/disabled:
https://sourceware.org/ml/binutils/2019-12/msg00093.html
Cheers,
Wilco
Hi Christophe,
> In practice, how do you activate it when running the GCC testsuite? Do
> you plan to send a GCC patch to enable this assembler flag, or do you
> locally enable that option by default in your binutils?
The warning is off by default so there is no need to do anything in the
testsu
Hi Christophe,
>> The warning is off by default so there is no need to do anything in the
>> testsuite,
>> you just need a fixed binutils.
>>
>
> Don't we want to fix GCC to stop generating the offending sequence?
Why? All ARMv8 implementations have to support it, and despite the warning
code a
d)((x & -x) * 0x077CB531U)) >> 27];
}
Is optimized to:
rbitw0, w0
clz w0, w0
and w0, w0, 31
ret
Bootstrapped on AArch64. OK for commit?
ChangeLog:
2019-12-11 Wilco Dijkstra
PR tree-optimization/90838
* tree-ssa-forwprop.c
ortex-A65AE to
cortexa53.
Bootstrap OK, OK for commit?
ChangeLog:
2019-12-11 Wilco Dijkstra
* config/aarch64/aarch64-cores.def: Update settings for
cortex-a76ae, cortex-a77, cortex-a65, cortex-a65ae, neoverse-e1,
cortex-a76.cortex-a55.
--
diff --git a/gcc/config/aarch64/aa
7;s the same as for
Cortex-A65. Set the scheduler for Cortex-A65 and Cortex-A65AE to
cortexa53.
Bootstrap OK, OK for commit?
ChangeLog:
2019-12-17 Wilco Dijkstra
* config/aarch64/aarch64-cores.def:
("cortex-a76ae"): Use neoversen1 tuning.
("cortex-a77")
Hi,
>> I've noticed that your patch caused a regression:
>> FAIL: gcc.dg/tree-prof/pr77698.c scan-rtl-dump-times alignments
>> "internal loop alignment added" 1
I've created https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93007
Cheers,
Wilco
Enable the most basic form of compare-branch fusion since various CPUs
support it. This has no measurable effect on cores which don't support
branch fusion, but increases fusion opportunities on cores which do.
Bootstrapped on AArch64, OK for commit?
ChangeLog:
2019-12-24 Wilco Dij
Testing shows the setting of 32:16 for jump alignment has a significant codesize
cost, however it doesn't make a difference in performance. So set jump-align
to 4 to get 1.6% codesize improvement.
OK for commit?
ChangeLog
2019-12-24 Wilco Dijkstra
* config/aarch64/aarc
Hi,
>On 1/6/20 7:10 AM, Jonathan Wakely wrote:
>> GCC now defaults to -fno-common. As a result, global
>> variable accesses are more efficient on various targets. In C, global
>> variables with multiple tentative definitions will result in linker
>> errors.
>
> This is better. I'd also s/will/n
On Thumb-2 the use of CBZ blocks conditional execution, so change the
test to compare with a non-zero value.
gcc/testsuite/ChangeLog:
PR target/113915
* gcc.target/arm/builtin-bswap.x: Fix test to avoid emitting CBZ.
---
diff --git a/gcc/testsuite/gcc.target/arm/builtin-bswap.x
Hi Richard,
> This description is too brief for me. Could you say in detail how the
> new scheme works? E.g. the description doesn't explain:
>
> -if ARCH_AARCH64_HAVE_LSE128
> -AM_CPPFLAGS = -DHAVE_FEAT_LSE128
> -endif
That is not needed because we can include auto-config.h in atomic_16.
As mentioned in
https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648397.html ,
do some additional cleanup of the macros and aliases:
Cleanup the macros to add the libat_ prefixes in atomic_16.S. Emit the
alias to __atomic_ when ifuncs are not enabled in the ENTRY macro.
Passes regress and
A few HWCAP entries are missing from aarch64/cpuinfo.c. This results in build
errors
on older machines.
This counts a trivial build fix, but since it's late in stage 4 I'll let
maintainers chip in.
OK for commit?
libgcc/
* config/aarch64/cpuinfo.c: Add HWCAP_EVTSTRM, HWCAP_CRC32,
HWC
The new RTL introduced for LDP/STP results in regressions due to use of UNSPEC.
Given the new LDP fusion pass is good at finding LDP opportunities, change the
memcpy, memmove and memset expansions to emit single vector loads/stores.
This fixes the regression and enables more RTL optimization on th
By default most patterns can be conditionalized on Arm targets. However
Thumb-2 predication requires the "predicable" attribute be explicitly
set to "yes". Most patterns are shared between Arm and Thumb(-2) and are
marked with "predicable". Given this sharing, it does not make sense to
use a di
Hi Richard,
> It looks like this is really doing two things at once: disabling the
> direct emission of LDP/STP Qs, and switching the GPR handling from using
> pairs of DImode moves to single TImode moves. At least, that seems to be
> the effect of...
No it still uses TImode for the !TARGET_SIMD
Hi Richard,
> This bit isn't. The correct fix here is to fix the pattern(s) concerned to
> add the missing predicate.
>
> Note that builtin-bswap.x explicitly mentions predicated mnemonics in the
> comments.
I fixed the patterns in v2. There are likely some more, plus we could likely
merge ma
Fix libatomic build to support --disable-gnu-indirect-function on AArch64.
Always build atomic_16.S and add aliases to the __atomic_* functions if
!HAVE_IFUNC.
Passes regress and bootstrap, OK for commit?
libatomic:
PR target/113986
* Makefile.in: Regenerated.
* Makefile.
Hi Richard,
> Did you test this on a thumb1 target? It seems to me that the target parts
> that you've
> removed were likely related to that. In fact, I don't see why this test
> would need to be changed at all.
The testcase explicitly forces a Thumb-2 target (arm_arch_v6t2). The patterns
wer
GCC tends to optimistically create CONST of globals with an immediate offset.
However it is almost always better to CSE addresses of globals and add immediate
offsets separately (the offset could be merged later in single-use cases).
Splitting CONST expressions with an index in aarch64_legitimize_
Hi Richard,
>> + rtx base = strip_offset_and_salt (XEXP (x, 1), &offset);
>
> This should be just strip_offset, so that we don't lose the salt
> during optimisation.
Fixed.
> +
> + if (offset.is_constant ())
> I'm not sure this is really required. Logically the same thing
> would app
Add support for -mcpu=cobalt-100 (Neoverse N2 with a different implementer ID).
Passes regress, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64-cores.def (AARCH64_CORE): Add 'cobalt-100' CPU.
* config/aarch64/aarch64-tune.md: Regenerated.
* doc/invoke.texi (-mcpu):
Hi,
>> Add support for -mcpu=cobalt-100 (Neoverse N2 with a different implementer
>> ID).
>>
>> Passes regress, OK for commit?
>
> Ok.
Also OK to backport to GCC 13, 12 and 11?
Cheers,
Wilco
(follow-on based on review comments on
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/641913.html)
Remove the tune AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS since it is only
used by an old core and doesn't properly support -Os. SPECINT_2017
shows that removing it has no performance difference
Hi Richard,
>> That tune is only used by an obsolete core. I ran the memcpy and memset
>> benchmarks from Optimized Routines on xgene-1 with and without LDP/STP.
>> There is no measurable penalty for using LDP/STP. I'm not sure why it was
>> ever added given it does not do anything useful. I'll po
v3: rebased to latest trunk
Cleanup memset implementation. Similar to memcpy/memmove, use an offset and
bytes throughout. Simplify the complex calculations when optimizing for size
by using a fixed limit.
Passes regress & bootstrap.
gcc/ChangeLog:
* config/aarch64/aarch64.h (MAX_SET_SI
Hi,
>> Is there no benefit to using SWPPL for RELEASE here? Similarly for the
>> others.
>
> We started off implementing all possible memory orderings available.
> Wilco saw value in merging less restricted orderings into more
> restricted ones - mainly to reduce codesize in less frequently use
Hi Richard,
>> Benchmarking showed that LSE and LSE2 RMW atomics have similar performance
>> once
>> the atomic is acquire, release or both. Given there is already a significant
>> overhead due
>> to the function call, PLT indirection and argument setup, it doesn't make
>> sense to add
>> extra
Hi Richard,
>> +#define MAX_SET_SIZE(speed) (speed ? 256 : 96)
>
> Since this isn't (AFAIK) a standard macro, there doesn't seem to be
> any need to put it in the header file. It could just go at the head
> of aarch64.cc instead.
Sure, I've moved it in v4.
>> + if (len <= 24 || (aarch64_tune_p
1 - 100 of 1186 matches
Mail list logo