Re: [PATCH v2] ARM: Block predication on atomics [PR111235]

2023-10-02 Thread Wilco Dijkstra
Hi Ramana,

>> I used --target=arm-none-linux-gnueabihf --host=arm-none-linux-gnueabihf
>> --build=arm-none-linux-gnueabihf --with-float=hard. However it seems that the
>> default armhf settings are incorrect. I shouldn't need the --with-float=hard 
>> since
>> that is obviously implied by armhf, and they should also imply armv7-a with 
>> vfpv3
>> according to documentation. It seems to get confused and skip some tests. I 
>> tried
>> using --with-fpu=auto, but that doesn't work at all, so in the end I forced 
>> it like:
>> --with-arch=armv8-a --with-fpu=neon-fp-armv8. With this it runs a few more 
>> tests.
> 
> Yeah that's a wart that I don't like.
> 
> armhf just implies the hard float ABI and came into being to help
> distinguish from the Base PCS for some of the distros at the time
> (2010s). However we didn't want to set a baseline arch at that time
> given the imminent arrival of v8-a and thus the specification of
> --with-arch , --with-fpu and --with-float became second nature to many
> of us working on it at that time.

Looking at it, the default is indeed incorrect, you get:
'-mcpu=arm10e' '-mfloat-abi=hard' '-marm' '-march=armv5te+fp'

That's like 25 years out of date!

However all the armhf distros have Armv7-a as the baseline and use Thumb-2:
'-mfloat-abi=hard' '-mthumb' '-march=armv7-a+fp'

So the issue is that dg-require-effective-target arm_arch_v7a_ok doesn't work on
armhf. It seems that if you specify an architecture even with hard-float 
configured,
it turns off FP and then complains because hard-float implies you must have 
FP...

So in most configurations Iincluding the one used by distro compilers) we 
basically
skip lots of tests for no apparent reason...

> Ok, thanks for promising to do so - I trust you to get it done. Please
> try out various combinations of -march v7ve, v7-a , v8-a with the tool
> as each of them have slightly different rules. For instance v7ve
> allows LDREXD and STREXD to be single copy atomic for 64 bit loads
> whereas v7-a did not .

You mean LDRD may be generated on CPUs with LPAE. We use LDREXD by
default since that is always atomic on v7-a.

> Ok if no regressions but as you might get nagged by the post commit CI ...

Thanks, I've committed it. Those links don't show anything concrete, however I 
do note
the CI didn't pick up v2.

Btw you're happy with backports if there are no issues reported for a few days?

Cheers,
Wilco
___
linaro-toolchain mailing list -- linaro-toolchain@lists.linaro.org
To unsubscribe send an email to linaro-toolchain-le...@lists.linaro.org


Re: [CI-NOTIFY]: TCWG Bisect tcwg_bmk_tk1/gnu-release-arm-spec2k6-O3_LTO - Build # 27 - Successful!

2021-07-12 Thread Wilco Dijkstra
Hi Maxim,

That sounds rather strange, huge differences due to scheduling are very rare. 
Which micro architecture was this run on? I can try running it on trunk and see 
what difference it makes with those options.

Cheers,
Wilco
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/linaro-toolchain


Re: [CI-NOTIFY]: TCWG Bisect tcwg_bmk_tk1/gnu-release-arm-spec2k6-O3_LTO - Build # 27 - Successful!

2021-07-15 Thread Wilco Dijkstra
Hi Maxim,

> We use Nvidia TK1s (Cortex-A15) for benchmarking on 32-bit ARM.

That's a bit old, I used Cortex-A57 as the closest to that.

> LTO tends to increase functions due to additional inlining, which increases 
> scheduling regions,
> which increases opportunities for the 1st scheduler for inter-block 
> instruction moves, which
> increases register pressure.

I don't think this is related to LTO - I see large differences with plain -O2 
as well.

> SCHED_PRESSURE_MODEL handles cases with high register pressure well, and 
> switching it off
> caused a few additional spills in the hot blocks, which caused the slow-down.
>
> It may be worthwhile to bring SCHED_PRESSURE_MODEL back when LTO is enabled.

A quick run shows that on trunk --param sched-pressure-algorithm=2 is indeed 
faster
for FP. However turning off pre-realloc scheduling is better overall since it 
gives 1% gain
on INT and 0.5% on FP as well as significant codesize reductions.

So the best way forward for 32-bit Arm is to turn off pre-realloc scheduling as 
it
just causes lots of spilling.

Cheers,
Wilco
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/linaro-toolchain