And another, even bigger, regression at plain -Os: > After gcc commit a459ee44c0a74b0df0485ed7a56683816c02aae9 > Author: Kyrylo Tkachov <kyrylo.tkac...@arm.com> > > aarch64: Improve size heuristic for cpymem expansion > > the following benchmarks grew in size by more than 1%: > - 470.lbm grew in size by 38% from 10823 to 14936 bytes > > Below reproducer instructions can be used to re-build both "first_bad" and > "last_good" cross-toolchains used in this bisection. Naturally, the scripts > will fail when triggerring benchmarking jobs if you don't have access to > Linaro TCWG CI. > > For your convenience, we have uploaded tarballs with pre-processed source and > assembly files at: > - First_bad save-temps: > https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os/3/artifact/artifacts/build-a459ee44c0a74b0df0485ed7a56683816c02aae9/save-temps/ > - Last_good save-temps: > https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os/3/artifact/artifacts/build-8f95e3c04d659d541ca4937b3df2f1175a1c5f05/save-temps/ > - Baseline save-temps: > https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os/3/artifact/artifacts/build-baseline/save-temps/ > > Configuration: > - Benchmark: SPEC CPU2006 > - Toolchain: GCC + Glibc + GNU Linker > - Version: all components were built from their tip of trunk > - Target: aarch64-linux-gnu > - Compiler flags: -Os > - Hardware: APM Mustang 8x X-Gene1 >
-- Maxim Kuvyrkov https://www.linaro.org > On 2 Oct 2021, at 18:45, Maxim Kuvyrkov via Gcc-regression > <gcc-regress...@gcc.gnu.org> wrote: > > Hi Kyrill, > > With LTO enabled this patch increases code size on SPEC CPU2006 -- it > increases 458.sjeng by 4% and 459.GemsFDTD by 2%. Could you check if these > regressions are avoidable? > > It reduces size of 465.tonto and 400.perlbench by 1%, and everything else is > neutral, see [1]. Overall it increases geomean code size at -Os -flto by > 0.1%. > > [1] > https://ci.linaro.org/job/tcwg_bmk_ci_gnu-build-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/26/artifact/artifacts/11-check_regression/results.csv/*view*/ > > Regards, > > -- > Maxim Kuvyrkov > https://www.linaro.org > > > > >> On Oct 2, 2021, at 11:12 AM, ci_not...@linaro.org wrote: >> >> After gcc commit a459ee44c0a74b0df0485ed7a56683816c02aae9 >> Author: Kyrylo Tkachov <kyrylo.tkac...@arm.com> >> >> aarch64: Improve size heuristic for cpymem expansion >> >> the following benchmarks grew in size by more than 1%: >> - 458.sjeng grew in size by 4% from 105780 to 109944 bytes >> - 459.GemsFDTD grew in size by 2% from 247504 to 251468 bytes >> >> Below reproducer instructions can be used to re-build both "first_bad" and >> "last_good" cross-toolchains used in this bisection. Naturally, the scripts >> will fail when triggerring benchmarking jobs if you don't have access to >> Linaro TCWG CI. >> >> For your convenience, we have uploaded tarballs with pre-processed source >> and assembly files at: >> - First_bad save-temps: >> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-a459ee44c0a74b0df0485ed7a56683816c02aae9/save-temps/ >> - Last_good save-temps: >> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-8f95e3c04d659d541ca4937b3df2f1175a1c5f05/save-temps/ >> - Baseline save-temps: >> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-baseline/save-temps/ >> >> Configuration: >> - Benchmark: SPEC CPU2006 >> - Toolchain: GCC + Glibc + GNU Linker >> - Version: all components were built from their tip of trunk >> - Target: aarch64-linux-gnu >> - Compiler flags: -Os -flto >> - Hardware: APM Mustang 8x X-Gene1 >> >> This benchmarking CI is work-in-progress, and we welcome feedback and >> suggestions at linaro-toolchain@lists.linaro.org . In our improvement plans >> is to add support for SPEC CPU2017 benchmarks and provide "perf >> report/annotate" data behind these reports. >> >> THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, >> REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. >> >> This commit has regressed these CI configurations: >> - tcwg_bmk_gnu_apm/gnu-master-aarch64-spec2k6-Os_LTO >> >> First_bad build: >> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-a459ee44c0a74b0df0485ed7a56683816c02aae9/ >> Last_good build: >> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-8f95e3c04d659d541ca4937b3df2f1175a1c5f05/ >> Baseline build: >> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-baseline/ >> Even more details: >> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/ >> >> Reproduce builds: >> <cut> >> mkdir investigate-gcc-a459ee44c0a74b0df0485ed7a56683816c02aae9 >> cd investigate-gcc-a459ee44c0a74b0df0485ed7a56683816c02aae9 >> >> # Fetch scripts >> git clone https://git.linaro.org/toolchain/jenkins-scripts >> >> # Fetch manifests and test.sh script >> mkdir -p artifacts/manifests >> curl -o artifacts/manifests/build-baseline.sh >> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/manifests/build-baseline.sh >> --fail >> curl -o artifacts/manifests/build-parameters.sh >> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/manifests/build-parameters.sh >> --fail >> curl -o artifacts/test.sh >> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/test.sh >> --fail >> chmod +x artifacts/test.sh >> >> # Reproduce the baseline build (build all pre-requisites) >> ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh >> >> # Save baseline build state (which is then restored in artifacts/test.sh) >> mkdir -p ./bisect >> rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ >> --exclude /gcc/ ./ ./bisect/baseline/ >> >> cd gcc >> >> # Reproduce first_bad build >> git checkout --detach a459ee44c0a74b0df0485ed7a56683816c02aae9 >> ../artifacts/test.sh >> >> # Reproduce last_good build >> git checkout --detach 8f95e3c04d659d541ca4937b3df2f1175a1c5f05 >> ../artifacts/test.sh >> >> cd .. >> </cut> >> >> Full commit (up to 1000 lines): >> <cut> >> commit a459ee44c0a74b0df0485ed7a56683816c02aae9 >> Author: Kyrylo Tkachov <kyrylo.tkac...@arm.com> >> Date: Wed Sep 29 11:21:45 2021 +0100 >> >> aarch64: Improve size heuristic for cpymem expansion >> >> Similar to my previous patch for setmem this one does the same for the >> cpymem expansion. >> We count the number of ops emitted and compare it against the alternative >> of just calling >> the library function when optimising for size. >> For the code: >> void >> cpy_127 (char *out, char *in) >> { >> __builtin_memcpy (out, in, 127); >> } >> >> void >> cpy_128 (char *out, char *in) >> { >> __builtin_memcpy (out, in, 128); >> } >> >> we now emit a call to memcpy (with an extra MOV-immediate instruction for >> the size) instead of: >> cpy_127(char*, char*): >> ldp q0, q1, [x1] >> stp q0, q1, [x0] >> ldp q0, q1, [x1, 32] >> stp q0, q1, [x0, 32] >> ldp q0, q1, [x1, 64] >> stp q0, q1, [x0, 64] >> ldr q0, [x1, 96] >> str q0, [x0, 96] >> ldr q0, [x1, 111] >> str q0, [x0, 111] >> ret >> cpy_128(char*, char*): >> ldp q0, q1, [x1] >> stp q0, q1, [x0] >> ldp q0, q1, [x1, 32] >> stp q0, q1, [x0, 32] >> ldp q0, q1, [x1, 64] >> stp q0, q1, [x0, 64] >> ldp q0, q1, [x1, 96] >> stp q0, q1, [x0, 96] >> ret >> >> which is a clear code size win. Speed optimisation heuristics remain >> unchanged. >> >> 2021-09-29 Kyrylo Tkachov <kyrylo.tkac...@arm.com> >> >> * config/aarch64/aarch64.c (aarch64_expand_cpymem): Count number of >> emitted operations and adjust heuristic for code size. >> >> 2021-09-29 Kyrylo Tkachov <kyrylo.tkac...@arm.com> >> >> * gcc.target/aarch64/cpymem-size.c: New test. >> --- >> gcc/config/aarch64/aarch64.c | 36 >> ++++++++++++++++++-------- >> gcc/testsuite/gcc.target/aarch64/cpymem-size.c | 29 +++++++++++++++++++++ >> 2 files changed, 54 insertions(+), 11 deletions(-) >> >> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c >> index ac17c1c88fb..a9a1800af53 100644 >> --- a/gcc/config/aarch64/aarch64.c >> +++ b/gcc/config/aarch64/aarch64.c >> @@ -23390,7 +23390,8 @@ aarch64_copy_one_block_and_progress_pointers (rtx >> *src, rtx *dst, >> } >> >> /* Expand cpymem, as if from a __builtin_memcpy. Return true if >> - we succeed, otherwise return false. */ >> + we succeed, otherwise return false, indicating that a libcall to >> + memcpy should be emitted. */ >> >> bool >> aarch64_expand_cpymem (rtx *operands) >> @@ -23407,11 +23408,13 @@ aarch64_expand_cpymem (rtx *operands) >> >> unsigned HOST_WIDE_INT size = INTVAL (operands[2]); >> >> - /* Inline up to 256 bytes when optimizing for speed. */ >> + /* Try to inline up to 256 bytes. */ >> unsigned HOST_WIDE_INT max_copy_size = 256; >> >> - if (optimize_function_for_size_p (cfun)) >> - max_copy_size = 128; >> + bool size_p = optimize_function_for_size_p (cfun); >> + >> + if (size > max_copy_size) >> + return false; >> >> int copy_bits = 256; >> >> @@ -23421,13 +23424,14 @@ aarch64_expand_cpymem (rtx *operands) >> || !TARGET_SIMD >> || (aarch64_tune_params.extra_tuning_flags >> & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)) >> - { >> - copy_bits = 128; >> - max_copy_size = max_copy_size / 2; >> - } >> + copy_bits = 128; >> >> - if (size > max_copy_size) >> - return false; >> + /* Emit an inline load+store sequence and count the number of operations >> + involved. We use a simple count of just the loads and stores emitted >> + rather than rtx_insn count as all the pointer adjustments and reg >> copying >> + in this function will get optimized away later in the pipeline. */ >> + start_sequence (); >> + unsigned nops = 0; >> >> base = copy_to_mode_reg (Pmode, XEXP (dst, 0)); >> dst = adjust_automodify_address (dst, VOIDmode, base, 0); >> @@ -23456,7 +23460,8 @@ aarch64_expand_cpymem (rtx *operands) >> cur_mode = V4SImode; >> >> aarch64_copy_one_block_and_progress_pointers (&src, &dst, cur_mode); >> - >> + /* A single block copy is 1 load + 1 store. */ >> + nops += 2; >> n -= mode_bits; >> >> /* Emit trailing copies using overlapping unaligned accesses - this is >> @@ -23471,7 +23476,16 @@ aarch64_expand_cpymem (rtx *operands) >> n = n_bits; >> } >> } >> + rtx_insn *seq = get_insns (); >> + end_sequence (); >> + >> + /* A memcpy libcall in the worst case takes 3 instructions to prepare the >> + arguments + 1 for the call. */ >> + unsigned libcall_cost = 4; >> + if (size_p && libcall_cost < nops) >> + return false; >> >> + emit_insn (seq); >> return true; >> } >> >> diff --git a/gcc/testsuite/gcc.target/aarch64/cpymem-size.c >> b/gcc/testsuite/gcc.target/aarch64/cpymem-size.c >> new file mode 100644 >> index 00000000000..4d488b74301 >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/aarch64/cpymem-size.c >> @@ -0,0 +1,29 @@ >> +/* { dg-do compile } */ >> +/* { dg-options "-Os" } */ >> + >> +#include <stdlib.h> >> + >> +/* >> +** cpy_127: >> +** mov x2, 127 >> +** b memcpy >> +*/ >> +void >> +cpy_127 (char *out, char *in) >> +{ >> + __builtin_memcpy (out, in, 127); >> +} >> + >> +/* >> +** cpy_128: >> +** mov x2, 128 >> +** b memcpy >> +*/ >> +void >> +cpy_128 (char *out, char *in) >> +{ >> + __builtin_memcpy (out, in, 128); >> +} >> + >> +/* { dg-final { check-function-bodies "**" "" "" } } */ >> + >> </cut> > _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain