Re: [TCWG CI] 458.sjeng grew in size by 4% after gcc: aarch64: Improve size heuristic for cpymem expansion

Maxim Kuvyrkov Mon, 04 Oct 2021 01:15:09 -0700

And another, even bigger, regression at plain -Os:

> After gcc commit a459ee44c0a74b0df0485ed7a56683816c02aae9
> Author: Kyrylo Tkachov <kyrylo.tkac...@arm.com>
> 
>    aarch64: Improve size heuristic for cpymem expansion
> 
> the following benchmarks grew in size by more than 1%:
> - 470.lbm grew in size by 38% from 10823 to 14936 bytes
> 
> Below reproducer instructions can be used to re-build both "first_bad" and 
> "last_good" cross-toolchains used in this bisection. Naturally, the scripts 
> will fail when triggerring benchmarking jobs if you don't have access to 
> Linaro TCWG CI.
> 
> For your convenience, we have uploaded tarballs with pre-processed source and 
> assembly files at:
> - First_bad save-temps: 
> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os/3/artifact/artifacts/build-a459ee44c0a74b0df0485ed7a56683816c02aae9/save-temps/
> - Last_good save-temps: 
> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os/3/artifact/artifacts/build-8f95e3c04d659d541ca4937b3df2f1175a1c5f05/save-temps/
> - Baseline save-temps: 
> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os/3/artifact/artifacts/build-baseline/save-temps/
> 
> Configuration:
> - Benchmark: SPEC CPU2006
> - Toolchain: GCC + Glibc + GNU Linker
> - Version: all components were built from their tip of trunk
> - Target: aarch64-linux-gnu
> - Compiler flags: -Os
> - Hardware: APM Mustang 8x X-Gene1
>


--
Maxim Kuvyrkov
https://www.linaro.org

> On 2 Oct 2021, at 18:45, Maxim Kuvyrkov via Gcc-regression 
> <gcc-regress...@gcc.gnu.org> wrote:
> 
> Hi Kyrill,
> 
> With LTO enabled this patch increases code size on SPEC CPU2006 -- it 
> increases 458.sjeng by 4% and 459.GemsFDTD by 2%.  Could you check if these 
> regressions are avoidable?
> 
> It reduces size of 465.tonto and 400.perlbench by 1%, and everything else is 
> neutral, see [1].  Overall it increases geomean code size at -Os -flto by 
> 0.1%.
> 
> [1] 
> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-build-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/26/artifact/artifacts/11-check_regression/results.csv/*view*/
> 
> Regards,
> 
> --
> Maxim Kuvyrkov
> https://www.linaro.org
> 
> 
> 
> 
>> On Oct 2, 2021, at 11:12 AM, ci_not...@linaro.org wrote:
>> 
>> After gcc commit a459ee44c0a74b0df0485ed7a56683816c02aae9
>> Author: Kyrylo Tkachov <kyrylo.tkac...@arm.com>
>> 
>>   aarch64: Improve size heuristic for cpymem expansion
>> 
>> the following benchmarks grew in size by more than 1%:
>> - 458.sjeng grew in size by 4% from 105780 to 109944 bytes
>> - 459.GemsFDTD grew in size by 2% from 247504 to 251468 bytes
>> 
>> Below reproducer instructions can be used to re-build both "first_bad" and 
>> "last_good" cross-toolchains used in this bisection.  Naturally, the scripts 
>> will fail when triggerring benchmarking jobs if you don't have access to 
>> Linaro TCWG CI.
>> 
>> For your convenience, we have uploaded tarballs with pre-processed source 
>> and assembly files at:
>> - First_bad save-temps: 
>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-a459ee44c0a74b0df0485ed7a56683816c02aae9/save-temps/
>> - Last_good save-temps: 
>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-8f95e3c04d659d541ca4937b3df2f1175a1c5f05/save-temps/
>> - Baseline save-temps: 
>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-baseline/save-temps/
>> 
>> Configuration:
>> - Benchmark: SPEC CPU2006
>> - Toolchain: GCC + Glibc + GNU Linker
>> - Version: all components were built from their tip of trunk
>> - Target: aarch64-linux-gnu
>> - Compiler flags: -Os -flto
>> - Hardware: APM Mustang 8x X-Gene1
>> 
>> This benchmarking CI is work-in-progress, and we welcome feedback and 
>> suggestions at linaro-toolchain@lists.linaro.org .  In our improvement plans 
>> is to add support for SPEC CPU2017 benchmarks and provide "perf 
>> report/annotate" data behind these reports.
>> 
>> THIS IS THE END OF INTERESTING STUFF.  BELOW ARE LINKS TO BUILDS, 
>> REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
>> 
>> This commit has regressed these CI configurations:
>> - tcwg_bmk_gnu_apm/gnu-master-aarch64-spec2k6-Os_LTO
>> 
>> First_bad build: 
>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-a459ee44c0a74b0df0485ed7a56683816c02aae9/
>> Last_good build: 
>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-8f95e3c04d659d541ca4937b3df2f1175a1c5f05/
>> Baseline build: 
>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-baseline/
>> Even more details: 
>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/
>> 
>> Reproduce builds:
>> <cut>
>> mkdir investigate-gcc-a459ee44c0a74b0df0485ed7a56683816c02aae9
>> cd investigate-gcc-a459ee44c0a74b0df0485ed7a56683816c02aae9
>> 
>> # Fetch scripts
>> git clone https://git.linaro.org/toolchain/jenkins-scripts
>> 
>> # Fetch manifests and test.sh script
>> mkdir -p artifacts/manifests
>> curl -o artifacts/manifests/build-baseline.sh 
>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/manifests/build-baseline.sh
>>  --fail
>> curl -o artifacts/manifests/build-parameters.sh 
>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/manifests/build-parameters.sh
>>  --fail
>> curl -o artifacts/test.sh 
>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/test.sh
>>  --fail
>> chmod +x artifacts/test.sh
>> 
>> # Reproduce the baseline build (build all pre-requisites)
>> ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
>> 
>> # Save baseline build state (which is then restored in artifacts/test.sh)
>> mkdir -p ./bisect
>> rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ 
>> --exclude /gcc/ ./ ./bisect/baseline/
>> 
>> cd gcc
>> 
>> # Reproduce first_bad build
>> git checkout --detach a459ee44c0a74b0df0485ed7a56683816c02aae9
>> ../artifacts/test.sh
>> 
>> # Reproduce last_good build
>> git checkout --detach 8f95e3c04d659d541ca4937b3df2f1175a1c5f05
>> ../artifacts/test.sh
>> 
>> cd ..
>> </cut>
>> 
>> Full commit (up to 1000 lines):
>> <cut>
>> commit a459ee44c0a74b0df0485ed7a56683816c02aae9
>> Author: Kyrylo Tkachov <kyrylo.tkac...@arm.com>
>> Date:   Wed Sep 29 11:21:45 2021 +0100
>> 
>>   aarch64: Improve size heuristic for cpymem expansion
>> 
>>   Similar to my previous patch for setmem this one does the same for the 
>> cpymem expansion.
>>   We count the number of ops emitted and compare it against the alternative 
>> of just calling
>>   the library function when optimising for size.
>>   For the code:
>>   void
>>   cpy_127 (char *out, char *in)
>>   {
>>     __builtin_memcpy (out, in, 127);
>>   }
>> 
>>   void
>>   cpy_128 (char *out, char *in)
>>   {
>>     __builtin_memcpy (out, in, 128);
>>   }
>> 
>>   we now emit a call to memcpy (with an extra MOV-immediate instruction for 
>> the size) instead of:
>>   cpy_127(char*, char*):
>>           ldp     q0, q1, [x1]
>>           stp     q0, q1, [x0]
>>           ldp     q0, q1, [x1, 32]
>>           stp     q0, q1, [x0, 32]
>>           ldp     q0, q1, [x1, 64]
>>           stp     q0, q1, [x0, 64]
>>           ldr     q0, [x1, 96]
>>           str     q0, [x0, 96]
>>           ldr     q0, [x1, 111]
>>           str     q0, [x0, 111]
>>           ret
>>   cpy_128(char*, char*):
>>           ldp     q0, q1, [x1]
>>           stp     q0, q1, [x0]
>>           ldp     q0, q1, [x1, 32]
>>           stp     q0, q1, [x0, 32]
>>           ldp     q0, q1, [x1, 64]
>>           stp     q0, q1, [x0, 64]
>>           ldp     q0, q1, [x1, 96]
>>           stp     q0, q1, [x0, 96]
>>           ret
>> 
>>   which is a clear code size win. Speed optimisation heuristics remain 
>> unchanged.
>> 
>>   2021-09-29  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
>> 
>>           * config/aarch64/aarch64.c (aarch64_expand_cpymem): Count number of
>>           emitted operations and adjust heuristic for code size.
>> 
>>   2021-09-29  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
>> 
>>           * gcc.target/aarch64/cpymem-size.c: New test.
>> ---
>> gcc/config/aarch64/aarch64.c                   | 36 
>> ++++++++++++++++++--------
>> gcc/testsuite/gcc.target/aarch64/cpymem-size.c | 29 +++++++++++++++++++++
>> 2 files changed, 54 insertions(+), 11 deletions(-)
>> 
>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> index ac17c1c88fb..a9a1800af53 100644
>> --- a/gcc/config/aarch64/aarch64.c
>> +++ b/gcc/config/aarch64/aarch64.c
>> @@ -23390,7 +23390,8 @@ aarch64_copy_one_block_and_progress_pointers (rtx 
>> *src, rtx *dst,
>> }
>> 
>> /* Expand cpymem, as if from a __builtin_memcpy.  Return true if
>> -   we succeed, otherwise return false.  */
>> +   we succeed, otherwise return false, indicating that a libcall to
>> +   memcpy should be emitted.  */
>> 
>> bool
>> aarch64_expand_cpymem (rtx *operands)
>> @@ -23407,11 +23408,13 @@ aarch64_expand_cpymem (rtx *operands)
>> 
>>  unsigned HOST_WIDE_INT size = INTVAL (operands[2]);
>> 
>> -  /* Inline up to 256 bytes when optimizing for speed.  */
>> +  /* Try to inline up to 256 bytes.  */
>>  unsigned HOST_WIDE_INT max_copy_size = 256;
>> 
>> -  if (optimize_function_for_size_p (cfun))
>> -    max_copy_size = 128;
>> +  bool size_p = optimize_function_for_size_p (cfun);
>> +
>> +  if (size > max_copy_size)
>> +    return false;
>> 
>>  int copy_bits = 256;
>> 
>> @@ -23421,13 +23424,14 @@ aarch64_expand_cpymem (rtx *operands)
>>      || !TARGET_SIMD
>>      || (aarch64_tune_params.extra_tuning_flags
>>        & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS))
>> -    {
>> -      copy_bits = 128;
>> -      max_copy_size = max_copy_size / 2;
>> -    }
>> +    copy_bits = 128;
>> 
>> -  if (size > max_copy_size)
>> -    return false;
>> +  /* Emit an inline load+store sequence and count the number of operations
>> +     involved.  We use a simple count of just the loads and stores emitted
>> +     rather than rtx_insn count as all the pointer adjustments and reg 
>> copying
>> +     in this function will get optimized away later in the pipeline.  */
>> +  start_sequence ();
>> +  unsigned nops = 0;
>> 
>>  base = copy_to_mode_reg (Pmode, XEXP (dst, 0));
>>  dst = adjust_automodify_address (dst, VOIDmode, base, 0);
>> @@ -23456,7 +23460,8 @@ aarch64_expand_cpymem (rtx *operands)
>>      cur_mode = V4SImode;
>> 
>>      aarch64_copy_one_block_and_progress_pointers (&src, &dst, cur_mode);
>> -
>> +      /* A single block copy is 1 load + 1 store.  */
>> +      nops += 2;
>>      n -= mode_bits;
>> 
>>      /* Emit trailing copies using overlapping unaligned accesses - this is
>> @@ -23471,7 +23476,16 @@ aarch64_expand_cpymem (rtx *operands)
>>        n = n_bits;
>>      }
>>    }
>> +  rtx_insn *seq = get_insns ();
>> +  end_sequence ();
>> +
>> +  /* A memcpy libcall in the worst case takes 3 instructions to prepare the
>> +     arguments + 1 for the call.  */
>> +  unsigned libcall_cost = 4;
>> +  if (size_p && libcall_cost < nops)
>> +    return false;
>> 
>> +  emit_insn (seq);
>>  return true;
>> }
>> 
>> diff --git a/gcc/testsuite/gcc.target/aarch64/cpymem-size.c 
>> b/gcc/testsuite/gcc.target/aarch64/cpymem-size.c
>> new file mode 100644
>> index 00000000000..4d488b74301
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/cpymem-size.c
>> @@ -0,0 +1,29 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-Os" } */
>> +
>> +#include <stdlib.h>
>> +
>> +/*
>> +** cpy_127:
>> +**      mov x2, 127
>> +**      b   memcpy
>> +*/
>> +void
>> +cpy_127 (char *out, char *in)
>> +{
>> +  __builtin_memcpy (out, in, 127);
>> +}
>> +
>> +/*
>> +** cpy_128:
>> +**      mov x2, 128
>> +**      b   memcpy
>> +*/
>> +void
>> +cpy_128 (char *out, char *in)
>> +{
>> +  __builtin_memcpy (out, in, 128);
>> +}
>> +
>> +/* { dg-final { check-function-bodies "**" "" "" } } */
>> +
>> </cut>
> 

_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/linaro-toolchain

Re: [TCWG CI] 458.sjeng grew in size by 4% after gcc: aarch64: Improve size heuristic for cpymem expansion

Reply via email to