[TCWG CI] 433.milc:[.] mult_su3_mat_vec slowed down by 11% after llvm: [AMDGPU] Enable load clustering in the post-RA scheduler
After llvm commit 66e13c7f439cf162d7ed1d25883e71a5755ac7ec Author: Jay Foad [AMDGPU] Enable load clustering in the post-RA scheduler the following hot functions slowed down by more than 10% (but their benchmarks slowed down by less than 2%): - 433.milc:[.] mult_su3_mat_vec slowed down by 11% from 2163 to 2391 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/build-66e13c7f439cf162d7ed1d25883e71a5755ac7ec/save-temps/ - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/build-838b4a533e6853d44e0c6d1977bcf0b06557d4ab/save-temps/ - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/build-baseline/save-temps/ Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain@lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2 First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/build-66e13c7f439cf162d7ed1d25883e71a5755ac7ec/ Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/build-838b4a533e6853d44e0c6d1977bcf0b06557d4ab/ Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/build-baseline/ Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/ Reproduce builds: mkdir investigate-llvm-66e13c7f439cf162d7ed1d25883e71a5755ac7ec cd investigate-llvm-66e13c7f439cf162d7ed1d25883e71a5755ac7ec # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/manifests/build-baseline.sh --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/manifests/build-parameters.sh --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/test.sh --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 66e13c7f439cf162d7ed1d25883e71a5755ac7ec ../artifacts/test.sh # Reproduce last_good build git checkout --detach 838b4a533e6853d44e0c6d1977bcf0b06557d4ab ../artifacts/test.sh cd .. Full commit (up to 1000 lines): commit 66e13c7f439cf162d7ed1d25883e71a5755ac7ec Author: Jay Foad Date: Tue Oct 12 15:39:43 2021 +0100 [AMDGPU] Enable load clustering in the post-RA scheduler This has a couple of benefits: 1. It can sometimes fix clusters that got broken apart when the register allocator inserted a copy. 2. Post-RA scheduling does not have to worry about increasing register pressure, which in some cases gives it more freedom to reorder instructions. Testing on a collection of 10,000 graphics shaders compiled for gfx1010 showed: - The average length of each run of one or more load instructions increased by about 1%. - The number of runs of two or more load instructions increased by about 4%. --- llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp | 1 + llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll | 5 ++
Re: [TCWG CI] 433.milc:[.] mult_su3_mat_vec slowed down by 11% after llvm: [AMDGPU] Enable load clustering in the post-RA scheduler
Hi Jay, This is a false positive. We’ll take a look why this report was sent out. Regards, -- Maxim Kuvyrkov https://www.linaro.org > On 26 Oct 2021, at 22:19, ci_not...@linaro.org wrote: > > After llvm commit 66e13c7f439cf162d7ed1d25883e71a5755ac7ec > Author: Jay Foad > >[AMDGPU] Enable load clustering in the post-RA scheduler > > the following hot functions slowed down by more than 10% (but their > benchmarks slowed down by less than 2%): > - 433.milc:[.] mult_su3_mat_vec slowed down by 11% from 2163 to 2391 perf > samples > > Below reproducer instructions can be used to re-build both "first_bad" and > "last_good" cross-toolchains used in this bisection. Naturally, the scripts > will fail when triggerring benchmarking jobs if you don't have access to > Linaro TCWG CI. > > For your convenience, we have uploaded tarballs with pre-processed source and > assembly files at: > - First_bad save-temps: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/build-66e13c7f439cf162d7ed1d25883e71a5755ac7ec/save-temps/ > - Last_good save-temps: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/build-838b4a533e6853d44e0c6d1977bcf0b06557d4ab/save-temps/ > - Baseline save-temps: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/build-baseline/save-temps/ > > Configuration: > - Benchmark: SPEC CPU2006 > - Toolchain: Clang + Glibc + LLVM Linker > - Version: all components were built from their tip of trunk > - Target: aarch64-linux-gnu > - Compiler flags: -O2 > - Hardware: NVidia TX1 4x Cortex-A57 > > This benchmarking CI is work-in-progress, and we welcome feedback and > suggestions at linaro-toolchain@lists.linaro.org . In our improvement plans > is to add support for SPEC CPU2017 benchmarks and provide "perf > report/annotate" data behind these reports. > > THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, > REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. > > This commit has regressed these CI configurations: > - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2 > > First_bad build: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/build-66e13c7f439cf162d7ed1d25883e71a5755ac7ec/ > Last_good build: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/build-838b4a533e6853d44e0c6d1977bcf0b06557d4ab/ > Baseline build: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/build-baseline/ > Even more details: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/ > > Reproduce builds: > > mkdir investigate-llvm-66e13c7f439cf162d7ed1d25883e71a5755ac7ec > cd investigate-llvm-66e13c7f439cf162d7ed1d25883e71a5755ac7ec > > # Fetch scripts > git clone https://git.linaro.org/toolchain/jenkins-scripts > > # Fetch manifests and test.sh script > mkdir -p artifacts/manifests > curl -o artifacts/manifests/build-baseline.sh > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/manifests/build-baseline.sh > --fail > curl -o artifacts/manifests/build-parameters.sh > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/manifests/build-parameters.sh > --fail > curl -o artifacts/test.sh > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/test.sh > --fail > chmod +x artifacts/test.sh > > # Reproduce the baseline build (build all pre-requisites) > ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh > > # Save baseline build state (which is then restored in artifacts/test.sh) > mkdir -p ./bisect > rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ > --exclude /llvm/ ./ ./bisect/baseline/ > > cd llvm > > # Reproduce first_bad build > git checkout --detach 66e13c7f439cf162d7ed1d25883e71a5755ac7ec > ../artifacts/test.sh > > # Reproduce last_good build > git checkout --detach 838b4a533e6853d44e0c6d1977bcf0b06557d4ab > ../artifacts/test.sh > > cd .. > > > Full commit (up to 1000 lines): > > commit 66e13c7f439cf162d7ed1d25883e71a5755ac7ec > Author: Jay Foad > Date: Tue Oct 12 15:39:43 2021 +0100 > >[AMDGPU] Enable load clustering in the post-RA scheduler > >This has a couple of benefits: >1. It can sometimes fix clusters that got broken apart when the register > allocator inserted a copy. >2. Post-RA scheduling does not have to worry about increasing register > pressure, which in some cases gives it more freedom to reorder > instructions. > >Testing