Could I get the source file with S_regmatch()? On Mon, Sep 27, 2021 at 6:07 AM Maxim Kuvyrkov <maxim.kuvyr...@linaro.org> wrote:
> Hi Arthur, > > Your patch seems to be slowing down 400.perlbench by 6% — due to slow down > of its hot function S_regmatch() by 14%. > > Could you take a look if this is easily fixable, please? > > Regards, > > -- > Maxim Kuvyrkov > https://www.linaro.org > > > On 24 Sep 2021, at 15:07, ci_not...@linaro.org wrote: > > > > After llvm commit e7249e4acf3cf9438d6d9e02edecebd5b622a4dc > > Author: Arthur Eubanks <aeuba...@google.com> > > > > [SimplifyCFG] Ignore free instructions when computing cost for > folding branch to common dest > > > > the following benchmarks slowed down by more than 2%: > > - 400.perlbench slowed down by 6% from 9730 to 10312 perf samples > > - 400.perlbench:[.] S_regmatch slowed down by 14% from 3660 to 4188 > perf samples > > > > Below reproducer instructions can be used to re-build both "first_bad" > and "last_good" cross-toolchains used in this bisection. Naturally, the > scripts will fail when triggerring benchmarking jobs if you don't have > access to Linaro TCWG CI. > > > > For your convenience, we have uploaded tarballs with pre-processed > source and assembly files at: > > - First_bad save-temps: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3/23/artifact/artifacts/build-e7249e4acf3cf9438d6d9e02edecebd5b622a4dc/save-temps/ > > - Last_good save-temps: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3/23/artifact/artifacts/build-32a50078657dd8beead327a3478ede4e9d730432/save-temps/ > > - Baseline save-temps: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3/23/artifact/artifacts/build-baseline/save-temps/ > > > > Configuration: > > - Benchmark: SPEC CPU2006 > > - Toolchain: Clang + Glibc + LLVM Linker > > - Version: all components were built from their tip of trunk > > - Target: aarch64-linux-gnu > > - Compiler flags: -O3 > > - Hardware: NVidia TX1 4x Cortex-A57 > > > > This benchmarking CI is work-in-progress, and we welcome feedback and > suggestions at linaro-toolchain@lists.linaro.org . In our improvement > plans is to add support for SPEC CPU2017 benchmarks and provide "perf > report/annotate" data behind these reports. > > > > THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, > REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. > > > > This commit has regressed these CI configurations: > > - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3 > > > > First_bad build: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3/23/artifact/artifacts/build-e7249e4acf3cf9438d6d9e02edecebd5b622a4dc/ > > Last_good build: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3/23/artifact/artifacts/build-32a50078657dd8beead327a3478ede4e9d730432/ > > Baseline build: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3/23/artifact/artifacts/build-baseline/ > > Even more details: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3/23/artifact/artifacts/ > > > > Reproduce builds: > > <cut> > > mkdir investigate-llvm-e7249e4acf3cf9438d6d9e02edecebd5b622a4dc > > cd investigate-llvm-e7249e4acf3cf9438d6d9e02edecebd5b622a4dc > > > > # Fetch scripts > > git clone https://git.linaro.org/toolchain/jenkins-scripts > > > > # Fetch manifests and test.sh script > > mkdir -p artifacts/manifests > > curl -o artifacts/manifests/build-baseline.sh > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3/23/artifact/artifacts/manifests/build-baseline.sh > --fail > > curl -o artifacts/manifests/build-parameters.sh > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3/23/artifact/artifacts/manifests/build-parameters.sh > --fail > > curl -o artifacts/test.sh > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3/23/artifact/artifacts/test.sh > --fail > > chmod +x artifacts/test.sh > > > > # Reproduce the baseline build (build all pre-requisites) > > ./jenkins-scripts/tcwg_bmk-build.sh @@ > artifacts/manifests/build-baseline.sh > > > > # Save baseline build state (which is then restored in artifacts/test.sh) > > mkdir -p ./bisect > > rsync -a --del --delete-excluded --exclude /bisect/ --exclude > /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ > > > > cd llvm > > > > # Reproduce first_bad build > > git checkout --detach e7249e4acf3cf9438d6d9e02edecebd5b622a4dc > > ../artifacts/test.sh > > > > # Reproduce last_good build > > git checkout --detach 32a50078657dd8beead327a3478ede4e9d730432 > > ../artifacts/test.sh > > > > cd .. > > </cut> > > > > Full commit (up to 1000 lines): > > <cut> > > commit e7249e4acf3cf9438d6d9e02edecebd5b622a4dc > > Author: Arthur Eubanks <aeuba...@google.com> > > Date: Fri Aug 27 12:32:59 2021 -0700 > > > > [SimplifyCFG] Ignore free instructions when computing cost for > folding branch to common dest > > > > When determining whether to fold branches to a common destination by > > merging two blocks, SimplifyCFG will count the number of instructions > to > > be moved into the first basic block. However, there's no reason to > count > > free instructions like bitcasts and other similar instructions. > > > > This resolves missed branch foldings with -fstrict-vtable-pointers in > > llvm-test-suite's lambda benchmark. > > > > Reviewed By: spatel > > > > Differential Revision: https://reviews.llvm.org/D108837 > > --- > > llvm/lib/Transforms/Utils/SimplifyCFG.cpp | 17 ++++++----- > > llvm/test/CodeGen/AArch64/csr-split.ll | 34 > +++++++++++----------- > > .../fold-branch-to-common-dest-free-cost.ll | 5 ++-- > > 3 files changed, 29 insertions(+), 27 deletions(-) > > > > diff --git a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp > b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp > > index 2ff98b238de0..a3bd89e72af9 100644 > > --- a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp > > +++ b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp > > @@ -3258,13 +3258,16 @@ bool llvm::FoldBranchToCommonDest(BranchInst > *BI, DomTreeUpdater *DTU, > > SawVectorOp |= isVectorOp(I); > > > > // Account for the cost of duplicating this instruction into each > > - // predecessor. > > - NumBonusInsts += PredCount; > > - > > - // Early exits once we reach the limit. > > - if (NumBonusInsts > > > - BonusInstThreshold * BranchFoldToCommonDestVectorMultiplier) > > - return false; > > + // predecessor. Ignore free instructions. > > + if (!TTI || > > + TTI->getUserCost(&I, CostKind) != > TargetTransformInfo::TCC_Free) { > > + NumBonusInsts += PredCount; > > + > > + // Early exits once we reach the limit. > > + if (NumBonusInsts > > > + BonusInstThreshold * BranchFoldToCommonDestVectorMultiplier) > > + return false; > > + } > > > > auto IsBCSSAUse = [BB, &I](Use &U) { > > auto *UI = cast<Instruction>(U.getUser()); > > diff --git a/llvm/test/CodeGen/AArch64/csr-split.ll > b/llvm/test/CodeGen/AArch64/csr-split.ll > > index 1bee7f05acec..de85b4313433 100644 > > --- a/llvm/test/CodeGen/AArch64/csr-split.ll > > +++ b/llvm/test/CodeGen/AArch64/csr-split.ll > > @@ -82,22 +82,22 @@ define dso_local signext i32 @test2(i32* %p1) > local_unnamed_addr { > > ; CHECK-NEXT: .cfi_def_cfa_offset 16 > > ; CHECK-NEXT: .cfi_offset w19, -8 > > ; CHECK-NEXT: .cfi_offset w30, -16 > > -; CHECK-NEXT: cbz x0, .LBB1_2 > > -; CHECK-NEXT: // %bb.1: // %if.end > > +; CHECK-NEXT: cbz x0, .LBB1_3 > > +; CHECK-NEXT: // %bb.1: // %entry > > ; CHECK-NEXT: adrp x8, a > > ; CHECK-NEXT: ldrsw x8, [x8, :lo12:a] > > ; CHECK-NEXT: mov x19, x0 > > ; CHECK-NEXT: cmp x8, x0 > > -; CHECK-NEXT: b.eq .LBB1_3 > > -; CHECK-NEXT: .LBB1_2: // %return > > -; CHECK-NEXT: mov w0, wzr > > -; CHECK-NEXT: ldp x30, x19, [sp], #16 // 16-byte Folded Reload > > -; CHECK-NEXT: ret > > -; CHECK-NEXT: .LBB1_3: // %if.then2 > > +; CHECK-NEXT: b.ne .LBB1_3 > > +; CHECK-NEXT: // %bb.2: // %if.then2 > > ; CHECK-NEXT: bl callVoid > > ; CHECK-NEXT: mov x0, x19 > > ; CHECK-NEXT: ldp x30, x19, [sp], #16 // 16-byte Folded Reload > > ; CHECK-NEXT: b callNonVoid > > +; CHECK-NEXT: .LBB1_3: // %return > > +; CHECK-NEXT: mov w0, wzr > > +; CHECK-NEXT: ldp x30, x19, [sp], #16 // 16-byte Folded Reload > > +; CHECK-NEXT: ret > > ; > > ; CHECK-APPLE-LABEL: test2: > > ; CHECK-APPLE: ; %bb.0: ; %entry > > @@ -108,26 +108,26 @@ define dso_local signext i32 @test2(i32* %p1) > local_unnamed_addr { > > ; CHECK-APPLE-NEXT: .cfi_offset w29, -16 > > ; CHECK-APPLE-NEXT: .cfi_offset w19, -24 > > ; CHECK-APPLE-NEXT: .cfi_offset w20, -32 > > -; CHECK-APPLE-NEXT: cbz x0, LBB1_2 > > -; CHECK-APPLE-NEXT: ; %bb.1: ; %if.end > > +; CHECK-APPLE-NEXT: cbz x0, LBB1_3 > > +; CHECK-APPLE-NEXT: ; %bb.1: ; %entry > > ; CHECK-APPLE-NEXT: Lloh2: > > ; CHECK-APPLE-NEXT: adrp x8, _a@PAGE > > ; CHECK-APPLE-NEXT: Lloh3: > > ; CHECK-APPLE-NEXT: ldrsw x8, [x8, _a@PAGEOFF] > > ; CHECK-APPLE-NEXT: mov x19, x0 > > ; CHECK-APPLE-NEXT: cmp x8, x0 > > -; CHECK-APPLE-NEXT: b.eq LBB1_3 > > -; CHECK-APPLE-NEXT: LBB1_2: ; %return > > -; CHECK-APPLE-NEXT: ldp x29, x30, [sp, #16] ; 16-byte Folded Reload > > -; CHECK-APPLE-NEXT: mov w0, wzr > > -; CHECK-APPLE-NEXT: ldp x20, x19, [sp], #32 ; 16-byte Folded Reload > > -; CHECK-APPLE-NEXT: ret > > -; CHECK-APPLE-NEXT: LBB1_3: ; %if.then2 > > +; CHECK-APPLE-NEXT: b.ne LBB1_3 > > +; CHECK-APPLE-NEXT: ; %bb.2: ; %if.then2 > > ; CHECK-APPLE-NEXT: bl _callVoid > > ; CHECK-APPLE-NEXT: ldp x29, x30, [sp, #16] ; 16-byte Folded Reload > > ; CHECK-APPLE-NEXT: mov x0, x19 > > ; CHECK-APPLE-NEXT: ldp x20, x19, [sp], #32 ; 16-byte Folded Reload > > ; CHECK-APPLE-NEXT: b _callNonVoid > > +; CHECK-APPLE-NEXT: LBB1_3: ; %return > > +; CHECK-APPLE-NEXT: ldp x29, x30, [sp, #16] ; 16-byte Folded Reload > > +; CHECK-APPLE-NEXT: mov w0, wzr > > +; CHECK-APPLE-NEXT: ldp x20, x19, [sp], #32 ; 16-byte Folded Reload > > +; CHECK-APPLE-NEXT: ret > > ; CHECK-APPLE-NEXT: .loh AdrpLdr Lloh2, Lloh3 > > entry: > > %tobool = icmp eq i32* %p1, null > > diff --git > a/llvm/test/Transforms/SimplifyCFG/fold-branch-to-common-dest-free-cost.ll > b/llvm/test/Transforms/SimplifyCFG/fold-branch-to-common-dest-free-cost.ll > > index ace2a5ed35ca..27df5ec44582 100644 > > --- > a/llvm/test/Transforms/SimplifyCFG/fold-branch-to-common-dest-free-cost.ll > > +++ > b/llvm/test/Transforms/SimplifyCFG/fold-branch-to-common-dest-free-cost.ll > > @@ -8,12 +8,11 @@ declare void @g2() > > > > define void @f(i8* %a, i8* %b, i1 %c, i1 %d, i1 %e) { > > ; CHECK-LABEL: @f( > > -; CHECK-NEXT: br i1 [[C:%.*]], label [[L1:%.*]], label [[L3:%.*]] > > -; CHECK: l1: > > ; CHECK-NEXT: [[A1:%.*]] = call i8* > @llvm.strip.invariant.group.p0i8(i8* [[A:%.*]]) > > ; CHECK-NEXT: [[B1:%.*]] = call i8* > @llvm.strip.invariant.group.p0i8(i8* [[B:%.*]]) > > ; CHECK-NEXT: [[I:%.*]] = icmp eq i8* [[A1]], [[B1]] > > -; CHECK-NEXT: br i1 [[I]], label [[L2:%.*]], label [[L3]] > > +; CHECK-NEXT: [[OR_COND:%.*]] = select i1 [[C:%.*]], i1 [[I]], i1 > false > > +; CHECK-NEXT: br i1 [[OR_COND]], label [[L2:%.*]], label [[L3:%.*]] > > ; CHECK: l2: > > ; CHECK-NEXT: call void @g1() > > ; CHECK-NEXT: br label [[RET:%.*]] > > </cut> > > _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain