Cool, thanks! -- Maxim Kuvyrkov https://www.linaro.org
> On 7 Dec 2021, at 15:10, Alexey Bataev <a.bat...@outlook.com> wrote: > > I committed a fix yesterday, should be fixed. Another one planning to commit > later today or tomorrow. > > Best regards, > Alexey Bataev > >> 7 дек. 2021 г., в 07:08, Maxim Kuvyrkov <maxim.kuvyr...@linaro.org> >> написал(а): >> >> Hi Alexey, >> >> After your patch Clang crashes while building 453.povray for >> aarch64-linux-gnu. Apparently, this happens only with LTO enabled at -O2 >> and -O3. >> >> Did you get any bug reports against this patch already? >> >> Thanks, >> >> -- >> Maxim Kuvyrkov >> https://www.linaro.org >> >>> On 5 Dec 2021, at 02:55, ci_not...@linaro.org wrote: >>> >>> After llvm commit ba74bb3a226e1b4660537f274627285b1bf41ee1 >>> Author: Alexey Bataev <a.bat...@outlook.com> >>> >>> [SLP]Fix reused extracts cost. >>> >>> the following benchmarks slowed down by more than 2%: >>> - 453.povray failed to build >>> >>> Below reproducer instructions can be used to re-build both "first_bad" and >>> "last_good" cross-toolchains used in this bisection. Naturally, the >>> scripts will fail when triggerring benchmarking jobs if you don't have >>> access to Linaro TCWG CI. >>> >>> For your convenience, we have uploaded tarballs with pre-processed source >>> and assembly files at: >>> - First_bad save-temps: >>> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3_LTO/39/artifact/artifacts/build-ba74bb3a226e1b4660537f274627285b1bf41ee1/save-temps/ >>> - Last_good save-temps: >>> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3_LTO/39/artifact/artifacts/build-78cc133c63173a4b5b7a43750cc507d4cff683cf/save-temps/ >>> - Baseline save-temps: >>> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3_LTO/39/artifact/artifacts/build-baseline/save-temps/ >>> >>> Configuration: >>> - Benchmark: SPEC CPU2006 >>> - Toolchain: Clang + Glibc + LLVM Linker >>> - Version: all components were built from their tip of trunk >>> - Target: aarch64-linux-gnu >>> - Compiler flags: -O3 -flto >>> - Hardware: NVidia TX1 4x Cortex-A57 >>> >>> This benchmarking CI is work-in-progress, and we welcome feedback and >>> suggestions at linaro-toolchain@lists.linaro.org . In our improvement >>> plans is to add support for SPEC CPU2017 benchmarks and provide "perf >>> report/annotate" data behind these reports. >>> >>> THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, >>> REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. >>> >>> This commit has regressed these CI configurations: >>> - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3_LTO >>> >>> First_bad build: >>> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3_LTO/39/artifact/artifacts/build-ba74bb3a226e1b4660537f274627285b1bf41ee1/ >>> Last_good build: >>> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3_LTO/39/artifact/artifacts/build-78cc133c63173a4b5b7a43750cc507d4cff683cf/ >>> Baseline build: >>> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3_LTO/39/artifact/artifacts/build-baseline/ >>> Even more details: >>> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3_LTO/39/artifact/artifacts/ >>> >>> Reproduce builds: >>> <cut> >>> mkdir investigate-llvm-ba74bb3a226e1b4660537f274627285b1bf41ee1 >>> cd investigate-llvm-ba74bb3a226e1b4660537f274627285b1bf41ee1 >>> >>> # Fetch scripts >>> git clone https://git.linaro.org/toolchain/jenkins-scripts >>> >>> # Fetch manifests and test.sh script >>> mkdir -p artifacts/manifests >>> curl -o artifacts/manifests/build-baseline.sh >>> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3_LTO/39/artifact/artifacts/manifests/build-baseline.sh >>> --fail >>> curl -o artifacts/manifests/build-parameters.sh >>> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3_LTO/39/artifact/artifacts/manifests/build-parameters.sh >>> --fail >>> curl -o artifacts/test.sh >>> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3_LTO/39/artifact/artifacts/test.sh >>> --fail >>> chmod +x artifacts/test.sh >>> >>> # Reproduce the baseline build (build all pre-requisites) >>> ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh >>> >>> # Save baseline build state (which is then restored in artifacts/test.sh) >>> mkdir -p ./bisect >>> rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ >>> --exclude /llvm/ ./ ./bisect/baseline/ >>> >>> cd llvm >>> >>> # Reproduce first_bad build >>> git checkout --detach ba74bb3a226e1b4660537f274627285b1bf41ee1 >>> ../artifacts/test.sh >>> >>> # Reproduce last_good build >>> git checkout --detach 78cc133c63173a4b5b7a43750cc507d4cff683cf >>> ../artifacts/test.sh >>> >>> cd .. >>> </cut> >>> >>> Full commit (up to 1000 lines): >>> <cut> >>> commit ba74bb3a226e1b4660537f274627285b1bf41ee1 >>> Author: Alexey Bataev <a.bat...@outlook.com> >>> Date: Thu Dec 2 04:22:55 2021 -0800 >>> >>> [SLP]Fix reused extracts cost. >>> >>> If the extractelement instruction is used multiple times in the >>> different tree entries (either vectorized, or gathered), need to >>> compensate the scalar cost of such instructions. They are completely >>> removed if all users are part of the tree but we need to compensate the >>> cost only once for each instruction. >>> >>> Differential Revision: https://reviews.llvm.org/D114958 >>> --- >>> llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | 29 >>> +++++++++++++--------- >>> .../X86/extractelement-multiple-uses.ll | 23 +++++++++-------- >>> 2 files changed, 29 insertions(+), 23 deletions(-) >>> >>> diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp >>> b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp >>> index 95061e9053fa..335ad6c85387 100644 >>> --- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp >>> +++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp >>> @@ -4287,8 +4287,8 @@ bool BoUpSLP::canReuseExtract(ArrayRef<Value *> VL, >>> Value *OpValue, >>> bool BoUpSLP::areAllUsersVectorized(Instruction *I, >>> ArrayRef<Value *> VectorizedVals) const { >>> return (I->hasOneUse() && is_contained(VectorizedVals, I)) || >>> - llvm::all_of(I->users(), [this](User *U) { >>> - return ScalarToTreeEntry.count(U) > 0; >>> + all_of(I->users(), [this](User *U) { >>> + return ScalarToTreeEntry.count(U) > 0 || MustGather.contains(U); >>> }); >>> } >>> >>> @@ -4442,9 +4442,9 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry >>> *E, >>> // FIXME: it tries to fix a problem with MSVC buildbots. >>> TargetTransformInfo &TTIRef = *TTI; >>> auto &&AdjustExtractsCost = [this, &TTIRef, CostKind, VL, VecTy, >>> - VectorizedVals](InstructionCost &Cost, >>> - bool IsGather) { >>> + VectorizedVals, E](InstructionCost &Cost) { >>> DenseMap<Value *, int> ExtractVectorsTys; >>> + SmallPtrSet<Value *, 4> CheckedExtracts; >>> for (auto *V : VL) { >>> if (isa<UndefValue>(V)) >>> continue; >>> @@ -4452,7 +4452,12 @@ InstructionCost BoUpSLP::getEntryCost(const >>> TreeEntry *E, >>> // instruction itself is not going to be vectorized, consider this >>> // instruction as dead and remove its cost from the final cost of the >>> // vectorized tree. >>> - if (!areAllUsersVectorized(cast<Instruction>(V), VectorizedVals)) >>> + // Also, avoid adjusting the cost for extractelements with multiple >>> uses >>> + // in different graph entries. >>> + const TreeEntry *VE = getTreeEntry(V); >>> + if (!CheckedExtracts.insert(V).second || >>> + !areAllUsersVectorized(cast<Instruction>(V), VectorizedVals) || >>> + (VE && VE != E)) >>> continue; >>> auto *EE = cast<ExtractElementInst>(V); >>> Optional<unsigned> EEIdx = getExtractIndex(EE); >>> @@ -4549,11 +4554,6 @@ InstructionCost BoUpSLP::getEntryCost(const >>> TreeEntry *E, >>> } >>> return GatherCost; >>> } >>> - if (isSplat(VL)) { >>> - // Found the broadcasting of the single scalar, calculate the cost >>> as the >>> - // broadcast. >>> - return TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy); >>> - } >>> if ((E->getOpcode() == Instruction::ExtractElement || >>> all_of(E->Scalars, >>> [](Value *V) { >>> @@ -4571,13 +4571,18 @@ InstructionCost BoUpSLP::getEntryCost(const >>> TreeEntry *E, >>> // single input vector or of 2 input vectors. >>> InstructionCost Cost = >>> computeExtractCost(VL, VecTy, *ShuffleKind, Mask, *TTI); >>> - AdjustExtractsCost(Cost, /*IsGather=*/true); >>> + AdjustExtractsCost(Cost); >>> if (NeedToShuffleReuses) >>> Cost += >>> TTI->getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, >>> FinalVecTy, E->ReuseShuffleIndices); >>> return Cost; >>> } >>> } >>> + if (isSplat(VL)) { >>> + // Found the broadcasting of the single scalar, calculate the cost >>> as the >>> + // broadcast. >>> + return TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy); >>> + } >>> InstructionCost ReuseShuffleCost = 0; >>> if (NeedToShuffleReuses) >>> ReuseShuffleCost = TTI->getShuffleCost( >>> @@ -4755,7 +4760,7 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry >>> *E, >>> TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, I); >>> } >>> } else { >>> - AdjustExtractsCost(CommonCost, /*IsGather=*/false); >>> + AdjustExtractsCost(CommonCost); >>> } >>> return CommonCost; >>> } >>> diff --git >>> a/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll >>> b/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll >>> index c47f255f0bfe..31696752bbb3 100644 >>> --- a/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll >>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll >>> @@ -2,24 +2,25 @@ >>> ; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux >>> -march=core-avx2 -pass-remarks-output=%t | FileCheck %s >>> ; RUN: FileCheck %s --input-file=%t --check-prefix=YAML >>> >>> -; YAML: --- !Missed >>> +; YAML: --- !Passed >>> ; YAML: Pass: slp-vectorizer >>> -; YAML: Name: NotBeneficial >>> +; YAML: Name: VectorizedList >>> ; YAML: Function: multi_uses >>> ; YAML: Args: >>> -; YAML: - String: 'List vectorization was possible but not >>> beneficial with cost ' >>> -; YAML: - Cost: '0' >>> -; YAML: - String: ' >= ' >>> -; YAML: - Treshold: '0' >>> +; YAML: - String: 'SLP vectorized with cost ' >>> +; YAML: - Cost: '-1' >>> +; YAML: - String: ' and with tree size ' >>> +; YAML: - TreeSize: '3' >>> >>> define float @multi_uses(<2 x float> %x, <2 x float> %y) { >>> ; CHECK-LABEL: @multi_uses( >>> -; CHECK-NEXT: [[X0:%.*]] = extractelement <2 x float> [[X:%.*]], i32 0 >>> -; CHECK-NEXT: [[X1:%.*]] = extractelement <2 x float> [[X]], i32 1 >>> ; CHECK-NEXT: [[Y1:%.*]] = extractelement <2 x float> [[Y:%.*]], i32 1 >>> -; CHECK-NEXT: [[X0X0:%.*]] = fmul float [[X0]], [[Y1]] >>> -; CHECK-NEXT: [[X1X1:%.*]] = fmul float [[X1]], [[Y1]] >>> -; CHECK-NEXT: [[ADD:%.*]] = fadd float [[X0X0]], [[X1X1]] >>> +; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x float> poison, float >>> [[Y1]], i32 0 >>> +; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x float> [[TMP1]], float >>> [[Y1]], i32 1 >>> +; CHECK-NEXT: [[TMP3:%.*]] = fmul <2 x float> [[X:%.*]], [[TMP2]] >>> +; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x float> [[TMP3]], i32 0 >>> +; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x float> [[TMP3]], i32 1 >>> +; CHECK-NEXT: [[ADD:%.*]] = fadd float [[TMP4]], [[TMP5]] >>> ; CHECK-NEXT: ret float [[ADD]] >>> ; >>> %x0 = extractelement <2 x float> %x, i32 0 >>> </cut> >> _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain