Cool, thanks!
--
Maxim Kuvyrkov
https://www.linaro.org

> On 7 Dec 2021, at 15:10, Alexey Bataev <a.bat...@outlook.com> wrote:
> 
> I committed a fix yesterday, should be fixed. Another one planning to commit 
> later today or tomorrow. 
> 
> Best regards,
> Alexey Bataev
> 
>> 7 дек. 2021 г., в 07:08, Maxim Kuvyrkov <maxim.kuvyr...@linaro.org> 
>> написал(а):
>> 
>> Hi Alexey,
>> 
>> After your patch Clang crashes while building 453.povray for 
>> aarch64-linux-gnu.  Apparently, this happens only with LTO enabled at -O2 
>> and -O3.
>> 
>> Did you get any bug reports against this patch already?
>> 
>> Thanks,
>> 
>> --
>> Maxim Kuvyrkov
>> https://www.linaro.org
>> 
>>> On 5 Dec 2021, at 02:55, ci_not...@linaro.org wrote:
>>> 
>>> After llvm commit ba74bb3a226e1b4660537f274627285b1bf41ee1
>>> Author: Alexey Bataev <a.bat...@outlook.com>
>>> 
>>>  [SLP]Fix reused extracts cost.
>>> 
>>> the following benchmarks slowed down by more than 2%:
>>> - 453.povray failed to build
>>> 
>>> Below reproducer instructions can be used to re-build both "first_bad" and 
>>> "last_good" cross-toolchains used in this bisection.  Naturally, the 
>>> scripts will fail when triggerring benchmarking jobs if you don't have 
>>> access to Linaro TCWG CI.
>>> 
>>> For your convenience, we have uploaded tarballs with pre-processed source 
>>> and assembly files at:
>>> - First_bad save-temps: 
>>> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3_LTO/39/artifact/artifacts/build-ba74bb3a226e1b4660537f274627285b1bf41ee1/save-temps/
>>> - Last_good save-temps: 
>>> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3_LTO/39/artifact/artifacts/build-78cc133c63173a4b5b7a43750cc507d4cff683cf/save-temps/
>>> - Baseline save-temps: 
>>> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3_LTO/39/artifact/artifacts/build-baseline/save-temps/
>>> 
>>> Configuration:
>>> - Benchmark: SPEC CPU2006
>>> - Toolchain: Clang + Glibc + LLVM Linker
>>> - Version: all components were built from their tip of trunk
>>> - Target: aarch64-linux-gnu
>>> - Compiler flags: -O3 -flto
>>> - Hardware: NVidia TX1 4x Cortex-A57
>>> 
>>> This benchmarking CI is work-in-progress, and we welcome feedback and 
>>> suggestions at linaro-toolchain@lists.linaro.org .  In our improvement 
>>> plans is to add support for SPEC CPU2017 benchmarks and provide "perf 
>>> report/annotate" data behind these reports.
>>> 
>>> THIS IS THE END OF INTERESTING STUFF.  BELOW ARE LINKS TO BUILDS, 
>>> REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
>>> 
>>> This commit has regressed these CI configurations:
>>> - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3_LTO
>>> 
>>> First_bad build: 
>>> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3_LTO/39/artifact/artifacts/build-ba74bb3a226e1b4660537f274627285b1bf41ee1/
>>> Last_good build: 
>>> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3_LTO/39/artifact/artifacts/build-78cc133c63173a4b5b7a43750cc507d4cff683cf/
>>> Baseline build: 
>>> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3_LTO/39/artifact/artifacts/build-baseline/
>>> Even more details: 
>>> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3_LTO/39/artifact/artifacts/
>>> 
>>> Reproduce builds:
>>> <cut>
>>> mkdir investigate-llvm-ba74bb3a226e1b4660537f274627285b1bf41ee1
>>> cd investigate-llvm-ba74bb3a226e1b4660537f274627285b1bf41ee1
>>> 
>>> # Fetch scripts
>>> git clone https://git.linaro.org/toolchain/jenkins-scripts
>>> 
>>> # Fetch manifests and test.sh script
>>> mkdir -p artifacts/manifests
>>> curl -o artifacts/manifests/build-baseline.sh 
>>> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3_LTO/39/artifact/artifacts/manifests/build-baseline.sh
>>>  --fail
>>> curl -o artifacts/manifests/build-parameters.sh 
>>> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3_LTO/39/artifact/artifacts/manifests/build-parameters.sh
>>>  --fail
>>> curl -o artifacts/test.sh 
>>> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3_LTO/39/artifact/artifacts/test.sh
>>>  --fail
>>> chmod +x artifacts/test.sh
>>> 
>>> # Reproduce the baseline build (build all pre-requisites)
>>> ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
>>> 
>>> # Save baseline build state (which is then restored in artifacts/test.sh)
>>> mkdir -p ./bisect
>>> rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ 
>>> --exclude /llvm/ ./ ./bisect/baseline/
>>> 
>>> cd llvm
>>> 
>>> # Reproduce first_bad build
>>> git checkout --detach ba74bb3a226e1b4660537f274627285b1bf41ee1
>>> ../artifacts/test.sh
>>> 
>>> # Reproduce last_good build
>>> git checkout --detach 78cc133c63173a4b5b7a43750cc507d4cff683cf
>>> ../artifacts/test.sh
>>> 
>>> cd ..
>>> </cut>
>>> 
>>> Full commit (up to 1000 lines):
>>> <cut>
>>> commit ba74bb3a226e1b4660537f274627285b1bf41ee1
>>> Author: Alexey Bataev <a.bat...@outlook.com>
>>> Date:   Thu Dec 2 04:22:55 2021 -0800
>>> 
>>>  [SLP]Fix reused extracts cost.
>>> 
>>>  If the extractelement instruction is used multiple times in the
>>>  different tree entries (either vectorized, or gathered), need to
>>>  compensate the scalar cost of such instructions. They are completely
>>>  removed if all users are part of the tree but we need to compensate the
>>>  cost only once for each instruction.
>>> 
>>>  Differential Revision: https://reviews.llvm.org/D114958
>>> ---
>>> llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp    | 29 
>>> +++++++++++++---------
>>> .../X86/extractelement-multiple-uses.ll            | 23 +++++++++--------
>>> 2 files changed, 29 insertions(+), 23 deletions(-)
>>> 
>>> diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp 
>>> b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
>>> index 95061e9053fa..335ad6c85387 100644
>>> --- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
>>> +++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
>>> @@ -4287,8 +4287,8 @@ bool BoUpSLP::canReuseExtract(ArrayRef<Value *> VL, 
>>> Value *OpValue,
>>> bool BoUpSLP::areAllUsersVectorized(Instruction *I,
>>>                                   ArrayRef<Value *> VectorizedVals) const {
>>> return (I->hasOneUse() && is_contained(VectorizedVals, I)) ||
>>> -         llvm::all_of(I->users(), [this](User *U) {
>>> -           return ScalarToTreeEntry.count(U) > 0;
>>> +         all_of(I->users(), [this](User *U) {
>>> +           return ScalarToTreeEntry.count(U) > 0 || MustGather.contains(U);
>>>        });
>>> }
>>> 
>>> @@ -4442,9 +4442,9 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry 
>>> *E,
>>> // FIXME: it tries to fix a problem with MSVC buildbots.
>>> TargetTransformInfo &TTIRef = *TTI;
>>> auto &&AdjustExtractsCost = [this, &TTIRef, CostKind, VL, VecTy,
>>> -                               VectorizedVals](InstructionCost &Cost,
>>> -                                               bool IsGather) {
>>> +                               VectorizedVals, E](InstructionCost &Cost) {
>>>   DenseMap<Value *, int> ExtractVectorsTys;
>>> +    SmallPtrSet<Value *, 4> CheckedExtracts;
>>>   for (auto *V : VL) {
>>>     if (isa<UndefValue>(V))
>>>       continue;
>>> @@ -4452,7 +4452,12 @@ InstructionCost BoUpSLP::getEntryCost(const 
>>> TreeEntry *E,
>>>     // instruction itself is not going to be vectorized, consider this
>>>     // instruction as dead and remove its cost from the final cost of the
>>>     // vectorized tree.
>>> -      if (!areAllUsersVectorized(cast<Instruction>(V), VectorizedVals))
>>> +      // Also, avoid adjusting the cost for extractelements with multiple 
>>> uses
>>> +      // in different graph entries.
>>> +      const TreeEntry *VE = getTreeEntry(V);
>>> +      if (!CheckedExtracts.insert(V).second ||
>>> +          !areAllUsersVectorized(cast<Instruction>(V), VectorizedVals) ||
>>> +          (VE && VE != E))
>>>       continue;
>>>     auto *EE = cast<ExtractElementInst>(V);
>>>     Optional<unsigned> EEIdx = getExtractIndex(EE);
>>> @@ -4549,11 +4554,6 @@ InstructionCost BoUpSLP::getEntryCost(const 
>>> TreeEntry *E,
>>>     }
>>>     return GatherCost;
>>>   }
>>> -    if (isSplat(VL)) {
>>> -      // Found the broadcasting of the single scalar, calculate the cost 
>>> as the
>>> -      // broadcast.
>>> -      return TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy);
>>> -    }
>>>   if ((E->getOpcode() == Instruction::ExtractElement ||
>>>        all_of(E->Scalars,
>>>               [](Value *V) {
>>> @@ -4571,13 +4571,18 @@ InstructionCost BoUpSLP::getEntryCost(const 
>>> TreeEntry *E,
>>>       // single input vector or of 2 input vectors.
>>>       InstructionCost Cost =
>>>           computeExtractCost(VL, VecTy, *ShuffleKind, Mask, *TTI);
>>> -        AdjustExtractsCost(Cost, /*IsGather=*/true);
>>> +        AdjustExtractsCost(Cost);
>>>       if (NeedToShuffleReuses)
>>>         Cost += 
>>> TTI->getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc,
>>>                                     FinalVecTy, E->ReuseShuffleIndices);
>>>       return Cost;
>>>     }
>>>   }
>>> +    if (isSplat(VL)) {
>>> +      // Found the broadcasting of the single scalar, calculate the cost 
>>> as the
>>> +      // broadcast.
>>> +      return TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy);
>>> +    }
>>>   InstructionCost ReuseShuffleCost = 0;
>>>   if (NeedToShuffleReuses)
>>>     ReuseShuffleCost = TTI->getShuffleCost(
>>> @@ -4755,7 +4760,7 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry 
>>> *E,
>>>             TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, I);
>>>       }
>>>     } else {
>>> -        AdjustExtractsCost(CommonCost, /*IsGather=*/false);
>>> +        AdjustExtractsCost(CommonCost);
>>>     }
>>>     return CommonCost;
>>>   }
>>> diff --git 
>>> a/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll 
>>> b/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll
>>> index c47f255f0bfe..31696752bbb3 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll
>>> @@ -2,24 +2,25 @@
>>> ; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux 
>>> -march=core-avx2 -pass-remarks-output=%t | FileCheck %s
>>> ; RUN: FileCheck %s --input-file=%t --check-prefix=YAML
>>> 
>>> -; YAML: --- !Missed
>>> +; YAML: --- !Passed
>>> ; YAML: Pass:            slp-vectorizer
>>> -; YAML: Name:            NotBeneficial
>>> +; YAML: Name:            VectorizedList
>>> ; YAML: Function:        multi_uses
>>> ; YAML: Args:
>>> -; YAML:  - String:          'List vectorization was possible but not 
>>> beneficial with cost '
>>> -; YAML:  - Cost:            '0'
>>> -; YAML:  - String:          ' >= '
>>> -; YAML:  - Treshold:        '0'
>>> +; YAML:  - String:          'SLP vectorized with cost '
>>> +; YAML:  - Cost:            '-1'
>>> +; YAML:  - String:          ' and with tree size '
>>> +; YAML:  - TreeSize:        '3'
>>> 
>>> define float @multi_uses(<2 x float> %x, <2 x float> %y) {
>>> ; CHECK-LABEL: @multi_uses(
>>> -; CHECK-NEXT:    [[X0:%.*]] = extractelement <2 x float> [[X:%.*]], i32 0
>>> -; CHECK-NEXT:    [[X1:%.*]] = extractelement <2 x float> [[X]], i32 1
>>> ; CHECK-NEXT:    [[Y1:%.*]] = extractelement <2 x float> [[Y:%.*]], i32 1
>>> -; CHECK-NEXT:    [[X0X0:%.*]] = fmul float [[X0]], [[Y1]]
>>> -; CHECK-NEXT:    [[X1X1:%.*]] = fmul float [[X1]], [[Y1]]
>>> -; CHECK-NEXT:    [[ADD:%.*]] = fadd float [[X0X0]], [[X1X1]]
>>> +; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <2 x float> poison, float 
>>> [[Y1]], i32 0
>>> +; CHECK-NEXT:    [[TMP2:%.*]] = insertelement <2 x float> [[TMP1]], float 
>>> [[Y1]], i32 1
>>> +; CHECK-NEXT:    [[TMP3:%.*]] = fmul <2 x float> [[X:%.*]], [[TMP2]]
>>> +; CHECK-NEXT:    [[TMP4:%.*]] = extractelement <2 x float> [[TMP3]], i32 0
>>> +; CHECK-NEXT:    [[TMP5:%.*]] = extractelement <2 x float> [[TMP3]], i32 1
>>> +; CHECK-NEXT:    [[ADD:%.*]] = fadd float [[TMP4]], [[TMP5]]
>>> ; CHECK-NEXT:    ret float [[ADD]]
>>> ;
>>> %x0 = extractelement <2 x float> %x, i32 0
>>> </cut>
>> 

_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/linaro-toolchain

Reply via email to