llvmbot wrote:
<!--LLVM PR SUMMARY COMMENT--> @llvm/pr-subscribers-bolt Author: Amir Ayupov (aaupov) <details> <summary>Changes</summary> This PR adds *imputing the fall-throughs for the top-of-stack (TOS) entries*, addressing fall-through undercounting for branches that appear on the top of the branch stack more often than they should. # Branch stack sampling Branch stack sampling (LBR, BRBE) provides a stack of (branch, target) entries for the last N taken branches: - Intel SKL+: 32, - AMD BRS/LBRv2: 16, - ARM BRBE: 8/16/32/64 depending on the implementation, - arbitrary for trace-synthesized branch stacks (ARM ETM, Intel PT). # DataAggregator representation/traces The internal representation of a unit of the branch profile is a _trace_, a triple of branch source, branch target/fall-through start, and fall-through end/next branch: https://github.com/llvm/llvm-project/blob/89c61449e60703e42c5f274ed41a21f3bc386cf0/bolt/include/bolt/Profile/DataAggregator.h#L116-L118 Branch source and branch target match the branch stack entry. Fall-through end is the branch start for the next stack entry. Naturally, the top-of-stack (TOS) entry doesn't have a fall-through end. Internally, this case is represented with a magic value `BR_ONLY` in `To` field: https://github.com/llvm/llvm-project/blob/89c61449e60703e42c5f274ed41a21f3bc386cf0/bolt/include/bolt/Profile/DataAggregator.h#L111 # Traces and fall-throughs Traces combine taken branches and fall-throughs into one record. This helps achieve efficiencies throughout profile pre-aggregation, profile parsing, and most critically for BOLT, enables distinguishing *call continuation* fall-throughs from the rest. This information is used to set the weight for call to call continuation fall-through CFG edge, in case they are in separate basic blocks. The effect of this information is non-trivial, providing a measurable CPU win over separate branches and non-differentiated fall-throughs. # Motivation: top-of-stack biased branches Internal studies indicate that Intel LBR sampling is not completely fair. Intuitively, for any branch in the program, the probability for it to be captured in a branch stack should be proportional to its execution frequency, and if captured, it should appear in any given entry with equal probability of ~1/N, N being the size of the branch stack. However, the observed probability distribution of the percentage of times an entry appears on the top of the stack does not follow a binomial distribution. Relaxing the assumption of a constant probability of 1/N and modeling the probability as a beta distribution, the resulting distribution more closely resembled the actual observed distribution. This means that for branches that are biased to be at the top of the stack, the fall-through path is undercounted in the profile. # Implementation TOS entries are easily identifiable with a `BR_ONLY` value in `Trace.To` field. To get an expected fall-through length for such traces: - group traces by the first two fields (`Branch`, `From`), - using traces that do have fall-through end, compute *weighed average fall-through length*, - use that for the trace without a fall-through end. Since traces are sorted by their three fields, the grouping is done naturally, and valid fall-through traces come first within a group, and `BR_ONLY` trace comes last in the group. This allows us to compute the weighted average in a single pass over traces. Care is taken in two special cases: - Fall-through start is an unconditional jump (unconditional branch, call, or return). In this case, the fall-through length is set to zero. This still allows extending it to cover call to continuation fall-through CFG edge. - Traces in external code (DSO, JIT) are skipped. # Effect For a large binary, this change improves the profile quality metrics: - no significant increase in mismatching traces (primary raw profile quality metric), - increases the profile density: 599.4 -> 620.6 (density is computed using the fall-through ranges), - CG flow imbalance: 11.56% -> 8.28%, - CFG flow imbalance: - weighted: 3.63% -> 3.34%, - worst: 17.13% -> 22.05% (the only regression). --- Full diff: https://github.com/llvm/llvm-project/pull/145258.diff 3 Files Affected: - (modified) bolt/include/bolt/Core/MCPlusBuilder.h (+6) - (modified) bolt/include/bolt/Profile/DataAggregator.h (+4) - (modified) bolt/lib/Profile/DataAggregator.cpp (+62) ``````````diff diff --git a/bolt/include/bolt/Core/MCPlusBuilder.h b/bolt/include/bolt/Core/MCPlusBuilder.h index 804100db80793..50c5f5bcb9303 100644 --- a/bolt/include/bolt/Core/MCPlusBuilder.h +++ b/bolt/include/bolt/Core/MCPlusBuilder.h @@ -430,6 +430,12 @@ class MCPlusBuilder { return Analysis->isIndirectBranch(Inst); } + bool IsUnconditionalJump(const MCInst &Inst) const { + const MCInstrDesc &Desc = Info->get(Inst.getOpcode()); + // barrier captures returns and unconditional branches + return Desc.isCall() || Desc.isBarrier(); + } + /// Returns true if the instruction is memory indirect call or jump virtual bool isBranchOnMem(const MCInst &Inst) const { llvm_unreachable("not implemented"); diff --git a/bolt/include/bolt/Profile/DataAggregator.h b/bolt/include/bolt/Profile/DataAggregator.h index 98e4bba872846..00c6f56520fdc 100644 --- a/bolt/include/bolt/Profile/DataAggregator.h +++ b/bolt/include/bolt/Profile/DataAggregator.h @@ -499,6 +499,10 @@ class DataAggregator : public DataReader { /// If \p FileBuildID has no match, then issue an error and exit. void processFileBuildID(StringRef FileBuildID); + /// Infer missing fall-throughs for branch-only traces (LBR top-of-stack + /// entries). + void imputeFallThroughs(); + /// Debugging dump methods void dump() const; void dump(const PerfBranchSample &Sample) const; diff --git a/bolt/lib/Profile/DataAggregator.cpp b/bolt/lib/Profile/DataAggregator.cpp index 88229bb31a2ad..2fcc2561cc212 100644 --- a/bolt/lib/Profile/DataAggregator.cpp +++ b/bolt/lib/Profile/DataAggregator.cpp @@ -77,6 +77,11 @@ FilterPID("pid", cl::Optional, cl::cat(AggregatorCategory)); +static cl::opt<bool> ImputeTraceFallthrough( + "impute-trace-fall-through", + cl::desc("impute missing fall-throughs for branch-only traces"), + cl::Optional, cl::cat(AggregatorCategory)); + static cl::opt<bool> IgnoreBuildID("ignore-build-id", cl::desc("continue even if build-ids in input binary and perf.data mismatch"), @@ -529,6 +534,60 @@ void DataAggregator::parsePerfData(BinaryContext &BC) { deleteTempFiles(); } +void DataAggregator::imputeFallThroughs() { + if (Traces.empty()) + return; + + std::pair PrevBranch(Trace::EXTERNAL, Trace::EXTERNAL); + uint64_t AggregateCount = 0; + uint64_t AggregateFallthroughSize = 0; + uint64_t InferredTraces = 0; + + // Helper map with whether the instruction is a call/ret/unconditional branch + std::unordered_map<uint64_t, bool> IsUncondJumpMap; + auto checkUncondJump = [&](const uint64_t Addr) { + auto isUncondJump = [&](auto MI) { + return MI && BC->MIB->IsUnconditionalJump(*MI); + }; + auto It = IsUncondJumpMap.find(Addr); + if (It != IsUncondJumpMap.end()) + return It->second; + BinaryFunction *Func = getBinaryFunctionContainingAddress(Addr); + if (!Func) + return false; + const uint64_t Offset = Addr - Func->getAddress(); + if (Func->hasInstructions() + ? isUncondJump(Func->getInstructionAtOffset(Offset)) + : isUncondJump(Func->disassembleInstructionAtOffset(Offset))) { + IsUncondJumpMap.emplace(Addr, true); + return true; + } + return false; + }; + + for (auto &[Trace, Info] : Traces) { + if (Trace.From == Trace::EXTERNAL) + continue; + std::pair CurrentBranch(Trace.Branch, Trace.From); + if (Trace.To == Trace::BR_ONLY) { + uint64_t InferredBytes = PrevBranch == CurrentBranch + ? AggregateFallthroughSize / AggregateCount + : !checkUncondJump(Trace.From); + Trace.To = Trace.From + InferredBytes; + outs() << "Inferred " << Trace << " " << InferredBytes << '\n'; + ++InferredTraces; + } else { + if (CurrentBranch != PrevBranch) + AggregateCount = AggregateFallthroughSize = 0; + if (Trace.To != Trace::EXTERNAL) + AggregateFallthroughSize += (Trace.To - Trace.From) * Info.TakenCount; + AggregateCount += Info.TakenCount; + } + PrevBranch = CurrentBranch; + } + outs() << "Inferred " << InferredTraces << " traces\n"; +} + Error DataAggregator::preprocessProfile(BinaryContext &BC) { this->BC = &BC; @@ -541,6 +600,9 @@ Error DataAggregator::preprocessProfile(BinaryContext &BC) { // Sort parsed traces for faster processing. llvm::sort(Traces, llvm::less_first()); + if (opts::ImputeTraceFallthrough) + imputeFallThroughs(); + if (opts::HeatmapMode) { if (std::error_code EC = printLBRHeatMap()) return errorCodeToError(EC); `````````` </details> https://github.com/llvm/llvm-project/pull/145258 _______________________________________________ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits