This is an automated email from the ASF dual-hosted git repository.
zclllyybb pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git
The following commit(s) were added to refs/heads/master by this push:
new f68eda67aad [Feature](skill) Introduce Doris profile reader skill
(#63948)
f68eda67aad is described below
commit f68eda67aadf25c767c1883aa843fe011997f0b6
Author: zclllyybb <[email protected]>
AuthorDate: Mon Jun 1 16:06:52 2026 +0800
[Feature](skill) Introduce Doris profile reader skill (#63948)
Add script to extract Doris profile operator/counter names and update
AGENTS.md
- Introduced a new Python script `extract_source_profile_inventory.py`
that extracts operator and counter names from the Doris source tree,
generating a markdown inventory.
- Updated AGENTS.md to include a requirement for formatting BE code with
the correct skill before committing if modifications are made.
---
.claude/skills/be-code-style/SKILL.md | 2 +-
.codex/skills/doris-profile-reader/SKILL.md | 51 +++++
.../skills/doris-profile-reader/agents/openai.yaml | 21 ++
.../references/counter-semantics.md | 244 +++++++++++++++++++++
.../references/join-order-diagnosis.md | 111 ++++++++++
.../references/operator-guide.md | 140 ++++++++++++
.../references/reading-workflow.md | 135 ++++++++++++
.../references/runtime-filters.md | 92 ++++++++
.../scripts/extract_source_profile_inventory.py | 158 +++++++++++++
AGENTS.md | 2 +
10 files changed, 955 insertions(+), 1 deletion(-)
diff --git a/.claude/skills/be-code-style/SKILL.md
b/.claude/skills/be-code-style/SKILL.md
index bd1f4a80f78..74fb272afde 100644
--- a/.claude/skills/be-code-style/SKILL.md
+++ b/.claude/skills/be-code-style/SKILL.md
@@ -1,6 +1,6 @@
---
name: be-code-style
-description: Fix BE (C++) code formatting issues using clang-format
+description: Fix BE (C++) code formatting issues using clang-format. Must use
this when your changes involve BE code before submitting the commit.
compatibility: opencode
---
diff --git a/.codex/skills/doris-profile-reader/SKILL.md
b/.codex/skills/doris-profile-reader/SKILL.md
new file mode 100644
index 00000000000..6b7057e7fee
--- /dev/null
+++ b/.codex/skills/doris-profile-reader/SKILL.md
@@ -0,0 +1,51 @@
+---
+name: doris-profile-reader
+description: Interpret Apache Doris query runtime profiles, especially profile
bottleneck triage, misleading wait counters, per-operator metric priority,
scan, join-order/runtime-filter analysis, and evidence-bounded performance
explanations. Use when given a Doris profile, query id, profile URL/text, or a
request to explain Doris query performance.
+---
+
+# Doris Profile Reader
+
+## Purpose
+
+Use this skill to identify the real bottleneck in an Apache Doris query
runtime profile. The core rule is to separate active work from dependency,
queue, and backpressure waits before naming an operator as expensive. When the
plan contains joins, also separate the immediate runtime bottleneck from the
plan-shape cause, especially bad join order and runtime-filter direction.
+
+## Required Reading Order
+
+1. Read `references/reading-workflow.md` for the analysis workflow and output
contract.
+2. Read `references/counter-semantics.md` for counter meaning and priority,
especially wait counters.
+3. Read `references/operator-guide.md` for the relevant operator family.
+4. Read `references/join-order-diagnosis.md` when the profile or plan has
multiple joins, a large hash/nested-loop build, a large scan that might have
been pruned by a join, paired fast/slow plans, hints/reordered SQL, or a
request about join shape/reorder.
+5. Read `references/runtime-filters.md` when a profile or plan includes
`RuntimeFilterInfo`, `RF... <-`, `RF... ->`, `JRFs`, `WaitForRuntimeFilter`, or
`AcquireRuntimeFilter`.
+6. Use `references/source-profile-inventory.md` as the source-backed
operator/counter inventory. If a counter or operator is not in the narrative
docs, do not ignore it; look it up in this inventory and classify it by the
rules in `counter-semantics.md`.
+
+## Non-Negotiable Interpretation Rules
+
+- Do not call `WaitForDependencyTime`, `WaitForDependency[...]Time`,
`WaitForData0`, `WaitForDataN`, `WaitForRpcBufferQueue`,
`WaitForBroadcastBuffer`, `PendingFinishDependency`, or pipeline blocked/wait
counters direct operator compute cost. They are dependency, data-arrival,
queue, memory, or backpressure signals.
+- Do not rank operators by the largest wait-like counter alone. First rank by
`ExecTime`/active timers, direct custom timers, rows/bytes, memory/spill, and
skew.
+- Treat merged-profile `sum` timers as accumulated across parallel instances.
A timer can exceed query elapsed time and still be normal when many scanners,
drivers, or fragments run in parallel.
+- For scans, prioritize `RowsRead`, `ScanRows`, `ScanBytes`, `ScannerCpuTime`,
`ScannerGetBlockTime`, `ScannerWorkerWaitTime`, I/O/decompression timers,
predicate/lazy-read timers, and row-filter counters. `ScannerWorkerWaitTime` is
important, but it indicates scanner scheduling/thread-pool wait rather than
scan CPU.
+- For runtime filters, distinguish source/build side from target/probe scan
side. In plan text, `RFxxx <- expr` is produced from the build side; `RFxxx ->
expr` is applied at a target scan. In profiles, `RuntimeFilterInfo` and
scan-side wait/filter counters decide whether the RF helped or just waited.
+- For joins, always identify build side and probe/target side before judging
the order. A scan can be the immediate active bottleneck while the root cause
is still bad join order if a selective join/RF source is scheduled too late to
prune that scan.
+- Do not require a paired fast profile before naming likely bad join order. A
single slow profile can be enough when it proves large wasted build/scan work
before an empty probe/result, an RF source side that had to scan massively
before it could emit an empty/tiny filter, or a huge intermediate later
eliminated by a highly selective or contradictory join predicate.
+- Do not treat "the current RF made other scans wait and then skip" as proof
that the join order is good. If producing that empty/tiny RF required the only
expensive scan, the RF source/target choice is itself the join-order question.
+- When a large build/source side is paid before an empty/tiny probe, preserved
side, semi-join key set, or contradictory join can eliminate the result, call
the build/probe order likely bad unless the profile proves that ordering is
semantically forced and no earlier pruning/short-circuit is possible.
+- If a strong single-profile join-order pattern matches, do not hedge as
"suspicious", "close to likely bad", or "not proven". Use `likely bad` when a
better legal order still needs validation, and reserve `not proven` for the
exact alternate shape, not for the join-order diagnosis itself.
+- When a join query has plan-shape evidence, the answer must explicitly judge
join order/build-probe/RF direction as good, suspicious, or bad. Do not replace
that judgment with a vague "predicate issue" or "plan shape issue".
+- A long `InitTime`, `OpenTime`, `CloseTime`, or profile total can matter, but
only after confirming it is not accumulated across many instances and not
dominated by a known wait/dependency branch.
+
+## Standard Answer Shape
+
+When explaining a profile, answer in this order:
+
+1. `Conclusion`: one or two sentences naming the likely bottleneck and, when
joins are involved, whether the plan shape/join order is likely the cause.
+2. `Evidence`: profile counters with operator names, values, and whether each
is active work, data volume, wait/backpressure, memory/spill, or runtime-filter
evidence.
+3. `Reasoning`: how the evidence maps to the execution plan, which side is
build/probe or RF source/target, why misleading counters are discounted, and
whether the join order is reasonable.
+4. `Next checks`: the smallest additional profile/log/code checks needed if
the conclusion is still uncertain.
+
+Always preserve uncertainty. Use "proven", "likely", and "not proven"
explicitly when the profile lacks enough detail.
+
+## Scripts
+
+- `scripts/extract_source_profile_inventory.py`: scan Doris source for
factory-created operators and counter/info registrations.
+
+Scripts are evidence-generation helpers. They do not replace the reading
workflow.
diff --git a/.codex/skills/doris-profile-reader/agents/openai.yaml
b/.codex/skills/doris-profile-reader/agents/openai.yaml
new file mode 100644
index 00000000000..34aa432dfbb
--- /dev/null
+++ b/.codex/skills/doris-profile-reader/agents/openai.yaml
@@ -0,0 +1,21 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+interface:
+ display_name: "Doris Profile Reader"
+ short_description: "Interpret Doris profiles, joins, and waits"
+ default_prompt: "Use $doris-profile-reader to analyze this Apache Doris
query profile, discount misleading wait counters, identify the real bottleneck,
and judge join order/build-probe/runtime-filter shape with counter evidence."
diff --git a/.codex/skills/doris-profile-reader/references/counter-semantics.md
b/.codex/skills/doris-profile-reader/references/counter-semantics.md
new file mode 100644
index 00000000000..7a254a19ad7
--- /dev/null
+++ b/.codex/skills/doris-profile-reader/references/counter-semantics.md
@@ -0,0 +1,244 @@
+# Counter Semantics
+
+This file defines how to interpret every counter class. For individual counter
names and source locations, use `source-profile-inventory.md`.
+
+## Universal Priority
+
+High-signal counters:
+
+- `ExecTime`: time inside the operator. In merged profiles, check `max` and
per-instance detail before comparing to query elapsed. Before calling it active
CPU, qualify it with child wait evidence such as runtime-filter wait,
data-arrival wait, dependency wait, scanner worker wait, or spill task queue
wait. Do not mechanically subtract wait counters from `ExecTime`; scan-level
`WaitForRuntimeFilter`, per-filter `RFx WaitTime`, scanner child timers, and
merged operator timers may overlap [...]
+- `InputRows`, `RowsProduced`, `RowsRead`, `ScanRows`, `ScanBytes`, bytes
sent/received: data volume that explains active time.
+- Operator-specific direct timers: scan CPU/I/O/decompression, hash table
build/probe, aggregation build/merge, sort, analytic evaluation,
serialization/compression, spill read/write.
+- `MemoryUsage`, `MemoryUsagePeak`, `MemoryUsageHashTable`,
`MemoryUsageBuildBlocks`, `MemoryUsageSortBlocks`: resource footprint.
+- Spill counters: direct evidence that memory pressure changed the execution
path.
+
+Low-directness counters:
+
+- `WaitForDependencyTime` and `WaitForDependency[...]Time`: dependency wait.
Important for pipeline causality, not direct compute.
+- `WaitForData0` and `WaitForDataN`: exchange receiver queue data wait.
Usually upstream did not produce data yet.
+- `WaitForRpcBufferQueue`, `WaitForBroadcastBuffer`,
`WaitForLocalExchangeBufferN`: downstream or channel backpressure.
+- `PendingFinishDependency`: sink finish dependency. Do not call it sink work.
+- `WaitWorkerTime`, `BlockedByDependency`, `MemoryReserveTimes`,
`MemoryReserveFailedTimes`: scheduler/dependency/memory state.
+- `FirstBatchArrivalWaitTime`, `DataArrivalWaitTime`: data-arrival symptom;
use with upstream operator evidence.
+- FE `Plan Time`, `Schedule Time`, `Wait and Fetch Result Time`, and
client/result fetch time: query lifecycle or client-facing timing. They can
explain end-to-end latency, but do not rank them as BE operator active work.
+
+## Common Operator Counters
+
+- `InitTime`: local state/profile initialization. Usually not the bottleneck
unless large in per-instance detail.
+- `OpenTime`: opening child/local state/resources. Can matter for external
scans/sinks or cold metadata, but verify with custom counters.
+- `ExecTime`: primary active operator timer.
+- `CloseTime`: close/finalization. For sinks it may include writer close,
publish, commit, or file close; inspect custom sink timers.
+- `ProjectionTime`: expression/projection after core operator work. Important
if high with many output columns or expensive expressions.
+- `RowsProduced`: source/operator output rows.
+- `InputRows`: sink input rows.
+- `BlocksProduced`, `OutputBlockBytes`, `MaxOutputBlockBytes`,
`MinOutputBlockBytes`: vectorized block output volume and block-size skew.
+- `MemoryUsage`, `MemoryUsagePeak`: current/peak memory tracked by the
operator.
+- `IsShuffled`, `IsColocate`, `FollowedByShuffledOperator`: plan/execution
metadata, not cost by itself.
+- `PendingFinishDependency`: wait for sink finalization dependency, not active
work.
+
+## Pipeline Task Counters
+
+- `TaskCpuTime`: accumulated CPU attributed to a pipeline task. It can exceed
wall time when summed.
+- `ExecuteTime`: task execution loop time.
+- `PrepareTime`, `OpenTime`, `GetBlockTime`, `SinkTime`, `CloseTime`: task
lifecycle partitions.
+- `WaitWorkerTime`: time waiting for worker scheduling.
+- `NumScheduleTimes`, `NumYieldTimes`, `CoreChangeTimes`: scheduling behavior
and task migration/yield signals.
+- `MemoryReserveTimes`, `MemoryReserveFailedTimes`: memory reservation
pressure; pair with blocked time and memory peaks.
+- `BlockedByDependency`: count/state of dependency blocking, not CPU.
+
+## Time Counter Name Patterns
+
+- `*Time`, `*Timer`, `*Costs`: elapsed or accumulated time for the named
action. Determine whether it is direct work or wait by the action name.
+- `Wait*`, `*Wait*`, `Blocked*`, `Pending*`: wait/dependency/backpressure
unless source code proves otherwise.
+- `*CpuTime`: CPU-side work. Can be accumulated across threads.
+- `*IOTimer`, `*ReadFileTime`, `*WriteFileTime`, `FileWriteTime`: I/O path
work or I/O wait. Use bytes and cache counters to interpret.
+- `*Serialize*`, `*Deserialize*`, `CompressTime`, `DecompressorTimer`: CPU
transformation cost.
+- `*EvalTime`, `ExprTime`, `ProjectionTime`, `VectorPredEvalTime`: expression
or predicate CPU.
+
+## Count/Volume Counter Name Patterns
+
+- `*Rows`, `*RowsRead`, `*InputRows`, `*RowsProduced`: row volume.
+- `*Bytes`, `*MemBytes`, `ScanBytes`, `CompressedBytesRead`,
`UncompressedBytesRead`: byte volume or memory/storage size.
+- `*Count`, `*Num`, `*Times`: event count.
+- `*Filtered`, `*FilteredRows`, `Rows*Filtered`: rows skipped or filtered.
High values can be good if they avoid downstream work.
+- `*CacheHit*`, `*CacheMiss*`: cache effectiveness. Interpret with I/O timers.
+- `*Peak`: peak observed value.
+
+## Scan Counters
+
+Common scan:
+
+- `RowsRead`: rows returned by the scanner after scan-side pruning/filtering.
+- `ScanRows`: rows scanned/read from storage or source. Larger than `RowsRead`
means scan-side filtering/pruning removed rows.
+- `ScanBytes`: bytes scanned. In OLAP scan source comments this represents
uncompressed bytes from local plus remote reads.
+- `NumScanners`: scanner instances created.
+- `ScannerCpuTime`: direct scanner CPU. Primary scan bottleneck timer.
+- `ScannerGetBlockTime`: time getting blocks from scanners.
+- `ScannerFilterTime`: scan-side filter time.
+- `ScannerWorkerWaitTime`: time scanner work waited in the scan worker pool.
High value indicates scanner scheduling/thread-pool contention or too much scan
concurrency, not scan CPU.
+- `WaitForRuntimeFilter`, `AcquireRuntimeFilter`, and `RuntimeFilterInfo/RFx
WaitTime`: scan-side wait before or during scan startup/filter application.
These can make scan `ExecTime` look large even when storage CPU/I/O is small.
+- `RFx InputRows`, `RFx FilterRows`, and `RFx AlwaysTrueFilterRows`:
per-filter counters, often accumulated across parallel scan instances and
filters. They may exceed a table's final scan rows when summed across
filters/instances. Compare them per filter and against `ScanRows`, `RowsRead`,
and target-side output; do not add them across filters as if they were disjoint.
+- `MaxScanConcurrency`, `MinScanConcurrency`, `RunningScanner`: concurrency
and skew.
+- `ConditionCacheHit`, `ConditionCacheFilteredRows`: condition cache
effectiveness.
+- `PerScannerRunningTime`, `PerScannerRowsRead`, `PerScannerWaitTime`,
`PerScannerProjectionTime`, `PerScannerPrepareTime`, `PerScannerOpenTime`:
per-scanner skew and phase evidence.
+
+OLAP scan storage path:
+
+- `ReaderInitTime`, `ScannerInitTime`, `ProcessConjunctTime`: scan setup and
predicate preparation.
+- `BlockLoadTime`, `BlocksLoad`, `BlockFetchTime`, `BlockInitTime`,
`BlockInitSeekTime`, `BlockInitSeekCount`: segment/block loading and seek
overhead.
+- `CompressedBytesRead`, `UncompressedBytesRead`, `IOTimer`,
`DecompressorTimer`: storage I/O and decompression cost.
+- `PredicateColumnReadTime`, `NonPredicateColumnReadTime`, `OutputColumnTime`,
`LazyReadTime`, `LazyReadSeekTime`: column read and late materialization cost.
+- `VectorPredEvalTime`, `ShortPredEvalTime`, `ExprFilterEvalTime`: predicate
expression CPU.
+- `RowsStatsFiltered`, `RowsKeyRangeFiltered`, `RowsConditionsFiltered`,
`RowsZoneMapRuntimePredicateFiltered`, `RowsBloomFilterFiltered`,
`RowsDelFiltered`, `SegmentDictFiltered`: pruning and delete/filter
effectiveness. Runtime filters, especially min/max or storage-pushed filters,
can show their benefit here even when a per-filter `FilterRows` counter is zero
or misleading.
+- `GenerateRowRangeByKeysTime`, `GenerateRowRangeByColumnConditionsTime`,
`GenerateRowRangeByBloomFilterIndexTime`, `GenerateRowRangeByZoneMapIndexTime`,
`GenerateRowRangeByDictTime`: index/range construction cost.
+- Inverted-index counters (`InvertedIndex*`, `RowsInvertedIndexFiltered`):
index reader/search/filtering cost and effectiveness.
+- ANN counters (`AnnIndex*`, `AnnIvf*`): vector index search, conversion,
cache, fallback, and filtering.
+- Variant counters (`Variant*`, sparse-column counters): variant/sparse
subcolumn materialization and filtering.
+- `AdaptiveBatchPredictMaxRows`, `AdaptiveBatchPredictMinRows`: adaptive batch
sizing metadata, not direct cost.
+
+File/external scan:
+
+- `FileScannerGetBlockTime`, `FileScannerPreFilterTimer`,
`ApplyAllRuntimeFilters`: file scanner work and RF application.
+- `FileNumber`, `EmptyFileNum`, `BatchSplitMode`: split/file shape.
+- Connector-specific read/open/list counters should be treated as remote I/O
or metadata cost; verify with bytes and file count.
+
+## Join Counters
+
+Hash join build sink:
+
+- `BuildTime`, `BuildHashTableTime`, `BuildTableInsertTime`,
`BuildExprCallTime`, `MergeBuildBlockTime`: direct build-side work.
+- `InputRows`, `HashTableSize`, `MemoryUsageHashTable`,
`MemoryUsageBuildBlocks`, `MemoryUsageBuildKeyArena`: build-side size and
memory.
+- `PublishTime`, `RuntimeFilterInfo`: runtime filter publication.
+- `BroadcastJoin`, `JoinType`, `ShareHashTableEnabled`, `BuildShareHashTable`,
`SkipProcess`: plan/state metadata.
+- `AsofIndexExprTime`, `AsofIndexSortTime`, `AsofIndexGroupTime`: ASOF join
index preparation cost.
+
+Hash join probe:
+
+- `ProbeRows`, `ProbeIntermediateRows`: probe-side volume and intermediate
expansion.
+- `ProbeExprCallTime`, `ProbeWhenSearchHashTableTime`,
`ProbeWhenProbeSideOutputTime`, `ProbeWhenBuildSideOutputTime`,
`NonEqualJoinConjunctEvaluationTime`, `JoinFilterTimer`, `BuildOutputBlock`,
`FinishProbePhaseTime`: direct probe/output/filter work.
+- `MemoryUsageProbeKeyArena`: probe key memory.
+- `WaitForDependency[HASH_JOIN_BUILD_DEPENDENCY]Time` on probe is waiting for
build readiness. It is not probe CPU.
+
+Partitioned hash join:
+
+- `Spilled`, `SpillBuildTime`, `SpillProbeTime`, `SpillRePartitionTime`,
`SpillBuildRows`, `SpillProbeRows`, `SpillRecovery*`, `SpillMaxPartitionLevel`,
`SpillTotalPartitions`: partitioned join spill path and recovery.
+- `BuildRows`, `ProbeBlocksBytesInMem`, `SpillInMemRow`: partitioned
memory/volume shape.
+
+Nested loop/cross join:
+
+- `LoopGenerateJoin`, `JoinConjunctsEvaluationTime`,
`FilteredByJoinConjunctsTime`, `OutputTempBlocksTime`,
`UpdateVisitedFlagsTime`: direct nested-loop work.
+- Large probe/intermediate rows are usually more important than wait counters.
+
+## Aggregation and Set Counters
+
+Aggregation sink/source:
+
+- `BuildTime`, `HashTableComputeTime`, `HashTableEmplaceTime`,
`HashTableInputCount`, `HashTableSize`: primary group-by/hash aggregation cost.
+- `MergeTime`, `DeserializeAndMergeTime`, `GetResultsTime`,
`HashTableIterateTime`, `InsertKeysToColumnTime`, `InsertValuesToColumnTime`:
merge/output phase.
+- `ExprTime`, `DoLimitComputeTime`: expression and limit work.
+- `MemoryUsageHashTable`, `MemoryUsageArena`, `MemoryUsageContainer`,
`MemoryUsageSerializeKeyArena`: aggregation memory.
+- `Spilled`, `SpillSerializeHashTableTime`, `SpillTotalTime`,
`SpillTotalPartitions`, `SpillMaxPartitionLevel`: spill path.
+
+Streaming/distinct aggregation:
+
+- `StreamingAggTime`: streaming aggregation work.
+- `BuildTime`, `HashTableComputeTime`, `HashTableEmplaceTime`,
`HashTableInputCount`, `HashTableSize`: distinct/group state.
+
+Set operators:
+
+- `BuildTime` on `SET_SINK_OPERATOR`: build set/hash table.
+- `ProbeTime`, `ExtractProbeDataTime` on `SET_PROBE_SINK_OPERATOR`:
probe/extract set results.
+- `GetDataTime`, `FilterTime`, `GetDataFromHashTableRows` on intersect/except:
set output/filter work.
+
+## Sort, Window, and Materialization Counters
+
+Sort:
+
+- `SortTime`, `AppendBatchTime`, `AppendBlockTime`, `MergeGetNext`,
`MergeGetNextBlock`, `MemoryUsageSortBlocks`: direct sort and output work.
+- Spill sort uses `SpillSort*`, `SpillMergeSortTime`, and common spill
counters.
+- `WaitForDependency[SORT_SOURCE_DEPENDENCY]Time` generally means waiting for
the sink to finish sorting.
+
+Partition sort:
+
+- `HashTableBuildTime`, `SortedDataTime`, `SortedPartitionInputRows`,
`SortedPartitionOutputRows`, `GetSortedTime`: partitioned/window sort setup and
output.
+
+Analytic/window:
+
+- `ComputePartitionByTime`, `ComputeOrderByTime`, `ComputeAggDataTime`,
`ComputeRangeBetweenTime`, `EvaluationTime`, `PartitionSearchTime`,
`OrderSearchTime`, `RemoveRowsTime`: direct window function work.
+- `Blocks`, `BlocksPeak`, `RemoveRows`, `RemoveCount`, `streaming mode`:
window state and mode.
+
+Materialization/select/repeat/table function:
+
+- `ProjectionTime`, expression/evaluation timers, `RowsProduced`, and output
bytes are the main proof.
+- `MATERIALIZATION_OPERATOR` can be a top/lazy materialization node that
fetches deferred columns by row id through `multiget_data_v2` RPCs.
`MaxRpcTime` is the max elapsed RPC round-trip for a batch, not local
projection CPU. Treat it as late row-id fetch latency unless child
`RowIDFetcher` profiles, fetched rows/bytes, `MergeResponseTime`, and output
row count prove a materialization-heavy workload.
+- Do not rank a top/lazy materialization node above upstream
scan/join/aggregation/memory evidence when it only materializes a small final
row set. Trace `row_ids` or row-id table info back to the scan/join branch that
produced those row ids, then report materialization separately as final-column
fetch latency.
+- `Repeat` expands grouping sets; compare input/output rows.
+- `TableFunction` can multiply rows; compare produced rows and expression time.
+
+## Exchange and Network Counters
+
+Exchange source:
+
+- `GetDataFromRecvrTime`, `DeserializeRowBatchTimer`, `DecompressTime`,
`FilterTime`, `CreateMergerTime`, `MergeGetNext`, `MergeGetNextBlock`: direct
receive/merge/decode work.
+- `LocalBytesReceived`, `RemoteBytesReceived`, `DecompressBytes`,
`BlocksProduced`: data volume.
+- `WaitForData0`, `WaitForDataN`, `DataArrivalWaitTime`,
`FirstBatchArrivalWaitTime`, `SendersBlockedTotalTimer(*)`,
`MaxWaitForWorkerTime`, `MaxWaitToProcessTime`: wait/backpressure/scheduling
signals.
+
+Exchange sink:
+
+- `SerializeBatchTime`, `CompressTime`, `DistributeRowsIntoChannelsTime`,
`SplitBlockHashComputeTime`, `LocalSendTime`, `MergeBlockTime`,
`SendNewPartitionTime`, `AddPartitionRequestTime`: direct send/shuffle work.
+- `BytesSent`, `LocalBytesSent`, `LocalSentRows`, `UncompressedRowBatchSize`,
`BlocksProduced`, `OverallThroughput`: volume/throughput.
+- `RpcMaxTime`, `RpcAvgTime`, `RpcMinTime`, `RpcSumTime`, `RpcCount` on
`DATA_STREAM_SINK_OPERATOR`: remote send RPC latency. A high max/avg with
meaningful bytes and small serialize/compress time points to network, receiver,
or RPC path latency, not local sink CPU.
+- `WaitForRpcBufferQueue`, `WaitForBroadcastBuffer`,
`WaitForLocalExchangeBufferN`: downstream backpressure or queue capacity wait.
+
+Local exchange and multicast:
+
+- `ComputeHashValueTime`, copy/channel push/pop timers, local bytes/rows, and
buffer waits explain in-BE reshuffle or fanout.
+- Long local exchange wait usually reflects receiver/sender imbalance, not
compute in the local exchange itself.
+
+## Sink and Writer Counters
+
+Result/file/mysql/flight sinks:
+
+- `AppendBatchTime`, `BytesSent`, `FileWriteTime`, `FileWriterCloseTime`,
`ConvertBlockToArrowBatchTime`: direct output cost.
+- Result sinks can be slow because clients fetch slowly; look for send/write
wait and bytes.
+
+OLAP/table/load sinks:
+
+- `AppendNodeChannelTime`, `CloseWaitTime`, `CloseWriterTime`,
`CloseLoadTime`, `CommitTxnTime`, `BuildRowsetTime`, `AddPartitionRequestTime`:
write/load/finalization phases.
+- High close/commit is sink-side only if custom writer timers agree.
+
+External table sinks:
+
+- Hive/Iceberg/MaxCompute/JDBC/TVF counters represent connector write, file
write, delete-file write, and close/commit costs. Use file counts, bytes,
writer close/write timers, and connector-specific errors/logs.
+- Iceberg delete/merge/sort spill counters show delete file generation, merge
output, and spill path.
+
+Blackhole/cache/memory scratch/dictionary sinks:
+
+- Blackhole `BytesProcessed` is volume consumed, not output I/O.
+- Cache counters show cache fill/read size and tablet id metadata.
+- Memory scratch conversion timers show in-memory Arrow/result conversion.
+- Dictionary sink counters should be read as dictionary build/publish/write
path.
+
+## Runtime Filter Counters
+
+See `runtime-filters.md`. In short:
+
+- `RuntimeFilterInfo` on join/build/sink shows produced filters and target
metadata.
+- `WaitForRuntimeFilter` and `AcquireRuntimeFilter` on scans/sources are waits
for filters.
+- `RFx InputRows`, `RFx FilterRows`, `RFx AlwaysTrueFilterRows`, `RFx
WaitTime` tell whether each RF was useful.
+- Waiting for a runtime filter is worthwhile only if it filters enough scan
rows or prevents expensive downstream work.
+- Runtime-filter wait counters can appear on scans, multicast/source
operators, or dynamic/top-N filter paths. Treat them as wait/synchronization
evidence first, then use the plan and detail profile to find the producer and
target.
+
+## Spill Counters
+
+Common spill counters are direct evidence of a different execution path:
+
+- `SpillTotalTime`, `SpillWriteTime`, `SpillRecoverTime`: overall spill cost.
+- `SpillWriteTaskWaitInQueueTime`, `SpillReadTaskWaitInQueueTime`: spill task
queue wait.
+- `SpillWriteFileTime`, `SpillReadFileTime`: spill file I/O.
+- `SpillWriteSerializeBlockTime`, `SpillReadDeserializeBlockTime`: CPU
transform cost.
+- `SpillWriteBlockCount`, `SpillWriteBlockBytes`, `SpillWriteRows`,
`SpillWriteFileBytes`, `SpillWriteFileTotalCount`: write volume.
+- `SpillReadBlockCount`, `SpillReadBlockBytes`, `SpillReadRows`,
`SpillReadFileBytes`, `SpillReadFileCount`: read volume.
+- `SpillMaxRowsOfPartition`, `SpillMinRowsOfPartition`,
`SpillMaxPartitionLevel`, `SpillTotalPartitions`: partition skew and spill
fanout.
+
+If spill exists, it is usually more important than generic wait counters.
diff --git
a/.codex/skills/doris-profile-reader/references/join-order-diagnosis.md
b/.codex/skills/doris-profile-reader/references/join-order-diagnosis.md
new file mode 100644
index 00000000000..ca82767b9b5
--- /dev/null
+++ b/.codex/skills/doris-profile-reader/references/join-order-diagnosis.md
@@ -0,0 +1,111 @@
+# Join Order Diagnosis
+
+Use this file after the normal bottleneck pass. Keep the skill's main
discipline: first prove the immediate runtime cost, then decide whether that
cost is a symptom of a bad join order, bad build/probe choice, or late/missing
runtime filter.
+
+## 1. Identify Build, Probe, Source, and Target
+
+Hash join:
+
+- `HASH_JOIN_SINK_OPERATOR` is the build side. Its decisive counters are
`InputRows`/`BuildRows`, `BuildTime`, `BuildHashTableTime`,
`BuildTableInsertTime`, `MemoryUsageHashTable`, `MemoryUsageBuildBlocks`, and
`RuntimeFilterInfo`.
+- `HASH_JOIN_OPERATOR` is the probe side. Its decisive counters are
`ProbeRows`, `ProbeIntermediateRows`, probe expression/search/output timers,
and final `RowsProduced`.
+- In non-pipeline or older profiles, a single `VHASH_JOIN_NODE` may contain
both sides. Use `BuildRows`, `ProbeRows`, `BuildTime`, `ProbeTime`, hash-table
memory, and plan child order/RF arrows to identify the sides.
+- In Doris plans, the hash table is normally built from the join's build/right
child, but verify with counters or RF arrows when the profile is merged or the
plan has been transformed.
+
+Nested-loop/cross join:
+
+- `CROSS_JOIN_SINK_OPERATOR` is the build/materialized side.
+- `CROSS_JOIN_OPERATOR` is the probe/loop side. Its decisive counters are
`ProbeRows`, `LoopGenerateJoin`, `JoinConjunctsEvaluationTime`,
`FilteredByJoinConjunctsTime`, and `OutputTempBlocksTime`.
+- A nested-loop join over a large intermediate result is often a plan-shape
problem even when the direct operator cost is correctly attributed to
nested-loop active timers.
+
+Runtime filters:
+
+- `RFxxx <- expr` is the source/build side that produces the filter.
+- `RFxxx -> expr` is the target/probe scan side that may be pruned.
+- A useful order usually produces filters from a small/selective build side
early enough to prune a large probe-side scan. A suspicious order scans/builds
a large fact side before the small/selective branch can produce its RF.
+- An empty RF is not automatically a good outcome. If the query spends most of
its time scanning the RF source just to produce an empty/tiny filter, then the
RF source side may be the wrong side even though downstream targets skipped
work.
+- Empty downstream target scans prove only that the RF arrived before those
targets did work. They do not prove the source choice was good. If the source
scan/build dominates elapsed time, treat the plan as "RF direction likely bad:
the expensive side produced the pruning signal instead of receiving one" unless
there is concrete counterevidence.
+
+Memo and plan alternatives:
+
+- If memo or verbose explain contains multiple physical expressions for the
same join group, inspect both child orders. A reversed physical expression is
valid counterfactual evidence even when no fast profile is present.
+- Compare the chosen side against table keys, distribution keys,
colocate/bucket properties, and available RF targets. If the chosen RF source
uses non-key predicates and scans massively, while the other side's join key
can target the source table's key/prefix/distribution key, classify the chosen
order as suspicious or bad.
+- Do not accept the optimizer's estimated lower cost as proof of correctness
when actual profile rows contradict the estimate. A low estimated build
cardinality that turns into a huge zero-output scan is evidence of bad
join-order costing.
+
+## 2. Decide Whether the Join Order Is Reasonable
+
+Treat join-order diagnosis as a plan-shape layer, separate from immediate
runtime cost. Report both layers.
+
+Strong evidence of bad join order:
+
+- A paired fast/slow profile, hint, or SQL rewrite with the same intended
result changes only join shape/RF direction and turns a large scan/hash build
into tiny or zero scan rows.
+- A large fact table is scanned or used as hash-build input before a selective
dimension/subquery branch is reduced to a few rows.
+- The slow plan builds a hash table on millions/billions of rows while an
alternate legal shape builds on a tiny side and uses RF to prune the large side.
+- A scan is the active bottleneck only because the RF that could prune it is
produced later, on the wrong side, or after the scan has already done most work.
+- The plan creates a huge intermediate result and a later join/filter reduces
it to zero or a tiny row count, especially with contradictory predicates or
predicates that could have been pushed below the join.
+- Estimates in explain/memo say a branch is large, but profile actual rows
show it is tiny, causing the optimizer to choose the wrong build/probe side.
+
+Single slow profile evidence can also be strong enough. Do not require a fast
profile when these patterns appear:
+
+- Empty-probe or empty-result build waste: a hash join builds/shuffles
millions or billions of rows, but `ProbeRows` or final output is zero/tiny. The
direct cost is build/scan, but the plan-shape cause is that the large build
side was scheduled before proving the probe/output side was empty or highly
selective. If join type semantics make a swap uncertain, say "likely bad join
order/build-side scheduling" rather than "not join order".
+- RF-source inversion: the current RF source/build side is itself the
expensive unpruned scan and produces zero/tiny rows only after scanning a huge
table. Its RF target scans then wait and skip work. This is not merely a
scan/index problem; it means the plan chose the side that needed pruning as the
side responsible for producing the pruning signal. Check whether the expensive
side's join key is a table key/prefix, distribution key, or colocated key and
whether no RF is applied to that side.
+- Expensive-empty source: a scan/build side does massive active work and
outputs zero/tiny rows, then its RF makes other branches skip. Classify this as
likely bad join order/RF direction even if that branch also has a missing
index, non-key predicate, bad literal, or stale estimate. The other anomaly
explains why the branch is expensive or empty; it does not make the expensive
branch a good RF source.
+- Late emptying predicate: a later join, semi join, analytic/filter, or
non-equi conjunct reduces a large intermediate to zero/tiny rows. If that
predicate references only a subset of the joined tables or is contradictory,
classify the earlier large join/intermediate as bad join order or missed
predicate placement.
+- Ineffective RF to a large side: the plan contains an RF targeting a large
scan, but profile shows it is ignored, not pushed down, arrives after scan
work, has zero filter rows, or the large scan still produces millions of rows.
If a small/selective branch exists, this is evidence of bad RF direction/timing
even without a fast profile.
+- Estimate/actual inversion: the optimizer estimates the build/source side
tiny, but profile proves it performed massive scan work to produce zero/tiny
rows, while downstream tables only waited for its RF. Treat this as a likely
join-order/RF-direction error. Do not stop at "the build side scan is slow"
unless the profile proves no join alternative could prune it.
+- Key-target inversion: the expensive source table is scanned by
non-key/non-prefix predicates, while its join key is a key prefix, distribution
key, or otherwise better pruning target. If current order makes that table
produce the RF instead of receive one, call the join order/RF direction bad
unless profile proves the alternative cannot prune it.
+- Huge build for empty/tiny probe: a hash join build side
scans/shuffles/builds millions or billions of rows while the probe side is
zero/tiny and final output is zero/tiny. Report direct cost as build/scan, and
report plan-shape root cause as likely bad join order/build-side scheduling
even for an outer join unless the profile proves no earlier
empty-preserved-side check or legal reorder can avoid the build.
+- Contradictory non-equi join late: an inner nested-loop/cross join has
conjuncts that cannot both be true, or a highly selective non-equi relation,
but the plan first materializes a large intermediate. This is not just
expression simplification; it is bad join order/predicate placement because the
impossible/selective relation was applied after the row explosion.
+
+When any strong single-profile pattern above matches, do not write "join order
not proven" as the main diagnosis. Use one of these verdicts:
+
+- `bad`: a paired plan, memo alternative, legal rewrite, key/RF target
evidence, or inner-join contradiction shows the expensive work should have been
avoided.
+- `likely bad`: the profile alone proves huge wasted scan/build/intermediate
work before an empty/tiny side or late predicate, but legal alternatives still
need confirmation.
+- `suspicious`: only for weak patterns where build/probe/RF sides or row-flow
evidence are incomplete. Do not use `suspicious` for expensive-empty source,
huge build for empty/tiny probe, tiny semi key after fact build, or
contradictory non-equi join late.
+
+Keep uncertainty scoped correctly:
+
+- It is acceptable to say the exact fix or alternate legal order is not proven.
+- It is not acceptable to demote the diagnosis to "not join order" or "only
scan/literal/index problem" when the profile already proves the large side was
paid before an empty/tiny side could prune it.
+- In the conclusion, put the join verdict before coexisting anomalies when the
user's question is profile performance: "runtime bottleneck is X; join
order/build scheduling is likely bad; bad literal/stats/index may explain why
the selective side became empty."
+
+Weaker evidence:
+
+- A single profile shows large scan/build work and late/weak RF, but there is
no alternate plan, no SQL/hint comparison, and no stats/memo evidence.
+- Outer/semi/anti joins restrict legal reorder. A costly order may be required
by semantics unless an equivalent rewrite or optimizer rule is known.
+
+Do not stop at "scan bottleneck" when the scan exists because of join order.
Say: "runtime bottleneck is the scan/hash build; likely plan-shape root cause
is join order/RF direction because ..."
+
+## 3. Join Type Constraints
+
+- Inner joins are usually reorderable; choose the shape that builds on the
smaller/selective side and probes/prunes the larger side.
+- For left outer join, the left side is preserved. For right outer join, the
right side is preserved. Swapping can change semantics unless the optimizer
rewrites the join equivalently.
+- Outer join preservation does not make every build/probe choice good. If a
preserved-side scan/probe is empty after its own predicates while the
non-preserved build side was already fully scanned/built, report the direct
build waste and classify the order as likely bad scheduling or missed
empty-side short-circuit/predicate placement. Keep the semantic caveat, but do
not downgrade it to a pure scan problem.
+- For outer joins, distinguish logical join order from physical work order.
Even when logical reordering is constrained, a profile can still prove bad
physical build-side scheduling if Doris builds the non-preserved side before
discovering the preserved side has zero rows.
+- For left outer joins in hash execution, `ProbeRows=0` with huge right/build
`BuildRows` is a likely bad physical work order: the build side was paid before
the preserved side was known empty. State that even if the SQL contains another
bug, and make the legal-rewrite caveat secondary.
+- For semi/anti joins, the preserved/output side and filter side matter. A
good order applies the selective semi/anti result before scanning or building
large downstream joins when legal.
+- For semi joins, a tiny key set from the filter side that is applied only
after a large fact scan/build is strong evidence of bad join order or RF timing.
+- For non-equi joins, nested-loop may be required, but the order can still be
bad if it materializes a huge intermediate before applying a selective or
impossible predicate. A contradictory join conjunct on an inner join is a
join-order/predicate-placement failure when earlier joins create large rows
that the contradictory join later discards. State "join order/predicate
placement is bad"; do not stop at "constant folding/predicate simplification
missing".
+
+## 4. Distinguish Join Order From Other Anomalies
+
+Other anomalies can coexist with bad join order:
+
+- Bad literal rewrite, stale stats, missing index, weak partition pruning, or
an expensive scalar expression may explain why one branch is selective or empty.
+- They do not explain why the engine scanned/built another huge side before
using that selectivity, or why an empty/tiny branch was unable to prune the
expensive side.
+- State both: "runtime bottleneck is large scan/build; another anomaly
explains selectivity; join order/RF direction is still wrong because the plan
pays the large side before the selective/empty branch can prune it."
+
+Avoid this failure mode: calling the expensive RF-source scan a pure
access-path problem when the rest of the plan proves that all other large scans
waited for its empty/tiny RF. That is exactly the case where the RF
source/target choice should be challenged.
+
+Avoid this failure mode: calling a large build before an empty probe "only an
outer join semantic constraint". If the preserved side is empty after its own
predicates, then the build side contributed nothing to the result; the plan
order/scheduling failed to exploit the empty side.
+
+Avoid this failure mode: calling an impossible non-equi join only a predicate
simplification issue. If a large intermediate is produced before the impossible
join, the bad symptom is join order/predicate placement: the impossible or
highly selective relation was applied too late.
+
+## 5. Output Checklist
+
+When a query has joins, include these facts:
+
+- `Runtime bottleneck`: the active scan/join/build/probe cost proven by
counters.
+- `Join-order diagnosis`: whether the order/build side/RF direction is good,
suspicious, or bad.
+- `Build/probe evidence`: build side rows/time/memory, probe side rows/time,
and join type.
+- `RF evidence`: source side, target side, wait, filtering/pruning, and
whether it arrived early enough.
+- `Counterfactual evidence`: paired profile, hint, rewrite, memo, or stats
that shows a better legal order. If absent, say the join-order conclusion is
likely but not proven.
diff --git a/.codex/skills/doris-profile-reader/references/operator-guide.md
b/.codex/skills/doris-profile-reader/references/operator-guide.md
new file mode 100644
index 00000000000..66fe07f8874
--- /dev/null
+++ b/.codex/skills/doris-profile-reader/references/operator-guide.md
@@ -0,0 +1,140 @@
+# Operator Guide
+
+This guide covers the major Doris runtime profile operator families. Use
`source-profile-inventory.md` for exact source registration lines and
counter/source anchors.
+
+## Scan Operators
+
+`OLAP_SCAN_OPERATOR` reads Doris internal storage. First check `ScanRows`,
`RowsRead`, `ScanBytes`, `ScannerCpuTime`, `ScannerGetBlockTime`,
`ScannerWorkerWaitTime`, `IOTimer`, `DecompressorTimer`, predicate timers, lazy
read timers, runtime-filter wait/filter counters, and row-filter counters. A
long scan is important when active scanner timers and large bytes/rows agree.
If `ExecTime` is mostly explained by `RuntimeFilterInfo/RFx WaitTime`,
`AcquireRuntimeFilter`, or `WaitForRuntimeFil [...]
+
+`FILE_SCAN_OPERATOR` reads external files through file scanner and connector
code. Prioritize file count/split shape, file read/open/list timers,
`FileScannerGetBlockTime`, `ApplyAllRuntimeFilters`, scan rows/bytes, and
connector I/O. Metadata/listing time can dominate small queries; data
read/decode dominates large ones.
+
+`JDBC_SCAN_OPERATOR` reads remote JDBC sources. Prioritize remote
query/read/fetch timers, rows, bytes, and network/serialization. A slow JDBC
scan may be remote database latency rather than Doris CPU.
+
+`GROUP_COMMIT_SCAN_OPERATOR` scans group-commit buffered data. Prioritize
rows, blocks, group-commit state, and wait/dependency only as queue state.
+
+`SCHEMA_SCAN_OPERATOR` reads schema/system metadata. Usually small. If slow,
check metadata service/catalog lookup timers and rows produced.
+
+`META_SCAN_OPERATOR` reads internal metadata. Treat it like a metadata scan;
long active time points to metadata lookup or serialization, not table data
scan.
+
+`DATA_GEN_SCAN_OPERATOR` generates rows. Prioritize generated row count and
expression/generation active timers.
+
+`REC_CTE_SCAN_OPERATOR` reads recursive CTE state. Use recursive CTE
source/sink counters and rows per iteration to find expansion.
+
+`CACHE_SOURCE_OPERATOR` reads cached intermediate data. Prioritize cache
tablet/id metadata, cache hit/read counters, rows/bytes, and active read time.
+
+`EMPTY_SET_OPERATOR` or `EMPTY_SET_SOURCE_OPERATOR` produces no data. It
should not be the bottleneck unless profile setup/open time is anomalously high.
+
+## Exchange Operators
+
+`EXCHANGE_OPERATOR` receives data from sender queues. Prioritize
`GetDataFromRecvrTime`, deserialize/decompress/filter/merge timers,
local/remote bytes received, and produced rows. `WaitForData0`/`WaitForDataN`,
`DataArrivalWaitTime`, and `FirstBatchArrivalWaitTime` mean the receiver waited
for data. They usually point to upstream work or network/backpressure, not
exchange CPU.
+
+`DATA_STREAM_SINK_OPERATOR` sends data across fragments. Prioritize
`SerializeBatchTime`, `CompressTime`, `DistributeRowsIntoChannelsTime`,
`SplitBlockHashComputeTime`, `LocalSendTime`, `BytesSent`, `LocalBytesSent`,
`LocalSentRows`, `RpcMaxTime`/`RpcAvgTime`, and throughput. High RPC time with
non-trivial bytes is exchange/network evidence even when local serialization is
small. `WaitForRpcBufferQueue`, `WaitForBroadcastBuffer`, and
`WaitForLocalExchangeBufferN` are buffer/backpressure waits.
+
+`LOCAL_EXCHANGE_OPERATOR` and `LOCAL_EXCHANGE_SINK_OPERATOR` reshuffle data
inside one BE. Prioritize hash computation, copy/push/pop timers, local
bytes/rows, and skew. Local exchange waits indicate sender/receiver imbalance
or downstream pressure.
+
+`LOCAL_MERGE_SORT_SOURCE_OPERATOR` receives locally sorted streams. Prioritize
merge/get-next/sorted output active timers and rows. A large dependency wait
usually means it waited for sort sink completion.
+
+`MULTI_CAST_DATA_STREAM_SINK_OPERATOR` and
`MULTI_CAST_DATA_STREAM_SOURCE_OPERATOR` fan out data to multiple consumers.
Prioritize copied bytes/rows, channel buffer state, source wait for runtime
filters when present, and downstream imbalance. Wait counters identify which
consumer/channel delayed progress.
+
+## Aggregation Operators
+
+`AGGREGATION_SINK_OPERATOR`, `AGGREGATION_OPERATOR`,
`BUCKETED_AGGREGATION_SINK_OPERATOR`, `BUCKETED_AGGREGATION_OPERATOR`,
`PARTITIONED_AGGREGATION_SINK_OPERATOR`, and `PARTITIONED_AGGREGATION_OPERATOR`
perform hash/merge aggregation. Prioritize `InputRows`, `RowsProduced`,
`BuildTime`, `HashTableComputeTime`, `HashTableEmplaceTime`,
`HashTableInputCount`, `HashTableSize`, `MergeTime`, `DeserializeAndMergeTime`,
`GetResultsTime`, `HashTableIterateTime`, memory counters, and spill counte
[...]
+
+`STREAMING_AGGREGATION_OPERATOR` aggregates when input order/streaming mode
allows less state. Prioritize `StreamingAggTime`, build/hash table counters,
input/output rows, and whether memory/hash table is unexpectedly high.
+
+`DISTINCT_STREAMING_AGGREGATION_OPERATOR` handles distinct streaming
aggregation. Prioritize distinct hash table compute/emplace, input rows, hash
table size, and rows produced.
+
+## Join Operators
+
+`HASH_JOIN_SINK_OPERATOR` builds hash join state and runtime filters.
Prioritize `InputRows`, `BuildTime`, `BuildHashTableTime`,
`BuildTableInsertTime`, `BuildExprCallTime`, `MergeBuildBlockTime`,
`MemoryUsageHashTable`, build memory arenas, `PublishTime`, and
`RuntimeFilterInfo`. Build dependency waits are sequencing state, not build CPU.
+
+`HASH_JOIN_OPERATOR` probes hash join state. Prioritize `ProbeRows`,
`ProbeIntermediateRows`, `ProbeExprCallTime`, `ProbeWhenSearchHashTableTime`,
non-equi conjunct time, output block build time, rows produced, and probe
memory. Long wait for build dependency means probe waited for the build side.
+
+`PARTITIONED_HASH_JOIN_SINK_OPERATOR` and
`PARTITIONED_HASH_JOIN_PROBE_OPERATOR` are partitioned/spill-capable join
paths. Prioritize build/probe rows, in-memory bytes, spill
repartition/build/probe/recovery timers, spill rows/blocks, partition
level/count, and memory.
+
+`CROSS_JOIN_SINK_OPERATOR` and `CROSS_JOIN_OPERATOR` are nested-loop style
join paths. Prioritize build rows, probe rows, intermediate rows,
`LoopGenerateJoin`, join conjunct evaluation/filter time, and output temp block
time. Large output cardinality is often the root cause. If the cross join is
only stitching together one-row scalar-subquery results, large dependency waits
on the cross join are branch fan-in latency; inspect the child branch
scans/joins/aggregations before blaming cros [...]
+
+`NestedLoopJoinBuildSinkOperatorX` and `NestedLoopJoinProbeOperatorX` appear
as cross join operators in profiles. Interpret them with the cross-join
counters.
+
+## Sort and Analytic Operators
+
+`SORT_SINK_OPERATOR` receives rows and sorts them. Prioritize input rows,
sort/append timers, memory usage for sort blocks, and spill counters.
+
+`SORT_OPERATOR` outputs sorted rows. Prioritize merge/get-next timers, rows
produced, output bytes, and source dependency waits only as sink-completion
waits.
+
+`SPILL_SORT_SINK_OPERATOR` and `SPILL_SORT_SOURCE_OPERATOR` are sort paths
with spill. Prioritize common spill counters, merge sort time, read/write file
times, spilled bytes/rows, and memory.
+
+`PARTITION_SORT_SINK_OPERATOR` and `PARTITION_SORT_OPERATOR` sort by partition
for analytic/window work. Prioritize hash table build, sorted data time, sorted
partition input/output rows, and get-sorted time.
+
+`ANALYTIC_EVAL_SINK_OPERATOR` and `ANALYTIC_EVAL_OPERATOR` evaluate window
functions. Prioritize input rows, partition/order/range compute timers,
evaluation time, search timers, remove rows/time, blocks peak, and rows
produced. Wait for dependency usually means waiting for buffered window state.
+
+## Set, Union, Repeat, and Projection Operators
+
+`UNION_SINK_OPERATOR` and `UNION_OPERATOR` combine child outputs. Prioritize
input/output rows, copy/projection time, and child imbalance. Union is rarely
the bottleneck unless it moves many rows or projects expensive expressions.
+
+`INTERSECT_OPERATOR`, `EXCEPT_OPERATOR`, `SET_SINK_OPERATOR`, and
`SET_PROBE_SINK_OPERATOR` build/probe set state. Prioritize
build/probe/get-data/filter timers, hash table rows, extracted rows, memory,
and rows produced.
+
+`REPEAT_OPERATOR` expands rows for grouping sets/rollups. Compare input rows
to rows produced and output bytes before blaming later exchange or aggregation.
A repeat/rollup can create hundreds of millions of rows; the repeat operator's
own active time may be moderate, while the real cost appears immediately after
it as local hash shuffle, data stream exchange, and aggregation over the
expanded rows. In that pattern, downstream wait counters are symptoms and the
proof is repeat output vol [...]
+
+`MATERIALIZATION_OPERATOR` materializes slots/projections. Prioritize
projection/expression time, output rows/bytes, child `RowIDFetcher` profile
info, `MergeResponseTime`, and fetched row-id volume. A high materialization
time can also be top/lazy materialization: the operator sends row-id based
`multiget_data_v2` RPCs to fetch deferred columns after upstream pruning/top-N.
In that case `MaxRpcTime` is remote row-id fetch latency for a batch, not local
materialization CPU. Do not promot [...]
+
+`SELECT_OPERATOR` applies predicates/projections. Prioritize
predicate/expression/projection active timers and rows filtered/produced.
+
+`TABLE_FUNCTION_OPERATOR` expands rows through table functions. Compare input
rows to output rows and expression/table-function evaluation time.
+
+`ASSERT_NUM_ROWS_OPERATOR` verifies scalar-subquery row count. Usually not a
performance bottleneck; check rows produced and assertion metadata if it
appears.
+
+## Recursive CTE Operators
+
+`REC_CTE_SOURCE_OPERATOR`, `REC_CTE_SINK_OPERATOR`,
`REC_CTE_ANCHOR_SINK_OPERATOR`, and `REC_CTE_SCAN_OPERATOR` implement recursive
CTE data flow. Prioritize iteration counts if present, rows in/out per stage,
materialized state size, and waits between recursive producer/consumer
dependencies. Long waits usually indicate iteration sequencing.
+
+## Result and Output Sinks
+
+`RESULT_SINK_OPERATOR` and `ResultSink` return rows to the client. Prioritize
bytes/rows sent, append/serialize/send timers, and client/network effects. A
slow client can make the result sink look slow.
+
+`RESULT_FILE_SINK_OPERATOR` writes query results to files. Prioritize
`FileWriteTime`, `FileWriterCloseTime`, bytes, file count, and
serialization/compression.
+
+`MEMORY_SCRATCH_SINK_OPERATOR` writes to in-memory scratch/results. Prioritize
conversion timers such as Arrow batch conversion, rows, and bytes.
+
+`BLACKHOLE_SINK_OPERATOR` consumes output without writing. Prioritize
`BytesProcessed`/rows only as volume; it should not be I/O-bound.
+
+## Table and Connector Sinks
+
+`OLAP_TABLE_SINK_OPERATOR`, `OlapTableSinkOperatorX`, and
`OlapTableSinkV2OperatorX` write into Doris storage. Prioritize append/channel
timers, rowset build, close/load/commit timers, tablet/channel skew,
rows/bytes, and memory. Close/commit only proves sink cost when writer timers
agree.
+
+`GROUP_COMMIT_OLAP_TABLE_SINK_OPERATOR` and `GROUP_COMMIT_BLOCK_SINK_OPERATOR`
write through group commit. Prioritize append blocks, group commit
flush/queue/commit timers, rows/bytes, and close/wait states.
+
+`HIVE_TABLE_SINK_OPERATOR`, `ICEBERG_TABLE_SINK_OPERATOR`,
`SPILL_ICEBERG_TABLE_SINK_OPERATOR`, `ICEBERG_DELETE_SINK_OPERATOR`,
`ICEBERG_MERGE_SINK_OPERATOR`, `MAXCOMPUTE_TABLE_SINK_OPERATOR`,
`JDBC_TABLE_SINK_OPERATOR`, and `TVF_TABLE_SINK_OPERATOR` write to external
systems. Prioritize connector write time, file write/close time,
delete-file/write-file counts, commit/close timers, bytes/rows, and spill
counters for spill-capable paths. External system latency may dominate.
+
+`DICTIONARY_SINK_OPERATOR` builds/publishes dictionary data. Prioritize
build/publish/write timers, rows, bytes, and memory.
+
+`CACHE_SINK_OPERATOR` writes cacheable intermediate data. Prioritize bytes
written, cache write timers, tablet/cache metadata, and memory.
+
+## Source Coverage Index
+
+The narrative sections above cover these factory-created plan nodes and data
sinks from the local source inventory:
+
+Plan nodes:
+
+`OLAP_SCAN_NODE`, `GROUP_COMMIT_SCAN_NODE`, `JDBC_SCAN_NODE`,
`FILE_SCAN_NODE`, `EXCHANGE_NODE`, `AGGREGATION_NODE`,
`BUCKETED_AGGREGATION_NODE`, `HASH_JOIN_NODE`, `CROSS_JOIN_NODE`, `UNION_NODE`,
`SORT_NODE`, `PARTITION_SORT_NODE`, `ANALYTIC_EVAL_NODE`,
`MATERIALIZATION_NODE`, `INTERSECT_NODE`, `EXCEPT_NODE`, `REPEAT_NODE`,
`TABLE_FUNCTION_NODE`, `ASSERT_NUM_ROWS_NODE`, `EMPTY_SET_NODE`,
`DATA_GEN_SCAN_NODE`, `SCHEMA_SCAN_NODE`, `META_SCAN_NODE`, `SELECT_NODE`,
`REC_CTE_NODE`, `REC_CTE_ [...]
+
+Data sinks:
+
+`DATA_STREAM_SINK`, `RESULT_SINK`, `DICTIONARY_SINK`,
`GROUP_COMMIT_OLAP_TABLE_SINK`, `GROUP_COMMIT_BLOCK_SINK`, `HIVE_TABLE_SINK`,
`ICEBERG_TABLE_SINK`, `ICEBERG_DELETE_SINK`, `ICEBERG_MERGE_SINK`,
`MAXCOMPUTE_TABLE_SINK`, `JDBC_TABLE_SINK`, `MEMORY_SCRATCH_SINK`,
`RESULT_FILE_SINK`, `MULTI_CAST_DATA_STREAM_SINK`, `BLACKHOLE_SINK`,
`TVF_TABLE_SINK`, `OLAP_TABLE_SINK`.
+
+Operator classes:
+
+`AggSinkOperatorX`, `AggSourceOperatorX`, `AnalyticSinkOperatorX`,
`AnalyticSourceOperatorX`, `AssertNumRowsOperatorX`,
`BucketedAggSinkOperatorX`, `BucketedAggSourceOperatorX`,
`CacheSourceOperatorX`, `DataGenSourceOperatorX`, `DictSinkOperatorX`,
`DistinctStreamingAggOperatorX`, `EmptySetSourceOperatorX`,
`ExchangeSinkOperatorX`, `ExchangeSourceOperatorX`, `FileScanOperatorX`,
`GroupCommitBlockSinkOperatorX`, `GroupCommitOperatorX`,
`HashJoinBuildSinkOperatorX`, `HashJoinProbeOperatorX [...]
+
+Named operators outside the factory list but present in profile/source
registration:
+
+`CACHE_SINK_OPERATOR`, `CACHE_SOURCE_OPERATOR`,
`DISTINCT_STREAMING_AGGREGATION_OPERATOR`, `LOCAL_EXCHANGE_OPERATOR`,
`LOCAL_EXCHANGE_SINK_OPERATOR`, `LOCAL_MERGE_SORT_SOURCE_OPERATOR`,
`MULTI_CAST_DATA_STREAM_SOURCE_OPERATOR`, `PARTITIONED_AGGREGATION_OPERATOR`,
`PARTITIONED_AGGREGATION_SINK_OPERATOR`,
`PARTITIONED_HASH_JOIN_PROBE_OPERATOR`, `PARTITIONED_HASH_JOIN_SINK_OPERATOR`,
`REC_CTE_ANCHOR_SINK_OPERATOR`, `SET_PROBE_SINK_OPERATOR`, `SET_SINK_OPERATOR`,
`SPILL_ICEBERG_TABLE_SINK_OP [...]
+
+## Fallback for Unknown Operator Names
+
+If a profile contains an operator not listed above:
+
+1. Search `source-profile-inventory.md` for the exact operator name or class.
+2. Use its source path to identify the family: scan, exchange, join,
aggregation, sort, sink, set, metadata, or utility.
+3. Classify each counter by `counter-semantics.md`.
+4. Do not infer bottleneck from wait counters without active work or volume
evidence.
diff --git a/.codex/skills/doris-profile-reader/references/reading-workflow.md
b/.codex/skills/doris-profile-reader/references/reading-workflow.md
new file mode 100644
index 00000000000..65934015a91
--- /dev/null
+++ b/.codex/skills/doris-profile-reader/references/reading-workflow.md
@@ -0,0 +1,135 @@
+# Doris Profile Reading Workflow
+
+## 1. Confirm Profile Shape
+
+Start from the query id, elapsed time, SQL, FE `Summary`, `Execution Profile`,
`MergedProfile`, and `DetailProfile`.
+
+- Use `MergedProfile` first to find candidate fragments/operators quickly.
+- Use `DetailProfile` to check skew, per-instance outliers, and whether a
large merged time is just parallel accumulation.
+- Use the plan text to map operator names back to tables, join sides, grouping
keys, exchanges, limits, and runtime-filter direction.
+- If the plan has joins, keep two layers in the analysis: immediate runtime
bottleneck and plan-shape cause. A scan or hash build may be the active cost
while a bad join order/RF direction is the reason that cost exists.
+- Keep query-lifecycle time separate from BE operator work. FE `Plan Time`,
`Schedule Time`, `Wait and Fetch Result Time`, and client/result fetch time can
explain wall-clock latency, especially for small or empty queries, but they are
not scan/join/aggregation CPU. If they dominate, report them as a separate
lifecycle finding before ranking BE operators.
+- If `Is Cached: Yes`, `Total Instances Num: 0`, or the BE detail is absent,
the profile is not enough for operator bottleneck analysis. Re-run with profile
enabled and SQL cache disabled.
+
+## 2. Build a Counter Classification Before Ranking
+
+Classify each large counter before interpreting it:
+
+- Active work: `ExecTime` after qualifying child wait evidence, plus direct
operator custom timers such as scan CPU, hash table build/probe, sort,
aggregation, serialization, compression, I/O, decompression, spill read/write.
+- Data volume: `InputRows`, `RowsProduced`, `RowsRead`, `ScanRows`,
`ScanBytes`, bytes sent/received, blocks produced.
+- Wait/backpressure: `WaitForDependencyTime`, `WaitForDependency[...]Time`,
`WaitForDataN`, `DataArrivalWaitTime`, `FirstBatchArrivalWaitTime`,
`WaitForRpcBufferQueue`, `WaitForBroadcastBuffer`, local-exchange buffer waits,
`PendingFinishDependency`, `WaitWorkerTime`, memory reserve waits.
+- Resource pressure: memory peak, spill counters, scanner queue wait,
remote/local bytes, cache miss/read timers.
+- Optimizer/context info: join type, colocate/shuffle flags, partitioner,
grouping keys, runtime-filter info, predicates, projections.
+- Query lifecycle/context: FE plan time, schedule time, wait/fetch result
time, result sink/client fetch time. Useful for wall-clock explanation, not
proof of BE operator cost.
+
+Only active work, data volume, and resource pressure can directly support a
bottleneck claim. Wait/backpressure counters explain pipeline shape and
symptoms, not direct operator cost by themselves.
+
+## 3. Rank Candidate Bottlenecks
+
+Use this priority order:
+
+1. Query elapsed and the largest active operator families.
+2. Large rows/bytes that explain the active time.
+3. Skew in `max` versus `avg` or a single instance much slower than peers.
+4. Spill, memory peak, and queue/thread waits.
+5. Runtime-filter effectiveness and wait.
+6. Network/exchange serialization, compression, and remote bytes.
+7. Init/open/close only if they remain large after excluding parallel
accumulation and waits.
+
+Do not rank by the largest wait-like counter. A profile where
`EXCHANGE_OPERATOR WaitForData0` is the largest counter commonly means the
exchange waited for upstream data.
+
+When several scans or branches each contribute meaningful rows, bytes, memory,
or active time, do not force a single-operator bottleneck. First summarize the
branch family or operator family, then name the strongest member only as an
example or leading contributor.
+
+After ranking the direct cost, perform a join-order pass for multi-join plans.
Identify build/probe sides, RF source/target sides, and whether a more
selective branch was scheduled too late to prune a large scan or build. Use
`join-order-diagnosis.md` for the decision rules.
+
+## 4. Interpret Merged Timers Correctly
+
+Merged profile values often include `sum`, `avg`, `max`, and `min`. Treat them
differently:
+
+- `sum` can exceed wall-clock elapsed time because it accumulates across
parallel fragments, drivers, scan ranges, and scanner threads.
+- `max` is usually better for wall-clock suspicion.
+- `avg` plus `max` reveals skew.
+- `min` is useful only for skew context.
+
+If an operator appears many times, a long summed `ExecTime` may represent
normal parallel CPU. Compare per-instance `max` to query elapsed and compare
rows/bytes per instance.
+
+## 5. Read Pipeline Waits as Dependencies
+
+Pipeline tasks register wait/dependency counters for start prerequisites,
upstream data readiness, downstream write availability, memory reservation,
runtime filters, and local/remote queues. These counters answer "what was this
driver waiting for", not "where CPU was spent".
+
+Examples:
+
+- `WaitForDependency[OLAP_SCAN_OPERATOR_DEPENDENCY]Time`: scan source was
blocked by its dependency state.
+- `WaitForDependency[HASH_JOIN_BUILD_DEPENDENCY]Time`: probe/build sequencing
or shared build state dependency.
+- `WaitForData0`: exchange source waited for sender queue data.
+- `WaitForRpcBufferQueue`: exchange sink waited for RPC buffer capacity.
+- `WaitForLocalExchangeBufferN`: sender waited for a local exchange channel.
+- `WaitWorkerTime`: pipeline task waited for worker scheduling.
+- `ScannerWorkerWaitTime`: scanners waited in the scan worker pool.
+
+Large waits are actionable only after you identify the upstream/downstream
resource that caused them.
+
+## 6. Diagnose by Operator Family
+
+Use the operator guide and choose the family-specific proof:
+
+- Scan bottleneck: high scan active time plus large `ScanRows`/`ScanBytes`,
high `ScannerCpuTime`, I/O/decompression/lazy-read/predicate timers, or scanner
queue wait.
+- Scan/RF latency trap: if scan `ExecTime` is close to `RuntimeFilterInfo`
`RFx WaitTime`, `AcquireRuntimeFilter`, or `WaitForRuntimeFilter`, treat the
scan as waiting for filters, not spending CPU in storage. Detail profiles can
expose `WaitForRuntimeFilter` even when merged `RFx WaitTime` is zero or the
filter timed out. `TOPN OPT`, `TopNFilterSourceNodeIds`, or `SharedPredicate`
on a scan should be read as the same filter-wait family before blaming storage
CPU. Then decide whether the [...]
+- Join bottleneck: high build/probe active time, large build/probe rows, hash
table memory, non-equi conjunct time, runtime filter publish/build, or spill in
partitioned join.
+- Aggregation bottleneck: high build/merge/hash-table times, large
`InputRows`, large hash table size/memory, or spill.
+- Sort/window bottleneck: high sort/evaluation active time, large input rows,
memory pressure, or spill.
+- Exchange bottleneck: high serialization/compression/send/receive/merge time,
high `RpcMaxTime`/`RpcAvgTime` on `DATA_STREAM_SINK_OPERATOR`, and bytes.
Data-arrival waits alone are not enough.
+- Sink/write bottleneck: high append/write/commit/close/load timers and
bytes/rows. Output `PendingFinishDependency` alone is not enough.
+
+For repeated CTEs, repeated common subplans, expanded `UNION ALL` branches, or
scalar-subquery fan-in plans, do not summarize from one representative branch.
Build a small branch inventory: table/source, scan rows/bytes, rows produced,
major join/aggregation rows, memory/spill evidence, and whether the apparent
time is active work or RF/dependency wait. Then aggregate the pattern mentally.
Repeated branches can make the true bottleneck "many medium scans plus RF
waits" rather than one si [...]
+
+Small dimension scans often show high apparent `ExecTime` or detail scanner
timers because they waited for runtime filters or because merged counters
accumulate across many tiny scanners. Before promoting a dimension scan over a
fact branch, compare rows/bytes, branch role, output impact, and whether the
timer is mostly RF wait. If the fact branches dominate row volume or downstream
aggregation, report the dimension scan as supporting/filtering work instead of
the main bottleneck.
+
+## 7. Runtime Filter Pass
+
+If runtime filters appear, always answer:
+
+- Which join/build side produced the filter.
+- Which scan/probe side applied it.
+- Whether it filtered rows, was always true, arrived too late, or forced
meaningful wait.
+- Whether a long wait is justified by saved scan work.
+
+See `runtime-filters.md`.
+
+## 8. Join Order Pass
+
+For every multi-join profile, always answer:
+
+- Which side each important join builds on, and which side probes.
+- Whether the build side is unexpectedly large compared with the probe/other
available side.
+- Whether runtime filters flow from a selective side to a large scan early
enough to save work.
+- Whether paired profiles, hints, rewritten SQL, memo estimates, or actual
rows prove a better order.
+- Whether memo contains an unchosen reversed join expression, or
schema/distribution keys show that the current RF source should instead be an
RF target.
+
+If the active bottleneck is a scan or hash build that would disappear under a
better join order, report the scan/build as the runtime bottleneck and the join
order/RF direction as the likely root cause.
+
+When only a slow profile is available, still make a join-order judgment. Use
"likely" when the evidence is circumstantial, but do not answer "not proven"
merely because there is no paired fast profile. Empty probes, empty final
results after huge build/scan work, huge intermediates later eliminated, and
RF-source inversion are enough to call the order likely bad when build/probe/RF
sides and row counts are visible.
+
+Use this red-flag pass before writing the conclusion:
+
+- Expensive RF source emits empty/tiny RF and all targets skip: join order/RF
direction is likely bad, not proven good.
+- Huge build/shuffle before zero/tiny probe or result: build/probe order or
build scheduling is likely bad, even if another predicate explains why the
probe is empty.
+- Huge intermediate before a contradictory or highly selective inner/non-equi
join: join order/predicate placement is bad.
+- Tiny semi/subquery key set is applied after a large fact scan/build: join
order/RF timing is likely bad.
+
+If one of those red flags matches, the conclusion must say `bad` or `likely
bad`. Use `not proven` only for the exact alternate legal plan or repair, not
for whether the observed order is wrong.
+
+If your reasoning says "plan shape", "predicate placement", "empty probe",
"late contradictory predicate", or "expensive RF source", translate that into
an explicit join-order diagnosis. The required output is not complete until it
says whether the join order/build-probe/RF direction is good, suspicious, or
bad.
+
+See `join-order-diagnosis.md`.
+
+## 9. Output Standard
+
+A good profile explanation is evidence-bounded:
+
+- Quote exact operator and counter names.
+- State whether each quoted timer is active, wait, or accumulated.
+- Tie scan/join/exchange metrics to table or plan node when possible.
+- For joins, include a short build/probe/RF-direction judgment, even when the
direct bottleneck is a scan.
+- Avoid generic tuning advice unless the profile proves the need.
+- End with missing evidence if the profile is incomplete.
diff --git a/.codex/skills/doris-profile-reader/references/runtime-filters.md
b/.codex/skills/doris-profile-reader/references/runtime-filters.md
new file mode 100644
index 00000000000..03ad91d19a3
--- /dev/null
+++ b/.codex/skills/doris-profile-reader/references/runtime-filters.md
@@ -0,0 +1,92 @@
+# Runtime Filter Interpretation
+
+## Direction
+
+Doris runtime filters are generated from join build-side expressions and
applied on target/probe-side scan expressions when the planner can map the
slots.
+
+Plan notation:
+
+- `RF000[min_max] <- build_expr`: the filter is produced by this join/build
side.
+- `RF000[min_max] -> target_expr`: the filter is applied to this scan/target
side.
+- `RFs=[RF0 build->target]`: physical plan join-side relationship.
+- `JRFs= RF0` on a scan: the scan has joined runtime filters to apply.
+
+Do not invert these. The arrow into the RF (`<-`) is source/build. The arrow
out of the RF (`->`) is target/scan.
+
+## Where to Look in Profiles
+
+Build/source side:
+
+- Join sink or build operator `RuntimeFilterInfo`.
+- `PublishTime` when publishing filters is visible.
+- Build-side row count, join type, hash table size, and build time.
+
+Target/scan side:
+
+- Scan `WaitForRuntimeFilter` or `AcquireRuntimeFilter`.
+- `RuntimeFilterInfo` entries such as `RF0 InputRows`, `RF0 FilterRows`, `RF0
AlwaysTrueFilterRows`, `RF0 WaitTime`.
+- Top-N or dynamic filter evidence such as `TOPN OPT`,
`TopNFilterSourceNodeIds`, and `SharedPredicate`. These can add
`WaitForRuntimeFilter` even when individual merged `RFx WaitTime` counters are
zero or the filter state is `TIMEOUT`.
+- Scan rows/bytes and row-filter counters before deciding whether the RF
helped.
+- If scan `ExecTime` is approximately equal to RF wait counters, the scan was
mostly waiting for runtime filters. Do not report that as scanner CPU or
storage I/O without supporting `ScannerCpuTime`, `IOTimer`,
`DecompressorTimer`, or `ScanBytes` evidence.
+- `RFx WaitTime`, scan-level `WaitForRuntimeFilter`, `AcquireRuntimeFilter`,
and scanner child timers are not guaranteed to be additive. They can overlap,
be per-filter, or be accumulated at different scopes. Use them to classify
latency and causality, not to compute exact residual CPU by subtraction.
+- `RFx InputRows`, `RFx FilterRows`, and `RFx AlwaysTrueFilterRows` are
per-filter/parallel counters. Do not add all filters together as if each row
were counted once.
+- Storage-pushed RF pruning may be visible as `RowsKeyRangeFiltered`,
`RowsStatsFiltered`, `RowsZoneMapRuntimePredicateFiltered`,
`RowsBloomFilterFiltered`, or related scan-detail counters. A zero `RFx
FilterRows` does not by itself prove the RF was useless.
+
+Exchange/sink side:
+
+- Runtime filters may be carried through data stream sinks between fragments.
Use the plan to connect source and target.
+- Multicast/local-exchange sources can expose runtime-filter waits while
distributing data to multiple consumers. Classify those waits as
consumer/filter synchronization first; then identify which downstream scan or
branch actually benefited.
+
+## Effectiveness Rules
+
+Useful filter:
+
+- `FilterRows` is large relative to `InputRows`.
+- Scan `RowsRead`/downstream rows are much lower than `ScanRows`.
+- Scan-detail pruning counters such as key-range, zone-map, bloom, stats, or
condition-filter rows are large.
+- Filter wait is small compared with saved scan/join work.
+- The target table is large enough that pruning matters.
+- The filter is produced by a small/selective build side early enough to avoid
large target-side scan or join-build work.
+
+Weak or useless filter:
+
+- `AlwaysTrueFilterRows` dominates.
+- `FilterRows = 0` or tiny relative to `InputRows`.
+- `WaitTime` is visible but no meaningful rows are filtered.
+- Filter arrives after most scan work, or scan had no large table to prune.
+
+Potentially harmful filter:
+
+- Long `WaitForRuntimeFilter`/`AcquireRuntimeFilter` and little filtering.
+- Scan `ExecTime` dominated by `RFx WaitTime`, especially on a small dimension
table or a branch whose target-side filtering is weak.
+- Broadcast or build side is slow, causing scan-side idle time without saving
I/O.
+- Many filters add predicate overhead while scan rows are already small.
+- The filter source is available only after the large table has already been
scanned or built into a hash table; this usually points to a
join-order/RF-direction problem rather than a scan implementation problem.
+- The expensive scan produces the RF, while other large scans wait for it and
then skip. If the RF source had to scan huge data to produce an empty/tiny
filter, question the source/target choice; the source side may be the side that
should have received an earlier RF instead.
+- Empty/tiny filter output after a massive source scan is a red flag, not a
success signal. Classify the target-side skip as saved downstream work, but
classify the source-side massive work as likely bad RF direction unless
concrete plan evidence rules out an earlier pruning direction.
+
+## Filter Types
+
+- Bloom/in filter: good for equality join pruning; check false positive and
target cardinality indirectly by rows filtered.
+- Min/max filter: cheap range pruning; useful only when build-side min/max is
selective for the target.
+- Bitmap filter: specialized nested-loop/bitmap path; check build bitmap
construction and target filtering.
+- IN/IN_OR_BLOOM variants: planner/runtime chooses representation based on
cardinality and settings.
+
+The filter type alone does not prove benefit. Always use target-side rows and
wait.
+
+## Common Mistakes
+
+- Mistaking source/build side for target/scan side because of arrows.
+- Treating `WaitForRuntimeFilter` as scan CPU.
+- Calling a runtime filter effective from its existence only.
+- Treating an active large scan as a pure scan problem without asking whether
an earlier RF from a different join order should have pruned it.
+- Calling an empty RF useful without asking whether the plan paid too much to
produce that empty RF.
+- Calling a target-side skip proof of good order while ignoring that
source-side scan/build dominated the query.
+- Ignoring `AlwaysTrueFilterRows`, which often means the filter was not
selective.
+- Ignoring that filter counters may be accumulated across parallel scan
instances.
+- Calling a filter useless only because one `RFx FilterRows` is zero while
scan-detail key-range, zone-map, bloom, or stats pruning removed rows.
+- Subtracting RF wait counters mechanically from `ExecTime` or from each other
despite overlapping scopes.
+
+## Example Pattern
+
+In one calibration profile, the plan showed min/max filters like
`RF000[min_max] <- build_key` at the build side and `RF000[min_max] ->
scan_key` at a scan. The profile had `RuntimeFilterInfo` with RF wait/filter
counters, but the filters waited only tens of milliseconds and filtered no
meaningful rows. The real evidence was large scan volume and active
aggregation/exchange work, not runtime filter wait.
diff --git
a/.codex/skills/doris-profile-reader/scripts/extract_source_profile_inventory.py
b/.codex/skills/doris-profile-reader/scripts/extract_source_profile_inventory.py
new file mode 100755
index 00000000000..55e9097821a
--- /dev/null
+++
b/.codex/skills/doris-profile-reader/scripts/extract_source_profile_inventory.py
@@ -0,0 +1,158 @@
+#!/usr/bin/env python3
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""Extract Doris profile operator/counter names from the current source
tree."""
+
+from __future__ import annotations
+
+import argparse
+import re
+from collections import defaultdict
+from pathlib import Path
+
+
+PROFILE_CONST_RE = re.compile(r'inline constexpr
char\s+([A-Z0-9_]+)\[\]\s*=\s*"([^"]+)"')
+MACRO_RE = re.compile(
+
r"\b(ADD_(?:CHILD_)?(?:TIMER|COUNTER)(?:_WITH_LEVEL)?|add_nonzero_counter|add_info_string|add_derived_counter)\s*\("
+)
+STRING_RE = re.compile(r'"([^"]+)"')
+NAME_ASSIGN_RE = re.compile(r'\b(?:_name|_op_name)\s*=\s*"([^"]+)"')
+CLASS_RE =
re.compile(r"\bclass\s+([A-Za-z0-9_]*OperatorX|[A-Za-z0-9_]*LocalState)\b")
+FACTORY_RE = re.compile(r"std::make_shared<([A-Za-z0-9_]+OperatorX)>")
+CASE_RE = re.compile(r"case\s+(T(?:PlanNode|DataSink)Type::[A-Z0-9_]+)")
+
+
+def rel(path: Path, root: Path) -> str:
+ return str(path.relative_to(root))
+
+
+def load_profile_constants(root: Path) -> dict[str, str]:
+ header = root / "be/src/runtime/runtime_profile_counter_names.h"
+ constants = {}
+ for match in PROFILE_CONST_RE.finditer(header.read_text(encoding="utf-8")):
+ constants[f"profile::{match.group(1)}"] = match.group(2)
+ return constants
+
+
+def extract_macro_args(line: str, start: int) -> str:
+ depth = 0
+ out = []
+ for char in line[start:]:
+ if char == "(":
+ depth += 1
+ elif char == ")":
+ depth -= 1
+ if depth == 0:
+ out.append(char)
+ break
+ out.append(char)
+ return "".join(out)
+
+
+def counter_name_from_args(args: str, constants: dict[str, str]) -> str | None:
+ strings = STRING_RE.findall(args)
+ if strings:
+ # For ADD_CHILD_TIMER(profile, child, parent), the first string is the
counter name.
+ return strings[0]
+ for const_name, value in constants.items():
+ if const_name in args:
+ return value
+ return None
+
+
+def main() -> int:
+ parser = argparse.ArgumentParser(description=__doc__)
+ parser.add_argument("--repo", default="/mnt/disk3/zhaochangle/doris")
+ parser.add_argument("--output",
default="/tmp/doris-profile-reader/source-profile-inventory.md")
+ args = parser.parse_args()
+
+ root = Path(args.repo)
+ constants = load_profile_constants(root)
+ source_roots = [
+ root / "be/src/exec/operator",
+ root / "be/src/exec/pipeline",
+ root / "be/src/exec/scan",
+ root / "be/src/exec/exchange",
+ root / "be/src/exec/sink",
+ root / "be/src/storage",
+ root / "be/src/runtime",
+ root / "fe/fe-core/src/main/java/org/apache/doris/common/profile",
+ root /
"fe/fe-core/src/main/java/org/apache/doris/nereids/processor/post",
+ root / "fe/fe-core/src/main/java/org/apache/doris/planner",
+ ]
+ files = []
+ for source_root in source_roots:
+ if source_root.exists():
+ files.extend(p for p in source_root.rglob("*") if p.suffix in
{".cpp", ".h", ".java"})
+
+ counters: dict[str, list[str]] = defaultdict(list)
+ names: dict[str, list[str]] = defaultdict(list)
+ classes: dict[str, list[str]] = defaultdict(list)
+ factories: dict[str, list[str]] = defaultdict(list)
+ cases: dict[str, list[str]] = defaultdict(list)
+
+ for path in sorted(files):
+ text = path.read_text(encoding="utf-8", errors="ignore")
+ relative = rel(path, root)
+ for line_no, line in enumerate(text.splitlines(), 1):
+ for match in MACRO_RE.finditer(line):
+ args_text = extract_macro_args(line, match.end() - 1)
+ counter_name = counter_name_from_args(args_text, constants)
+ if counter_name:
+ counters[counter_name].append(f"{relative}:{line_no}")
+ for match in NAME_ASSIGN_RE.finditer(line):
+ names[match.group(1)].append(f"{relative}:{line_no}")
+ for match in CLASS_RE.finditer(line):
+ classes[match.group(1)].append(f"{relative}:{line_no}")
+ for match in FACTORY_RE.finditer(line):
+ factories[match.group(1)].append(f"{relative}:{line_no}")
+ for match in CASE_RE.finditer(line):
+ cases[match.group(1)].append(f"{relative}:{line_no}")
+
+ lines = ["# Doris Source Profile Inventory", ""]
+ lines.append("## Factory Cases")
+ for key in sorted(cases):
+ lines.append(f"- `{key}`: {', '.join(cases[key][:6])}")
+ lines.append("")
+
+ lines.append("## Operator Name Assignments")
+ for key in sorted(names):
+ lines.append(f"- `{key}`: {', '.join(names[key][:6])}")
+ lines.append("")
+
+ lines.append("## Operator Classes In Factory")
+ for key in sorted(factories):
+ lines.append(f"- `{key}`: {', '.join(factories[key][:6])}")
+ lines.append("")
+
+ lines.append("## Counter And Info Names")
+ for key in sorted(counters):
+ refs = counters[key]
+ lines.append(f"- `{key}` ({len(refs)} registrations): {',
'.join(refs[:8])}")
+ lines.append("")
+
+ output = Path(args.output)
+ output.parent.mkdir(parents=True, exist_ok=True)
+ output.write_text("\n".join(lines), encoding="utf-8")
+ print(f"wrote {output}")
+ return 0
+
+
+if __name__ == "__main__":
+ raise SystemExit(main())
diff --git a/AGENTS.md b/AGENTS.md
index 34617013274..c6a2d6ae119 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -65,6 +65,8 @@ Added regression tests must comply with the following
standards:
Files in git commit should only be related to the current modification task.
Environment modifications for running (e.g., `conf/`, `AGENTS.md`, `hooks/`,
etc.) must not be `git add`ed. When delivering the final task, you must ensure
all actual code modifications have been committed.
+If the BE code has been modified, it must be formatted using the correct skill
before committing.
+
Commit messages must follow the format below, which mirrors the PR template
(`.github/PULL_REQUEST_TEMPLATE.md`):
```
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]