(doris) branch master updated: [Feature](skill) Introduce Doris profile reader skill (#63948)

zclll Mon, 01 Jun 2026 01:07:11 -0700

This is an automated email from the ASF dual-hosted git repository.

zclllyybb pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git



The following commit(s) were added to refs/heads/master by this push:
     new f68eda67aad [Feature](skill) Introduce Doris profile reader skill 
(#63948)
f68eda67aad is described below

commit f68eda67aadf25c767c1883aa843fe011997f0b6
Author: zclllyybb <[email protected]>
AuthorDate: Mon Jun 1 16:06:52 2026 +0800

    [Feature](skill) Introduce Doris profile reader skill (#63948)
    
    Add script to extract Doris profile operator/counter names and update
    AGENTS.md
    
    - Introduced a new Python script `extract_source_profile_inventory.py`
    that extracts operator and counter names from the Doris source tree,
    generating a markdown inventory.
    - Updated AGENTS.md to include a requirement for formatting BE code with
    the correct skill before committing if modifications are made.
---
 .claude/skills/be-code-style/SKILL.md              |   2 +-
 .codex/skills/doris-profile-reader/SKILL.md        |  51 +++++
 .../skills/doris-profile-reader/agents/openai.yaml |  21 ++
 .../references/counter-semantics.md                | 244 +++++++++++++++++++++
 .../references/join-order-diagnosis.md             | 111 ++++++++++
 .../references/operator-guide.md                   | 140 ++++++++++++
 .../references/reading-workflow.md                 | 135 ++++++++++++
 .../references/runtime-filters.md                  |  92 ++++++++
 .../scripts/extract_source_profile_inventory.py    | 158 +++++++++++++
 AGENTS.md                                          |   2 +
 10 files changed, 955 insertions(+), 1 deletion(-)

diff --git a/.claude/skills/be-code-style/SKILL.md 
b/.claude/skills/be-code-style/SKILL.md
index bd1f4a80f78..74fb272afde 100644
--- a/.claude/skills/be-code-style/SKILL.md
+++ b/.claude/skills/be-code-style/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: be-code-style
-description: Fix BE (C++) code formatting issues using clang-format
+description: Fix BE (C++) code formatting issues using clang-format. Must use 
this when your changes involve BE code before submitting the commit.
 compatibility: opencode
 ---
 
diff --git a/.codex/skills/doris-profile-reader/SKILL.md 
b/.codex/skills/doris-profile-reader/SKILL.md
new file mode 100644
index 00000000000..6b7057e7fee
--- /dev/null
+++ b/.codex/skills/doris-profile-reader/SKILL.md
@@ -0,0 +1,51 @@
+---
+name: doris-profile-reader
+description: Interpret Apache Doris query runtime profiles, especially profile 
bottleneck triage, misleading wait counters, per-operator metric priority, 
scan, join-order/runtime-filter analysis, and evidence-bounded performance 
explanations. Use when given a Doris profile, query id, profile URL/text, or a 
request to explain Doris query performance.
+---
+
+# Doris Profile Reader
+
+## Purpose
+
+Use this skill to identify the real bottleneck in an Apache Doris query 
runtime profile. The core rule is to separate active work from dependency, 
queue, and backpressure waits before naming an operator as expensive. When the 
plan contains joins, also separate the immediate runtime bottleneck from the 
plan-shape cause, especially bad join order and runtime-filter direction.
+
+## Required Reading Order
+
+1. Read `references/reading-workflow.md` for the analysis workflow and output 
contract.
+2. Read `references/counter-semantics.md` for counter meaning and priority, 
especially wait counters.
+3. Read `references/operator-guide.md` for the relevant operator family.
+4. Read `references/join-order-diagnosis.md` when the profile or plan has 
multiple joins, a large hash/nested-loop build, a large scan that might have 
been pruned by a join, paired fast/slow plans, hints/reordered SQL, or a 
request about join shape/reorder.
+5. Read `references/runtime-filters.md` when a profile or plan includes 
`RuntimeFilterInfo`, `RF... <-`, `RF... ->`, `JRFs`, `WaitForRuntimeFilter`, or 
`AcquireRuntimeFilter`.
+6. Use `references/source-profile-inventory.md` as the source-backed 
operator/counter inventory. If a counter or operator is not in the narrative 
docs, do not ignore it; look it up in this inventory and classify it by the 
rules in `counter-semantics.md`.
+
+## Non-Negotiable Interpretation Rules
+
+- Do not call `WaitForDependencyTime`, `WaitForDependency[...]Time`, 
`WaitForData0`, `WaitForDataN`, `WaitForRpcBufferQueue`, 
`WaitForBroadcastBuffer`, `PendingFinishDependency`, or pipeline blocked/wait 
counters direct operator compute cost. They are dependency, data-arrival, 
queue, memory, or backpressure signals.
+- Do not rank operators by the largest wait-like counter alone. First rank by 
`ExecTime`/active timers, direct custom timers, rows/bytes, memory/spill, and 
skew.
+- Treat merged-profile `sum` timers as accumulated across parallel instances. 
A timer can exceed query elapsed time and still be normal when many scanners, 
drivers, or fragments run in parallel.
+- For scans, prioritize `RowsRead`, `ScanRows`, `ScanBytes`, `ScannerCpuTime`, 
`ScannerGetBlockTime`, `ScannerWorkerWaitTime`, I/O/decompression timers, 
predicate/lazy-read timers, and row-filter counters. `ScannerWorkerWaitTime` is 
important, but it indicates scanner scheduling/thread-pool wait rather than 
scan CPU.
+- For runtime filters, distinguish source/build side from target/probe scan 
side. In plan text, `RFxxx <- expr` is produced from the build side; `RFxxx -> 
expr` is applied at a target scan. In profiles, `RuntimeFilterInfo` and 
scan-side wait/filter counters decide whether the RF helped or just waited.
+- For joins, always identify build side and probe/target side before judging 
the order. A scan can be the immediate active bottleneck while the root cause 
is still bad join order if a selective join/RF source is scheduled too late to 
prune that scan.
+- Do not require a paired fast profile before naming likely bad join order. A 
single slow profile can be enough when it proves large wasted build/scan work 
before an empty probe/result, an RF source side that had to scan massively 
before it could emit an empty/tiny filter, or a huge intermediate later 
eliminated by a highly selective or contradictory join predicate.
+- Do not treat "the current RF made other scans wait and then skip" as proof 
that the join order is good. If producing that empty/tiny RF required the only 
expensive scan, the RF source/target choice is itself the join-order question.
+- When a large build/source side is paid before an empty/tiny probe, preserved 
side, semi-join key set, or contradictory join can eliminate the result, call 
the build/probe order likely bad unless the profile proves that ordering is 
semantically forced and no earlier pruning/short-circuit is possible.
+- If a strong single-profile join-order pattern matches, do not hedge as 
"suspicious", "close to likely bad", or "not proven". Use `likely bad` when a 
better legal order still needs validation, and reserve `not proven` for the 
exact alternate shape, not for the join-order diagnosis itself.
+- When a join query has plan-shape evidence, the answer must explicitly judge 
join order/build-probe/RF direction as good, suspicious, or bad. Do not replace 
that judgment with a vague "predicate issue" or "plan shape issue".
+- A long `InitTime`, `OpenTime`, `CloseTime`, or profile total can matter, but 
only after confirming it is not accumulated across many instances and not 
dominated by a known wait/dependency branch.
+
+## Standard Answer Shape
+
+When explaining a profile, answer in this order:
+
+1. `Conclusion`: one or two sentences naming the likely bottleneck and, when 
joins are involved, whether the plan shape/join order is likely the cause.
+2. `Evidence`: profile counters with operator names, values, and whether each 
is active work, data volume, wait/backpressure, memory/spill, or runtime-filter 
evidence.
+3. `Reasoning`: how the evidence maps to the execution plan, which side is 
build/probe or RF source/target, why misleading counters are discounted, and 
whether the join order is reasonable.
+4. `Next checks`: the smallest additional profile/log/code checks needed if 
the conclusion is still uncertain.
+
+Always preserve uncertainty. Use "proven", "likely", and "not proven" 
explicitly when the profile lacks enough detail.
+
+## Scripts
+
+- `scripts/extract_source_profile_inventory.py`: scan Doris source for 
factory-created operators and counter/info registrations.
+
+Scripts are evidence-generation helpers. They do not replace the reading 
workflow.
diff --git a/.codex/skills/doris-profile-reader/agents/openai.yaml 
b/.codex/skills/doris-profile-reader/agents/openai.yaml
new file mode 100644
index 00000000000..34aa432dfbb
--- /dev/null
+++ b/.codex/skills/doris-profile-reader/agents/openai.yaml
@@ -0,0 +1,21 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+interface:
+  display_name: "Doris Profile Reader"
+  short_description: "Interpret Doris profiles, joins, and waits"
+  default_prompt: "Use $doris-profile-reader to analyze this Apache Doris 
query profile, discount misleading wait counters, identify the real bottleneck, 
and judge join order/build-probe/runtime-filter shape with counter evidence."
diff --git a/.codex/skills/doris-profile-reader/references/counter-semantics.md 
b/.codex/skills/doris-profile-reader/references/counter-semantics.md
new file mode 100644
index 00000000000..7a254a19ad7
--- /dev/null
+++ b/.codex/skills/doris-profile-reader/references/counter-semantics.md
@@ -0,0 +1,244 @@
+# Counter Semantics
+
+This file defines how to interpret every counter class. For individual counter 
names and source locations, use `source-profile-inventory.md`.
+
+## Universal Priority
+
+High-signal counters:
+
+- `ExecTime`: time inside the operator. In merged profiles, check `max` and 
per-instance detail before comparing to query elapsed. Before calling it active 
CPU, qualify it with child wait evidence such as runtime-filter wait, 
data-arrival wait, dependency wait, scanner worker wait, or spill task queue 
wait. Do not mechanically subtract wait counters from `ExecTime`; scan-level 
`WaitForRuntimeFilter`, per-filter `RFx WaitTime`, scanner child timers, and 
merged operator timers may overlap  [...]
+- `InputRows`, `RowsProduced`, `RowsRead`, `ScanRows`, `ScanBytes`, bytes 
sent/received: data volume that explains active time.
+- Operator-specific direct timers: scan CPU/I/O/decompression, hash table 
build/probe, aggregation build/merge, sort, analytic evaluation, 
serialization/compression, spill read/write.
+- `MemoryUsage`, `MemoryUsagePeak`, `MemoryUsageHashTable`, 
`MemoryUsageBuildBlocks`, `MemoryUsageSortBlocks`: resource footprint.
+- Spill counters: direct evidence that memory pressure changed the execution 
path.
+
+Low-directness counters:
+
+- `WaitForDependencyTime` and `WaitForDependency[...]Time`: dependency wait. 
Important for pipeline causality, not direct compute.
+- `WaitForData0` and `WaitForDataN`: exchange receiver queue data wait. 
Usually upstream did not produce data yet.
+- `WaitForRpcBufferQueue`, `WaitForBroadcastBuffer`, 
`WaitForLocalExchangeBufferN`: downstream or channel backpressure.
+- `PendingFinishDependency`: sink finish dependency. Do not call it sink work.
+- `WaitWorkerTime`, `BlockedByDependency`, `MemoryReserveTimes`, 
`MemoryReserveFailedTimes`: scheduler/dependency/memory state.
+- `FirstBatchArrivalWaitTime`, `DataArrivalWaitTime`: data-arrival symptom; 
use with upstream operator evidence.
+- FE `Plan Time`, `Schedule Time`, `Wait and Fetch Result Time`, and 
client/result fetch time: query lifecycle or client-facing timing. They can 
explain end-to-end latency, but do not rank them as BE operator active work.
+
+## Common Operator Counters
+
+- `InitTime`: local state/profile initialization. Usually not the bottleneck 
unless large in per-instance detail.
+- `OpenTime`: opening child/local state/resources. Can matter for external 
scans/sinks or cold metadata, but verify with custom counters.
+- `ExecTime`: primary active operator timer.
+- `CloseTime`: close/finalization. For sinks it may include writer close, 
publish, commit, or file close; inspect custom sink timers.
+- `ProjectionTime`: expression/projection after core operator work. Important 
if high with many output columns or expensive expressions.
+- `RowsProduced`: source/operator output rows.
+- `InputRows`: sink input rows.
+- `BlocksProduced`, `OutputBlockBytes`, `MaxOutputBlockBytes`, 
`MinOutputBlockBytes`: vectorized block output volume and block-size skew.
+- `MemoryUsage`, `MemoryUsagePeak`: current/peak memory tracked by the 
operator.
+- `IsShuffled`, `IsColocate`, `FollowedByShuffledOperator`: plan/execution 
metadata, not cost by itself.
+- `PendingFinishDependency`: wait for sink finalization dependency, not active 
work.
+
+## Pipeline Task Counters
+
+- `TaskCpuTime`: accumulated CPU attributed to a pipeline task. It can exceed 
wall time when summed.
+- `ExecuteTime`: task execution loop time.
+- `PrepareTime`, `OpenTime`, `GetBlockTime`, `SinkTime`, `CloseTime`: task 
lifecycle partitions.
+- `WaitWorkerTime`: time waiting for worker scheduling.
+- `NumScheduleTimes`, `NumYieldTimes`, `CoreChangeTimes`: scheduling behavior 
and task migration/yield signals.
+- `MemoryReserveTimes`, `MemoryReserveFailedTimes`: memory reservation 
pressure; pair with blocked time and memory peaks.
+- `BlockedByDependency`: count/state of dependency blocking, not CPU.
+
+## Time Counter Name Patterns
+
+- `*Time`, `*Timer`, `*Costs`: elapsed or accumulated time for the named 
action. Determine whether it is direct work or wait by the action name.
+- `Wait*`, `*Wait*`, `Blocked*`, `Pending*`: wait/dependency/backpressure 
unless source code proves otherwise.
+- `*CpuTime`: CPU-side work. Can be accumulated across threads.
+- `*IOTimer`, `*ReadFileTime`, `*WriteFileTime`, `FileWriteTime`: I/O path 
work or I/O wait. Use bytes and cache counters to interpret.
+- `*Serialize*`, `*Deserialize*`, `CompressTime`, `DecompressorTimer`: CPU 
transformation cost.
+- `*EvalTime`, `ExprTime`, `ProjectionTime`, `VectorPredEvalTime`: expression 
or predicate CPU.
+
+## Count/Volume Counter Name Patterns
+
+- `*Rows`, `*RowsRead`, `*InputRows`, `*RowsProduced`: row volume.
+- `*Bytes`, `*MemBytes`, `ScanBytes`, `CompressedBytesRead`, 
`UncompressedBytesRead`: byte volume or memory/storage size.
+- `*Count`, `*Num`, `*Times`: event count.
+- `*Filtered`, `*FilteredRows`, `Rows*Filtered`: rows skipped or filtered. 
High values can be good if they avoid downstream work.
+- `*CacheHit*`, `*CacheMiss*`: cache effectiveness. Interpret with I/O timers.
+- `*Peak`: peak observed value.
+
+## Scan Counters
+
+Common scan:
+
+- `RowsRead`: rows returned by the scanner after scan-side pruning/filtering.
+- `ScanRows`: rows scanned/read from storage or source. Larger than `RowsRead` 
means scan-side filtering/pruning removed rows.
+- `ScanBytes`: bytes scanned. In OLAP scan source comments this represents 
uncompressed bytes from local plus remote reads.
+- `NumScanners`: scanner instances created.
+- `ScannerCpuTime`: direct scanner CPU. Primary scan bottleneck timer.
+- `ScannerGetBlockTime`: time getting blocks from scanners.
+- `ScannerFilterTime`: scan-side filter time.
+- `ScannerWorkerWaitTime`: time scanner work waited in the scan worker pool. 
High value indicates scanner scheduling/thread-pool contention or too much scan 
concurrency, not scan CPU.
+- `WaitForRuntimeFilter`, `AcquireRuntimeFilter`, and `RuntimeFilterInfo/RFx 
WaitTime`: scan-side wait before or during scan startup/filter application. 
These can make scan `ExecTime` look large even when storage CPU/I/O is small.
+- `RFx InputRows`, `RFx FilterRows`, and `RFx AlwaysTrueFilterRows`: 
per-filter counters, often accumulated across parallel scan instances and 
filters. They may exceed a table's final scan rows when summed across 
filters/instances. Compare them per filter and against `ScanRows`, `RowsRead`, 
and target-side output; do not add them across filters as if they were disjoint.
+- `MaxScanConcurrency`, `MinScanConcurrency`, `RunningScanner`: concurrency 
and skew.
+- `ConditionCacheHit`, `ConditionCacheFilteredRows`: condition cache 
effectiveness.
+- `PerScannerRunningTime`, `PerScannerRowsRead`, `PerScannerWaitTime`, 
`PerScannerProjectionTime`, `PerScannerPrepareTime`, `PerScannerOpenTime`: 
per-scanner skew and phase evidence.
+
+OLAP scan storage path:
+
+- `ReaderInitTime`, `ScannerInitTime`, `ProcessConjunctTime`: scan setup and 
predicate preparation.
+- `BlockLoadTime`, `BlocksLoad`, `BlockFetchTime`, `BlockInitTime`, 
`BlockInitSeekTime`, `BlockInitSeekCount`: segment/block loading and seek 
overhead.
+- `CompressedBytesRead`, `UncompressedBytesRead`, `IOTimer`, 
`DecompressorTimer`: storage I/O and decompression cost.
+- `PredicateColumnReadTime`, `NonPredicateColumnReadTime`, `OutputColumnTime`, 
`LazyReadTime`, `LazyReadSeekTime`: column read and late materialization cost.
+- `VectorPredEvalTime`, `ShortPredEvalTime`, `ExprFilterEvalTime`: predicate 
expression CPU.
+- `RowsStatsFiltered`, `RowsKeyRangeFiltered`, `RowsConditionsFiltered`, 
`RowsZoneMapRuntimePredicateFiltered`, `RowsBloomFilterFiltered`, 
`RowsDelFiltered`, `SegmentDictFiltered`: pruning and delete/filter 
effectiveness. Runtime filters, especially min/max or storage-pushed filters, 
can show their benefit here even when a per-filter `FilterRows` counter is zero 
or misleading.
+- `GenerateRowRangeByKeysTime`, `GenerateRowRangeByColumnConditionsTime`, 
`GenerateRowRangeByBloomFilterIndexTime`, `GenerateRowRangeByZoneMapIndexTime`, 
`GenerateRowRangeByDictTime`: index/range construction cost.
+- Inverted-index counters (`InvertedIndex*`, `RowsInvertedIndexFiltered`): 
index reader/search/filtering cost and effectiveness.
+- ANN counters (`AnnIndex*`, `AnnIvf*`): vector index search, conversion, 
cache, fallback, and filtering.
+- Variant counters (`Variant*`, sparse-column counters): variant/sparse 
subcolumn materialization and filtering.
+- `AdaptiveBatchPredictMaxRows`, `AdaptiveBatchPredictMinRows`: adaptive batch 
sizing metadata, not direct cost.
+
+File/external scan:
+
+- `FileScannerGetBlockTime`, `FileScannerPreFilterTimer`, 
`ApplyAllRuntimeFilters`: file scanner work and RF application.
+- `FileNumber`, `EmptyFileNum`, `BatchSplitMode`: split/file shape.
+- Connector-specific read/open/list counters should be treated as remote I/O 
or metadata cost; verify with bytes and file count.
+
+## Join Counters
+
+Hash join build sink:
+
+- `BuildTime`, `BuildHashTableTime`, `BuildTableInsertTime`, 
`BuildExprCallTime`, `MergeBuildBlockTime`: direct build-side work.
+- `InputRows`, `HashTableSize`, `MemoryUsageHashTable`, 
`MemoryUsageBuildBlocks`, `MemoryUsageBuildKeyArena`: build-side size and 
memory.
+- `PublishTime`, `RuntimeFilterInfo`: runtime filter publication.
+- `BroadcastJoin`, `JoinType`, `ShareHashTableEnabled`, `BuildShareHashTable`, 
`SkipProcess`: plan/state metadata.
+- `AsofIndexExprTime`, `AsofIndexSortTime`, `AsofIndexGroupTime`: ASOF join 
index preparation cost.
+
+Hash join probe:
+
+- `ProbeRows`, `ProbeIntermediateRows`: probe-side volume and intermediate 
expansion.
+- `ProbeExprCallTime`, `ProbeWhenSearchHashTableTime`, 
`ProbeWhenProbeSideOutputTime`, `ProbeWhenBuildSideOutputTime`, 
`NonEqualJoinConjunctEvaluationTime`, `JoinFilterTimer`, `BuildOutputBlock`, 
`FinishProbePhaseTime`: direct probe/output/filter work.
+- `MemoryUsageProbeKeyArena`: probe key memory.
+- `WaitForDependency[HASH_JOIN_BUILD_DEPENDENCY]Time` on probe is waiting for 
build readiness. It is not probe CPU.
+
+Partitioned hash join:
+
+- `Spilled`, `SpillBuildTime`, `SpillProbeTime`, `SpillRePartitionTime`, 
`SpillBuildRows`, `SpillProbeRows`, `SpillRecovery*`, `SpillMaxPartitionLevel`, 
`SpillTotalPartitions`: partitioned join spill path and recovery.
+- `BuildRows`, `ProbeBlocksBytesInMem`, `SpillInMemRow`: partitioned 
memory/volume shape.
+
+Nested loop/cross join:
+
+- `LoopGenerateJoin`, `JoinConjunctsEvaluationTime`, 
`FilteredByJoinConjunctsTime`, `OutputTempBlocksTime`, 
`UpdateVisitedFlagsTime`: direct nested-loop work.
+- Large probe/intermediate rows are usually more important than wait counters.
+
+## Aggregation and Set Counters
+
+Aggregation sink/source:
+
+- `BuildTime`, `HashTableComputeTime`, `HashTableEmplaceTime`, 
`HashTableInputCount`, `HashTableSize`: primary group-by/hash aggregation cost.
+- `MergeTime`, `DeserializeAndMergeTime`, `GetResultsTime`, 
`HashTableIterateTime`, `InsertKeysToColumnTime`, `InsertValuesToColumnTime`: 
merge/output phase.
+- `ExprTime`, `DoLimitComputeTime`: expression and limit work.
+- `MemoryUsageHashTable`, `MemoryUsageArena`, `MemoryUsageContainer`, 
`MemoryUsageSerializeKeyArena`: aggregation memory.
+- `Spilled`, `SpillSerializeHashTableTime`, `SpillTotalTime`, 
`SpillTotalPartitions`, `SpillMaxPartitionLevel`: spill path.
+
+Streaming/distinct aggregation:
+
+- `StreamingAggTime`: streaming aggregation work.
+- `BuildTime`, `HashTableComputeTime`, `HashTableEmplaceTime`, 
`HashTableInputCount`, `HashTableSize`: distinct/group state.
+
+Set operators:
+
+- `BuildTime` on `SET_SINK_OPERATOR`: build set/hash table.
+- `ProbeTime`, `ExtractProbeDataTime` on `SET_PROBE_SINK_OPERATOR`: 
probe/extract set results.
+- `GetDataTime`, `FilterTime`, `GetDataFromHashTableRows` on intersect/except: 
set output/filter work.
+
+## Sort, Window, and Materialization Counters
+
+Sort:
+
+- `SortTime`, `AppendBatchTime`, `AppendBlockTime`, `MergeGetNext`, 
`MergeGetNextBlock`, `MemoryUsageSortBlocks`: direct sort and output work.
+- Spill sort uses `SpillSort*`, `SpillMergeSortTime`, and common spill 
counters.
+- `WaitForDependency[SORT_SOURCE_DEPENDENCY]Time` generally means waiting for 
the sink to finish sorting.
+
+Partition sort:
+
+- `HashTableBuildTime`, `SortedDataTime`, `SortedPartitionInputRows`, 
`SortedPartitionOutputRows`, `GetSortedTime`: partitioned/window sort setup and 
output.
+
+Analytic/window:
+
+- `ComputePartitionByTime`, `ComputeOrderByTime`, `ComputeAggDataTime`, 
`ComputeRangeBetweenTime`, `EvaluationTime`, `PartitionSearchTime`, 
`OrderSearchTime`, `RemoveRowsTime`: direct window function work.
+- `Blocks`, `BlocksPeak`, `RemoveRows`, `RemoveCount`, `streaming mode`: 
window state and mode.
+
+Materialization/select/repeat/table function:
+
+- `ProjectionTime`, expression/evaluation timers, `RowsProduced`, and output 
bytes are the main proof.
+- `MATERIALIZATION_OPERATOR` can be a top/lazy materialization node that 
fetches deferred columns by row id through `multiget_data_v2` RPCs. 
`MaxRpcTime` is the max elapsed RPC round-trip for a batch, not local 
projection CPU. Treat it as late row-id fetch latency unless child 
`RowIDFetcher` profiles, fetched rows/bytes, `MergeResponseTime`, and output 
row count prove a materialization-heavy workload.
+- Do not rank a top/lazy materialization node above upstream 
scan/join/aggregation/memory evidence when it only materializes a small final 
row set. Trace `row_ids` or row-id table info back to the scan/join branch that 
produced those row ids, then report materialization separately as final-column 
fetch latency.
+- `Repeat` expands grouping sets; compare input/output rows.
+- `TableFunction` can multiply rows; compare produced rows and expression time.
+
+## Exchange and Network Counters
+
+Exchange source:
+
+- `GetDataFromRecvrTime`, `DeserializeRowBatchTimer`, `DecompressTime`, 
`FilterTime`, `CreateMergerTime`, `MergeGetNext`, `MergeGetNextBlock`: direct 
receive/merge/decode work.
+- `LocalBytesReceived`, `RemoteBytesReceived`, `DecompressBytes`, 
`BlocksProduced`: data volume.
+- `WaitForData0`, `WaitForDataN`, `DataArrivalWaitTime`, 
`FirstBatchArrivalWaitTime`, `SendersBlockedTotalTimer(*)`, 
`MaxWaitForWorkerTime`, `MaxWaitToProcessTime`: wait/backpressure/scheduling 
signals.
+
+Exchange sink:
+
+- `SerializeBatchTime`, `CompressTime`, `DistributeRowsIntoChannelsTime`, 
`SplitBlockHashComputeTime`, `LocalSendTime`, `MergeBlockTime`, 
`SendNewPartitionTime`, `AddPartitionRequestTime`: direct send/shuffle work.
+- `BytesSent`, `LocalBytesSent`, `LocalSentRows`, `UncompressedRowBatchSize`, 
`BlocksProduced`, `OverallThroughput`: volume/throughput.
+- `RpcMaxTime`, `RpcAvgTime`, `RpcMinTime`, `RpcSumTime`, `RpcCount` on 
`DATA_STREAM_SINK_OPERATOR`: remote send RPC latency. A high max/avg with 
meaningful bytes and small serialize/compress time points to network, receiver, 
or RPC path latency, not local sink CPU.
+- `WaitForRpcBufferQueue`, `WaitForBroadcastBuffer`, 
`WaitForLocalExchangeBufferN`: downstream backpressure or queue capacity wait.
+
+Local exchange and multicast:
+
+- `ComputeHashValueTime`, copy/channel push/pop timers, local bytes/rows, and 
buffer waits explain in-BE reshuffle or fanout.
+- Long local exchange wait usually reflects receiver/sender imbalance, not 
compute in the local exchange itself.
+
+## Sink and Writer Counters
+
+Result/file/mysql/flight sinks:
+
+- `AppendBatchTime`, `BytesSent`, `FileWriteTime`, `FileWriterCloseTime`, 
`ConvertBlockToArrowBatchTime`: direct output cost.
+- Result sinks can be slow because clients fetch slowly; look for send/write 
wait and bytes.
+
+OLAP/table/load sinks:
+
+- `AppendNodeChannelTime`, `CloseWaitTime`, `CloseWriterTime`, 
`CloseLoadTime`, `CommitTxnTime`, `BuildRowsetTime`, `AddPartitionRequestTime`: 
write/load/finalization phases.
+- High close/commit is sink-side only if custom writer timers agree.
+
+External table sinks:
+
+- Hive/Iceberg/MaxCompute/JDBC/TVF counters represent connector write, file 
write, delete-file write, and close/commit costs. Use file counts, bytes, 
writer close/write timers, and connector-specific errors/logs.
+- Iceberg delete/merge/sort spill counters show delete file generation, merge 
output, and spill path.
+
+Blackhole/cache/memory scratch/dictionary sinks:
+
+- Blackhole `BytesProcessed` is volume consumed, not output I/O.
+- Cache counters show cache fill/read size and tablet id metadata.
+- Memory scratch conversion timers show in-memory Arrow/result conversion.
+- Dictionary sink counters should be read as dictionary build/publish/write 
path.
+
+## Runtime Filter Counters
+
+See `runtime-filters.md`. In short:
+
+- `RuntimeFilterInfo` on join/build/sink shows produced filters and target 
metadata.
+- `WaitForRuntimeFilter` and `AcquireRuntimeFilter` on scans/sources are waits 
for filters.
+- `RFx InputRows`, `RFx FilterRows`, `RFx AlwaysTrueFilterRows`, `RFx 
WaitTime` tell whether each RF was useful.
+- Waiting for a runtime filter is worthwhile only if it filters enough scan 
rows or prevents expensive downstream work.
+- Runtime-filter wait counters can appear on scans, multicast/source 
operators, or dynamic/top-N filter paths. Treat them as wait/synchronization 
evidence first, then use the plan and detail profile to find the producer and 
target.
+
+## Spill Counters
+
+Common spill counters are direct evidence of a different execution path:
+
+- `SpillTotalTime`, `SpillWriteTime`, `SpillRecoverTime`: overall spill cost.
+- `SpillWriteTaskWaitInQueueTime`, `SpillReadTaskWaitInQueueTime`: spill task 
queue wait.
+- `SpillWriteFileTime`, `SpillReadFileTime`: spill file I/O.
+- `SpillWriteSerializeBlockTime`, `SpillReadDeserializeBlockTime`: CPU 
transform cost.
+- `SpillWriteBlockCount`, `SpillWriteBlockBytes`, `SpillWriteRows`, 
`SpillWriteFileBytes`, `SpillWriteFileTotalCount`: write volume.
+- `SpillReadBlockCount`, `SpillReadBlockBytes`, `SpillReadRows`, 
`SpillReadFileBytes`, `SpillReadFileCount`: read volume.
+- `SpillMaxRowsOfPartition`, `SpillMinRowsOfPartition`, 
`SpillMaxPartitionLevel`, `SpillTotalPartitions`: partition skew and spill 
fanout.
+
+If spill exists, it is usually more important than generic wait counters.
diff --git 
a/.codex/skills/doris-profile-reader/references/join-order-diagnosis.md 
b/.codex/skills/doris-profile-reader/references/join-order-diagnosis.md
new file mode 100644
index 00000000000..ca82767b9b5
--- /dev/null
+++ b/.codex/skills/doris-profile-reader/references/join-order-diagnosis.md
@@ -0,0 +1,111 @@
+# Join Order Diagnosis
+
+Use this file after the normal bottleneck pass. Keep the skill's main 
discipline: first prove the immediate runtime cost, then decide whether that 
cost is a symptom of a bad join order, bad build/probe choice, or late/missing 
runtime filter.
+
+## 1. Identify Build, Probe, Source, and Target
+
+Hash join:
+
+- `HASH_JOIN_SINK_OPERATOR` is the build side. Its decisive counters are 
`InputRows`/`BuildRows`, `BuildTime`, `BuildHashTableTime`, 
`BuildTableInsertTime`, `MemoryUsageHashTable`, `MemoryUsageBuildBlocks`, and 
`RuntimeFilterInfo`.
+- `HASH_JOIN_OPERATOR` is the probe side. Its decisive counters are 
`ProbeRows`, `ProbeIntermediateRows`, probe expression/search/output timers, 
and final `RowsProduced`.
+- In non-pipeline or older profiles, a single `VHASH_JOIN_NODE` may contain 
both sides. Use `BuildRows`, `ProbeRows`, `BuildTime`, `ProbeTime`, hash-table 
memory, and plan child order/RF arrows to identify the sides.
+- In Doris plans, the hash table is normally built from the join's build/right 
child, but verify with counters or RF arrows when the profile is merged or the 
plan has been transformed.
+
+Nested-loop/cross join:
+
+- `CROSS_JOIN_SINK_OPERATOR` is the build/materialized side.
+- `CROSS_JOIN_OPERATOR` is the probe/loop side. Its decisive counters are 
`ProbeRows`, `LoopGenerateJoin`, `JoinConjunctsEvaluationTime`, 
`FilteredByJoinConjunctsTime`, and `OutputTempBlocksTime`.
+- A nested-loop join over a large intermediate result is often a plan-shape 
problem even when the direct operator cost is correctly attributed to 
nested-loop active timers.
+
+Runtime filters:
+
+- `RFxxx <- expr` is the source/build side that produces the filter.
+- `RFxxx -> expr` is the target/probe scan side that may be pruned.
+- A useful order usually produces filters from a small/selective build side 
early enough to prune a large probe-side scan. A suspicious order scans/builds 
a large fact side before the small/selective branch can produce its RF.
+- An empty RF is not automatically a good outcome. If the query spends most of 
its time scanning the RF source just to produce an empty/tiny filter, then the 
RF source side may be the wrong side even though downstream targets skipped 
work.
+- Empty downstream target scans prove only that the RF arrived before those 
targets did work. They do not prove the source choice was good. If the source 
scan/build dominates elapsed time, treat the plan as "RF direction likely bad: 
the expensive side produced the pruning signal instead of receiving one" unless 
there is concrete counterevidence.
+
+Memo and plan alternatives:
+
+- If memo or verbose explain contains multiple physical expressions for the 
same join group, inspect both child orders. A reversed physical expression is 
valid counterfactual evidence even when no fast profile is present.
+- Compare the chosen side against table keys, distribution keys, 
colocate/bucket properties, and available RF targets. If the chosen RF source 
uses non-key predicates and scans massively, while the other side's join key 
can target the source table's key/prefix/distribution key, classify the chosen 
order as suspicious or bad.
+- Do not accept the optimizer's estimated lower cost as proof of correctness 
when actual profile rows contradict the estimate. A low estimated build 
cardinality that turns into a huge zero-output scan is evidence of bad 
join-order costing.
+
+## 2. Decide Whether the Join Order Is Reasonable
+
+Treat join-order diagnosis as a plan-shape layer, separate from immediate 
runtime cost. Report both layers.
+
+Strong evidence of bad join order:
+
+- A paired fast/slow profile, hint, or SQL rewrite with the same intended 
result changes only join shape/RF direction and turns a large scan/hash build 
into tiny or zero scan rows.
+- A large fact table is scanned or used as hash-build input before a selective 
dimension/subquery branch is reduced to a few rows.
+- The slow plan builds a hash table on millions/billions of rows while an 
alternate legal shape builds on a tiny side and uses RF to prune the large side.
+- A scan is the active bottleneck only because the RF that could prune it is 
produced later, on the wrong side, or after the scan has already done most work.
+- The plan creates a huge intermediate result and a later join/filter reduces 
it to zero or a tiny row count, especially with contradictory predicates or 
predicates that could have been pushed below the join.
+- Estimates in explain/memo say a branch is large, but profile actual rows 
show it is tiny, causing the optimizer to choose the wrong build/probe side.
+
+Single slow profile evidence can also be strong enough. Do not require a fast 
profile when these patterns appear:
+
+- Empty-probe or empty-result build waste: a hash join builds/shuffles 
millions or billions of rows, but `ProbeRows` or final output is zero/tiny. The 
direct cost is build/scan, but the plan-shape cause is that the large build 
side was scheduled before proving the probe/output side was empty or highly 
selective. If join type semantics make a swap uncertain, say "likely bad join 
order/build-side scheduling" rather than "not join order".
+- RF-source inversion: the current RF source/build side is itself the 
expensive unpruned scan and produces zero/tiny rows only after scanning a huge 
table. Its RF target scans then wait and skip work. This is not merely a 
scan/index problem; it means the plan chose the side that needed pruning as the 
side responsible for producing the pruning signal. Check whether the expensive 
side's join key is a table key/prefix, distribution key, or colocated key and 
whether no RF is applied to that side.
+- Expensive-empty source: a scan/build side does massive active work and 
outputs zero/tiny rows, then its RF makes other branches skip. Classify this as 
likely bad join order/RF direction even if that branch also has a missing 
index, non-key predicate, bad literal, or stale estimate. The other anomaly 
explains why the branch is expensive or empty; it does not make the expensive 
branch a good RF source.
+- Late emptying predicate: a later join, semi join, analytic/filter, or 
non-equi conjunct reduces a large intermediate to zero/tiny rows. If that 
predicate references only a subset of the joined tables or is contradictory, 
classify the earlier large join/intermediate as bad join order or missed 
predicate placement.
+- Ineffective RF to a large side: the plan contains an RF targeting a large 
scan, but profile shows it is ignored, not pushed down, arrives after scan 
work, has zero filter rows, or the large scan still produces millions of rows. 
If a small/selective branch exists, this is evidence of bad RF direction/timing 
even without a fast profile.
+- Estimate/actual inversion: the optimizer estimates the build/source side 
tiny, but profile proves it performed massive scan work to produce zero/tiny 
rows, while downstream tables only waited for its RF. Treat this as a likely 
join-order/RF-direction error. Do not stop at "the build side scan is slow" 
unless the profile proves no join alternative could prune it.
+- Key-target inversion: the expensive source table is scanned by 
non-key/non-prefix predicates, while its join key is a key prefix, distribution 
key, or otherwise better pruning target. If current order makes that table 
produce the RF instead of receive one, call the join order/RF direction bad 
unless profile proves the alternative cannot prune it.
+- Huge build for empty/tiny probe: a hash join build side 
scans/shuffles/builds millions or billions of rows while the probe side is 
zero/tiny and final output is zero/tiny. Report direct cost as build/scan, and 
report plan-shape root cause as likely bad join order/build-side scheduling 
even for an outer join unless the profile proves no earlier 
empty-preserved-side check or legal reorder can avoid the build.
+- Contradictory non-equi join late: an inner nested-loop/cross join has 
conjuncts that cannot both be true, or a highly selective non-equi relation, 
but the plan first materializes a large intermediate. This is not just 
expression simplification; it is bad join order/predicate placement because the 
impossible/selective relation was applied after the row explosion.
+
+When any strong single-profile pattern above matches, do not write "join order 
not proven" as the main diagnosis. Use one of these verdicts:
+
+- `bad`: a paired plan, memo alternative, legal rewrite, key/RF target 
evidence, or inner-join contradiction shows the expensive work should have been 
avoided.
+- `likely bad`: the profile alone proves huge wasted scan/build/intermediate 
work before an empty/tiny side or late predicate, but legal alternatives still 
need confirmation.
+- `suspicious`: only for weak patterns where build/probe/RF sides or row-flow 
evidence are incomplete. Do not use `suspicious` for expensive-empty source, 
huge build for empty/tiny probe, tiny semi key after fact build, or 
contradictory non-equi join late.
+
+Keep uncertainty scoped correctly:
+
+- It is acceptable to say the exact fix or alternate legal order is not proven.
+- It is not acceptable to demote the diagnosis to "not join order" or "only 
scan/literal/index problem" when the profile already proves the large side was 
paid before an empty/tiny side could prune it.
+- In the conclusion, put the join verdict before coexisting anomalies when the 
user's question is profile performance: "runtime bottleneck is X; join 
order/build scheduling is likely bad; bad literal/stats/index may explain why 
the selective side became empty."
+
+Weaker evidence:
+
+- A single profile shows large scan/build work and late/weak RF, but there is 
no alternate plan, no SQL/hint comparison, and no stats/memo evidence.
+- Outer/semi/anti joins restrict legal reorder. A costly order may be required 
by semantics unless an equivalent rewrite or optimizer rule is known.
+
+Do not stop at "scan bottleneck" when the scan exists because of join order. 
Say: "runtime bottleneck is the scan/hash build; likely plan-shape root cause 
is join order/RF direction because ..."
+
+## 3. Join Type Constraints
+
+- Inner joins are usually reorderable; choose the shape that builds on the 
smaller/selective side and probes/prunes the larger side.
+- For left outer join, the left side is preserved. For right outer join, the 
right side is preserved. Swapping can change semantics unless the optimizer 
rewrites the join equivalently.
+- Outer join preservation does not make every build/probe choice good. If a 
preserved-side scan/probe is empty after its own predicates while the 
non-preserved build side was already fully scanned/built, report the direct 
build waste and classify the order as likely bad scheduling or missed 
empty-side short-circuit/predicate placement. Keep the semantic caveat, but do 
not downgrade it to a pure scan problem.
+- For outer joins, distinguish logical join order from physical work order. 
Even when logical reordering is constrained, a profile can still prove bad 
physical build-side scheduling if Doris builds the non-preserved side before 
discovering the preserved side has zero rows.
+- For left outer joins in hash execution, `ProbeRows=0` with huge right/build 
`BuildRows` is a likely bad physical work order: the build side was paid before 
the preserved side was known empty. State that even if the SQL contains another 
bug, and make the legal-rewrite caveat secondary.
+- For semi/anti joins, the preserved/output side and filter side matter. A 
good order applies the selective semi/anti result before scanning or building 
large downstream joins when legal.
+- For semi joins, a tiny key set from the filter side that is applied only 
after a large fact scan/build is strong evidence of bad join order or RF timing.
+- For non-equi joins, nested-loop may be required, but the order can still be 
bad if it materializes a huge intermediate before applying a selective or 
impossible predicate. A contradictory join conjunct on an inner join is a 
join-order/predicate-placement failure when earlier joins create large rows 
that the contradictory join later discards. State "join order/predicate 
placement is bad"; do not stop at "constant folding/predicate simplification 
missing".
+
+## 4. Distinguish Join Order From Other Anomalies
+
+Other anomalies can coexist with bad join order:
+
+- Bad literal rewrite, stale stats, missing index, weak partition pruning, or 
an expensive scalar expression may explain why one branch is selective or empty.
+- They do not explain why the engine scanned/built another huge side before 
using that selectivity, or why an empty/tiny branch was unable to prune the 
expensive side.
+- State both: "runtime bottleneck is large scan/build; another anomaly 
explains selectivity; join order/RF direction is still wrong because the plan 
pays the large side before the selective/empty branch can prune it."
+
+Avoid this failure mode: calling the expensive RF-source scan a pure 
access-path problem when the rest of the plan proves that all other large scans 
waited for its empty/tiny RF. That is exactly the case where the RF 
source/target choice should be challenged.
+
+Avoid this failure mode: calling a large build before an empty probe "only an 
outer join semantic constraint". If the preserved side is empty after its own 
predicates, then the build side contributed nothing to the result; the plan 
order/scheduling failed to exploit the empty side.
+
+Avoid this failure mode: calling an impossible non-equi join only a predicate 
simplification issue. If a large intermediate is produced before the impossible 
join, the bad symptom is join order/predicate placement: the impossible or 
highly selective relation was applied too late.
+
+## 5. Output Checklist
+
+When a query has joins, include these facts:
+
+- `Runtime bottleneck`: the active scan/join/build/probe cost proven by 
counters.
+- `Join-order diagnosis`: whether the order/build side/RF direction is good, 
suspicious, or bad.
+- `Build/probe evidence`: build side rows/time/memory, probe side rows/time, 
and join type.
+- `RF evidence`: source side, target side, wait, filtering/pruning, and 
whether it arrived early enough.
+- `Counterfactual evidence`: paired profile, hint, rewrite, memo, or stats 
that shows a better legal order. If absent, say the join-order conclusion is 
likely but not proven.
diff --git a/.codex/skills/doris-profile-reader/references/operator-guide.md 
b/.codex/skills/doris-profile-reader/references/operator-guide.md
new file mode 100644
index 00000000000..66fe07f8874
--- /dev/null
+++ b/.codex/skills/doris-profile-reader/references/operator-guide.md
@@ -0,0 +1,140 @@
+# Operator Guide
+
+This guide covers the major Doris runtime profile operator families. Use 
`source-profile-inventory.md` for exact source registration lines and 
counter/source anchors.
+
+## Scan Operators
+
+`OLAP_SCAN_OPERATOR` reads Doris internal storage. First check `ScanRows`, 
`RowsRead`, `ScanBytes`, `ScannerCpuTime`, `ScannerGetBlockTime`, 
`ScannerWorkerWaitTime`, `IOTimer`, `DecompressorTimer`, predicate timers, lazy 
read timers, runtime-filter wait/filter counters, and row-filter counters. A 
long scan is important when active scanner timers and large bytes/rows agree. 
If `ExecTime` is mostly explained by `RuntimeFilterInfo/RFx WaitTime`, 
`AcquireRuntimeFilter`, or `WaitForRuntimeFil [...]
+
+`FILE_SCAN_OPERATOR` reads external files through file scanner and connector 
code. Prioritize file count/split shape, file read/open/list timers, 
`FileScannerGetBlockTime`, `ApplyAllRuntimeFilters`, scan rows/bytes, and 
connector I/O. Metadata/listing time can dominate small queries; data 
read/decode dominates large ones.
+
+`JDBC_SCAN_OPERATOR` reads remote JDBC sources. Prioritize remote 
query/read/fetch timers, rows, bytes, and network/serialization. A slow JDBC 
scan may be remote database latency rather than Doris CPU.
+
+`GROUP_COMMIT_SCAN_OPERATOR` scans group-commit buffered data. Prioritize 
rows, blocks, group-commit state, and wait/dependency only as queue state.
+
+`SCHEMA_SCAN_OPERATOR` reads schema/system metadata. Usually small. If slow, 
check metadata service/catalog lookup timers and rows produced.
+
+`META_SCAN_OPERATOR` reads internal metadata. Treat it like a metadata scan; 
long active time points to metadata lookup or serialization, not table data 
scan.
+
+`DATA_GEN_SCAN_OPERATOR` generates rows. Prioritize generated row count and 
expression/generation active timers.
+
+`REC_CTE_SCAN_OPERATOR` reads recursive CTE state. Use recursive CTE 
source/sink counters and rows per iteration to find expansion.
+
+`CACHE_SOURCE_OPERATOR` reads cached intermediate data. Prioritize cache 
tablet/id metadata, cache hit/read counters, rows/bytes, and active read time.
+
+`EMPTY_SET_OPERATOR` or `EMPTY_SET_SOURCE_OPERATOR` produces no data. It 
should not be the bottleneck unless profile setup/open time is anomalously high.
+
+## Exchange Operators
+
+`EXCHANGE_OPERATOR` receives data from sender queues. Prioritize 
`GetDataFromRecvrTime`, deserialize/decompress/filter/merge timers, 
local/remote bytes received, and produced rows. `WaitForData0`/`WaitForDataN`, 
`DataArrivalWaitTime`, and `FirstBatchArrivalWaitTime` mean the receiver waited 
for data. They usually point to upstream work or network/backpressure, not 
exchange CPU.
+
+`DATA_STREAM_SINK_OPERATOR` sends data across fragments. Prioritize 
`SerializeBatchTime`, `CompressTime`, `DistributeRowsIntoChannelsTime`, 
`SplitBlockHashComputeTime`, `LocalSendTime`, `BytesSent`, `LocalBytesSent`, 
`LocalSentRows`, `RpcMaxTime`/`RpcAvgTime`, and throughput. High RPC time with 
non-trivial bytes is exchange/network evidence even when local serialization is 
small. `WaitForRpcBufferQueue`, `WaitForBroadcastBuffer`, and 
`WaitForLocalExchangeBufferN` are buffer/backpressure waits.
+
+`LOCAL_EXCHANGE_OPERATOR` and `LOCAL_EXCHANGE_SINK_OPERATOR` reshuffle data 
inside one BE. Prioritize hash computation, copy/push/pop timers, local 
bytes/rows, and skew. Local exchange waits indicate sender/receiver imbalance 
or downstream pressure.
+
+`LOCAL_MERGE_SORT_SOURCE_OPERATOR` receives locally sorted streams. Prioritize 
merge/get-next/sorted output active timers and rows. A large dependency wait 
usually means it waited for sort sink completion.
+
+`MULTI_CAST_DATA_STREAM_SINK_OPERATOR` and 
`MULTI_CAST_DATA_STREAM_SOURCE_OPERATOR` fan out data to multiple consumers. 
Prioritize copied bytes/rows, channel buffer state, source wait for runtime 
filters when present, and downstream imbalance. Wait counters identify which 
consumer/channel delayed progress.
+
+## Aggregation Operators
+
+`AGGREGATION_SINK_OPERATOR`, `AGGREGATION_OPERATOR`, 
`BUCKETED_AGGREGATION_SINK_OPERATOR`, `BUCKETED_AGGREGATION_OPERATOR`, 
`PARTITIONED_AGGREGATION_SINK_OPERATOR`, and `PARTITIONED_AGGREGATION_OPERATOR` 
perform hash/merge aggregation. Prioritize `InputRows`, `RowsProduced`, 
`BuildTime`, `HashTableComputeTime`, `HashTableEmplaceTime`, 
`HashTableInputCount`, `HashTableSize`, `MergeTime`, `DeserializeAndMergeTime`, 
`GetResultsTime`, `HashTableIterateTime`, memory counters, and spill counte 
[...]
+
+`STREAMING_AGGREGATION_OPERATOR` aggregates when input order/streaming mode 
allows less state. Prioritize `StreamingAggTime`, build/hash table counters, 
input/output rows, and whether memory/hash table is unexpectedly high.
+
+`DISTINCT_STREAMING_AGGREGATION_OPERATOR` handles distinct streaming 
aggregation. Prioritize distinct hash table compute/emplace, input rows, hash 
table size, and rows produced.
+
+## Join Operators
+
+`HASH_JOIN_SINK_OPERATOR` builds hash join state and runtime filters. 
Prioritize `InputRows`, `BuildTime`, `BuildHashTableTime`, 
`BuildTableInsertTime`, `BuildExprCallTime`, `MergeBuildBlockTime`, 
`MemoryUsageHashTable`, build memory arenas, `PublishTime`, and 
`RuntimeFilterInfo`. Build dependency waits are sequencing state, not build CPU.
+
+`HASH_JOIN_OPERATOR` probes hash join state. Prioritize `ProbeRows`, 
`ProbeIntermediateRows`, `ProbeExprCallTime`, `ProbeWhenSearchHashTableTime`, 
non-equi conjunct time, output block build time, rows produced, and probe 
memory. Long wait for build dependency means probe waited for the build side.
+
+`PARTITIONED_HASH_JOIN_SINK_OPERATOR` and 
`PARTITIONED_HASH_JOIN_PROBE_OPERATOR` are partitioned/spill-capable join 
paths. Prioritize build/probe rows, in-memory bytes, spill 
repartition/build/probe/recovery timers, spill rows/blocks, partition 
level/count, and memory.
+
+`CROSS_JOIN_SINK_OPERATOR` and `CROSS_JOIN_OPERATOR` are nested-loop style 
join paths. Prioritize build rows, probe rows, intermediate rows, 
`LoopGenerateJoin`, join conjunct evaluation/filter time, and output temp block 
time. Large output cardinality is often the root cause. If the cross join is 
only stitching together one-row scalar-subquery results, large dependency waits 
on the cross join are branch fan-in latency; inspect the child branch 
scans/joins/aggregations before blaming cros [...]
+
+`NestedLoopJoinBuildSinkOperatorX` and `NestedLoopJoinProbeOperatorX` appear 
as cross join operators in profiles. Interpret them with the cross-join 
counters.
+
+## Sort and Analytic Operators
+
+`SORT_SINK_OPERATOR` receives rows and sorts them. Prioritize input rows, 
sort/append timers, memory usage for sort blocks, and spill counters.
+
+`SORT_OPERATOR` outputs sorted rows. Prioritize merge/get-next timers, rows 
produced, output bytes, and source dependency waits only as sink-completion 
waits.
+
+`SPILL_SORT_SINK_OPERATOR` and `SPILL_SORT_SOURCE_OPERATOR` are sort paths 
with spill. Prioritize common spill counters, merge sort time, read/write file 
times, spilled bytes/rows, and memory.
+
+`PARTITION_SORT_SINK_OPERATOR` and `PARTITION_SORT_OPERATOR` sort by partition 
for analytic/window work. Prioritize hash table build, sorted data time, sorted 
partition input/output rows, and get-sorted time.
+
+`ANALYTIC_EVAL_SINK_OPERATOR` and `ANALYTIC_EVAL_OPERATOR` evaluate window 
functions. Prioritize input rows, partition/order/range compute timers, 
evaluation time, search timers, remove rows/time, blocks peak, and rows 
produced. Wait for dependency usually means waiting for buffered window state.
+
+## Set, Union, Repeat, and Projection Operators
+
+`UNION_SINK_OPERATOR` and `UNION_OPERATOR` combine child outputs. Prioritize 
input/output rows, copy/projection time, and child imbalance. Union is rarely 
the bottleneck unless it moves many rows or projects expensive expressions.
+
+`INTERSECT_OPERATOR`, `EXCEPT_OPERATOR`, `SET_SINK_OPERATOR`, and 
`SET_PROBE_SINK_OPERATOR` build/probe set state. Prioritize 
build/probe/get-data/filter timers, hash table rows, extracted rows, memory, 
and rows produced.
+
+`REPEAT_OPERATOR` expands rows for grouping sets/rollups. Compare input rows 
to rows produced and output bytes before blaming later exchange or aggregation. 
A repeat/rollup can create hundreds of millions of rows; the repeat operator's 
own active time may be moderate, while the real cost appears immediately after 
it as local hash shuffle, data stream exchange, and aggregation over the 
expanded rows. In that pattern, downstream wait counters are symptoms and the 
proof is repeat output vol [...]
+
+`MATERIALIZATION_OPERATOR` materializes slots/projections. Prioritize 
projection/expression time, output rows/bytes, child `RowIDFetcher` profile 
info, `MergeResponseTime`, and fetched row-id volume. A high materialization 
time can also be top/lazy materialization: the operator sends row-id based 
`multiget_data_v2` RPCs to fetch deferred columns after upstream pruning/top-N. 
In that case `MaxRpcTime` is remote row-id fetch latency for a batch, not local 
materialization CPU. Do not promot [...]
+
+`SELECT_OPERATOR` applies predicates/projections. Prioritize 
predicate/expression/projection active timers and rows filtered/produced.
+
+`TABLE_FUNCTION_OPERATOR` expands rows through table functions. Compare input 
rows to output rows and expression/table-function evaluation time.
+
+`ASSERT_NUM_ROWS_OPERATOR` verifies scalar-subquery row count. Usually not a 
performance bottleneck; check rows produced and assertion metadata if it 
appears.
+
+## Recursive CTE Operators
+
+`REC_CTE_SOURCE_OPERATOR`, `REC_CTE_SINK_OPERATOR`, 
`REC_CTE_ANCHOR_SINK_OPERATOR`, and `REC_CTE_SCAN_OPERATOR` implement recursive 
CTE data flow. Prioritize iteration counts if present, rows in/out per stage, 
materialized state size, and waits between recursive producer/consumer 
dependencies. Long waits usually indicate iteration sequencing.
+
+## Result and Output Sinks
+
+`RESULT_SINK_OPERATOR` and `ResultSink` return rows to the client. Prioritize 
bytes/rows sent, append/serialize/send timers, and client/network effects. A 
slow client can make the result sink look slow.
+
+`RESULT_FILE_SINK_OPERATOR` writes query results to files. Prioritize 
`FileWriteTime`, `FileWriterCloseTime`, bytes, file count, and 
serialization/compression.
+
+`MEMORY_SCRATCH_SINK_OPERATOR` writes to in-memory scratch/results. Prioritize 
conversion timers such as Arrow batch conversion, rows, and bytes.
+
+`BLACKHOLE_SINK_OPERATOR` consumes output without writing. Prioritize 
`BytesProcessed`/rows only as volume; it should not be I/O-bound.
+
+## Table and Connector Sinks
+
+`OLAP_TABLE_SINK_OPERATOR`, `OlapTableSinkOperatorX`, and 
`OlapTableSinkV2OperatorX` write into Doris storage. Prioritize append/channel 
timers, rowset build, close/load/commit timers, tablet/channel skew, 
rows/bytes, and memory. Close/commit only proves sink cost when writer timers 
agree.
+
+`GROUP_COMMIT_OLAP_TABLE_SINK_OPERATOR` and `GROUP_COMMIT_BLOCK_SINK_OPERATOR` 
write through group commit. Prioritize append blocks, group commit 
flush/queue/commit timers, rows/bytes, and close/wait states.
+
+`HIVE_TABLE_SINK_OPERATOR`, `ICEBERG_TABLE_SINK_OPERATOR`, 
`SPILL_ICEBERG_TABLE_SINK_OPERATOR`, `ICEBERG_DELETE_SINK_OPERATOR`, 
`ICEBERG_MERGE_SINK_OPERATOR`, `MAXCOMPUTE_TABLE_SINK_OPERATOR`, 
`JDBC_TABLE_SINK_OPERATOR`, and `TVF_TABLE_SINK_OPERATOR` write to external 
systems. Prioritize connector write time, file write/close time, 
delete-file/write-file counts, commit/close timers, bytes/rows, and spill 
counters for spill-capable paths. External system latency may dominate.
+
+`DICTIONARY_SINK_OPERATOR` builds/publishes dictionary data. Prioritize 
build/publish/write timers, rows, bytes, and memory.
+
+`CACHE_SINK_OPERATOR` writes cacheable intermediate data. Prioritize bytes 
written, cache write timers, tablet/cache metadata, and memory.
+
+## Source Coverage Index
+
+The narrative sections above cover these factory-created plan nodes and data 
sinks from the local source inventory:
+
+Plan nodes:
+
+`OLAP_SCAN_NODE`, `GROUP_COMMIT_SCAN_NODE`, `JDBC_SCAN_NODE`, 
`FILE_SCAN_NODE`, `EXCHANGE_NODE`, `AGGREGATION_NODE`, 
`BUCKETED_AGGREGATION_NODE`, `HASH_JOIN_NODE`, `CROSS_JOIN_NODE`, `UNION_NODE`, 
`SORT_NODE`, `PARTITION_SORT_NODE`, `ANALYTIC_EVAL_NODE`, 
`MATERIALIZATION_NODE`, `INTERSECT_NODE`, `EXCEPT_NODE`, `REPEAT_NODE`, 
`TABLE_FUNCTION_NODE`, `ASSERT_NUM_ROWS_NODE`, `EMPTY_SET_NODE`, 
`DATA_GEN_SCAN_NODE`, `SCHEMA_SCAN_NODE`, `META_SCAN_NODE`, `SELECT_NODE`, 
`REC_CTE_NODE`, `REC_CTE_ [...]
+
+Data sinks:
+
+`DATA_STREAM_SINK`, `RESULT_SINK`, `DICTIONARY_SINK`, 
`GROUP_COMMIT_OLAP_TABLE_SINK`, `GROUP_COMMIT_BLOCK_SINK`, `HIVE_TABLE_SINK`, 
`ICEBERG_TABLE_SINK`, `ICEBERG_DELETE_SINK`, `ICEBERG_MERGE_SINK`, 
`MAXCOMPUTE_TABLE_SINK`, `JDBC_TABLE_SINK`, `MEMORY_SCRATCH_SINK`, 
`RESULT_FILE_SINK`, `MULTI_CAST_DATA_STREAM_SINK`, `BLACKHOLE_SINK`, 
`TVF_TABLE_SINK`, `OLAP_TABLE_SINK`.
+
+Operator classes:
+
+`AggSinkOperatorX`, `AggSourceOperatorX`, `AnalyticSinkOperatorX`, 
`AnalyticSourceOperatorX`, `AssertNumRowsOperatorX`, 
`BucketedAggSinkOperatorX`, `BucketedAggSourceOperatorX`, 
`CacheSourceOperatorX`, `DataGenSourceOperatorX`, `DictSinkOperatorX`, 
`DistinctStreamingAggOperatorX`, `EmptySetSourceOperatorX`, 
`ExchangeSinkOperatorX`, `ExchangeSourceOperatorX`, `FileScanOperatorX`, 
`GroupCommitBlockSinkOperatorX`, `GroupCommitOperatorX`, 
`HashJoinBuildSinkOperatorX`, `HashJoinProbeOperatorX [...]
+
+Named operators outside the factory list but present in profile/source 
registration:
+
+`CACHE_SINK_OPERATOR`, `CACHE_SOURCE_OPERATOR`, 
`DISTINCT_STREAMING_AGGREGATION_OPERATOR`, `LOCAL_EXCHANGE_OPERATOR`, 
`LOCAL_EXCHANGE_SINK_OPERATOR`, `LOCAL_MERGE_SORT_SOURCE_OPERATOR`, 
`MULTI_CAST_DATA_STREAM_SOURCE_OPERATOR`, `PARTITIONED_AGGREGATION_OPERATOR`, 
`PARTITIONED_AGGREGATION_SINK_OPERATOR`, 
`PARTITIONED_HASH_JOIN_PROBE_OPERATOR`, `PARTITIONED_HASH_JOIN_SINK_OPERATOR`, 
`REC_CTE_ANCHOR_SINK_OPERATOR`, `SET_PROBE_SINK_OPERATOR`, `SET_SINK_OPERATOR`, 
`SPILL_ICEBERG_TABLE_SINK_OP [...]
+
+## Fallback for Unknown Operator Names
+
+If a profile contains an operator not listed above:
+
+1. Search `source-profile-inventory.md` for the exact operator name or class.
+2. Use its source path to identify the family: scan, exchange, join, 
aggregation, sort, sink, set, metadata, or utility.
+3. Classify each counter by `counter-semantics.md`.
+4. Do not infer bottleneck from wait counters without active work or volume 
evidence.
diff --git a/.codex/skills/doris-profile-reader/references/reading-workflow.md 
b/.codex/skills/doris-profile-reader/references/reading-workflow.md
new file mode 100644
index 00000000000..65934015a91
--- /dev/null
+++ b/.codex/skills/doris-profile-reader/references/reading-workflow.md
@@ -0,0 +1,135 @@
+# Doris Profile Reading Workflow
+
+## 1. Confirm Profile Shape
+
+Start from the query id, elapsed time, SQL, FE `Summary`, `Execution Profile`, 
`MergedProfile`, and `DetailProfile`.
+
+- Use `MergedProfile` first to find candidate fragments/operators quickly.
+- Use `DetailProfile` to check skew, per-instance outliers, and whether a 
large merged time is just parallel accumulation.
+- Use the plan text to map operator names back to tables, join sides, grouping 
keys, exchanges, limits, and runtime-filter direction.
+- If the plan has joins, keep two layers in the analysis: immediate runtime 
bottleneck and plan-shape cause. A scan or hash build may be the active cost 
while a bad join order/RF direction is the reason that cost exists.
+- Keep query-lifecycle time separate from BE operator work. FE `Plan Time`, 
`Schedule Time`, `Wait and Fetch Result Time`, and client/result fetch time can 
explain wall-clock latency, especially for small or empty queries, but they are 
not scan/join/aggregation CPU. If they dominate, report them as a separate 
lifecycle finding before ranking BE operators.
+- If `Is Cached: Yes`, `Total Instances Num: 0`, or the BE detail is absent, 
the profile is not enough for operator bottleneck analysis. Re-run with profile 
enabled and SQL cache disabled.
+
+## 2. Build a Counter Classification Before Ranking
+
+Classify each large counter before interpreting it:
+
+- Active work: `ExecTime` after qualifying child wait evidence, plus direct 
operator custom timers such as scan CPU, hash table build/probe, sort, 
aggregation, serialization, compression, I/O, decompression, spill read/write.
+- Data volume: `InputRows`, `RowsProduced`, `RowsRead`, `ScanRows`, 
`ScanBytes`, bytes sent/received, blocks produced.
+- Wait/backpressure: `WaitForDependencyTime`, `WaitForDependency[...]Time`, 
`WaitForDataN`, `DataArrivalWaitTime`, `FirstBatchArrivalWaitTime`, 
`WaitForRpcBufferQueue`, `WaitForBroadcastBuffer`, local-exchange buffer waits, 
`PendingFinishDependency`, `WaitWorkerTime`, memory reserve waits.
+- Resource pressure: memory peak, spill counters, scanner queue wait, 
remote/local bytes, cache miss/read timers.
+- Optimizer/context info: join type, colocate/shuffle flags, partitioner, 
grouping keys, runtime-filter info, predicates, projections.
+- Query lifecycle/context: FE plan time, schedule time, wait/fetch result 
time, result sink/client fetch time. Useful for wall-clock explanation, not 
proof of BE operator cost.
+
+Only active work, data volume, and resource pressure can directly support a 
bottleneck claim. Wait/backpressure counters explain pipeline shape and 
symptoms, not direct operator cost by themselves.
+
+## 3. Rank Candidate Bottlenecks
+
+Use this priority order:
+
+1. Query elapsed and the largest active operator families.
+2. Large rows/bytes that explain the active time.
+3. Skew in `max` versus `avg` or a single instance much slower than peers.
+4. Spill, memory peak, and queue/thread waits.
+5. Runtime-filter effectiveness and wait.
+6. Network/exchange serialization, compression, and remote bytes.
+7. Init/open/close only if they remain large after excluding parallel 
accumulation and waits.
+
+Do not rank by the largest wait-like counter. A profile where 
`EXCHANGE_OPERATOR WaitForData0` is the largest counter commonly means the 
exchange waited for upstream data.
+
+When several scans or branches each contribute meaningful rows, bytes, memory, 
or active time, do not force a single-operator bottleneck. First summarize the 
branch family or operator family, then name the strongest member only as an 
example or leading contributor.
+
+After ranking the direct cost, perform a join-order pass for multi-join plans. 
Identify build/probe sides, RF source/target sides, and whether a more 
selective branch was scheduled too late to prune a large scan or build. Use 
`join-order-diagnosis.md` for the decision rules.
+
+## 4. Interpret Merged Timers Correctly
+
+Merged profile values often include `sum`, `avg`, `max`, and `min`. Treat them 
differently:
+
+- `sum` can exceed wall-clock elapsed time because it accumulates across 
parallel fragments, drivers, scan ranges, and scanner threads.
+- `max` is usually better for wall-clock suspicion.
+- `avg` plus `max` reveals skew.
+- `min` is useful only for skew context.
+
+If an operator appears many times, a long summed `ExecTime` may represent 
normal parallel CPU. Compare per-instance `max` to query elapsed and compare 
rows/bytes per instance.
+
+## 5. Read Pipeline Waits as Dependencies
+
+Pipeline tasks register wait/dependency counters for start prerequisites, 
upstream data readiness, downstream write availability, memory reservation, 
runtime filters, and local/remote queues. These counters answer "what was this 
driver waiting for", not "where CPU was spent".
+
+Examples:
+
+- `WaitForDependency[OLAP_SCAN_OPERATOR_DEPENDENCY]Time`: scan source was 
blocked by its dependency state.
+- `WaitForDependency[HASH_JOIN_BUILD_DEPENDENCY]Time`: probe/build sequencing 
or shared build state dependency.
+- `WaitForData0`: exchange source waited for sender queue data.
+- `WaitForRpcBufferQueue`: exchange sink waited for RPC buffer capacity.
+- `WaitForLocalExchangeBufferN`: sender waited for a local exchange channel.
+- `WaitWorkerTime`: pipeline task waited for worker scheduling.
+- `ScannerWorkerWaitTime`: scanners waited in the scan worker pool.
+
+Large waits are actionable only after you identify the upstream/downstream 
resource that caused them.
+
+## 6. Diagnose by Operator Family
+
+Use the operator guide and choose the family-specific proof:
+
+- Scan bottleneck: high scan active time plus large `ScanRows`/`ScanBytes`, 
high `ScannerCpuTime`, I/O/decompression/lazy-read/predicate timers, or scanner 
queue wait.
+- Scan/RF latency trap: if scan `ExecTime` is close to `RuntimeFilterInfo` 
`RFx WaitTime`, `AcquireRuntimeFilter`, or `WaitForRuntimeFilter`, treat the 
scan as waiting for filters, not spending CPU in storage. Detail profiles can 
expose `WaitForRuntimeFilter` even when merged `RFx WaitTime` is zero or the 
filter timed out. `TOPN OPT`, `TopNFilterSourceNodeIds`, or `SharedPredicate` 
on a scan should be read as the same filter-wait family before blaming storage 
CPU. Then decide whether the [...]
+- Join bottleneck: high build/probe active time, large build/probe rows, hash 
table memory, non-equi conjunct time, runtime filter publish/build, or spill in 
partitioned join.
+- Aggregation bottleneck: high build/merge/hash-table times, large 
`InputRows`, large hash table size/memory, or spill.
+- Sort/window bottleneck: high sort/evaluation active time, large input rows, 
memory pressure, or spill.
+- Exchange bottleneck: high serialization/compression/send/receive/merge time, 
high `RpcMaxTime`/`RpcAvgTime` on `DATA_STREAM_SINK_OPERATOR`, and bytes. 
Data-arrival waits alone are not enough.
+- Sink/write bottleneck: high append/write/commit/close/load timers and 
bytes/rows. Output `PendingFinishDependency` alone is not enough.
+
+For repeated CTEs, repeated common subplans, expanded `UNION ALL` branches, or 
scalar-subquery fan-in plans, do not summarize from one representative branch. 
Build a small branch inventory: table/source, scan rows/bytes, rows produced, 
major join/aggregation rows, memory/spill evidence, and whether the apparent 
time is active work or RF/dependency wait. Then aggregate the pattern mentally. 
Repeated branches can make the true bottleneck "many medium scans plus RF 
waits" rather than one si [...]
+
+Small dimension scans often show high apparent `ExecTime` or detail scanner 
timers because they waited for runtime filters or because merged counters 
accumulate across many tiny scanners. Before promoting a dimension scan over a 
fact branch, compare rows/bytes, branch role, output impact, and whether the 
timer is mostly RF wait. If the fact branches dominate row volume or downstream 
aggregation, report the dimension scan as supporting/filtering work instead of 
the main bottleneck.
+
+## 7. Runtime Filter Pass
+
+If runtime filters appear, always answer:
+
+- Which join/build side produced the filter.
+- Which scan/probe side applied it.
+- Whether it filtered rows, was always true, arrived too late, or forced 
meaningful wait.
+- Whether a long wait is justified by saved scan work.
+
+See `runtime-filters.md`.
+
+## 8. Join Order Pass
+
+For every multi-join profile, always answer:
+
+- Which side each important join builds on, and which side probes.
+- Whether the build side is unexpectedly large compared with the probe/other 
available side.
+- Whether runtime filters flow from a selective side to a large scan early 
enough to save work.
+- Whether paired profiles, hints, rewritten SQL, memo estimates, or actual 
rows prove a better order.
+- Whether memo contains an unchosen reversed join expression, or 
schema/distribution keys show that the current RF source should instead be an 
RF target.
+
+If the active bottleneck is a scan or hash build that would disappear under a 
better join order, report the scan/build as the runtime bottleneck and the join 
order/RF direction as the likely root cause.
+
+When only a slow profile is available, still make a join-order judgment. Use 
"likely" when the evidence is circumstantial, but do not answer "not proven" 
merely because there is no paired fast profile. Empty probes, empty final 
results after huge build/scan work, huge intermediates later eliminated, and 
RF-source inversion are enough to call the order likely bad when build/probe/RF 
sides and row counts are visible.
+
+Use this red-flag pass before writing the conclusion:
+
+- Expensive RF source emits empty/tiny RF and all targets skip: join order/RF 
direction is likely bad, not proven good.
+- Huge build/shuffle before zero/tiny probe or result: build/probe order or 
build scheduling is likely bad, even if another predicate explains why the 
probe is empty.
+- Huge intermediate before a contradictory or highly selective inner/non-equi 
join: join order/predicate placement is bad.
+- Tiny semi/subquery key set is applied after a large fact scan/build: join 
order/RF timing is likely bad.
+
+If one of those red flags matches, the conclusion must say `bad` or `likely 
bad`. Use `not proven` only for the exact alternate legal plan or repair, not 
for whether the observed order is wrong.
+
+If your reasoning says "plan shape", "predicate placement", "empty probe", 
"late contradictory predicate", or "expensive RF source", translate that into 
an explicit join-order diagnosis. The required output is not complete until it 
says whether the join order/build-probe/RF direction is good, suspicious, or 
bad.
+
+See `join-order-diagnosis.md`.
+
+## 9. Output Standard
+
+A good profile explanation is evidence-bounded:
+
+- Quote exact operator and counter names.
+- State whether each quoted timer is active, wait, or accumulated.
+- Tie scan/join/exchange metrics to table or plan node when possible.
+- For joins, include a short build/probe/RF-direction judgment, even when the 
direct bottleneck is a scan.
+- Avoid generic tuning advice unless the profile proves the need.
+- End with missing evidence if the profile is incomplete.
diff --git a/.codex/skills/doris-profile-reader/references/runtime-filters.md 
b/.codex/skills/doris-profile-reader/references/runtime-filters.md
new file mode 100644
index 00000000000..03ad91d19a3
--- /dev/null
+++ b/.codex/skills/doris-profile-reader/references/runtime-filters.md
@@ -0,0 +1,92 @@
+# Runtime Filter Interpretation
+
+## Direction
+
+Doris runtime filters are generated from join build-side expressions and 
applied on target/probe-side scan expressions when the planner can map the 
slots.
+
+Plan notation:
+
+- `RF000[min_max] <- build_expr`: the filter is produced by this join/build 
side.
+- `RF000[min_max] -> target_expr`: the filter is applied to this scan/target 
side.
+- `RFs=[RF0 build->target]`: physical plan join-side relationship.
+- `JRFs= RF0` on a scan: the scan has joined runtime filters to apply.
+
+Do not invert these. The arrow into the RF (`<-`) is source/build. The arrow 
out of the RF (`->`) is target/scan.
+
+## Where to Look in Profiles
+
+Build/source side:
+
+- Join sink or build operator `RuntimeFilterInfo`.
+- `PublishTime` when publishing filters is visible.
+- Build-side row count, join type, hash table size, and build time.
+
+Target/scan side:
+
+- Scan `WaitForRuntimeFilter` or `AcquireRuntimeFilter`.
+- `RuntimeFilterInfo` entries such as `RF0 InputRows`, `RF0 FilterRows`, `RF0 
AlwaysTrueFilterRows`, `RF0 WaitTime`.
+- Top-N or dynamic filter evidence such as `TOPN OPT`, 
`TopNFilterSourceNodeIds`, and `SharedPredicate`. These can add 
`WaitForRuntimeFilter` even when individual merged `RFx WaitTime` counters are 
zero or the filter state is `TIMEOUT`.
+- Scan rows/bytes and row-filter counters before deciding whether the RF 
helped.
+- If scan `ExecTime` is approximately equal to RF wait counters, the scan was 
mostly waiting for runtime filters. Do not report that as scanner CPU or 
storage I/O without supporting `ScannerCpuTime`, `IOTimer`, 
`DecompressorTimer`, or `ScanBytes` evidence.
+- `RFx WaitTime`, scan-level `WaitForRuntimeFilter`, `AcquireRuntimeFilter`, 
and scanner child timers are not guaranteed to be additive. They can overlap, 
be per-filter, or be accumulated at different scopes. Use them to classify 
latency and causality, not to compute exact residual CPU by subtraction.
+- `RFx InputRows`, `RFx FilterRows`, and `RFx AlwaysTrueFilterRows` are 
per-filter/parallel counters. Do not add all filters together as if each row 
were counted once.
+- Storage-pushed RF pruning may be visible as `RowsKeyRangeFiltered`, 
`RowsStatsFiltered`, `RowsZoneMapRuntimePredicateFiltered`, 
`RowsBloomFilterFiltered`, or related scan-detail counters. A zero `RFx 
FilterRows` does not by itself prove the RF was useless.
+
+Exchange/sink side:
+
+- Runtime filters may be carried through data stream sinks between fragments. 
Use the plan to connect source and target.
+- Multicast/local-exchange sources can expose runtime-filter waits while 
distributing data to multiple consumers. Classify those waits as 
consumer/filter synchronization first; then identify which downstream scan or 
branch actually benefited.
+
+## Effectiveness Rules
+
+Useful filter:
+
+- `FilterRows` is large relative to `InputRows`.
+- Scan `RowsRead`/downstream rows are much lower than `ScanRows`.
+- Scan-detail pruning counters such as key-range, zone-map, bloom, stats, or 
condition-filter rows are large.
+- Filter wait is small compared with saved scan/join work.
+- The target table is large enough that pruning matters.
+- The filter is produced by a small/selective build side early enough to avoid 
large target-side scan or join-build work.
+
+Weak or useless filter:
+
+- `AlwaysTrueFilterRows` dominates.
+- `FilterRows = 0` or tiny relative to `InputRows`.
+- `WaitTime` is visible but no meaningful rows are filtered.
+- Filter arrives after most scan work, or scan had no large table to prune.
+
+Potentially harmful filter:
+
+- Long `WaitForRuntimeFilter`/`AcquireRuntimeFilter` and little filtering.
+- Scan `ExecTime` dominated by `RFx WaitTime`, especially on a small dimension 
table or a branch whose target-side filtering is weak.
+- Broadcast or build side is slow, causing scan-side idle time without saving 
I/O.
+- Many filters add predicate overhead while scan rows are already small.
+- The filter source is available only after the large table has already been 
scanned or built into a hash table; this usually points to a 
join-order/RF-direction problem rather than a scan implementation problem.
+- The expensive scan produces the RF, while other large scans wait for it and 
then skip. If the RF source had to scan huge data to produce an empty/tiny 
filter, question the source/target choice; the source side may be the side that 
should have received an earlier RF instead.
+- Empty/tiny filter output after a massive source scan is a red flag, not a 
success signal. Classify the target-side skip as saved downstream work, but 
classify the source-side massive work as likely bad RF direction unless 
concrete plan evidence rules out an earlier pruning direction.
+
+## Filter Types
+
+- Bloom/in filter: good for equality join pruning; check false positive and 
target cardinality indirectly by rows filtered.
+- Min/max filter: cheap range pruning; useful only when build-side min/max is 
selective for the target.
+- Bitmap filter: specialized nested-loop/bitmap path; check build bitmap 
construction and target filtering.
+- IN/IN_OR_BLOOM variants: planner/runtime chooses representation based on 
cardinality and settings.
+
+The filter type alone does not prove benefit. Always use target-side rows and 
wait.
+
+## Common Mistakes
+
+- Mistaking source/build side for target/scan side because of arrows.
+- Treating `WaitForRuntimeFilter` as scan CPU.
+- Calling a runtime filter effective from its existence only.
+- Treating an active large scan as a pure scan problem without asking whether 
an earlier RF from a different join order should have pruned it.
+- Calling an empty RF useful without asking whether the plan paid too much to 
produce that empty RF.
+- Calling a target-side skip proof of good order while ignoring that 
source-side scan/build dominated the query.
+- Ignoring `AlwaysTrueFilterRows`, which often means the filter was not 
selective.
+- Ignoring that filter counters may be accumulated across parallel scan 
instances.
+- Calling a filter useless only because one `RFx FilterRows` is zero while 
scan-detail key-range, zone-map, bloom, or stats pruning removed rows.
+- Subtracting RF wait counters mechanically from `ExecTime` or from each other 
despite overlapping scopes.
+
+## Example Pattern
+
+In one calibration profile, the plan showed min/max filters like 
`RF000[min_max] <- build_key` at the build side and `RF000[min_max] -> 
scan_key` at a scan. The profile had `RuntimeFilterInfo` with RF wait/filter 
counters, but the filters waited only tens of milliseconds and filtered no 
meaningful rows. The real evidence was large scan volume and active 
aggregation/exchange work, not runtime filter wait.
diff --git 
a/.codex/skills/doris-profile-reader/scripts/extract_source_profile_inventory.py
 
b/.codex/skills/doris-profile-reader/scripts/extract_source_profile_inventory.py
new file mode 100755
index 00000000000..55e9097821a
--- /dev/null
+++ 
b/.codex/skills/doris-profile-reader/scripts/extract_source_profile_inventory.py
@@ -0,0 +1,158 @@
+#!/usr/bin/env python3
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""Extract Doris profile operator/counter names from the current source 
tree."""
+
+from __future__ import annotations
+
+import argparse
+import re
+from collections import defaultdict
+from pathlib import Path
+
+
+PROFILE_CONST_RE = re.compile(r'inline constexpr 
char\s+([A-Z0-9_]+)\[\]\s*=\s*"([^"]+)"')
+MACRO_RE = re.compile(
+    
r"\b(ADD_(?:CHILD_)?(?:TIMER|COUNTER)(?:_WITH_LEVEL)?|add_nonzero_counter|add_info_string|add_derived_counter)\s*\("
+)
+STRING_RE = re.compile(r'"([^"]+)"')
+NAME_ASSIGN_RE = re.compile(r'\b(?:_name|_op_name)\s*=\s*"([^"]+)"')
+CLASS_RE = 
re.compile(r"\bclass\s+([A-Za-z0-9_]*OperatorX|[A-Za-z0-9_]*LocalState)\b")
+FACTORY_RE = re.compile(r"std::make_shared<([A-Za-z0-9_]+OperatorX)>")
+CASE_RE = re.compile(r"case\s+(T(?:PlanNode|DataSink)Type::[A-Z0-9_]+)")
+
+
+def rel(path: Path, root: Path) -> str:
+    return str(path.relative_to(root))
+
+
+def load_profile_constants(root: Path) -> dict[str, str]:
+    header = root / "be/src/runtime/runtime_profile_counter_names.h"
+    constants = {}
+    for match in PROFILE_CONST_RE.finditer(header.read_text(encoding="utf-8")):
+        constants[f"profile::{match.group(1)}"] = match.group(2)
+    return constants
+
+
+def extract_macro_args(line: str, start: int) -> str:
+    depth = 0
+    out = []
+    for char in line[start:]:
+        if char == "(":
+            depth += 1
+        elif char == ")":
+            depth -= 1
+            if depth == 0:
+                out.append(char)
+                break
+        out.append(char)
+    return "".join(out)
+
+
+def counter_name_from_args(args: str, constants: dict[str, str]) -> str | None:
+    strings = STRING_RE.findall(args)
+    if strings:
+        # For ADD_CHILD_TIMER(profile, child, parent), the first string is the 
counter name.
+        return strings[0]
+    for const_name, value in constants.items():
+        if const_name in args:
+            return value
+    return None
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--repo", default="/mnt/disk3/zhaochangle/doris")
+    parser.add_argument("--output", 
default="/tmp/doris-profile-reader/source-profile-inventory.md")
+    args = parser.parse_args()
+
+    root = Path(args.repo)
+    constants = load_profile_constants(root)
+    source_roots = [
+        root / "be/src/exec/operator",
+        root / "be/src/exec/pipeline",
+        root / "be/src/exec/scan",
+        root / "be/src/exec/exchange",
+        root / "be/src/exec/sink",
+        root / "be/src/storage",
+        root / "be/src/runtime",
+        root / "fe/fe-core/src/main/java/org/apache/doris/common/profile",
+        root / 
"fe/fe-core/src/main/java/org/apache/doris/nereids/processor/post",
+        root / "fe/fe-core/src/main/java/org/apache/doris/planner",
+    ]
+    files = []
+    for source_root in source_roots:
+        if source_root.exists():
+            files.extend(p for p in source_root.rglob("*") if p.suffix in 
{".cpp", ".h", ".java"})
+
+    counters: dict[str, list[str]] = defaultdict(list)
+    names: dict[str, list[str]] = defaultdict(list)
+    classes: dict[str, list[str]] = defaultdict(list)
+    factories: dict[str, list[str]] = defaultdict(list)
+    cases: dict[str, list[str]] = defaultdict(list)
+
+    for path in sorted(files):
+        text = path.read_text(encoding="utf-8", errors="ignore")
+        relative = rel(path, root)
+        for line_no, line in enumerate(text.splitlines(), 1):
+            for match in MACRO_RE.finditer(line):
+                args_text = extract_macro_args(line, match.end() - 1)
+                counter_name = counter_name_from_args(args_text, constants)
+                if counter_name:
+                    counters[counter_name].append(f"{relative}:{line_no}")
+            for match in NAME_ASSIGN_RE.finditer(line):
+                names[match.group(1)].append(f"{relative}:{line_no}")
+            for match in CLASS_RE.finditer(line):
+                classes[match.group(1)].append(f"{relative}:{line_no}")
+            for match in FACTORY_RE.finditer(line):
+                factories[match.group(1)].append(f"{relative}:{line_no}")
+            for match in CASE_RE.finditer(line):
+                cases[match.group(1)].append(f"{relative}:{line_no}")
+
+    lines = ["# Doris Source Profile Inventory", ""]
+    lines.append("## Factory Cases")
+    for key in sorted(cases):
+        lines.append(f"- `{key}`: {', '.join(cases[key][:6])}")
+    lines.append("")
+
+    lines.append("## Operator Name Assignments")
+    for key in sorted(names):
+        lines.append(f"- `{key}`: {', '.join(names[key][:6])}")
+    lines.append("")
+
+    lines.append("## Operator Classes In Factory")
+    for key in sorted(factories):
+        lines.append(f"- `{key}`: {', '.join(factories[key][:6])}")
+    lines.append("")
+
+    lines.append("## Counter And Info Names")
+    for key in sorted(counters):
+        refs = counters[key]
+        lines.append(f"- `{key}` ({len(refs)} registrations): {', 
'.join(refs[:8])}")
+    lines.append("")
+
+    output = Path(args.output)
+    output.parent.mkdir(parents=True, exist_ok=True)
+    output.write_text("\n".join(lines), encoding="utf-8")
+    print(f"wrote {output}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/AGENTS.md b/AGENTS.md
index 34617013274..c6a2d6ae119 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -65,6 +65,8 @@ Added regression tests must comply with the following 
standards:
 
 Files in git commit should only be related to the current modification task. 
Environment modifications for running (e.g., `conf/`, `AGENTS.md`, `hooks/`, 
etc.) must not be `git add`ed. When delivering the final task, you must ensure 
all actual code modifications have been committed.
 
+If the BE code has been modified, it must be formatted using the correct skill 
before committing.
+
 Commit messages must follow the format below, which mirrors the PR template 
(`.github/PULL_REQUEST_TEMPLATE.md`):
 
 ```


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(doris) branch master updated: [Feature](skill) Introduce Doris profile reader skill (#63948)

Reply via email to