featzhang opened a new pull request, #27596:
URL: https://github.com/apache/flink/pull/27596
component: Runtime / Web Frontend
key: FLINK-XXXXX
priority: Major
summary: Add "Top N" metrics aggregation panel to Flink Web UI for quick
diagnostics
description: |
h3. Problem Statement
Currently, when troubleshooting performance issues (e.g., high CPU usage),
users need to manually navigate through multiple pages to check metrics for
each TaskManager, Job, and Subtask. This is time-consuming and inefficient.
Common scenarios include:
# Clicking through each TaskManager to find which one has the highest CPU
usage
# Checking each operator's subtask to find backpressure issues
# Browsing through multiple tasks to identify GC-intensive operations
h3. Proposed Solution
Add a "Top N Metrics" aggregation panel to the Flink Web UI Overview page
that displays:
*h4. Top N CPU Consumers*
- List the top 5 subtasks with the highest CPU usage
(busyTimePerMsPerSecond)
- Display subtask ID, operator name, and CPU percentage
- Click to navigate to detailed subtask metrics page
*h4. Top N Backlogged Operators*
- List the top 5 operators with the most severe backpressure
- Display operator ID, name, and backpressure percentage
- Click to navigate to detailed operator metrics page
*h4. Top N GC Intensive Tasks*
- List the top 5 tasks with the highest GC time percentage
- Display task ID, name, and GC percentage
- Click to navigate to detailed task metrics page
h3. Implementation Approach
*h4. Backend Changes*
1. Add a new REST API endpoint: {{/jobs/:jobid/metrics/top-n}} or
integrate into existing {{/overview}}
2. Create a handler that:
- Collects metrics from all TaskManagers and subtasks
- Aggregates and sorts metrics by CPU, backpressure, and GC time
- Returns top N results (N configurable, default 5)
*h4. Frontend Changes*
1. Add a new "Top N Metrics" component to the Overview page
2. Create a service to call the new API endpoint
3. Display the top N lists in a table or card format
4. Add click handlers to navigate to detailed views
h3. Public Interfaces
*h4. REST API*
{{GET /jobs/:jobid/metrics/top-n}}
Returns:
{code:java}
{
"topCpuConsumers": [
{
"subtaskId": 0,
"taskName": "Source: Kafka",
"operatorName": "Kafka Source",
"cpuPercentage": 95.5,
"taskManagerId": "container_123"
}
],
"topBackpressureOperators": [
{
"operatorId": "op_456",
"operatorName": "Map",
"backpressureRatio": 0.85,
"subtaskId": 1
}
],
"topGcIntensiveTasks": [
{
"taskId": "task_789",
"taskName": "ProcessFunction",
"gcTimePercentage": 45.2,
"taskManagerId": "container_123"
}
]
}
{code}
h3. Compatibility, Deprecation, and Migration Plan
This is a pure addition with no breaking changes. Existing UI
functionality remains unchanged.
h3. Test Plan
# Unit tests for the new REST API handler
# Integration tests to verify correct aggregation and sorting
# Frontend component tests
# Manual testing with a real Flink cluster:
- Start a job with multiple subtasks
- Create backpressure on some operators
- Generate high CPU/GC load
- Verify the Top N panel shows correct results
- Verify clicking entries navigates to correct detail pages
h3.Rejected Alternatives
*h4. Alternative 1: Separate page for Top N metrics*
- Rejected because it adds an extra navigation step
- Adding to Overview page provides immediate visibility upon page load
*h4. Alternative 2: Use WebSocket for real-time updates*
- Rejected because polling is sufficient for this use case
- WebSocket adds complexity without significant benefit
h4. References*
- Discussion thread: (to be added)
- Related FLIPs: (none)
---
## 问题描述(中文版)
当前,在排查性能问题(例如高 CPU 使用率)时,用户需要手动浏览多个页面来检查每个 TaskManager、Job 和 Subtask
的指标。这非常耗时且效率低下。
常见场景包括:
1. 点击每个 TaskManager 以找出 CPU 使用率最高的那个
2. 检查每个算子的子任务以查找反压问题
3. 浏览多个任务以识别 GC 密集型操作
## 建议方案
在 Flink Web UI 概览页面添加一个"Top N 指标"聚合面板,显示:
**Top N CPU 消费者**
- 列出 CPU 使用率最高的前 5 个子任务
- 显示子任务 ID、算子名称和 CPU 百分比
- 点击可导航到详细的子任务指标页面
**Top N 反压算子**
- 列出反压最严重的前 5 个算子
- 显示算子 ID、名称和反压百分比
- 点击可导航到详细的算子指标页面
**Top N GC 密集型任务**
- 列出 GC 时间百分比最高的前 5 个任务
- 显示任务 ID、名称和 GC 百分比
- 点击可导航到详细的任务指标页面
## 实现方法
**后端变更**
1. 添加新的 REST API 端点:{{/jobs/:jobid/metrics/top-n}} 或集成到现有的 {{/overview}}
2. 创建处理程序,用于:
- 从所有 TaskManager 和子任务收集指标
- 聚合并按 CPU、反压和 GC 时间排序指标
- 返回前 N 个结果(N 可配置,默认为 5)
**前端变更**
1. 在概览页面添加新的"Top N 指标"组件
2. 创建服务来调用新的 API 端点
3. 以表格或卡片格式显示前 N 个列表
4. 添加点击处理程序以导航到详细视图
## 兼容性、弃用和迁移计划
这是一个纯粹的添加,没有破坏性变更。现有的 UI 功能保持不变。
## 测试计划
1. 新 REST API 处理程序的单元测试
2. 集成测试以验证正确的聚合和排序
3. 前端组件测试
4. 使用真实 Flink 集群进行手动测试:
- 启动具有多个子任务的作业
- 在某些算子上创建反压
- 产生高 CPU/GC 负载
- 验证 Top N 面板显示正确结果
- 验证点击条目导航到正确的详细页面
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]