[
https://issues.apache.org/jira/browse/FLINK-39079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
featzhang updated FLINK-39079:
------------------------------
Description:
Currently, when troubleshooting Flink jobs, users need to navigate across
multiple pages in the Web UI to collect diagnostic information:
- Check checkpoints page for checkpointing issues
- View backpressure page for operator bottlenecks
- Monitor task managers for resource usage
- Review logs for error messages
- Check metrics dashboard for performance indicators
This fragmented approach makes it time-consuming and error-prone to quickly
identify the root cause of job problems. Users often have to manually correlate
information from different sources to understand the overall health of their
jobs.
*{*}Motivation:{*}*
The proposed Diagnostic Summary Page will consolidate key diagnostic
information into a single, easily accessible dashboard. This will significantly
improve operational efficiency by:
- Providing a unified view of job health status at a glance
- Highlighting the most critical issues with visual indicators
- Reducing the time required to diagnose problems from minutes to seconds
- Enabling faster incident response and reduced downtime
- Lowering the learning curve for new users by presenting information in a
structured way
*{*}Proposed Changes:{*}*
1. *{*}Add a new "Diagnostics" tab{*}* in the Job Overview page, positioned
alongside existing tabs (Overview, Checkpoints, Backpressure, etc.)
2. *{*}Diagnostic Categories and Metrics:{*}*
a. *{*}Job Status Summary{*}*
- Job state (RUNNING, FAILED, CANCELED, etc.)
- Job duration and restart history
- Last failure timestamp and error message (if applicable)
b. *{*}Checkpoint Health{*}*
- Checkpoint status indicator (Healthy/Unhealthy)
- Latest checkpoint duration
- Checkpoint alignment duration
- Failed checkpoint count in last 10 minutes
- Trend chart showing checkpoint times over the job lifecycle
c. *{*}Backpressure Analysis{*}*
- List of operators with high backpressure (> 80%)
- Backpressure severity ranking (Top 10)
- Affected subtasks and task managers
d. *{*}Resource Utilization{*}*
- Top 10 CPU-intensive tasks
- Top 10 memory-intensive tasks
- Task managers with high GC frequency
- Network throughput per connection
e. *{*}Error Tracking{*}*
- Recent error messages grouped by type
- Count of exceptions in the last 5 minutes
- Stack trace snippets for most frequent errors
f. *{*}Alert Recommendations{*}*
- Auto-generated suggestions based on detected issues
- Links to relevant documentation or configuration options
3. *{*}UI/UX Design:{*}*
- Use color-coded status indicators (Green=Healthy, Yellow=Warning,
Red=Critical)
- Implement collapsible sections for each diagnostic category
- Support filtering and sorting for lists (e.g., by severity, timestamp)
- Include a "Refresh" button to update real-time metrics
- Export diagnostic report as JSON/JSON file
4. *{*}Backend Changes:{*}*
- Add REST endpoint: `GET /jobs/:jobid/diagnostics`
- Create `JobDiagnosticsHandler` to aggregate metrics from existing handlers
- Implement efficient caching to avoid redundant metric collection
*{*}Alternatives Considered:{*}*
1. *{*}Dashboard Extension{*}*: Instead of a dedicated diagnostics page, extend
the existing Overview page. Rejected because it would make the Overview page
cluttered and less focused on high-level job information.
2. *{*}CLI-based Diagnostics{*}*: Provide a command-line tool to export
diagnostic information. Rejected because the Web UI is more accessible to a
broader range of users, especially those responsible for monitoring and
operations.
3. *{*}Third-party Integration{*}*: Rely on external monitoring tools (e.g.,
Prometheus, Grafana). Rejected because it adds operational complexity and
doesn't help users who don't have such tools already set up.
was:
Currently, when troubleshooting Flink jobs, users need to navigate across
multiple pages in the Web UI to collect diagnostic information:
- Check checkpoints page for checkpointing issues
- View backpressure page for operator bottlenecks
- Monitor task managers for resource usage
- Review logs for error messages
- Check metrics dashboard for performance indicators
This fragmented approach makes it time-consuming and error-prone to quickly
identify the root cause of job problems. Users often have to manually correlate
information from different sources to understand the overall health of their
jobs.
**Motivation:**
The proposed Diagnostic Summary Page will consolidate key diagnostic
information into a single, easily accessible dashboard. This will significantly
improve operational efficiency by:
- Providing a unified view of job health status at a glance
- Highlighting the most critical issues with visual indicators
- Reducing the time required to diagnose problems from minutes to seconds
- Enabling faster incident response and reduced downtime
- Lowering the learning curve for new users by presenting information in a
structured way
**Proposed Changes:**
1. **Add a new "Diagnostics" tab** in the Job Overview page, positioned
alongside existing tabs (Overview, Checkpoints, Backpressure, etc.)
2. **Diagnostic Categories and Metrics:**
a. **Job Status Summary**
- Job state (RUNNING, FAILED, CANCELED, etc.)
- Job duration and restart history
- Last failure timestamp and error message (if applicable)
b. **Checkpoint Health**
- Checkpoint status indicator (Healthy/Unhealthy)
- Latest checkpoint duration
- Checkpoint alignment duration
- Failed checkpoint count in last 10 minutes
- Trend chart showing checkpoint times over the job lifecycle
c. **Backpressure Analysis**
- List of operators with high backpressure (> 80%)
- Backpressure severity ranking (Top 10)
- Affected subtasks and task managers
d. **Resource Utilization**
- Top 10 CPU-intensive tasks
- Top 10 memory-intensive tasks
- Task managers with high GC frequency
- Network throughput per connection
e. **Error Tracking**
- Recent error messages grouped by type
- Count of exceptions in the last 5 minutes
- Stack trace snippets for most frequent errors
f. **Alert Recommendations**
- Auto-generated suggestions based on detected issues
- Links to relevant documentation or configuration options
3. **UI/UX Design:**
- Use color-coded status indicators (Green=Healthy, Yellow=Warning,
Red=Critical)
- Implement collapsible sections for each diagnostic category
- Support filtering and sorting for lists (e.g., by severity, timestamp)
- Include a "Refresh" button to update real-time metrics
- Export diagnostic report as JSON/JSON file
4. **Backend Changes:**
- Add REST endpoint: `GET /jobs/:jobid/diagnostics`
- Create `JobDiagnosticsHandler` to aggregate metrics from existing handlers
- Implement efficient caching to avoid redundant metric collection
**Alternatives Considered:**
1. **Dashboard Extension**: Instead of a dedicated diagnostics page, extend the
existing Overview page. Rejected because it would make the Overview page
cluttered and less focused on high-level job information.
2. **CLI-based Diagnostics**: Provide a command-line tool to export diagnostic
information. Rejected because the Web UI is more accessible to a broader range
of users, especially those responsible for monitoring and operations.
3. **Third-party Integration**: Rely on external monitoring tools (e.g.,
Prometheus, Grafana). Rejected because it adds operational complexity and
doesn't help users who don't have such tools already set up.
**Additional Context:**
- Target Version: 1.21
- Component: Web Frontend / Runtime / REST
- Priority: Major
- Labels: web-ui, diagnostics, usability
This feature builds upon existing Web UI improvements such as the Top N Metrics
dashboard and aligns with Flink's ongoing efforts to improve observability and
operational experience.
**Related Issues:**
- FLINK-XXXXX: Add Top N Metrics Dashboard (already implemented)
- FLINK-XXXXX: Improve exception messages
> Add Diagnostic Summary Page in Flink Web UI
> -------------------------------------------
>
> Key: FLINK-39079
> URL: https://issues.apache.org/jira/browse/FLINK-39079
> Project: Flink
> Issue Type: New Feature
> Components: Runtime / Web Frontend
> Reporter: featzhang
> Priority: Major
>
> Currently, when troubleshooting Flink jobs, users need to navigate across
> multiple pages in the Web UI to collect diagnostic information:
> - Check checkpoints page for checkpointing issues
> - View backpressure page for operator bottlenecks
> - Monitor task managers for resource usage
> - Review logs for error messages
> - Check metrics dashboard for performance indicators
> This fragmented approach makes it time-consuming and error-prone to quickly
> identify the root cause of job problems. Users often have to manually
> correlate information from different sources to understand the overall health
> of their jobs.
> *{*}Motivation:{*}*
> The proposed Diagnostic Summary Page will consolidate key diagnostic
> information into a single, easily accessible dashboard. This will
> significantly improve operational efficiency by:
> - Providing a unified view of job health status at a glance
> - Highlighting the most critical issues with visual indicators
> - Reducing the time required to diagnose problems from minutes to seconds
> - Enabling faster incident response and reduced downtime
> - Lowering the learning curve for new users by presenting information in a
> structured way
> *{*}Proposed Changes:{*}*
> 1. *{*}Add a new "Diagnostics" tab{*}* in the Job Overview page, positioned
> alongside existing tabs (Overview, Checkpoints, Backpressure, etc.)
> 2. *{*}Diagnostic Categories and Metrics:{*}*
> a. *{*}Job Status Summary{*}*
> - Job state (RUNNING, FAILED, CANCELED, etc.)
> - Job duration and restart history
> - Last failure timestamp and error message (if applicable)
> b. *{*}Checkpoint Health{*}*
> - Checkpoint status indicator (Healthy/Unhealthy)
> - Latest checkpoint duration
> - Checkpoint alignment duration
> - Failed checkpoint count in last 10 minutes
> - Trend chart showing checkpoint times over the job lifecycle
> c. *{*}Backpressure Analysis{*}*
> - List of operators with high backpressure (> 80%)
> - Backpressure severity ranking (Top 10)
> - Affected subtasks and task managers
> d. *{*}Resource Utilization{*}*
> - Top 10 CPU-intensive tasks
> - Top 10 memory-intensive tasks
> - Task managers with high GC frequency
> - Network throughput per connection
> e. *{*}Error Tracking{*}*
> - Recent error messages grouped by type
> - Count of exceptions in the last 5 minutes
> - Stack trace snippets for most frequent errors
> f. *{*}Alert Recommendations{*}*
> - Auto-generated suggestions based on detected issues
> - Links to relevant documentation or configuration options
> 3. *{*}UI/UX Design:{*}*
> - Use color-coded status indicators (Green=Healthy, Yellow=Warning,
> Red=Critical)
> - Implement collapsible sections for each diagnostic category
> - Support filtering and sorting for lists (e.g., by severity, timestamp)
> - Include a "Refresh" button to update real-time metrics
> - Export diagnostic report as JSON/JSON file
> 4. *{*}Backend Changes:{*}*
> - Add REST endpoint: `GET /jobs/:jobid/diagnostics`
> - Create `JobDiagnosticsHandler` to aggregate metrics from existing handlers
> - Implement efficient caching to avoid redundant metric collection
> *{*}Alternatives Considered:{*}*
> 1. *{*}Dashboard Extension{*}*: Instead of a dedicated diagnostics page,
> extend the existing Overview page. Rejected because it would make the
> Overview page cluttered and less focused on high-level job information.
> 2. *{*}CLI-based Diagnostics{*}*: Provide a command-line tool to export
> diagnostic information. Rejected because the Web UI is more accessible to a
> broader range of users, especially those responsible for monitoring and
> operations.
> 3. *{*}Third-party Integration{*}*: Rely on external monitoring tools (e.g.,
> Prometheus, Grafana). Rejected because it adds operational complexity and
> doesn't help users who don't have such tools already set up.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)