[ 
https://issues.apache.org/jira/browse/FLINK-39079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

featzhang updated FLINK-39079:
------------------------------
    Description: 
Currently, when troubleshooting Flink jobs, users need to navigate across 
multiple pages in the Web UI to collect diagnostic information:
 - Check checkpoints page for checkpointing issues
 - View backpressure page for operator bottlenecks
 - Monitor task managers for resource usage
 - Review logs for error messages
 - Check metrics dashboard for performance indicators

This fragmented approach makes it time-consuming and error-prone to quickly 
identify the root cause of job problems. Users often have to manually correlate 
information from different sources to understand the overall health of their 
jobs.

*{*}Motivation:{*}*

The proposed Diagnostic Summary Page will consolidate key diagnostic 
information into a single, easily accessible dashboard. This will significantly 
improve operational efficiency by:
 - Providing a unified view of job health status at a glance
 - Highlighting the most critical issues with visual indicators
 - Reducing the time required to diagnose problems from minutes to seconds
 - Enabling faster incident response and reduced downtime
 - Lowering the learning curve for new users by presenting information in a 
structured way

*{*}Proposed Changes:{*}*

1. *{*}Add a new "Diagnostics" tab{*}* in the Job Overview page, positioned 
alongside existing tabs (Overview, Checkpoints, Backpressure, etc.)

2. *{*}Diagnostic Categories and Metrics:{*}*

a. *{*}Job Status Summary{*}*
 - Job state (RUNNING, FAILED, CANCELED, etc.)
 - Job duration and restart history
 - Last failure timestamp and error message (if applicable)

b. *{*}Checkpoint Health{*}*
 - Checkpoint status indicator (Healthy/Unhealthy)
 - Latest checkpoint duration
 - Checkpoint alignment duration
 - Failed checkpoint count in last 10 minutes
 - Trend chart showing checkpoint times over the job lifecycle

c. *{*}Backpressure Analysis{*}*
 - List of operators with high backpressure (> 80%)
 - Backpressure severity ranking (Top 10)
 - Affected subtasks and task managers

d. *{*}Resource Utilization{*}*
 - Top 10 CPU-intensive tasks
 - Top 10 memory-intensive tasks
 - Task managers with high GC frequency
 - Network throughput per connection

e. *{*}Error Tracking{*}*
 - Recent error messages grouped by type
 - Count of exceptions in the last 5 minutes
 - Stack trace snippets for most frequent errors

f. *{*}Alert Recommendations{*}*
 - Auto-generated suggestions based on detected issues
 - Links to relevant documentation or configuration options

3. *{*}UI/UX Design:{*}*
 - Use color-coded status indicators (Green=Healthy, Yellow=Warning, 
Red=Critical)
 - Implement collapsible sections for each diagnostic category
 - Support filtering and sorting for lists (e.g., by severity, timestamp)
 - Include a "Refresh" button to update real-time metrics
 - Export diagnostic report as JSON/JSON file

4. *{*}Backend Changes:{*}*
 - Add REST endpoint: `GET /jobs/:jobid/diagnostics`
 - Create `JobDiagnosticsHandler` to aggregate metrics from existing handlers
 - Implement efficient caching to avoid redundant metric collection

*{*}Alternatives Considered:{*}*

1. *{*}Dashboard Extension{*}*: Instead of a dedicated diagnostics page, extend 
the existing Overview page. Rejected because it would make the Overview page 
cluttered and less focused on high-level job information.

2. *{*}CLI-based Diagnostics{*}*: Provide a command-line tool to export 
diagnostic information. Rejected because the Web UI is more accessible to a 
broader range of users, especially those responsible for monitoring and 
operations.

3. *{*}Third-party Integration{*}*: Rely on external monitoring tools (e.g., 
Prometheus, Grafana). Rejected because it adds operational complexity and 
doesn't help users who don't have such tools already set up.

 

  was:
Currently, when troubleshooting Flink jobs, users need to navigate across 
multiple pages in the Web UI to collect diagnostic information:
- Check checkpoints page for checkpointing issues
- View backpressure page for operator bottlenecks  
- Monitor task managers for resource usage
- Review logs for error messages
- Check metrics dashboard for performance indicators

This fragmented approach makes it time-consuming and error-prone to quickly 
identify the root cause of job problems. Users often have to manually correlate 
information from different sources to understand the overall health of their 
jobs.

**Motivation:**

The proposed Diagnostic Summary Page will consolidate key diagnostic 
information into a single, easily accessible dashboard. This will significantly 
improve operational efficiency by:
- Providing a unified view of job health status at a glance
- Highlighting the most critical issues with visual indicators
- Reducing the time required to diagnose problems from minutes to seconds
- Enabling faster incident response and reduced downtime
- Lowering the learning curve for new users by presenting information in a 
structured way

**Proposed Changes:**

1. **Add a new "Diagnostics" tab** in the Job Overview page, positioned 
alongside existing tabs (Overview, Checkpoints, Backpressure, etc.)

2. **Diagnostic Categories and Metrics:**

   a. **Job Status Summary**
      - Job state (RUNNING, FAILED, CANCELED, etc.)
      - Job duration and restart history
      - Last failure timestamp and error message (if applicable)

   b. **Checkpoint Health**
      - Checkpoint status indicator (Healthy/Unhealthy)
      - Latest checkpoint duration
      - Checkpoint alignment duration
      - Failed checkpoint count in last 10 minutes
      - Trend chart showing checkpoint times over the job lifecycle

   c. **Backpressure Analysis**
      - List of operators with high backpressure (> 80%)
      - Backpressure severity ranking (Top 10)
      - Affected subtasks and task managers

   d. **Resource Utilization**
      - Top 10 CPU-intensive tasks
      - Top 10 memory-intensive tasks
      - Task managers with high GC frequency
      - Network throughput per connection

   e. **Error Tracking**
      - Recent error messages grouped by type
      - Count of exceptions in the last 5 minutes
      - Stack trace snippets for most frequent errors

   f. **Alert Recommendations**
      - Auto-generated suggestions based on detected issues
      - Links to relevant documentation or configuration options

3. **UI/UX Design:**
   - Use color-coded status indicators (Green=Healthy, Yellow=Warning, 
Red=Critical)
   - Implement collapsible sections for each diagnostic category
   - Support filtering and sorting for lists (e.g., by severity, timestamp)
   - Include a "Refresh" button to update real-time metrics
   - Export diagnostic report as JSON/JSON file

4. **Backend Changes:**
   - Add REST endpoint: `GET /jobs/:jobid/diagnostics`
   - Create `JobDiagnosticsHandler` to aggregate metrics from existing handlers
   - Implement efficient caching to avoid redundant metric collection

**Alternatives Considered:**

1. **Dashboard Extension**: Instead of a dedicated diagnostics page, extend the 
existing Overview page. Rejected because it would make the Overview page 
cluttered and less focused on high-level job information.

2. **CLI-based Diagnostics**: Provide a command-line tool to export diagnostic 
information. Rejected because the Web UI is more accessible to a broader range 
of users, especially those responsible for monitoring and operations.

3. **Third-party Integration**: Rely on external monitoring tools (e.g., 
Prometheus, Grafana). Rejected because it adds operational complexity and 
doesn't help users who don't have such tools already set up.

**Additional Context:**

- Target Version: 1.21
- Component: Web Frontend / Runtime / REST
- Priority: Major
- Labels: web-ui, diagnostics, usability

This feature builds upon existing Web UI improvements such as the Top N Metrics 
dashboard and aligns with Flink's ongoing efforts to improve observability and 
operational experience.

**Related Issues:**
- FLINK-XXXXX: Add Top N Metrics Dashboard (already implemented)
- FLINK-XXXXX: Improve exception messages


> Add Diagnostic Summary Page in Flink Web UI
> -------------------------------------------
>
>                 Key: FLINK-39079
>                 URL: https://issues.apache.org/jira/browse/FLINK-39079
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Web Frontend
>            Reporter: featzhang
>            Priority: Major
>
> Currently, when troubleshooting Flink jobs, users need to navigate across 
> multiple pages in the Web UI to collect diagnostic information:
>  - Check checkpoints page for checkpointing issues
>  - View backpressure page for operator bottlenecks
>  - Monitor task managers for resource usage
>  - Review logs for error messages
>  - Check metrics dashboard for performance indicators
> This fragmented approach makes it time-consuming and error-prone to quickly 
> identify the root cause of job problems. Users often have to manually 
> correlate information from different sources to understand the overall health 
> of their jobs.
> *{*}Motivation:{*}*
> The proposed Diagnostic Summary Page will consolidate key diagnostic 
> information into a single, easily accessible dashboard. This will 
> significantly improve operational efficiency by:
>  - Providing a unified view of job health status at a glance
>  - Highlighting the most critical issues with visual indicators
>  - Reducing the time required to diagnose problems from minutes to seconds
>  - Enabling faster incident response and reduced downtime
>  - Lowering the learning curve for new users by presenting information in a 
> structured way
> *{*}Proposed Changes:{*}*
> 1. *{*}Add a new "Diagnostics" tab{*}* in the Job Overview page, positioned 
> alongside existing tabs (Overview, Checkpoints, Backpressure, etc.)
> 2. *{*}Diagnostic Categories and Metrics:{*}*
> a. *{*}Job Status Summary{*}*
>  - Job state (RUNNING, FAILED, CANCELED, etc.)
>  - Job duration and restart history
>  - Last failure timestamp and error message (if applicable)
> b. *{*}Checkpoint Health{*}*
>  - Checkpoint status indicator (Healthy/Unhealthy)
>  - Latest checkpoint duration
>  - Checkpoint alignment duration
>  - Failed checkpoint count in last 10 minutes
>  - Trend chart showing checkpoint times over the job lifecycle
> c. *{*}Backpressure Analysis{*}*
>  - List of operators with high backpressure (> 80%)
>  - Backpressure severity ranking (Top 10)
>  - Affected subtasks and task managers
> d. *{*}Resource Utilization{*}*
>  - Top 10 CPU-intensive tasks
>  - Top 10 memory-intensive tasks
>  - Task managers with high GC frequency
>  - Network throughput per connection
> e. *{*}Error Tracking{*}*
>  - Recent error messages grouped by type
>  - Count of exceptions in the last 5 minutes
>  - Stack trace snippets for most frequent errors
> f. *{*}Alert Recommendations{*}*
>  - Auto-generated suggestions based on detected issues
>  - Links to relevant documentation or configuration options
> 3. *{*}UI/UX Design:{*}*
>  - Use color-coded status indicators (Green=Healthy, Yellow=Warning, 
> Red=Critical)
>  - Implement collapsible sections for each diagnostic category
>  - Support filtering and sorting for lists (e.g., by severity, timestamp)
>  - Include a "Refresh" button to update real-time metrics
>  - Export diagnostic report as JSON/JSON file
> 4. *{*}Backend Changes:{*}*
>  - Add REST endpoint: `GET /jobs/:jobid/diagnostics`
>  - Create `JobDiagnosticsHandler` to aggregate metrics from existing handlers
>  - Implement efficient caching to avoid redundant metric collection
> *{*}Alternatives Considered:{*}*
> 1. *{*}Dashboard Extension{*}*: Instead of a dedicated diagnostics page, 
> extend the existing Overview page. Rejected because it would make the 
> Overview page cluttered and less focused on high-level job information.
> 2. *{*}CLI-based Diagnostics{*}*: Provide a command-line tool to export 
> diagnostic information. Rejected because the Web UI is more accessible to a 
> broader range of users, especially those responsible for monitoring and 
> operations.
> 3. *{*}Third-party Integration{*}*: Rely on external monitoring tools (e.g., 
> Prometheus, Grafana). Rejected because it adds operational complexity and 
> doesn't help users who don't have such tools already set up.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to