[
https://issues.apache.org/jira/browse/FLINK-39079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
featzhang updated FLINK-39079:
------------------------------
Description:
Currently, troubleshooting Flink jobs requires navigating across multiple Web
UI pages to collect diagnostic information: * *Checkpoints:* For checkpointing
issues.
* *Backpressure:* For operator bottlenecks.
* *Task Managers:* For resource usage.
* *Logs:* For error messages.
* *Metrics:* For performance indicators.
This fragmented approach is time-consuming and error-prone, as users must
manually correlate data from different sources.
h4. *Motivation*
The proposed *Diagnostic Summary Page* aims to consolidate key diagnostic
information into a single dashboard to: # Provide a unified view of job health
at a glance.
# Highlight critical issues with visual indicators.
# Reduce diagnosis time from minutes to seconds.
# Lower the learning curve for new users.
h4. *Proposed Changes*
*1. New "Diagnostics" Tab* * Add a new *"Diagnostics"* tab in the Job Overview
page alongside existing tabs (Overview, Checkpoints, Backpressure, etc.).
*2. Diagnostic Categories and Metrics*
The page will include the following modules:
* *Job Status Summary:* State (RUNNING/FAILED), duration, restart history,
last failure timestamp, and error message.
* *Checkpoint Health:* Status indicator (Healthy/Unhealthy), latest duration,
alignment duration, failed count (last 10 mins), and trend charts.
* *Backpressure Analysis:* List of operators with high backpressure (>80%),
severity ranking (Top 10), and affected subtasks/TMs.
* *Resource Utilization:* Top 10 CPU/Memory intensive tasks, TMs with high GC
frequency, and network throughput.
* *Error Tracking:* Recent errors grouped by type, exception counts (last 5
mins), and stack trace snippets.
* *Alert Recommendations:* Auto-generated suggestions based on detected issues
with links to documentation.
*3. UI/UX Design* * *Color-coded indicators:* Green (Healthy), Yellow
(Warning), Red (Critical).
* *Collapsible sections* for each category.
* *Filtering & Sorting* for lists (e.g., by severity).
* *Refresh button* for real-time updates.
* *Export function* to save reports as JSON/HTML.
*4. Backend Changes*
* *New REST Endpoint:* {{GET /jobs/:jobid/diagnostics}}
* *New Handler:* {{JobDiagnosticsHandler}} to aggregate metrics from existing
handlers.
* *Caching:* Implement efficient caching to avoid redundant collection.
h4. *Alternatives Considered*
# *Dashboard Extension:* Extending the existing Overview page was rejected to
avoid clutter.
# *CLI-based Diagnostics:* Rejected due to lower accessibility compared to the
Web UI for operations teams.
# *Third-party Integration:* Rejected as it adds operational complexity and
excludes users without external monitoring tools.
was:
Currently, troubleshooting Flink jobs requires navigating across multiple Web
UI pages to collect diagnostic information: * *Checkpoints:* For checkpointing
issues.
* *Backpressure:* For operator bottlenecks.
* *Task Managers:* For resource usage.
* *Logs:* For error messages.
* *Metrics:* For performance indicators.
This fragmented approach is time-consuming and error-prone, as users must
manually correlate data from different sources.
h4. *Motivation*
The proposed *Diagnostic Summary Page* aims to consolidate key diagnostic
information into a single dashboard to: # Provide a unified view of job health
at a glance.
# Highlight critical issues with visual indicators.
# Reduce diagnosis time from minutes to seconds.
# Lower the learning curve for new users.
h4. *Proposed Changes*
*1. New "Diagnostics" Tab* * Add a new *"Diagnostics"* tab in the Job Overview
page alongside existing tabs (Overview, Checkpoints, Backpressure, etc.).
*2. Diagnostic Categories and Metrics*
The page will include the following modules: * *Job Status Summary:* State
(RUNNING/FAILED), duration, restart history, last failure timestamp, and error
message.
* *Checkpoint Health:* Status indicator (Healthy/Unhealthy), latest duration,
alignment duration, failed count (last 10 mins), and trend charts.
* *Backpressure Analysis:* List of operators with high backpressure (>80%),
severity ranking (Top 10), and affected subtasks/TMs.
* *Resource Utilization:* Top 10 CPU/Memory intensive tasks, TMs with high GC
frequency, and network throughput.
* *Error Tracking:* Recent errors grouped by type, exception counts (last 5
mins), and stack trace snippets.
* *Alert Recommendations:* Auto-generated suggestions based on detected issues
with links to documentation.
*3. UI/UX Design* * *Color-coded indicators:* Green (Healthy), Yellow
(Warning), Red (Critical).
* *Collapsible sections* for each category.
* *Filtering & Sorting* for lists (e.g., by severity).
* *Refresh button* for real-time updates.
* *Export function* to save reports as JSON/HTML.
*4. Backend Changes* * *New REST Endpoint:* {{GET /jobs/:jobid/diagnostics}}
* *New Handler:* {{JobDiagnosticsHandler}} to aggregate metrics from existing
handlers.
* *Caching:* Implement efficient caching to avoid redundant collection.
h4. *Alternatives Considered*
# *Dashboard Extension:* Extending the existing Overview page was rejected to
avoid clutter.
# *CLI-based Diagnostics:* Rejected due to lower accessibility compared to the
Web UI for operations teams.
# *Third-party Integration:* Rejected as it adds operational complexity and
excludes users without external monitoring tools.
> Add Diagnostic Summary Page in Flink Web UI
> -------------------------------------------
>
> Key: FLINK-39079
> URL: https://issues.apache.org/jira/browse/FLINK-39079
> Project: Flink
> Issue Type: New Feature
> Components: Runtime / Web Frontend
> Reporter: featzhang
> Priority: Major
>
> Currently, troubleshooting Flink jobs requires navigating across multiple Web
> UI pages to collect diagnostic information: * *Checkpoints:* For
> checkpointing issues.
> * *Backpressure:* For operator bottlenecks.
> * *Task Managers:* For resource usage.
> * *Logs:* For error messages.
> * *Metrics:* For performance indicators.
> This fragmented approach is time-consuming and error-prone, as users must
> manually correlate data from different sources.
> h4. *Motivation*
> The proposed *Diagnostic Summary Page* aims to consolidate key diagnostic
> information into a single dashboard to: # Provide a unified view of job
> health at a glance.
> # Highlight critical issues with visual indicators.
> # Reduce diagnosis time from minutes to seconds.
> # Lower the learning curve for new users.
> h4. *Proposed Changes*
> *1. New "Diagnostics" Tab* * Add a new *"Diagnostics"* tab in the Job
> Overview page alongside existing tabs (Overview, Checkpoints, Backpressure,
> etc.).
> *2. Diagnostic Categories and Metrics*
> The page will include the following modules:
> * *Job Status Summary:* State (RUNNING/FAILED), duration, restart history,
> last failure timestamp, and error message.
> * *Checkpoint Health:* Status indicator (Healthy/Unhealthy), latest
> duration, alignment duration, failed count (last 10 mins), and trend charts.
> * *Backpressure Analysis:* List of operators with high backpressure (>80%),
> severity ranking (Top 10), and affected subtasks/TMs.
> * *Resource Utilization:* Top 10 CPU/Memory intensive tasks, TMs with high
> GC frequency, and network throughput.
> * *Error Tracking:* Recent errors grouped by type, exception counts (last 5
> mins), and stack trace snippets.
> * *Alert Recommendations:* Auto-generated suggestions based on detected
> issues with links to documentation.
> *3. UI/UX Design* * *Color-coded indicators:* Green (Healthy), Yellow
> (Warning), Red (Critical).
> * *Collapsible sections* for each category.
> * *Filtering & Sorting* for lists (e.g., by severity).
> * *Refresh button* for real-time updates.
> * *Export function* to save reports as JSON/HTML.
> *4. Backend Changes*
> * *New REST Endpoint:* {{GET /jobs/:jobid/diagnostics}}
> * *New Handler:* {{JobDiagnosticsHandler}} to aggregate metrics from
> existing handlers.
> * *Caching:* Implement efficient caching to avoid redundant collection.
> h4. *Alternatives Considered*
> # *Dashboard Extension:* Extending the existing Overview page was rejected
> to avoid clutter.
> # *CLI-based Diagnostics:* Rejected due to lower accessibility compared to
> the Web UI for operations teams.
> # *Third-party Integration:* Rejected as it adds operational complexity and
> excludes users without external monitoring tools.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)