featzhang created FLINK-39079:
---------------------------------

             Summary: Add Diagnostic Summary Page in Flink Web UI
                 Key: FLINK-39079
                 URL: https://issues.apache.org/jira/browse/FLINK-39079
             Project: Flink
          Issue Type: New Feature
          Components: Runtime / Web Frontend
            Reporter: featzhang


Currently, when troubleshooting Flink jobs, users need to navigate across 
multiple pages in the Web UI to collect diagnostic information:
- Check checkpoints page for checkpointing issues
- View backpressure page for operator bottlenecks  
- Monitor task managers for resource usage
- Review logs for error messages
- Check metrics dashboard for performance indicators

This fragmented approach makes it time-consuming and error-prone to quickly 
identify the root cause of job problems. Users often have to manually correlate 
information from different sources to understand the overall health of their 
jobs.

**Motivation:**

The proposed Diagnostic Summary Page will consolidate key diagnostic 
information into a single, easily accessible dashboard. This will significantly 
improve operational efficiency by:
- Providing a unified view of job health status at a glance
- Highlighting the most critical issues with visual indicators
- Reducing the time required to diagnose problems from minutes to seconds
- Enabling faster incident response and reduced downtime
- Lowering the learning curve for new users by presenting information in a 
structured way

**Proposed Changes:**

1. **Add a new "Diagnostics" tab** in the Job Overview page, positioned 
alongside existing tabs (Overview, Checkpoints, Backpressure, etc.)

2. **Diagnostic Categories and Metrics:**

   a. **Job Status Summary**
      - Job state (RUNNING, FAILED, CANCELED, etc.)
      - Job duration and restart history
      - Last failure timestamp and error message (if applicable)

   b. **Checkpoint Health**
      - Checkpoint status indicator (Healthy/Unhealthy)
      - Latest checkpoint duration
      - Checkpoint alignment duration
      - Failed checkpoint count in last 10 minutes
      - Trend chart showing checkpoint times over the job lifecycle

   c. **Backpressure Analysis**
      - List of operators with high backpressure (> 80%)
      - Backpressure severity ranking (Top 10)
      - Affected subtasks and task managers

   d. **Resource Utilization**
      - Top 10 CPU-intensive tasks
      - Top 10 memory-intensive tasks
      - Task managers with high GC frequency
      - Network throughput per connection

   e. **Error Tracking**
      - Recent error messages grouped by type
      - Count of exceptions in the last 5 minutes
      - Stack trace snippets for most frequent errors

   f. **Alert Recommendations**
      - Auto-generated suggestions based on detected issues
      - Links to relevant documentation or configuration options

3. **UI/UX Design:**
   - Use color-coded status indicators (Green=Healthy, Yellow=Warning, 
Red=Critical)
   - Implement collapsible sections for each diagnostic category
   - Support filtering and sorting for lists (e.g., by severity, timestamp)
   - Include a "Refresh" button to update real-time metrics
   - Export diagnostic report as JSON/JSON file

4. **Backend Changes:**
   - Add REST endpoint: `GET /jobs/:jobid/diagnostics`
   - Create `JobDiagnosticsHandler` to aggregate metrics from existing handlers
   - Implement efficient caching to avoid redundant metric collection

**Alternatives Considered:**

1. **Dashboard Extension**: Instead of a dedicated diagnostics page, extend the 
existing Overview page. Rejected because it would make the Overview page 
cluttered and less focused on high-level job information.

2. **CLI-based Diagnostics**: Provide a command-line tool to export diagnostic 
information. Rejected because the Web UI is more accessible to a broader range 
of users, especially those responsible for monitoring and operations.

3. **Third-party Integration**: Rely on external monitoring tools (e.g., 
Prometheus, Grafana). Rejected because it adds operational complexity and 
doesn't help users who don't have such tools already set up.

**Additional Context:**

- Target Version: 1.21
- Component: Web Frontend / Runtime / REST
- Priority: Major
- Labels: web-ui, diagnostics, usability

This feature builds upon existing Web UI improvements such as the Top N Metrics 
dashboard and aligns with Flink's ongoing efforts to improve observability and 
operational experience.

**Related Issues:**
- FLINK-XXXXX: Add Top N Metrics Dashboard (already implemented)
- FLINK-XXXXX: Improve exception messages



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to