jingzi created FLINK-39376:
------------------------------

             Summary: Show TaskManager IP address in Checkpoint Subtask 
Statistics
                 Key: FLINK-39376
                 URL: https://issues.apache.org/jira/browse/FLINK-39376
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Checkpointing
    Affects Versions: 2.1.1
         Environment: prod
            Reporter: jingzi


h1.   Summary

    When diagnosing slow or failing checkpoints, operators need to identify 
which TaskManager hosts are responsible for high checkpoint latency or 
failures. Currently, the checkpoint subtask statistics table in the Flink Web 
UI (/jobs/<job-id>/checkpoints/subtask/<vertex-id>) shows
  per-subtask metrics (state size, duration, alignment, etc.) but does not 
include information about which TaskManager (host/IP) each subtask ran on.
h1.   Motivation

  - Disk I/O bottlenecks, network issues, or GC pressure on specific nodes are 
common root causes of slow checkpoints. Without host information, operators 
must cross-reference subtask indices with TaskManager assignment through a 
separate UI path.
  - Providing the IP/hostname directly in the subtask checkpoint statistics 
table reduces MTTR for checkpoint-related incidents.
h1.   Proposed Solution

  1. Add an ip field to SubtaskStateStats populated from TaskManagerLocation at 
checkpoint acknowledgement time in PendingCheckpoint.
  2. Expose the field in the REST API response via 
SubtaskCheckpointStatistics.CompletedSubtaskCheckpointStatistics.
  3. Display a sortable "IP Address" column in the Web UI subtask checkpoint 
statistics table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to