jingzi created FLINK-39376:
------------------------------
Summary: Show TaskManager IP address in Checkpoint Subtask
Statistics
Key: FLINK-39376
URL: https://issues.apache.org/jira/browse/FLINK-39376
Project: Flink
Issue Type: Improvement
Components: Runtime / Checkpointing
Affects Versions: 2.1.1
Environment: prod
Reporter: jingzi
h1. Summary
When diagnosing slow or failing checkpoints, operators need to identify
which TaskManager hosts are responsible for high checkpoint latency or
failures. Currently, the checkpoint subtask statistics table in the Flink Web
UI (/jobs/<job-id>/checkpoints/subtask/<vertex-id>) shows
per-subtask metrics (state size, duration, alignment, etc.) but does not
include information about which TaskManager (host/IP) each subtask ran on.
h1. Motivation
- Disk I/O bottlenecks, network issues, or GC pressure on specific nodes are
common root causes of slow checkpoints. Without host information, operators
must cross-reference subtask indices with TaskManager assignment through a
separate UI path.
- Providing the IP/hostname directly in the subtask checkpoint statistics
table reduces MTTR for checkpoint-related incidents.
h1. Proposed Solution
1. Add an ip field to SubtaskStateStats populated from TaskManagerLocation at
checkpoint acknowledgement time in PendingCheckpoint.
2. Expose the field in the REST API response via
SubtaskCheckpointStatistics.CompletedSubtaskCheckpointStatistics.
3. Display a sortable "IP Address" column in the Web UI subtask checkpoint
statistics table.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)