laskoviymishka commented on issue #16389:
URL: https://github.com/apache/iceberg/issues/16389#issuecomment-4556828819

   @HenryCaiHaiying thanks for putting this together. Two pieces of feedback.
   
   **Layering**
   
   I’d try to address #16361 first and re-measure.
   
   The `O(N²)` behavior is the actual spiral mechanism. Without it, a 15-minute 
HMS outage should create an N-sized backlog that recovers cleanly on the next 
cycle. So the motivating case for this proposal may not reproduce after that 
fix.
   
   If we add a Coordinator → Worker feedback edge, I’d rather do it once with 
the right signal than ship a workaround we already know we may need to revisit.
   
   **Signal**
   
   The thing we want to bound is commit batch size, not “consecutive missing 
acks.”
   
   That signal lives directly on the Coordinator. I’d suggest something like:
   
   `CoordinatorStatus { pending_files, pending_partitions, last_commit_outcome 
}`
   
   published periodically on the control topic.
   
   Workers can then refuse new writes, or throw `RetriableException`, when 
`pending_files > threshold`.
   
   That gives us graded degradation, a clear resume threshold, no startup-state 
ambiguity, no coupling to `offset.flush.interval.ms`, and it avoids the 
partial-commit blind spot. It is a bit more wire-protocol work, but it is a 
much better mechanism.
   **If we keep the current approach**
   
   Please address the following before we rely on this in production.
   
   **Initial-state grace period**
   
   The control consumer uses a transient group ID and seeks to latest. That 
means a worker joining mid-cycle, or restarting after a rebalance, can see 
`START_COMMIT` messages without ever seeing the matching `COMMIT_COMPLETE`.
   
   With `stalled.cycles=3`, three healthy commit starts in a row could falsely 
pause a new worker.
   
   I’d add an initial state: do not increment the missed-cycle counter until 
the worker has seen one full `START_COMMIT → COMMIT_COMPLETE` cycle. After 
that, it has a baseline.
   
   **Partial commits**
   
   `Coordinator.commit(true)` still emits `COMMIT_COMPLETE` when the partial 
commit succeeds. So a Coordinator that is degraded and only making partial 
progress looks healthy to workers.
   
   That hides the exact failure mode this proposal is trying to detect.
   
   I’d either:
   
   * add `partial=true/false` to `CommitComplete`, and treat partial commits as 
missed progress for backpressure
   * or emit a separate `CommitPartial` event
   
   Either is fine. I just don’t think we can leave this undefined.
   
   **Stable-success threshold before resuming**
   
   `PAUSED → ACTIVE` on the first `COMMIT_COMPLETE` feels too eager.
   
   The Coordinator may have completed one cycle while still being behind. If 
thousands of workers all see that same message and resume together, they can 
overwhelm it again.
   
   I’d add a config like:
   
   `iceberg.coordinator.progress.resume.stable-cycles`
   
   Default 2 or 3. Require K consecutive non-partial `COMMIT_COMPLETE`s before 
resuming.
   
   Optional but useful: add a small random delay per worker before resuming, 
for example `random(0, pause-ms)`, so they do not all restart in lockstep.
   
   **Explicit pause duration**
   
   Using `offset.flush.interval.ms` as the retry cadence couples two unrelated 
things:
   
   * how often Connect flushes offsets
   * how long a worker should back off
   
   Those should be separate knobs.
   
   I’d add:
   
   `iceberg.coordinator.progress.pause-ms`
   
   with a default like 60s, and have the worker use `SinkTaskContext.timeout()` 
for the pause. That also gives us a clean path to graded backoff later.
   
   **`errors.retry.timeout.ms` prerequisite**
   
   `RetriableException` only retries until `errors.retry.timeout.ms`. The 
Connect default is `0`, so under default config this “pause” mechanism can fail 
the task on the first pause.
   
   We should make that hard to miss:
   
   1. Document that `errors.retry.timeout.ms` must be `-1`, or high enough to 
cover the worst expected outage.
   2. Validate it at task startup. If progress detection is enabled and retry 
timeout is too low, either fail fast or log a very loud warning.
   
   Silent misconfig here turns a safety feature into a foot-gun.
   
   **Metrics**
   
   A self-paused worker needs to be visible operationally. Otherwise operators 
just see “no data flowing” and have to dig through logs.
   
   At minimum, expose these through Kafka Connect JMX:
   
   * `coordinator-progress-paused`
   * `coordinator-progress-consecutive-missed-cycles`
   * `coordinator-progress-pause-count-total`
   * `coordinator-progress-time-paused-ms`
   * `coordinator-progress-last-commit-complete-age-ms`
   
   That lets on-call alert on “paused for > N minutes” and distinguish 
“ingestion paused because Coordinator is degraded” from every other reason 
ingestion might stop.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to