laskoviymishka commented on issue #16389:
URL: https://github.com/apache/iceberg/issues/16389#issuecomment-4556828819
@HenryCaiHaiying thanks for putting this together. Two pieces of feedback.
**Layering**
I’d try to address #16361 first and re-measure.
The `O(N²)` behavior is the actual spiral mechanism. Without it, a 15-minute
HMS outage should create an N-sized backlog that recovers cleanly on the next
cycle. So the motivating case for this proposal may not reproduce after that
fix.
If we add a Coordinator → Worker feedback edge, I’d rather do it once with
the right signal than ship a workaround we already know we may need to revisit.
**Signal**
The thing we want to bound is commit batch size, not “consecutive missing
acks.”
That signal lives directly on the Coordinator. I’d suggest something like:
`CoordinatorStatus { pending_files, pending_partitions, last_commit_outcome
}`
published periodically on the control topic.
Workers can then refuse new writes, or throw `RetriableException`, when
`pending_files > threshold`.
That gives us graded degradation, a clear resume threshold, no startup-state
ambiguity, no coupling to `offset.flush.interval.ms`, and it avoids the
partial-commit blind spot. It is a bit more wire-protocol work, but it is a
much better mechanism.
**If we keep the current approach**
Please address the following before we rely on this in production.
**Initial-state grace period**
The control consumer uses a transient group ID and seeks to latest. That
means a worker joining mid-cycle, or restarting after a rebalance, can see
`START_COMMIT` messages without ever seeing the matching `COMMIT_COMPLETE`.
With `stalled.cycles=3`, three healthy commit starts in a row could falsely
pause a new worker.
I’d add an initial state: do not increment the missed-cycle counter until
the worker has seen one full `START_COMMIT → COMMIT_COMPLETE` cycle. After
that, it has a baseline.
**Partial commits**
`Coordinator.commit(true)` still emits `COMMIT_COMPLETE` when the partial
commit succeeds. So a Coordinator that is degraded and only making partial
progress looks healthy to workers.
That hides the exact failure mode this proposal is trying to detect.
I’d either:
* add `partial=true/false` to `CommitComplete`, and treat partial commits as
missed progress for backpressure
* or emit a separate `CommitPartial` event
Either is fine. I just don’t think we can leave this undefined.
**Stable-success threshold before resuming**
`PAUSED → ACTIVE` on the first `COMMIT_COMPLETE` feels too eager.
The Coordinator may have completed one cycle while still being behind. If
thousands of workers all see that same message and resume together, they can
overwhelm it again.
I’d add a config like:
`iceberg.coordinator.progress.resume.stable-cycles`
Default 2 or 3. Require K consecutive non-partial `COMMIT_COMPLETE`s before
resuming.
Optional but useful: add a small random delay per worker before resuming,
for example `random(0, pause-ms)`, so they do not all restart in lockstep.
**Explicit pause duration**
Using `offset.flush.interval.ms` as the retry cadence couples two unrelated
things:
* how often Connect flushes offsets
* how long a worker should back off
Those should be separate knobs.
I’d add:
`iceberg.coordinator.progress.pause-ms`
with a default like 60s, and have the worker use `SinkTaskContext.timeout()`
for the pause. That also gives us a clean path to graded backoff later.
**`errors.retry.timeout.ms` prerequisite**
`RetriableException` only retries until `errors.retry.timeout.ms`. The
Connect default is `0`, so under default config this “pause” mechanism can fail
the task on the first pause.
We should make that hard to miss:
1. Document that `errors.retry.timeout.ms` must be `-1`, or high enough to
cover the worst expected outage.
2. Validate it at task startup. If progress detection is enabled and retry
timeout is too low, either fail fast or log a very loud warning.
Silent misconfig here turns a safety feature into a foot-gun.
**Metrics**
A self-paused worker needs to be visible operationally. Otherwise operators
just see “no data flowing” and have to dig through logs.
At minimum, expose these through Kafka Connect JMX:
* `coordinator-progress-paused`
* `coordinator-progress-consecutive-missed-cycles`
* `coordinator-progress-pause-count-total`
* `coordinator-progress-time-paused-ms`
* `coordinator-progress-last-commit-complete-age-ms`
That lets on-call alert on “paused for > N minutes” and distinguish
“ingestion paused because Coordinator is degraded” from every other reason
ingestion might stop.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]