rseetham opened a new pull request, #18501: URL: https://github.com/apache/pinot/pull/18501
Introduces four independent circuit breakers to prevent unbounded backfill triggering when a cluster is overwhelmed or restarts after prolonged downtime: 1. Pause flag per topic (`realtime.segment.offsetAutoReset.pause`): operator-set boolean in stream config; checked in computeStartOffset() before any backfill decision is made. 2. Max segments guard (`realtime.segment.offsetAutoReset.maxSegmentsBeforeBackfillSkip`): skips backfill trigger if table's segment count >= configured limit, preventing znode exhaustion when ingestion is permanently elevated. 3. Max concurrent backfills per controller (`controller.realtime.offsetAutoReset.maxConcurrentBackfillsPerController`): caps the number of tables that can simultaneously backfill on a single controller instance, guarding against cluster-restart storms. 4. Per-partition in-flight collision threshold (`controller.realtime.offsetAutoReset.maxBackfillCollisionsBeforeAutoPause`, default 3): tracks consecutive backfill-trigger attempts on a partition that already has an active backfill. Below the threshold the new trigger is allowed; at or above the threshold the topic's pause flag is set automatically and a metric is emitted requiring operator intervention. New ControllerMeter entries are added for each skipped-backfill scenario to enable alerting on all circuit breaker activations. Fixes: https://github.com/apache/pinot/issues/18314 `bugfix` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
