Re: [PR] [improve][broker] PIP-483: scalable topic auto split/merge [pulsar]

via GitHub Thu, 11 Jun 2026 10:47:39 -0700


lhotari commented on PR #25980:
URL: https://github.com/apache/pulsar/pull/25980#issuecomment-4683344727


   I performed an AI assisted (Claude Fable 5) review and these were the 
findings. Some might be useful to address.
   
   ### Findings
   
   1. **[QUALITY] In-flight guard can leak permanently — 
`ScalableTopicController.java:322-341`**
      If anything between the successful `autoScaleInFlight.compareAndSet` and 
the future-chain construction throws synchronously (e.g. a misbehaving 
`getSegmentLoadAsync`), the flag is never reset: `runAutoScaleSafely` catches 
the `Throwable` but doesn't release the guard, so auto-scaling silently stops 
on this topic until a leadership change. Low likelihood (sync throws from async 
methods violate the codebase rules), but the failure mode is invisible and 
permanent — the feature dies with no recovery and nothing but a single warn log 
to show for it. A try/catch that resets the flag would harden it cheaply.
   
   2. **[QUALITY] Load records leak in the metadata store; `forget()` is dead 
code**
      `deleteSegmentLoadAsync` and `SegmentLoadReporter.forget()` have no 
production callers (tests only). The GC tick prunes segments from the layout 
and deletes backing topics (`ScalableTopicController.java:1126`) but leaves the 
`.../segments/{id}/load` znode behind forever. Likewise the reporter's 
`lastWritten` map grows unboundedly with segment churn on a long-lived broker, 
and its documented contract — "call when this broker stops owning the segment 
topic" — is never wired into unload/seal/delete. Both APIs were built for a 
purpose this PR doesn't fulfill; either wire them in (GC tick for the znode, 
topic-close path for `forget`) or note it as a known follow-up.
   
   3. **[QUALITY] Materiality band is anchored at the last-written value, not 
at the thresholds — confirm intended**
      The reporter skips writes when every rate changed <25% relative to the 
*last written* value (`SegmentLoadReporter.java:95-102`), with only a 
zero-crossing exception. The effect approximates hysteresis but with two 
quirks: it's **path-dependent** (a segment at 10,900 msg/s splits if the last 
record was 5,000, but not if it was 8,800 — same load, opposite outcomes), and 
the suppression is **permanent rather than a delay** (a true rate that settles 
inside `(stored, stored × 1.25)` never produces a new record, so a segment can 
run sustained at up to 25% over the split threshold — or sit 25–33% below the 
merge threshold — indefinitely without action). The impact is bounded by the 
tunable 0.25 knob and is one-directional (delays action, never causes spurious 
action), and adding threshold-crossing writes would interact with the merge 
window (each crossing resets the cold clock — arguably correct, but a behavior 
change). The PR frames the knob purely as metadata-write econom
 y, so the right ask is: if the blind spot is accepted, document it on 
`scalableTopicLoadReportRateChangeThreshold`; if not, the reporter or evaluator 
needs threshold awareness — complicated by the reporter being deliberately 
policy-agnostic.
   
   4. **[QUALITY] Topic ownership moves reset the merge window**
      `reportSegmentLoadAsync` uses `readModifyUpdateOrCreate(existing -> 
stats)`, which writes unconditionally; a newly-owning broker has an empty 
`lastWritten` cache, so after every unload/rebalance an *identical* record is 
rewritten and the znode mtime — which `coldEnough` uses as "cold since" — 
resets. Under frequent rebalancing, merges can be starved cluster-wide, since 
load-manager shedding is routine in a busy cluster. Returning `existing` from 
the RMW lambda when the values are equal would fix it. Relatedly, `coldEnough` 
compares the controller's `clock.millis()` against the metadata store's 
server-side mtime, so broker↔store clock skew shifts the window — acceptable 
for a heuristic, worth a code comment.
   
   5. **[QUALITY] No validation of the policy config**
      Nothing enforces `minSegments ≤ maxSegments`, positive cooldowns, or the 
invariant the `AutoScaleConfig` javadoc itself states as a requirement — split 
thresholds strictly above merge thresholds. A threshold configured as 0 makes 
the evaluator divide by zero: a positive rate scores `Infinity` (permanent 
split pressure), while `0/0` is NaN and silently ignored. A validation pass in 
`fromBrokerConfig` would be cheap and catches operator error at the right layer.
   
   6. **[QUALITY] Cooldowns don't survive leader failover, and a failed split 
consumes one**
      `lastSplitAtMs`/`lastMergeAtMs` are in-memory volatiles that reset to 
`MIN_VALUE` on a new leader, so an auto merge can fire immediately after 
failover even if one ran seconds earlier. Both timestamps are also set *before* 
the operation runs (`ScalableTopicController.java:557,608`), so a split that 
fails on a metadata conflict still burns the 60 s cooldown. Both are bounded in 
impact; worth a comment if intentional.
   
   7. **[QUALITY] Several `dynamic = true` knobs aren't actually dynamic**
      `scalableTopicLoadReportIntervalSeconds` and 
`scalableTopicLoadReportRateChangeThreshold` are read once in 
`BrokerService.startSegmentLoadReporter()` at broker start; 
`scalableTopicAutoScaleIntervalSeconds` once at leader election; and turning 
`scalableTopicAutoScaleEnabled` *on* after it was off at election time has no 
effect until the next leadership cycle (turning it off works, since 
`evaluateAndAct` rechecks every evaluation). The rate thresholds and cooldowns 
*are* genuinely dynamic via `AutoScaleConfig.fromBrokerConfig` per evaluation. 
Either honor changes for the others or drop their `dynamic` flag so 
`update-dynamic-config` users aren't misled.
   
   8. **[INTENT MISMATCH] "Consumer-count scale-up within seconds" overstates 
the burst case**
      Only the *first* split fires within seconds of consumer registration; a 
burst of N consumers scales one split per `max(splitCooldown, tick)`, so going 
from 1 segment to N takes roughly (N−1)×60 s. A consumer-change event arriving 
while an evaluation is in-flight is also silently dropped (CAS fails) and only 
recovered by the next tick. Both behaviors follow from the cooldown design the 
PR body itself describes — calibrate the headline claim, or consider 
re-triggering an evaluation after a consumer-driven split completes so a burst 
converges in one cooldown chain rather than tick-by-tick.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [improve][broker] PIP-483: scalable topic auto split/merge [pulsar]

Reply via email to