lhotari commented on PR #25980:
URL: https://github.com/apache/pulsar/pull/25980#issuecomment-4683344727
I performed an AI assisted (Claude Fable 5) review and these were the
findings. Some might be useful to address.
### Findings
1. **[QUALITY] In-flight guard can leak permanently —
`ScalableTopicController.java:322-341`**
If anything between the successful `autoScaleInFlight.compareAndSet` and
the future-chain construction throws synchronously (e.g. a misbehaving
`getSegmentLoadAsync`), the flag is never reset: `runAutoScaleSafely` catches
the `Throwable` but doesn't release the guard, so auto-scaling silently stops
on this topic until a leadership change. Low likelihood (sync throws from async
methods violate the codebase rules), but the failure mode is invisible and
permanent — the feature dies with no recovery and nothing but a single warn log
to show for it. A try/catch that resets the flag would harden it cheaply.
2. **[QUALITY] Load records leak in the metadata store; `forget()` is dead
code**
`deleteSegmentLoadAsync` and `SegmentLoadReporter.forget()` have no
production callers (tests only). The GC tick prunes segments from the layout
and deletes backing topics (`ScalableTopicController.java:1126`) but leaves the
`.../segments/{id}/load` znode behind forever. Likewise the reporter's
`lastWritten` map grows unboundedly with segment churn on a long-lived broker,
and its documented contract — "call when this broker stops owning the segment
topic" — is never wired into unload/seal/delete. Both APIs were built for a
purpose this PR doesn't fulfill; either wire them in (GC tick for the znode,
topic-close path for `forget`) or note it as a known follow-up.
3. **[QUALITY] Materiality band is anchored at the last-written value, not
at the thresholds — confirm intended**
The reporter skips writes when every rate changed <25% relative to the
*last written* value (`SegmentLoadReporter.java:95-102`), with only a
zero-crossing exception. The effect approximates hysteresis but with two
quirks: it's **path-dependent** (a segment at 10,900 msg/s splits if the last
record was 5,000, but not if it was 8,800 — same load, opposite outcomes), and
the suppression is **permanent rather than a delay** (a true rate that settles
inside `(stored, stored × 1.25)` never produces a new record, so a segment can
run sustained at up to 25% over the split threshold — or sit 25–33% below the
merge threshold — indefinitely without action). The impact is bounded by the
tunable 0.25 knob and is one-directional (delays action, never causes spurious
action), and adding threshold-crossing writes would interact with the merge
window (each crossing resets the cold clock — arguably correct, but a behavior
change). The PR frames the knob purely as metadata-write econom
y, so the right ask is: if the blind spot is accepted, document it on
`scalableTopicLoadReportRateChangeThreshold`; if not, the reporter or evaluator
needs threshold awareness — complicated by the reporter being deliberately
policy-agnostic.
4. **[QUALITY] Topic ownership moves reset the merge window**
`reportSegmentLoadAsync` uses `readModifyUpdateOrCreate(existing ->
stats)`, which writes unconditionally; a newly-owning broker has an empty
`lastWritten` cache, so after every unload/rebalance an *identical* record is
rewritten and the znode mtime — which `coldEnough` uses as "cold since" —
resets. Under frequent rebalancing, merges can be starved cluster-wide, since
load-manager shedding is routine in a busy cluster. Returning `existing` from
the RMW lambda when the values are equal would fix it. Relatedly, `coldEnough`
compares the controller's `clock.millis()` against the metadata store's
server-side mtime, so broker↔store clock skew shifts the window — acceptable
for a heuristic, worth a code comment.
5. **[QUALITY] No validation of the policy config**
Nothing enforces `minSegments ≤ maxSegments`, positive cooldowns, or the
invariant the `AutoScaleConfig` javadoc itself states as a requirement — split
thresholds strictly above merge thresholds. A threshold configured as 0 makes
the evaluator divide by zero: a positive rate scores `Infinity` (permanent
split pressure), while `0/0` is NaN and silently ignored. A validation pass in
`fromBrokerConfig` would be cheap and catches operator error at the right layer.
6. **[QUALITY] Cooldowns don't survive leader failover, and a failed split
consumes one**
`lastSplitAtMs`/`lastMergeAtMs` are in-memory volatiles that reset to
`MIN_VALUE` on a new leader, so an auto merge can fire immediately after
failover even if one ran seconds earlier. Both timestamps are also set *before*
the operation runs (`ScalableTopicController.java:557,608`), so a split that
fails on a metadata conflict still burns the 60 s cooldown. Both are bounded in
impact; worth a comment if intentional.
7. **[QUALITY] Several `dynamic = true` knobs aren't actually dynamic**
`scalableTopicLoadReportIntervalSeconds` and
`scalableTopicLoadReportRateChangeThreshold` are read once in
`BrokerService.startSegmentLoadReporter()` at broker start;
`scalableTopicAutoScaleIntervalSeconds` once at leader election; and turning
`scalableTopicAutoScaleEnabled` *on* after it was off at election time has no
effect until the next leadership cycle (turning it off works, since
`evaluateAndAct` rechecks every evaluation). The rate thresholds and cooldowns
*are* genuinely dynamic via `AutoScaleConfig.fromBrokerConfig` per evaluation.
Either honor changes for the others or drop their `dynamic` flag so
`update-dynamic-config` users aren't misled.
8. **[INTENT MISMATCH] "Consumer-count scale-up within seconds" overstates
the burst case**
Only the *first* split fires within seconds of consumer registration; a
burst of N consumers scales one split per `max(splitCooldown, tick)`, so going
from 1 segment to N takes roughly (N−1)×60 s. A consumer-change event arriving
while an evaluation is in-flight is also silently dropped (CAS fails) and only
recovered by the next tick. Both behaviors follow from the cooldown design the
PR body itself describes — calibrate the headline claim, or consider
re-triggering an evaluation after a consumer-driven split completes so a burst
converges in one cooldown chain rather than tick-by-tick.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]