[PR] [fix][test] Fix flaky ExtensibleLoadManagerImplTest by re-serving the channel topic in initializeState [pulsar]

via GitHub Mon, 08 Jun 2026 10:03:30 -0700


merlimat opened a new pull request, #25976:
URL: https://github.com/apache/pulsar/pull/25976

### Motivation

The `Pulsar CI Flaky` suite has been failing frequently on
`ExtensibleLoadManagerImplTest`, with a 2-minute timeout in the
`initializeState` `@BeforeMethod` (example run:
https://github.com/apache/pulsar/actions/runs/27144684649):

```
org.awaitility.core.ConditionTimeoutException: Assertion condition null
within 2 minutes.
at
...ExtensibleLoadManagerImplBaseTest.initializeState(ExtensibleLoadManagerImplBaseTest.java)
```

**Root cause.** Tests such as `testHandleNoChannelOwner` deliberately churn
leader election by closing the `LeaderElectionService` on both brokers. This
can leave the channel-topic bundle `pulsar/system/0x00000000_0xffffffff` (which
hosts `loadbalancer-service-unit-state`) in an
*owner-recorded-but-not-actually-served* state. Every channel operation then
fails with `... not served by this instance ... Please redo the lookup`.

`initializeState` (reworked in #25946) drives `monitor()` and retries the
namespace unload for 120s, but `monitor()` cannot heal this particular state:
`ExtensibleLoadManagerImpl.handleNoChannelOwnerError` only restarts leader
election when the channel reports *"no channel owner now"*. When an owner
**is** recorded but refuses to serve, no such error is thrown, recovery never
triggers, and the unload — which must publish to the channel topic — can never
succeed. The 120s budget is exhausted and the `@BeforeMethod` fails, cascading
to skipped tests.

### Modifications

In `ExtensibleLoadManagerImplBaseTest.initializeState`, force-serve the
channel topic inside the existing retry loop, before the unload:

- `admin.lookups().lookupTopic(...)` re-assigns the `pulsar/system` bundle,
and
- `admin.topics().getStats(...)` forces the recorded owner to actually load
it (the lookup layer alone can claim an owner that refuses to serve).

This is the same sequence `awaitChannelOwnerStable()` already uses to
stabilize after churn, but run on **every** retry attempt so the channel is
re-served immediately before each unload — rather than only once in the churn
test's `finally`, where the state can degrade again before the next
`initializeState`. It is guarded to the `ServiceUnitStateTableViewImpl`
(system-topic) variant, matching `awaitChannelOwnerStable`'s own guard; the
metadata-store variant has no channel *topic* to serve.

This is a test-side mitigation. The durable fix is product-side — teaching
`monitor()` / `handleNoChannelOwnerError` to detect
*owner-recorded-but-unserved* and re-assign the bundle — and can follow
separately.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [fix][test] Fix flaky ExtensibleLoadManagerImplTest by re-serving the channel topic in initializeState [pulsar]

Reply via email to