merlimat opened a new pull request, #25976:
URL: https://github.com/apache/pulsar/pull/25976

   ### Motivation
   
   The `Pulsar CI Flaky` suite has been failing frequently on 
`ExtensibleLoadManagerImplTest`, with a 2-minute timeout in the 
`initializeState` `@BeforeMethod` (example run: 
https://github.com/apache/pulsar/actions/runs/27144684649):
   
   ```
   org.awaitility.core.ConditionTimeoutException: Assertion condition null 
within 2 minutes.
       at 
...ExtensibleLoadManagerImplBaseTest.initializeState(ExtensibleLoadManagerImplBaseTest.java)
   ```
   
   **Root cause.** Tests such as `testHandleNoChannelOwner` deliberately churn 
leader election by closing the `LeaderElectionService` on both brokers. This 
can leave the channel-topic bundle `pulsar/system/0x00000000_0xffffffff` (which 
hosts `loadbalancer-service-unit-state`) in an 
*owner-recorded-but-not-actually-served* state. Every channel operation then 
fails with `... not served by this instance ... Please redo the lookup`.
   
   `initializeState` (reworked in #25946) drives `monitor()` and retries the 
namespace unload for 120s, but `monitor()` cannot heal this particular state: 
`ExtensibleLoadManagerImpl.handleNoChannelOwnerError` only restarts leader 
election when the channel reports *"no channel owner now"*. When an owner 
**is** recorded but refuses to serve, no such error is thrown, recovery never 
triggers, and the unload — which must publish to the channel topic — can never 
succeed. The 120s budget is exhausted and the `@BeforeMethod` fails, cascading 
to skipped tests.
   
   ### Modifications
   
   In `ExtensibleLoadManagerImplBaseTest.initializeState`, force-serve the 
channel topic inside the existing retry loop, before the unload:
   
   - `admin.lookups().lookupTopic(...)` re-assigns the `pulsar/system` bundle, 
and
   - `admin.topics().getStats(...)` forces the recorded owner to actually load 
it (the lookup layer alone can claim an owner that refuses to serve).
   
   This is the same sequence `awaitChannelOwnerStable()` already uses to 
stabilize after churn, but run on **every** retry attempt so the channel is 
re-served immediately before each unload — rather than only once in the churn 
test's `finally`, where the state can degrade again before the next 
`initializeState`. It is guarded to the `ServiceUnitStateTableViewImpl` 
(system-topic) variant, matching `awaitChannelOwnerStable`'s own guard; the 
metadata-store variant has no channel *topic* to serve.
   
   This is a test-side mitigation. The durable fix is product-side — teaching 
`monitor()` / `handleNoChannelOwnerError` to detect 
*owner-recorded-but-unserved* and re-assign the bundle — and can follow 
separately.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to