lhotari opened a new pull request, #25388:
URL: https://github.com/apache/pulsar/pull/25388
### Motivation
The `SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover`
integration test is flaky. The root cause is a bug in
`SameAuthParamsLookupAutoClusterFailover.findFailoverTo()`: when it probes
intermediate services and finds them unavailable, it does not update their
state. They remain stale `Healthy`, which causes a spurious recovery bounce on
the next check cycle.
For example, when failing over from index 0 to index 2 (skipping index 1
which is unavailable), index 1's state remains `Healthy`. On the next check,
`firstHealthyPulsarService()` sees the stale `Healthy` state and immediately
"recovers" to index 1 — which is actually a broken service. This causes
unnecessary bouncing (0→2→1→2 instead of 0→2). Combined with 3-second probe
timeouts on dead services, the extra bounce adds ~30 seconds, pushing the total
failover time past the test's 60-second awaitility timeout.
### Modifications
- In `findFailoverTo()`, when a probe fails for a service, mark it as
`Failed` immediately. This prevents stale `Healthy` states from causing
spurious recovery bounces after failover.
- Added unit tests in `pulsar-client` to verify the fix and prevent
regression.
### Verifying this change
This change added tests and can be verified as follows:
- `testFindFailoverToMarksSkippedServicesAsFailed` — directly verifies
that `findFailoverTo` marks skipped unavailable services as `Failed` (the core
bug). Verified this test fails without the fix.
- `testNoSpuriousRecoveryBounceAfterFailover` — verifies no spurious
recovery bounce occurs on the next check cycle after failover.
- `testRecoveryAfterFindFailoverToMarksServiceFailed` — verifies that
recovery still works correctly after a service was marked `Failed` by
`findFailoverTo`, once it becomes available again.
### Does this pull request potentially affect one of the following parts:
*If the box was checked, please highlight the changes*
- [ ] Dependencies (add or upgrade a dependency)
- [ ] The public API
- [ ] The schema
- [ ] The default values of configurations
- [ ] The threading model
- [ ] The binary protocol
- [ ] The REST endpoints
- [ ] The admin CLI options
- [ ] The metrics
- [ ] Anything that affects deployment
### Documentation
- [x] `doc-not-needed`
### Matching PR in forked repository
PR in forked repository: <\!-- ENTER URL HERE after creating forked PR -->
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]