lhotari opened a new pull request, #25388:
URL: https://github.com/apache/pulsar/pull/25388

   ### Motivation
   
   The `SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover` 
integration test is flaky. The root cause is a bug in 
`SameAuthParamsLookupAutoClusterFailover.findFailoverTo()`: when it probes 
intermediate services and finds them unavailable, it does not update their 
state. They remain stale `Healthy`, which causes a spurious recovery bounce on 
the next check cycle.
   
   For example, when failing over from index 0 to index 2 (skipping index 1 
which is unavailable), index 1's state remains `Healthy`. On the next check, 
`firstHealthyPulsarService()` sees the stale `Healthy` state and immediately 
"recovers" to index 1 — which is actually a broken service. This causes 
unnecessary bouncing (0→2→1→2 instead of 0→2). Combined with 3-second probe 
timeouts on dead services, the extra bounce adds ~30 seconds, pushing the total 
failover time past the test's 60-second awaitility timeout.
   
   ### Modifications
   
   - In `findFailoverTo()`, when a probe fails for a service, mark it as 
`Failed` immediately. This prevents stale `Healthy` states from causing 
spurious recovery bounces after failover.
   - Added unit tests in `pulsar-client` to verify the fix and prevent 
regression.
   
   ### Verifying this change
   
   This change added tests and can be verified as follows:
   
     - `testFindFailoverToMarksSkippedServicesAsFailed` — directly verifies 
that `findFailoverTo` marks skipped unavailable services as `Failed` (the core 
bug). Verified this test fails without the fix.
     - `testNoSpuriousRecoveryBounceAfterFailover` — verifies no spurious 
recovery bounce occurs on the next check cycle after failover.
     - `testRecoveryAfterFindFailoverToMarksServiceFailed` — verifies that 
recovery still works correctly after a service was marked `Failed` by 
`findFailoverTo`, once it becomes available again.
   
   ### Does this pull request potentially affect one of the following parts:
   
   *If the box was checked, please highlight the changes*
   
   - [ ] Dependencies (add or upgrade a dependency)
   - [ ] The public API
   - [ ] The schema
   - [ ] The default values of configurations
   - [ ] The threading model
   - [ ] The binary protocol
   - [ ] The REST endpoints
   - [ ] The admin CLI options
   - [ ] The metrics
   - [ ] Anything that affects deployment
   
   ### Documentation
   
   - [x] `doc-not-needed`
   
   ### Matching PR in forked repository
   
   PR in forked repository: <\!-- ENTER URL HERE after creating forked PR -->
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to