merlimat opened a new pull request, #26009:
URL: https://github.com/apache/pulsar/pull/26009

   Fixes #25997
   
   ### Motivation
   
   #25929 added an `explicitHostURI` pin to `ConnectionHandler` so the v5 
transaction coordinator's metadata-store discovery re-dials the elected leader 
on retry instead of falling back to the service URL. The same change also 
persisted the broker-assigned redirect URL from 
`CommandCloseProducer`/`CommandCloseConsumer` unload notifications (PIP-307) 
into that pin: a producer/consumer that was redirected once stayed pinned to 
that address on every subsequent retry, never going through topic lookup again, 
and nothing ever cleared the field.
   
   A redirect to a wrong or stale broker — e.g. a stepping-down 
`ExtensibleLoadManagerImpl` leader whose `closeInternalTopics()` redirects 
internal-topic clients to itself via the stale channel-table assignment — now 
wedged the client permanently with `ServiceUnitNotReadyException` instead of 
recovering via lookup on the next retry. This is the cause of the 
`ExtensibleLoadManagerImplTest.testRoleChange` flakiness on master. Before 
#25929 the redirect was honored for the immediate reconnect attempt only, so 
one failed attempt was followed by a lookup that found the new owner.
   
   ### Modifications
   
   - `ConnectionHandler.connectionClosed()` updates the `explicitHostURI` pin 
only for handlers already in explicit-host mode, i.e. the TC metadata-store 
discovery, which entered it via `grabCnx(URI, boolean)`. The pin update is 
still needed there: `TransactionMetaStoreHandler.retargetLeader()` switches 
leaders by closing the channel and passing the new leader URI through 
`connectionClosed()`, and a failed first dial to the new leader must not retry 
against the old one.
   - Lookup-based handlers (producers/consumers) never enter explicit-host 
mode, so the broker-assigned redirect is honored for the immediate reconnect 
attempt only; if that attempt fails, retries fall back to topic lookup, 
restoring the pre-#25929 behavior.
   - Added a regression test that delivers a disconnect+redirect with an 
unreachable `assignedBrokerServiceUrl` to a live producer and consumer (exactly 
as `ClientCnx#handleCloseProducer`/`#handleCloseConsumer` do) and asserts both 
recover via lookup and can exchange a message.
   
   ### Verifying this change
   
   - [x] Make sure that the change passes the CI checks.
   
   This change added tests and can be verified as follows:
   
   - `BrokerClientIntegrationTest.testStaleBrokerRedirectFallsBackToLookup` 
fails on master without the fix (the clients stay pinned to the unreachable 
address) and passes with it.
   - `TransactionClientConnectTest` passes, covering the TC client connection 
path.
   - `ExtensibleLoadManagerImplTest.testRoleChange` with `invocationCount = 20` 
passes 40/40 invocations with the fix (the test factory runs both 
`ServiceUnitStateChannel` implementations), vs. 1-2 failures per 20 on master 
per #25997.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to