merlimat commented on PR #25929: URL: https://github.com/apache/pulsar/pull/25929#issuecomment-4625601105
Thanks for the very thorough review — went through every finding against the code; all were legitimate. Addressed in 6acca7f. **On the design question (can `parallelism` change after bring-up?):** No, and your instinct about aborted transactions is exactly the additional constraint. `TxnID.mostSigBits` encodes the coordinator, so changing parallelism strands existing transaction ids. And because an aborted transaction's records are retained as long as its participating messages are, the value can only ever be *reduced* once every transaction created under the previous value has been fully cleaned up. I've documented this on the config and now enforce cross-broker agreement (see #5 below). **Findings:** 1. **[BUG] follower-snapshot race** — confirmed. `buildAssignmentsSnapshot` used the cache-only `getLeaderValueIfPresent()`; switched to the async `getLeaderValue()` (loads from the store on a cache miss), so a freshly-`Following` broker no longer omits the partition. Added a short re-push from `ServerCnx` when a snapshot is incomplete or fails to build, so the client converges without relying on a later leadership change. Note: the deepest 3+-broker scenario isn't reproducible by the 2-broker integration test, so I'm relying on the mechanism fix + re-push rather than an e2e proof — flagging that honestly. 2. **[BUG] watch dies silently** — confirmed both paths. `attach()` on reconnect without the flag, and post-initial `onError()` (you're right that `ClientCnx` already removed the session so no `connectionClosed()` fires) now both `scheduleReconnect()`. 3. **[BUG] one-shot initial open** — confirmed. The initial open now retries retryable failures up to the lookup deadline instead of failing `startAsync()`. 4. **[QUALITY] `new URI(null)` NPE** — confirmed it's an NPE not `URISyntaxException`. `HandlerState.setRedirectedClusterURI` now throws a clear `URISyntaxException` for a blank selected URL, and `onSnapshot` isolates per-partition failures. 5. **[QUALITY] parallelism validation/consistency** — added `minValue=1`; the value is persisted at `/txn/tc/parallelism` on first start and every broker fails fast on mismatch. New unit test. 6. **[QUALITY] test can't prove the new path** — fixed; the integration test now asserts `isUsingMetadataDiscovery()` so a watch-path regression can't pass via the silent assign-topic fallback. 7. **[minor]** `handlers()` now reads the field into a local; the #1 re-push also bounds the startup snapshot burst. All affected unit tests + checkstyle pass locally; the integration test compiles and is wired into the TRANSACTION suite. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
