lhotari opened a new pull request, #25946:
URL: https://github.com/apache/pulsar/pull/25946

   Fixes #24357
   
   ### Motivation
   
   `ExtensibleLoadManagerImplTest.testLoadBalancerServiceUnitTableViewSyncer` 
hung for the full 300s TestNG suite-default timeout in both attempts of CI run 
[27011985664](https://github.com/apache/pulsar/actions/runs/27011985664/job/79723694452),
 then cascaded into a later test's `@BeforeMethod initializeState()` failing 
with HTTP 500 (`The producer ... can not send message to the topic 
persistent://pulsar/system/loadbalancer-service-unit-state within given 
timeout`), skipping `testHandleNoChannelOwner`.
   
   Investigation (details in [this comment on 
#24357](https://github.com/apache/pulsar/issues/24357#issuecomment-4634015476)) 
found two production bugs in `ServiceUnitStateTableViewSyncer` that make 
`waitUntilSynced()` spin forever on entrySet-size divergence:
   
   1. **Tombstone asymmetry** — a deletion is delivered to the syncer's 
put-based tail listeners as a `null` value. 
`ServiceUnitStateMetadataStoreTableViewImpl.put()` is `@NonNull` on the value, 
so the resulting NPE is silently swallowed by the table-view dispatchers and 
the deletion never propagates to the metadata-store view. (The reverse 
direction works only by accident: the system-topic view's `delete(key)` *is* 
`put(key, null)`.)
   
   2. **Existing-vs-tail listener gap** — channel updates that land between the 
existing-items copy and the registration of the tail listeners in 
`syncTailItems()` are replayed to the freshly started views as *existing* 
items, which are wired to a no-op listener, so they never propagate. This is 
near-deterministic in the `MetadataStoreToSystemTopicSyncer` direction because 
closing the previous reader triggers a re-assignment of the channel-topic 
bundle itself into that gap (reproduced locally with the same sizes-off-by-one 
signature as CI, e.g. `MetadataStoreTableView.size: 44, 
SystemTopicTableView.size: 43`).
   
   Additionally, the leader-election-churning tests (`testRoleChange`, 
`testRoleChangeIdempotency`, `testHandleNoChannelOwner`) can leave the 
`pulsar/system/0x00000000_0xffffffff` bundle unserved (`Namespace bundle ... 
not served by this instance`), so the next test's namespace unload cannot 
publish its `Releasing` event and fails with HTTP 500 or hangs server-side. 
Nothing heals that state until the background monitor task (120s interval) 
reconciles the brokers' roles with the channel ownership — passive 
waiting/retrying alone cannot recover (reproduced locally and in personal CI).
   
   ### Modifications
   
   Production — `ServiceUnitStateTableViewSyncer` (a migration feature; no 
interface, call-site, or config changes):
   - Route `null` tail items to `delete()` instead of `put()`, and log failed 
sync futures at WARN (previously silently discarded by the dispatchers).
   - Make `waitUntilSynced()` reconcile periodically while diverged: copy 
source-only items to the destination (direction-aware via 
`dst.isMetadataStoreBased()`; raw metadata-store writes to avoid items being 
conflicted out, matching `syncExistingItemsToMetadataStore`) and remove 
destination-only items that predate the sync phase and are re-confirmed absent 
from the source (so a concurrent fresh write to the destination is never 
discarded). Batched at `MAX_CONCURRENT_SYNC_COUNT`; failures are logged and 
retried on the next pass; `InterruptedException` propagates.
   - Make the sync-wait budget an overridable field with a `@VisibleForTesting` 
setter (default unchanged at 300s) and add throttled DEBUG progress logging.
   - Code cleanup: shared `NOOP_CONSUMER`, `maybeWaitCompletion()` and 
`writeToMetadataStore()` helpers.
   
   Tests:
   - `testLoadBalancerServiceUnitTableViewSyncer`: cap at 240s (below the 300s 
suite default so a hang fails fast and is retried instead of consuming the 
whole slot), shrink the sync-wait budget to 30s before the first `start()`, 
drive `monitor()` inside the activation await, and tear the syncer down in a 
`finally` block so a failure cannot poison later tests.
   - `awaitChannelOwnerStable()` helper: after the leader-election-churning 
tests, drive `monitor()` to reconcile roles with channel ownership (the eager 
equivalent of the 120s background monitor task) and probe that the channel 
topic is looked up and served before yielding to the next test. 
`testHandleNoChannelOwner` additionally forces a deterministic leader in its 
cleanup, since a body failure mid-churn can otherwise leave leadership flapping 
between the brokers. Method timeouts for the churning tests raised from 30s to 
60s.
   - `initializeState()`: drive `monitor()` and retry a bounded `unloadAsync` 
(15s per attempt) for up to 120s before failing loudly, healing a transiently 
unserved channel-topic bundle instead of failing the whole suite.
   
   ### Verifying this change
   
   - [x] Make sure that the change passes the CI checks.
   
   This change is already covered by existing tests, such as 
`ExtensibleLoadManagerImplTest` (both `@Factory` parametrizations exercise both 
syncer directions in `testLoadBalancerServiceUnitTableViewSyncer`).
   
   Verification performed:
   - Before the fix, the full class failed deterministically in the 
`MetadataStoreToSystemTopicSyncer` direction locally (twice, same signature as 
both CI attempts). After the fix, the syncer test completes in seconds and 
repeated full-class runs pass (102 tests, 0 failures), including runs where the 
residual pre-existing first-attempt flakes in 
`testRoleChange`/`testHandleNoChannelOwner` fired and were contained by the 
retry without cascading.
   - Personal CI on the fork: `Pulsar CI Flaky` (BROKER_FLAKY group) passed 
twice consecutively (lhotari/pulsar#225); the only full-pipeline failure was an 
unrelated `GeoReplicationTest` integration flake.
   
   ### Does this pull request potentially affect one of the following parts:
   
   *If the box was checked, please highlight the changes*
   
   - [ ] Dependencies (add or upgrade a dependency)
   - [ ] The public API
   - [ ] The schema
   - [ ] The default values of configurations
   - [ ] The threading model
   - [ ] The binary protocol
   - [ ] The REST endpoints
   - [ ] The admin CLI options
   - [ ] The metrics
   - [ ] Anything that affects deployment


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to