lhotari commented on issue #24357: URL: https://github.com/apache/pulsar/issues/24357#issuecomment-4634015476
I investigated a recent occurrence of this flake with a different failure mode than originally reported here: in [run 27011985664](https://github.com/apache/pulsar/actions/runs/27011985664/job/79723694452) (both attempts), `testLoadBalancerServiceUnitTableViewSyncer` hung for the full 300s TestNG suite-default timeout (`ThreadTimeoutException`), and then cascaded into the next test's `@BeforeMethod initializeState()` failing with HTTP 500: `The producer test-0-N can not send message to the topic persistent://pulsar/system/loadbalancer-service-unit-state within given timeout`, skipping `testHandleNoChannelOwner`. ### Root causes The hang is in `ServiceUnitStateTableViewSyncer.waitUntilSynced()` (stack in the test-report XML: `waitUntilSynced:241 <- syncTailItems:172 <- start:74 <- monitor()`), which spins on entrySet-**size** equality between the two table views. Two production bugs in the syncer make that divergence permanent: 1. **Tombstone asymmetry.** A deletion is delivered to the syncer's put-based tail listeners as a `null` value. `ServiceUnitStateMetadataStoreTableViewImpl.put()` is `@NonNull` on the value, so the resulting NPE is silently swallowed by the table-view dispatchers (`TableViewImpl`/`MetadataStoreTableViewImpl` both catch and log listener exceptions) and the deletion never propagates to the metadata-store view. The reverse direction works only by accident (`ServiceUnitStateTableViewImpl.delete(key)` *is* `put(key, null)`). Evidence: both CI attempts show a post-kill `NullPointerException at ServiceUnitStateMetadataStoreTableViewImpl.put:138 <- syncToMetadataStore:89`. 2. **Existing-vs-tail listener gap.** Channel updates that land between `syncExistingItems()`'s copy and the registration of the tail listeners in `syncTailItems()` are replayed to the freshly-started views as *existing* items — which are wired to a dummy listener — so they never propagate. This is near-deterministic in the `MetadataStoreToSystemTopicSyncer` direction because closing the previous reader triggers a re-assignment of the channel-topic bundle itself right into that gap. Reproduced locally twice with the same signature (`MetadataStoreTableView.size: 44, SystemTopicTableView.size: 43`, then `3 vs 2` on retry — the gap write being `pulsar/system/0x00000000_0xffffffff` itself going `Assigning -> Owned` between the copy and `Started MetadataStoreTableView`). The downstream `initializeState()` HTTP 500 is the channel topic's bundle being transiently unowned (state=Free) amid leader-election churn from `testRoleChange`/`testHandleNoChannelOwner`, with the channel producer's reconnect backoff (~52s) exceeding the 30s send timeout. Note the 300s hang also explains why this surfaces as a 5-minute `ThreadTimeoutException` rather than a clean failure: the syncer's internal `SYNC_WAIT_TIME_IN_SECS = 300` exactly collides with the TestNG suite-default 300000ms method timeout, and TestNG's clock starts earlier, so it always wins. ### Fix I have a fix that routes `null` tail items to `delete()` and makes `waitUntilSynced()` reconcile the two views periodically while diverged (direction-aware, with safeguards for mid-migration writes), plus test-side hardening (method timeout cap, shortened sync budget, teardown in `finally`, channel-owner stabilization after the leader-election-churning tests, and a bounded retry in `initializeState()`). Currently validating via Personal CI (lhotari/pulsar#225); will open a PR against this issue once green. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
