Re: [I] Flaky-test: ExtensibleLoadManagerImplTest.testLoadBalancerServiceUnitTableViewSyncer [pulsar]

via GitHub Fri, 05 Jun 2026 10:42:39 -0700


lhotari commented on issue #24357:
URL: https://github.com/apache/pulsar/issues/24357#issuecomment-4634015476


   I investigated a recent occurrence of this flake with a different failure 
mode than originally reported here: in [run 
27011985664](https://github.com/apache/pulsar/actions/runs/27011985664/job/79723694452)
 (both attempts), `testLoadBalancerServiceUnitTableViewSyncer` hung for the 
full 300s TestNG suite-default timeout (`ThreadTimeoutException`), and then 
cascaded into the next test's `@BeforeMethod initializeState()` failing with 
HTTP 500: `The producer test-0-N can not send message to the topic 
persistent://pulsar/system/loadbalancer-service-unit-state within given 
timeout`, skipping `testHandleNoChannelOwner`.
   
   ### Root causes
   
   The hang is in `ServiceUnitStateTableViewSyncer.waitUntilSynced()` (stack in 
the test-report XML: `waitUntilSynced:241 <- syncTailItems:172 <- start:74 <- 
monitor()`), which spins on entrySet-**size** equality between the two table 
views. Two production bugs in the syncer make that divergence permanent:
   
   1. **Tombstone asymmetry.** A deletion is delivered to the syncer's 
put-based tail listeners as a `null` value. 
`ServiceUnitStateMetadataStoreTableViewImpl.put()` is `@NonNull` on the value, 
so the resulting NPE is silently swallowed by the table-view dispatchers 
(`TableViewImpl`/`MetadataStoreTableViewImpl` both catch and log listener 
exceptions) and the deletion never propagates to the metadata-store view. The 
reverse direction works only by accident 
(`ServiceUnitStateTableViewImpl.delete(key)` *is* `put(key, null)`). Evidence: 
both CI attempts show a post-kill `NullPointerException at 
ServiceUnitStateMetadataStoreTableViewImpl.put:138 <- syncToMetadataStore:89`.
   
   2. **Existing-vs-tail listener gap.** Channel updates that land between 
`syncExistingItems()`'s copy and the registration of the tail listeners in 
`syncTailItems()` are replayed to the freshly-started views as *existing* items 
— which are wired to a dummy listener — so they never propagate. This is 
near-deterministic in the `MetadataStoreToSystemTopicSyncer` direction because 
closing the previous reader triggers a re-assignment of the channel-topic 
bundle itself right into that gap. Reproduced locally twice with the same 
signature (`MetadataStoreTableView.size: 44, SystemTopicTableView.size: 43`, 
then `3 vs 2` on retry — the gap write being 
`pulsar/system/0x00000000_0xffffffff` itself going `Assigning -> Owned` between 
the copy and `Started MetadataStoreTableView`).
   
   The downstream `initializeState()` HTTP 500 is the channel topic's bundle 
being transiently unowned (state=Free) amid leader-election churn from 
`testRoleChange`/`testHandleNoChannelOwner`, with the channel producer's 
reconnect backoff (~52s) exceeding the 30s send timeout.
   
   Note the 300s hang also explains why this surfaces as a 5-minute 
`ThreadTimeoutException` rather than a clean failure: the syncer's internal 
`SYNC_WAIT_TIME_IN_SECS = 300` exactly collides with the TestNG suite-default 
300000ms method timeout, and TestNG's clock starts earlier, so it always wins.
   
   ### Fix
   
   I have a fix that routes `null` tail items to `delete()` and makes 
`waitUntilSynced()` reconcile the two views periodically while diverged 
(direction-aware, with safeguards for mid-migration writes), plus test-side 
hardening (method timeout cap, shortened sync budget, teardown in `finally`, 
channel-owner stabilization after the leader-election-churning tests, and a 
bounded retry in `initializeState()`). Currently validating via Personal CI 
(lhotari/pulsar#225); will open a PR against this issue once green.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Flaky-test: ExtensibleLoadManagerImplTest.testLoadBalancerServiceUnitTableViewSyncer [pulsar]

Reply via email to