merlimat opened a new pull request, #25620: URL: https://github.com/apache/pulsar/pull/25620
## Motivation `MessageChunkingSharedTest.testMultiConsumers` is flaky — it hangs intermittently and was bumped from 3s to 15s in #25475 to mask the symptom. The root cause is a real correctness bug in `SharedConsumerAssignor` that can cause chunked messages on Shared subscriptions to get permanently stuck in the redelivery queue, producing partial reassemblies on the consumer that never complete. ## Bug In `SharedConsumerAssignor.getConsumerForUuid`: ```java consumerToPermits.put(consumer, currentAvailablePermits - 1); ``` `currentAvailablePermits` is the assignor loop's local tracker for the loop's `defaultConsumer`. But when a cache hit on `uuidToConsumer` redirects a chunk to a *different* consumer, `consumer` and `defaultConsumer` are not the same. The line above then writes the loop consumer's tracker into the *target* consumer's slot, corrupting `consumerToPermits` and making the assignor's per-consumer accounting drift over-optimistic. The chain that produces a stuck chunk: 1. Over-allocation makes `consumerToPermits[X]` claim more permits than X actually has. 2. The assignor allocates more entries to X than X can take. 3. `sendChunkedMessagesToConsumers` trims via `Math.min(consumer.getAvailablePermits(), entryList.size())` and pushes the excess to the redelivery queue. 4. For a last-chunk entry that gets trimmed, the assignor *already* removed `uuidToConsumer[uuid]` on line 145. 5. When that last-chunk entry is re-read from redelivery, the assignor sees no cache entry and `chunkId != 0`, so it returns `null` — and the entry goes right back to redelivery. Forever. 6. The consumer holds chunks 0..N-2 in `chunkedMessagesMap` waiting for chunk N-1 that will never arrive; the partial message never reassembles, never acks. When this happens the test hangs until TestNG's retry kicks in 30s later, masking the failure as a "slow run." ## Modifications `getConsumerForUuid` now decrements the target consumer's permits using the value it just read from `consumerToPermits`, not the caller's local tracker. The `currentAvailablePermits` parameter is no longer needed and was removed. ## Verification Locally, before the fix the test was bimodal: either ~0.5s or ~30s (TestNG retry threshold). After the fix, 0/30 runs were slow: | | Before fix | After fix | | --- | --- | --- | | Slow runs (>5s) | 2/10 | 0/30 | Other chunking suites (`MessageChunkingTest`, `MessageChunkingSharedTest`, `MessageChunkingDeduplicationTest`, `SharedConsumerAssignorTest`) still pass. ## Test plan - [x] `mvn test -pl pulsar-broker -Dtest='MessageChunkingSharedTest'` — passes - [x] `mvn test -pl pulsar-broker -Dtest='SharedConsumerAssignorTest'` — passes - [x] Stress run: 30 iterations of `MessageChunkingSharedTest.testMultiConsumers` — 0 slow / 0 failed ### Matching PR types in the table - [x] `area/broker` ### Documentation - [x] `doc-not-needed` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
