merlimat opened a new pull request, #25620:
URL: https://github.com/apache/pulsar/pull/25620

   ## Motivation
   
   `MessageChunkingSharedTest.testMultiConsumers` is flaky — it hangs 
intermittently and was bumped from 3s to 15s in #25475 to mask the symptom. The 
root cause is a real correctness bug in `SharedConsumerAssignor` that can cause 
chunked messages on Shared subscriptions to get permanently stuck in the 
redelivery queue, producing partial reassemblies on the consumer that never 
complete.
   
   ## Bug
   
   In `SharedConsumerAssignor.getConsumerForUuid`:
   
   ```java
   consumerToPermits.put(consumer, currentAvailablePermits - 1);
   ```
   
   `currentAvailablePermits` is the assignor loop's local tracker for the 
loop's `defaultConsumer`. But when a cache hit on `uuidToConsumer` redirects a 
chunk to a *different* consumer, `consumer` and `defaultConsumer` are not the 
same. The line above then writes the loop consumer's tracker into the *target* 
consumer's slot, corrupting `consumerToPermits` and making the assignor's 
per-consumer accounting drift over-optimistic.
   
   The chain that produces a stuck chunk:
   
   1. Over-allocation makes `consumerToPermits[X]` claim more permits than X 
actually has.
   2. The assignor allocates more entries to X than X can take.
   3. `sendChunkedMessagesToConsumers` trims via 
`Math.min(consumer.getAvailablePermits(), entryList.size())` and pushes the 
excess to the redelivery queue.
   4. For a last-chunk entry that gets trimmed, the assignor *already* removed 
`uuidToConsumer[uuid]` on line 145.
   5. When that last-chunk entry is re-read from redelivery, the assignor sees 
no cache entry and `chunkId != 0`, so it returns `null` — and the entry goes 
right back to redelivery. Forever.
   6. The consumer holds chunks 0..N-2 in `chunkedMessagesMap` waiting for 
chunk N-1 that will never arrive; the partial message never reassembles, never 
acks.
   
   When this happens the test hangs until TestNG's retry kicks in 30s later, 
masking the failure as a "slow run."
   
   ## Modifications
   
   `getConsumerForUuid` now decrements the target consumer's permits using the 
value it just read from `consumerToPermits`, not the caller's local tracker. 
The `currentAvailablePermits` parameter is no longer needed and was removed.
   
   ## Verification
   
   Locally, before the fix the test was bimodal: either ~0.5s or ~30s (TestNG 
retry threshold). After the fix, 0/30 runs were slow:
   
   | | Before fix | After fix |
   | --- | --- | --- |
   | Slow runs (>5s) | 2/10 | 0/30 |
   
   Other chunking suites (`MessageChunkingTest`, `MessageChunkingSharedTest`, 
`MessageChunkingDeduplicationTest`, `SharedConsumerAssignorTest`) still pass.
   
   ## Test plan
   
   - [x] `mvn test -pl pulsar-broker -Dtest='MessageChunkingSharedTest'` — 
passes
   - [x] `mvn test -pl pulsar-broker -Dtest='SharedConsumerAssignorTest'` — 
passes
   - [x] Stress run: 30 iterations of 
`MessageChunkingSharedTest.testMultiConsumers` — 0 slow / 0 failed
   
   ### Matching PR types in the table
   
   - [x] `area/broker`
   
   ### Documentation
   
   - [x] `doc-not-needed`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to