ng-galien opened a new issue, #25533:
URL: https://github.com/apache/pulsar/issues/25533

   ## Summary
   
   The docs for `negativeAcknowledge` and `DeadLetterPolicy.maxRedeliverCount` 
don't warn that the redelivery counter is kept exclusively in client and broker 
RAM, and is lost whenever the subscription's dispatcher is closed. In any 
cluster with `loadBalancerEnabled=true` (the default), bundles are routinely 
unloaded and rebalanced, so `maxRedeliverCount` acts as a soft guideline at 
best and the DLQ is in practice never reached for long-lived poison messages.
   
   This is **not** a behavior bug: `reconsumeLater(...)` with 
`enableRetry(true)` works correctly and persists its counter as a 
`RECONSUMETIMES` property on the retry topic, in BookKeeper. The problem is 
that nothing in the documentation tells a reader to prefer `reconsumeLater` 
over `negativeAcknowledge` when they care about DLQ guarantees — the two are 
presented as interchangeable retry mechanisms.
   
   Related discussion: #23944 covers a different angle of the same broader 
"redelivery count is unreliable" theme (consumer crashes vs. dispatcher close), 
and proposes a behavior change. This issue is intentionally narrower and 
**docs-only** — even if #23944 lands a behavior fix, the docs should warn 
current users today.
   
   ## Why this matters (real-world)
   
   Our team ran a consumer with `negativeAcknowledge` + 
`DeadLetterPolicy(maxRedeliverCount=10)` in production for months before 
realizing the DLQ was never reached. The subscription accumulated ~400k 
messages in the backlog because every bundle unload reset the in-memory 
redelivery counter, and the backoff cycle was longer than the typical time 
between unloads. We only understood the mechanism after writing an instrumented 
reproduction against a standalone broker. **A single paragraph in the docs 
would have saved us that time**, which is why I'm opening this.
   
   ## Reproduction
   
   Self-contained, runnable: https://github.com/ng-galien/bug-pulsar
   
   | # | Client | Broker | Event applied | DLQ reached? |
   |---|---|---|---|---|
   | A | 2.11.0 | 2.11.0 | \`docker-compose restart pulsar\` | ❌ 0 msgs |
   | D | 2.11.0 | 2.11.0 | \`pulsar-admin topics unload\` (broker stays up) | ❌ 
0 msgs |
   | V3 | 2.11.0 | 3.3.9 | \`pulsar-admin topics unload\` | ❌ 0 msgs |
   | V3 (\`-Pclient-v3\`) | 3.3.9 | 3.3.9 | \`pulsar-admin topics unload\` | ❌ 
0 msgs |
   
   After the unload, the same \`MessageId\`s come back with 
\`redeliveryCount=0\` because the \`InMemoryRedeliveryTracker\` attached to the 
dispatcher is destroyed and recreated empty. Scenario B in the same repo 
validates the counterpart: \`reconsumeLater(...)\` + \`enableRetry(true)\` is 
unaffected by either a broker restart or a topic unload, because the counter 
lives on the retry topic as a persisted message property.
   
   The bug reproduces identically on the latest 3.3.x line, with both the 2.11 
and 3.3 clients.
   
   ## What I'm asking for
   
   Docs-only. No behavior change, no API change.
   
   1. In the *Negative acknowledgement* page, add a warning callout stating:
      - the redelivery counter is kept only in client (\`NegativeAcksTracker\`) 
+ broker (\`InMemoryRedeliveryTracker\`) RAM;
      - any event that closes the subscription dispatcher (broker restart, 
bundle unload, bundle rebalance, last consumer disconnect) resets the counter;
      - therefore \`DeadLetterPolicy.maxRedeliverCount\` cannot be relied on as 
a hard upper bound when \`negativeAcknowledge\` is the retry mechanism;
      - recommend \`reconsumeLater(...)\` with \`enableRetry(true)\` when the 
DLQ must actually be reached.
   
   2. In the javadoc for \`Consumer.negativeAcknowledge(...)\` and 
\`DeadLetterPolicy.Builder.maxRedeliverCount(...)\`, add a one-liner that links 
to that callout.
   
   3. Optional: update the *Retry letter topic* page to cross-reference the 
nack page and explain *why* one would choose one over the other — the current 
docs describe them as alternatives without explaining the persistence 
difference, which is arguably the entire point.
   
   Happy to submit a docs PR with proposed wording if a maintainer agrees this 
is worth documenting.
   
   ## Versions covered by the repro
   
   - Broker: \`apachepulsar/pulsar:2.11.0\` and \`apachepulsar/pulsar:3.3.9\` 
(both reproduce)
   - Client: \`pulsar-client:2.11.0\` and \`pulsar-client:3.3.9\` (both 
reproduce)
   - Java: 21 (Temurin), macOS + Docker


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to