amjadali-klarity opened a new issue, #25438:
URL: https://github.com/apache/pulsar/issues/25438

   ### Search before reporting
   
   - [x] I searched in the [issues](https://github.com/apache/pulsar/issues) 
and found nothing similar.
   
   
   ### Read release policy
   
   - [x] I understand that [unsupported 
versions](https://pulsar.apache.org/contribute/release-policy/#supported-versions)
 don't get bug fixes. I will attempt to reproduce the issue on a supported 
version of Pulsar client and Pulsar broker.
   
   
   ### User environment
   
   - Pulsar version: 4.1.2 (pulsar-all image)
   - BookKeeper version: 4.17.2
   - ZooKeeper version: 3.9.4
   - Deployment: Kubernetes (StatefulSet), 2 brokers, 2 bookies
   - Client: Pulsar-CPP-v3.5.1
   - Subscription type: Shared
   
   ### Issue Description
   
   After a BookKeeper ledger rolls over (triggered by ledger-full), the 
`PersistentDispatcherMultipleConsumers` dispatch loop crashes silently with a 
`NoSuchElementException` and never reschedules itself. As a result, the broker 
stops delivering messages to connected consumers even though:
   
   - The consumer remains connected (availablePermits > 0)
   - The new ledger is created successfully
   - The broker health check (/status.html) continues returning HTTP 200
   - The topic backlog shows undelivered messages
   
   The broker appears completely healthy from the outside but dispatch is 
permanently frozen until the topic is unloaded or the broker is restarted.
   
   ### Error messages
   
   ```text
   ERROR org.apache.bookkeeper.common.util.SingleThreadExecutor - Error while 
running task: null
   java.util.NoSuchElementException: null
       at java.base/java.util.concurrent.ConcurrentSkipListMap.firstKey(Unknown 
Source)
       at 
org.apache.bookkeeper.mledger.impl.EntryCountEstimator.internalEstimateEntryCountByBytesSize(EntryCountEstimator.java:94)
       at 
org.apache.bookkeeper.mledger.impl.EntryCountEstimator.estimateEntryCountByBytesSize(EntryCountEstimator.java:52)
       at 
org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.applyMaxSizeCap(ManagedCursorImpl.java:3949)
       at 
org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.getMessagesToReplayNow(PersistentDispatcherMultipleConsumers.java:1319)
       at 
org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.readMoreEntries(PersistentDispatcherMultipleConsumers.java:385)
       at 
org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.handleSendingMessagesAndReadingMore(PersistentDispatcherMultipleConsumers.java:742)
       at 
org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.lambda$readEntriesComplete$9(PersistentDispatcherMultipleConsumers.java:718)
       at 
org.apache.bookkeeper.common.util.SingleThreadExecutor.safeRunTask(SingleThreadExecutor.java:137)
       at 
org.apache.bookkeeper.common.util.SingleThreadExecutor.run(SingleThreadExecutor.java:113)
       at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
       at java.base/java.lang.Thread.run(Unknown Source)
   
   Preceded by (same timestamp):
   ERROR org.apache.bookkeeper.client.LedgerHandle - Metadata conflict when 
closing ledger 5905.
   Another client may have recovered the ledger while there were writes 
outstanding.
   (local lastEntry:0 length:6083) (metadata lastEntry:-1 length:0)
   
   ERROR org.apache.bookkeeper.client.MetadataUpdateLoop - 
UpdateLoop(ledgerId=5905,loopId=143e2cbf) Exception updating
   org.apache.bookkeeper.client.BKException$BKMetadataVersionException: Bad 
ledger metadata version
       at 
org.apache.bookkeeper.client.LedgerHandle.lambda$null$2(LedgerHandle.java:602)
       at 
org.apache.bookkeeper.client.MetadataUpdateLoop.writeLoop(MetadataUpdateLoop.java:132)
       ...
   ```
   
   ### Reproducing the issue
   
   1. Create a persistent topic with a Shared subscription and an active 
consumer (Pulsar CPP client v3.5.1).
   2. Configure the managed ledger with a small 
`managedLedgerMaxEntriesPerLedger` or `managedLedgerMaxSizePerLedgerMb` so 
ledger rollover occurs under normal load.
   3. Publish messages to the topic until the current ledger becomes full and 
triggers a rollover.
   4. Observe that a concurrent ledger recovery on another broker causes a 
`BKMetadataVersionException` during close of the old ledger.
   5. After the new ledger is successfully created, observe that no further 
messages are dispatched to the consumer despite messages being present in the 
backlog.
   
   ### Additional information
   
   Expected Behaviour:
   
   After a ledger rollover, the dispatcher should continue reading entries and 
delivering messages to connected consumers without interruption. A 
`NoSuchElementException` in `EntryCountEstimator` should be handled gracefully 
and must not permanently halt the read loop.
   
   Actual Behaviour:
   
   The `SingleThreadExecutor` task driving `readMoreEntries()` crashes with a 
`NoSuchElementException` when 
`EntryCountEstimator.internalEstimateEntryCountByBytesSize()` calls 
`ConcurrentSkipListMap.firstKey()` on an empty map (the newly created ledger 
has no size-estimation data yet). The executor swallows the exception and logs 
it as "Error while running task: null". The dispatch loop is never rescheduled, 
leaving the subscription permanently stalled.
   
   ### Are you willing to submit a PR?
   
   - [ ] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to