amjadali-klarity opened a new issue, #25438: URL: https://github.com/apache/pulsar/issues/25438
### Search before reporting - [x] I searched in the [issues](https://github.com/apache/pulsar/issues) and found nothing similar. ### Read release policy - [x] I understand that [unsupported versions](https://pulsar.apache.org/contribute/release-policy/#supported-versions) don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker. ### User environment - Pulsar version: 4.1.2 (pulsar-all image) - BookKeeper version: 4.17.2 - ZooKeeper version: 3.9.4 - Deployment: Kubernetes (StatefulSet), 2 brokers, 2 bookies - Client: Pulsar-CPP-v3.5.1 - Subscription type: Shared ### Issue Description After a BookKeeper ledger rolls over (triggered by ledger-full), the `PersistentDispatcherMultipleConsumers` dispatch loop crashes silently with a `NoSuchElementException` and never reschedules itself. As a result, the broker stops delivering messages to connected consumers even though: - The consumer remains connected (availablePermits > 0) - The new ledger is created successfully - The broker health check (/status.html) continues returning HTTP 200 - The topic backlog shows undelivered messages The broker appears completely healthy from the outside but dispatch is permanently frozen until the topic is unloaded or the broker is restarted. ### Error messages ```text ERROR org.apache.bookkeeper.common.util.SingleThreadExecutor - Error while running task: null java.util.NoSuchElementException: null at java.base/java.util.concurrent.ConcurrentSkipListMap.firstKey(Unknown Source) at org.apache.bookkeeper.mledger.impl.EntryCountEstimator.internalEstimateEntryCountByBytesSize(EntryCountEstimator.java:94) at org.apache.bookkeeper.mledger.impl.EntryCountEstimator.estimateEntryCountByBytesSize(EntryCountEstimator.java:52) at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.applyMaxSizeCap(ManagedCursorImpl.java:3949) at org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.getMessagesToReplayNow(PersistentDispatcherMultipleConsumers.java:1319) at org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.readMoreEntries(PersistentDispatcherMultipleConsumers.java:385) at org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.handleSendingMessagesAndReadingMore(PersistentDispatcherMultipleConsumers.java:742) at org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.lambda$readEntriesComplete$9(PersistentDispatcherMultipleConsumers.java:718) at org.apache.bookkeeper.common.util.SingleThreadExecutor.safeRunTask(SingleThreadExecutor.java:137) at org.apache.bookkeeper.common.util.SingleThreadExecutor.run(SingleThreadExecutor.java:113) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Unknown Source) Preceded by (same timestamp): ERROR org.apache.bookkeeper.client.LedgerHandle - Metadata conflict when closing ledger 5905. Another client may have recovered the ledger while there were writes outstanding. (local lastEntry:0 length:6083) (metadata lastEntry:-1 length:0) ERROR org.apache.bookkeeper.client.MetadataUpdateLoop - UpdateLoop(ledgerId=5905,loopId=143e2cbf) Exception updating org.apache.bookkeeper.client.BKException$BKMetadataVersionException: Bad ledger metadata version at org.apache.bookkeeper.client.LedgerHandle.lambda$null$2(LedgerHandle.java:602) at org.apache.bookkeeper.client.MetadataUpdateLoop.writeLoop(MetadataUpdateLoop.java:132) ... ``` ### Reproducing the issue 1. Create a persistent topic with a Shared subscription and an active consumer (Pulsar CPP client v3.5.1). 2. Configure the managed ledger with a small `managedLedgerMaxEntriesPerLedger` or `managedLedgerMaxSizePerLedgerMb` so ledger rollover occurs under normal load. 3. Publish messages to the topic until the current ledger becomes full and triggers a rollover. 4. Observe that a concurrent ledger recovery on another broker causes a `BKMetadataVersionException` during close of the old ledger. 5. After the new ledger is successfully created, observe that no further messages are dispatched to the consumer despite messages being present in the backlog. ### Additional information Expected Behaviour: After a ledger rollover, the dispatcher should continue reading entries and delivering messages to connected consumers without interruption. A `NoSuchElementException` in `EntryCountEstimator` should be handled gracefully and must not permanently halt the read loop. Actual Behaviour: The `SingleThreadExecutor` task driving `readMoreEntries()` crashes with a `NoSuchElementException` when `EntryCountEstimator.internalEstimateEntryCountByBytesSize()` calls `ConcurrentSkipListMap.firstKey()` on an empty map (the newly created ledger has no size-estimation data yet). The executor swallows the exception and logs it as "Error while running task: null". The dispatch loop is never rescheduled, leaving the subscription permanently stalled. ### Are you willing to submit a PR? - [ ] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
