Re: [I] [Bug] Bookie (v4.1.2) Restarting Suddenly: ZKRegistrationClient Invalidate Cache / NetworkTopology Node Removal [pulsar]

via GitHub Tue, 31 Mar 2026 00:42:46 -0700


lhotari commented on issue #25433:
URL: https://github.com/apache/pulsar/issues/25433#issuecomment-4160580369


   > Our pulsar-bookie pods are experiencing sudden restarts.
   
   Claude answer made with some prompting:
   
   > # Diagnosing Pulsar-Bookie Pod Restarts
   > 
   > ## 1. Check pod details and events
   > ```bash
   > kubectl describe pod  -n 
   > ```
   > 
   > Shows restart count, last termination reason (e.g. `OOMKilled`, exit 
code), and events (probe failures, back-offs) scoped to the pod.
   > 
   > ## 2. Get logs from the crashed container
   > ```bash
   > kubectl logs  -n  --previous
   > ```
   > 
   > The `--previous` flag retrieves logs from the prior (crashed) instance.
   > 
   > ## 3. Filter events for a specific pod
   > ```bash
   > kubectl get events -n  --sort-by='.lastTimestamp' \
   >   --field-selector involvedObject.name=
   > ```
   > 
   > Example:
   > ```bash
   > kubectl get events -n pulsar --sort-by='.lastTimestamp' \
   >   --field-selector involvedObject.name=pulsar-bookie-0
   > ```
   > 
   > Useful for spotting node-level evictions or scheduling issues.
   > 
   > ## 4. Common causes
   > 
   > | Cause | Indicator | Check |
   > |---|---|---|
   > | OOMKilled | Exit code 137 | See OOMKilled details below |
   > | Liveness probe failure | `Unhealthy` / `Killing` events | Review probe 
`timeoutSeconds` and `failureThreshold` |
   > | Disk pressure | Pod evicted or crash loops | `kubectl describe node` and 
PV usage |
   > 
   > ### OOMKilled details
   > 
   > Bookies consume memory well beyond the JVM heap. Direct memory (Netty 
direct buffers, RocksDB block cache and memtables) and JVM overhead (metaspace, 
thread stacks, GC) all contribute to total RSS.
   > 
   > **`-Xmx` should be set to roughly 60–70% of the pod's memory limit**, 
leaving the remaining 30–40% for direct memory allocations and JVM overhead.
   > 
   > If memory limits are set lower than memory requests, the container can be 
killed as soon as usage exceeds the limit. Ensure `resources.limits.memory` is 
equal to or greater than `resources.requests.memory`.
   > 
   > **Node-level overprovisioning** can also cause OOMKilled even when the pod 
itself is within its limits. When the node's total committed memory exceeds 
physical RAM, the Linux kernel's OOM killer may terminate a bookie process. 
Check node-level metrics:
   > ```bash
   > kubectl describe node 
   > kubectl top node 
   > ```
   > 
   > Look at `Allocated resources` in the describe output to see if the node is 
overcommitted.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Bug] Bookie (v4.1.2) Restarting Suddenly: ZKRegistrationClient Invalidate Cache / NetworkTopology Node Removal [pulsar]

Reply via email to