lhotari commented on issue #25433: URL: https://github.com/apache/pulsar/issues/25433#issuecomment-4160580369
> Our pulsar-bookie pods are experiencing sudden restarts. Claude answer made with some prompting: > # Diagnosing Pulsar-Bookie Pod Restarts > > ## 1. Check pod details and events > ```bash > kubectl describe pod -n > ``` > > Shows restart count, last termination reason (e.g. `OOMKilled`, exit code), and events (probe failures, back-offs) scoped to the pod. > > ## 2. Get logs from the crashed container > ```bash > kubectl logs -n --previous > ``` > > The `--previous` flag retrieves logs from the prior (crashed) instance. > > ## 3. Filter events for a specific pod > ```bash > kubectl get events -n --sort-by='.lastTimestamp' \ > --field-selector involvedObject.name= > ``` > > Example: > ```bash > kubectl get events -n pulsar --sort-by='.lastTimestamp' \ > --field-selector involvedObject.name=pulsar-bookie-0 > ``` > > Useful for spotting node-level evictions or scheduling issues. > > ## 4. Common causes > > | Cause | Indicator | Check | > |---|---|---| > | OOMKilled | Exit code 137 | See OOMKilled details below | > | Liveness probe failure | `Unhealthy` / `Killing` events | Review probe `timeoutSeconds` and `failureThreshold` | > | Disk pressure | Pod evicted or crash loops | `kubectl describe node` and PV usage | > > ### OOMKilled details > > Bookies consume memory well beyond the JVM heap. Direct memory (Netty direct buffers, RocksDB block cache and memtables) and JVM overhead (metaspace, thread stacks, GC) all contribute to total RSS. > > **`-Xmx` should be set to roughly 60–70% of the pod's memory limit**, leaving the remaining 30–40% for direct memory allocations and JVM overhead. > > If memory limits are set lower than memory requests, the container can be killed as soon as usage exceeds the limit. Ensure `resources.limits.memory` is equal to or greater than `resources.requests.memory`. > > **Node-level overprovisioning** can also cause OOMKilled even when the pod itself is within its limits. When the node's total committed memory exceeds physical RAM, the Linux kernel's OOM killer may terminate a bookie process. Check node-level metrics: > ```bash > kubectl describe node > kubectl top node > ``` > > Look at `Allocated resources` in the describe output to see if the node is overcommitted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
