Hi Everyone, We are facing some strange and serious issue with our Prometheus pods running on Azure instances, frequent pod restarts occurring when the load balancer or DNS points to Prometheus. I need to know the metric or some sort of PromQL where we could see the incoming requests count and API calls against Prometheus endpoint at the time of pod restarts. I would really appreciate if you could assist me with this matter, thank you!Little background on what we did so far!
- I tried using scrape_samples_scraped metric and it is showing spike that occurs once in a day BUT pods are getting restarted more than 20 times in a day. - Tried with http_requests_total metric as mentioned in https://prometheus.io/docs/prometheus/latest/querying/examples/ BUT it did not show any spike in the requests at all. - I got this prometheus_http_requests_total metric and there also I don't see any spike at all. To remediate the pod restarts problem, we have performed below actions to understand the cause of this frequent pod restarts but no luck. *Do you have any recommendation or solution to stop these restarts?* 1. We spawn up new pods without pointing to any LB or DNS, this helped NO pod restarts. But this is just for testing purpose as we cannot go live without LB or DNS! 2. We checked the access logs to see if any HTTP requests from applications are causing the issue. We are seeing many readiness probe failures for Prometheus. As a workaround, we have increased the readiness and liveliness checks timeout but this didn't help. 3. We tried deleting /wal directory to clear the broken files to avoid Prometheus pod restarts but still the issues is same after few hours. NO immediate restarts at least! 4. We have scaled up the Azure instance type to make pods having enough resources to handle the load(which is invisible in Prometheus) and Azure monitoring does not showing any spike or much usage still Prometheus pods are getting restarted. 5. We had a call with App teams to cross check whether our Prometheus is getting hit by any applications/services or some load test. We still see the restarts even after we suspended some apps. Best, Moksh -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f2e63b64-227a-4498-b02c-701ee0bd4e52n%40googlegroups.com.

