Hi Everyone,

We are facing some strange and serious issue with our Prometheus pods 
running on Azure instances, frequent pod restarts occurring when the load 
balancer or DNS points to Prometheus. I need to know the metric or some 
sort of PromQL where we could see the incoming requests count and API calls 
against Prometheus endpoint at the time of pod restarts. I would really 
appreciate if you could assist me with this matter, thank you!Little 
background on what we did so far!

   - I tried using scrape_samples_scraped metric and it is showing spike 
   that occurs once in a day BUT pods are getting restarted more than 20 times 
   in a day.
   - Tried with http_requests_total metric as mentioned in 
   https://prometheus.io/docs/prometheus/latest/querying/examples/ BUT it 
   did not show any spike in the requests at all.
   - I got this prometheus_http_requests_total metric and there also I 
   don't see any spike at all.

To remediate the pod restarts problem, we have performed below actions to 
understand the cause of this frequent pod restarts but no luck. *Do you 
have any recommendation or solution to stop these restarts?*

   1. We spawn up new pods without pointing to any LB or DNS, this helped 
   NO pod restarts. But this is just for testing purpose as we cannot go live 
   without LB or DNS!
   2. We checked the access logs to see if any HTTP requests from 
   applications are causing the issue. We are seeing many readiness probe 
   failures for Prometheus. As a workaround, we have increased the readiness 
   and liveliness checks timeout but this didn't help.
   3. We tried deleting /wal directory to clear the broken files to avoid 
   Prometheus pod restarts but still the issues is same after few hours. NO 
   immediate restarts at least!
   4. We have scaled up the Azure instance type to make pods having enough 
   resources to handle the load(which is invisible in Prometheus) and Azure 
   monitoring does not showing any spike or much usage still Prometheus pods 
   are getting restarted.
   5. We had a call with App teams to cross check whether our Prometheus is 
   getting hit by any applications/services or some load test. We still see 
   the restarts even after we suspended some apps.

Best,
Moksh

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f2e63b64-227a-4498-b02c-701ee0bd4e52n%40googlegroups.com.

Reply via email to