Hi Moksha,
If you don't see query rates spiking up at the problematic times, here's a
few ideas:
* Have you confirmed that the Prometheus pods die with an OOM
(out-of-memory failure) and not for some other reason (e.g. do the logs of
the killed pods show any crash errors)?
* What do the various memory metrics ("process_resident_memory_bytes" /
"process_virtual_memory_bytes" / "container_memory_rss") look like for the
Prometheus processes before they die? Do they increase suddenly before an
OOM, or do they just gradually creep up until the server dies?
* It could still be that a single large query takes out your Prometheus
server, although the general rate of queries doesn't increase. You can use
Prometheus' active query log feature to figure out what query was running
while your Prometheus server crashed / got killed. See
https://training.promlabs.com/training/monitoring-and-debugging-prometheus/logs/active-queries-log/
* You can also generally log all PromQL queries that a Prometheus server
receives to a file, see:
https://training.promlabs.com/training/monitoring-and-debugging-prometheus/logs/query-log/
- the limitation of this approach is that it will only log completed
queries, so if your server dies while processing a query, it will not show
up in that log (for that you will need to use the active query log approach
above).
Cheers,
Julius
On Fri, Sep 13, 2024 at 3:38 AM Moksha Reddy G <[email protected]> wrote:
> Hi Everyone,
>
> We are facing some strange and serious issue with our Prometheus pods
> running on Azure instances, frequent pod restarts occurring when the load
> balancer or DNS points to Prometheus. I need to know the metric or some
> sort of PromQL where we could see the incoming requests count and API calls
> against Prometheus endpoint at the time of pod restarts. I would really
> appreciate if you could assist me with this matter, thank you!Little
> background on what we did so far!
>
> - I tried using scrape_samples_scraped metric and it is showing spike
> that occurs once in a day BUT pods are getting restarted more than 20 times
> in a day.
> - Tried with http_requests_total metric as mentioned in
> https://prometheus.io/docs/prometheus/latest/querying/examples/ BUT it
> did not show any spike in the requests at all.
> - I got this prometheus_http_requests_total metric and there also I
> don't see any spike at all.
>
> To remediate the pod restarts problem, we have performed below actions to
> understand the cause of this frequent pod restarts but no luck. *Do you
> have any recommendation or solution to stop these restarts?*
>
> 1. We spawn up new pods without pointing to any LB or DNS, this helped
> NO pod restarts. But this is just for testing purpose as we cannot go live
> without LB or DNS!
> 2. We checked the access logs to see if any HTTP requests from
> applications are causing the issue. We are seeing many readiness probe
> failures for Prometheus. As a workaround, we have increased the readiness
> and liveliness checks timeout but this didn't help.
> 3. We tried deleting /wal directory to clear the broken files to avoid
> Prometheus pod restarts but still the issues is same after few hours. NO
> immediate restarts at least!
> 4. We have scaled up the Azure instance type to make pods having
> enough resources to handle the load(which is invisible in Prometheus) and
> Azure monitoring does not showing any spike or much usage still Prometheus
> pods are getting restarted.
> 5. We had a call with App teams to cross check whether our Prometheus
> is getting hit by any applications/services or some load test. We still see
> the restarts even after we suspended some apps.
>
> Best,
> Moksh
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/f2e63b64-227a-4498-b02c-701ee0bd4e52n%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/f2e63b64-227a-4498-b02c-701ee0bd4e52n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
--
Julius Volz
PromLabs - promlabs.com
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/CAObpH5xbF%3DMeN1XtyRp6%3DceLKif91_89_idj9pX1WWmAe40sEQ%40mail.gmail.com.