Monitoring for a metric vanishing is not a very good way to do alerting.
Metrics hang around for the "staleness" interval, which by default is 5
minutes. Ideally, you should monitor all the things you care about
explicitly, get a success metric like "up" (1 = working, 0 = not working)
and then alert on "up == 0" or equivalent. This is much more flexible and
timely.
Having said that, there's a quick and dirty hack that might be good enough
for you:
expr: container_memory_usage_bytes offset 10m unless
container_memory_usage_bytes
This will give you an alert if any metric container_memory_usage_bytes
existed 10 minutes ago but does not exist now. The alert will resolve
itself after 10 minutes.
The result of this expression is a vector, so it can alert on multiple
containers at once; each element of the vector will have the container name
in the label ("name")
On Saturday 18 May 2024 at 19:50:48 UTC+1 Sleep Man wrote:
> I have a large number of containers. I learned that the following
> configuration can monitor a single container down. How to configure it to
> monitor all containers and send the container name once a container is down.
>
>
> - name: containers
> rules:
> - alert: jenkins_down
> expr: absent(container_memory_usage_bytes{name="jenkins"})
> for: 30s
> labels:
> severity: critical
> annotations:
> summary: "Jenkins down"
> description: "Jenkins container is down for more than 30 seconds."
>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/17dd75ea-16d6-4e2d-bdc1-3d2bb345c4fan%40googlegroups.com.