*I am running Prometheus to monitor system resources like memory and CPU
usage, as well as other services on the infrastructure. I rely on
Alertmanager to send alerts to Telegram whenever a specific issue occurs
(such as high memory usage or a service stopping).*
*The problem I'm facing is that Alertmanager is not sending a notification
when an issue is resolved.High CPU Usage: If CPU usage exceeds 70%.High
Memory Usage: If memory usage exceeds 85%.Service Stopped: If a service
stops working.Alerts are sent to Alertmanager, which then sends
notifications via Telegram when an issue arises.*
*The initial alert messages are received correctly when the problem occurs.
However, when the system returns to a normal state and the issue is
"resolved," Alertmanager does not send a notification indicating that the
problem has been resolved.*
*Instead of sending a "Resolved" message when the issue is fixed, I notice
that the same alert message is repeated (the one for the issue), rather
than receiving a message indicating that the issue has been resolved.*
*Current Configuration:Prometheus Configuration (file alerts.yml):groups:*
-
*name: CPU Usage Alert rules:*
-
*alert: HighCPUUsage expr: ceil(100 * (1 - (avg by (Host, Client)
(rate(node_cpu_seconds_total{mode="idle"}[5m]))))) > 70 for: 6m labels:
severity: Critical Host: "{{ $labels.Host }}" Client: "{{ $labels.Client
}}" annotations: summary: "High CPU usage on {{ $labels.Host }} for {{
$labels.Client }} ({{ $value }})" description: "CPU usage on {{
$labels.Host }} for {{ $labels.Client }} has exceeded 70% for 5 minutes."
resolved: "CPU usage on {{ $labels.Host }} for {{ $labels.Client }} is
back
to normal ({{ $value }})."*
-
*name: Memory Usage Alert rules:*
-
*alert: HighMemory expr: floor(1 - (avg(node_memory_MemAvailable_bytes) by
(Client, Host) / avg(node_memory_MemTotal_bytes) by (Client, Host))) *
100
> 85 for: 6m labels: severity: Critical Host: "{{ $labels.Host }}"
Client:
"{{ $labels.Client }}" annotations: summary: "High Memory usage on {{
$labels.Host }} for {{ $labels.Client }} ({{ $value }})" description:
"Memory usage on {{ $labels.Host }} for {{ $labels.Client }} has exceeded
85% for 5 minutes." resolved: "Memory usage on {{ $labels.Host }} for {{
$labels.Client }} is back to normal ({{ $value }}%)."*
*Alertmanager Configuration (file alertmanager.yml):global:resolve_timeout:
5mroute:receiver: telegram_receivergroup_by: ["alertname",
"Host"]group_wait: 15sgroup_interval: 15srepeat_interval: 24hroutes:*
-
*receiver: 'telegram_receiver' matchers:*
- *severity="Critical"*
*receivers:*
-
*name: 'telegram_receiver' telegram_configs:*
-
*api_url: 'https://api.telegram.org <https://api.telegram.org/>'
send_resolved: true bot_token: xxxxxxxxxxxxxx chat_id:
yyyyyyyyyyyyyyyyyyyyyyyyy message: '{{ range .Alerts }}Alert⚠️: {{ printf
"%s\n" .Labels.alertname }}{{ printf "%s\n" .Annotations.summary }}{{
printf "%s\n" .Annotations.description }}{{ end }}' parse_mode: 'HTML'*
*I would greatly appreciate any guidance or solutions to this issue.*
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/prometheus-users/4010b5eb-e76a-467d-b4fb-ada44af2912bn%40googlegroups.com.