[prometheus-users] Re: Prometheus alert tagging issue - multiple servers

mohan garden Wed, 03 Apr 2024 01:11:33 -0700

*correction: 
*Scenario2: *While server1 trigger is active, a second server ( say 
server2)'s local disk usage reaches 50%,


i see that the already open Opsgenie ticket's details gets updated as:
ticket header name:  local disk usage reached 50%
ticket description:  space on /var file system at server1:9100 server = 
82%."
                                 space on /var file system at server2:9100 
server = 80%."
ticket tags: criteria: overuse , team: support, severity: critical, 
infra,monitor,host=server1

[image: photo003.png]



On Wednesday, April 3, 2024 at 1:37:12 PM UTC+5:30 mohan garden wrote:

> Hi Brian, 
> Thank you for the response, Here are some more details, hope this will 
> help you in gaining more understanding into the configuration and method i 
> am using to generate tags :
>
>
> 1. We collect data from the node exporter, and have created some rules 
> around the collected data. Here is one example - 
>     - alert: "Local Disk usage has reached 50%"
>       expr: (round( 
> node_filesystem_avail_bytes{mountpoint=~"/dev.*|/sys*|/|/home|/tmp|/var.*|/boot.*",}
>  
> / 
> node_filesystem_size_bytes{mountpoint=~"/dev.*|/sys*|/|/home|/tmp|/var.*|/boot.*"}
>  
> * 100  ,0.1) >= y ) and (round( 
> node_filesystem_avail_bytes{mountpoint=~"/dev.*|/sys*|/|/home|/tmp|/var.*|/boot.*"}
>  
> / 
> node_filesystem_size_bytes{mountpoint=~"/dev.*|/sys*|/|/home|/tmp|/var.*|/boot.*"}
>  
> * 100  ,0.1) <= z )
>       for: 5m
>       labels:
>         criteria: overuse
>         severity: critical
>         team: support
>       annotations:
>         summary: "{{ $labels.instance }} 's  ({{ $labels.device }}) has 
> low space."
>         description: "space on {{ $labels.mountpoint }} file system at {{ 
> $labels.instance }} server = {{ $value }}%."
>
> 2. at the alert manager , we have created notification rules to notify in 
> case the aforementioned condition occurs:
>
>   smtp_from: '[email protected]'
>   smtp_require_tls: false
>   smtp_smarthost: '[email protected]:25 <http://[email protected]:25>'
>
> templates:
>   - /home/ALERTMANAGER/conf/template/*.tmpl
>
> route:
>   group_wait: 5m
>   group_interval: 2h
>   repeat_interval: 5h
>   receiver: admin
>   routes:
>   - match_re:
>       alertname: ".*Local Disk usage has reached .*%"
>     receiver: admin
>     routes:
>     - match:
>         criteria: overuse
>         severity: critical
>         team: support
>       receiver: mailsupport
>       continue: true
>     - match:
>         criteria: overuse
>         team: support
>         severity: critical
>         receiver: opsgeniesupport
>
> receivers:
>   - name: opsgeniesupport
>     opsgenie_configs:
>     - api_key: XYZ
>       api_url: https://api.opsgenie.com
>       message: '{{ .CommonLabels.alertname }}'
>       description: "{{ range .Alerts }}{{ .Annotations.description 
> }}\n\r{{ end }}"
>       tags: '{{ range $k, $v := .CommonLabels}}{{ if or (eq $k 
> "criteria")  (eq $k "severity") (eq $k "team") }}{{$k}}={{$v}},{{ else if 
> eq $k "instance" }}{{ reReplaceAll "(.+):(.+)" "host=$1" $v 
> }},{{end}}{{end}},infra,monitor'
>       priority: 'P1'
>       update_alerts: true
>       send_resolved: true
> ...
> So you can see that i derive a  tag host=<hostname> from the instance 
> label.
>
>
> *Scenario1: *When server1 's local disk usage reaches 50%, i see that 
> Opsgenie ticket is created having:
> Opsgenie Ticket metadata: 
> ticket header name:  local disk usage reached 50%
> ticket description:  space on /var file system at server1:9100 server = 
> 82%."
> ticket tags: criteria: overuse , team: support, severity: critical, 
> infra,monitor,host=server1
>
> so everything works as expected, no issues with Scenario1.
>
>
> *Scenario2: *While server1 trigger is active, a second server ( say 
> server2)'s local disk usage reaches 50%,
>
> i see that Opsgenie tickets are getting updated as:
> ticket header name:  local disk usage reached 50%
> ticket description:  space on /var file system at server1:9100 server = 
> 82%."
> ticket description:  space on /var file system at server2:9100 server = 
> 80%."
> ticket tags: criteria: overuse , team: support, severity: critical, 
> infra,monitor,host=server1
>
>
> but i was expecting an additional host=server2 tag on the ticket.  
> in Summary - i see updated description , but unable to see updated tags.
>
> in tags section of the alertmanager - opsgenie integration configuration , 
> i had tried iterating over Alerts and CommonLabels, but i was unable to 
> add  additional host=server2 tag .
> {{ range $idx, $alert := .Alerts}}{{range $k, $v := $alert.Labels 
> }}{{$k}}={{$v}},{{end}}{{end}},test=test
> {{ range $k, $v := .CommonLabels}}....{{end}}
>
>
> At the moment, i am not sure that what is potentially preventing the 
> update of tags on the opsgenie tickets.
> If i can get some clarity on the fact that if the configurations i have 
> for  alertmanager are good enough, then i can look at the opsgenie 
> configurations.
>
>
> Please advice.
>
>
> Regards
> CP
>
>
> On Tuesday, April 2, 2024 at 10:46:36 PM UTC+5:30 Brian Candler wrote:
>
>> FYI, those images are unreadable - copy-pasted text would be much better.
>>
>> My guess, though, is that you probably don't want to group alerts before 
>> sending them to opsgenie. You haven't shown your full alertmanager config, 
>> but if you have a line like
>>
>>    group_by: ['alertname']
>>
>> then try
>>
>>    group_by: ["..."]
>>
>> (literally, exactly that: a single string containing three dots, inside 
>> square brackets)
>>
>> On Tuesday 2 April 2024 at 17:15:39 UTC+1 mohan garden wrote:
>>
>>> Dear Prometheus Community,
>>> I am reaching out regarding an issue i have encountered with  prometheus 
>>> alert tagging, specifically while creating tickets in Opsgenie.
>>>
>>>
>>> I have configured alertmanager  to send alerts to Opsgenie as , the 
>>> configuration as :
>>> [image: photo001.png]i ticket is generated with expected description 
>>> and tags as - 
>>> [image: photo002.png]
>>>
>>> Now, by default the alerts are grouped by the alert name( default 
>>> behavior).So when the similar event happens on a different server i see 
>>> that the description is updated as:
>>> [image: photo003.png]
>>> but the tag on the ticket remains same, 
>>> expected behavior: criteria=..., host=108, host=114, infra.....support 
>>>
>>> I have set update_alert and send_resolved settings to true.
>>> I am not sure that in order to make it work as expected, If i need 
>>> additional configuration at opsgenie or at the alertmanager. 
>>>
>>> I would appreciate any insight or guidance on the method to resolve this 
>>> issue and ensure that alerts for different servers are correctly tagged in 
>>> Opsgenie.
>>>
>>> Thank you in advance.
>>> Regards,
>>> CP
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/38adb61c-20a8-43bd-badb-7fc726796324n%40googlegroups.com.

[prometheus-users] Re: Prometheus alert tagging issue - multiple servers

Reply via email to