How do you set up effective DevOps alerting systems?
#1
We're drowning in alerts and need to improve our DevOps alerting systems. How do you balance getting notified about important issues without alert fatigue?

What tools do you use for monitoring and alerting? How do you determine what deserves an alert vs just being logged?

Also interested in how you handle alert routing - who gets notified for what, and how do you escalate when needed? Any tips for creating meaningful alert messages that actually help people understand and fix issues quickly?
Reply
#2
For DevOps alerting systems, we use a combination of Prometheus for metrics, Grafana for visualization, and Alertmanager for routing. The key is setting intelligent thresholds, not just CPU > 80%".

We categorize alerts as critical, warning, and info. Critical alerts page the on-call engineer. Warning alerts go to a Slack channel for investigation during business hours. Info alerts are just logged.

We also implemented alert deduplication and grouping. If multiple services are experiencing the same issue, we get one alert, not twenty. And we use alert templates that include runbook links and suggested actions.

The biggest improvement was involving developers in alert creation. They know what metrics indicate problems with their services better than ops does.
Reply
#3
We struggled with alert fatigue until we implemented SLO-based alerting for our DevOps alerting systems. Instead of alerting on individual metrics, we alert when our Service Level Objectives are at risk of being violated.

For example, instead of error rate > 1%", we alert "error budget consumption rate will exhaust budget in 4 hours if current trend continues". This focuses on what matters to users rather than internal metrics.

We use Datadog for monitoring and PagerDuty for alert routing. Datadog's anomaly detection helps reduce false positives by learning normal patterns.

Every alert must have a runbook. If you can't write a runbook for how to respond to an alert, you shouldn't be alerting on it.
Reply
#4
Our DevOps alerting systems follow the three alarms" rule: if the same alert fires three times without being addressed, it automatically escalates to the next level.

We also use dynamic routing based on time and expertise. During business hours, alerts go to the team that owns the service. After hours, they go to the general on-call engineer who can escalate if needed.

For alert messages, we follow a template: [Service] [Severity] [Issue] [Impact] [Action]. Example: "API Gateway CRITICAL Latency spike Affecting 30% of users Check load balancer config".

We review our alerting rules quarterly. Any alert that hasn't fired in 3 months gets reviewed - maybe the threshold is wrong, or maybe the issue is fixed and we don't need the alert anymore.
Reply


[-]
Quick Reply
Message
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Forum Jump: