How to Reduce False Positive Alerts in Monitoring
False alerts cause alert fatigue and erode trust in monitoring. Learn practical techniques to reduce false positives and keep alerts meaningful.
Wakestack Team
Engineering Team
The False Positive Problem
Every false alert has costs:
- Interruption: Someone investigates nothing
- Erosion of trust: Team starts ignoring alerts
- Delayed response: Real incidents dismissed as "probably false"
- Burnout: On-call becomes dreaded
If your team jokes about ignoring alerts, you have a false positive problem.
Identifying False Positives
Track Alert Outcomes
For every alert, record:
- Was it a real problem? (Yes/No/Partial)
- What action was taken?
- How long to resolve?
After a month, calculate your false positive rate:
False Positive Rate = False Alerts / Total Alerts × 100
Target: Under 10%. Ideal: Under 5%.
Common Patterns
Time-based patterns:
- Alerts during deployments (expected behavior)
- Alerts at specific times (cron jobs, backups)
- Alerts during maintenance windows
Source-based patterns:
- Specific probe locations causing issues
- Certain services more prone to false alerts
- Network-dependent checks
Threshold patterns:
- Alerts that clear within seconds
- Alerts that hover at threshold
- Alerts that fire repeatedly
Technique 1: Add Confirmation Checks
Don't alert on first failure. Require multiple consecutive failures.
How It Works
Check 1: FAIL
→ Wait, check again
Check 2: FAIL
→ Wait, check again
Check 3: FAIL
→ NOW alert (confirmed failure)
Implementation
Most monitoring tools have this built in:
Wakestack: Configure "consecutive failures before alert"
Prometheus Alertmanager:
groups:
- name: example
rules:
- alert: HighCPU
expr: cpu_usage > 90
for: 5m # Must be true for 5 minutes
labels:
severity: warningDatadog:
- Set "Trigger when metric is above threshold for X minutes"
Recommended Settings
| Alert Type | Confirmation Time |
|---|---|
| Critical services | 2-3 checks (1-3 min) |
| Standard services | 3-5 checks (3-5 min) |
| Non-critical | 5-10 checks (5-10 min) |
Technique 2: Use Multi-Location Verification
Single-location failures are often network issues, not service issues.
The Problem
Probe in London: FAIL (network issue)
Your service: Actually fine
Result: False alert
The Solution
Require failures from multiple locations:
London: FAIL
Singapore: OK
Virginia: OK
Result: Probably London network issue, don't alert
Implementation
Alert only when majority fail:
| Locations | Failures for Alert |
|---|---|
| 3 locations | 2+ failures |
| 5 locations | 3+ failures |
| 7 locations | 4+ failures |
Most uptime monitoring services support this natively.
Technique 3: Adjust Thresholds
Thresholds set too low trigger on normal variation.
Finding the Right Threshold
- Collect baseline data: What's normal for this metric?
- Calculate percentiles: What's the 95th or 99th percentile?
- Set threshold above normal: Alert on genuinely abnormal values
Example: CPU Alerts
Too sensitive (causes false positives):
Alert when CPU > 70%
Normal operation often hits 70%. You'll alert constantly.
Better approach:
Alert when CPU > 90% for 5+ minutes
Occasional spikes to 95% are fine. Sustained 90%+ indicates a real problem.
Metric-Specific Recommendations
| Metric | Poor Threshold | Better Threshold |
|---|---|---|
| CPU | > 70% | > 90% for 5 min |
| Memory | > 80% | > 95% for 5 min |
| Disk | > 80% | > 90% |
| Response time | > 500ms | > 2000ms for 3 min |
| Error rate | > 0.1% | > 5% for 5 min |
Technique 4: Use Hysteresis
Prevent flapping alerts when metrics hover around threshold.
The Problem
CPU: 89% → No alert
CPU: 91% → ALERT!
CPU: 89% → Resolved
CPU: 91% → ALERT!
CPU: 89% → Resolved
... (repeats endlessly)
The Solution: Different Trigger and Recovery Thresholds
Alert when: CPU > 90%
Resolve when: CPU < 80%
CPU at 89%: Still alerting (not resolved yet)
CPU at 79%: Now resolved
This prevents rapid alert/resolve cycles.
Implementation
# Prometheus example with hysteresis-like behavior
- alert: HighMemory
expr: memory_usage > 90
for: 5m
annotations:
summary: "Memory above 90% for 5+ minutes"
# Separate recovery notification (optional)
- alert: MemoryRecovered
expr: memory_usage < 80
for: 5m
# Only fires after HighMemory was activeTechnique 5: Smart Timing
Avoid alerting during expected disruptions.
Maintenance Windows
Suppress alerts during planned maintenance:
# Alertmanager silence
silences:
- matchers:
- name: service
value: api
startsAt: "2024-01-15T02:00:00Z"
endsAt: "2024-01-15T04:00:00Z"
comment: "Scheduled maintenance"Deployment Windows
Temporarily increase tolerance during deployments:
- Extend confirmation time
- Increase thresholds
- Route alerts to deployment channel, not on-call
Known Spikes
If backups cause CPU spikes every night at 3 AM:
- Exclude that time from alerting
- Or adjust thresholds for that period
- Or accept and document the pattern
Technique 6: Improve Health Checks
Better health checks = fewer false positives.
Make Health Checks Representative
Bad: Check if port 80 is open
Returns 200 even when database is down
Good: Check if application is functional
GET /health
- Verifies database connection
- Checks critical dependencies
- Returns 500 if anything is wrong
Avoid Slow Health Checks
Health checks should be fast:
- Under 1 second response time
- No expensive database queries
- Cache dependency status if needed
Slow health checks timeout and appear as failures.
Include Dependency Checks
If your service depends on a database, your health check should verify:
{
"status": "healthy",
"database": "connected",
"cache": "connected",
"external_api": "reachable"
}Return unhealthy if critical dependencies are down.
Technique 7: Categorize and Route
Not all alerts need the same treatment.
Severity Levels
Critical: Page immediately (real customer impact)
Warning: Notify during business hours
Info: Log for review, don't notify
Routing by Confidence
High confidence (multiple checks, sustained): → Page on-call
Low confidence (single check, transient): → Slack channel
Example Alertmanager Routing
route:
receiver: 'default-slack'
routes:
- match:
severity: critical
confidence: high
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack-warnings'Technique 8: Regular Alert Review
Alerts drift over time. Review regularly.
Monthly Alert Audit
- List all alerts that fired
- Categorize each: Real issue / False positive / Unknown
- For false positives: Identify why and fix
- For never-firing alerts: Are they still relevant?
- Calculate false positive rate: Track over time
Questions to Ask
- Did this alert require action?
- Could we have caught this differently?
- Is the threshold still appropriate?
- Does this alert still matter?
Alert Retirement
Delete alerts that:
- Haven't fired in 6+ months (might be obsolete)
- Always require "ignore, it's fine" response
- Duplicate other alerts
- Monitor deprecated services
Implementation Checklist
Quick Wins (Do This Week)
- Add confirmation checks (2-3 consecutive failures)
- Enable multi-location verification
- Review top 5 noisiest alerts
- Set up alert outcome tracking
Medium Term (This Month)
- Audit all alerts for appropriate thresholds
- Implement maintenance window suppression
- Improve health check endpoints
- Create severity-based routing
Ongoing
- Monthly alert review
- Track false positive rate
- Retire obsolete alerts
- Refine thresholds based on data
Summary
Reducing false positive alerts requires systematic effort:
Immediate improvements:
- Require confirmation (consecutive failures)
- Use multi-location verification
- Adjust thresholds to catch real problems, not noise
Structural improvements:
- Add hysteresis to prevent flapping
- Suppress during known disruptions
- Improve health check quality
Ongoing practice:
- Track alert outcomes
- Review alerts monthly
- Delete alerts that don't drive action
The goal is alerts you trust. When an alert fires, the response should be "let's fix this"—not "probably another false alarm."
Every false positive you eliminate makes real alerts more meaningful.
Frequently Asked Questions
What causes false positive alerts?
Common causes include: thresholds set too sensitive, single-check failures without confirmation, network issues at probe locations, temporary spikes during normal operations, and monitoring systems themselves having issues.
How many false positives are acceptable?
Industry guidance suggests keeping false positive rates under 5-10%. If more than 1 in 10 alerts is false, your team will start ignoring alerts entirely.
Should I delete alerts that have false positives?
Not necessarily. First try adjusting thresholds, adding confirmation checks, or changing timing. Only delete alerts that provide no value even when correctly configured.
Related Articles
The Real Difference Between 'Monitoring' and 'Alerting'
Monitoring and alerting aren't the same thing. Understanding the difference prevents alert fatigue and improves incident response. Here's what each actually does.
Read moreWhat Is Alert Fatigue and How Do Teams Fix It?
Alert fatigue happens when too many alerts cause people to ignore them. Learn the causes, warning signs, and proven strategies to eliminate alert fatigue in your team.
Read moreWhat Is a False Positive Alert?
A false positive alert is an alert that fires when there's no real problem. Learn why false positives happen, how they damage your incident response, and how to eliminate them.
Read moreReady to monitor your uptime?
Start monitoring your websites, APIs, and services in minutes. Free forever for small projects.