Back to Blog
Guides
alert fatigue
false positives

How to Reduce False Positive Alerts in Monitoring

False alerts cause alert fatigue and erode trust in monitoring. Learn practical techniques to reduce false positives and keep alerts meaningful.

WT

Wakestack Team

Engineering Team

7 min read

The False Positive Problem

Every false alert has costs:

  • Interruption: Someone investigates nothing
  • Erosion of trust: Team starts ignoring alerts
  • Delayed response: Real incidents dismissed as "probably false"
  • Burnout: On-call becomes dreaded

If your team jokes about ignoring alerts, you have a false positive problem.

Identifying False Positives

Track Alert Outcomes

For every alert, record:

  • Was it a real problem? (Yes/No/Partial)
  • What action was taken?
  • How long to resolve?

After a month, calculate your false positive rate:

False Positive Rate = False Alerts / Total Alerts × 100

Target: Under 10%. Ideal: Under 5%.

Common Patterns

Time-based patterns:

  • Alerts during deployments (expected behavior)
  • Alerts at specific times (cron jobs, backups)
  • Alerts during maintenance windows

Source-based patterns:

  • Specific probe locations causing issues
  • Certain services more prone to false alerts
  • Network-dependent checks

Threshold patterns:

  • Alerts that clear within seconds
  • Alerts that hover at threshold
  • Alerts that fire repeatedly

Technique 1: Add Confirmation Checks

Don't alert on first failure. Require multiple consecutive failures.

How It Works

Check 1: FAIL
  → Wait, check again

Check 2: FAIL
  → Wait, check again

Check 3: FAIL
  → NOW alert (confirmed failure)

Implementation

Most monitoring tools have this built in:

Wakestack: Configure "consecutive failures before alert"

Prometheus Alertmanager:

groups:
  - name: example
    rules:
      - alert: HighCPU
        expr: cpu_usage > 90
        for: 5m  # Must be true for 5 minutes
        labels:
          severity: warning

Datadog:

  • Set "Trigger when metric is above threshold for X minutes"
Alert TypeConfirmation Time
Critical services2-3 checks (1-3 min)
Standard services3-5 checks (3-5 min)
Non-critical5-10 checks (5-10 min)

Technique 2: Use Multi-Location Verification

Single-location failures are often network issues, not service issues.

The Problem

Probe in London: FAIL (network issue)
Your service: Actually fine
Result: False alert

The Solution

Require failures from multiple locations:

London: FAIL
Singapore: OK
Virginia: OK

Result: Probably London network issue, don't alert

Implementation

Alert only when majority fail:

LocationsFailures for Alert
3 locations2+ failures
5 locations3+ failures
7 locations4+ failures

Most uptime monitoring services support this natively.

Technique 3: Adjust Thresholds

Thresholds set too low trigger on normal variation.

Finding the Right Threshold

  1. Collect baseline data: What's normal for this metric?
  2. Calculate percentiles: What's the 95th or 99th percentile?
  3. Set threshold above normal: Alert on genuinely abnormal values

Example: CPU Alerts

Too sensitive (causes false positives):

Alert when CPU > 70%

Normal operation often hits 70%. You'll alert constantly.

Better approach:

Alert when CPU > 90% for 5+ minutes

Occasional spikes to 95% are fine. Sustained 90%+ indicates a real problem.

Metric-Specific Recommendations

MetricPoor ThresholdBetter Threshold
CPU> 70%> 90% for 5 min
Memory> 80%> 95% for 5 min
Disk> 80%> 90%
Response time> 500ms> 2000ms for 3 min
Error rate> 0.1%> 5% for 5 min

Technique 4: Use Hysteresis

Prevent flapping alerts when metrics hover around threshold.

The Problem

CPU: 89% → No alert
CPU: 91% → ALERT!
CPU: 89% → Resolved
CPU: 91% → ALERT!
CPU: 89% → Resolved
... (repeats endlessly)

The Solution: Different Trigger and Recovery Thresholds

Alert when: CPU > 90%
Resolve when: CPU < 80%

CPU at 89%: Still alerting (not resolved yet)
CPU at 79%: Now resolved

This prevents rapid alert/resolve cycles.

Implementation

# Prometheus example with hysteresis-like behavior
- alert: HighMemory
  expr: memory_usage > 90
  for: 5m
  annotations:
    summary: "Memory above 90% for 5+ minutes"
 
# Separate recovery notification (optional)
- alert: MemoryRecovered
  expr: memory_usage < 80
  for: 5m
  # Only fires after HighMemory was active

Technique 5: Smart Timing

Avoid alerting during expected disruptions.

Maintenance Windows

Suppress alerts during planned maintenance:

# Alertmanager silence
silences:
  - matchers:
      - name: service
        value: api
    startsAt: "2024-01-15T02:00:00Z"
    endsAt: "2024-01-15T04:00:00Z"
    comment: "Scheduled maintenance"

Deployment Windows

Temporarily increase tolerance during deployments:

  • Extend confirmation time
  • Increase thresholds
  • Route alerts to deployment channel, not on-call

Known Spikes

If backups cause CPU spikes every night at 3 AM:

  • Exclude that time from alerting
  • Or adjust thresholds for that period
  • Or accept and document the pattern

Technique 6: Improve Health Checks

Better health checks = fewer false positives.

Make Health Checks Representative

Bad: Check if port 80 is open

Returns 200 even when database is down

Good: Check if application is functional

GET /health
- Verifies database connection
- Checks critical dependencies
- Returns 500 if anything is wrong

Avoid Slow Health Checks

Health checks should be fast:

  • Under 1 second response time
  • No expensive database queries
  • Cache dependency status if needed

Slow health checks timeout and appear as failures.

Include Dependency Checks

If your service depends on a database, your health check should verify:

{
  "status": "healthy",
  "database": "connected",
  "cache": "connected",
  "external_api": "reachable"
}

Return unhealthy if critical dependencies are down.

Technique 7: Categorize and Route

Not all alerts need the same treatment.

Severity Levels

Critical: Page immediately (real customer impact)
Warning: Notify during business hours
Info: Log for review, don't notify

Routing by Confidence

High confidence (multiple checks, sustained): → Page on-call
Low confidence (single check, transient): → Slack channel

Example Alertmanager Routing

route:
  receiver: 'default-slack'
  routes:
    - match:
        severity: critical
        confidence: high
      receiver: 'pagerduty'
    - match:
        severity: warning
      receiver: 'slack-warnings'

Technique 8: Regular Alert Review

Alerts drift over time. Review regularly.

Monthly Alert Audit

  1. List all alerts that fired
  2. Categorize each: Real issue / False positive / Unknown
  3. For false positives: Identify why and fix
  4. For never-firing alerts: Are they still relevant?
  5. Calculate false positive rate: Track over time

Questions to Ask

  • Did this alert require action?
  • Could we have caught this differently?
  • Is the threshold still appropriate?
  • Does this alert still matter?

Alert Retirement

Delete alerts that:

  • Haven't fired in 6+ months (might be obsolete)
  • Always require "ignore, it's fine" response
  • Duplicate other alerts
  • Monitor deprecated services

Implementation Checklist

Quick Wins (Do This Week)

  • Add confirmation checks (2-3 consecutive failures)
  • Enable multi-location verification
  • Review top 5 noisiest alerts
  • Set up alert outcome tracking

Medium Term (This Month)

  • Audit all alerts for appropriate thresholds
  • Implement maintenance window suppression
  • Improve health check endpoints
  • Create severity-based routing

Ongoing

  • Monthly alert review
  • Track false positive rate
  • Retire obsolete alerts
  • Refine thresholds based on data

Summary

Reducing false positive alerts requires systematic effort:

Immediate improvements:

  • Require confirmation (consecutive failures)
  • Use multi-location verification
  • Adjust thresholds to catch real problems, not noise

Structural improvements:

  • Add hysteresis to prevent flapping
  • Suppress during known disruptions
  • Improve health check quality

Ongoing practice:

  • Track alert outcomes
  • Review alerts monthly
  • Delete alerts that don't drive action

The goal is alerts you trust. When an alert fires, the response should be "let's fix this"—not "probably another false alarm."

Every false positive you eliminate makes real alerts more meaningful.

About the Author

WT

Wakestack Team

Engineering Team

Frequently Asked Questions

What causes false positive alerts?

Common causes include: thresholds set too sensitive, single-check failures without confirmation, network issues at probe locations, temporary spikes during normal operations, and monitoring systems themselves having issues.

How many false positives are acceptable?

Industry guidance suggests keeping false positive rates under 5-10%. If more than 1 in 10 alerts is false, your team will start ignoring alerts entirely.

Should I delete alerts that have false positives?

Not necessarily. First try adjusting thresholds, adding confirmation checks, or changing timing. Only delete alerts that provide no value even when correctly configured.

Related Articles

Ready to monitor your uptime?

Start monitoring your websites, APIs, and services in minutes. Free forever for small projects.