How to Reduce False Positive Alerts in Monitoring

The False Positive Problem

Every false alert has costs:

Interruption: Someone investigates nothing
Erosion of trust: Team starts ignoring alerts
Delayed response: Real incidents dismissed as "probably false"
Burnout: On-call becomes dreaded

If your team jokes about ignoring alerts, you have a false positive problem.

Identifying False Positives

Track Alert Outcomes

For every alert, record:

Was it a real problem? (Yes/No/Partial)
What action was taken?
How long to resolve?

After a month, calculate your false positive rate:

False Positive Rate = False Alerts / Total Alerts × 100

Target: Under 10%. Ideal: Under 5%.

Common Patterns

Time-based patterns:

Alerts during deployments (expected behavior)
Alerts at specific times (cron jobs, backups)
Alerts during maintenance windows

Source-based patterns:

Specific probe locations causing issues
Certain services more prone to false alerts
Network-dependent checks

Threshold patterns:

Alerts that clear within seconds
Alerts that hover at threshold
Alerts that fire repeatedly

Technique 1: Add Confirmation Checks

Don't alert on first failure. Require multiple consecutive failures.

How It Works

Check 1: FAIL
  → Wait, check again

Check 2: FAIL
  → Wait, check again

Check 3: FAIL
  → NOW alert (confirmed failure)

Implementation

Most monitoring tools have this built in:

Wakestack: Configure "consecutive failures before alert"

Prometheus Alertmanager:

groups:
  - name: example
    rules:
      - alert: HighCPU
        expr: cpu_usage > 90
        for: 5m  # Must be true for 5 minutes
        labels:
          severity: warning

Datadog:

Set "Trigger when metric is above threshold for X minutes"

Recommended Settings

Alert Type	Confirmation Time
Critical services	2-3 checks (1-3 min)
Standard services	3-5 checks (3-5 min)
Non-critical	5-10 checks (5-10 min)

Technique 2: Use Multi-Location Verification

Single-location failures are often network issues, not service issues.

The Problem

Probe in London: FAIL (network issue)
Your service: Actually fine
Result: False alert

The Solution

Require failures from multiple locations:

London: FAIL
Singapore: OK
Virginia: OK

Result: Probably London network issue, don't alert

Implementation

Alert only when majority fail:

Locations	Failures for Alert
3 locations	2+ failures
5 locations	3+ failures
7 locations	4+ failures

Most uptime monitoring services support this natively.

Technique 3: Adjust Thresholds

Thresholds set too low trigger on normal variation.

Finding the Right Threshold

Collect baseline data: What's normal for this metric?
Calculate percentiles: What's the 95th or 99th percentile?
Set threshold above normal: Alert on genuinely abnormal values

Example: CPU Alerts

Too sensitive (causes false positives):

Alert when CPU > 70%

Normal operation often hits 70%. You'll alert constantly.

Better approach:

Alert when CPU > 90% for 5+ minutes

Occasional spikes to 95% are fine. Sustained 90%+ indicates a real problem.

Metric-Specific Recommendations

Metric	Poor Threshold	Better Threshold
CPU	> 70%	> 90% for 5 min
Memory	> 80%	> 95% for 5 min
Disk	> 80%	> 90%
Response time	> 500ms	> 2000ms for 3 min
Error rate	> 0.1%	> 5% for 5 min

Technique 4: Use Hysteresis

Prevent flapping alerts when metrics hover around threshold.

The Problem

CPU: 89% → No alert
CPU: 91% → ALERT!
CPU: 89% → Resolved
CPU: 91% → ALERT!
CPU: 89% → Resolved
... (repeats endlessly)

The Solution: Different Trigger and Recovery Thresholds

Alert when: CPU > 90%
Resolve when: CPU < 80%

CPU at 89%: Still alerting (not resolved yet)
CPU at 79%: Now resolved

This prevents rapid alert/resolve cycles.

Implementation

# Prometheus example with hysteresis-like behavior
- alert: HighMemory
  expr: memory_usage > 90
  for: 5m
  annotations:
    summary: "Memory above 90% for 5+ minutes"
 
# Separate recovery notification (optional)
- alert: MemoryRecovered
  expr: memory_usage < 80
  for: 5m
  # Only fires after HighMemory was active

Technique 5: Smart Timing

Avoid alerting during expected disruptions.

Maintenance Windows

Suppress alerts during planned maintenance:

# Alertmanager silence
silences:
  - matchers:
      - name: service
        value: api
    startsAt: "2024-01-15T02:00:00Z"
    endsAt: "2024-01-15T04:00:00Z"
    comment: "Scheduled maintenance"

Deployment Windows

Temporarily increase tolerance during deployments:

Extend confirmation time
Increase thresholds
Route alerts to deployment channel, not on-call

Known Spikes

If backups cause CPU spikes every night at 3 AM:

Exclude that time from alerting
Or adjust thresholds for that period
Or accept and document the pattern

Technique 6: Improve Health Checks

Better health checks = fewer false positives.

Make Health Checks Representative

Bad: Check if port 80 is open

Returns 200 even when database is down

Good: Check if application is functional

GET /health
- Verifies database connection
- Checks critical dependencies
- Returns 500 if anything is wrong

Avoid Slow Health Checks

Health checks should be fast:

Under 1 second response time
No expensive database queries
Cache dependency status if needed

Slow health checks timeout and appear as failures.

Include Dependency Checks

If your service depends on a database, your health check should verify:

{
  "status": "healthy",
  "database": "connected",
  "cache": "connected",
  "external_api": "reachable"
}

Return unhealthy if critical dependencies are down.

Technique 7: Categorize and Route

Not all alerts need the same treatment.

Severity Levels

Critical: Page immediately (real customer impact)
Warning: Notify during business hours
Info: Log for review, don't notify

Routing by Confidence

High confidence (multiple checks, sustained): → Page on-call
Low confidence (single check, transient): → Slack channel

Example Alertmanager Routing

route:
  receiver: 'default-slack'
  routes:
    - match:
        severity: critical
        confidence: high
      receiver: 'pagerduty'
    - match:
        severity: warning
      receiver: 'slack-warnings'