Back to Blog
Guides
MTTD
incident management

What Is Mean Time to Detect (MTTD)?

Mean Time to Detect (MTTD) measures how long it takes to discover a problem after it starts. Learn how to calculate MTTD, why it matters, and how to improve it.

WT

Wakestack Team

Engineering Team

5 min read

What Is MTTD?

Mean Time to Detect (MTTD) is the average amount of time it takes to discover that a problem exists after it actually starts.

MTTD = Time problem detected - Time problem started

If your database starts failing at 2:00 PM and your monitoring alerts you at 2:08 PM, the time to detect was 8 minutes.

MTTD measures the gap between reality and awareness.

Why MTTD Matters

Every minute a problem goes undetected is a minute where:

  • Users are affected
  • Data could be corrupted
  • The problem could get worse
  • Trust erodes

The Cost of Late Detection

Consider a payment processing outage:

  • Detected in 2 minutes: ~50 failed transactions
  • Detected in 15 minutes: ~750 failed transactions
  • Detected in 60 minutes: ~3,000 failed transactions

The problem is the same. The detection time determines the impact.

MTTD Affects MTTR

You can't fix what you don't know about. A long MTTD directly increases your total incident duration:

Total Incident Time = MTTD + Time to Respond + Time to Resolve

Reducing MTTD is often the fastest way to reduce overall incident impact.

How to Calculate MTTD

Basic Formula

MTTD = Sum of all detection times / Number of incidents

Example Calculation

IncidentProblem StartedDetectedDetection Time
#109:0009:055 min
#211:3011:388 min
#314:0014:033 min
#416:4517:0015 min
#520:0020:1212 min

MTTD = (5 + 8 + 3 + 15 + 12) / 5 = 8.6 minutes

The Challenge: Knowing When Problems Started

The tricky part is determining when a problem actually started. You need:

  • Timestamps in your monitoring data
  • Logs with accurate timestamps
  • Correlation between symptoms and root cause

Sometimes you discover an incident started hours before detection—that's valuable information for improving MTTD.

What Affects MTTD?

Monitoring Coverage

Problems in unmonitored areas take longer to detect (if they're detected at all).

Improve by: Adding monitors for all critical paths.

Check Frequency

If you check every 5 minutes, you can't detect faster than 5 minutes.

Improve by: Increasing check frequency for critical services.

Alert Thresholds

Thresholds set too high miss problems. Too low creates noise.

Improve by: Tuning thresholds based on real baselines.

Alert Routing

If alerts go to an unmonitored channel, detection is delayed.

Improve by: Routing alerts to actively monitored channels with escalation.

On-Call Response

If nobody's watching, nobody detects.

Improve by: Clear on-call schedules and acknowledgment requirements.

How to Reduce MTTD

1. Monitor the Right Things

Focus on user-facing symptoms first:

  • Can users reach the service?
  • Are requests succeeding?
  • Is latency acceptable?

These catch problems regardless of root cause.

2. Increase Check Frequency

For critical services:

  • Check every 30 seconds instead of 5 minutes
  • Use multiple check locations for redundancy
  • Consider synthetic transactions for end-to-end coverage

3. Use Anomaly Detection

Static thresholds miss some problems. Anomaly detection catches:

  • Unusual patterns
  • Gradual degradation
  • Problems you didn't anticipate

4. Implement Proactive Alerts

Don't wait for failure. Alert on warning signs:

  • Disk filling up (before it's full)
  • Memory pressure (before OOM)
  • Error rate increasing (before outage)

5. Reduce Alert Noise

Alert fatigue increases MTTD because:

  • Real alerts get lost in noise
  • People stop paying attention
  • Investigation is slower

Fewer, higher-quality alerts improve detection speed.

6. Fix Alert Routing

Ensure alerts reach someone who will act:

  • Route to the right team
  • Use escalation policies
  • Require acknowledgment
  • Monitor alert response times

MTTD Benchmarks

Typical MTTD values vary by organisation and service criticality:

CategoryTypical MTTDGood MTTD
Critical customer-facing5-15 min< 5 min
Important internal15-30 min< 15 min
Non-critical services30-60 min< 30 min
Batch jobsHours< 1 hour

These are guidelines. Your targets should be based on:

  • Business impact of delays
  • SLA requirements
  • Cost of faster detection

MTTD vs Other Metrics

MTTD vs MTTR

  • MTTD: Time to detect the problem
  • MTTR: Time to resolve the problem (often includes detection)

Both matter. MTTD is often overlooked but directly impacts MTTR.

MTTD vs MTTA

  • MTTD: Time until the problem is known
  • MTTA (Mean Time to Acknowledge): Time until someone starts working on it

MTTD comes first. You can't acknowledge what you haven't detected.

Tracking MTTD

What to Record

For each incident, capture:

  • When the problem actually started (from logs/metrics)
  • When the alert fired
  • When someone acknowledged
  • Root cause of any detection delay

Review Regularly

After incidents, ask:

  • Why did detection take this long?
  • What would have caught it faster?
  • Are there similar unmonitored risks?

Trend Over Time

Track MTTD monthly. It should trend down as you:

  • Add monitoring coverage
  • Tune alerting
  • Improve incident response

Summary

MTTD (Mean Time to Detect) measures how quickly you discover problems. It's calculated by averaging the time between problem start and detection across incidents.

Lower MTTD means:

  • Less user impact
  • Faster resolution
  • Better reliability

Improve MTTD by:

  • Monitoring the right things
  • Increasing check frequency
  • Reducing alert noise
  • Fixing alert routing
  • Using proactive alerting

Detection is the first step in incident response. The faster you detect, the faster you can respond and resolve.

About the Author

WT

Wakestack Team

Engineering Team

Frequently Asked Questions

What is MTTD?

MTTD (Mean Time to Detect) is the average time between when a problem starts and when your team becomes aware of it. Lower MTTD means faster detection.

How do you calculate MTTD?

MTTD = Total detection time for all incidents / Number of incidents. For example, if 5 incidents took 10, 5, 15, 8, and 12 minutes to detect, MTTD = 50/5 = 10 minutes.

What's a good MTTD?

It depends on your service criticality. For critical services, aim for under 5 minutes. For less critical services, under 15 minutes is reasonable. The key is continuous improvement.

How do you reduce MTTD?

Improve monitoring coverage, reduce check intervals, add proactive alerting, use anomaly detection, and ensure alerts route to the right people immediately.

Related Articles

Ready to monitor your uptime?

Start monitoring your websites, APIs, and services in minutes. Free forever for small projects.