What Is Mean Time to Detect (MTTD)?
Mean Time to Detect (MTTD) measures how long it takes to discover a problem after it starts. Learn how to calculate MTTD, why it matters, and how to improve it.
Wakestack Team
Engineering Team
What Is MTTD?
Mean Time to Detect (MTTD) is the average amount of time it takes to discover that a problem exists after it actually starts.
MTTD = Time problem detected - Time problem started
If your database starts failing at 2:00 PM and your monitoring alerts you at 2:08 PM, the time to detect was 8 minutes.
MTTD measures the gap between reality and awareness.
Why MTTD Matters
Every minute a problem goes undetected is a minute where:
- Users are affected
- Data could be corrupted
- The problem could get worse
- Trust erodes
The Cost of Late Detection
Consider a payment processing outage:
- Detected in 2 minutes: ~50 failed transactions
- Detected in 15 minutes: ~750 failed transactions
- Detected in 60 minutes: ~3,000 failed transactions
The problem is the same. The detection time determines the impact.
MTTD Affects MTTR
You can't fix what you don't know about. A long MTTD directly increases your total incident duration:
Total Incident Time = MTTD + Time to Respond + Time to Resolve
Reducing MTTD is often the fastest way to reduce overall incident impact.
How to Calculate MTTD
Basic Formula
MTTD = Sum of all detection times / Number of incidents
Example Calculation
| Incident | Problem Started | Detected | Detection Time |
|---|---|---|---|
| #1 | 09:00 | 09:05 | 5 min |
| #2 | 11:30 | 11:38 | 8 min |
| #3 | 14:00 | 14:03 | 3 min |
| #4 | 16:45 | 17:00 | 15 min |
| #5 | 20:00 | 20:12 | 12 min |
MTTD = (5 + 8 + 3 + 15 + 12) / 5 = 8.6 minutes
The Challenge: Knowing When Problems Started
The tricky part is determining when a problem actually started. You need:
- Timestamps in your monitoring data
- Logs with accurate timestamps
- Correlation between symptoms and root cause
Sometimes you discover an incident started hours before detection—that's valuable information for improving MTTD.
What Affects MTTD?
Monitoring Coverage
Problems in unmonitored areas take longer to detect (if they're detected at all).
Improve by: Adding monitors for all critical paths.
Check Frequency
If you check every 5 minutes, you can't detect faster than 5 minutes.
Improve by: Increasing check frequency for critical services.
Alert Thresholds
Thresholds set too high miss problems. Too low creates noise.
Improve by: Tuning thresholds based on real baselines.
Alert Routing
If alerts go to an unmonitored channel, detection is delayed.
Improve by: Routing alerts to actively monitored channels with escalation.
On-Call Response
If nobody's watching, nobody detects.
Improve by: Clear on-call schedules and acknowledgment requirements.
How to Reduce MTTD
1. Monitor the Right Things
Focus on user-facing symptoms first:
- Can users reach the service?
- Are requests succeeding?
- Is latency acceptable?
These catch problems regardless of root cause.
2. Increase Check Frequency
For critical services:
- Check every 30 seconds instead of 5 minutes
- Use multiple check locations for redundancy
- Consider synthetic transactions for end-to-end coverage
3. Use Anomaly Detection
Static thresholds miss some problems. Anomaly detection catches:
- Unusual patterns
- Gradual degradation
- Problems you didn't anticipate
4. Implement Proactive Alerts
Don't wait for failure. Alert on warning signs:
- Disk filling up (before it's full)
- Memory pressure (before OOM)
- Error rate increasing (before outage)
5. Reduce Alert Noise
Alert fatigue increases MTTD because:
- Real alerts get lost in noise
- People stop paying attention
- Investigation is slower
Fewer, higher-quality alerts improve detection speed.
6. Fix Alert Routing
Ensure alerts reach someone who will act:
- Route to the right team
- Use escalation policies
- Require acknowledgment
- Monitor alert response times
MTTD Benchmarks
Typical MTTD values vary by organisation and service criticality:
| Category | Typical MTTD | Good MTTD |
|---|---|---|
| Critical customer-facing | 5-15 min | < 5 min |
| Important internal | 15-30 min | < 15 min |
| Non-critical services | 30-60 min | < 30 min |
| Batch jobs | Hours | < 1 hour |
These are guidelines. Your targets should be based on:
- Business impact of delays
- SLA requirements
- Cost of faster detection
MTTD vs Other Metrics
MTTD vs MTTR
- MTTD: Time to detect the problem
- MTTR: Time to resolve the problem (often includes detection)
Both matter. MTTD is often overlooked but directly impacts MTTR.
MTTD vs MTTA
- MTTD: Time until the problem is known
- MTTA (Mean Time to Acknowledge): Time until someone starts working on it
MTTD comes first. You can't acknowledge what you haven't detected.
Tracking MTTD
What to Record
For each incident, capture:
- When the problem actually started (from logs/metrics)
- When the alert fired
- When someone acknowledged
- Root cause of any detection delay
Review Regularly
After incidents, ask:
- Why did detection take this long?
- What would have caught it faster?
- Are there similar unmonitored risks?
Trend Over Time
Track MTTD monthly. It should trend down as you:
- Add monitoring coverage
- Tune alerting
- Improve incident response
Summary
MTTD (Mean Time to Detect) measures how quickly you discover problems. It's calculated by averaging the time between problem start and detection across incidents.
Lower MTTD means:
- Less user impact
- Faster resolution
- Better reliability
Improve MTTD by:
- Monitoring the right things
- Increasing check frequency
- Reducing alert noise
- Fixing alert routing
- Using proactive alerting
Detection is the first step in incident response. The faster you detect, the faster you can respond and resolve.
Frequently Asked Questions
What is MTTD?
MTTD (Mean Time to Detect) is the average time between when a problem starts and when your team becomes aware of it. Lower MTTD means faster detection.
How do you calculate MTTD?
MTTD = Total detection time for all incidents / Number of incidents. For example, if 5 incidents took 10, 5, 15, 8, and 12 minutes to detect, MTTD = 50/5 = 10 minutes.
What's a good MTTD?
It depends on your service criticality. For critical services, aim for under 5 minutes. For less critical services, under 15 minutes is reasonable. The key is continuous improvement.
How do you reduce MTTD?
Improve monitoring coverage, reduce check intervals, add proactive alerting, use anomaly detection, and ensure alerts route to the right people immediately.
Related Articles
The Real Difference Between 'Monitoring' and 'Alerting'
Monitoring and alerting aren't the same thing. Understanding the difference prevents alert fatigue and improves incident response. Here's what each actually does.
Read moreWhat Is Alert Fatigue and How Do Teams Fix It?
Alert fatigue happens when too many alerts cause people to ignore them. Learn the causes, warning signs, and proven strategies to eliminate alert fatigue in your team.
Read moreWhat Is Mean Time to Resolve (MTTR)?
Mean Time to Resolve (MTTR) measures how long it takes to fix a problem completely. Learn how to calculate MTTR, what affects it, and strategies to reduce it.
Read moreReady to monitor your uptime?
Start monitoring your websites, APIs, and services in minutes. Free forever for small projects.