What Is Mean Time to Detect (MTTD)?

What Is MTTD?

Mean Time to Detect (MTTD) is the average amount of time it takes to discover that a problem exists after it actually starts.

MTTD = Time problem detected - Time problem started

If your database starts failing at 2:00 PM and your monitoring alerts you at 2:08 PM, the time to detect was 8 minutes.

MTTD measures the gap between reality and awareness.

Why MTTD Matters

Every minute a problem goes undetected is a minute where:

Users are affected
Data could be corrupted
The problem could get worse
Trust erodes

The Cost of Late Detection

Consider a payment processing outage:

Detected in 2 minutes: ~50 failed transactions
Detected in 15 minutes: ~750 failed transactions
Detected in 60 minutes: ~3,000 failed transactions

The problem is the same. The detection time determines the impact.

MTTD Affects MTTR

You can't fix what you don't know about. A long MTTD directly increases your total incident duration:

Total Incident Time = MTTD + Time to Respond + Time to Resolve

Reducing MTTD is often the fastest way to reduce overall incident impact.

How to Calculate MTTD

Basic Formula

MTTD = Sum of all detection times / Number of incidents

Example Calculation

Incident	Problem Started	Detected	Detection Time
#1	09:00	09:05	5 min
#2	11:30	11:38	8 min
#3	14:00	14:03	3 min
#4	16:45	17:00	15 min
#5	20:00	20:12	12 min

MTTD = (5 + 8 + 3 + 15 + 12) / 5 = 8.6 minutes

The Challenge: Knowing When Problems Started

The tricky part is determining when a problem actually started. You need:

Timestamps in your monitoring data
Logs with accurate timestamps
Correlation between symptoms and root cause

Sometimes you discover an incident started hours before detection—that's valuable information for improving MTTD.

What Affects MTTD?

Monitoring Coverage

Problems in unmonitored areas take longer to detect (if they're detected at all).

Improve by: Adding monitors for all critical paths.

Check Frequency

If you check every 5 minutes, you can't detect faster than 5 minutes.

Improve by: Increasing check frequency for critical services.

Alert Thresholds

Thresholds set too high miss problems. Too low creates noise.

Improve by: Tuning thresholds based on real baselines.

Alert Routing

If alerts go to an unmonitored channel, detection is delayed.

Improve by: Routing alerts to actively monitored channels with escalation.

On-Call Response

If nobody's watching, nobody detects.

Improve by: Clear on-call schedules and acknowledgment requirements.

How to Reduce MTTD

1. Monitor the Right Things

Focus on user-facing symptoms first:

Can users reach the service?
Are requests succeeding?
Is latency acceptable?

These catch problems regardless of root cause.

2. Increase Check Frequency

For critical services:

Check every 30 seconds instead of 5 minutes
Use multiple check locations for redundancy
Consider synthetic transactions for end-to-end coverage

3. Use Anomaly Detection

Static thresholds miss some problems. Anomaly detection catches:

Unusual patterns
Gradual degradation
Problems you didn't anticipate

4. Implement Proactive Alerts

Don't wait for failure. Alert on warning signs:

Disk filling up (before it's full)
Memory pressure (before OOM)
Error rate increasing (before outage)

5. Reduce Alert Noise

Alert fatigue increases MTTD because:

Real alerts get lost in noise
People stop paying attention
Investigation is slower

Fewer, higher-quality alerts improve detection speed.

6. Fix Alert Routing

Ensure alerts reach someone who will act:

Route to the right team
Use escalation policies
Require acknowledgment
Monitor alert response times

MTTD Benchmarks

Typical MTTD values vary by organisation and service criticality:

Category	Typical MTTD	Good MTTD
Critical customer-facing	5-15 min	< 5 min
Important internal	15-30 min	< 15 min
Non-critical services	30-60 min	< 30 min
Batch jobs	Hours	< 1 hour

These are guidelines. Your targets should be based on:

Business impact of delays
SLA requirements
Cost of faster detection

MTTD vs Other Metrics

MTTD vs MTTR

MTTD: Time to detect the problem
MTTR: Time to resolve the problem (often includes detection)

Both matter. MTTD is often overlooked but directly impacts MTTR.

MTTD vs MTTA

MTTD: Time until the problem is known
MTTA (Mean Time to Acknowledge): Time until someone starts working on it

MTTD comes first. You can't acknowledge what you haven't detected.

Tracking MTTD

What to Record

For each incident, capture:

When the problem actually started (from logs/metrics)
When the alert fired
When someone acknowledged
Root cause of any detection delay

Review Regularly

After incidents, ask:

Why did detection take this long?
What would have caught it faster?
Are there similar unmonitored risks?

Trend Over Time

Track MTTD monthly. It should trend down as you:

Add monitoring coverage
Tune alerting
Improve incident response

Summary

MTTD (Mean Time to Detect) measures how quickly you discover problems. It's calculated by averaging the time between problem start and detection across incidents.

Lower MTTD means:

Less user impact
Faster resolution
Better reliability

Improve MTTD by:

Monitoring the right things
Increasing check frequency
Reducing alert noise
Fixing alert routing
Using proactive alerting

Detection is the first step in incident response. The faster you detect, the faster you can respond and resolve.

About the Author

Frequently Asked Questions

What is MTTD?

How do you calculate MTTD?

What's a good MTTD?

How do you reduce MTTD?

Related Articles

The Real Difference Between 'Monitoring' and 'Alerting'

What Is Alert Fatigue and How Do Teams Fix It?

What Is Mean Time to Resolve (MTTR)?

Ready to monitor your uptime?