What Is Mean Time to Resolve (MTTR)?
Mean Time to Resolve (MTTR) measures how long it takes to fix a problem completely. Learn how to calculate MTTR, what affects it, and strategies to reduce it.
Wakestack Team
Engineering Team
What Is MTTR?
Mean Time to Resolve (MTTR) is the average time from when an incident is detected until the service is fully restored.
MTTR = Time incident resolved - Time incident detected
If your service goes down at 3:00 PM and is restored at 3:45 PM, the time to resolve was 45 minutes.
Note: MTTR sometimes stands for "Mean Time to Repair" or "Mean Time to Recovery"—the concept is the same.
Why MTTR Matters
MTTR directly measures how quickly you recover from problems. Lower MTTR means:
Less Downtime
Every minute of an outage costs money:
- Lost revenue from failed transactions
- Support costs from customer complaints
- Engineering time fighting fires
- Reputation damage
Better User Experience
Users don't care about your architecture. They care whether the service works. Fast recovery minimises their frustration.
SLA Compliance
Most SLAs define acceptable downtime. MTTR determines whether you stay within those limits.
If your SLA allows 99.9% uptime (43 minutes/month), your MTTR must be low enough to stay under that budget.
How to Calculate MTTR
Basic Formula
MTTR = Sum of all resolution times / Number of incidents
Example Calculation
| Incident | Detected | Resolved | Resolution Time |
|---|---|---|---|
| #1 | 09:00 | 09:45 | 45 min |
| #2 | 13:30 | 14:00 | 30 min |
| #3 | 16:00 | 17:30 | 90 min |
| #4 | 22:00 | 22:20 | 20 min |
MTTR = (45 + 30 + 90 + 20) / 4 = 46.25 minutes
What Counts as "Resolved"?
Define this clearly for your organisation:
- Service restored: Users can use the system again
- Fully fixed: Root cause addressed, no workarounds
- Verified stable: Service confirmed working for a period
Most teams measure to "service restored" since that's what users care about.
The Components of MTTR
MTTR includes several phases:
1. Detection to Acknowledgment
Time from alert firing to someone starting work. This is MTTA (Mean Time to Acknowledge).
2. Diagnosis
Time spent understanding what's wrong:
- Reading alerts and logs
- Correlating symptoms
- Identifying root cause
3. Remediation
Time spent fixing the problem:
- Applying fixes
- Restarting services
- Rolling back changes
4. Verification
Time confirming the fix worked:
- Testing functionality
- Monitoring for recurrence
- Communicating resolution
What Affects MTTR?
Monitoring Quality
Good monitoring speeds diagnosis. Poor monitoring means guesswork.
Impact: Can reduce diagnosis time by 50%+
Runbooks and Documentation
Clear procedures reduce decision-making time during stress.
Impact: Turns 30-minute decisions into 5-minute actions
On-Call Response
Faster acknowledgment means faster resolution.
Impact: 24/7 on-call vs. business hours can mean hours of difference
System Complexity
Complex systems are harder to diagnose and fix.
Impact: Microservices might be harder to debug than monoliths
Automation
Automated recovery (restarts, failover, scaling) can resolve issues without human intervention.
Impact: Seconds vs. minutes for common failures
Team Experience
Experienced engineers resolve incidents faster.
Impact: Significant—experienced teams are 2-3x faster
How to Reduce MTTR
1. Improve Detection (MTTD)
You can't resolve what you haven't detected. Better monitoring reduces the time before work begins.
See: What Is MTTD?
2. Create Runbooks
Document common incidents:
- Symptoms
- Diagnostic steps
- Resolution procedures
- Escalation paths
During an incident, nobody should be figuring this out for the first time.
3. Implement Better Observability
Invest in:
- Metrics: System and application health
- Logs: Searchable, correlated logs
- Traces: Request flow through systems
Good observability turns hours of diagnosis into minutes.
4. Automate Recovery
For known failure modes, automate the response:
- Service restarts on failure
- Automatic failover to healthy nodes
- Auto-scaling under load
- Automatic rollback on error rate spike
5. Practice Incident Response
Run drills:
- Game days with simulated failures
- Chaos engineering experiments
- Regular on-call handoff reviews
Teams that practice respond faster.
6. Reduce Complexity
Simpler systems fail in simpler ways:
- Fewer dependencies
- Clearer architecture
- Better isolation between components
7. Conduct Thorough Post-Mortems
After each incident, ask:
- What would have made detection faster?
- What slowed down diagnosis?
- How can we prevent this or recover faster next time?
Implement improvements from every post-mortem.
MTTR Benchmarks
Typical values vary by incident severity:
| Severity | Typical MTTR | Good MTTR |
|---|---|---|
| Critical (full outage) | 1-4 hours | < 1 hour |
| High (major degradation) | 2-8 hours | < 2 hours |
| Medium (partial impact) | 4-24 hours | < 4 hours |
| Low (minor issues) | 1-7 days | < 1 day |
Your targets should reflect:
- Business impact of downtime
- SLA commitments
- Cost of faster resolution
MTTR vs Related Metrics
| Metric | Measures | Formula |
|---|---|---|
| MTTD | Time to detect | Detection time - Start time |
| MTTA | Time to acknowledge | Acknowledgment - Detection |
| MTTR | Time to resolve | Resolution - Detection |
| MTBF | Time between failures | Uptime / Number of failures |
These metrics together describe your incident lifecycle:
MTTD → MTTA → MTTR
↑
(Total incident duration)
Common MTTR Mistakes
Measuring Only Mean
Averages hide outliers. One 8-hour incident skews your 30-minute average.
Fix: Track median and percentiles too (P90, P99).
Including Resolved-but-Recurring Issues
If an issue comes back, was it really resolved?
Fix: Only count as resolved when the issue is genuinely fixed.
Ignoring Business Hours
An incident at 2 AM detected at 9 AM technically had a long MTTR, but resolution started immediately when detected.
Fix: Track time-to-resolution from detection, not from problem start.
Summary
MTTR (Mean Time to Resolve) measures how quickly you recover from incidents. It's calculated by averaging resolution times across incidents.
Lower MTTR means:
- Less downtime
- Happier users
- SLA compliance
- Lower incident costs
Reduce MTTR by:
- Improving detection (MTTD)
- Creating runbooks
- Investing in observability
- Automating recovery
- Practicing incident response
- Learning from post-mortems
MTTR is one of the most important operational metrics. Track it, improve it, and watch your reliability improve.
Frequently Asked Questions
What is MTTR?
MTTR (Mean Time to Resolve) is the average time it takes to fully resolve an incident from the moment it's detected until normal service is restored.
How do you calculate MTTR?
MTTR = Total resolution time for all incidents / Number of incidents. If three incidents took 30, 45, and 15 minutes to resolve, MTTR = 90/3 = 30 minutes.
What's a good MTTR?
For critical services, aim for under 1 hour. For most services, under 4 hours is reasonable. The right target depends on your SLAs and business impact.
How do you reduce MTTR?
Improve runbooks, implement better monitoring for faster diagnosis, practice incident response, automate recovery procedures, and conduct thorough post-mortems.
Related Articles
What Is Alert Fatigue and How Do Teams Fix It?
Alert fatigue happens when too many alerts cause people to ignore them. Learn the causes, warning signs, and proven strategies to eliminate alert fatigue in your team.
Read moreWhat Is Mean Time to Detect (MTTD)?
Mean Time to Detect (MTTD) measures how long it takes to discover a problem after it starts. Learn how to calculate MTTD, why it matters, and how to improve it.
Read moreWhat Is an SLA vs SLO vs SLI? (Clear Comparison)
SLAs, SLOs, and SLIs are related but different. SLIs measure, SLOs target, and SLAs promise. Learn the differences with clear examples and when to use each.
Read moreReady to monitor your uptime?
Start monitoring your websites, APIs, and services in minutes. Free forever for small projects.