What Is Mean Time to Resolve (MTTR)?

What Is MTTR?

Mean Time to Resolve (MTTR) is the average time from when an incident is detected until the service is fully restored.

MTTR = Time incident resolved - Time incident detected

If your service goes down at 3:00 PM and is restored at 3:45 PM, the time to resolve was 45 minutes.

Note: MTTR sometimes stands for "Mean Time to Repair" or "Mean Time to Recovery"—the concept is the same.

Why MTTR Matters

MTTR directly measures how quickly you recover from problems. Lower MTTR means:

Less Downtime

Every minute of an outage costs money:

Lost revenue from failed transactions
Support costs from customer complaints
Engineering time fighting fires
Reputation damage

Better User Experience

Users don't care about your architecture. They care whether the service works. Fast recovery minimises their frustration.

SLA Compliance

Most SLAs define acceptable downtime. MTTR determines whether you stay within those limits.

If your SLA allows 99.9% uptime (43 minutes/month), your MTTR must be low enough to stay under that budget.

How to Calculate MTTR

Basic Formula

MTTR = Sum of all resolution times / Number of incidents

Example Calculation

Incident	Detected	Resolved	Resolution Time
#1	09:00	09:45	45 min
#2	13:30	14:00	30 min
#3	16:00	17:30	90 min
#4	22:00	22:20	20 min

MTTR = (45 + 30 + 90 + 20) / 4 = 46.25 minutes

What Counts as "Resolved"?

Define this clearly for your organisation:

Service restored: Users can use the system again
Fully fixed: Root cause addressed, no workarounds
Verified stable: Service confirmed working for a period

Most teams measure to "service restored" since that's what users care about.

The Components of MTTR

MTTR includes several phases:

1. Detection to Acknowledgment

Time from alert firing to someone starting work. This is MTTA (Mean Time to Acknowledge).

2. Diagnosis

Time spent understanding what's wrong:

Reading alerts and logs
Correlating symptoms
Identifying root cause

3. Remediation

Time spent fixing the problem:

Applying fixes
Restarting services
Rolling back changes

4. Verification

Time confirming the fix worked:

Testing functionality
Monitoring for recurrence
Communicating resolution

What Affects MTTR?

Monitoring Quality

Good monitoring speeds diagnosis. Poor monitoring means guesswork.

Impact: Can reduce diagnosis time by 50%+

Runbooks and Documentation

Clear procedures reduce decision-making time during stress.

Impact: Turns 30-minute decisions into 5-minute actions

On-Call Response

Faster acknowledgment means faster resolution.

Impact: 24/7 on-call vs. business hours can mean hours of difference

System Complexity

Complex systems are harder to diagnose and fix.

Impact: Microservices might be harder to debug than monoliths

Automation

Automated recovery (restarts, failover, scaling) can resolve issues without human intervention.

Impact: Seconds vs. minutes for common failures

Team Experience

Experienced engineers resolve incidents faster.

Impact: Significant—experienced teams are 2-3x faster

How to Reduce MTTR

1. Improve Detection (MTTD)

You can't resolve what you haven't detected. Better monitoring reduces the time before work begins.

See: What Is MTTD?

2. Create Runbooks

Document common incidents:

Symptoms
Diagnostic steps
Resolution procedures
Escalation paths

During an incident, nobody should be figuring this out for the first time.

3. Implement Better Observability

Invest in:

Metrics: System and application health
Logs: Searchable, correlated logs
Traces: Request flow through systems

Good observability turns hours of diagnosis into minutes.

4. Automate Recovery

For known failure modes, automate the response:

Service restarts on failure
Automatic failover to healthy nodes
Auto-scaling under load
Automatic rollback on error rate spike

5. Practice Incident Response

Run drills:

Game days with simulated failures
Chaos engineering experiments
Regular on-call handoff reviews

Teams that practice respond faster.

6. Reduce Complexity

Simpler systems fail in simpler ways:

Fewer dependencies
Clearer architecture
Better isolation between components

7. Conduct Thorough Post-Mortems

After each incident, ask:

What would have made detection faster?
What slowed down diagnosis?
How can we prevent this or recover faster next time?

Implement improvements from every post-mortem.

MTTR Benchmarks

Typical values vary by incident severity:

Severity	Typical MTTR	Good MTTR
Critical (full outage)	1-4 hours	< 1 hour
High (major degradation)	2-8 hours	< 2 hours
Medium (partial impact)	4-24 hours	< 4 hours
Low (minor issues)	1-7 days	< 1 day

Your targets should reflect:

Business impact of downtime
SLA commitments
Cost of faster resolution

Metric	Measures	Formula
MTTD	Time to detect	Detection time - Start time
MTTA	Time to acknowledge	Acknowledgment - Detection
MTTR	Time to resolve	Resolution - Detection
MTBF	Time between failures	Uptime / Number of failures

These metrics together describe your incident lifecycle:

MTTD → MTTA → MTTR
        ↑
      (Total incident duration)

Less downtime
Happier users
SLA compliance
Lower incident costs

Reduce MTTR by:

Improving detection (MTTD)
Creating runbooks
Investing in observability
Automating recovery
Practicing incident response
Learning from post-mortems

MTTR is one of the most important operational metrics. Track it, improve it, and watch your reliability improve.

About the Author

Frequently Asked Questions

What is MTTR?

How do you calculate MTTR?

What's a good MTTR?

How do you reduce MTTR?

Related Articles

What Is Alert Fatigue and How Do Teams Fix It?

What Is Mean Time to Detect (MTTD)?

What Is an SLA vs SLO vs SLI? (Clear Comparison)

Ready to monitor your uptime?