Back to Blog
Guides
MTTR
incident management

What Is Mean Time to Resolve (MTTR)?

Mean Time to Resolve (MTTR) measures how long it takes to fix a problem completely. Learn how to calculate MTTR, what affects it, and strategies to reduce it.

WT

Wakestack Team

Engineering Team

6 min read

What Is MTTR?

Mean Time to Resolve (MTTR) is the average time from when an incident is detected until the service is fully restored.

MTTR = Time incident resolved - Time incident detected

If your service goes down at 3:00 PM and is restored at 3:45 PM, the time to resolve was 45 minutes.

Note: MTTR sometimes stands for "Mean Time to Repair" or "Mean Time to Recovery"—the concept is the same.

Why MTTR Matters

MTTR directly measures how quickly you recover from problems. Lower MTTR means:

Less Downtime

Every minute of an outage costs money:

  • Lost revenue from failed transactions
  • Support costs from customer complaints
  • Engineering time fighting fires
  • Reputation damage

Better User Experience

Users don't care about your architecture. They care whether the service works. Fast recovery minimises their frustration.

SLA Compliance

Most SLAs define acceptable downtime. MTTR determines whether you stay within those limits.

If your SLA allows 99.9% uptime (43 minutes/month), your MTTR must be low enough to stay under that budget.

How to Calculate MTTR

Basic Formula

MTTR = Sum of all resolution times / Number of incidents

Example Calculation

IncidentDetectedResolvedResolution Time
#109:0009:4545 min
#213:3014:0030 min
#316:0017:3090 min
#422:0022:2020 min

MTTR = (45 + 30 + 90 + 20) / 4 = 46.25 minutes

What Counts as "Resolved"?

Define this clearly for your organisation:

  • Service restored: Users can use the system again
  • Fully fixed: Root cause addressed, no workarounds
  • Verified stable: Service confirmed working for a period

Most teams measure to "service restored" since that's what users care about.

The Components of MTTR

MTTR includes several phases:

1. Detection to Acknowledgment

Time from alert firing to someone starting work. This is MTTA (Mean Time to Acknowledge).

2. Diagnosis

Time spent understanding what's wrong:

  • Reading alerts and logs
  • Correlating symptoms
  • Identifying root cause

3. Remediation

Time spent fixing the problem:

  • Applying fixes
  • Restarting services
  • Rolling back changes

4. Verification

Time confirming the fix worked:

  • Testing functionality
  • Monitoring for recurrence
  • Communicating resolution

What Affects MTTR?

Monitoring Quality

Good monitoring speeds diagnosis. Poor monitoring means guesswork.

Impact: Can reduce diagnosis time by 50%+

Runbooks and Documentation

Clear procedures reduce decision-making time during stress.

Impact: Turns 30-minute decisions into 5-minute actions

On-Call Response

Faster acknowledgment means faster resolution.

Impact: 24/7 on-call vs. business hours can mean hours of difference

System Complexity

Complex systems are harder to diagnose and fix.

Impact: Microservices might be harder to debug than monoliths

Automation

Automated recovery (restarts, failover, scaling) can resolve issues without human intervention.

Impact: Seconds vs. minutes for common failures

Team Experience

Experienced engineers resolve incidents faster.

Impact: Significant—experienced teams are 2-3x faster

How to Reduce MTTR

1. Improve Detection (MTTD)

You can't resolve what you haven't detected. Better monitoring reduces the time before work begins.

See: What Is MTTD?

2. Create Runbooks

Document common incidents:

  • Symptoms
  • Diagnostic steps
  • Resolution procedures
  • Escalation paths

During an incident, nobody should be figuring this out for the first time.

3. Implement Better Observability

Invest in:

  • Metrics: System and application health
  • Logs: Searchable, correlated logs
  • Traces: Request flow through systems

Good observability turns hours of diagnosis into minutes.

4. Automate Recovery

For known failure modes, automate the response:

  • Service restarts on failure
  • Automatic failover to healthy nodes
  • Auto-scaling under load
  • Automatic rollback on error rate spike

5. Practice Incident Response

Run drills:

  • Game days with simulated failures
  • Chaos engineering experiments
  • Regular on-call handoff reviews

Teams that practice respond faster.

6. Reduce Complexity

Simpler systems fail in simpler ways:

  • Fewer dependencies
  • Clearer architecture
  • Better isolation between components

7. Conduct Thorough Post-Mortems

After each incident, ask:

  • What would have made detection faster?
  • What slowed down diagnosis?
  • How can we prevent this or recover faster next time?

Implement improvements from every post-mortem.

MTTR Benchmarks

Typical values vary by incident severity:

SeverityTypical MTTRGood MTTR
Critical (full outage)1-4 hours< 1 hour
High (major degradation)2-8 hours< 2 hours
Medium (partial impact)4-24 hours< 4 hours
Low (minor issues)1-7 days< 1 day

Your targets should reflect:

  • Business impact of downtime
  • SLA commitments
  • Cost of faster resolution
MetricMeasuresFormula
MTTDTime to detectDetection time - Start time
MTTATime to acknowledgeAcknowledgment - Detection
MTTRTime to resolveResolution - Detection
MTBFTime between failuresUptime / Number of failures

These metrics together describe your incident lifecycle:

MTTD → MTTA → MTTR
        ↑
      (Total incident duration)

Common MTTR Mistakes

Measuring Only Mean

Averages hide outliers. One 8-hour incident skews your 30-minute average.

Fix: Track median and percentiles too (P90, P99).

Including Resolved-but-Recurring Issues

If an issue comes back, was it really resolved?

Fix: Only count as resolved when the issue is genuinely fixed.

Ignoring Business Hours

An incident at 2 AM detected at 9 AM technically had a long MTTR, but resolution started immediately when detected.

Fix: Track time-to-resolution from detection, not from problem start.

Summary

MTTR (Mean Time to Resolve) measures how quickly you recover from incidents. It's calculated by averaging resolution times across incidents.

Lower MTTR means:

  • Less downtime
  • Happier users
  • SLA compliance
  • Lower incident costs

Reduce MTTR by:

  • Improving detection (MTTD)
  • Creating runbooks
  • Investing in observability
  • Automating recovery
  • Practicing incident response
  • Learning from post-mortems

MTTR is one of the most important operational metrics. Track it, improve it, and watch your reliability improve.

About the Author

WT

Wakestack Team

Engineering Team

Frequently Asked Questions

What is MTTR?

MTTR (Mean Time to Resolve) is the average time it takes to fully resolve an incident from the moment it's detected until normal service is restored.

How do you calculate MTTR?

MTTR = Total resolution time for all incidents / Number of incidents. If three incidents took 30, 45, and 15 minutes to resolve, MTTR = 90/3 = 30 minutes.

What's a good MTTR?

For critical services, aim for under 1 hour. For most services, under 4 hours is reasonable. The right target depends on your SLAs and business impact.

How do you reduce MTTR?

Improve runbooks, implement better monitoring for faster diagnosis, practice incident response, automate recovery procedures, and conduct thorough post-mortems.

Related Articles

Ready to monitor your uptime?

Start monitoring your websites, APIs, and services in minutes. Free forever for small projects.