Back to Blog
Guides
microservices
monitoring

How to Monitor Microservices Without Overcomplicating

Microservices monitoring can spiral into complexity quickly. Learn practical approaches to get visibility without overwhelming your team or budget.

WT

Wakestack Team

Engineering Team

7 min read

The Microservices Monitoring Trap

Microservices promise flexibility but deliver complexity. Monitoring is where this hits hardest:

Monolith: Monitor one application 10 microservices: Monitor 10 applications + their interactions 50 microservices: Monitor 50 applications + 200 potential interaction paths

It's easy to either:

  • Under-monitor: Can't see what's happening
  • Over-monitor: Drowning in data, alerts, and dashboards

The goal is effective monitoring without the complexity trap.

Start with the Golden Signals

Google's SRE book defines four golden signals. These work for any service:

1. Latency

How long requests take.

Measure: Request duration (p50, p95, p99)
Alert on: p99 > threshold for 5 minutes

2. Traffic

How much demand the service handles.

Measure: Requests per second
Alert on: Sudden drops (might indicate upstream failure)

3. Errors

Rate of failed requests.

Measure: Error rate (5xx responses / total requests)
Alert on: Error rate > 1% for 5 minutes

4. Saturation

How full resources are.

Measure: CPU, memory, connection pool usage
Alert on: Approaching limits (80-90%)

If every microservice exposes these four signals, you have meaningful visibility across your entire system.

Monitoring Layers (In Priority Order)

Layer 1: External Health (Required)

Can users reach and use your system?

What to monitor:

  • Public endpoint availability
  • End-to-end transaction success
  • Response time from user perspective

How:

  • Uptime monitoring (Wakestack, Better Stack)
  • Synthetic transactions
  • Real user monitoring (RUM)

Why first: If users can't use your product, nothing else matters.

Layer 2: Service Health (Required)

Is each service working?

What to monitor:

  • Health check endpoints
  • Golden signals per service
  • Dependency status

How:

  • Each service exposes /health
  • Prometheus/metrics endpoint
  • Service mesh telemetry

Why second: Service health helps you understand where problems are.

Layer 3: Infrastructure (Required)

Are resources sufficient?

What to monitor:

  • CPU, memory, disk per host/container
  • Network connectivity
  • Database performance

How:

  • Agent-based monitoring
  • Container orchestrator metrics
  • Database-specific monitoring

Why third: Infrastructure problems cause service problems.

What happened?

What to monitor:

  • Structured logs from each service
  • Error logs and stack traces
  • Audit trails

How:

  • Centralized log aggregation
  • Structured logging (JSON)
  • Log-based alerts for errors

Why recommended: Logs help debug when metrics show a problem.

Layer 5: Distributed Tracing (Optional)

How do requests flow?

What to monitor:

  • Request paths across services
  • Latency per service in chain
  • Error propagation

How:

  • Jaeger, Zipkin, or commercial APM
  • Trace context propagation
  • Sampling for production

Why optional: Valuable for complex flows, but adds overhead. Add when you need it.

Practical Implementation

Step 1: Standardize Service Metrics

Every service should expose the same base metrics:

# HTTP services
http_requests_total{method, path, status}
http_request_duration_seconds{method, path}

# All services
service_up
service_health_status
process_cpu_usage
process_memory_bytes

Create a shared library or sidecar that handles this.

Step 2: Create One Overview Dashboard

One dashboard that shows:

┌─────────────────────────────────────────────────────┐
│ System Health Overview                               │
├─────────────────────────────────────────────────────┤
│ [Public Endpoint Status]  [Overall Error Rate]      │
├─────────────────────────────────────────────────────┤
│ Service         Status    RPS     Errors   Latency  │
│ user-service    ✓ OK      1.2k    0.1%     45ms    │
│ order-service   ✓ OK      800     0.2%     120ms   │
│ payment-service ⚠ WARN    300     2.1%     500ms   │
│ inventory       ✓ OK      1.5k    0.0%     30ms    │
├─────────────────────────────────────────────────────┤
│ [CPU Usage]  [Memory Usage]  [Request Volume]       │
└─────────────────────────────────────────────────────┘

This is where troubleshooting starts. Deep-dive dashboards come later.

Step 3: Alert on Symptoms, Not Causes

Bad alerts (causes):

  • CPU > 80%
  • Memory > 90%
  • Pod restarted

Good alerts (symptoms):

  • Error rate > 1% for 5 minutes
  • Latency p99 > 2 seconds for 5 minutes
  • Health check failing for 3 minutes

Symptom-based alerts tell you when users are affected. Cause-based alerts create noise.

Step 4: Keep Alert Count Low

Target: 5-10 alerts per microservice maximum.

Critical (pages someone):

  • Service unavailable
  • Error rate > 5%
  • SLO burn rate high

Warning (notifies team):

  • Error rate > 1%
  • Latency degraded
  • Resource pressure

Info (logged):

  • Everything else

If you have 50 alerts per service, most are noise.

Step 5: Use Service Mesh for Cross-Cutting Concerns

If using Kubernetes, a service mesh (Istio, Linkerd) automatically provides:

  • Request metrics between services
  • Latency histograms
  • Error rates
  • Dependency maps

This removes the need to instrument every service for cross-service metrics.

What to Avoid

Avoid: Custom Metrics Explosion

The trap:

"Let's add a metric for everything!"
→ 500 custom metrics per service
→ 25,000 metrics across 50 services
→ Unmanageable, expensive, ignored

The solution:

  • Start with golden signals only
  • Add custom metrics when investigating specific issues
  • Delete metrics no one looks at

Avoid: Dashboard Sprawl

The trap:

"Each team creates their own dashboards"
→ 200 dashboards
→ No one knows which to look at
→ Duplicate, outdated, confusing

The solution:

  • One overview dashboard (source of truth)
  • Standardized service dashboard template
  • Periodic dashboard cleanup

Avoid: Log Everything

The trap:

"Let's log every request in detail!"
→ Terabytes of logs
→ High costs
→ Can't find anything in the noise

The solution:

  • Log errors always
  • Sample success logs
  • Structure logs (JSON) for searchability
  • Set retention policies

Avoid: Tracing Everything

The trap:

"Let's trace 100% of requests!"
→ Massive overhead
→ Storage costs explode
→ Performance impact

The solution:

  • Sample traces (1-10% of requests)
  • Always trace errors
  • Trace specific flows when debugging

Simple Architecture Patterns

Pattern 1: Prometheus + Grafana (Self-Hosted)

Services → Prometheus → Grafana
    └──────────────────→ Alertmanager → PagerDuty/Slack

Good for: Cost-conscious teams, on-premise, full control Trade-off: Operational overhead

Pattern 2: Cloud Native (Managed)

Services → Cloud Provider Metrics (CloudWatch, GCP Monitoring)
    └──→ Cloud Provider Alerting

Good for: Cloud-native teams, minimal ops, tight cloud integration Trade-off: Vendor lock-in, can get expensive

Pattern 3: SaaS Observability (Datadog, New Relic)

Services → Agent/SDK → SaaS Platform
    └──→ Unified dashboards, alerts, traces

Good for: Teams without ops capacity, need full observability fast Trade-off: Cost at scale

Pattern 4: Hybrid

Infrastructure → Prometheus (internal)
Application → SaaS APM (vendor)
Uptime → External monitoring (Wakestack)

Good for: Balancing cost, capability, and control Trade-off: Multiple tools to manage

Checklist: Microservices Monitoring Without Complexity

Every Service Must Have

  • Health endpoint (/health)
  • Golden signal metrics (latency, traffic, errors, saturation)
  • Structured logging
  • Consistent naming conventions

Platform Must Provide

  • Metrics collection and storage
  • Centralized logging
  • Alerting infrastructure
  • Standard dashboard templates

Keep It Simple By

  • Starting with external health checks
  • Using golden signals, not hundreds of custom metrics
  • Creating one overview dashboard
  • Alerting on symptoms, not causes
  • Adding tracing only when needed
  • Reviewing and pruning regularly

Summary

Monitoring microservices without overcomplicating:

Focus on what matters:

  • Golden signals (latency, traffic, errors, saturation)
  • External health first
  • Symptoms, not causes

Keep it simple:

  • Standardize across services
  • One overview dashboard
  • 5-10 alerts per service max
  • Add tracing only when needed

Avoid complexity traps:

  • Custom metric explosion
  • Dashboard sprawl
  • Logging everything
  • Tracing 100%

Build incrementally:

  1. External health monitoring
  2. Service-level golden signals
  3. Infrastructure metrics
  4. Centralized logging
  5. Distributed tracing (when needed)

Effective microservices monitoring isn't about seeing everything—it's about seeing what matters quickly and acting on it. Start simple, add as needed, prune what doesn't help.

About the Author

WT

Wakestack Team

Engineering Team

Frequently Asked Questions

Do microservices require distributed tracing?

Not always. Start with metrics and logs. Add distributed tracing when you have complex request flows spanning many services and need to debug latency or failures across the chain.

Should each microservice have its own monitoring?

Each service should emit metrics and logs, but use shared monitoring infrastructure. You want unified visibility across services, not isolated monitoring silos.

How many metrics should a microservice expose?

Focus on the golden signals: latency, traffic, errors, and saturation. Most services need 10-20 key metrics, not hundreds. More metrics doesn't mean better visibility.

Related Articles

Ready to monitor your uptime?

Start monitoring your websites, APIs, and services in minutes. Free forever for small projects.