How to Monitor Microservices Without Overcomplicating
Microservices monitoring can spiral into complexity quickly. Learn practical approaches to get visibility without overwhelming your team or budget.
Wakestack Team
Engineering Team
The Microservices Monitoring Trap
Microservices promise flexibility but deliver complexity. Monitoring is where this hits hardest:
Monolith: Monitor one application 10 microservices: Monitor 10 applications + their interactions 50 microservices: Monitor 50 applications + 200 potential interaction paths
It's easy to either:
- Under-monitor: Can't see what's happening
- Over-monitor: Drowning in data, alerts, and dashboards
The goal is effective monitoring without the complexity trap.
Start with the Golden Signals
Google's SRE book defines four golden signals. These work for any service:
1. Latency
How long requests take.
Measure: Request duration (p50, p95, p99)
Alert on: p99 > threshold for 5 minutes
2. Traffic
How much demand the service handles.
Measure: Requests per second
Alert on: Sudden drops (might indicate upstream failure)
3. Errors
Rate of failed requests.
Measure: Error rate (5xx responses / total requests)
Alert on: Error rate > 1% for 5 minutes
4. Saturation
How full resources are.
Measure: CPU, memory, connection pool usage
Alert on: Approaching limits (80-90%)
If every microservice exposes these four signals, you have meaningful visibility across your entire system.
Monitoring Layers (In Priority Order)
Layer 1: External Health (Required)
Can users reach and use your system?
What to monitor:
- Public endpoint availability
- End-to-end transaction success
- Response time from user perspective
How:
- Uptime monitoring (Wakestack, Better Stack)
- Synthetic transactions
- Real user monitoring (RUM)
Why first: If users can't use your product, nothing else matters.
Layer 2: Service Health (Required)
Is each service working?
What to monitor:
- Health check endpoints
- Golden signals per service
- Dependency status
How:
- Each service exposes
/health - Prometheus/metrics endpoint
- Service mesh telemetry
Why second: Service health helps you understand where problems are.
Layer 3: Infrastructure (Required)
Are resources sufficient?
What to monitor:
- CPU, memory, disk per host/container
- Network connectivity
- Database performance
How:
- Agent-based monitoring
- Container orchestrator metrics
- Database-specific monitoring
Why third: Infrastructure problems cause service problems.
Layer 4: Logs (Recommended)
What happened?
What to monitor:
- Structured logs from each service
- Error logs and stack traces
- Audit trails
How:
- Centralized log aggregation
- Structured logging (JSON)
- Log-based alerts for errors
Why recommended: Logs help debug when metrics show a problem.
Layer 5: Distributed Tracing (Optional)
How do requests flow?
What to monitor:
- Request paths across services
- Latency per service in chain
- Error propagation
How:
- Jaeger, Zipkin, or commercial APM
- Trace context propagation
- Sampling for production
Why optional: Valuable for complex flows, but adds overhead. Add when you need it.
Practical Implementation
Step 1: Standardize Service Metrics
Every service should expose the same base metrics:
# HTTP services
http_requests_total{method, path, status}
http_request_duration_seconds{method, path}
# All services
service_up
service_health_status
process_cpu_usage
process_memory_bytes
Create a shared library or sidecar that handles this.
Step 2: Create One Overview Dashboard
One dashboard that shows:
┌─────────────────────────────────────────────────────┐
│ System Health Overview │
├─────────────────────────────────────────────────────┤
│ [Public Endpoint Status] [Overall Error Rate] │
├─────────────────────────────────────────────────────┤
│ Service Status RPS Errors Latency │
│ user-service ✓ OK 1.2k 0.1% 45ms │
│ order-service ✓ OK 800 0.2% 120ms │
│ payment-service ⚠ WARN 300 2.1% 500ms │
│ inventory ✓ OK 1.5k 0.0% 30ms │
├─────────────────────────────────────────────────────┤
│ [CPU Usage] [Memory Usage] [Request Volume] │
└─────────────────────────────────────────────────────┘
This is where troubleshooting starts. Deep-dive dashboards come later.
Step 3: Alert on Symptoms, Not Causes
Bad alerts (causes):
- CPU > 80%
- Memory > 90%
- Pod restarted
Good alerts (symptoms):
- Error rate > 1% for 5 minutes
- Latency p99 > 2 seconds for 5 minutes
- Health check failing for 3 minutes
Symptom-based alerts tell you when users are affected. Cause-based alerts create noise.
Step 4: Keep Alert Count Low
Target: 5-10 alerts per microservice maximum.
Critical (pages someone):
- Service unavailable
- Error rate > 5%
- SLO burn rate high
Warning (notifies team):
- Error rate > 1%
- Latency degraded
- Resource pressure
Info (logged):
- Everything else
If you have 50 alerts per service, most are noise.
Step 5: Use Service Mesh for Cross-Cutting Concerns
If using Kubernetes, a service mesh (Istio, Linkerd) automatically provides:
- Request metrics between services
- Latency histograms
- Error rates
- Dependency maps
This removes the need to instrument every service for cross-service metrics.
What to Avoid
Avoid: Custom Metrics Explosion
The trap:
"Let's add a metric for everything!"
→ 500 custom metrics per service
→ 25,000 metrics across 50 services
→ Unmanageable, expensive, ignored
The solution:
- Start with golden signals only
- Add custom metrics when investigating specific issues
- Delete metrics no one looks at
Avoid: Dashboard Sprawl
The trap:
"Each team creates their own dashboards"
→ 200 dashboards
→ No one knows which to look at
→ Duplicate, outdated, confusing
The solution:
- One overview dashboard (source of truth)
- Standardized service dashboard template
- Periodic dashboard cleanup
Avoid: Log Everything
The trap:
"Let's log every request in detail!"
→ Terabytes of logs
→ High costs
→ Can't find anything in the noise
The solution:
- Log errors always
- Sample success logs
- Structure logs (JSON) for searchability
- Set retention policies
Avoid: Tracing Everything
The trap:
"Let's trace 100% of requests!"
→ Massive overhead
→ Storage costs explode
→ Performance impact
The solution:
- Sample traces (1-10% of requests)
- Always trace errors
- Trace specific flows when debugging
Simple Architecture Patterns
Pattern 1: Prometheus + Grafana (Self-Hosted)
Services → Prometheus → Grafana
└──────────────────→ Alertmanager → PagerDuty/Slack
Good for: Cost-conscious teams, on-premise, full control Trade-off: Operational overhead
Pattern 2: Cloud Native (Managed)
Services → Cloud Provider Metrics (CloudWatch, GCP Monitoring)
└──→ Cloud Provider Alerting
Good for: Cloud-native teams, minimal ops, tight cloud integration Trade-off: Vendor lock-in, can get expensive
Pattern 3: SaaS Observability (Datadog, New Relic)
Services → Agent/SDK → SaaS Platform
└──→ Unified dashboards, alerts, traces
Good for: Teams without ops capacity, need full observability fast Trade-off: Cost at scale
Pattern 4: Hybrid
Infrastructure → Prometheus (internal)
Application → SaaS APM (vendor)
Uptime → External monitoring (Wakestack)
Good for: Balancing cost, capability, and control Trade-off: Multiple tools to manage
Checklist: Microservices Monitoring Without Complexity
Every Service Must Have
- Health endpoint (
/health) - Golden signal metrics (latency, traffic, errors, saturation)
- Structured logging
- Consistent naming conventions
Platform Must Provide
- Metrics collection and storage
- Centralized logging
- Alerting infrastructure
- Standard dashboard templates
Keep It Simple By
- Starting with external health checks
- Using golden signals, not hundreds of custom metrics
- Creating one overview dashboard
- Alerting on symptoms, not causes
- Adding tracing only when needed
- Reviewing and pruning regularly
Summary
Monitoring microservices without overcomplicating:
Focus on what matters:
- Golden signals (latency, traffic, errors, saturation)
- External health first
- Symptoms, not causes
Keep it simple:
- Standardize across services
- One overview dashboard
- 5-10 alerts per service max
- Add tracing only when needed
Avoid complexity traps:
- Custom metric explosion
- Dashboard sprawl
- Logging everything
- Tracing 100%
Build incrementally:
- External health monitoring
- Service-level golden signals
- Infrastructure metrics
- Centralized logging
- Distributed tracing (when needed)
Effective microservices monitoring isn't about seeing everything—it's about seeing what matters quickly and acting on it. Start simple, add as needed, prune what doesn't help.
Frequently Asked Questions
Do microservices require distributed tracing?
Not always. Start with metrics and logs. Add distributed tracing when you have complex request flows spanning many services and need to debug latency or failures across the chain.
Should each microservice have its own monitoring?
Each service should emit metrics and logs, but use shared monitoring infrastructure. You want unified visibility across services, not isolated monitoring silos.
How many metrics should a microservice expose?
Focus on the golden signals: latency, traffic, errors, and saturation. Most services need 10-20 key metrics, not hundreds. More metrics doesn't mean better visibility.
Related Articles
Best Monitoring Tools for Small Infrastructure Teams
Small teams need monitoring that provides visibility without requiring a dedicated ops person. Here are the best tools for teams of 1-10 engineers.
Read moreHow to Design a Monitoring Strategy for Growing Teams
As teams grow, monitoring needs evolve. Learn how to build a monitoring strategy that scales with your organization without becoming overwhelming.
Read moreWhat Is Infrastructure Monitoring? (Simple Explanation)
Infrastructure monitoring is the practice of collecting and analysing data from your servers, networks, and cloud resources to ensure they're healthy and performing well. Learn the basics.
Read moreReady to monitor your uptime?
Start monitoring your websites, APIs, and services in minutes. Free forever for small projects.