Back to Blog
Guides
server monitoring
infrastructure monitoring

The Complete Guide to Server Monitoring (2026)

Everything you need to know about server monitoring: metrics to track, tools to use, agent vs agentless approaches, and how to combine with uptime monitoring for complete visibility.

WT

Wakestack Team

Engineering Team

10 min read

Server monitoring provides visibility into what's happening inside your infrastructure. While uptime monitoring tells you IF a service is responding, server monitoring tells you WHY—CPU overloaded, memory exhausted, disk full, or process crashed.

This guide covers everything: what to monitor, how monitoring works, tool options, and how to combine server metrics with uptime monitoring.

Table of Contents

  1. What Is Server Monitoring
  2. Why Server Monitoring Matters
  3. Key Metrics to Monitor
  4. Agent-Based vs Agentless
  5. Setting Up Server Monitoring
  6. Alerting on Server Metrics
  7. Combining with Uptime Monitoring
  8. Tools and Options
  9. Common Mistakes
  10. Related Resources

What Is Server Monitoring

Server monitoring is the continuous collection and analysis of metrics from your servers:

Server Metrics:
├── CPU usage (system, user, idle)
├── Memory consumption (used, available, swap)
├── Disk space and I/O
├── Network traffic (in/out)
├── Running processes
└── System load

These metrics are collected by software (an agent) running on your servers and sent to a monitoring platform for visualization and alerting.

Learn more: What Is Server Monitoring vs Website Monitoring


Why Server Monitoring Matters

The Blind Spot Problem

Without server monitoring, incidents look like this:

Alert: API timeout
├── What's wrong? Unknown
├── SSH into server
├── Run htop → CPU at 98%
├── Run df -h → Disk at 45%
├── Run free -m → Memory at 92%
├── Root cause found: Memory exhaustion
└── Time to diagnose: 15 minutes

With server monitoring:

Alert: API timeout
├── Dashboard: CPU 45%, Memory 92%, Disk 45%
├── Root cause: Memory exhaustion
└── Time to diagnose: 30 seconds

Learn more: Why Most Uptime Tools Miss Server Failures

Predictive Awareness

Server monitoring catches problems before they cause outages:

Warning: Disk at 85%
├── Current: 85%
├── Growth rate: 2%/day
├── Time to 100%: ~7 days
└── Action: Clean up or expand storage

Incident Context

During incidents, server metrics provide essential context:

SymptomServer MetricRoot Cause
Slow responseHigh CPUCompute bottleneck
Random errorsHigh memoryMemory pressure
Write failuresHigh diskStorage exhausted
Connection timeoutsHigh loadOverloaded system

Key Metrics to Monitor

CPU Metrics

MetricDescriptionAlert Threshold
CPU UsageOverall CPU utilization> 85% sustained
System CPUKernel/system processes> 30% (unusual)
User CPUApplication processesContext-dependent
IO WaitWaiting on disk I/O> 20%
Load AverageSystem load relative to cores> 2x core count

What high CPU means:

  • Compute-bound workload
  • Runaway process
  • Traffic spike
  • Inefficient code

Memory Metrics

MetricDescriptionAlert Threshold
Used MemoryActive memory usage> 90%
Available MemoryMemory free for use< 10%
Swap UsageMemory paged to diskAny sustained use
Buffers/CacheFilesystem cacheGenerally good

What high memory means:

  • Memory leak
  • Undersized instance
  • Too many processes
  • Need for optimization

Disk Metrics

MetricDescriptionAlert Threshold
Disk UsagePercentage of space used> 85%
Disk I/O ReadData read per secondContext-dependent
Disk I/O WriteData written per secondContext-dependent
Inode UsageFile count capacity> 85%

What high disk means:

  • Growing logs
  • Accumulated data
  • Need for cleanup
  • Need for expansion

Network Metrics

MetricDescriptionAlert Threshold
Network InIncoming trafficUnusual spikes
Network OutOutgoing trafficUnusual spikes
ErrorsPacket errorsAny sustained
DroppedDropped packetsAny sustained

Process Metrics

MetricDescriptionWhy It Matters
Process runningIs expected process aliveCore health check
Process CPUCPU usage per processIdentify hogs
Process memoryMemory per processIdentify leaks
Process countNumber of processesWorker scaling

Agent-Based vs Agentless

An agent runs on your server and collects metrics locally:

Your Server
├── Agent (lightweight process)
│   ├── Collects CPU, memory, disk metrics
│   ├── Monitors running processes
│   └── Sends data to monitoring service
└── Your applications

Pros:

  • Detailed metrics
  • Efficient (local collection)
  • Works behind firewalls
  • Process-level visibility

Cons:

  • Requires installation
  • Agent maintenance

Learn more: Agent-Based Monitoring

Agentless Monitoring

Metrics collected remotely via SNMP, SSH, or APIs:

Monitoring Server → SSH/SNMP → Your Server
                              ├── Run commands
                              └── Parse output

Pros:

  • No installation on target
  • Works with managed devices

Cons:

  • Less detailed
  • Network dependent
  • Credential management
  • Scales poorly

Learn more: Why Agentless Monitoring Fails at Scale

Recommendation

For servers you control: use agent-based monitoring. Modern agents are lightweight (10-20MB memory), secure, and provide much better visibility.


Setting Up Server Monitoring

Step 1: Choose Your Tool

NeedRecommendation
Uptime + serversWakestack
Servers only (self-hosted)Netdata, Prometheus
Enterprise full-stackDatadog, New Relic

Step 2: Install Agent

Example with Wakestack:

curl -sSL https://wakestack.co.uk/install.sh | bash

The agent:

  • Installs as a system service
  • Starts automatically
  • Uses minimal resources
  • Reports to your dashboard

Step 3: Configure What to Monitor

Typically automatic, but you may want to:

  • Set specific process monitoring
  • Adjust collection intervals
  • Configure custom metrics

Step 4: Set Up Alerts

Define thresholds:

CPU > 85% for 5 minutes → Warning
CPU > 95% for 5 minutes → Critical

Memory > 85% → Warning
Memory > 95% → Critical

Disk > 80% → Warning
Disk > 90% → Critical

Step 5: Create Dashboards

Organize visibility:

Production Dashboard:
├── api-prod-01
│   ├── CPU: 45%
│   ├── Memory: 62%
│   ├── Disk: 55%
│   └── Processes: nginx, node
├── api-prod-02
│   └── ...
└── db-prod-01
    └── ...

Alerting on Server Metrics

What to Alert On

MetricAlert LevelThresholdWhy
CPUWarning> 85% (5 min)Performance impact
CPUCritical> 95% (5 min)Imminent problems
MemoryWarning> 85%OOM risk approaching
MemoryCritical> 95%OOM likely
DiskWarning> 80%Plan expansion
DiskCritical> 90%Urgent action
Process downCriticalExpected process missingService impact

What NOT to Alert On

  • Normal fluctuations (CPU spike to 70% for 30 seconds)
  • Scheduled high usage (batch jobs)
  • Non-critical systems during off-hours

Learn more: The Difference Between Monitoring and Alerting

Alert Routing

Server TypeAlert Destination
Production criticalPagerDuty + Slack
Production non-criticalSlack only
StagingEmail digest
DevelopmentDashboard only

Combining with Uptime Monitoring

The most powerful setup combines both:

Unified Dashboard

Production API Server
├── External Checks
│   ├── HTTP /health → 200 OK (145ms)
│   └── HTTP /api/status → 200 OK (89ms)
│
└── Server Metrics (via agent)
    ├── CPU: 45%
    ├── Memory: 62%
    ├── Disk: 55%
    └── Processes: nginx ✓, node ✓

Correlated Alerting

When an uptime check fails, immediately see server context:

Alert: api.example.com/health timeout

Context:
├── Server: api-prod-01
├── CPU: 98% ← Root cause
├── Memory: 72%
├── Disk: 45%
└── Process: node at 95% CPU

Learn more: Why Uptime Checks Alone Don't Work | Infrastructure-Aware Monitoring

Nested Host Organization

Organize monitors under their servers:

api-prod-01 (Server)
├── Agent metrics
└── Monitors
    ├── /health
    ├── /api/v1/status
    └── Port 5432 (database)

Learn more: How Nested Infrastructure Changes Monitoring


Tools and Options

Uptime + server monitoring + status pages:

FeatureIncluded
HTTP/TCP monitoring
Server agent
CPU, memory, disk
Process monitoring
Status pages
Nested hosts

Try Wakestack free

Self-Hosted Options

ToolTypeBest For
Prometheus + GrafanaMetrics + dashboardsCustom setups
NetdataReal-time metricsDetailed server data
Uptime KumaUptime onlySelf-hosted uptime

Enterprise Options

ToolStrengths
DatadogFull observability platform
New RelicAPM + infrastructure
DynatraceEnterprise with AI

Learn more: Best Uptime Monitoring Tools | Hidden Costs of Datadog


Common Mistakes

1. Only Monitoring Uptime

Uptime alone doesn't explain WHY things fail.

Fix: Add server monitoring for root cause visibility.

2. Alerting on Every Spike

Short CPU spikes are normal; alerting on them causes fatigue.

Fix: Require sustained duration (e.g., > 85% for 5 minutes).

3. Same Thresholds for All Servers

Database server memory usage differs from web server.

Fix: Tune thresholds per server role.

4. Ignoring Disk Growth

Disk filling up is preventable but often missed.

Fix: Monitor disk with 80% warning threshold.

5. No Process Monitoring

Server can be "healthy" while critical process is down.

Fix: Monitor that expected processes are running.

6. Separate Dashboards

Uptime tool + server tool + logs = context switching during incidents.

Fix: Use unified monitoring (Wakestack) or correlate manually.


Foundational Concepts

Implementation

Integration


Get Started

Ready to set up server monitoring? Wakestack offers:

  • Server agent — Lightweight Go agent for Linux
  • Key metrics — CPU, memory, disk, network
  • Process monitoring — Track running processes
  • Combined view — Uptime + server metrics together
  • Nested hosts — Organize monitors under servers

Start monitoring for free — Install the agent in under 2 minutes.

About the Author

WT

Wakestack Team

Engineering Team

Frequently Asked Questions

What is server monitoring?

Server monitoring is the continuous tracking of server health metrics like CPU usage, memory consumption, disk space, and running processes. It provides visibility into what's happening inside your servers, complementing external uptime monitoring.

What's the difference between server monitoring and uptime monitoring?

Uptime monitoring checks if services are responding from outside (can users reach it?). Server monitoring tracks internal metrics (CPU, memory, disk) to show WHY services might be slow or failing. You typically need both for complete visibility.

Do I need an agent for server monitoring?

For detailed metrics (CPU, memory, disk, processes), yes. Agents run on your servers and collect metrics locally. Agentless approaches exist but provide less detail and have scaling challenges. Modern agents are lightweight and secure.

Related Articles

Ready to monitor your uptime?

Start monitoring your websites, APIs, and services in minutes. Free forever for small projects.