Why Agentless Monitoring Fails at Scale
Agentless monitoring seems simpler, but it creates blind spots as infrastructure grows. Learn why agent-based monitoring becomes essential at scale.
Wakestack Team
Engineering Team
Agentless monitoring looks simpler—no installation, no agent updates, no extra processes. But that simplicity has a cost: blind spots that grow with your infrastructure. What works for 5 servers becomes a liability at 50.
Here's why agentless approaches hit a wall, and when you should make the switch.
How Agentless Monitoring Works
Agentless monitoring checks systems from outside:
Monitoring Server → Target System
│
├── HTTP check: GET /health
├── TCP check: Connect to port 5432
├── SNMP poll: Request system metrics
└── SSH probe: Run remote command
No software installed on the target. Everything happens over the network.
Common Agentless Methods
| Method | What It Checks | Limitation |
|---|---|---|
| HTTP/HTTPS | Web endpoints | Only sees HTTP responses |
| TCP | Port availability | Only sees "open or closed" |
| SNMP | Device metrics | Limited metrics, security concerns |
| SSH | Remote commands | Requires credentials, adds latency |
| Cloud APIs | Provider metrics | Limited to what cloud exposes |
Why Agentless Seems Attractive
For small deployments, agentless monitoring wins on:
- No installation — Just point at endpoints
- No maintenance — No agent updates
- Quick setup — Monitoring in minutes
- No footprint — Nothing running on target systems
At 5 servers, these benefits are real.
Where Agentless Breaks Down
Problem 1: Limited Visibility
Agentless can only see what's exposed externally:
What agentless sees:
Server: api-prod-01
├── HTTP /health: 200 OK
├── Port 443: Open
└── Ping: 15ms
What's actually happening:
Server: api-prod-01
├── HTTP /health: 200 OK
├── CPU: 94% ← Problem
├── Memory: 88% ← Problem
├── Disk: 92% ← Problem
├── Swap: Active ← Big problem
├── Load: 12.5 (8 cores)
└── Processes:
├── node: 85% CPU
├── postgres: Waiting on I/O
└── zombie workers: 15 ← Problem
The server looks "up" but is actually in trouble.
Problem 2: Polling Overhead Scales Linearly
With agentless monitoring, every check is a network request:
5 servers × 10 metrics × 1 check/minute = 50 requests/minute
50 servers × 10 metrics × 1 check/minute = 500 requests/minute
500 servers × 10 metrics × 1 check/minute = 5,000 requests/minute
This creates:
- Network overhead on monitoring server
- Load on target systems (handling probe requests)
- Latency in metric collection
- Credential management at scale
Problem 3: Network Dependency
Agentless monitoring fails when:
- Network path is congested
- Firewall rules change
- Target network is isolated
- DNS fails
Scenario: Network hiccup
Agentless result:
├── Server 1: Timeout (actually fine)
├── Server 2: Timeout (actually fine)
├── Server 3: Timeout (actually fine)
└── Alert storm: "3 servers down!"
Reality: Network switch flapped for 30 seconds
Problem 4: Security at Scale
Agentless methods need access to targets:
| Method | Requires |
|---|---|
| SNMP | Community strings (often insecure) |
| SSH | Credentials on monitoring server |
| Cloud APIs | IAM keys/tokens |
At scale, managing these credentials becomes a security concern. A compromised monitoring server could access everything.
Problem 5: Internal Services Are Invisible
Services behind firewalls can't be reached agentlessly:
Internet
│
│ Firewall
│ ─────────────────────
│
├── Internal API: Not reachable
├── Database: Not reachable
├── Cache: Not reachable
└── Workers: Not reachable
You'd need to punch firewall holes—bad for security.
Agent-Based Monitoring at Scale
Agents flip the model:
Target System
│
Agent (runs locally)
│
├── Collects metrics internally
├── Full system visibility
└── Pushes to → Monitoring Server
Why This Scales Better
1. Metric collection is local
Agent on server:
├── Read /proc/stat (CPU) - instant
├── Read /proc/meminfo (Memory) - instant
├── Read /sys/block/*/stat (Disk) - instant
└── Local process list - instant
vs
Agentless from monitoring server:
├── SSH connection - 50-200ms
├── Run command - 100-500ms
├── Parse output - variable
└── Per metric, per server
2. Network efficient
Agent: Collect 50 metrics locally → Send 1 payload → Monitoring server
Agentless: 50 separate network requests per server
3. No credential sprawl
Agent: One API key per agent → Pushes outbound
Agentless: Monitor needs credentials to every system
4. Works behind firewalls
Agent: Initiates outbound HTTPS → Works through firewalls
Agentless: Requires inbound access → Firewall holes needed
The Scale Tipping Point
| Scale | Recommendation |
|---|---|
| 1-5 servers | Agentless is fine |
| 5-20 servers | Consider agents for deeper visibility |
| 20-100 servers | Agents strongly recommended |
| 100+ servers | Agents essential |
Signs You've Outgrown Agentless
- Frequent "false positive" alerts from network issues
- Can't diagnose WHY servers are slow
- Missing metrics for internal services
- Credential management is painful
- Alert storms during network hiccups
The Hybrid Approach
Best practice: combine both methods.
External (Agentless)
- HTTP checks from outside your network
- SSL certificate monitoring
- DNS verification
- TCP port checks
Purpose: "Can users reach us?"
Internal (Agent-Based)
- Server metrics (CPU, memory, disk)
- Process monitoring
- Internal service health
- Application metrics
Purpose: "Are our systems healthy?"
Complete picture:
├── External: Can users reach the API? ✓
└── Internal: Is the server healthy?
├── CPU: 45%
├── Memory: 62%
├── Disk: 55%
└── API process: Running
Common Objections to Agents
"Agents are overhead"
Modern agents are lightweight:
- Wakestack agent: ~10MB memory, negligible CPU
- Runs once per 30 seconds
- Less overhead than answering SSH probes
"More things to manage"
Agents are simpler than credential management at scale:
- Install once
- Auto-updates
- No firewall changes
- No credential rotation
"What if the agent crashes?"
What if SSH access fails? What if SNMP times out?
Both approaches have failure modes. Agents failing is visible and recoverable. Network issues create ambiguous states.
"We use cloud provider monitoring"
Cloud monitoring (CloudWatch, GCP Monitoring) is useful but:
- Metrics can lag 5+ minutes
- Limited to what the provider exposes
- Doesn't cover non-cloud resources
- Often expensive at scale
Wakestack's Approach
Wakestack uses both approaches:
External Monitoring (Agentless)
- HTTP/HTTPS, TCP, DNS, Ping
- Multi-region verification
- SSL monitoring
- No installation needed
Server Agent (Agent-Based)
- Lightweight Go binary
- CPU, memory, disk, process metrics
- 30-second granularity
- Outbound-only communication
Combined View
Production Server
├── External: HTTP check → 200 OK (145ms)
└── Agent: Server metrics
├── CPU: 42%
├── Memory: 68%
├── Disk: 55%
└── Processes: All running
Both views, one dashboard.
See the difference — Try agent-based monitoring alongside your uptime checks.
Migration Path
If you're currently agentless-only:
Step 1: Keep External Checks
Don't remove HTTP/TCP monitoring. It provides the user perspective.
Step 2: Add Agents to Critical Servers
Start with:
- Production application servers
- Database servers
- Any server where you've had "mystery" slowdowns
Step 3: Correlate Data
When alerts fire, check both:
- External: Is it down?
- Agent: What's the system state?
Step 4: Expand Coverage
Add agents to remaining servers as you see value.
Key Takeaways
- Agentless monitoring works at small scale
- It fails at scale due to: limited visibility, polling overhead, network dependency
- Agent-based monitoring scales efficiently with local collection
- Best practice: use both—external for user perspective, agents for system health
- The tipping point is usually 5-20 servers
- Modern agents are lightweight—overhead concerns are outdated
Related Resources
Frequently Asked Questions
What is agentless monitoring?
Agentless monitoring checks systems from outside without installing software on them. It uses protocols like HTTP, SNMP, SSH, or APIs to gather data remotely. Examples include HTTP uptime checks and cloud provider API monitoring.
Why does agentless monitoring fail at scale?
Agentless monitoring fails at scale because: it can't see inside systems deeply, polling overhead increases linearly with hosts, it depends on network availability, and it misses granular metrics like process health or disk I/O. It becomes both less effective and less efficient as you grow.
When should I use agent-based monitoring?
Use agent-based monitoring when: you have more than 5-10 servers, you need process-level visibility, you're monitoring systems behind firewalls, or you need detailed metrics beyond basic availability. The upfront installation cost is offset by better visibility and more efficient data collection.
Related Articles
Agent-Based Monitoring: Why You Need Eyes Inside Your Servers
Understand agent-based monitoring - what it is, how it works, and when you need it. Compare agent-based vs agentless monitoring approaches.
Read moreInfrastructure-Aware Uptime Monitoring: Beyond Simple Checks
Learn how infrastructure-aware monitoring combines uptime checks with server metrics. Understand why knowing your endpoints isn't enough without knowing your infrastructure.
Read moreServer Monitoring: Complete Guide to Infrastructure Visibility
Learn how to monitor your servers effectively - CPU, memory, disk, and processes. Understand why server monitoring matters and how it complements uptime monitoring.
Read moreReady to monitor your uptime?
Start monitoring your websites, APIs, and services in minutes. Free forever for small projects.