I have eight machines on my home network. Two are dedicated to local LLM inference, three are shared gaming PCs my kids use, one is a Proxmox node, one is a Raspberry Pi 5 running Pi-hole and Home Assistant, and one is a NAS. I know them all intimately. I also have no idea what any of them are doing right now without SSH-ing in to check.
That’s the homelab problem in miniature: the machines you care about most are the ones you understand least in aggregate. You built them, you know their configs, and yet you have no single view of what’s running, what’s straining, or what happened last night while you were asleep.
The Prometheus Trap
The standard advice is to set up Prometheus and Grafana. And Prometheus is genuinely excellent software — I use it at work. But for a homelab with 5–10 machines, the overhead is real:
- node_exporter on every machine — that’s an agent to install, configure, and maintain on every box, including Windows machines where it’s not native
- Prometheus config — a scrape config that needs updating every time you add a machine
- Grafana dashboards — you’ll spend an afternoon building the perfect dashboard and then never look at it
- Alertmanager — another service to run if you want notifications
- Retention — Prometheus stores time series locally; now you’re thinking about disk space
I’m not saying the stack isn’t worth it. For a production environment or a serious homelab with 20+ nodes it absolutely is. But if what you actually want is “tell me what’s happening across my machines right now,” Prometheus is a 10-hour project for a 10-second answer.
What SSH Already Gives You
Here’s the thing: you already have SSH access to every Linux and macOS machine in your lab. Windows has had a native OpenSSH server since 2019. Every metric you could want — CPU load, RAM pressure, GPU utilization, disk usage, running processes, logged-in users, temperatures — is available via a command over SSH.
The question is whether you want to run those commands manually every time, or whether you want something that does it continuously, correlates the results, and surfaces them when something interesting happens.
One monitoring process holds a persistent SSH connection to each machine. Every minute or so, it runs a set of OS-appropriate commands, parses the output, and maintains state. No agents installed on target machines. No open ports beyond SSH. The data never leaves your network.
This is the architecture I built into Leassh’s fleet monitoring plugin, and it’s fundamentally different from the exporter-based model: the intelligence is on the monitoring side, not the monitored side. Your machines don’t know they’re being monitored. They’re just answering SSH commands.
What It Looks Like in Practice
Once configured, you get two things. First, a natural language interface via OpenClaw — your AI agent can answer questions about your fleet in real time:
That’s one query, no dashboard to build, no alert to configure beforehand. The NAS disk warning surfaces because the system is tracking disk usage over time and doing rate-of-change math — “time to full based on current write rate” is more useful than “89% used.”
Second, there’s a live dashboard served at /fleet — all nodes, status bars, GPU VRAM, disk, active users, last seen. It auto-refreshes. For the big-picture view it’s faster than any terminal.
The Configuration Is One File
The entire setup is a single fleet.yaml:
Add a node, restart the binary, it starts showing up. Remove a node, it disappears. No service discovery config, no scrape rules, no relabeling.
Cross-Platform Without the Pain
The part that took the most work to build is the part you don’t see: getting the same logical metric from machines that answer completely different commands.
| Metric | Linux | macOS | Windows |
|---|---|---|---|
| CPU usage | /proc/stat | top -l 1 | Get-CimInstance |
| RAM | /proc/meminfo | vm_stat | Get-CimInstance |
| Disk | df -h | df -h | Get-CimInstance |
| GPU (NVIDIA) | nvidia-smi | N/A | nvidia-smi |
| Idle time | xprintidle / input mtime | ioreg HIDIdleTime | GetLastInputInfo |
| Processes | ps aux | ps aux | Get-Process |
| Screenshots | scrot / grim | screencapture | PowerShell task¹ |
¹ Windows SSH runs in Session 0 (no desktop). Screenshots require creating a scheduled task in the interactive user session — the binary handles this automatically.
All of these commands live in a JSON registry, not hardcoded in the binary. If a command isn’t available on a given machine, that metric is skipped gracefully. Add a new command variant without recompiling.
Alerts That Don’t Require Pre-Configuration
Traditional monitoring is reactive: you define a threshold, something crosses it, you get an alert. That’s fine when you know what to watch for. But homelab failures are creative — it’s rarely the alert you configured that fires, it’s the one you forgot to set up.
SSH fleet monitoring has an advantage here because the monitoring system understands behavior, not just thresholds. When a previously-idle machine suddenly has a high-GPU process running, that’s surfaced without you pre-defining “alert if GPU > 80%.” The agent sees the state change and tells you about it.
Concrete examples of what gets flagged automatically:
- Node went offline with last-known metrics and how long it’s been down
- Disk trend critical — time-to-full estimate based on rolling regression, not just current percentage
- Unknown heavy process — something consuming >30% CPU or GPU that the system hasn’t seen before
- Node back online after being unreachable — brief recovery confirmation
Where It Fits in the Monitoring Stack
I still run Prometheus at work and wouldn’t replace it for production systems. For my homelab, SSH fleet monitoring does what I actually need:
“Tell me what my machines are doing right now, flag anything unusual, and let me ask questions without opening five terminal tabs.”
If you’re running fewer than 20 nodes and spending more time maintaining your monitoring stack than actually using it, that’s a signal to simplify. SSH is already there. The commands already work. The only question is whether you’re running them manually every time or whether something runs them for you.
The fleet plugin is free for OpenClaw users — unlimited nodes, MIT licensed. The full Leassh product adds rules automation (“kill this process on idle machines”), screen time enforcement, and AI behavioral reports for family machines. But for pure fleet monitoring, the free tier is the whole thing.
Related reading: If you’re also dealing with kids gaming on your GPU machines, see How to Stop Your Kids From Hijacking Your Homelab GPU for the automated enforcement side of the same setup.