SSH Fleet Monitoring for Your Homelab: No Prometheus, No Grafana, No YAML Hell

I have eight machines on my home network. Two are dedicated to local LLM inference, three are shared gaming PCs my kids use, one is a Proxmox node, one is a Raspberry Pi 5 running Pi-hole and Home Assistant, and one is a NAS. I know them all intimately. I also have no idea what any of them are doing right now without SSH-ing in to check.

That’s the homelab problem in miniature: the machines you care about most are the ones you understand least in aggregate. You built them, you know their configs, and yet you have no single view of what’s running, what’s straining, or what happened last night while you were asleep.

The Prometheus Trap

The standard advice is to set up Prometheus and Grafana. And Prometheus is genuinely excellent software — I use it at work. But for a homelab with 5–10 machines, the overhead is real:

node_exporter on every machine — that’s an agent to install, configure, and maintain on every box, including Windows machines where it’s not native
Prometheus config — a scrape config that needs updating every time you add a machine
Grafana dashboards — you’ll spend an afternoon building the perfect dashboard and then never look at it
Alertmanager — another service to run if you want notifications
Retention — Prometheus stores time series locally; now you’re thinking about disk space

I’m not saying the stack isn’t worth it. For a production environment or a serious homelab with 20+ nodes it absolutely is. But if what you actually want is “tell me what’s happening across my machines right now,” Prometheus is a 10-hour project for a 10-second answer.

What SSH Already Gives You

Here’s the thing: you already have SSH access to every Linux and macOS machine in your lab. Windows has had a native OpenSSH server since 2019. Every metric you could want — CPU load, RAM pressure, GPU utilization, disk usage, running processes, logged-in users, temperatures — is available via a command over SSH.

The question is whether you want to run those commands manually every time, or whether you want something that does it continuously, correlates the results, and surfaces them when something interesting happens.

The SSH Monitoring Model

One monitoring process holds a persistent SSH connection to each machine. Every minute or so, it runs a set of OS-appropriate commands, parses the output, and maintains state. No agents installed on target machines. No open ports beyond SSH. The data never leaves your network.

This is the architecture I built into Leassh’s fleet monitoring plugin, and it’s fundamentally different from the exporter-based model: the intelligence is on the monitoring side, not the monitored side. Your machines don’t know they’re being monitored. They’re just answering SSH commands.

What It Looks Like in Practice

Once configured, you get two things. First, a natural language interface via OpenClaw — your AI agent can answer questions about your fleet in real time:

you > how's the fleet?

gpu-primary ONLINE CPU 12% GPU 8% RAM 18GB/24GB disk 847GB free

gpu-secondary ONLINE CPU 6% GPU 0% RAM 6GB/16GB disk 1.2TB free

nas ONLINE CPU 3% disk WARNING: 89% full

proxmox ONLINE CPU 31% RAM 28GB/64GB 4 VMs active

pi5 ONLINE CPU 4% RAM 2.1GB/8GB

gaming-1 IDLE (no users, 43min)

gaming-2 IDLE (no users, 2h 11min)

gaming-3 ONLINE user: felix, GPU 91% (Fortnite)

# nas disk warning flagged — you've got ~3 days at current write rate

That’s one query, no dashboard to build, no alert to configure beforehand. The NAS disk warning surfaces because the system is tracking disk usage over time and doing rate-of-change math — “time to full based on current write rate” is more useful than “89% used.”

Second, there’s a live dashboard served at /fleet — all nodes, status bars, GPU VRAM, disk, active users, last seen. It auto-refreshes. For the big-picture view it’s faster than any terminal.

The Configuration Is One File

The entire setup is a single fleet.yaml:

license_key: "your-license-key"

nodes:
  - name: gpu-primary
    host: 192.168.1.100
    ssh: carl@192.168.1.100
    os: linux

  - name: nas
    host: 192.168.1.105
    ssh: admin@192.168.1.105
    os: linux

  - name: proxmox
    host: 192.168.1.110
    ssh: root@192.168.1.110
    os: linux

  - name: gaming-3
    host: 192.168.1.131
    ssh: carl@192.168.1.131
    os: windows

probes:
  health_interval: 60      # seconds
  metrics_interval: 120
  idle_threshold: 30        # minutes before marking IDLE

load_thresholds:
  low: 30
  high: 70

Add a node, restart the binary, it starts showing up. Remove a node, it disappears. No service discovery config, no scrape rules, no relabeling.

Cross-Platform Without the Pain

The part that took the most work to build is the part you don’t see: getting the same logical metric from machines that answer completely different commands.

Metric	Linux	macOS	Windows
CPU usage	/proc/stat	top -l 1	Get-CimInstance
RAM	/proc/meminfo	vm_stat	Get-CimInstance
Disk	df -h	df -h	Get-CimInstance
GPU (NVIDIA)	nvidia-smi	N/A	nvidia-smi
Idle time	xprintidle / input mtime	ioreg HIDIdleTime	GetLastInputInfo
Processes	ps aux	ps aux	Get-Process
Screenshots	scrot / grim	screencapture	PowerShell task¹

¹ Windows SSH runs in Session 0 (no desktop). Screenshots require creating a scheduled task in the interactive user session — the binary handles this automatically.

All of these commands live in a JSON registry, not hardcoded in the binary. If a command isn’t available on a given machine, that metric is skipped gracefully. Add a new command variant without recompiling.

Alerts That Don’t Require Pre-Configuration

Traditional monitoring is reactive: you define a threshold, something crosses it, you get an alert. That’s fine when you know what to watch for. But homelab failures are creative — it’s rarely the alert you configured that fires, it’s the one you forgot to set up.

SSH fleet monitoring has an advantage here because the monitoring system understands behavior, not just thresholds. When a previously-idle machine suddenly has a high-GPU process running, that’s surfaced without you pre-defining “alert if GPU > 80%.” The agent sees the state change and tells you about it.

Concrete examples of what gets flagged automatically:

Node went offline with last-known metrics and how long it’s been down
Disk trend critical — time-to-full estimate based on rolling regression, not just current percentage
Unknown heavy process — something consuming >30% CPU or GPU that the system hasn’t seen before
Node back online after being unreachable — brief recovery confirmation

Where It Fits in the Monitoring Stack

I still run Prometheus at work and wouldn’t replace it for production systems. For my homelab, SSH fleet monitoring does what I actually need:

“Tell me what my machines are doing right now, flag anything unusual, and let me ask questions without opening five terminal tabs.”

If you’re running fewer than 20 nodes and spending more time maintaining your monitoring stack than actually using it, that’s a signal to simplify. SSH is already there. The commands already work. The only question is whether you’re running them manually every time or whether something runs them for you.

The fleet plugin is free for OpenClaw users — unlimited nodes, MIT licensed. The full Leassh product adds rules automation (“kill this process on idle machines”), screen time enforcement, and AI behavioral reports for family machines. But for pure fleet monitoring, the free tier is the whole thing.

Related reading: If you’re also dealing with kids gaming on your GPU machines, see How to Stop Your Kids From Hijacking Your Homelab GPU for the automated enforcement side of the same setup.

Brokk Andersen is a backend engineer at Leassh specializing in infrastructure monitoring and fleet management. He's passionate about building systems that respect user privacy and security, with deep expertise in SSH-based architectures and homelab infrastructure.