In This Article
Introduction
For DevOps and home-lab operators, OpenClaw acts as an Infrastructure Agent. Named "Reef" in one popular community setup, the agent monitors Kubernetes clusters, diagnoses failing services, and autonomously applies fixes using Ansible or Terraform manifests. It performs health checks every 15 minutes and conducts a "Nightly Brainstorm" at 4:00 AM to review logs and plan system optimizations.
The Reef pattern has become a reference implementation for "self-healing infrastructure"—systems that detect and fix problems without human intervention. In six months of operation, community member Reef resolved three incidents that would have required on-call engineer pages at inconvenient hours. In each case, the problem was detected and fixed before any human was aware it had occurred. "It's like having a junior sysadmin who never sleeps and never complains about weekend shifts." This guide explains how to build your own Reef.
The Reef Pattern
The "Reef" pattern: an OpenClaw agent with SSH access to infrastructure, read access to logs and metrics, and write access to runbooks and automation. The agent doesn't replace human operators — it handles routine failures and escalates complex issues.
The key insight is scope. Reef doesn't try to fix everything. It has a defined set of runbooks: pod restart, log rotation, certificate renewal, disk cleanup. When it detects a condition that matches a runbook, it executes. When it detects something outside the runbooks—a novel failure mode, a security incident, ambiguous diagnostics—it alerts the human with context. The human handles the edge cases; Reef handles the 80% that's routine.
Separation of concerns matters. Reef runs on a dedicated OpenClaw instance. Don't mix infrastructure automation with your personal agent—different risk profiles, different access levels. Reef needs SSH keys and kubectl access. Your personal agent needs email and calendar. Keep them separate.
Health Checks
Every 15 minutes, Reef runs:
- HTTP endpoint checks: Critical services return 200. If not, Reef checks if it's a transient blip (retry) or persistent (alert or fix).
- Kubernetes pod status: Running, not CrashLoopBackOff. Reef identifies crash-looping pods and can restart them per runbook. It doesn't delete namespaces or modify deployments—that requires human approval.
- Disk usage: Alert if > 85%; take action if > 90%. Reef can clear temp files, rotate logs, and compress old data. The runbook defines the escalation path.
- Certificate expiration: Alert if < 14 days. Reef can run certbot and update ingress—a common automation that prevents midnight certificate expiry.
- Error log volume: Spike detection. If error rate jumps 3x in an hour, something's wrong. Reef investigates (read recent logs) and either fixes (known pattern) or alerts (unknown pattern).
Two-tier processing: scripts run checks; LLM invoked only when anomaly detected. Most cycles find nothing wrong—scripts return "all green," no LLM call, minimal cost. When something's wrong, the LLM reasons about the situation and decides: runbook fix or human alert. Reduces API cost by 70-90% compared to full-LLM cycles.
Autonomous Fixes
When Reef detects a failure, it consults runbooks (stored in memory as Markdown). Examples:
- Pod crash: kubectl delete pod (restart). The new pod comes up fresh. Works for stateless services. Reef doesn't restart stateful pods without explicit runbook.
- Disk full: Clear temp files, rotate logs, compress old backups. The runbook specifies paths and retention. Reef doesn't delete arbitrarily.
- Certificate expiring: Run certbot, apply to ingress. Reef has the certbot command and the kubectl apply. It executes, verifies the new cert is active, and logs the renewal.
Reef applies fixes via Ansible or Terraform when possible. For ambiguous cases, it alerts the human with diagnosis and suggested action. "Disk at 92%. Runbook suggests log rotation. I've identified 4GB in /var/log. Proceed? Or: [alternative actions]." The human approves; Reef executes. Or the human handles it manually. Reef's job is to surface the right information.
Nightly Brainstorm
At 4:00 AM, Reef runs a "Nightly Brainstorm" task: review logs from the past 24 hours, identify patterns, suggest optimizations. Output: Markdown report in memory, optional Slack summary. Human reviews over morning coffee.
This proactive analysis catches slow degradation — increasing error rates, memory leaks, growing latency — before they become incidents. Reef might notice: "Error rate in service X has increased 15% over the past week. Correlation with deployment Y. Consider rollback or investigation." That's the kind of insight that prevents 3 AM pages. See Nightly Brainstorm for the full pattern.
Security: TruffleHog & Secret Scanning
Security is maintained via secret scanners like TruffleHog, which prevent the agent from accidentally committing API keys to its own memory files. Reef runs TruffleHog before any git commit; blocks if secrets detected. The agent might have learned a credential during troubleshooting—it must not persist that to version control.
Reef's SSH keys are scoped to minimal required permissions. Principle of least privilege: agent can restart pods, not delete namespaces. It can read logs, not modify audit trails. It can run certbot, not access the CA private key. The runbooks are designed to work within these constraints. If a fix requires elevated access, Reef escalates to the human.
Implementation
- Dedicated instance: Don't mix Reef with your personal agent. Infrastructure automation has different security requirements.
- SSH key with limited scope: Consider bastion + jump host. Reef connects to bastion; bastion connects to clusters. Reduces direct exposure.
- HEARTBEAT.md: 15-min health check, 4 AM brainstorm. Time conditions ensure the brainstorm runs during low-traffic hours.
- Runbooks in Markdown: One file per failure mode. Reef references them by pattern match. "Pod CrashLoopBackOff" → runbook-pod-restart.md.
Start with read-only monitoring. Add fix runbooks one at a time. Validate each before expanding. The goal is confidence—you need to trust Reef before you give it fix authority.
Real Incidents Resolved
Community reports of Reef resolving incidents:
- Pod crash loop: A deployment had a memory leak. Pods crashed every few hours. Reef detected the pattern, restarted the pod, and alerted: "Service X pods crashing repeatedly. Consider memory limit increase or code fix." The human addressed the root cause; Reef had prevented user-facing downtime.
- Disk full: Log rotation had failed. /var was at 98%. Reef cleared old logs, rotated current ones, and freed 40GB. Without Reef, the service would have crashed when disk hit 100%.
- Certificate expiry: A Let's Encrypt cert was 10 days from expiry. Certbot hadn't run (cron was broken). Reef ran certbot manually, applied the new cert, and alerted: "Cert renewed. Fix certbot cron." The human fixed the cron; Reef had prevented the outage.
These aren't edge cases—they're the routine failures that fill on-call rotations. Reef handles them. Humans handle the rest.
Wrapping Up
The Reef pattern demonstrates OpenClaw's value for infrastructure: autonomous monitoring, diagnosis, and remediation. It's production-ready—community members have been running it for months. Start with monitoring, add fixes incrementally, and enjoy the sleep. See OpenClaw Kubernetes and Heartbeat Engine for setup.