Introduction

OpenClaw runs as a long-lived process — a natural fit for Kubernetes. Deploy with Docker images, scale replicas for multi-agent setups, and use persistent volumes for memory. Here's what we're covering: production Kubernetes deployment for DevOps teams. See scaling for capacity planning.

Key consideration: OpenClaw is stateful. Memory lives on disk. This affects how you scale and design for HA. We'll cover the patterns that work.

Deployment Options

Run OpenClaw as a Deployment with a single replica (typical) or multiple replicas for multi-tenant setups. Each replica is an independent agent. Use ConfigMaps for config.yaml; Secrets for API keys. Community Helm charts exist; validate before use.

Single replica. Most deployments run one OpenClaw instance per agent. Simple: one Deployment, one Pod, one PVC for memory. No coordination needed. Use for: single-tenant, one agent per organization.

Multi-replica. For multi-tenant (e.g., SaaS offering multiple agents), run multiple replicas. Each replica has its own memory volume. Use a shared config for common settings; per-replica config for agent identity. Route traffic (e.g., by tenant ID) to the right replica.

ConfigMaps and Secrets. Store config.yaml in a ConfigMap — but keep secrets out. Use a Secret for API keys, tokens, database URLs. Mount both into the Pod. For sensitive config, consider external secrets operators (e.g., External Secrets Operator for Vault integration).

Helm. Community Helm charts exist for OpenClaw. Validate before use — check image source, default resources, security context. Or build your own chart for full control. Helm simplifies upgrades and environment management.

Scaling Considerations

OpenClaw is stateful: memory lives on disk. Scaling horizontally means multiple agents with separate memory. For multi-agent workflows, use multi-agent patterns with shared memory directories via ReadWriteMany volumes if your storage class supports it.

Horizontal scaling. Don't scale a single agent across multiple pods — there's no built-in coordination. Each pod is an independent agent. To scale, add more agents (more deployments), not more replicas of one agent.

Shared memory. For multi-agent setups where agents share context, use a ReadWriteMany PVC (e.g., NFS, EFS, Azure Files). All agent pods mount the same memory directory. Coordinate carefully — concurrent writes can corrupt files. Consider file locking or partitioning memory by agent.

Resource limits. Set requests and limits. Start with 512Mi RAM, 0.5 CPU for cloud-model deployments. For local models, you need significantly more (8GB+ RAM depending on model). Adjust based on load. OOM kills are disruptive — set limits to allow graceful degradation.

HPA. Horizontal Pod Autoscaler doesn't apply well to stateful single-agent deployments. For multi-tenant, you might scale replicas based on tenant count — but that's typically manual or based on provisioning, not CPU.

Persistent Storage

Memory and config need persistent volumes. Use PVCs with adequate size. Back up memory directories regularly — see backup guide. Don't use emptyDir for memory; you'll lose state on pod restart.

PVC sizing. Memory grows over time. Start with 10Gi; monitor. Agent memory can reach hundreds of MB for heavy users. Plan for growth.

Storage class. Use a storage class with appropriate performance. OpenClaw does frequent file reads/writes. SSD-backed storage (e.g., gp3 on AWS) is recommended for production. Avoid slow network storage for memory — it can cause latency.

Backup. Memory is critical. Back up the PVC regularly — before upgrades, daily for production. Test restore. Consider Velero or similar for cluster-level backup. See backup guide.

ReadWriteMany. Only if you need shared memory across pods. Not all clusters support RWX. EKS with EFS, GKE with Filestore — check your provider. RWX adds complexity; use only when necessary.

High Availability

For HA: run one active replica, use PodDisruptionBudgets, and ensure your storage is highly available. OpenClaw doesn't support active-active for a single agent; use one primary. For failover, restore from backup to a new pod.

Single primary. One agent = one pod. No active-active. The agent has in-memory state (sessions, etc.) that doesn't replicate. For HA, focus on: fast restart, reliable storage, and backup.

PodDisruptionBudget. Set minAvailable: 1 (for single-replica) so Kubernetes doesn't evict your pod during voluntary disruptions (node drain, etc.) without replacement ready.

Liveness and readiness. Configure probes. Liveness: is the process running? Readiness: can it accept traffic? OpenClaw may need custom health endpoints — check if the gateway exposes /health or similar. Without probes, K8s can't recover from hangs.

Failover. If the pod dies, K8s restarts it (RestartPolicy: Always). The same PVC reattaches — memory persists. For node failure, the pod reschedules; PVC follows. Recovery time: typically 1-2 minutes. For faster failover, consider a standby replica that mounts the same PVC (read-only) and promotes on primary failure — advanced setup.

Storage HA. Use storage with redundancy. Cloud provider volumes (EBS, Persistent Disk) are typically replicated. Avoid single-point-of-failure storage.

Implementation Checklist

  • □ Build or obtain Docker image for OpenClaw
  • □ Create ConfigMap for config.yaml; Secret for credentials
  • □ Create PVC for memory (10Gi+ to start)
  • □ Deploy as Deployment with 1 replica
  • □ Configure liveness/readiness probes
  • □ Set resource requests and limits
  • □ Set up backup for memory PVC
  • □ Configure PodDisruptionBudget
  • □ Test: kill pod, verify restart and state persistence

FAQ

Can I run OpenClaw on EKS or GKE? Yes. Same patterns apply. Use AWS or GCP services for storage and secrets. EKS: EBS for storage, Secrets Manager for credentials. GKE: Persistent Disk, Secret Manager.

Resource limits? Start with 512Mi RAM, 0.5 CPU for cloud-model deployments. For local models (Ollama in same pod), 8GB+ RAM, 2+ CPU. Adjust based on load. Monitor and tune.

Can I use a StatefulSet? You can. StatefulSets give stable network identity and ordered getting it running. For a single replica, Deployment is simpler. StatefulSet helps when you have multiple replicas with distinct identity (e.g., agent-0, agent-1).

What about Istio/Linkerd? Service mesh can add observability and mTLS. OpenClaw works with it. Ensure the mesh doesn't interfere with long-lived connections (e.g., Telegram webhooks). Test thoroughly.

Wrapping Up

Kubernetes provides production-grade deployment for OpenClaw: orchestration, storage, and resilience. Design for statefulness — persistent storage, single primary, backup. OpenClaw Consult helps with K8s architecture and getting it running.