Introduction

We talk to teams every week who are planning their OpenClaw deployment. The single most common mistake we see — across startups, agencies, and enterprise IT — is treating the agent host and the model inference backend as a single hardware decision. They aren't. They have different compute profiles, different scaling characteristics, and different cost structures. Confusing them leads to either overspending on hardware you don't need or under-provisioning the layer that actually bottlenecks performance.

This guide breaks down the architectural split, explains what each side needs, and gives practical server selection advice based on dozens of production deployments we've configured.

The Misconception Everyone Makes

When someone says "I need a server for OpenClaw," they're actually describing two entirely different compute jobs:

  1. Agent orchestration — the OpenClaw gateway, tool calling, memory operations, API integrations, multi-agent coordination, and business logic. This is CPU work.
  2. LLM inference — transformer attention, token generation, matrix multiplications. This is GPU work (or very large unified-memory CPU work on Apple Silicon).

The confusion started because early OpenClaw adopters ran both on the same Apple Mac Mini. Apple's unified memory architecture made this possible — the CPU and GPU share a single memory pool, so a 24GB Mac Mini could host the OpenClaw agent process and run a local 14B-parameter model through Ollama simultaneously. People assumed that was the architecture. It wasn't. It was a convenient deployment shortcut enabled by Apple's unusual chip design.

When companies scale past a single personal agent, that shortcut becomes a liability.

What the CPU Side Actually Does

The OpenClaw agent process is a Node.js service. It handles:

  • Message routing: Receiving and dispatching messages across Telegram, WhatsApp, Slack, Discord, iMessage, and web channels
  • State management: Tracking conversation history, active workflows, pending tool calls, and heartbeat schedules
  • Tool execution: Running shell commands, calling external APIs, performing file operations, managing browser automation
  • Memory I/O: Reading and writing markdown memory files, building context windows from soul.md and knowledge files
  • Multi-agent coordination: When running agent teams, managing inter-agent message passing and task delegation
  • Business logic: Heartbeat cycles, two-tier processing scripts, conditional escalation, webhook handling

All of this is integer-heavy, I/O-bound, classical CPU work. It doesn't touch floating-point matrix math. It doesn't need tensor cores. A modern 4-core CPU with 8GB of RAM handles a single OpenClaw agent with ease. Ten agents running on the same machine might want 8 cores and 16GB, but the per-agent resource footprint is modest.

The agent process spends most of its time waiting — waiting for the LLM to respond, waiting for API calls to complete, waiting for the next heartbeat cycle. CPU utilization rarely exceeds 15% for a single agent during normal operation.

What the GPU Side Actually Does

LLM inference is where the computational weight lives. When your agent sends a prompt to Claude, GPT-4, or a locally-hosted model, the inference engine performs:

  • Matrix multiplications: Billions of floating-point operations per token generated
  • Attention computation: Comparing every token against every other token in the context window
  • KV cache management: Storing intermediate attention states in fast memory for efficient generation
  • Embedding calculations: Converting text to numerical representations and back

This workload is embarrassingly parallel — it maps perfectly to GPU architectures with thousands of cores running in lockstep. A single NVIDIA A100 can generate tokens 50–100x faster than the best CPU-only inference. Memory bandwidth — how fast the GPU can feed data to its cores — is typically the bottleneck, not raw compute.

For most OpenClaw users, inference happens remotely. You send an API call to Anthropic, OpenAI, or Google, and their GPU clusters handle the compute. You pay per token. Your local hardware never touches this workload at all.

When you run local models (via Ollama, llama.cpp, or vLLM), the GPU requirements become your problem. And they're significant: a 70B-parameter model needs roughly 35GB of memory just to hold the weights at 4-bit quantization. That's more than any single consumer GPU provides.

Why Separating Them Matters

Once you understand that these are different workloads, several deployment decisions become obvious:

Scaling independence

Agent orchestration scales with the number of agents and channels. Inference scales with model size and request volume. These don't correlate. A company might run 50 agents (high CPU need) against a single API endpoint (no local GPU need). Or one agent might need a dedicated 70B model for compliance reasons (high GPU need, minimal CPU need).

Cost optimization

CPU compute is cheap. A $5/month VPS runs an OpenClaw agent perfectly. GPU compute is expensive. An NVIDIA A100 GPU costs $10,000+ to buy or $1–3/hour to rent. Bundling them means you're either overpaying for CPU when you scale GPUs, or GPU-limited when your agents need more CPU headroom.

Reliability isolation

If your local model server crashes or needs a restart for a model swap, your agents should keep running — queuing messages, executing heartbeat scripts, maintaining state. If your agent host reboots, the inference backend should serve other clients uninterrupted. Tight coupling means one failure brings down everything.

Security boundaries

The agent process has access to sensitive data: API keys, memory files, messaging credentials, shell access. The inference server only needs model weights and incoming prompts. Separating them lets you apply different security policies — the inference server can be on a restricted network segment with no access to secrets.

The All-in-One Machine Trap

We call it the "Mac Mini in a cubicle" problem. A team starts with one person running OpenClaw on a Mac Mini at their desk. It works great. Then three people want it. Then ten. Suddenly the office has a fleet of small machines running critical business automation:

  • No central backup: If someone's Mac Mini fails, that agent's memory and configuration are gone
  • No security policy: Each machine has its own API keys, its own exposed ports, its own (lack of) firewall rules
  • Stranded resources: Each Mini has 24GB of RAM, most of it unused. Collectively, the fleet wastes hundreds of gigabytes of memory
  • No monitoring: Nobody knows which agents are running, which have crashed, or which are burning through API budget
  • Physical vulnerability: Someone unplugs a power cable, walks out with a machine, or the office Wi-Fi drops — agents go offline with no failover

This mirrors what happened in the 1990s with workstation computing. Companies learned the hard way that critical applications belong on managed infrastructure, not desk-side hardware. The same lesson applies to AI agents.

Moving From Desk to Data Center

The natural migration path for serious OpenClaw deployments:

Phase 1: Personal experimentation

Mac Mini, laptop, or VPS. One agent, cloud models only. Goal: learn the system, build useful automations. Total cost: $0–10/month (hardware you already own + API costs).

Phase 2: Team deployment

Move agents to a shared server or VM cluster. Docker containers per agent. Central management via docker-compose or a simple orchestration script. Shared API keys with spend tracking. Total cost: $30–100/month (VPS or on-prem server + API costs).

Phase 3: Production infrastructure

Agents run on managed containers (Docker Swarm, Kubernetes, or even systemd on dedicated servers). Inference is either cloud API or a dedicated GPU server running vLLM for self-hosted models. Centralized logging, monitoring, automated restarts, backup rotation. Total cost: varies wildly by scale, but the architecture is now properly split.

The key transition is Phase 2 to Phase 3 — that's where the CPU/GPU split becomes critical. You stop buying all-in-one machines and start provisioning each layer independently.

Picking Your Agent CPU Hardware

For the agent orchestration side, here's what actually matters:

What to prioritize

  • Reliability: ECC RAM, server-grade storage, redundant power — your agents should run for months without intervention
  • Core count over clock speed: Each agent is I/O-bound, not compute-bound. More cores = more agents per machine. A 32-core server comfortably runs 100+ lightweight agent processes
  • Fast storage: NVMe SSDs for memory file operations. Agents read and write markdown files constantly — spinning disks create latency
  • Network reliability: Stable, low-latency internet. Agents are making API calls and receiving webhooks continuously

What doesn't matter

  • GPU: You don't need one for agent orchestration. Zero. Save the money.
  • Clock speed: The difference between a 3.0GHz and 4.5GHz CPU is irrelevant when your bottleneck is network I/O waiting for LLM responses
  • Massive RAM: 512MB per agent is comfortable. 16GB serves dozens of agents. Don't buy 128GB for agent hosting alone.

Practical recommendations

ScaleHardwareMonthly Cost
1–5 agents$5 VPS (2 vCPU, 4GB RAM) or Raspberry Pi 5$5–10
5–20 agents$20 VPS (4 vCPU, 8GB RAM) or used Dell Optiplex$15–30
20–100 agentsDedicated server (8+ cores, 32GB RAM) or small Kubernetes cluster$50–200
100+ agentsMulti-node cluster with container orchestration$200+

Picking Your Inference GPU Strategy

The inference side has three approaches, each with clear trade-offs:

Option A: Cloud API (recommended for most)

Use Anthropic, OpenAI, or Google's hosted models. You pay per token, scale instantly, and maintain zero GPU infrastructure. This is the right choice for 90% of OpenClaw deployments.

Pros: Zero hardware cost, instant access to frontier models, no maintenance. Cons: Per-token costs at scale, data leaves your network, vendor dependency.

Option B: Dedicated GPU server with self-hosted models

Run vLLM, Ollama, or llama.cpp on your own GPU hardware. Required when you have compliance restrictions (data can't leave your network), need custom fine-tuned models, or have enough inference volume that self-hosting is cheaper than API costs.

Hardware tiers for self-hosted inference:

  • Entry: Single RTX 4090 (24GB VRAM) — runs 30B models at good speed, $1,600
  • Mid: Dual RTX 4090 or single A6000 (48GB) — runs 70B models comfortably, $3,000–5,000
  • High: NVIDIA A100 80GB or H100 — runs frontier-class models at production throughput, $10,000–30,000
  • Unified memory alternative: AMD Ryzen AI Max+ 395 or NVIDIA GB10 with 128GB LPDDR5X — holds very large models in shared memory, $2,000–3,000

Option C: Hybrid — cloud for heavy lifting, local for lightweight tasks

Route complex reasoning to Claude or GPT-4o via API. Route simple classification, summarization, or embedding tasks to a local 8B model running on modest hardware. This is increasingly popular for cost-sensitive deployments that still need frontier reasoning for critical decisions.

OpenClaw supports model routing natively — you can configure different models for different tasks in your agent's configuration.

Real-World Configurations We Deploy

Here are actual deployment architectures we've built for clients:

Small agency (5 agents, 3 team members)

Agent host: Hetzner CX31 VPS (4 vCPU, 8GB RAM, $15/month). Inference: Anthropic Claude API. All five agents run as Docker containers on the single VPS. Total infrastructure cost: $15/month + API usage (~$50–150/month depending on volume).

E-commerce company (15 agents across customer support, inventory, and marketing)

Agent host: Dedicated Hetzner server (AMD Ryzen 9, 64GB RAM, $65/month). Inference: Mix of Claude API for customer-facing responses and a local Llama 3 70B on an RTX 4090 workstation for internal data processing where PII can't leave the network. Total infrastructure: $65/month + API costs + one-time $2,500 GPU workstation.

Consulting firm (40+ agents, enterprise compliance requirements)

Agent host: Three-node Kubernetes cluster on dedicated servers. Inference: Self-hosted vLLM cluster with two NVIDIA A100 GPUs for full data sovereignty. All traffic stays on-premise. Total infrastructure: $800/month for servers + $20,000 one-time GPU investment.

Cost Comparison: Bundled vs Split

Consider a team that needs 10 OpenClaw agents with cloud API inference:

ApproachUpfront CostMonthly Cost
10x Mac Mini M4 (one per agent)$6,000–8,000~$5 electricity + API
1x shared VPS (all 10 agents)$0$20 VPS + API
1x dedicated server (room to grow)$0 (rented)$50 server + API

The all-in-one approach costs 100x more upfront for the same result. The agents on the shared VPS perform identically — they're making the same API calls to the same cloud models. The Mac Minis' GPUs sit completely idle.

The math only shifts when you need local inference. Then the GPU investment is justified — but it should be a dedicated inference server, not a GPU in every agent machine.

Wrapping Up

The core insight: OpenClaw agents are CPU processes. LLM inference is GPU work. They belong on different hardware. Conflating them is the most expensive mistake teams make when scaling past a single personal agent.

Start with a cheap VPS and cloud API. When you need local models, add a dedicated GPU server. When you need enterprise reliability, move agents to managed containers. At every stage, keep the two layers independent — your architecture will be simpler, cheaper, and more resilient.

Need Help Planning Your OpenClaw Server Architecture?

OpenClaw Consult designs and deploys production OpenClaw infrastructure — from single-agent VPS setups to multi-node enterprise clusters. We've built all the configurations described in this guide. Get in touch and we'll scope a deployment that fits your team.