OpenClaw is model-agnostic. Today you learn every provider it supports, when to use each model, and how to run Llama 3 locally with zero API cost using Ollama.
Why OpenClaw decouples the intelligence layer and what that means for you.
Anthropic, OpenAI, Google, and Chinese models. Capabilities and trade-offs.
Install Ollama, pull Llama 3.2, connect it to OpenClaw. Full walkthrough.
Pricing comparison, hardware guide, and hybrid local + cloud strategies.
Which model for which task. Decision framework you can actually use.
How to switch models in your config file. One change, no code.
Raspberry Pi to Mac Studio. What runs where and at what speed.
Get Ollama running, test a local model, compare it to your cloud model.
~/.openclaw/openclaw.jsonagents.defaults.model.primaryOpenClaw's Gateway abstracts the model layer completely. The agent issues a request, receives a response. Which provider processes it is entirely configurable.
Switching from Claude to GPT to a local Llama model requires a single config change and a restart. No code changes.
Use different models for different tasks. Cheap model for heartbeat. Powerful model for analysis. Local model for sensitive data.
Pricing changes. APIs evolve. New models launch. Your agent infrastructure, memory, and Skills survive all of it.
Chinese models: DeepSeek V3.2, Kimi K2.5, GLM-5. Competitive performance at roughly 1/10th the cost of US cloud models.
OpenClaw's most community-tested integration. Current models as of March 2026:
Exceptional complex reasoning, precise instruction following, sophisticated writing. 1M token context window. Use for high-value tasks where quality justifies spend.
Strong quality without the premium pricing of Opus. Many users run Sonnet for interactive conversations and Haiku for automated tasks.
Fast, cost-effective, performs well on structured tasks, data extraction, summarization, and routine decision-making. Excellent for heartbeat cycles.
Released March 2026. Most capable and efficient frontier model. 33% fewer errors than GPT-5.2. Thinking and Pro variants available.
Released Aug 2025. Strong reasoning, reliable tool use, multimodal. Routes between fast and deep reasoning models automatically.
Strong general reasoning, reliable tool use, good code generation, 128K context. Multimodal via the vision Skill.
~16x cheaper than GPT-4o. Retains strong performance on structured tasks, summarization, and instruction following.
Released Feb 2026. Reasoning-first model optimized for complex agentic workflows and coding. 1M token context window. Adaptive thinking. SWE-bench 80.6%.
Pro-grade reasoning with Flash-level latency and cost efficiency. Built for speed. Also available: Gemini 2.5 Pro, Gemini 2.5 Flash for older integrations.
The Gemini integration in OpenClaw was added through community contributions. Available through direct API or Vertex AI for enterprise. Test thoroughly before production use.
Competitive reasoning quality at roughly 1/10th the cost of US cloud models.
| Model | Origin | SWE-bench | Strength |
|---|---|---|---|
| GLM-5 | Zhipu AI (China) | 77.8% | 744B MoE, MIT license, $1/$3.20 |
| Kimi K2.5 | Moonshot (China) | 76.8% | Strong reasoning, 1T params |
| DeepSeek V3.2 | DeepSeek (China) | 73.0% | Extreme cost efficiency |
GLM-5's 77.8% and Kimi K2.5's 76.8% on SWE-bench approach Claude Opus 4.6 (80.8%) and GPT-5.2 (80.0%). For many agent tasks the gap is negligible. GLM-5 API: $1/$3.20 per 1M tokens. US users should verify latency and compliance.
| Use Case | Recommended Model |
|---|---|
| Complex reasoning & analysis | GPT-5.4 or Claude Opus 4.6 |
| Heartbeat / background monitoring | GPT-4o Mini or Claude Haiku 4.5 |
| Privacy-sensitive tasks | Llama 3.2 8B or Mistral 7B (local) |
| Code generation & debugging | GPT-5.4 or Claude Opus 4.6 |
| Zero cost constraint | Llama 3.1 70B (local, high-end hardware) |
| Multilingual tasks | Gemini 3.1 Pro or Qwen 2.5 (local) |
Start with a single model for simplicity. Once stable, introduce model routing to optimize cost and quality across different task types.
| Model | Input / 1M tokens | Output / 1M tokens |
|---|---|---|
| Claude Opus 4.6 | $5.00 | $25.00 |
| Claude Sonnet 4.6 | $3.00 | $15.00 |
| Claude Haiku 4.5 | $1.00 | $5.00 |
| GPT-5.4 | $2.50 | $15.00 |
| GPT-4o | $2.50 | $10.00 |
| GPT-4o Mini | $0.15 | $0.60 |
| Local (Ollama) | $0 | $0 |
A power user reported burning through 180 million tokens in weeks after enabling aggressive heartbeat monitoring with an expensive model and accidentally creating a feedback loop. With frontier pricing, that was hundreds of dollars. Always set spending limits.
Morning briefing heartbeat. Occasional tasks. GPT-4o Mini for most, GPT-4o for complex. Heartbeat every 60 min.
Active heartbeat monitoring. Regular interactive use. Claude Haiku 4.5 for heartbeat, Claude Opus 4.6 for complex work.
Mac Mini with 16GB RAM. Llama 3.2 for everything. Only cost is electricity: roughly $1–2/month.
OpenClaw itself is free and open source (MIT license). You only pay for the intelligence layer.
An open-source tool for running large language models locally. It presents a clean API compatible with OpenAI's API spec, so OpenClaw connects to it using the same interface it uses for cloud providers.
ollama pull llama3.2 downloads a model. ollama list shows what you have. No manual GGUF downloads.
Built on llama.cpp. GPU acceleration on NVIDIA, AMD, and Apple Silicon. Often 2–3x faster than less optimized stacks.
Zero data leaves your machine. No internet required for inference. Your conversations stay on your hardware.
On macOS, Ollama installs as a menu bar application that manages the server lifecycle automatically.
Ollama runs as a background service on Windows. Same API, same models, same port.
Ollama starts an API server on http://localhost:11434. Verify it: curl http://localhost:11434/api/tags — you should see a JSON response.
If you see a response from ollama run llama3.2, Ollama is working. The API server is live on localhost:11434.
OpenClaw treats Ollama as just another LLM provider. One config change is all it takes.
Run openclaw configure and select Ollama as your provider. It writes the config for you. Restart the Gateway and test with a message.
| Model | Size | RAM | Best For |
|---|---|---|---|
| Llama 3.2 8B Instruct | ~5GB | 8GB | Balanced performance, good tool use |
| Mistral 7B Instruct v0.3 | ~4GB | 8GB | Fast responses, good instruction following |
| Qwen 2.5 14B Instruct | ~9GB | 16GB | Strong reasoning, excellent multilingual |
| Llama 3.1 70B Instruct | ~40GB | 64GB | Near-GPT-4 quality, high-end hardware |
| Phi-4 Mini (3.8B) | ~2.5GB | 4GB | Raspberry Pi and low-power devices |
Local models vary in their ability to generate well-formatted tool calls. Models with "-instruct" or "-chat" suffixes perform better. Llama 3.2 Instruct and Mistral 7B Instruct are community favorites for reliable tool use in OpenClaw.
| Hardware | Recommended Model | Speed |
|---|---|---|
| Raspberry Pi 5 (8GB) | Phi-4 Mini or Gemma 2 2B | 3–6 tokens/sec |
| Mac Mini M2 (8GB) | Llama 3.2 8B | 25–40 tokens/sec |
| Mac Mini M4 (16GB) | Qwen 2.5 14B | 20–35 tokens/sec |
| Mac Studio M4 (64GB) | Llama 3.1 70B | 15–25 tokens/sec |
| PC with RTX 4090 (24GB) | Llama 3.1 70B Q4 | 40–60 tokens/sec |
Apple Silicon Macs benefit from unified memory architecture. The GPU and CPU share the same memory pool, meaning an M4 Mac Mini with 24GB RAM can run a 20B parameter model with the GPU fully utilized — something impossible on a discrete GPU with only 12GB VRAM.
When multiple quantization levels are available, Q5_K_M provides a good balance of quality and speed. Roughly Q8 quality at Q4 speed.
Local models run slower with larger context. For heartbeat tasks that don't need extensive history, set a smaller num_ctx to improve throughput.
Model loading takes 10–30 seconds. Set OLLAMA_KEEP_ALIVE to keep models loaded in memory between calls. Once loaded, responses are fast.
Close memory-intensive applications when running large models. More RAM for Ollama means less disk paging and dramatically faster inference.
Many experienced OpenClaw users settle on a hybrid approach. Use the right model for the right job.
Use a local model. Structured, repetitive tasks where an 8B model performs fine. Zero API cost for the most frequent token consumption.
Use a local model. Legal documents, health data, financial analysis. Route to local regardless of quality considerations.
Use a cloud model. When you need the best reasoning, nuanced writing, or complex code generation, route to GPT-5.4 or Claude Opus 4.6.
Most automated monitoring doesn't need frontier intelligence. Claude Opus 4.6 heartbeat: ~$15–30/month. Claude Haiku 4.5 for the same tasks: ~$3–5/month. GPT-4o Mini: even less.
Both OpenAI and Anthropic allow monthly caps. Set one at $5–$10 before your agent runs unattended. Protects you from runaway tasks.
30-minute to 60-minute cycles cuts background API usage by 50%. For most monitoring, hourly checks are sufficient.
Long system prompts and bloated memory files increase every request's token count. Periodically prune memory and tighten your system prompt.
Three ways to change your model. All take effect without rebuilding anything.
Interactive menu. Select your new provider and model.
Edit directly. Gateway watches the file and applies changes.
Form UI or raw JSON editor. Changes hot-reload automatically.
Run the install command for your platform. Pull llama3.2. Run ollama run llama3.2 "Hello" and confirm a response.
Run openclaw configure or edit your config. Set Ollama as the provider. Send a message through the dashboard and confirm it works.
Ask your agent the same question with your cloud model and your local model. Notice the difference in speed, quality, and cost.
Go to your API provider's dashboard and set a monthly cap. $5–$10 is a good starting point. Do this before Day 4.
ollama listVerify your downloaded models. Try pulling one more model that fits your hardware. Experiment with ollama run to test it.
You understand the full model landscape. You have Ollama running locally. You know which model to use for which task and how much it costs.
Cost Optimization — Now that you know the models, we go deep on controlling what you spend. Two-tier processing, heartbeat tuning, and strategies that cut your bill by 70% or more.