Models & Ollama, Day 3 of the Free Comprehensive OpenClaw Course
The Intelligence Layer
Why this matters
The model you wire openclaw to changes the agent's bill, its speed, and its actual personality. People default to whichever provider they already have an account with and pay for it later. This lesson walks every provider openclaw supports, the price math per provider, and the full ollama setup for running an agent against a local Llama 3 with zero API cost.
Which AI model should I use with OpenClaw?
OpenClaw with ollama is the headline of this lesson, but the bigger story is provider choice. The runtime is provider-agnostic, you swap from Anthropic Claude to OpenAI GPT to Google Gemini to local Ollama by changing one line in the .env. The agent code does not change, the prompts do not change, only the wire to the model changes.
The honest provider matrix as of today:
- Anthropic Claude. Best default for a serious agent. Claude Sonnet for the smart tier, Claude Haiku for the cheap routine tier. Strong tool use, strong instruction following, prompt caching that cuts repeated context cost by 90%. This is what most production OpenClaw deployments run on.
- OpenAI GPT. Strong general performer, deep tool ecosystem, slightly more expensive than Claude per equivalent quality. GPT-5.5 mini is the cheap tier, GPT-5.5 is the smart tier.
- Google Gemini. Cheapest of the cloud providers per token, strong at long-context tasks. Tool use is good but the ecosystem is slightly thinner than Anthropic or OpenAI.
- Ollama (local). Zero ongoing API cost, full data privacy, agent can run with no internet at all. Trade-off: smaller models, slower responses, weaker tool use. Llama 3 70B is the best practical choice if your machine can run it.
For most people the right answer is start with Claude Sonnet for the first week to see what a good agent feels like, then move to a two-tier setup once you understand the cost shape. Day 4, openclaw cost optimization covers the two-tier pattern in detail.
Cost per million tokens, all providers side by side
Real numbers, output token pricing, as of the time of writing. These shift every few months, check the provider's pricing page before you commit, but the relative shape stays the same. Claude Sonnet is roughly $15 per million output tokens, Claude Haiku is roughly $1.25 per million output tokens. GPT-5.5 is roughly $20 per million output tokens, GPT-5.5 mini is roughly $1.50 per million. Gemini 2.5 Pro is roughly $10 per million, Gemini 2.5 Flash is roughly $0.40 per million. Llama 3 70B on Ollama is $0 per token, your only cost is the hardware and electricity.
The 12x to 16x gap between the smart tier and the cheap tier on the same provider is the entire reason two-tier routing works. A typical personal agent's prompt mix is 80 percent routine and 20 percent hard. Run all 100 percent on the smart tier and the bill is one number. Run 80/20 across the two tiers and the bill is roughly a fifth of that, with no measurable quality loss on routine work.
Input tokens are always cheaper than output tokens, usually by 4x to 5x. This matters because most agent prompts have a long input context (SOUL.md, MEMORY.md, AGENTS.md all re-sent on every prompt) and a short output. Prompt caching on Anthropic flips this further, caching the input context costs about 10 percent of a normal input token, so a re-sent context is almost free.
When to use which model
The decision matrix that tracks how I actually pick. Claude Sonnet, when the agent needs to hold a sharp voice over a long conversation, when tool use must work the first time, when the agent is doing real reasoning rather than pattern-matching. The default smart tier for serious work.
Claude Haiku, when the prompt is routine. "Summarize this email", "draft a polite decline", "is this calendar event worth attending". The default cheap tier paired with Sonnet.
GPT-5.5, when the agent needs deep OpenAI ecosystem integration (Whisper, DALL-E, embeddings) or when the workspace already has GPT-tuned prompts. Otherwise Claude Sonnet wins on price-to-quality.
Gemini Flash, when cost is the primary constraint and the work is mostly routine. The cheapest cloud option, fast, fine for a personal agent that does not need to hold a complex voice.
Llama 3 70B on Ollama, when data privacy is the constraint, when the agent must run offline (rare but it happens, ships, off-grid sites), or when you have the hardware sitting around already and want to amortize it.
Llama 3 8B on Ollama, when you want to learn the local-model workflow without buying hardware. Runs on a laptop, agent feels noticeably slower and dumber than the 70B, but it works.
How do I set up OpenClaw with Ollama?
OpenClaw with ollama is the canonical local-model setup. The full walkthrough:
- Install Ollama from ollama.com. It runs on macOS, Linux, and Windows.
- Pull a model.
ollama pull llama3:8bfor the small model that runs on a laptop,ollama pull llama3:70bfor the bigger model that needs at least 48 GB of RAM or a serious GPU. - Confirm Ollama is running. Hit
http://localhost:11434in a browser, you should see "Ollama is running". - In your OpenClaw agent's .env, set
OLLAMA_BASE_URL=http://localhost:11434andOLLAMA_MODEL=llama3:8b. - Start the agent with
openclaw run. Send a message. The first reply will be slower than cloud, that is normal, the model is loading into memory. Subsequent replies are faster.
The catch with local: tool use is weaker on Llama 3 than on Claude or GPT. If your agent needs to use a lot of tools, do hybrid, run the routine chat on Llama and route the hard tool-use prompts to Claude Sonnet. The runtime supports per-prompt provider routing.
Hybrid routing config example
The hybrid pattern looks like this in your AGENTS.md:
provider_routing:
default: ollama:llama3:70b
rules:
- if: prompt_complexity > 0.7
use: anthropic:claude-sonnet
- if: requires_tool_use and tool_count > 2
use: anthropic:claude-sonnet
- if: heartbeat_decision
use: ollama:llama3:8b
The runtime evaluates the rules top to bottom on every prompt. The first match wins. The default catches anything no rule matched. With this config, routine chat runs on local Llama for free, hard prompts route to Claude Sonnet on the cloud, heartbeat decisions run on the smallest local model. The bill drops by 70 to 90 percent versus all-Sonnet, with the same actual capability for the hard prompts.
What is the cheapest way to run an OpenClaw agent?
Cheapest is Ollama on your own hardware, zero per-token cost. The honest number is roughly $0 a month if you ignore electricity, plus the one-time cost of a machine that can run a 8B or 70B model. A used Mac Mini M2 with 16 GB of RAM handles Llama 3 8B. A maxed Mac Studio or a workstation with two consumer GPUs handles Llama 3 70B.
Cheapest cloud is Gemini Flash at the time of writing, around 5 to 10 cents per million tokens depending on tier. A lightly-used personal agent on Gemini Flash runs $1 to $3 a month.
The middle path most people land on is two-tier with Claude Haiku for routine work and Claude Sonnet for hard prompts. This typically lands at $5 to $15 a month for a personal agent that gets used several times a day. The math depends entirely on heartbeat frequency, which is the next lesson.
Common local-model pitfalls
Three things that bite people the first week on Ollama. Tool use drift. Llama 3 will sometimes hallucinate tool calls or miss the schema entirely. Sonnet and GPT got there years ago, the open-source models are still catching up. The mitigation is to keep tool count per prompt low (under three is safe), to give the agent very clear examples in the prompt, and to route hard tool-use prompts to a cloud model with the hybrid pattern above.
The second, context window confusion. Llama 3 8B has an 8k context window in the default Ollama config. That is small. A bloated MEMORY.md will silently truncate the prompt and the agent loses the thread. Bump the context with OLLAMA_NUM_CTX=32768 in the .env, and watch your RAM use spike correspondingly. The 70B model handles longer contexts much better.
The third, cold start latency. Ollama unloads models from RAM after a few minutes of inactivity to free memory. The next request has to reload, which can take 10 to 30 seconds depending on model size. For a heartbeat-driven agent that fires every 15 minutes, every tick pays the reload tax. Set OLLAMA_KEEP_ALIVE=24h to pin the model in memory, the cost is RAM held continuously, the benefit is sub-second response on every tick.
How this connects to your full agent
The model you pick today is rarely the model you run forever. Most people start on Claude Sonnet because the quality is sharp on day one, then move to a two-tier setup once they have read openclaw cost optimization on day 4 and feel the bill. A few months in, some people move to a hybrid Ollama plus Claude Sonnet setup once they have hardware to amortize.
The provider abstraction is one of the runtime's best decisions. You can change models by editing one line in the .env, restart, and the agent keeps its memory, voice, and channels. Try a model for a week, swap if it does not feel right, swap back. The cost of experimentation is the price of one restart.
The next lesson, openclaw cost optimization, walks the four levers that take a $70 a month deployment to $17. The model choice is one of those four. Heartbeat tuning, prompt caching, and prompt bloat are the other three.
Key takeaways
- 01OpenClaw is provider-agnostic, swap from Claude to GPT to Ollama by changing one .env line.
- 02Two-tier model setups (cheap routine, smart hard) cut bills 60% with no quality loss.
- 03Ollama plus Llama 3 gives you a fully local agent for zero ongoing API spend.
- 04Use Claude Sonnet for the smart tier and Haiku or local Llama for the routine tier.
About the instructor. Adhiraj Hangal teaches this lesson. Founder of OpenClaw Consult and one of the few consultants whose code is merged in openclaw/openclaw core. PR #76345 was reviewed and merged by project creator Peter Steinberger. Read the contribution log.
Need help shipping openclaw with ollama in production?
OpenClaw Consult ships production-grade OpenClaw deployments for operators and founders. Founded by Adhiraj Hangal, a merged contributor to openclaw/openclaw core.
Hire an OpenClaw expert→