Introduction

Imagine an AI agent that never sends a single byte of your conversations to any external server. One that works without an internet connection. One with no variable API costs regardless of how many millions of tokens it processes. One that runs on hardware you own and control, using models that are open and auditable.

This is exactly what OpenClaw + Ollama provides. Ollama is an open-source tool for running large language models locally, and its integration with OpenClaw creates the most private, most cost-effective AI agent deployment available to individual users and organizations today. Here's what we're covering: every step from installation to production operation.

Why Ollama?

Several tools exist for running local LLMs: llama.cpp directly, text-generation-webui, LM Studio, and others. Ollama stands out for OpenClaw deployments for three reasons.

First, it presents a clean API compatible with OpenAI's API specification. This means OpenClaw can communicate with Ollama using the same interface it uses for cloud providers — no special integration needed. The connection is a single configuration change.

Second, model management is simple. ollama pull llama3.2 downloads a model. ollama list shows what you have. ollama run llama3.2 lets you test it interactively. No manual GGUF downloads, no quantization decisions at download time, no manual path configuration. Ollama handles all of this transparently.

Third, performance is good. Ollama is built on llama.cpp under the hood, which provides optimized CPU inference and excellent GPU acceleration on NVIDIA, AMD, and Apple Silicon hardware. The performance difference between Ollama and the same model run through a less optimized stack is measurable — often 2–3x faster tokens per second for the same hardware.

Installing Ollama

Ollama installation is a one-command process on most platforms:

# macOS and Linux
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version
ollama serve &  # Start the Ollama server if it didn't auto-start

On Windows, download the installer from ollama.com. On macOS, Ollama installs as a menu bar application that manages the server lifecycle automatically.

After installation, download your first model. Start with Llama 3.2 8B for a good balance of quality and resource requirements:

# Download and run interactively to verify
ollama run llama3.2

# After verification, download more models
ollama pull mistral:7b-instruct
ollama pull phi4-mini:latest

# Check what you have downloaded
ollama list

Ollama automatically starts an API server on http://localhost:11434. Verify it's running:

curl http://localhost:11434/api/tags

If you see a JSON response listing your models, the server is running correctly.

Configuring OpenClaw for Ollama

OpenClaw's Ollama integration treats the local server as just another LLM provider. In your config.yaml:

llm:
  default_provider: ollama
  providers:
    ollama:
      base_url: "http://localhost:11434"
      model: "llama3.2"
      # Optional: configure for instruction following
      options:
        temperature: 0.7
        top_p: 0.9
        num_ctx: 8192  # Context window size

If you want to use Ollama for some tasks and a cloud provider for others (the hybrid approach), configure both:

llm:
  default_provider: ollama
  providers:
    ollama:
      base_url: "http://localhost:11434"
      model: "llama3.2"
    openai:
      api_key: "${OPENAI_API_KEY}"
      model: "gpt-4o"
  routing:
    # Use cloud model when explicitly requested or for complex tasks
    complex_reasoning: openai
    sensitive_data: ollama  # Always use local for sensitive content
    heartbeat: ollama       # Use local for cost efficiency on background tasks

After updating configuration, restart OpenClaw and test with a simple message. If responses are coming through, your local model integration is working.

Model Recommendations

Choosing the right model matters significantly for OpenClaw's agentic tasks. The key requirement is reliable tool use — the model must generate well-formed tool calls when the agent needs to invoke Skills. Not all local models do this consistently. Here are tested, recommended options:

Llama 3.2 8B Instruct (Recommended for most users): Meta's model demonstrates strong instruction following and reliable tool use for an 8B parameter model. It handles most OpenClaw heartbeat tasks and routine conversations well. At ~5GB download size and requiring 8GB RAM, it fits comfortably on most modern hardware.

Mistral 7B Instruct v0.3: Fast, efficient, and excellent at following structured instructions. Slightly less capable than Llama 3.2 on complex reasoning but significantly faster at inference. Good choice for hardware where speed matters — Raspberry Pi 5 or older laptops where you need sub-10-second response times.

Qwen 2.5 14B Instruct: If you have 16GB RAM available, Qwen 2.5 14B represents a significant quality step up over 7–8B models. Strong reasoning, excellent multilingual support, and good tool use. The sweet spot for users who need local inference quality close to GPT-4o.

Llama 3.1 70B Instruct: For users with 64GB+ RAM and serious hardware, 70B parameter models deliver quality approaching frontier cloud models. Latency is 2–4x slower than smaller models, but for non-time-sensitive tasks the quality improvement is substantial.

Hardware Guide

Hardware determines which models you can run and at what speed. Here's a practical breakdown by hardware category:

HardwareRecommended ModelExpected Speed
Raspberry Pi 5 (8GB)Phi-4 Mini or Gemma 2 2B3–6 tokens/sec
Mac Mini M2 (8GB)Llama 3.2 8B25–40 tokens/sec
Mac Mini M4 (16GB)Qwen 2.5 14B20–35 tokens/sec
Mac Studio M4 (64GB)Llama 3.1 70B15–25 tokens/sec
PC with RTX 4090 (24GB VRAM)Llama 3.1 70B Q440–60 tokens/sec

Apple Silicon Macs benefit from unified memory architecture — the GPU and CPU share the same memory pool, meaning an M4 Mac Mini with 24GB RAM can run a 20B parameter model with the GPU fully utilized, something impossible on a discrete GPU system with only 12GB VRAM.

Performance Optimization Tips

Several configuration changes can meaningfully improve local model performance for OpenClaw use cases:

Use Q5_K_M quantization: When multiple quantization levels are available, Q5_K_M provides a good balance of quality and size/speed. It's roughly equivalent to Q8 quality at Q4 speed.

Limit context window size: Local models run slower with larger context windows. For heartbeat tasks that don't need extensive history, configure a smaller context window in the Ollama options to improve throughput.

Keep Ollama running continuously: Model loading (the time between when you first call Ollama and when it returns a response) takes 10–30 seconds as the model is loaded into memory. Once loaded, subsequent calls are fast. Configure Ollama to keep models loaded in memory between calls with the OLLAMA_KEEP_ALIVE environment variable.

Reserve system RAM for the model: Close memory-intensive applications when running large local models. More memory available to Ollama means more of the model stays in RAM rather than being paged to disk, which dramatically improves inference speed.

Wrapping Up

OpenClaw with Ollama is not a compromise — it's a genuine first-class deployment option that prioritizes privacy and cost over raw model quality. For users who handle sensitive data, who want predictable costs, or who simply believe their conversations should stay on their own hardware, local model deployment delivers on its promise. The hardware investment pays for itself quickly against API costs, and the peace of mind from complete data sovereignty is difficult to put a dollar value on.