Introduction

OpenClaw is primarily text-based, but voice interfaces are increasingly important. Voice agents combine speech-to-text (STT), the LLM, and text-to-speech (TTS) to enable spoken interaction. Here's what we're covering: adding voice to OpenClaw and integration patterns.

Voice Architecture

Voice flow: User speaks → STT converts to text → OpenClaw processes → LLM responds → TTS converts to audio → User hears. This can run through a separate voice gateway (e.g., Vapi, Bland, or custom) that connects to OpenClaw's API, or through Skills that handle audio.

Two main patterns. (1) Voice platform (Vapi, Bland) handles STT/TTS, telephony, and streaming. It sends text to OpenClaw's API and receives text back. OpenClaw is the brain; the platform is the interface. (2) Custom Skill: receive audio (e.g., from Telegram voice message), call STT API, pass text to agent, get response, call TTS, return audio. More control, more integration work.

Speech-to-Text: Options & Setup

STT options: Whisper (OpenAI, local or API), Google Speech-to-Text, AssemblyAI, Deepgram. Quality and latency vary. For real-time conversation, low-latency providers (Deepgram, AssemblyAI) matter. Store transcripts in OpenClaw memory for context.

Step-by-step: Adding STT. Choose provider. For OpenAI Whisper API: send audio file, get text. For Deepgram: real-time streaming or batch. Create a Skill that: (1) Receives audio (from webhook, Telegram, etc.), (2) Calls STT API, (3) Passes text to OpenClaw, (4) Returns response. Latency budget: aim for under 500ms STT + 1s LLM + 500ms TTS for natural conversation.

Text-to-Speech: Options & Setup

TTS options: ElevenLabs, Play.ht, OpenAI TTS, Google TTS. Naturalness varies. ElevenLabs and Play.ht offer voice cloning for brand consistency. Stream TTS for lower perceived latency, start playing before full response is generated.

Costs. OpenAI TTS: ~$15/1M chars. ElevenLabs: tiered; higher quality costs more. Google TTS: $4/1M chars. For high volume, compare per-minute costs.

Integration Patterns

Pattern 1: Voice platform (Vapi, Bland) handles STT/TTS and sends text to OpenClaw. OpenClaw is the brain; voice is the interface. Easiest for phone/IVR. Pattern 2: Custom Skill that receives audio, calls STT, passes to agent, gets response, calls TTS. More control, more work. Pattern 3: Telegram/WhatsApp voice messages, OpenClaw can process voice notes via platform APIs and STT. Good for async voice.

Implementation Checklist

  • □ Choose pattern: voice platform vs custom Skill
  • □ Select STT provider (Whisper, Deepgram, etc.)
  • □ Select TTS provider (ElevenLabs, OpenAI, etc.)
  • □ Build or integrate voice gateway
  • □ Test latency; optimize for real-time
  • □ Store transcripts in memory for context

Cost Breakdown for Voice

STT: Whisper API ~$0.006/min. Deepgram ~$0.004/min. TTS: OpenAI ~$15/1M chars (~$0.02/min speech). ElevenLabs varies. For 1000 min/month: ~$30-80 in voice APIs. Add LLM costs. Voice platforms (Vapi) have their own pricing.

Common Pitfalls to Avoid

Pitfall 1: High latency. Users tolerate 1-2s total. Optimize STT (streaming), use faster LLM for voice. Pitfall 2: Wrong language. Ensure STT/TTS support your target languages. Pitfall 3: No fallback. When STT fails (noise, accent), have "I didn't catch that" handling.

Frequently Asked Questions

Can OpenClaw handle phone calls? Via Vapi, Bland, or similar. They handle telephony; OpenClaw handles conversation. What about WhatsApp voice? Process voice notes with STT; respond with text or TTS-to-audio. Local STT/TTS? Whisper runs locally; Coqui TTS for local TTS. No API cost, but need GPU.

Wrapping Up

Voice extends OpenClaw to hands-free and accessibility use cases. OpenClaw Consult helps design and implement voice agent setups.