Introduction

OpenClaw is primarily text-based, but voice interfaces are increasingly important. Voice agents combine speech-to-text (STT), the LLM, and text-to-speech (TTS) to enable spoken interaction. Here's what we're covering: adding voice to OpenClaw and integration patterns.

Voice Architecture

Voice flow: User speaks → STT converts to text → OpenClaw processes → LLM responds → TTS converts to audio → User hears. This can run through a separate voice gateway (e.g., Vapi, Bland, or custom) that connects to OpenClaw's API, or through Skills that handle audio.

Two main patterns. (1) Voice platform (Vapi, Bland) handles STT/TTS, telephony, and streaming. It sends text to OpenClaw's API and receives text back. OpenClaw is the brain; the platform is the interface. (2) Custom Skill: receive audio (e.g., from Telegram voice message), call STT API, pass text to agent, get response, call TTS, return audio. More control, more integration work.

Speech-to-Text: Options & Setup

STT options: Whisper (OpenAI, local or API), Google Speech-to-Text, AssemblyAI, Deepgram. Quality and latency vary. For real-time conversation, low-latency providers (Deepgram, AssemblyAI) matter. Store transcripts in OpenClaw memory for context.

Step-by-step: Adding STT. Choose provider. For OpenAI Whisper API: send audio file, get text. For Deepgram: real-time streaming or batch. Create a Skill that: (1) Receives audio (from webhook, Telegram, etc.), (2) Calls STT API, (3) Passes text to OpenClaw, (4) Returns response. Latency budget: aim for under 500ms STT + 1s LLM + 500ms TTS for natural conversation.

Text-to-Speech: Options & Setup

TTS options: ElevenLabs, Play.ht, OpenAI TTS, Google TTS. Naturalness varies. ElevenLabs and Play.ht offer voice cloning for brand consistency. Stream TTS for lower perceived latency — start playing before full response is generated.

Costs. OpenAI TTS: ~$15/1M chars. ElevenLabs: tiered; higher quality costs more. Google TTS: $4/1M chars. For high volume, compare per-minute costs.

Integration Patterns

Pattern 1: Voice platform (Vapi, Bland) handles STT/TTS and sends text to OpenClaw. OpenClaw is the brain; voice is the interface. Easiest for phone/IVR. Pattern 2: Custom Skill that receives audio, calls STT, passes to agent, gets response, calls TTS. More control, more work. Pattern 3: Telegram/WhatsApp voice messages — OpenClaw can process voice notes via platform APIs and STT. Good for async voice.

Implementation Checklist

  • □ Choose pattern: voice platform vs custom Skill
  • □ Select STT provider (Whisper, Deepgram, etc.)
  • □ Select TTS provider (ElevenLabs, OpenAI, etc.)
  • □ Build or integrate voice gateway
  • □ Test latency; optimize for real-time
  • □ Store transcripts in memory for context

Cost Breakdown for Voice

STT: Whisper API ~$0.006/min. Deepgram ~$0.004/min. TTS: OpenAI ~$15/1M chars (~$0.02/min speech). ElevenLabs varies. For 1000 min/month: ~$30-80 in voice APIs. Add LLM costs. Voice platforms (Vapi) have their own pricing.

Common Pitfalls to Avoid

Pitfall 1: High latency. Users tolerate 1-2s total. Optimize STT (streaming), use faster LLM for voice. Pitfall 2: Wrong language. Ensure STT/TTS support your target languages. Pitfall 3: No fallback. When STT fails (noise, accent), have "I didn't catch that" handling.

Frequently Asked Questions

Can OpenClaw handle phone calls? Via Vapi, Bland, or similar. They handle telephony; OpenClaw handles conversation. What about WhatsApp voice? Process voice notes with STT; respond with text or TTS-to-audio. Local STT/TTS? Whisper runs locally; Coqui TTS for local TTS. No API cost, but need GPU.

Wrapping Up

Voice extends OpenClaw to hands-free and accessibility use cases. OpenClaw Consult helps design and implement voice agent setups.