Building a Continuous Voice Interface with the OpenAI Realtime API
A technical walkthrough of how the ABD Assistant voice command system works end-to-end, from raw microphone bytes to tool execution. The Core Architecture The system has three moving parts: a brows...

Source: DEV Community
A technical walkthrough of how the ABD Assistant voice command system works end-to-end, from raw microphone bytes to tool execution. The Core Architecture The system has three moving parts: a browser Web Audio capture layer, an Express WebSocket relay, and OpenAI's Realtime API as the voice brain. The browser streams PCM audio directly to OpenAI via a WebSocket that stays open for the entire session. OpenAI performs server-side voice activity detection (VAD), transcribes speech incrementally, runs its LLM over the conversation history, and streams back audio tokens as they're generated. This means no client-side silence detection, no turn-management logic, and no separate transcription step — one pipeline, fully server-driven. Audio Capture: The Hard Part Capturing audio correctly is where most implementations fall apart. The key constraint: OpenAI's Realtime API expects mono PCM at 24kHz, 16-bit signed integers. Browser MediaRecorder produces audio/webm or audio/opus — a completely di