// PIPELINE // QUICKSTART // COMPARE // PROVIDERS
Voice layer for Claude & GPT-4o  ·  On WhatsApp · Self-hosted

YOUR AI.
NOW IT
TALKS BACK.

Send a WhatsApp voice note. Claude hears it, thinks, and speaks back in under 300ms. No SaaS middleman. No $0.31/min bill. Your server, your audio, your AI.

platformOmniVoice v0.1.0
channelwhatsapp  [ CONNECTED ]
llm_provideranthropic/claude-3.5-sonnet  [ OK ]
asr_providerdeepgram/nova-2  [ OK ]
latency_p50247ms
audio_leaves_serverfalse  [ GOOD ]
statusREADY. Send a voice note to get started.
// Launching in
14
days
00
hrs
00
min
00
sec
// no credit card · no vendor lock-in · just the pipeline
End-to-end latency
<300ms
p50 avg across ASR+LLM+TTS
Platform fee (self-hosted)
$0
Own infra, zero platform tax
WhatsApp users worldwide
2B+
No app download. They're already there.
License
MIT
Fork it. Ship it. Own it.
CLAUDE ON WHATSAPP SELF-HOSTED 2B+ USERS. NO APP DOWNLOAD OPEN SOURCE SUB-300MS VOICE ROUND-TRIP NO $0.31/MIN BILL GPT-4O ON WHATSAPP AUDIO ON YOUR SERVER CLAUDE ON WHATSAPP SELF-HOSTED 2B+ USERS. NO APP DOWNLOAD OPEN SOURCE SUB-300MS VOICE ROUND-TRIP NO $0.31/MIN BILL GPT-4O ON WHATSAPP AUDIO ON YOUR SERVER
// How it works
SEND A NOTE.
HEAR AN ANSWER.
You send a WhatsApp voice note. OmniVoice transcribes it, runs it through Claude or GPT-4o, and sends a voice note back. All under 300ms.
01
🎙 You send a voice note
Record anything — question, task, command. Just like messaging a friend.
02
👂 Deepgram transcribes it
Speech-to-text in ~80ms. Nova-2 model. Handles accents, noise, fast speech.
~80ms
03
🧠 Claude / GPT-4o replies
Your chosen LLM thinks and streams a response. Swap providers in one env var.
~120ms
04
🔊 You receive a voice note back
TTS converts the reply to audio. Delivered as a WhatsApp voice note.
~60ms
Total round-trip < 300ms
Fully self-hosted · your audio never leaves your server
// Pipeline Architecture 8 stages · each independently swappable · total ~247ms
01🎙️MICRaw audio input via WebSocket streamclient WebSocketLIVE
02👂ASRSpeech-to-text transcriptionDeepgram · Whisper · AssemblyAI~80msLIVE
03ATTSAdaptive turn-taking & barge-in detectionbuilt-in~5msLIVE
04🧠LLMLanguage model inference & streamingOpenAI · Claude · Ollama · OpenRouter~120msLIVE
05🪣TABTemporal alignment buffer — smooth token ratebuilt-in~10msLIVE
06📡AQALAdaptive audio quality & codec selectionbuilt-in~5msLIVE
07🔊TTSText-to-speech synthesisOpenAI · ElevenLabs · Azure~60msLIVE
08🎧OUTPCM-16 audio stream back to callerclient WebSocketLIVE
RUNNING IN
60 SECONDS.

Clone, set two env vars, start. Your first voice session is one WebSocket connection away.

01
Clone and install
git clone + pip install -r requirements.txt
02
Run the setup wizard
python setup_env.py
03
Pick your LLM — Claude, GPT-4o, or Ollama
LLM_PROVIDER=anthropic / openai / ollama
04
Start everything in one command
python start.py
terminal — omni-voice
# 1. clone & install
git clone https://github.com/yourname/omnivoice
cd omnivoice && pip install -r requirements.txt

# 2. guided setup — picks your providers
python setup_env.py

# choose your LLM (pick one):
LLM_PROVIDER=anthropic   → Claude  (ANTHROPIC_API_KEY)
LLM_PROVIDER=openai      → GPT-4o  (OPENAI_API_KEY)
LLM_PROVIDER=ollama      → Llama3  (no key needed)

# ASR & TTS — both have free options:
ASR_PROVIDER=whisper     → local, free, no key
TTS_PROVIDER=edge        → free Microsoft neural voices

# 3. start everything
python start.py

✓ Tunnel active: https://xxx.trycloudflare.com
✓ Twilio webhook updated automatically
✓ OmniVoice ready — send a WhatsApp voice note
// 01_PROVIDER_AGNOSTIC
🔌
One env var. Any provider.
Deepgram → Whisper. GPT-4o → Claude. OpenAI TTS → ElevenLabs. One env var change, no code rewrite. Provider lock-in is a choice — choose freedom.
ASR · LLM · TTS
// 02_MULTI_CHANNEL
📱
WhatsApp. Phone. Web.
Twilio integration built-in. Voice notes go through ASR→LLM→TTS and come back as audio. Phone calls are real-time bidirectional streams. Three channels, one pipeline.
Twilio · WebSocket · PSTN
// 03_SELF_HOSTED
🔒
Your server. Your audio.
No SaaS middleman listening to your users. No $0.31/min surprise bills. Deploy on any VPS, k8s cluster, or bare metal you already own and trust.
Open source · MIT · Self-hosted
// 04_LATENCY
Sub-300ms. Every call.
TAB smooths bursty LLM tokens into natural speech rhythm. ATTS detects barge-in in real time. Conversations feel like conversations, not HTTP requests.
TAB · ATTS · AQAL
// 05_OBSERVABILITY
📊
Built-in Prometheus.
Every pipeline stage instrumented out of the box. Latency histograms, session counters, error rates — ready for your Grafana dashboard. No config needed.
Prometheus · Grafana
// 06_OPEN_SOURCE
MIT. Fork it. Own it.
The full pipeline is open source. Read the code, extend it, ship it. Or use the hosted cloud on omnivoice.dev if you don't want to run infra. Your call.
MIT License · GitHub

GIVE YOUR AI A VOICE.

// OmniVoice is the voice layer. You pick the brain. One env var to swap.

Use your Anthropic API key directly — no middleman. Every voice conversation goes through Claude 3.5 Sonnet, speaking and listening.
# .env — one key, direct to Anthropic LLM_PROVIDER=anthropic ANTHROPIC_API_KEY=sk-ant-... ANTHROPIC_MODEL=claude-3-5-sonnet-20241022 TTS_PROVIDER=edge # free ASR_PROVIDER=whisper # free, local
Talk to Claude. Hear Claude talk back.
Use your OpenAI API key directly. GPT-4o for the brain, Edge TTS for the voice, Whisper for the ears — all on your own server.
# .env — one key, direct to OpenAI LLM_PROVIDER=openai OPENAI_API_KEY=sk-... OPENAI_MODEL=gpt-4o-mini # or gpt-4o TTS_PROVIDER=edge # free ASR_PROVIDER=whisper # free, local
GPT-4o with a voice. Under 5 minutes.
No API key at all. Llama 3, Mistral, Phi-3 — running on your own machine. Zero cloud dependency, zero cost, zero data leaving your device.
# .env — zero API keys needed LLM_PROVIDER=ollama OLLAMA_MODEL=llama3 # or mistral, phi3 ASR_PROVIDER=whisper # local, free TTS_PROVIDER=edge # free # prerequisite: brew install ollama
Fully local. Zero keys. Zero cost.
OmniVoice is the voice layer — not the AI. Use whichever LLM you already pay for, trust, or run locally. Switch providers with one line in .env. Also works with OpenRouter, Gemini, Mistral, DeepSeek, and any OpenAI-compatible endpoint.
🎙 MIC → ASR → YOUR LLM → TTS → 🔊
WhatsApp → ASR → YOUR LLM → TTS → WhatsApp
// Works with
Deepgram
OpenAI Whisper
AssemblyAI
Azure Speech
GPT-4o
Claude Sonnet
OpenRouter
Ollama (local)
ElevenLabs
Twilio
Cloudflare
Prometheus

VS THE COMPETITION.

// Great until they're not. Here's what you give up.

FeatureOmniVoiceVapiRetell AIBland AI
PricingFree (self-hosted)$0.13–0.31/min$0.07/minPer API call
Self-hostableYESNONONO
Audio on your serverYESNONONO
Provider-agnosticFULLPARTIALPARTIALNO
WhatsApp + PhoneYESPhone onlyPhone onlyPhone only
Open sourceYES (MIT)NONONO

YOUR AI.
THEIR
WHATSAPP.

2 billion people already send voice notes. Point OmniVoice at Claude or GPT-4o, connect your Twilio number, and they're talking to your AI — no app download, no new interface, no $0.31/min bill.

★ GITHUB
// Launching in
14
DAYS
00
HRS
00
MIN
00
SEC
Self-hosted voice AI infrastructure.
Open source · MIT licensed · omnivoice.dev