Voice layer for Claude & GPT-4o · On WhatsApp · Self-hosted

YOUR AI.
NOW IT
TALKS BACK.

Send a WhatsApp voice note. Claude hears it, thinks, and speaks back in under 300ms. No SaaS middleman. No $0.31/min bill. Your server, your audio, your AI.

platformOmniVoice v0.1.0

channelwhatsapp [ CONNECTED ]

llm_provideranthropic/claude-3.5-sonnet [ OK ]

asr_providerdeepgram/nova-2 [ OK ]

latency_p50247ms

audio_leaves_serverfalse [ GOOD ]

statusREADY. Send a voice note to get started.

// Launching in

days

hrs

min

sec

// no credit card · no vendor lock-in · just the pipeline

End-to-end latency

<300ms

p50 avg across ASR+LLM+TTS

Platform fee (self-hosted)

Own infra, zero platform tax

WhatsApp users worldwide

2B+

No app download. They're already there.

License

MIT

Fork it. Ship it. Own it.

CLAUDE ON WHATSAPP◆ SELF-HOSTED◆ 2B+ USERS. NO APP DOWNLOAD◆ OPEN SOURCE◆ SUB-300MS VOICE ROUND-TRIP◆ NO $0.31/MIN BILL◆ GPT-4O ON WHATSAPP◆ AUDIO ON YOUR SERVER◆ CLAUDE ON WHATSAPP◆ SELF-HOSTED◆ 2B+ USERS. NO APP DOWNLOAD◆ OPEN SOURCE◆ SUB-300MS VOICE ROUND-TRIP◆ NO $0.31/MIN BILL◆ GPT-4O ON WHATSAPP◆ AUDIO ON YOUR SERVER◆

// How it works

SEND A NOTE.
HEAR AN ANSWER.

You send a WhatsApp voice note. OmniVoice transcribes it, runs it through Claude or GPT-4o, and sends a voice note back. All under 300ms.

🎙 You send a voice note

Record anything — question, task, command. Just like messaging a friend.

👂 Deepgram transcribes it

Speech-to-text in ~80ms. Nova-2 model. Handles accents, noise, fast speech.

~80ms

🧠 Claude / GPT-4o replies

Your chosen LLM thinks and streams a response. Swap providers in one env var.

~120ms

🔊 You receive a voice note back

TTS converts the reply to audio. Delivered as a WhatsApp voice note.

~60ms

⚡

Total round-trip < 300ms

Fully self-hosted · your audio never leaves your server

‹

🤖

OmniVoice AI

online

📞⋮

Today

▶

0:04

10:42 am✓✓

👂ASR80ms

🧠LLM120ms

🔊TTS60ms

⚡Total<300ms

// Pipeline Architecture 8 stages · each independently swappable · total ~247ms

01	🎙️	MIC	Raw audio input via WebSocket stream	client WebSocket	—	LIVE
02	👂	ASR	Speech-to-text transcription	Deepgram · Whisper · AssemblyAI	~80ms	LIVE
03	⚡	ATTS	Adaptive turn-taking & barge-in detection	built-in	~5ms	LIVE
04	🧠	LLM	Language model inference & streaming	OpenAI · Claude · Ollama · OpenRouter	~120ms	LIVE
05	🪣	TAB	Temporal alignment buffer — smooth token rate	built-in	~10ms	LIVE
06	📡	AQAL	Adaptive audio quality & codec selection	built-in	~5ms	LIVE
07	🔊	TTS	Text-to-speech synthesis	OpenAI · ElevenLabs · Azure	~60ms	LIVE
08	🎧	OUT	PCM-16 audio stream back to caller	client WebSocket	—	LIVE

RUNNING IN
60 SECONDS.

Clone, set two env vars, start. Your first voice session is one WebSocket connection away.

Clone and install

git clone + pip install -r requirements.txt

Run the setup wizard

python setup_env.py

Pick your LLM — Claude, GPT-4o, or Ollama

LLM_PROVIDER=anthropic / openai / ollama

Start everything in one command

python start.py

terminal — omni-voice

# 1. clone & install
git clone https://github.com/yourname/omnivoice
cd omnivoice && pip install -r requirements.txt

# 2. guided setup — picks your providers
python setup_env.py

# choose your LLM (pick one):
LLM_PROVIDER=anthropic   → Claude  (ANTHROPIC_API_KEY)
LLM_PROVIDER=openai      → GPT-4o  (OPENAI_API_KEY)
LLM_PROVIDER=ollama      → Llama3  (no key needed)

# ASR & TTS — both have free options:
ASR_PROVIDER=whisper     → local, free, no key
TTS_PROVIDER=edge        → free Microsoft neural voices

# 3. start everything
python start.py

✓ Tunnel active: https://xxx.trycloudflare.com
✓ Twilio webhook updated automatically
✓ OmniVoice ready — send a WhatsApp voice note
▌

// 01_PROVIDER_AGNOSTIC

🔌

One env var. Any provider.

Deepgram → Whisper. GPT-4o → Claude. OpenAI TTS → ElevenLabs. One env var change, no code rewrite. Provider lock-in is a choice — choose freedom.

ASR · LLM · TTS

// 02_MULTI_CHANNEL

📱

WhatsApp. Phone. Web.

Twilio integration built-in. Voice notes go through ASR→LLM→TTS and come back as audio. Phone calls are real-time bidirectional streams. Three channels, one pipeline.

Twilio · WebSocket · PSTN

// 03_SELF_HOSTED

🔒

Your server. Your audio.

No SaaS middleman listening to your users. No $0.31/min surprise bills. Deploy on any VPS, k8s cluster, or bare metal you already own and trust.

Open source · MIT · Self-hosted

// 04_LATENCY

⚡

Sub-300ms. Every call.

TAB smooths bursty LLM tokens into natural speech rhythm. ATTS detects barge-in in real time. Conversations feel like conversations, not HTTP requests.

TAB · ATTS · AQAL

// 05_OBSERVABILITY

📊

Built-in Prometheus.

Every pipeline stage instrumented out of the box. Latency histograms, session counters, error rates — ready for your Grafana dashboard. No config needed.

Prometheus · Grafana

// 06_OPEN_SOURCE

⭐

MIT. Fork it. Own it.

The full pipeline is open source. Read the code, extend it, ship it. Or use the hosted cloud on omnivoice.dev if you don't want to run infra. Your call.

MIT License · GitHub

GIVE YOUR AI A VOICE.

// OmniVoice is the voice layer. You pick the brain. One env var to swap.

Claude Anthropic

Use your Anthropic API key directly — no middleman. Every voice conversation goes through Claude 3.5 Sonnet, speaking and listening.

# .env — one key, direct to Anthropic LLM_PROVIDER=anthropic ANTHROPIC_API_KEY=sk-ant-... ANTHROPIC_MODEL=claude-3-5-sonnet-20241022 TTS_PROVIDER=edge # free ASR_PROVIDER=whisper # free, local

Talk to Claude. Hear Claude talk back.

GPT-4o OpenAI

Use your OpenAI API key directly. GPT-4o for the brain, Edge TTS for the voice, Whisper for the ears — all on your own server.

# .env — one key, direct to OpenAI LLM_PROVIDER=openai OPENAI_API_KEY=sk-... OPENAI_MODEL=gpt-4o-mini # or gpt-4o TTS_PROVIDER=edge # free ASR_PROVIDER=whisper # free, local

GPT-4o with a voice. Under 5 minutes.

Ollama 100% Local · Free

No API key at all. Llama 3, Mistral, Phi-3 — running on your own machine. Zero cloud dependency, zero cost, zero data leaving your device.

# .env — zero API keys needed LLM_PROVIDER=ollama OLLAMA_MODEL=llama3 # or mistral, phi3 ASR_PROVIDER=whisper # local, free TTS_PROVIDER=edge # free # prerequisite: brew install ollama

Fully local. Zero keys. Zero cost.

OmniVoice is the voice layer — not the AI. Use whichever LLM you already pay for, trust, or run locally. Switch providers with one line in .env. Also works with OpenRouter, Gemini, Mistral, DeepSeek, and any OpenAI-compatible endpoint.

🎙 MIC → ASR → YOUR LLM → TTS → 🔊

WhatsApp → ASR → YOUR LLM → TTS → WhatsApp

// Works with

Deepgram

OpenAI Whisper

AssemblyAI

Azure Speech

GPT-4o

Claude Sonnet

OpenRouter

Ollama (local)

ElevenLabs

Twilio

Cloudflare

Prometheus

VS THE COMPETITION.

// Great until they're not. Here's what you give up.

Feature	OmniVoice	Vapi	Retell AI	Bland AI
Pricing	Free (self-hosted)	$0.13–0.31/min	$0.07/min	Per API call
Self-hostable	YES	NO	NO	NO
Audio on your server	YES	NO	NO	NO
Provider-agnostic	FULL	PARTIAL	PARTIAL	NO
WhatsApp + Phone	YES	Phone only	Phone only	Phone only
Open source	YES (MIT)	NO	NO	NO

YOUR AI.
THEIR
WHATSAPP.

2 billion people already send voice notes. Point OmniVoice at Claude or GPT-4o, connect your Twilio number, and they're talking to your AI — no app download, no new interface, no $0.31/min bill.

★ GITHUB

// Launching in

DAYS

HRS

MIN

SEC

Self-hosted voice AI infrastructure.
Open source · MIT licensed · omnivoice.dev

YOUR AI.NOW ITTALKS BACK.

GIVE YOUR AI A VOICE.

VS THE COMPETITION.

YOUR AI.THEIRWHATSAPP.

YOUR AI.
NOW IT
TALKS BACK.

YOUR AI.
THEIR
WHATSAPP.