# AI Avatar Assistance Pipeline — Build Notes
## 0) Goals & budget
- **UX**: near-instant voice replies with synced face animation.
- **Target E2E** (user speech start → first audible token): **≤ 700 ms** (stretch ≤ 500 ms with warm starts).
- **Cadence**: ASR partials every **40–80 ms**; TTS audio frames **20–40 ms**; viseme look-ahead **150–300 ms**.
- **Lip-sync skew**: |video PTS – audio PTS| ≤ **±80 ms** steady-state.
---
## 1) Components (roles)
- **LiveKit (ingress/egress)**
- In: microphone RTP → PCM/Opus; signaling for sessions; user mute/barge-in events.
- Out: mixed AV (Cartesia audio + Hedra frames) via WebRTC.
- **Deepgram `stt-nova-3` (streaming ASR)**
- Outputs partial & final transcripts, **word timings**, optional **punctuation**; VAD endpoints.
- **Hedra Avatar Runtime**
- Avatar scene (2D/3D).
- Consumes **viseme timeline** + optional **blendshape curves**; renders 24–30 fps.
- **LLM: Gemini (chat)**
- Streaming tokens; tool-use hooks for retrieval.
- **External Knowledge Merge**
- Lightweight RAG/tool layer.
- Contracts: `context_snippets[]`, `facts[]`, `citations[]`.
- **Cartesia TTS**
- Streaming PCM + **phoneme/viseme marks** (or SSML marks).
- Low first-chunk latency with pre-warmed voice.
- **Lip-sync & Animation Bridge (Hedra)**
- Maps phonemes→visemes; schedules keyframes ahead of audio PTS; co-articulates.
---
## 2) End-to-end event flow (high level)
1) **Capture**
LiveKit receives user audio frames (Opus/PCM). Forward raw PCM to ASR and keep a short rolling buffer (for barge-in rewind if needed).
2) **ASR (Deepgram)**
- Emits **partial** transcripts quickly (every 50–100 ms) + final segments on pause.
- Endpointing: VAD or explicit silence timeout (≈ 200–300 ms).
3) **NLU/LLM (Gemini)**
- On **first reliable clause** (or on endpoint), compose a prompt with system persona + conversation memory.
- If retrieval needed, run the **External KB merge** (see §4) and insert `context_snippets[]`.
- **Stream tokens** immediately.
4) **TTS (Cartesia)**
- Feed **streaming text chunks** to TTS (don’t wait for full answer).
- Receive PCM frames **+ phoneme/viseme marks**.
5) **Lip-sync & Hedra**
- Convert phonemes → visemes; build a **look-ahead timeline** (~200–300 ms).
- Apply co-articulation smoothing; schedule blendshape targets slightly **ahead** of audio (≈ +40 ms).
- Render frames at 24–30 fps.
6) **LiveKit Egress**
- Use audio as **clock master**; align video PTS to audio PTS.
- Stream AV to user.
- Support **barge-in**: stop current TTS, flush viseme queue, fade-out audio, reset timeline.
---
## 3) Sequence (mermaid)
```mermaid
sequenceDiagram
participant U as User (LiveKit client)
participant LK as LiveKit SFU
participant GW as Avatar Gateway
participant ASR as Deepgram STT
participant LLM as Gemini
participant KB as Ext. Knowledge
participant TTS as Cartesia TTS
participant H as Hedra Avatar
U->>LK: RTP audio (mic)
LK->>GW: PCM frames
GW->>ASR: stream PCM
ASR-->>GW: partial/final transcripts (+timings)
GW->>LLM: prompt + partial/final text
LLM-->>GW: streamed tokens
GW->>KB: (if needed) search/query
KB-->>GW: snippets/citations
GW->>TTS: stream text chunks (+SSML marks)
TTS-->>GW: PCM frames + phoneme/viseme marks
GW->>H: viseme timeline + control
H-->>GW: rendered frames
GW->>LK: audio (Opus) + video (H.264)
LK-->>U: WebRTC AV stream
```
---
## 4) External knowledge merge (deterministic & cheap)
- **When**: Trigger on keywords, missing facts, or user intent (“check calendar”, “latest price”, etc.).
- **Flow**:
1. Draft query from ASR text.
2. Fetch **fast** (cached, pre-indexed) results.
3. Produce contract:
```json
{
"context_snippets": [
{"text": "…", "source": "kb://…", "relevance": 0.86}
],
"facts": [
{"key": "company_price", "value": "€123.45", "ts": "2025-08-13T11:21:00Z"}
],
"citations": ["kb://…", "web://…"]
}
```
4. Prepend to Gemini prompt as **grounding**; keep it <1–2 k tokens.
- **Latency budget**: ≤ 120 ms cache hit, ≤ 250 ms light fetch.
---
## 5) Data contracts (streaming)
### 5.1 ASR → Orchestrator
```json
{
"type": "asr.partial",
"text": "what’s the weather in",
"words": [
{"w":"what’s","t0":123.0,"t1":200.0},
{"w":"the","t0":200.0,"t1":240.0}
],
"final": false
}
```
### 5.2 LLM → Orchestrator (stream)
```json
{"type":"llm.delta","text":"Sure, "}
{"type":"llm.delta","text":"today "}
…
{"type":"llm.done","usage":{"input_tokens":..., "output_tokens":...}}
```
### 5.3 Orchestrator → TTS (chunked)
```json
{"type":"tts.input","text":"Sure, today it's sunny—","ssml":true}
```
### 5.4 TTS → Orchestrator
```json
{
"type": "tts.chunk",
"pts_ms": 3140,
"pcm_b64": "<…>",
"marks": [
{"t_ms": 3160, "phoneme": "S"},
{"t_ms": 3200, "phoneme": "UH1"}
]
}
```
### 5.5 Viseme schedule → Hedra
```json
{
"type":"viseme.timeline",
"audio_pts_base_ms": 3000,
"items":[
{"t_ms": 3160, "viseme": "FV", "w": 0.9, "dur_ms": 90},
{"t_ms": 3230, "viseme": "UH", "w": 0.85, "dur_ms": 100}
]
}
```
---
## 6) Lip-sync rules (fast & stable)
- **Map** phonemes → ~15 visemes (Disney/ARKit subset).
- **Co-articulation**: current 0.7, neighbors 0.3 total; clamp min viseme hold **60 ms**, max **140 ms**.
- **Lead time**: schedule viseme targets **+40 ms** ahead of audio PTS to counter playout delay.
- **Drift control**: if |drift| > 30 ms for 3 frames, time-warp next 120 ms of visemes by ±5–10%.
---
## 7) Concurrency & backpressure
- **Pipelines** (separate async loops/threads):
- `ingress_audio` → ASR
- `dialog_manager` (ASR → LLM → TTS)
- `lip_sync` (TTS marks → viseme timeline)
- `renderer` (Hedra)
- `egress_av` (LiveKit out)
- **Queues** with watermarks:
- Drop oldest **ASR partials** if queue > 2.
- **Chunk TTS input** by clause; don’t exceed 300 ms queued speech.
- If **TTS underrun**: insert a 60–100 ms neutral mouth flap (low weight) to mask.
---
## 8) Barge-in & turn-taking
- **Detect**: user speech energy > threshold or VAD speech start while agent speaking.
- **Act** (≤ 80 ms):
1. Send `tts.cancel` to Cartesia.
2. Fade out audio 40 ms; clear viseme queue; signal Hedra to neutral pose.
3. Append “interruption” marker to dialog and pass user audio to ASR.
---
## 9) Observability (must-have metrics)
- `t_user_speech_start → asr_first_partial` (expect 80–150 ms)
- `asr_final_time` & segment WER proxy
- `t_asr_final → llm_first_token` (100–250 ms cloud)
- `t_llm_first_token → tts_first_pcm` (50–150 ms)
- `audio_pts ↔ video_pts` skew histogram
- WebRTC RTT/jitter, audio underruns, frame drop %, CPU/GPU load
- Per-turn logs with correlation id.
---
## 10) Minimal orchestrator skeleton (Python/async; pseudo-real)
```python
# app.py
import asyncio
from queue import SimpleQueue
from livekit_in import LiveKitIn
from deepgram_stream import DeepgramASR
from gemini_stream import GeminiLLM
from cartesia_stream import CartesiaTTS
from hedra_bridge import HedraClient, phoneme_to_visemes, schedule_visemes
from livekit_out import LiveKitOut
from kb_merge import maybe_enrich
async def run_session(session_id):
lk_in = LiveKitIn(session_id)
lk_out = LiveKitOut(session_id)
asr = DeepgramASR()
llm = GeminiLLM()
tts = CartesiaTTS()
hedra = HedraClient()
q_asr = asyncio.Queue(maxsize=4)
q_tts_pcm = asyncio.Queue(maxsize=8)
q_visemes = asyncio.Queue(maxsize=8)
async def ingress():
async for pcm in lk_in.stream_pcm():
await asr.send_audio(pcm)
async def asr_loop():
async for hyp in asr.results():
await q_asr.put(hyp)
async def dialog_loop():
async for hyp in _dedupe_asr(q_asr):
ctx = maybe_enrich(hyp.text) # fast RAG/tooling
async for delta in llm.stream(hyp.text, ctx):
if _clause_boundary(delta.text):
await tts.send_text(delta.text)
async def tts_loop():
async for chunk in tts.stream_pcm():
await q_tts_pcm.put(chunk)
vis = phoneme_to_visemes(chunk.marks)
timeline = schedule_visemes(vis, audio_pts=chunk.pts_ms)
await q_visemes.put(timeline)
async def render_loop():
while True:
pcm = await q_tts_pcm.get()
await lk_out.send_audio(pcm)
while not q_visemes.empty():
tl = await q_visemes.get()
await hedra.apply_timeline(tl)
frame = await hedra.render_next()
await lk_out.send_video(frame)
await asyncio.gather(
ingress(), asr_loop(), dialog_loop(), tts_loop(), render_loop()
)
if __name__ == "__main__":
asyncio.run(run_session("session-1"))
```
---
## 11) Config tips
- **Audio**: 24 kHz mono, Opus bitrate 24–32 kbps; frame 20–40 ms.
- **Video**: 720p @ 24–30 fps; GOP 1–2 s; tune for low delay.
- **Warm starts**: pre-init Gemini context, Cache RAG indexes, pre-select Cartesia voice.
- **Timeouts**: ASR endpoint 250 ms; LLM no-token stall 1 s → “thinking” filler; TTS stall >120 ms → neutral viseme.
- **Retry**: exponential backoff; never block user input.
---
## 12) Testing checklist
- **Cold vs warm** E2E latency (10 trials).
- **Packet loss** 3% & 80 ms jitter: speech stability, A/V skew.
- **Barge-in** during 3 positions (sentence start/mid/end).
- **Viseme quality**: co-articulation smoothness, plosive accuracy (/p,b,m/).
- **Edge cases**: acronyms, numbers, code-switching, long answers (>15 s).
- **Fallback**: if TTS marks absent, derive visemes from phonemes or a fast phonemeizer.
---
**TL;DR**
Wire it as **five streaming loops** (ingress/ASR, dialog/LLM, TTS, lip-sync, render/egress), **feed TTS early**, schedule visemes with a **200–300 ms look-ahead**, and keep audio as the **clock master**. With warm starts and tight queues, this stack comfortably hits sub-700 ms first-token to-ear while keeping Hedra’s face in sync.