AI Avatar Assistance Pipeline — Build Notes

# AI Avatar Assistance Pipeline — Build Notes ## 0) Goals & budget - **UX**: near-instant voice replies with synced face animation. - **Target E2E** (user speech start → first audible token): **≤ 700 ms** (stretch ≤ 500 ms with warm starts). - **Cadence**: ASR partials every **40–80 ms**; TTS audio frames **20–40 ms**; viseme look-ahead **150–300 ms**. - **Lip-sync skew**: |video PTS – audio PTS| ≤ **±80 ms** steady-state. --- ## 1) Components (roles) - **LiveKit (ingress/egress)** - In: microphone RTP → PCM/Opus; signaling for sessions; user mute/barge-in events. - Out: mixed AV (Cartesia audio + Hedra frames) via WebRTC. - **Deepgram `stt-nova-3` (streaming ASR)** - Outputs partial & final transcripts, **word timings**, optional **punctuation**; VAD endpoints. - **Hedra Avatar Runtime** - Avatar scene (2D/3D). - Consumes **viseme timeline** + optional **blendshape curves**; renders 24–30 fps. - **LLM: Gemini (chat)** - Streaming tokens; tool-use hooks for retrieval. - **External Knowledge Merge** - Lightweight RAG/tool layer. - Contracts: `context_snippets[]`, `facts[]`, `citations[]`. - **Cartesia TTS** - Streaming PCM + **phoneme/viseme marks** (or SSML marks). - Low first-chunk latency with pre-warmed voice. - **Lip-sync & Animation Bridge (Hedra)** - Maps phonemes→visemes; schedules keyframes ahead of audio PTS; co-articulates. --- ## 2) End-to-end event flow (high level) 1) **Capture** LiveKit receives user audio frames (Opus/PCM). Forward raw PCM to ASR and keep a short rolling buffer (for barge-in rewind if needed). 2) **ASR (Deepgram)** - Emits **partial** transcripts quickly (every 50–100 ms) + final segments on pause. - Endpointing: VAD or explicit silence timeout (≈ 200–300 ms). 3) **NLU/LLM (Gemini)** - On **first reliable clause** (or on endpoint), compose a prompt with system persona + conversation memory. - If retrieval needed, run the **External KB merge** (see §4) and insert `context_snippets[]`. - **Stream tokens** immediately. 4) **TTS (Cartesia)** - Feed **streaming text chunks** to TTS (don’t wait for full answer). - Receive PCM frames **+ phoneme/viseme marks**. 5) **Lip-sync & Hedra** - Convert phonemes → visemes; build a **look-ahead timeline** (~200–300 ms). - Apply co-articulation smoothing; schedule blendshape targets slightly **ahead** of audio (≈ +40 ms). - Render frames at 24–30 fps. 6) **LiveKit Egress** - Use audio as **clock master**; align video PTS to audio PTS. - Stream AV to user. - Support **barge-in**: stop current TTS, flush viseme queue, fade-out audio, reset timeline. --- ## 3) Sequence (mermaid) ```mermaid sequenceDiagram participant U as User (LiveKit client) participant LK as LiveKit SFU participant GW as Avatar Gateway participant ASR as Deepgram STT participant LLM as Gemini participant KB as Ext. Knowledge participant TTS as Cartesia TTS participant H as Hedra Avatar U->>LK: RTP audio (mic) LK->>GW: PCM frames GW->>ASR: stream PCM ASR-->>GW: partial/final transcripts (+timings) GW->>LLM: prompt + partial/final text LLM-->>GW: streamed tokens GW->>KB: (if needed) search/query KB-->>GW: snippets/citations GW->>TTS: stream text chunks (+SSML marks) TTS-->>GW: PCM frames + phoneme/viseme marks GW->>H: viseme timeline + control H-->>GW: rendered frames GW->>LK: audio (Opus) + video (H.264) LK-->>U: WebRTC AV stream ``` --- ## 4) External knowledge merge (deterministic & cheap) - **When**: Trigger on keywords, missing facts, or user intent (“check calendar”, “latest price”, etc.). - **Flow**: 1. Draft query from ASR text. 2. Fetch **fast** (cached, pre-indexed) results. 3. Produce contract: ```json { "context_snippets": [ {"text": "…", "source": "kb://…", "relevance": 0.86} ], "facts": [ {"key": "company_price", "value": "€123.45", "ts": "2025-08-13T11:21:00Z"} ], "citations": ["kb://…", "web://…"] } ``` 4. Prepend to Gemini prompt as **grounding**; keep it <1–2 k tokens. - **Latency budget**: ≤ 120 ms cache hit, ≤ 250 ms light fetch. --- ## 5) Data contracts (streaming) ### 5.1 ASR → Orchestrator ```json { "type": "asr.partial", "text": "what’s the weather in", "words": [ {"w":"what’s","t0":123.0,"t1":200.0}, {"w":"the","t0":200.0,"t1":240.0} ], "final": false } ``` ### 5.2 LLM → Orchestrator (stream) ```json {"type":"llm.delta","text":"Sure, "} {"type":"llm.delta","text":"today "} … {"type":"llm.done","usage":{"input_tokens":..., "output_tokens":...}} ``` ### 5.3 Orchestrator → TTS (chunked) ```json {"type":"tts.input","text":"Sure, today it's sunny—","ssml":true} ``` ### 5.4 TTS → Orchestrator ```json { "type": "tts.chunk", "pts_ms": 3140, "pcm_b64": "<…>", "marks": [ {"t_ms": 3160, "phoneme": "S"}, {"t_ms": 3200, "phoneme": "UH1"} ] } ``` ### 5.5 Viseme schedule → Hedra ```json { "type":"viseme.timeline", "audio_pts_base_ms": 3000, "items":[ {"t_ms": 3160, "viseme": "FV", "w": 0.9, "dur_ms": 90}, {"t_ms": 3230, "viseme": "UH", "w": 0.85, "dur_ms": 100} ] } ``` --- ## 6) Lip-sync rules (fast & stable) - **Map** phonemes → ~15 visemes (Disney/ARKit subset). - **Co-articulation**: current 0.7, neighbors 0.3 total; clamp min viseme hold **60 ms**, max **140 ms**. - **Lead time**: schedule viseme targets **+40 ms** ahead of audio PTS to counter playout delay. - **Drift control**: if |drift| > 30 ms for 3 frames, time-warp next 120 ms of visemes by ±5–10%. --- ## 7) Concurrency & backpressure - **Pipelines** (separate async loops/threads): - `ingress_audio` → ASR - `dialog_manager` (ASR → LLM → TTS) - `lip_sync` (TTS marks → viseme timeline) - `renderer` (Hedra) - `egress_av` (LiveKit out) - **Queues** with watermarks: - Drop oldest **ASR partials** if queue > 2. - **Chunk TTS input** by clause; don’t exceed 300 ms queued speech. - If **TTS underrun**: insert a 60–100 ms neutral mouth flap (low weight) to mask. --- ## 8) Barge-in & turn-taking - **Detect**: user speech energy > threshold or VAD speech start while agent speaking. - **Act** (≤ 80 ms): 1. Send `tts.cancel` to Cartesia. 2. Fade out audio 40 ms; clear viseme queue; signal Hedra to neutral pose. 3. Append “interruption” marker to dialog and pass user audio to ASR. --- ## 9) Observability (must-have metrics) - `t_user_speech_start → asr_first_partial` (expect 80–150 ms) - `asr_final_time` & segment WER proxy - `t_asr_final → llm_first_token` (100–250 ms cloud) - `t_llm_first_token → tts_first_pcm` (50–150 ms) - `audio_pts ↔ video_pts` skew histogram - WebRTC RTT/jitter, audio underruns, frame drop %, CPU/GPU load - Per-turn logs with correlation id. --- ## 10) Minimal orchestrator skeleton (Python/async; pseudo-real) ```python # app.py import asyncio from queue import SimpleQueue from livekit_in import LiveKitIn from deepgram_stream import DeepgramASR from gemini_stream import GeminiLLM from cartesia_stream import CartesiaTTS from hedra_bridge import HedraClient, phoneme_to_visemes, schedule_visemes from livekit_out import LiveKitOut from kb_merge import maybe_enrich async def run_session(session_id): lk_in = LiveKitIn(session_id) lk_out = LiveKitOut(session_id) asr = DeepgramASR() llm = GeminiLLM() tts = CartesiaTTS() hedra = HedraClient() q_asr = asyncio.Queue(maxsize=4) q_tts_pcm = asyncio.Queue(maxsize=8) q_visemes = asyncio.Queue(maxsize=8) async def ingress(): async for pcm in lk_in.stream_pcm(): await asr.send_audio(pcm) async def asr_loop(): async for hyp in asr.results(): await q_asr.put(hyp) async def dialog_loop(): async for hyp in _dedupe_asr(q_asr): ctx = maybe_enrich(hyp.text) # fast RAG/tooling async for delta in llm.stream(hyp.text, ctx): if _clause_boundary(delta.text): await tts.send_text(delta.text) async def tts_loop(): async for chunk in tts.stream_pcm(): await q_tts_pcm.put(chunk) vis = phoneme_to_visemes(chunk.marks) timeline = schedule_visemes(vis, audio_pts=chunk.pts_ms) await q_visemes.put(timeline) async def render_loop(): while True: pcm = await q_tts_pcm.get() await lk_out.send_audio(pcm) while not q_visemes.empty(): tl = await q_visemes.get() await hedra.apply_timeline(tl) frame = await hedra.render_next() await lk_out.send_video(frame) await asyncio.gather( ingress(), asr_loop(), dialog_loop(), tts_loop(), render_loop() ) if __name__ == "__main__": asyncio.run(run_session("session-1")) ``` --- ## 11) Config tips - **Audio**: 24 kHz mono, Opus bitrate 24–32 kbps; frame 20–40 ms. - **Video**: 720p @ 24–30 fps; GOP 1–2 s; tune for low delay. - **Warm starts**: pre-init Gemini context, Cache RAG indexes, pre-select Cartesia voice. - **Timeouts**: ASR endpoint 250 ms; LLM no-token stall 1 s → “thinking” filler; TTS stall >120 ms → neutral viseme. - **Retry**: exponential backoff; never block user input. --- ## 12) Testing checklist - **Cold vs warm** E2E latency (10 trials). - **Packet loss** 3% & 80 ms jitter: speech stability, A/V skew. - **Barge-in** during 3 positions (sentence start/mid/end). - **Viseme quality**: co-articulation smoothness, plosive accuracy (/p,b,m/). - **Edge cases**: acronyms, numbers, code-switching, long answers (>15 s). - **Fallback**: if TTS marks absent, derive visemes from phonemes or a fast phonemeizer. --- **TL;DR** Wire it as **five streaming loops** (ingress/ASR, dialog/LLM, TTS, lip-sync, render/egress), **feed TTS early**, schedule visemes with a **200–300 ms look-ahead**, and keep audio as the **clock master**. With warm starts and tight queues, this stack comfortably hits sub-700 ms first-token to-ear while keeping Hedra’s face in sync.