---
# System prepended metadata

title: Build a Tavus Clone From Scratch

---

# Tavus Clone: How It Works & MVP Bare Bones

> **Goal:** Understand how Tavus-style video, voice, and lip-sync work under the hood, whether it’s “deepfake” tech, and what a minimal MVP looks like. Frontend: Next.js. No code—concepts and structure only.

---

## Part 1: How the Tech Works

### 1.1 Is It a Deepfake?

**Short answer:** It’s in the same family as deepfake tech, but the product is usually framed differently.

- **Classic “deepfake”** = face-swap: take person A’s face and paste it onto person B’s body/performance in a video. You’re replacing one face with another.
- **Tavus-style “replica”** = **synthetic talking head**: you’re not swapping a face onto someone else’s video. You’re **generating new video of a face** (or head/upper body) from scratch, driven by **audio** (and sometimes a few reference images or a short training clip). The model learns one identity from a small amount of data, then produces new frames of that same person saying new things. So:
  - Same underlying idea: AI-generated or AI-modified face video.
  - Different pipeline: **identity + audio → new video**, not “video of B + face of A.”
- **Lip-sync / dubbing** (e.g. Hummingbird-style): take **existing video** of a person and **change only the mouth** to match new audio. The rest of the face and body stay from the original. That’s more “alteration” than “full synthesis,” but still in the same ethical/legal bucket as synthetic media.

So: yes, it’s synthetic face video that can look and sound like a real person—same broad category as deepfakes. The main difference is the *method* (synthesis from identity + audio vs. face-swap) and how the product is used (consent, disclosure, compliance).

---

### 1.2 How the Video Works

**Inputs**

- **Identity:** A short training video (e.g. ~2 minutes) of one person: face visible, talking for part of it, neutral/listening for part of it. Sometimes a single photo or a few frames.
- **Driver:** New **audio** (what you want them to “say”). That audio can come from:
  - Text → TTS (voice clone), or  
  - A pre-recorded audio file you provide.

**What the model does**

- The system learns an **identity representation** from the training data: face shape, skin, expressions, how the mouth moves when speaking, lighting, etc. Tavus uses a **Gaussian-diffusion–style** rendering model (they call it Phoenix): it generates frames that look like that person, conditioned on the driving signal.
- The **driving signal** is usually derived from the **audio**: phonemes, or mel-spectrogram, or other audio features. So: “this phoneme at this time” → “mouth shape and face pose at this frame.”
- Output: a **new video** of that identity speaking the new audio, with lip movements and often subtle head/expression motion aligned to the speech. No original video is “played back” with a swapped face—the video is **synthesized frame by frame** from identity + audio.

**In other words**

- **Not** “take a full-body video and paste a fake face on it.”
- **Yes** “learn this person’s look from 2 minutes of video; then generate new frames of that person from audio alone.” So it’s **synthetic talking-head generation** (sometimes called “talking head” or “digital human” or “AI avatar”).

---

### 1.3 How the Voice Works

**Two separate pieces**

1. **Voice cloning (TTS)**  
   From a few minutes of that same person’s speech (often the audio from the training video), the system extracts a **voice profile** (embedding). When you give it **new text**, it generates **new speech** that sounds like that person, in many languages. So: same “replica” identity = same face *and* same voice. No re-recording.

2. **Lip-sync**  
   Once you have the **audio** (from TTS or from your own file), the **video** model uses that audio to drive mouth and face motion. So the voice pipeline is: **text → TTS (cloned voice) → audio**. The video pipeline then takes that **audio** and produces **lip-synced face video**.

**Flow**

- **Text in** → TTS with cloned voice → **audio** → lip-sync / talking-head model → **video**.  
- Or: **your audio file** → lip-sync / talking-head model → **video** (no TTS).

---

### 1.4 How Lip-Sync Works

**Two main patterns**

1. **Full talking-head synthesis (replica)**  
   - One model (e.g. Phoenix-style) is trained or conditioned on the **identity** (from the 2‑min video).  
   - At inference: **identity + audio** → each frame. The model outputs pixels (or a representation that gets rendered to pixels). Lip-sync is built in: the model is trained so that the mouth and face motion match the input audio. So “lip-sync” here is the model’s job: generate a face that is already saying the given audio.

2. **Lip-sync on existing video (dubbing / Hummingbird-style)**  
   - You already have a **video** of someone (unchanged body, face, expressions).  
   - You have **new audio** (e.g. a translation or a new line).  
   - A model **modifies only the mouth region** (or re-renders the mouth) so it matches the new audio. Rest of the frame stays the same. So: original video + new audio → **re-mouth** the face so it looks like they’re saying the new words.

**For an MVP clone**

- **Replica path:** “Identity (from short video or images) + audio → full talking-head video.” Lip-sync is implicit in the model.  
- **Dubbing path:** “Existing video + new audio → video with altered mouth.”  
- Tavus does both; for a bare-bones MVP you’d typically pick one (e.g. replica-style: one identity, generate videos from script or audio).

---

### 1.5 Product Flows (Recap)

| Product | What the user gets |
|--------|---------------------|
| **Programmatic video** | One “replica” (identity + voice). You send script (or audio) + maybe variables (e.g. name). Backend does TTS (if script) + talking-head synthesis. You get a **single rendered video** (e.g. MP4) per request. |
| **Conversational video (CVI)** | Same replica, but **live**: user is on a call. User speaks → STT → LLM → TTS (replica voice) → same talking-head engine renders video in real time → streamed back over WebRTC. So it’s the same “voice + face” tech, but with real-time latency and a call UX. |

Same core tech (identity model + voice clone + lip-sync/synthesis); different use case (batch vs. live).

---

### 1.6 Where the LLM Fits (Conversational Video Only)

The LLM is used **only in the conversational (CVI) product**, not in programmatic video.

**Example: user asks “How late is it?”**

1. User says it out loud on the call.
2. **STT** transcribes it → text: “How late is it?”
3. That text (plus any system prompt / persona) is sent to an **LLM**.
4. The **LLM** is the one that “answers”—e.g. it might call a tool to get the current time, or use knowledge in its context. It returns **reply text**, e.g. “It’s 3:42 PM.”
5. **TTS** turns that reply into speech in the replica’s voice.
6. The **video** model lip-syncs the replica to that audio and streams it back.

So: **the LLM is whatever model you choose to generate the replica’s replies.** It’s not a special “Tavus LLM”—it’s a standard language model that takes the user’s words and produces the next line the avatar will “say.” “Checking how late it is” is either:
- the LLM’s internal knowledge (e.g. “current date/time” in the system prompt), or  
- the LLM calling a **tool** (e.g. “get current time”) and then wording the answer.

**Which LLM does that?**

- **Tavus:** They offer low-latency LLMs in their pipeline, or let you plug in your own (e.g. OpenAI, Anthropic) so *their* backend calls your API.
- **Our clone:** We don’t train an LLM. We **use an existing one** via API or self-host:
  - **Hosted API:** OpenAI (GPT-4, etc.), Anthropic (Claude), etc. Backend receives transcript → calls OpenAI/Anthropic with system prompt (persona) + user message → gets reply text → sends to TTS.
  - **Self-hosted:** Run a small fast model (e.g. Llama, Phi) for lower latency and cost; same idea: transcript in, reply text out.
  - **Tools (optional):** If the avatar should “check the time,” “look up weather,” etc., the backend can give the LLM **tool calls** (e.g. function that returns current time); the LLM decides when to call the tool and how to phrase the answer.

**How we can do it**

- **Persona:** Store a system prompt per “persona” (e.g. “You are a helpful assistant. When asked the time, use the get_current_time tool and answer in one short sentence.”).
- **Per turn:** Send to the LLM: system prompt + conversation history (or last N messages) + latest user transcript. LLM returns reply text.
- **Then:** Reply text → TTS (replica voice) → talking-head synthesis → send audio + video back over WebRTC.

We never train this LLM; we just **call** it and use its output as the script for the next TTS + video chunk.

---

### 1.7 Which Video Model — Train Our Own or Use Existing?

We **do not need to train our own model from scratch** for an MVP. We use existing models and optionally fine-tune later.

**Options (no training from scratch):**

1. **Use a video API**  
   A provider (Tavus, or another synthesis/dubbing API) runs the talking-head or lip-sync model for you. You send identity (e.g. replica_id or video URL) + audio (or script); they return video. Easiest; you don’t touch the model.

2. **Use open-source models (inference only)**  
   Run a pre-trained model yourself (on your backend or a GPU service like RunPod, Replicate, etc.):
   - **SadTalker, Wav2Lip, Diff2Lip, etc.:** input = reference face (image or short video) + audio → output = talking-head or lip-synced video. No training; you just run inference. Identity comes from the reference you pass each time (e.g. one frame or short clip from the user’s 2‑min video).
   - **Real-time (for CVI):** e.g. ARTalk, GAGAvatar—again, pre-trained; you run them with a reference face + audio stream. No training required.

3. **Optional: fine-tune an open-source model**  
   If we want the replica to look *exactly* like the 2‑min training video (better identity than a generic reference frame), we can **fine-tune** an existing model (e.g. SadTalker, or a small diffusion talking-head) on that clip. That’s “training” in the sense of adapting weights to one person, but we’re not designing or training a new architecture from scratch.

**What we are not doing for MVP**

- **Training a model from scratch** (e.g. our own Phoenix-style Gaussian-diffusion model). That’s research-level and unnecessary; existing models or APIs are enough.

**Summary**

- **Programmatic video:** Use an existing **API** (Tavus or other) or run **open-source** (SadTalker, Wav2Lip, etc.) with a reference from the user’s video. No “train our own” required.
- **CVI (real-time):** Same idea: use a provider or a **real-time-capable** open-source model (e.g. ARTalk). No from-scratch training.
- **Later:** Optionally fine-tune one of these models on a specific replica’s 2‑min clip for better fidelity. Still not “our own” architecture—we’re reusing an existing one and adapting it.

---

## Part 2: MVP Bare Bones (Next.js Clone)

No code—just the minimal scope and how the pieces fit.

### 2.1 Scope

- **In scope for MVP:**  
  - One product: **programmatic personalized video** (replica + script or audio → one video per job).  
  - Next.js frontend: upload training video, create “replica,” submit script (or audio URL), get back a video (or status + link).  
  - Optional: template with variables (e.g. `{{name}}`) so one template produces many videos.

- **Out of scope for MVP (or phase 2):**  
  - Real-time conversational video (CVI): same tech, but needs real-time pipeline (STT, LLM, TTS, low-latency synthesis) and WebRTC. Add once programmatic works.

### 2.2 Core Concepts Your App Needs

- **Replica**  
  One identity + one voice. Created from a training video (upload → backend trains or uses an API). Stored as “replica” with an ID. All generated videos use this ID so the backend knows which face/voice to use.

- **Video job**  
  “Generate one video for this replica with this script (or this audio URL).” Optional: template with variables; backend substitutes then runs the same pipeline. Result: one video file (or link) per job.

- **Pipeline (backend, not in the HackMD as code)**  
  - Resolve replica (load identity + voice).  
  - Script → TTS (cloned voice) **or** use provided audio.  
  - Audio + replica identity → talking-head / lip-sync model → video.  
  - Optional: composite on background; encode; store; return URL (or webhook).

You don’t implement the ML in Next.js; Next.js calls your backend or a third-party API that does TTS + video synthesis.

### 2.3 What Next.js Is Responsible For (MVP)

- **Replica creation flow**  
  - Upload training video (and maybe consent clip for personal replicas).  
  - Call backend/API to create replica (e.g. by URL or upload).  
  - Show status (training → ready / failed).  
  - List replicas; select one for video generation.

- **Video generation flow**  
  - Form: pick replica, enter script (or audio URL), optional template + variables.  
  - Submit job to backend/API.  
  - Poll or webhook: show status (queued → generating → ready).  
  - Show or download result (hosted URL or file).

- **Optional: templates / variables**  
  - Define a template script with placeholders.  
  - Per “recipient,” fill in variables and maybe trigger one video job per recipient (batch). Same pipeline; many jobs.

No code here—just: these are the screens and flows; the “video stuff,” “voice,” and “lip-sync” all live in the backend/API that Next.js talks to.

### 2.4 Where the “Video / Voice / Lip-Sync” Live

- **Voice:** Backend (or external TTS/voice-clone API). Input: text + replica’s voice ID. Output: audio.  
- **Lip-sync / talking-head:** Backend (or external synthesis API). Input: replica identity + audio. Output: video (or stream).  
- **Next.js:** Frontend only. It sends “replica_id + script (or audio_url)” and shows status + result. It does not run TTS or video models.

So: you’re not building the deepfake/talking-head model in the HackMD or in Next.js. You’re defining the **product**: replicas, jobs, template variables, and the **one pipeline** (script/audio → TTS → synthesis → video), and letting Next.js be the UI over that.

### 2.5 Minimal Data You Need (Conceptually)

- **Replicas:** id, status (training | ready | failed), maybe type (personal vs stock). Backend stores the actual model/artifacts.
- **Videos (jobs):** id, replica_id, script or audio_url, status, result URL when ready. Optional: template_id, variable set for batch.
- **Templates (optional):** id, script_template with placeholders. Used to fill variables then create one video job per row.

### 2.6 Milestones (Bare Bones)

1. **Replica:** Upload training video from Next.js → backend creates replica → show status and list.  
2. **Single video:** From Next.js, submit one job (replica + script or audio) → backend runs TTS + synthesis → show “ready” + link.  
3. **Templates + variables:** Template with `{{name}}` etc.; form to fill variables and submit; one video per submission (or batch).  
4. **CVI (later):** Add real-time conversation (WebRTC, STT, LLM, TTS, real-time synthesis). Same voice and face tech; different UX and latency requirements.

---

## Summary

- **Video:** Synthetic talking-head generation: identity (from short training video) + audio → new video of that person saying that audio. Not classic face-swap; same broad category as deepfakes.  
- **Voice:** Cloned from same (or separate) audio; TTS turns text into that voice; that audio then drives the video.  
- **Lip-sync:** Either (1) built into the talking-head model (audio → frames with correct mouth), or (2) a separate step that re-mouths existing video to new audio.  
- **LLM (CVI only):** The avatar’s “brain.” User speaks → STT → **LLM** (any provider: OpenAI, Anthropic, or self-hosted) gets transcript + persona → reply text → TTS → video. We don’t train the LLM; we call an existing one. For “how late is it?” the LLM (or a tool it calls) provides the answer.  
- **Video model:** We don’t train our own from scratch. Use an existing **API** or **open-source** model (SadTalker, Wav2Lip, ARTalk, etc.) for inference; optionally fine-tune later for better identity.  
- **MVP:** Next.js for replica creation + video job submission + status and result; backend/API does TTS + talking-head/lip-sync and returns video. No code in this doc—just the concepts and bare-bones product shape.
