# mk-openwakeword-tests # Studio Voice Triggers Planning Doc *A pnpm monorepo plan with Next.js UI + Node Slack/servers + Ink CLI + Python sidecar (openWakeWord) + event bus integration.* --- ## Goals 1. **Near-instant wake / command triggers** (always-on, local) using **openWakeWord**. 2. **Event-first architecture**: wake/intent detectors emit *facts* onto the bus; downstream services decide actions. 3. **Keep ML deps isolated** in a **Python sidecar**, while Node remains the studio “control plane.” 4. Support both: * **Live mode**: wake → immediate studio automation (lights, overlays, camera cuts, etc.) * **Refinement mode**: later runs can re-derive intent from transcript/audio if needed --- ## Monorepo shape ### Top-level folders ``` repo/ apps/ web/ # Next.js “stylish visuals” slack/ # Slack app (Node) + server-side studio control plane cli/ # Ink CLI (Node) for local ops/debug python/ voice/ # openWakeWord sidecar + FastAPI tooling notebooks/ # R&D notebooks scripts/ # utilities, eval, dataset prep, etc. packages/ bus/ # shared TS types + event schema helpers config/ # shared lint/tsconfig/eslint/prettier ui/ # optional shared UI components for Next.js infra/ docker/ # optional docs/ # planning docs like this one pnpm-workspace.yaml package.json ``` --- ## pnpm workspace conventions (high-level) ### Workspace grouping * `apps/*` are runnable products * `packages/*` are shared libraries * `python/*` is not managed by pnpm, but lives alongside and shares schema contracts ### What “one command” should do You want a **single dev entrypoint** that can: * start event bus (or connect to an existing local bus) * start Python wake server * start Slack app * start Next.js dashboard * optionally start the Ink CLI in watch mode --- ## The event bus slice (what we’re adding) ### Why it belongs on the bus Wake/keyword detection is a **Tier-0 sensor**: low latency, high signal, and it should remain decoupled from downstream logic. ### Event tiers (recommended mental model) 1. **Tier 0: Edge events** (wake/keyword detected) 2. **Tier 1: Control events** (open gate, start recording, change scene) 3. **Tier 2: Language events** (transcript segments, refined words, diarization) 4. **Tier 3: Agent events** (inferences, summaries, suggestions, QA checks) --- ## Bus implementation choice (pragmatic recommendation) For your studio setup, pick one “always-on” local bus: ### Option A (simple, great for a studio LAN): **NATS** * extremely good at low-latency pub/sub * easy subject hierarchy * good tooling, durable options if you want later ### Option B (already in many stacks): **Redis pub/sub** * fine if you already have Redis * simplest to adopt * durability requires extra patterns ### Option C (durable logging): **Postgres outbox + listeners** * best if you want “every event is persisted” * more moving parts * not the lowest latency **Recommendation:** use **Redis pub/sub** if you already have it in the stack; otherwise **NATS** is a sweet spot for “studio event bus.” (You can still persist selected events to Postgres for audit/debug later.) --- ## Canonical event schema ### Design rules * Events are **facts**, not commands-with-side-effects * Small payloads; put big blobs (audio) elsewhere and reference them * Always include: * `type` * `source` * timestamp * stream identity * correlation IDs (session, utterance, etc.) ### Core event types for this feature #### 1) Wake detected * **Type:** `wake.detected` * **Emitted by:** Python openWakeWord sidecar * **Purpose:** “a hotword was heard” Fields: * `keyword` (string) * `confidence` (0..1) * `audio.stream_id` * `audio.t_ms` (ms timestamp) * `session_id?` (optional; can be created downstream) #### 2) Command keyword detected (optional if you use only wake + transcript parsing) * **Type:** `command.detected` * **Emitted by:** Python sidecar (if configured for multiple phrases) * **Purpose:** “heard one of the action words” Fields: * `command` (e.g. `"record" | "freeze" | "next"`) * `confidence` * `audio.stream_id` * `audio.t_ms` #### 3) Gate opened / closed * **Type:** `asr.gate.opened` / `asr.gate.closed` * **Emitted by:** Node control plane (Slack app server) * **Purpose:** “begin expensive processing (Whisper) now” or “stop” Fields: * `gate_id` * `mode` (e.g. `"command" | "dictation" | "conversation"`) * `audio.stream_id` * `t_ms` * `reason` (e.g. `"wake.detected"`) #### 4) Transcript segment (live) * **Type:** `asr.segment` * **Emitted by:** whichever ASR service you run (Python or Node) * **Purpose:** incremental text Fields: * `segment_id`, `gate_id` * `text` * `t_start_ms`, `t_end_ms` * `is_final` (boolean) #### 5) Transcript words (refined) * **Type:** `asr.words` * **Emitted by:** refinement pass * **Purpose:** word-level timestamps for search/UI Fields: * `words: [{ text, t_start_ms, t_end_ms, confidence? }]` #### 6) Diarization turns * **Type:** `diarization.turns` * **Emitted by:** refinement pass * **Purpose:** speaker segments Fields: * `turns: [{ speaker, t_start_ms, t_end_ms }]` --- ## Service responsibilities ### Python sidecar (`python/voice/`) **Primary job:** listen to audio and emit *fast* events. Recommended responsibilities: * audio capture OR PCM receiver (choose one) * openWakeWord inference * debounce/cooldown * emit `wake.detected` (+ optionally `command.detected`) * health endpoint + basic metrics Recommended non-responsibilities: * do not call Slack * do not control cameras * do not decide what “wake” means in context #### Two integration modes **Mode 1: Python owns mic** * simplest * fastest to ship * good if one machine owns the mic input **Mode 2: Node owns audio; Python only infers** * best if you already have an audio “bus” in Node * enables multiple consumers from one capture stream (recording, meters, ASR, etc.) Given your studio architecture, **Mode 2** is likely the “cleaner end state,” but **Mode 1** is the fastest way to validate. --- ### Node control plane (`apps/slack/`) **Primary job:** subscribe to bus events and decide actions. Responsibilities: * subscribe to `wake.detected` * maintain “session state” (gate open/close, cooldown rules, which room/mic is active) * publish control events (`asr.gate.opened`, `asr.gate.closed`) * trigger studio actions (camera cuts, overlays, recorders, etc.) * optionally post notifications to Slack (but only as a consumer of events) --- ### Next.js dashboard (`apps/web/`) **Primary job:** observability + operator UX. Responsibilities: * show live event stream (wake hits, gate status, segments) * show “confidence meters” and false positive debugging * show current mic/room routing status * quick controls (manual gate open/close, toggle wake words, adjust thresholds) This becomes your “stylish glass cockpit” for the studio. --- ### Ink CLI (`apps/cli/`) **Primary job:** operator + developer tooling without a browser. Responsibilities: * tail bus events * test fire events (“simulate wake”) * show last N wake hits * enable/disable wake words quickly * run diagnostics (ping Python service, list configured keywords, etc.) --- ## Repo contracts (how TS and Python stay aligned) ### Shared contract package: `packages/bus/` This package defines: * event `type` union * JSON schema (or Zod schema) for each event * helpers for publishing/subscribing Python should: * either import generated JSON schema files * or share a `schema/` directory that both TS and Python validate against Goal: if you change an event payload, both sides fail fast. --- ## openWakeWord specifics (what you’re building) ### Wake configuration * Define **a small list of wake phrases** (2–8 is a good start) * Define optional **command phrases** (5–20) * Add: * `debounce_ms` (e.g. 200ms) * `cooldown_ms` (e.g. 1500ms) * `min_confidence` per keyword (tune per room) ### False positive control Plan to tune with: * threshold per keyword * room profiles (quiet office vs classroom vs studio floor) * mic type (lav vs shotgun vs desk mic) * a “push-to-talk fallback” (manual gate open) for high-stakes moments ### Debug logging Log every detection with: * timestamp * keyword * confidence * mic stream id * whether it was suppressed by cooldown/debounce (You’ll thank yourself later.) --- ## How wake words interact with transcript context A key pattern you want: 1. **Wake event** fires instantly (`wake.detected`) 2. Node opens a short “context window” (gate open) 3. Node may: * run ASR for the next 1–3 seconds to parse a command, OR * look back into the rolling transcript buffer for parameters Examples: * “Ok Studio… define entropy” * wake triggers gate; ASR captures 1–2 seconds; command parser extracts “define” + “entropy” * “Ok Studio… mark that” * wake triggers immediate `marker.add` action referencing last 10 seconds of transcript timestamps This keeps wake detection cheap, while still letting commands be rich. --- ## Milestones ### Milestone 1: Bus + wake → visible events * Python sidecar emits `wake.detected` * Slack app subscribes and logs * Next.js dashboard displays event feed * Ink CLI tails events **Done when:** you can reliably trigger wake phrases and see them everywhere. ### Milestone 2: Gate controller * Wake triggers `asr.gate.opened` for 3 seconds * A simple ASR runner emits `asr.segment` during gate-open * Gate closes automatically **Done when:** speaking after wake produces short transcript segments. ### Milestone 3: Command execution * After wake, parse command phrase * Emit `action.requested` (or call internal action APIs) * Execute studio function (safe subset first) **Done when:** you can do 2–3 safe commands end-to-end (e.g., “mark”, “start recording”, “next scene”). ### Milestone 4: Refinement pass * Store audio + segment metadata * Offline job generates `asr.words` and `diarization.turns` * UI can “upgrade” a session from segments → words+speakers **Done when:** the same session becomes searchable/clickable with speaker labels. --- ## Safety and operator experience ### Add a “safe mode” * Wake events still show up * Actions require manual confirmation (dashboard/CLI) until you trust it ### Add an “armed/disarmed” state * Disarmed: detector runs, but control plane ignores actions * Armed: wake triggers actions * Show state clearly in UI + Slack ### Add per-room profiles * `room-a` thresholds differ from `room-b` * Keep config in versioned files --- ## What I need from you (to finalize the doc into an implementation checklist) You don’t need to answer now, but these determine defaults: 1. How many microphones/rooms are we listening to simultaneously? 2. Do you want Python to own the mic at first (fastest) or Node? 3. What are your first ~5 wake/command phrases? If you want, I can turn this planning doc into: * a repo-ready `docs/voice-triggers.md` * plus a concrete “phase 1” checklist with exact package names, scripts, and event subjects you’ll use in the bus.