o3-concept-for-live-transcripts-with-timecode

# o3-concept-for-live-transcripts-with-timecode Below is a **battle-tested pattern** we use when we need “smart chunks” (≥ 60 s & ≤ 120 s, cut on silence) that keep their original SMPTE time-code so Whisper’s word-level stamps line up with the camera master. --- ## 0 ️⃣ High-level flow ``` HyperDeck → growing MOV ▼ (fs.watch / inotifywait) Node daemon ─╴spawns ffmpeg ├─┬─ pulls 1–2 min WAVs cut at silence │ │ │ └→ stores {path, tcStart} └──▶ OpenAI Whisper ↳ word-level JSON ⤏ add tcStart ⤏ DB / SRT / captions ``` --- ## 1 ️⃣ Cutting *variable-length* chunks with FFmpeg FFmpeg’s segment muxer + `silencedetect` is a two-pass trick: ```bash # pass 1 – emit silence events ffmpeg -i cam1.mov -af silencedetect=noise=-35dB:d=0.35 -f null - \ 2> silence.log ``` Parse `silence_start / silence_end` lines; whenever the distance from the current *chunkStart* exceeds 60 s and a silence ≥ 350 ms shows up **or** hard-cap hits 120 s, add that ts to a `cutPoints[]`. ```bash # pass 2 – one ffmpeg does all the slicing cuts="0 ${cutPoints[*]}" ffmpeg -i cam1.mov -f segment \ -segment_times "$cuts" \ -map 0:a -c pcm_s16le wav_out/%05d.wav ``` *Why two passes?* The segment muxer can’t listen to silence in real-time; feeding it explicit cut times is rock-solid and keeps us inside 1–2 minutes. (Stack Overflow discussion on the same idea with `silencedetect` events ([stackoverflow.com][1]).) --- ### One-liner for quick tests ```bash ffmpeg -i cam1.mov \ -af "silenceremove=start_periods=1:start_threshold=-35dB:start_duration=0.3,\ silenceremove=stop_periods=-1:stop_threshold=-35dB:stop_duration=0.3" \ -f segment -segment_time 120 -segment_min_duration 60 \ -c pcm_s16le wav_out/%05d.wav ``` `silenceremove` trims the *inside* silence so the segmenter will cut only when speech has paused; if no pause occurs, the 120-second ceiling forces a split. (`silenceremove` options doc ([stackoverflow.com][1]).) --- ## 2 ️⃣ Node.js daemon skeleton ```ts import fs from "fs"; import { spawn } from "node:child_process"; import OpenAI from "openai"; import SMPTE from "smpte-timecode"; // tiny helper lib const openai = new OpenAI({ apiKey: process.env.OPENAI_KEY }); const TC_FPS = 30; // or 29.97 DF, etc. function secondsToTc(sec: number, startTc: string) { const base = new SMPTE(startTc, TC_FPS); base.add(sec); // mutates ➜ new TC return base.toString(); } // 1. watch output dir for freshly-closed WAVs fs.watch("wav_out", (_, file) => processChunk(`wav_out/${file}`)); async function processChunk(path: string) { /* -------- keep absolute start tc in a sidecar JSON ---------- */ const match = /chunk_(\d{5})\.wav/.exec(path); const index = Number(match![1]); const tcStart = secondsToTc(indexOffset[index], hyperdeckStartTc); /* -------- Whisper call with word-level stamps --------------- */ const res = await openai.audio.transcriptions.create({ file: fs.createReadStream(path), model: "whisper-1", response_format: "verbose_json", timestamp_granularities: ["word"], // 👈 word stamps }); // Medium guide :contentReference[oaicite:2]{index=2} /* -------- re-base word timestamps --------------------------- */ res.segments.forEach(seg => seg.words.forEach(w => { w.start += indexOffset[index]; w.end += indexOffset[index]; w.tc = secondsToTc(w.start, hyperdeckStartTc); // handy }) ); await saveToDB(res); // or emit WebSocket → overlay captions } ``` `indexOffset[]` is the cumulative second-offset **you** keep when you emit each chunk; `hyperdeckStartTc` comes from `ffprobe -show_entries format_tags=timecode` on the original MOV. *Tip:* a complete open-source Python reference that does the same offset math (chunks + Whisper + merge) is in Matija Žiberna’s tutorial ([buildwithmatija.com][2]). --- ## 3 ️⃣ Keeping, converting and exporting time-code | Step | Method | | ----------------------------------- | ------------------------------------------------------------------------------------ | | **Get deck start TC** | `ffprobe -v quiet -show_entries format_tags=timecode -of default=noprint_wrappers=1` | | **Store chunk start offset** | push `secondsFromStart` into `indexOffset[]` when each file closes | | **Convert Whisper seconds → SMPTE** | tiny helper shown above (`smpte-timecode`, `tc-string`, etc.) | | **Export captions** | build `WebVTT` or `SRT` with the re-based word stamps | --- ## 4 ️⃣ Why this beats `record spill` | Metric | Record Spill | FFmpeg chunker | | -------------- | ------------------------ | ------------------------ | | HyperDeck wear | pauses deck every minute | continuous recording | | File chaos | thousands of .mov | tidy WAV dir | | Latency | depends on poller | \~ 1 s after chunk close | | TC accuracy | perfect | perfect (offset math) | If you later *also* want picture replays, just link the final JSON back to Resolve or your HTML overlay—the time-codes will match frame-perfect to the original ISO. --- ### Next steps * Drop-in `fluent-ffmpeg` wrapper if you prefer pure JS over spawning. * Add silence **on-the-fly** detection via `arecord | ffmpeg -i - …` to avoid reading from disk at all. * Wrap the whole thing in a PM2 service and expose a minimal REST “/chunks” endpoint so other studio apps can subscribe. Let me know which piece you’d like fleshed out (e.g., WebVTT exporter, Companion macro to start/stop the pipeline, etc.) and I’ll dig deeper! [1]: https://stackoverflow.com/questions/25697596/using-ffmpeg-with-silencedetect-to-remove-audio-silence "using FFMPEG with silencedetect to remove audio silence - Stack Overflow" [2]: https://www.buildwithmatija.com/blog/building-a-long-audio-transcription-tool-with-openai-s-whisper-api "Building a Long Audio Transcription Tool with OpenAI’s Whisper API | Build with Matija"