o3-on-realtime-api-for-switching

# o3-on-realtime-api-for-switching Below is a practical “wiring-diagram” for turning spoken studio requests into ATEM switcher actions with the **OpenAI Realtime API’s function-calling** feature. I’ve broken it into the three layers you’ll actually have to ship (audio I/O, Realtime session, and ATEM control), then included an end-to-end code sketch so you can see how the pieces click together. --- ## 1 Capture & stream audio **Hardware / OS layer** * Mic or room Dante feed → small Linux box / Mac mini sitting on the production network * Capture 16-kHz mono PCM (`--format s16le --ar 16000`) with `arecord`, `sox`, `ffmpeg`, or Node’s `mic` package, and push the bytes straight into a WebSocket. Because this is a headless, server-to-server flow you can authenticate the WebSocket with a normal secret key; no need for the short-lived “ephemeral” keys that browsers require. ([OpenAI Platform][1], [OpenAI Platform][2]) ```mermaid graph TD MIC -->|raw PCM| WSClient(Node) WSClient -->|audio frames| OpenAIRealtime ``` --- ## 2 Define the voice interface in the Realtime session ### a Open the socket ```ts const url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-12-17"; const ws = new WebSocket(url, { headers: { Authorization: `Bearer ${process.env.OPENAI_KEY}`, "OpenAI-Beta": "realtime=v1", }, }); ``` ### b Tell GPT-4o what **tools** it can call Send this as your very first JSON message on the data channel (before you stream any audio): ```jsonc { "type": "assistant", "role": "system", "content": "You are Studio Ops. When the user asks to change cameras, call the appropriate tool. Never answer with text; just call the tool.", "tools": [ { "name": "switch_camera", "description": "Switches the Blackmagic ATEM to a given input on a specified ME row", "parameters": { "type": "object", "properties": { "me": { "type": "integer", "description": "Mix-Effect row (1-4)" }, "input": { "type": "integer", "description": "Input number" }, "transition": { "type": "string", "enum": ["cut", "auto", "fade"], "description": "Style of transition" } }, "required": ["me", "input"] } } ] } ``` ### c Stream audio & watch for tool calls As soon as the operator says **“Camera 2 on ME 3”** the model will answer with an event that looks roughly like this: ```json { "type": "tool", "tool_call_id": "abc-123", "name": "switch_camera", "arguments": { "me": 3, "input": 2, "transition": "cut" } } ``` You now have a structured command you can hand straight to your switcher code. ([OpenAI Community][3], [YouTube][4]) --- ## 3 Trigger the ATEM switcher ### a Node + `atem-connection` ```ts import { Atem } from "atem-connection"; const atem = new Atem(); atem.connect("192.168.10.240"); // switcher IP atem.on("connected", () => console.log("ATEM ready")); function handleSwitchCamera({ me, input, transition }) { if (transition === "auto" || transition === "fade") { atem.autoME(me); atem.changePreviewInput(input, me); } else { atem.changeProgramInput(input, me); } } ``` The ATEM protocol is instantaneous on a LAN; typical command latency is < 15 ms — well under the Realtime API’s audio round-trip so the cut feels immediate. ([npm][5]) ### b Wire the pieces together ```ts ws.on("message", (msg) => { const evt = JSON.parse(msg); if (evt.type === "tool" && evt.name === "switch_camera") { handleSwitchCamera(evt.arguments); // Optionally send a function result back so GPT-4o can confirm: ws.send(JSON.stringify({ "type": "tool_result", "tool_call_id": evt.tool_call_id, "content": "Camera switched." })); } }); ``` --- ## 4 Operational tips & hard-won lessons | Topic | Recommendation | | | ------------------ | ------------------------------------------------------------------------------------------------------------ | ------------------------------- | | **Wake-word** | Put a “studio hot-word” (“*Stage Manager*…”) in your system prompt so ambient chatter doesn’t fire the tool. | | | **Safety net** | Allow GPT-4o to *suggest* a function but require a quick regex (“camera | cam”) match before you execute. | | **Latency budget** | Realtime audio <300 ms path-to-path; local ATEM cut adds <20 ms; the combo feels snappy. | | | **Roll-back** | Map a “switch\_camera\_undo” macro in case the operator blurts the wrong number. | | | **Monitoring** | Log every tool invocation and echo it to the program-out confidence monitor so the TD sees confirmations. | | --- ### TL;DR Yes—**Realtime API + function-calling** gives you Alexa-style “intents” but with GPT-4-level understanding and millisecond audio round-trips. Declare a `switch_camera` tool, stream studio audio to GPT-4o, and route each tool event to `atem-connection`. You get hands-free, natural-language control of your ATEM with only a few hundred lines of Node. Have fun cutting the show! [1]: https://platform.openai.com/docs/guides/voice-agents?utm_source=chatgpt.com "Voice agents - OpenAI API" [2]: https://platform.openai.com/docs/guides/function-calling?api-mode=responses&utm_source=chatgpt.com "Function calling - OpenAI Platform" [3]: https://community.openai.com/t/how-to-get-the-function-name-in-realtime-api-for-a-function-call/977544?utm_source=chatgpt.com "How to get the function name in realtime api for a function call? - API" [4]: https://www.youtube.com/watch?v=69Ul5GQe1Q0&utm_source=chatgpt.com "Function Calling in the OpenAI Realtime API - YouTube" [5]: https://www.npmjs.com/package/%40atem-connection/camera-control?utm_source=chatgpt.com "atem-connection/camera-control - NPM"