# o3-on-realtime-api-for-switching
Below is a practical “wiring-diagram” for turning spoken studio requests into ATEM switcher actions with the **OpenAI Realtime API’s function-calling** feature. I’ve broken it into the three layers you’ll actually have to ship (audio I/O, Realtime session, and ATEM control), then included an end-to-end code sketch so you can see how the pieces click together.
---
## 1 Capture & stream audio
**Hardware / OS layer**
* Mic or room Dante feed → small Linux box / Mac mini sitting on the production network
* Capture 16-kHz mono PCM (`--format s16le --ar 16000`) with `arecord`, `sox`, `ffmpeg`, or Node’s `mic` package, and push the bytes straight into a WebSocket.
Because this is a headless, server-to-server flow you can authenticate the WebSocket with a normal secret key; no need for the short-lived “ephemeral” keys that browsers require. ([OpenAI Platform][1], [OpenAI Platform][2])
```mermaid
graph TD
MIC -->|raw PCM| WSClient(Node)
WSClient -->|audio frames| OpenAIRealtime
```
---
## 2 Define the voice interface in the Realtime session
### a Open the socket
```ts
const url =
"wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-12-17";
const ws = new WebSocket(url, {
headers: {
Authorization: `Bearer ${process.env.OPENAI_KEY}`,
"OpenAI-Beta": "realtime=v1",
},
});
```
### b Tell GPT-4o what **tools** it can call
Send this as your very first JSON message on the data channel (before you stream any audio):
```jsonc
{
"type": "assistant",
"role": "system",
"content": "You are Studio Ops. When the user asks to change cameras, call the appropriate tool. Never answer with text; just call the tool.",
"tools": [
{
"name": "switch_camera",
"description": "Switches the Blackmagic ATEM to a given input on a specified ME row",
"parameters": {
"type": "object",
"properties": {
"me": { "type": "integer", "description": "Mix-Effect row (1-4)" },
"input": { "type": "integer", "description": "Input number" },
"transition": {
"type": "string",
"enum": ["cut", "auto", "fade"],
"description": "Style of transition"
}
},
"required": ["me", "input"]
}
}
]
}
```
### c Stream audio & watch for tool calls
As soon as the operator says **“Camera 2 on ME 3”** the model will answer with an event that looks roughly like this:
```json
{
"type": "tool",
"tool_call_id": "abc-123",
"name": "switch_camera",
"arguments": { "me": 3, "input": 2, "transition": "cut" }
}
```
You now have a structured command you can hand straight to your switcher code. ([OpenAI Community][3], [YouTube][4])
---
## 3 Trigger the ATEM switcher
### a Node + `atem-connection`
```ts
import { Atem } from "atem-connection";
const atem = new Atem();
atem.connect("192.168.10.240"); // switcher IP
atem.on("connected", () => console.log("ATEM ready"));
function handleSwitchCamera({ me, input, transition }) {
if (transition === "auto" || transition === "fade") {
atem.autoME(me);
atem.changePreviewInput(input, me);
} else {
atem.changeProgramInput(input, me);
}
}
```
The ATEM protocol is instantaneous on a LAN; typical command latency is < 15 ms — well under the Realtime API’s audio round-trip so the cut feels immediate. ([npm][5])
### b Wire the pieces together
```ts
ws.on("message", (msg) => {
const evt = JSON.parse(msg);
if (evt.type === "tool" && evt.name === "switch_camera") {
handleSwitchCamera(evt.arguments);
// Optionally send a function result back so GPT-4o can confirm:
ws.send(JSON.stringify({
"type": "tool_result",
"tool_call_id": evt.tool_call_id,
"content": "Camera switched."
}));
}
});
```
---
## 4 Operational tips & hard-won lessons
| Topic | Recommendation | |
| ------------------ | ------------------------------------------------------------------------------------------------------------ | ------------------------------- |
| **Wake-word** | Put a “studio hot-word” (“*Stage Manager*…”) in your system prompt so ambient chatter doesn’t fire the tool. | |
| **Safety net** | Allow GPT-4o to *suggest* a function but require a quick regex (“camera | cam”) match before you execute. |
| **Latency budget** | Realtime audio <300 ms path-to-path; local ATEM cut adds <20 ms; the combo feels snappy. | |
| **Roll-back** | Map a “switch\_camera\_undo” macro in case the operator blurts the wrong number. | |
| **Monitoring** | Log every tool invocation and echo it to the program-out confidence monitor so the TD sees confirmations. | |
---
### TL;DR
Yes—**Realtime API + function-calling** gives you Alexa-style “intents” but with GPT-4-level understanding and millisecond audio round-trips. Declare a `switch_camera` tool, stream studio audio to GPT-4o, and route each tool event to `atem-connection`. You get hands-free, natural-language control of your ATEM with only a few hundred lines of Node. Have fun cutting the show!
[1]: https://platform.openai.com/docs/guides/voice-agents?utm_source=chatgpt.com "Voice agents - OpenAI API"
[2]: https://platform.openai.com/docs/guides/function-calling?api-mode=responses&utm_source=chatgpt.com "Function calling - OpenAI Platform"
[3]: https://community.openai.com/t/how-to-get-the-function-name-in-realtime-api-for-a-function-call/977544?utm_source=chatgpt.com "How to get the function name in realtime api for a function call? - API"
[4]: https://www.youtube.com/watch?v=69Ul5GQe1Q0&utm_source=chatgpt.com "Function Calling in the OpenAI Realtime API - YouTube"
[5]: https://www.npmjs.com/package/%40atem-connection/camera-control?utm_source=chatgpt.com "atem-connection/camera-control - NPM"