Got it! I will research how to configure the Unitree Go2 EDU version to enable speech interaction at a conference, using OpenAI's API for answering questions. This will include:
- Setting up speech input and output for real-time interaction.
- Connecting the robot to OpenAI's API for question-answering.
- Implementing control mechanisms to stop the robot, make it follow you, or perform other actions.
- Providing step-by-step instructions with exact code examples to ensure easy implementation.
I'll get back to you with clear instructions soon!
# Configuring Unitree Go2 EDU for Real-Time Voice Interaction with OpenAI
Integrating OpenAI's API with the Unitree Go2 EDU robot enables it to **listen to questions/commands, generate answers via ChatGPT, speak responses, and follow manual commands**. In fact, the Unitree Go2 is capable of voice interactions – it can be set up to answer questions (using ChatGPT) and execute voice commands ([Unitree Robotics Archives - InDro Robotics](https://indrorobotics.ca/category/unitree-robotics/#:~:text=One%20of%20the%20more%20intriguing,out%20certain%20tasks%20by%20voice)). Below, we provide a step-by-step guide covering the hardware setup, software configuration, OpenAI API integration, and implementing manual controls. The instructions are detailed and beginner-friendly, designed for a practical conference demo scenario.
## Hardware and Software Requirements
Before starting, ensure you have the following hardware and software ready:
- **Unitree Go2 EDU robot** – The EDU version supports custom programming (via Unitree’s SDK or DDS interface) and includes voice hardware (microphone array and speaker) in its head unit ([Unitree Robotics Archives - InDro Robotics](https://indrorobotics.ca/category/unitree-robotics/#:~:text=One%20of%20the%20more%20intriguing,out%20certain%20tasks%20by%20voice)). It also supports an onboard computer (e.g., NVIDIA Jetson Orin Nano) for AI processing ([Unitree Robotics Archives - InDro Robotics](https://indrorobotics.ca/category/unitree-robotics/#:~:text=Because%20of%20the%20LiDAR%2C%20it%E2%80%99s,enhanced%20EDGE%20computing)).
- **Microphone** – A good-quality microphone to capture audience questions. If the Go2’s built-in mic is accessible, you can use that; otherwise, use an external USB microphone (connected to the robot’s onboard computer or a nearby laptop). A directional mic or noise-canceling mic is ideal in a noisy conference environment.
- **Speaker** – A speaker for the robot’s voice output. The Go2 Pro/EDU has a built-in speaker in the head (used for its voice features) ([Unitree Robotics Archives - InDro Robotics](https://indrorobotics.ca/category/unitree-robotics/#:~:text=One%20of%20the%20more%20intriguing,out%20certain%20tasks%20by%20voice)). If using an external computer or if louder sound is needed, attach a portable speaker (via 3.5mm jack or Bluetooth) so the audience can hear the responses.
- **Controlling computer** – Either the robot’s onboard computing module (e.g. Jetson Orin on the EDU) or an external laptop. This will run the Python code for speech recognition, OpenAI API calls, and sending commands to the robot. Ensure this computer has Python 3 installed.
- **Internet connectivity (Wi-Fi)** – A reliable Wi-Fi network with internet access is **essential** for OpenAI API calls. Connect the Go2 robot to Wi-Fi and ensure the controlling computer is on the same network. (The Go2’s voice command feature itself requires Wi-Fi internet ([Go2 Documentation | QUADRUPED ROBOTICS](https://www.docs.quadruped.de/projects/go2/html/controller.html#:~:text=Note)).) In a conference, use a portable Wi-Fi router or venue Wi-Fi for seamless operation.
- **OpenAI API access** – An OpenAI account with an API key to use OpenAI’s GPT models. Sign up at OpenAI’s website and obtain a secret API key.
- **Unitree SDK / control library** – Software to send motion commands to the Go2. Unitree provides an SDK (C++ and a Python interface) for the Go2 EDU, which uses DDS (Data Distribution Service) for communication ([GitHub - legion1581/go2_python_sdk: Unofficial Unitree Go2 Python SDK (DDS/WebRTC)](https://github.com/legion1581/go2_python_sdk#:~:text=CycloneDDS%20Driver)). You can use the official **`unitree_sdk2_python`** package (if available) or an unofficial Python SDK ([GitHub - legion1581/go2_python_sdk: Unofficial Unitree Go2 Python SDK (DDS/WebRTC)](https://github.com/legion1581/go2_python_sdk#:~:text=This%20is%20an%20unofficial%20Python,WebRTC%20is%20not%20yet%20implemented)). Alternatively, you can interface via ROS2 if you have it set up. In this guide, we assume a Python-based control interface.
- **Python libraries** – Install the following Python packages for voice and AI integration:
- `openai` – to call OpenAI’s API (ChatGPT).
- Speech recognition library: either `speechrecognition` (with `pyaudio`) or OpenAI’s Whisper. For simplicity, we’ll use the SpeechRecognition package with an online recognizer, but you can also use Whisper for more accuracy.
- `pyttsx3` (or another TTS library) – for text-to-speech output so the robot can speak.
- Any Unitree control library (as mentioned above) for sending commands to the robot.
**Note:** Make sure to charge the robot and have the **handheld controller** ready. The controller is useful as a manual override – e.g., you can always press the **“kill”/ESTOP** button or use the joystick to take control if needed for safety. (The Go2’s remote can switch modes or stop the robot instantly if something goes wrong.)
## Step 1: Connect the Robot to Wi-Fi
For internet access and remote control, connect the Unitree Go2 to a Wi-Fi network:
1. **Power on the Go2** and connect via the Unitree app on your smartphone (or via a web interface). In the app, navigate to **Connection Settings**. You will see two modes: **AP (Access Point) mode** and **Wi-Fi Client mode** ([Go2 Documentation | QUADRUPED ROBOTICS](https://www.docs.quadruped.de/projects/go2/html/controller.html#:~:text=you%20will%20have%20two%20connection,one%20that%20suits%20your%20needs)). Choose **Wi-Fi mode** to connect the robot as a client to an existing router.
2. **Select your Wi-Fi network** in the app, enter the password, and connect ([Go2 Documentation | QUADRUPED ROBOTICS](https://www.docs.quadruped.de/projects/go2/html/controller.html#:~:text=%2A%20If%20you%20choose%20Wi,Go2%20and%20the%20app%20are)). Once the Go2 joins your Wi-Fi, the app should indicate a successful connection.
3. **Verify internet access**: The Go2’s voice functions (and our ChatGPT integration) need internet ([Go2 Documentation | QUADRUPED ROBOTICS](https://www.docs.quadruped.de/projects/go2/html/controller.html#:~:text=Note)). The robot should now be online. Your controlling computer should also connect to the **same Wi-Fi network** so it can communicate with the robot (e.g., via IP or DDS) and reach the internet.
4. (If direct Wi-Fi client mode is not possible, an alternative is to use the robot’s default hotspot **AP mode** for control and use a **second network adapter** on your PC for internet ([DigitalCommons@Kennesaw State University - Symposium of Student Scholars: The Voice-To-Text Implementation with ChatGPT in Unitree Go1 Programming.](https://digitalcommons.kennesaw.edu/undergradsymposiumksu/spring2024/spring2024/227/#:~:text=facilitate%20communication%20between%20the%20ChatGPT,Research%20in%20this%20technology%20holds)). However, using client mode on a single network is simpler for seamless operation.)
**Tip:** In a conference setting, test the Wi-Fi connection in advance. If the venue Wi-Fi is unreliable, consider a dedicated router or hotspot. Also ensure the network has no strict firewall blocking the OpenAI API endpoints.
## Step 2: Set Up Audio Input and Output
Next, prepare the microphone and speaker on the robot (or on your control system) for speech I/O:
- **Microphone Setup:** Connect the USB microphone to the controlling computer (e.g., plug it into the Jetson Orin or your laptop). On Linux, verify the system recognizes the mic (e.g., with `arecord -l` or in sound settings). On Windows, check the recording devices. You may need to adjust input gain for a noisy environment. If using the Go2’s built-in mic array (EDU/Pro models), access to it might be through the Unitree SDK’s audio interface (the Go2 has an internal “vui_service” for voice) – but using an external mic via Python is often easier for custom applications. Place the microphone such that it clearly picks up questions from people (mount it on the robot or hold it when someone speaks).
- **Speaker Setup:** If the Go2’s internal speaker is accessible, you can use it for output. The Go2 EDU’s head has a 3W speaker ([Go1 — Unitree_Docs 1.0rc documentation - Unitree Support!](https://unitree-docs.readthedocs.io/en/latest/get_started/Go1_Edu.html#:~:text=Go1%20%E2%80%94%20Unitree_Docs%201,to%20the%20123%20segment)) that the built-in voice assistant uses. The unofficial SDK hints at an `audio_client` for audio I/O ([GitHub - legion1581/go2_python_sdk: Unofficial Unitree Go2 Python SDK (DDS/WebRTC)](https://github.com/legion1581/go2_python_sdk#:~:text=%2A%20%60vui_client%60%20,Photo%20interfaces)). However, for a quick setup, you can use an external speaker: connect a small powered speaker to your controlling device’s headphone jack, or use a Bluetooth speaker paired to the device. Ensure the speaker volume is high enough for a crowd to hear. Test by playing a sample sound.
**Note:** It’s important to minimize feedback (the mic picking up the speaker’s output). Keep them slightly apart or use directional mic to avoid the robot hearing its own voice.
## Step 3: Install Required Python Libraries
On the controlling computer, install the necessary Python packages:
```bash
pip install openai speechrecognition pyaudio pyttsx3
```
- **`openai`** – Official OpenAI Python client for calling the ChatGPT API.
- **`SpeechRecognition`** – High-level speech-to-text library (supports Google Web Speech, Sphinx, etc.). It provides a simple interface to capture microphone input.
- **`PyAudio`** – Required by SpeechRecognition to interface with the microphone (on some systems).
- **`pyttsx3`** – Offline text-to-speech library (works on Windows, macOS, Linux). It uses the system’s TTS engines (eSpeak on Linux, SAPI on Windows) to speak out text.
Also install any Unitree SDK library if needed. If using Unitree’s official Python SDK (SDK2), follow their installation instructions. For example, if using `unitree_sdk2_python`, you might clone the repo and run its setup, or simply do:
```bash
pip install unitree-sdk2py
```
(if available on PyPI – this is a hypothetical package name). This SDK will allow us to send commands like walk, stop, etc., to the robot from Python. The Unitree Go2 EDU supports DDS-based control out-of-the-box ([GitHub - legion1581/go2_python_sdk: Unofficial Unitree Go2 Python SDK (DDS/WebRTC)](https://github.com/legion1581/go2_python_sdk#:~:text=CycloneDDS%20Driver)), which these libraries utilize.
## Step 4: Configure OpenAI API Access
Now set up access to OpenAI’s API:
1. **Obtain API Key:** Log in to your OpenAI account and find your API key (typically under user settings -> API keys). Copy the key, which starts with `sk-...`. **Keep this secret**.
2. **Install OpenAI Python library:** (Done in Step 3).
3. **Initialize API usage in code:** You can set your API key in code or via environment variable. For simplicity, we’ll set it in code (just be careful not to expose it).
Here’s a small Python snippet to test OpenAI API integration and generate a response:
```python
import openai
openai.api_key = "sk-YOUR_API_KEY_HERE" # TODO: replace with your actual key
# Example question for testing:
user_question = "Hello, can you introduce yourself?"
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": user_question}]
)
answer = response['choices'][0]['message']['content']
print("ChatGPT response:", answer)
```
This code sends a simple prompt to OpenAI’s ChatGPT (gpt-3.5-turbo model) and prints the reply. You should run a quick test of this snippet to ensure your API key is set up correctly and the network connection works. The response should be a text string (e.g., "Hello, I am a Unitree robot...").
**Important:** Be mindful of API usage limits and costs, especially at a busy event. GPT-3.5-turbo is fast and cost-effective for real-time Q&A. Also, you might want to add a **system prompt** to guide the AI (for example: “You are a robot dog at a tech conference. Keep answers brief and polite, and if asked to do something physical, respond that you will perform the action.”). This can help tailor the behavior.
## Step 5: Set Up Speech Recognition (Robot Hearing)
We need the robot to **listen** to people’s questions or voice commands and convert that speech to text. There are two main approaches:
- **Using an online API or service:** The `SpeechRecognition` library can use Google’s free web speech API or other cloud services to transcribe audio. This is easy to set up but depends on internet and may have usage limits.
- **Using an offline or OpenAI Whisper model:** OpenAI’s **Whisper** is a powerful ASR (Automatic Speech Recognition) model ([Introducing Whisper - OpenAI](https://openai.com/index/whisper/#:~:text=Whisper%20is%20an%20automatic%20speech,multitask%20supervised%20data%20collected)) that can run locally (or via OpenAI’s new Whisper API). Whisper is more accurate and can handle noisy environments well, but running it in real-time might require a good GPU (the Jetson Orin can handle the small models). For a beginner-friendly route, we’ll demonstrate using the SpeechRecognition library with Google’s engine for now.
**Configure the microphone input:** Using the SpeechRecognition library, we create a Recognizer and use the Microphone:
```python
import speech_recognition as sr
r = sr.Recognizer()
# Optionally, adjust the energy threshold for ambient noise:
r.dynamic_energy_threshold = True
r.energy_threshold = 300 # starting threshold, auto-adjust
# Choose the microphone (default is usually index 0).
# You can list microphones with sr.Microphone.list_microphone_names()
mic = sr.Microphone(device_index=0)
```
We set `dynamic_energy_threshold=True` so the library adjusts to background noise (helpful for a conference). You might need to tweak `energy_threshold` if the mic is picking up too much noise or not hearing voices.
**Listening for speech:** Now, to capture speech, use a context manager to listen from the mic:
```python
print("Listening for a question or command...")
with mic as source:
r.adjust_for_ambient_noise(source, duration=1) # calibrate to noise
audio_data = r.listen(source, timeout=5, phrase_time_limit=10)
```
Here we quickly adjust for ambient noise, then listen. We set a timeout of 5 seconds to start hearing someone (to avoid blocking indefinitely if no one speaks) and a phrase_time_limit of 10 seconds so that one question doesn’t run on too long. Adjust these as needed (longer if you expect longer questions).
**Converting speech to text:** Once we have `audio_data`, we transcribe it:
```python
try:
# Use Google Web Speech API (default key). This requires internet.
spoken_text = r.recognize_google(audio_data)
print("Heard:", spoken_text)
except sr.UnknownValueError:
print("Sorry, I didn't catch that.")
spoken_text = ""
except sr.RequestError as e:
print(f"Speech recognition error; {e}")
spoken_text = ""
```
This will attempt to recognize English speech. If successful, `spoken_text` will contain the transcribed question/command. If it didn’t hear properly, we handle exceptions.
*Alternative:* To use **OpenAI Whisper** for possibly better accuracy, you could do:
```python
# Requires: pip install whisper-openai
import whisper
model = whisper.load_model("base") # or "small", etc., depending on resources
result = model.transcribe(audio_data.get_wav_data())
spoken_text = result["text"]
```
Or use `openai.Audio.transcribe()` with your API key to send the audio for transcription. However, these may introduce additional latency. For a live demo, the Google API via SpeechRecognition is often quick and decent for clear speech.
**Tip:** You might implement a **push-to-talk** mechanism – e.g., have the robot only listen when a button is pressed or a wake-word is spoken – to avoid picking up irrelevant chatter at the conference. The Unitree app’s voice feature allows push-to-talk mode ([Go2 Documentation | QUADRUPED ROBOTICS](https://www.docs.quadruped.de/projects/go2/html/controller.html#:~:text=Navigate%20to%20FUNCTIONS%20,to%20playing%20your%20preferred%20music)). For custom setup, you could add a keyboard trigger (press a key to start listening) or use a wake word detection library (like Porcupine or Snowboy). For simplicity, our example will continuously listen in a loop with short timeouts.
## Step 6: OpenAI API – Generate the Response
Once the robot has the user’s question/command as text, the next step is to get a response from OpenAI (for general questions) **or handle the command if it’s a control instruction**. We will differentiate between normal questions (to be answered by ChatGPT) and robot directives (like “stop” or “follow me”) which we handle directly.
**Determine if input is a command or a question:** You can have a list of keywords for commands. For example, commands might include **stop, follow me, sit, stand, dance, move forward, turn**, etc. If the spoken text matches one of these (or contains these phrases), we will treat it as a **manual control command**. Otherwise, we treat it as a question for the AI. (More on executing commands in the next step.)
Example of checking the text:
```python
user_text = spoken_text.lower()
# Define some trigger phrases for manual commands
manual_commands = ["stop", "follow", "follow me", "come here", "sit", "stand up", "dance"]
is_command = any(cmd in user_text for cmd in manual_commands)
```
If `is_command` is true, we handle it separately. If false, use OpenAI:
```python
if not is_command and user_text:
# Call the OpenAI ChatGPT API with the question
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": user_text}]
)
answer = response['choices'][0]['message']['content'].strip()
print("ChatGPT:", answer)
else:
answer = "" # not using ChatGPT for this input (it’s a direct command)
```
We strip the answer to clean up any extra whitespace. Now `answer` holds the response we want to speak (unless it was a manual command, in which case we might set a specific phrase or just leave it empty if the robot will perform an action instead).
You can also format the prompt to include context or a persona. For instance, you might include a system message: “You are a friendly robot dog helping answer questions at a conference. Keep responses brief.” This can be done by adding `{"role": "system", "content": "instruction..."}` to the messages list. This can guide ChatGPT to give suitable answers in tone and length for your event.
## Step 7: Text-to-Speech (Robot Speaking)
Now we take the `answer` text from ChatGPT and convert it to speech so the robot (or connected speaker) can speak it out loud.
Using **pyttsx3** (which operates offline):
```python
import pyttsx3
tts_engine = pyttsx3.init()
# Optionally, change voice properties (volume, rate, voice)
tts_engine.setProperty('rate', 150) # speaking speed (words per minute)
tts_engine.setProperty('volume', 1.0) # volume (max is 1.0)
# On some systems, you can choose a voice ID if multiple voices are available:
# voices = tts_engine.getProperty('voices')
# tts_engine.setProperty('voice', voices[0].id) # example: choose the first voice
if answer:
tts_engine.say(answer)
tts_engine.runAndWait()
```
This will make the computer speak the answer through the default audio output (which should be the speaker you set up). The voice might sound robotic, but it’s reliable. If you prefer a more natural voice, you could use cloud TTS services (Google Cloud TTS, Amazon Polly, etc.) or a library like `gTTS` to get Google’s voice. For example, using gTTS:
```python
from gtts import gTTS
import os
if answer:
tts = gTTS(answer, lang='en')
tts.save("response.mp3")
os.system("mpg123 response.mp3") # uses mpg123 to play mp3 (ensure it's installed)
```
This approach needs internet (since gTTS uses Google) and introduces a slight delay to fetch the MP3. For a conference demo, pyttsx3 is often sufficient and runs quickly on-device.
**Voice Output Test:** Try a test phrase (like a greeting) and see if the speaker volume is okay. Adjust the `rate` if needed to make it speak slower or faster for clarity.
## Step 8: Sending Commands to the Robot (Manual Control Integration)
To allow manual control commands (like “stop”, “follow me”, etc.), you need to interface with the Unitree Go2’s control API. The Go2 EDU is open for programming, meaning you can send it velocity commands or trigger certain behaviors via code ([GitHub - legion1581/go2_python_sdk: Unofficial Unitree Go2 Python SDK (DDS/WebRTC)](https://github.com/legion1581/go2_python_sdk#:~:text=CycloneDDS%20Driver)).
There are a couple of ways to control the Go2 in Python:
- **Using Unitree’s SDK (SDK2) with Python:** The official SDK provides functions to control motion. For example, you might set the robot’s walking speed, orientation, or execute preset motions. In the Python interface, this could be done via a high-level API. (Refer to Unitree’s documentation for exact functions.)
- **Using ROS2:** If the robot is running ROS2, you could publish commands to topics (e.g., a Twist message to `/cmd_vel` for velocity, or service calls for predefined actions).
- **Unofficial Python SDK (CycloneDDS):** As noted earlier, an unofficial SDK exists that connects via DDS to Go2. It even lists a `vui_client` (voice UI) and `audio_client` ([GitHub - legion1581/go2_python_sdk: Unofficial Unitree Go2 Python SDK (DDS/WebRTC)](https://github.com/legion1581/go2_python_sdk#:~:text=%2A%20%60vui_client%60%20,Photo%20interfaces)), which suggests support for voice and audio through the same interface. While cutting-edge, using the official or a stable interface is recommended for reliability.
For **safety and simplicity**, we can implement at least **stop** and **follow** using available means:
- **Stop:** Immediately halt the robot’s movement and possibly have it stand still or sit down.
- **Follow me:** Engage the robot’s follow/track mode to follow the user.
The Go2 actually has a built-in **Accompany/Follow mode** accessible via the remote (double-press “M” on the remote to have it follow the remote beacon) ([Go2 Documentation | QUADRUPED ROBOTICS](https://www.docs.quadruped.de/projects/go2/html/controller.html#:~:text=To%20activate%20the%20accompanying%20mode,controller%20that%E2%80%99s%20included%20with%20it)). Programmatically, if the SDK exposes an “accompany mode” toggle, we could use that. If not, a workaround is to simply command the robot to walk behind the user at a set distance using vision or LIDAR – but that would require person-tracking logic beyond our scope. For now, we assume either using the built-in follow or a placeholder.
Similarly, “dance” or other tricks are part of the preset actions (like the remote combo L1+B triggers a dance) – these might be invoked if the SDK provides a function or if you play a predefined motion file.
**Initializing robot control:** Here’s a conceptual snippet (the actual code depends on the SDK):
```python
# Pseudocode for robot control initialization
import unitree_sdk2_python as go2 # assuming an imported library
robot = go2.Robot() # initialize robot interface
robot.connect(ip="192.168.XX.XX") # connect to robot (if needed, or uses DDS auto-discovery)
robot.set_mode("high_level") # switch to high-level control mode, if required by SDK
```
For the Go2, high-level mode allows sending target velocities or actions, whereas low-level would mean joint control (not needed here). The IP might not be needed if using DDS which auto-discovers on the network.
**Executing commands:** Now implement handling for our voice commands:
```python
if "stop" in user_text:
# Stop the robot's movement
robot.stop() # hypothetical method: could send zero velocity or disable motors
answer = "Stopping now."
elif "follow" in user_text:
# Enable follow mode
robot.start_follow_mode() # hypothetical: switch to accompany mode
answer = "Okay, I will follow you."
elif "dance" in user_text or "dance" in user_text:
robot.perform_action("dance") # hypothetical: trigger a dance routine
answer = "Sure! I'll do a dance."
# ... (and so on for other commands like sit, stand, etc.)
```
After executing, we set an `answer` so the robot also verbally acknowledges the command (optional but engaging). For example, if the user says "follow me", the robot might reply with *"Okay, I will follow you."* and then start following.
The exact functions (`robot.stop()`, `start_follow_mode()`, etc.) will depend on the SDK:
- **Stop:** Many SDKs don’t have a direct “stop()” but you can achieve stop by sending a zero translation velocity and zero rotation (essentially commanding it to stand in place). For instance, the Go1 SDK had a function to send velocity commands; sending (vx=0, vy=0, yawRate=0) repeatedly will stop the robot. The Go2 SDK likely has similar control.
- **Follow mode:** If no direct SDK call, this might not be trivial to implement from scratch. If follow mode is critical, you can simply rely on the remote as a backup: the user can carry the remote and double-tap M to make the robot follow them ([Go2 Documentation | QUADRUPED ROBOTICS](https://www.docs.quadruped.de/projects/go2/html/controller.html#:~:text=To%20activate%20the%20accompanying%20mode,controller%20that%E2%80%99s%20included%20with%20it)), while the voice system is idle. You could integrate that by instructing the user via voice, or if using the app’s built-in voice feature, it could handle it. Since our focus is custom integration, consider follow mode a bonus if achievable.
- **Predefined actions (dance, etc.):** The Unitree app/voice can execute tricks ([Go2 Documentation | QUADRUPED ROBOTICS](https://www.docs.quadruped.de/projects/go2/html/controller.html#:~:text=Navigate%20to%20FUNCTIONS%20,to%20playing%20your%20preferred%20music)). In code, this might correspond to playing a motion sequence file on the robot. If the SDK provides an interface to trigger those, great. Otherwise, you could pre-program simple motions. For a beginner, it might be easier to stick to basic moves.
**Example:** If the user says "sit down", you might not have a direct function, but you can simulate it by lowering the robot’s height or changing its posture mode (some Unitree robots have a lie-down mode). If available, use it; if not, you can at least stop movement.
## Step 9: Running the Integration Loop
With all components ready (listening, understanding via OpenAI, speaking, and moving), combine them into a single loop so the robot continuously interacts. Here’s a simplified high-level loop incorporating previous steps:
```python
import openai, speech_recognition as sr, pyttsx3
# (Assume robot SDK is initialized as `robot`)
r = sr.Recognizer()
mic = sr.Microphone()
tts_engine = pyttsx3.init()
print("Robot is ready and listening...")
while True:
try:
# Listen for a phrase
with mic as source:
r.adjust_for_ambient_noise(source, duration=0.5)
audio_data = r.listen(source, timeout=10, phrase_time_limit=8)
try:
user_text = r.recognize_google(audio_data)
except Exception as e:
user_text = ""
if not user_text:
continue # no speech recognized, continue loop
user_text = user_text.lower()
print("User said:", user_text)
# Check for manual command keywords
if "stop" in user_text:
robot.stop() # Stop the robot
response_text = "Stopping."
elif "follow" in user_text:
robot.start_follow_mode()
response_text = "Entering follow mode."
elif "dance" in user_text:
robot.perform_action("dance")
response_text = "Doing a dance!"
# (Additional commands can be handled here)
else:
# It's a general question or request – use OpenAI API
openai_response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": user_text}]
)
response_text = openai_response['choices'][0]['message']['content'].strip()
# Speak out the response_text
print("Robot:", response_text)
tts_engine.say(response_text)
tts_engine.runAndWait()
except KeyboardInterrupt:
break # allow breaking the loop with Ctrl+C (manual stop)
```
In this loop:
- The robot keeps listening in chunks (timeout 10 seconds for someone to start speaking, 8 seconds max per phrase; adjust as needed).
- If speech is caught, we convert to text (`user_text`).
- We then check if the text contains any manual command. We used simple substring checks here, but you could improve this with NLP or a more robust command parser if needed.
- For each recognized command, we call the corresponding robot control function and set an appropriate verbal response.
- If it’s not a command, we send it to OpenAI and get a conversational answer.
- We then use TTS to speak the answer or acknowledgment.
**Manual override and safety:** This system assumes the robot is in a mode where it can accept both remote and SDK commands. **Always test in a safe area first.** At any time, the human operator can use the physical remote to override. For example, pressing the **“Start”** button on the remote usually toggles robot movement enabling/disabling. If the robot is doing something it shouldn’t, hit the **kill switch** on the app or remote, or use the emergency stop (on some robots, picking them up off the ground also triggers a stop). Our code loop also allows a keyboard interrupt (Ctrl+C) to stop the program if running from a laptop.
Additionally, you might incorporate a voice command like "stop listening" that breaks out of the loop or disables the AI responses, in case the handlers need to be paused.
## Step 10: Testing and Tuning
With everything set up, conduct thorough testing **before the conference**:
- **Unit test each component:** Test the microphone input alone (print out what it transcribes for various people/questions). Test the OpenAI response generation with sample questions. Test the speaker output (does it speak clearly?). And test sending a basic move command to the robot in isolation (e.g., a small forward move or stop).
- **Integrated test:** Run the full loop in a quiet setting first. Speak a command like "What is your name?" and see the robot respond via voice. Then try "follow me" and observe if the robot enters follow mode (or at least acknowledges it).
- **Latency:** Ensure the turnaround time from speaking to getting a reply is acceptable. GPT-3.5 is quite fast for short queries (1-2 seconds). The speech recognition and TTS are near-real-time. Overall, the robot should ideally respond within ~3 seconds. If it’s slower, consider optimizing (e.g., using a smaller language model or offline processing).
- **Edge cases:** Think about what if multiple people talk at once, or if a question is long. The system might grab partial speech. You could enforce one-at-a-time questions (have the robot say "Please ask one at a time."). Also, if the content of questions might sometimes be inappropriate or tricky, consider adding some filtering or a robust system prompt to steer ChatGPT away from problematic outputs (for example, instruct it to refuse if asked to do something dangerous or offensive).
- **Network fallback:** Have a plan if internet drops. The robot could say "Sorry, I am offline currently." or you switch to a pre-programmed mode. Since the Go2 has **offline voice command** capability for basic commands (built-in) ([Go2 SDK Development Guide - 宇树文档中心 - Unitree Robotics](https://support.unitree.com/home/en/developer#:~:text=Go2%20SDK%20Development%20Guide,offline%20voice%20interaction%2C%20commands%2C)), you might lean on that if API fails for commands.
## Additional Tools and Libraries
Summarizing tools/libraries used in this setup:
- **Unitree SDK / Python API** – to connect and send commands to the robot (movement, mode changes, etc.).
- **OpenAI API** – for natural language understanding and response (ChatGPT for generating answers).
- **OpenAI Whisper or SpeechRecognition** – for converting spoken audio to text in real time.
- **Pyttsx3 or other TTS** – for converting text back to speech audio output.
- **CycloneDDS** – (under the hood for Unitree SDK) communication middleware the Go2 uses. The Unitree EDU supports CycloneDDS natively for high-level control ([GitHub - legion1581/go2_python_sdk: Unofficial Unitree Go2 Python SDK (DDS/WebRTC)](https://github.com/legion1581/go2_python_sdk#:~:text=CycloneDDS%20Driver)), which the Python SDK leverages.
- **Hardware** – microphone and speaker for I/O, Jetson Orin or PC for running the AI code, and the robot’s built-in sensors/actuators for execution.
- Optionally, **ROS 2** if you prefer using ROS topics to control the robot, and **VAD (Voice Activity Detection)** libraries if you want the robot to detect when someone starts speaking automatically.
## Real-World Tips for a Conference Demo
- **Prepare a startup routine:** At the conference, you might start the robot in a neutral state (standing idle). Have a short **intro script** – e.g., you press a key to trigger the robot to say "Hello, I am a Unitree Go2 robot. I can answer questions. Ask me anything!" – this breaks the ice and attracts people.
- **Operator monitoring:** Even with voice commands, have an operator monitor the robot’s behavior. The operator can use a laptop or tablet that shows what the robot heard and what it’s going to do (you can print logs as in our code). That way, if it mishears something or if ChatGPT produces a strange answer, the operator can intervene. For example, if someone asks something inappropriate, the operator could stop the response.
- **Manual drive mode:** You might sometimes want to manually drive the robot around the booth. In that case, it’s wise to **pause the AI program** (or have a voice command like "switch to manual"). You can then use the handheld controller to pilot the robot. When ready to resume Q&A mode, stop the robot and re-engage the program.
- **Use of built-in features:** Remember that the Go2 EDU *itself* has a voice interaction feature (“BenBen” voice assistant in the app) that can execute many commands and even answer queries using ChatGPT integration ([Unitree Robotics Archives - InDro Robotics](https://indrorobotics.ca/category/unitree-robotics/#:~:text=One%20of%20the%20more%20intriguing,out%20certain%20tasks%20by%20voice)). Our guide shows how to build a custom solution, but you could also leverage the built-in system as a fallback. The built-in voice control allows full control (tricks, navigation, music) via cloud AI ([Go2 Documentation | QUADRUPED ROBOTICS](https://www.docs.quadruped.de/projects/go2/html/controller.html#:~:text=Navigate%20to%20FUNCTIONS%20,to%20playing%20your%20preferred%20music)). However, customizing it (like altering responses or adding new behaviors) might be limited, which is why a custom Python integration is powerful.
## Conclusion
By following these steps, you will have configured the Unitree Go2 EDU to handle real-time speech interaction:
- The robot **listens** to people’s questions through a microphone.
- The speech is transcribed to text and sent to **OpenAI’s API**, where ChatGPT generates a response.
- The robot then **speaks** the answer using text-to-speech, effectively allowing natural Q&A.
- If the input was a **command** like "stop" or "follow me", the system recognizes it and directly controls the robot’s behavior instead of asking OpenAI, enabling manual intervention at any time.
- The robot stays connected via **Wi-Fi** for internet access, and you have manual control overrides for safety.
This integrated setup will enable your Unitree Go2 robot to act as an interactive, voice-responsive agent at the conference – a great way to engage attendees. Good luck with your implementation, and enjoy the demo!
**Sources:**
- Unitree Go2 official documentation – voice command feature and Wi-Fi requirements ([Go2 Documentation | QUADRUPED ROBOTICS](https://www.docs.quadruped.de/projects/go2/html/controller.html#:~:text=Navigate%20to%20FUNCTIONS%20,to%20playing%20your%20preferred%20music)) ([Go2 Documentation | QUADRUPED ROBOTICS](https://www.docs.quadruped.de/projects/go2/html/controller.html#:~:text=Note)).
- InDro Robotics review – confirming Go2’s ChatGPT voice interaction capability ([Unitree Robotics Archives - InDro Robotics](https://indrorobotics.ca/category/unitree-robotics/#:~:text=One%20of%20the%20more%20intriguing,out%20certain%20tasks%20by%20voice)).
- Unofficial Go2 Python SDK docs – indicating EDU supports DDS control for custom code ([GitHub - legion1581/go2_python_sdk: Unofficial Unitree Go2 Python SDK (DDS/WebRTC)](https://github.com/legion1581/go2_python_sdk#:~:text=CycloneDDS%20Driver)).
I'll research how to leverage the Unitree Go2 EDU's built-in speech recognition and text-to-speech (TTS) capabilities for interacting with users. This includes:
- Whether the robot's native voice assistant can be integrated with OpenAI’s API for more advanced responses.
- How to use the onboard microphone and speaker directly for speech input and output, avoiding external services where possible.
- SDK or API hooks that allow customization of the robot's speech recognition and synthesis features.
I'll provide step-by-step instructions based on what is possible with the Unitree Go2 EDU.
# Using Unitree Go2 EDU's Speech Recognition and TTS for Real-Time Interaction
The Unitree Go2 EDU comes with a built-in voice assistant (“BenBen”) that supports speech recognition and text-to-speech. Below, we address how to customize or extend this system, use the robot’s onboard microphone and speaker directly, issue movement commands via voice, and what SDK/API options exist for voice features.
## 1. Customizing the Built-in Voice Assistant (OpenAI Integration)
**Built-in Voice Assistant:** The Go2’s voice assistant is quite advanced – it actually leverages an **LLM (GPT-3.5)** behind the scenes. In fact, the Go2 is *“integrated with OpenAI’s GPT-3.5 LLM”* for its voice interactions ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=All%20of%20this%20progress%20has,user%27s%20voice%20commands%20as%20input)). Users can ask the robot questions or give commands by voice, and the system uses an AI persona “BenBen” to respond and act. For example, if you say "Hello," BenBen might introduce itself and even wag its tail, and if you give a command, the assistant generates a Python code snippet to execute that action on the robot ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=Your%20character%20is%20a%20docile%2C,your%20abilities%20are%20given%20below)) ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=output%20a%20python%20code%20block,circles%20or%20wagging%20your%20tail)).
**Can it be modified?** **Not easily.** Unitree does **not provide a public API or config** to change BenBen’s behavior or connect it to external services ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=Last%20but%20certainly%20not%20least%2C,publicly%20available%20for%20the%20Go2)). The voice assistant’s prompt and logic are baked into the system (as researchers discovered by extracting its hidden prompt) ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=Last%20but%20certainly%20not%20least%2C,publicly%20available%20for%20the%20Go2)). This means you **cannot directly “plug in” your own API key or change the voice assistant’s responses** through any official setting. The only way the robot’s built-in assistant answers questions is via Unitree’s cloud service (which already uses OpenAI’s API under the hood), and there’s no official hook for developers to alter this.
**Workaround – Build a Custom Pipeline:** While you can’t modify BenBen itself, you *can* bypass it by creating your own voice-interaction program that runs on the Go2 (or an external computer) and uses OpenAI’s services. In essence, you’d be reimplementing the voice pipeline: use the **onboard mic for speech input**, call OpenAI’s API for processing, then use the **onboard speaker for output**. Here’s how you can do that step-by-step:
1. **Access the Robot’s Computer:** Ensure you can run custom code on the Go2’s main computer (e.g. via SSH into the Go2 EDU’s Linux system). The Go2 EDU runs a Linux-based controller (e.g. a Raspberry Pi or Jetson) where you can install Python libraries and use the microphone/speaker hardware.
2. **Speech-to-Text (STT):** Capture audio from the Go2’s microphone and convert it to text. You can use OpenAI’s Whisper API or another STT engine. For example, record a short audio clip from the mic (5 seconds) using a command-line tool or Python (e.g. `arecord -d 5 -f cd input.wav` on Linux), then send it to OpenAI’s transcription endpoint. In Python, you might use:
```python
import openai
audio_file = open("input.wav", "rb")
transcript = openai.Audio.transcribe("whisper-1", audio_file)
user_text = transcript.text
```
This uses OpenAI’s Whisper model to get the text of what was said. (If you prefer offline, you could run Whisper locally on the Jetson if available.)
3. **Query OpenAI (ChatGPT):** Take the transcribed text and send it to the OpenAI Chat API (GPT-3.5/4) to get an answer or decide on an action. You’ll need an OpenAI API key for this. For example:
```python
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": user_text}]
)
answer_text = response.choices[0].message.content
```
You can customize the system prompt here to define how the AI should behave (e.g. you can give it a persona or instructions similar to BenBen’s, or just have it answer factually). This is effectively connecting the robot’s voice input to OpenAI’s API – **allowing the robot to answer arbitrary questions or carry on a conversation using ChatGPT**.
4. **Text-to-Speech (TTS):** Convert the AI’s reply (or any command confirmation) back to audio. The Go2’s built-in assistant uses a TTS voice to speak – you can do the same. You have a few options:
- Use OpenAI’s new text-to-speech if available (the OpenAI **“Realtime” voice API** can produce natural speech from text).
- Use a third-party TTS service or library (e.g. Google Cloud TTS, Amazon Polly, or an offline library like eSpeak or pyttsx3).
For example, using an offline TTS library in Python:
```python
import pyttsx3
tts_engine = pyttsx3.init()
tts_engine.save_to_file(answer_text, "output.wav")
tts_engine.runAndWait()
```
This would synthesize speech to an `output.wav` file. (Alternatively, an API could return an MP3/WAV of the spoken response.)
5. **Play Audio through the Speaker:** Once you have an audio file for the response, play it on the Go2’s speaker. On the robot’s Linux system, you might call aplay to play the WAV file (e.g. `aplay output.wav`). Programmatically, you can use Python’s `subprocess` to run that, or use an audio playback library to stream it to the speaker. (On Go1, developers have done this by copying a WAV to the robot and using command-line players ([Audio for Go1 with javascript - DroneBlocks Drone Coding Discussion Groups & Community](https://community.droneblocks.io/t/audio-for-go1-with-javascript/1043#:~:text=We%20have%20experimented%20with%20audio,want%20to%20play%20it%20back)) – the same concept applies to Go2.)
By chaining these steps, your Go2 can listen to a question, use OpenAI to generate an answer, and speak out the answer – effectively a custom voice assistant. While this requires writing your own code, it **gives you full control**. You can, for instance, change the wake word, filter the user input, or program custom Q&A behavior (perhaps limiting answers to a certain domain).
**Sample Workflow Code Snippet:** (combining the above steps in a simple loop)
```python
import openai, subprocess
openai.api_key = "YOUR_OPENAI_API_KEY"
while True:
# 1. Record audio from mic
subprocess.run(["arecord", "-D", "plughw:0,0", "-d", "3", "-f", "cd", "input.wav"])
# 2. Transcribe audio to text
with open("input.wav", "rb") as audio_file:
transcript = openai.Audio.transcribe("whisper-1", audio_file)
user_text = transcript["text"]
if not user_text:
continue # no voice input captured
print(f"User said: {user_text}")
# 3. Get response from ChatGPT
chat = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": user_text}]
)
answer = chat.choices[0].message.content.strip()
print(f"ChatGPT answer: {answer}")
# 4. Synthesize TTS (using a simple system TTS like pico2wave, for example)
subprocess.run(["pico2wave", "-w=output.wav", answer])
# 5. Play the audio on speaker
subprocess.run(["aplay", "output.wav"])
```
*In the above code:* we record 3 seconds of audio, transcribe via Whisper, chat with GPT-3.5, then use a command-line TTS (pico2wave, an offline TTS) to generate speech and play it. You would need to adjust device names (`plughw:0,0`) depending on audio hardware, and install the necessary tools (like pico2wave or other TTS engine).
This approach **bypasses the built-in voice assistant** entirely, allowing you to use the Go2’s mic/speaker with the OpenAI API (or any AI service) in real-time. The trade-off is you have to manage the coding and integration yourself, but you gain flexibility – for example, you could integrate voice with custom robot behaviors or use a different language model if desired.
## 2. Using the Onboard Microphone and Speaker Directly (Bypassing External Services)
The Go2 EDU **hardware** includes a built-in **microphone and speaker** in the robot’s head ([Unitree Go2 Edu Standard](https://store.hp-drones.com/en/unitree/2558-EAURGO2215291-9010189215291.html#:~:text=bow%2C%20a%20variety%20of%20creative,actions%20with%20music%20and%20lights)), enabling voice interaction. By default, the Go2’s official app and cloud services handle audio (the mobile app can stream your phone’s mic to the robot’s speaker for an “intercom” function ([Unitree go1 speakers connect - Robots - DroneBlocks Drone Coding Discussion Groups & Community](https://community.droneblocks.io/t/unitree-go1-speakers-connect/826#:~:text=In%20the%20Go%201%20Edu,the%20speaker%20in%20Go%201)), and the voice commands feature sends your speech to the cloud for recognition). However, you can access the audio hardware directly for your own applications, avoiding reliance on Unitree’s app or servers.
**Official Support:** Unitree’s SDK and docs currently **do not provide high-level APIs for audio**. The official stance (as of mid-2024) was that **the Go2’s internal speaker isn’t directly accessible through their SDK** ([[Unitree Go2] Access to Speaker and Microphone - Software - Robot Forum | MYBOTSHOP](https://forum.mybotshop.de/t/unitree-go2-access-to-speaker-and-microphone/1036#:~:text=Dear%20%40chr99th)), and there’s no documented API for the microphone either. In other words, the built-in voice features are intended to be used via the Unitree app/cloud (BenBen assistant), not via custom code. This is why you won’t find functions in the SDK like “getMicrophoneInput()” or “playAudio()”.
**Direct Access Methods:** Despite the lack of official SDK calls, developers have found ways to use the **onboard mic and speaker at a lower level**:
- **Linux Audio Utilities:** The Go2 runs Linux internally, so standard audio tools and libraries (ALSA, PortAudio, etc.) work. As mentioned, you can record from the mic using `arecord` and play sound with `aplay`. If you have shell access (SSH) to the robot, you can test this manually. For example, SSH into the robot and run:
- `arecord -d 3 -f cd test.wav` – records 3 seconds from the microphone into `test.wav` (16-bit, 44.1 kHz).
- `aplay test.wav` – plays the recorded audio through the robot’s speaker.
If you hear the recording, you’ve confirmed direct access. The quality might not be perfect (users noted the audio can cut out at times ([Unitree go1 speakers connect - Robots - DroneBlocks Drone Coding Discussion Groups & Community](https://community.droneblocks.io/t/unitree-go1-speakers-connect/826#:~:text=Boof%20%20May%205%2C%202023%2C,5%3A31pm%20%2011))), but it works.
- **WebRTC Interface:** Unitree uses a WebRTC-based interface for streaming audio/video between the robot and the app. The Go2 broadcasts its camera feed and mic audio over WebRTC, and accepts audio input for TTS through this channel. An **unofficial open-source driver** called **`go2_webrtc_connect`** leverages this to capture the robot’s mic and send sound to its speaker ([[Unitree Go2] Access to Speaker and Microphone : r/robotics](https://www.reddit.com/r/robotics/comments/1eyd9y2/unitree_go2_access_to_speaker_and_microphone/#:~:text=Aggravating)) ([GitHub - legion1581/go2_webrtc_connect: Unitree Go2 WebRTC driver](https://github.com/legion1581/go2_webrtc_connect#:~:text=Audio%20and%20Video%20Support)). By connecting to the Go2’s WebRTC service (either in AP mode or on the same network), you can programmatically receive audio (mic) and transmit audio (speaker) in real-time. This library uses PortAudio to interface with the streams ([GitHub - legion1581/go2_webrtc_connect: Unitree Go2 WebRTC driver](https://github.com/legion1581/go2_webrtc_connect#:~:text=sudo%20apt%20install%20portaudio19,submodules%20https%3A%2F%2Fgithub.com%2Flegion1581%2Fgo2_webrtc_connect.git%20cd%20go2_webrtc_connect)). In practice, this means you could run a Python script on a laptop that connects to the robot’s WiFi and subscribes to the audio feed, instead of running the code on the robot itself.
- **Onboard Code with Audio Libraries:** You can also install Python libraries like `pyaudio` or `sounddevice` on the robot to capture or play audio in real-time. For example, using `pyaudio`, you can open the default input device (the mic) and read frames in a loop, or open the output device (speaker) to play audio buffers.
**Bypassing External Services:** If your goal is to **avoid Unitree’s cloud and use everything on-device**, you’ll need an offline speech recognizer and TTS (or at least local network processing). One approach is to run an offline STT engine on the Go2 (like Vosk or Whisper in offline mode) and an offline TTS (like eSpeak or Festival). However, these might be resource-intensive and less accurate than cloud solutions. An alternative is a hybrid: use the onboard mic/speaker but still call cloud APIs (like OpenAI Whisper or Google STT) for the heavy speech recognition or synthesis tasks – this keeps the audio pipeline under your control while leveraging powerful models.
**Step-by-Step Example (Direct Audio Use):**
- *Setup:* SSH into the Go2’s computer or connect your program via the Go2’s IP. Ensure the audio driver is accessible (the Go2 EDU’s mic/speaker should appear as a sound card device).
- *Recording from Mic:* Use a Python library to record audio. For instance, with `sounddevice`:
```python
import sounddevice as sd
import numpy as np
duration = 3 # seconds
fs = 16000 # sample rate
print("Listening...")
audio = sd.rec(int(duration * fs), samplerate=fs, channels=1, dtype='int16')
sd.wait()
# audio now contains the recorded waveform
np.save("audio.npy", audio) # save or process as needed
```
This would capture 3 seconds of mono audio. You can then send this `audio` data to a speech recognizer (local or remote).
- *Playing to Speaker:* To play a numpy waveform or an audio file, you can use `sd.play(audio, fs)` or use `subprocess.call(["aplay", "file.wav"])` if you have a WAV file. For example:
```python
import simpleaudio as sa
wave_obj = sa.WaveObject.from_wave_file("response.wav")
play_obj = wave_obj.play()
play_obj.wait_done()
```
This uses the `simpleaudio` library to play a WAV file through the default audio output (which should be the robot’s speaker).
By using these methods, you **don’t need the Unitree phone app or cloud** – you can capture the user’s voice and play responses locally. This is essential for real-time custom interactions, and it’s how you’d integrate your own voice assistant or control system into the robot.
**Note:** Ensure volume is adjusted appropriately. The Go2’s speaker volume might be controllable via software. (In the BenBen system prompt, there’s a function `set_volume(value="10%+")` for volume control ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=,on)), implying there is an API for volume. If not publicly documented, you may control volume through ALSA mixer settings on the device.)
## 3. Voice Commands for Movement (Follow, Stop, Dance, etc.)
One of the most exciting uses of speech recognition on the Go2 is commanding it to move or perform tricks with just your voice. The **built-in voice command set** already understands many action commands. According to Unitree, the Go2’s voice control *“allows complete control over the robot, from executing tricks, navigating specific distances and angles, to playing your preferred music.”* ([Go2 Documentation | QUADRUPED ROBOTICS](https://www.docs.quadruped.de/projects/go2/html/controller.html#:~:text=turned%20on%20and%20connected%20to,to%20playing%20your%20preferred%20music)). In practice, the default assistant can recognize commands like *“sit down”*, *“stand up”*, *“turn right 90 degrees”*, *“walk backward 3 meters”*, *“dance”*, etc., and it will make the robot do that action (often responding with a confirmation like “Okay, starting to walk backward!”) ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=1,Master%3A%20Stop%20singing)) ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=6,Game%20rules%20emphasized)). For example, saying *“Give me a spin”* makes BenBen output a spin command to the robot, and saying *“Stop”* will halt the current motion or music ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=1,Master%3A%20Stop%20singing)).
**Using Built-in Commands:** Out of the box, you can use these phrases (in whatever language the assistant supports, likely English and Chinese) to control the Go2. “Dance” will trigger a dance routine, “Sit down” will make it sit, “Take a picture” will snap a photo, etc. ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=,off)) ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=1,Master%3A%20Stop%20singing)). The command **“Follow me”** (to initiate follow/accompany mode) is a bit special – it’s not explicitly listed in the leaked command set, and follow mode usually requires the locator remote (“M” button double-tap) ([Go2 Documentation | QUADRUPED ROBOTICS](https://www.docs.quadruped.de/projects/go2/html/controller.html#:~:text=To%20activate%20the%20accompanying%20mode,controller%20that%E2%80%99s%20included%20with%20it)). It’s possible the voice assistant might not handle “follow me” by default, but you could simulate it by commanding the robot to walk behind you or using vision-follow (if you integrated a custom vision model). In short, the built-in system covers many static actions, but dynamic following might need a custom solution or the remote.
**Adding Custom Voice Commands:** If you want to extend voice control to *your own custom commands* or behaviors (beyond what’s built-in), you will need to implement the recognition-to-action mapping yourself. This goes hand-in-hand with the custom pipeline from earlier. Essentially, replace the ChatGPT step with your own command parser, or use ChatGPT but give it a prompt that includes your custom commands. Two approaches:
- **Keyword/Rule-Based:** After getting text from speech recognition, use simple logic to detect certain keywords or phrases and trigger corresponding actions. For example:
```python
cmd = user_text.lower()
if "follow" in cmd:
# Activate follow mode or custom follow behavior
robot.start_following()
elif "dance" in cmd:
perform_dance_sequence() # your predefined sequence of moves
elif "stop" in cmd:
robot.stop() # stop all motion
```
Here, `robot.start_following()` and others would be functions you create using the Unitree SDK or control APIs (more on that below). This approach is straightforward and ensures specific phrases map to actions.
- **AI-based parsing:** You can use an LLM (like GPT) to interpret more complex or natural language commands and output an action code. This is actually what BenBen does internally – it asks the LLM to output a code block with an action function for any user request ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=Your%20character%20is%20a%20docile%2C,your%20abilities%20are%20given%20below)) ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=output%20a%20python%20code%20block,circles%20or%20wagging%20your%20tail)). You can mimic this by giving GPT a custom prompt that lists your available commands, and asking it to respond with a command identifier. For instance: *“If the user says something about dancing, answer with ‘DANCE’; if they say stop, answer ‘STOP’,”* etc. Then your code can execute the corresponding behavior. This is more flexible with phrasing but also more complex and requires careful prompt design to avoid misunderstandings.
**Issuing Movement Commands via SDK:** Whether you use keyword matching or AI parsing, at the end you need to control the robot. Unitree provides an SDK for programming the robot’s motions. For high-level movements, the **Unitree Python SDK** (part of the `unitree_sdk2` library) allows you to send velocity commands, posture commands, or trigger preset motions. For example, you could send a forward velocity to make the robot walk, or use inverse kinematics to move legs, etc. The DroneBlocks forum discussion suggests using the Unitree Python SDK to control Go2 in conjunction with ChatGPT for interpreting commands ([ChatGPT and Go1 Tutorial - Robots - DroneBlocks Drone Coding Discussion Groups & Community](https://community.droneblocks.io/t/chatgpt-and-go1-tutorial/1016#:~:text=It%20should%20be%20possible%20to,could%20be%20run%20in%20Python)). This SDK is essentially a wrapper around the low-level control interface that the Go2 uses for movement and sensor data.
Some possible implementations:
- **Follow Mode:** If you want a voice command "follow me", you might invoke the Go2’s accompany mode. While there may not be a direct SDK call named “follow”, you can simulate the remote control input that turns on accompany mode (for example, sending whatever command the “M” button does – this might be accessible through the SDK’s API for preset modes). If the SDK doesn’t expose it, a workaround is to control the robot to move towards a target (e.g., using vision or simply following a tag). Another hack: if you have the Go2’s remote, you could use a USB interface to simulate a button press when voice says "follow". This is not documented, so it may require creative engineering.
- **Stop:** This is easier – you can send zero velocities to all joints or call an SDK stop function. The built-in voice command uses `stop_sport()` internally ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=,stop_sport)). For your own code, if you’re in velocity control mode, just send 0 translational and rotational speed. If running a predefined motion, you’d need to interrupt or reset it.
- **Dance:** You can choreograph a dance by stringing together motions. The Go2 has preset “dance segments” accessible in the built-in system ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=,stop_music)). Without access to those, you can either use the SDK to command joint angles over time (more involved), or trigger the built-in dances by simulating the voice or app command. A simpler path: use the *“playMotion”* function if provided (some Unitree SDKs allow playing preset motion files). Check if Unitree’s secondary development resources include sample motions for dance. If not, you can code your own fun movements (e.g., spin in place, wiggle, hop).
**Example:** Suppose you have integrated speech recognition (as in section 2) and obtained the text `command_text`. The following pseudo-code illustrates integrating with the SDK:
```python
from unitree_sdk2_python import robot # hypothetical import of Unitree SDK
# Initialize robot control (this depends on how the SDK is set up, e.g., connecting to robot)
bot = robot.UnitreeRobot()
# Interpret voice command
command = command_text.strip().lower()
if "forward" in command:
bot.set_velocity(x=0.5, y=0, yaw=0) # move forward at 0.5 m/s
elif "back" in command:
bot.set_velocity(x=-0.5, y=0, yaw=0) # move backward
elif "stop" in command:
bot.stop() # stop motion, likely sets velocity to 0 or goes to idle mode
elif "sit" in command:
bot.execute_action("sit") # if SDK has predefined action, or set joint angles for sitting
elif "dance" in command:
# perform a custom dance: e.g., sequence of moves
bot.execute_action("happy")
time.sleep(2)
bot.execute_action("wiggle_hip")
time.sleep(3)
bot.execute_action("happy_new_year")
elif "follow" in command:
bot.enter_accompany_mode() # pseudo-call, if available
```
This is illustrative – refer to actual Unitree SDK docs for correct function names. The key idea is mapping spoken words to SDK calls.
**Testing and Refinement:** Start with simple tests. For instance, use voice to trigger “sit” and “stand” by mapping those words to a basic SDK example of posture control. Once that works reliably (with your STT), add more commands. Keep phrases distinct to avoid confusion (e.g., “stop” is straightforward; something like “halt” vs “stop” might need both to be recognized). Also consider adding voice confirmations: after the robot executes a command, you can use TTS to have it say “Done” or nod its head (BenBen does this kind of friendly acknowledgement).
In summary, **the built-in speech recognition can directly control movements** (no coding needed for the default set), but **for custom commands or integration with your own systems, you’ll combine STT with the Unitree SDK**. The Go2 EDU is designed for “secondary development”, so it encourages you to write programs that take sensor input (which can include the microphone input you obtain) and generate actuator commands. Voice is just another sensor input – once you have the text, you have full freedom to drive the robot in code.
## 4. SDK and API Hooks for Voice Features
**Official SDK/API:** Unitree provides SDKs primarily for **movement, navigation, and perception**, not specifically for the voice assistant. The **Unitree Legged SDK** (and the newer `unitree_sdk2`) let you connect a PC or an onboard computer to the robot’s control loop to send commands and read sensors in real-time ([ChatGPT and Go1 Tutorial - Robots - DroneBlocks Drone Coding Discussion Groups & Community](https://community.droneblocks.io/t/chatgpt-and-go1-tutorial/1016#:~:text=Image%20github)). For example, you can control joint angles, get IMU data, camera feed, etc. But **there are no official SDK endpoints to directly access the voice recognition or TTS engine** – those remain part of Unitree’s proprietary system. The **“secondary development manual”** referenced in the product docs covers how to use the motion API, not how to modify voice functions ([Go2英文v1.1](https://static.generation-robots.com/media/brochure-unitree-go2-en.pdf#:~:text=,read%20the%20secondary%20development%20manual)) ([Go2英文v1.1](https://static.generation-robots.com/media/brochure-unitree-go2-en.pdf#:~:text=,Type%20AIR%20PRO%20EDU)).
In practical terms, this means:
- You cannot call something like `robot.voice.listen()` or `robot.voice.speak()` through the provided SDK.
- You also cannot retrieve the transcript of what the user said via an official API; the voice commands go straight into the black-box assistant and are not exposed to user code.
**Unofficial/Community Solutions:** Given the above, the community has stepped in to fill the gap:
- The earlier-mentioned `go2_webrtc_connect` library is an **unofficial API** of sorts for audio/video streaming. It basically hooks into the same interface the official app uses, giving you programmatic access to the mic and speaker streams ([GitHub - legion1581/go2_webrtc_connect: Unitree Go2 WebRTC driver](https://github.com/legion1581/go2_webrtc_connect#:~:text=Audio%20and%20Video%20Support)). This isn’t endorsed by Unitree, but it’s a clever reverse-engineering that many researchers and developers find useful for integrating custom voice or teleoperation systems.
- Some developers have also run **ROS (Robot Operating System)** on Unitree robots and integrated off-the-shelf voice packages. For instance, there are ROS nodes for speech recognition (connecting to Google Speech or offline PocketSphinx) and for TTS. A YouTube demo shows a ROS-integrated voice command control for Unitree Go2, where speech commands get mapped to robot actions ([Self-made ROS-integrated voice command control for Unitree GO2](https://www.youtube.com/watch?v=2xzkja-AA_Y#:~:text=GO2%20www,speech%20recognition%20with%20robot%20commands)). Using ROS or custom code, you effectively create your own “voice API” on top of the hardware.
**Using the SDK with Voice:** Even though the SDK doesn’t know about voice, it’s crucial for executing actions in response to voice. The recommended architecture is **decoupling voice I/O and robot control**:
- Use whatever means to get voice input (as discussed, possibly custom code or the WebRTC hack).
- Use the **SDK’s motion commands** to make the robot do things based on that input.
- Optionally, use the SDK to feed data back into your voice responses. For example, if you ask the robot “What is your battery level?”, your code can query `robot.get_battery_state()` from the SDK and then use TTS to speak the answer. In the built-in assistant, such questions are answered by the AI (which likely has access to some API for battery info). In your custom setup, you’d manually handle those by bridging the SDK and your voice assistant logic.
**Summary of SDK/API Options:**
- **Movement and Sensors:** Use Unitree’s official SDK (C++ or Python bindings) – this is well-documented for things like walking, joint control, camera feed (the Go2 has an API for getting the camera or LiDAR data), etc. For example, the Python SDK on GitHub (unitree_sdk2_python) provides access to robot state and control functions ([ChatGPT and Go1 Tutorial - Robots - DroneBlocks Drone Coding Discussion Groups & Community](https://community.droneblocks.io/t/chatgpt-and-go1-tutorial/1016#:~:text=Image%20github)).
- **Voice and Audio:** No official API. Use OS-level access or community tools for audio, and third-party APIs for speech recognition/synthesis.
- **Cloud AI (BenBen):** No official access point for hooking your code into the BenBen assistant. It’s essentially a closed loop from the mobile app to Unitree servers. If you attempt to intercept it, you’re in unsupported territory (as the researchers did when *jailbreaking* the robot’s AI to get its prompt ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=Last%20but%20certainly%20not%20least%2C,publicly%20available%20for%20the%20Go2))).
In conclusion, Unitree’s Go2 EDU has impressive built-in voice interaction capabilities, but to leverage them in custom projects you’ll mostly be working around the provided system:
- **Yes**, you can integrate OpenAI’s API – by recreating the voice pipeline yourself (the robot already does it internally, but you can’t directly tap in, so you replicate the idea with your own code) ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=All%20of%20this%20progress%20has,user%27s%20voice%20commands%20as%20input)) ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=is%20through%20voice%20queries%2C%20we,publicly%20available%20for%20the%20Go2)).
- **Yes**, you can use the onboard mic and speaker directly – the hardware is there and accessible via low-level interfaces, even though the official SDK doesn’t cover it ([Unitree Go2 Edu Standard](https://store.hp-drones.com/en/unitree/2558-EAURGO2215291-9010189215291.html#:~:text=bow%2C%20a%20variety%20of%20creative,actions%20with%20music%20and%20lights)) ([[Unitree Go2] Access to Speaker and Microphone - Software - Robot Forum | MYBOTSHOP](https://forum.mybotshop.de/t/unitree-go2-access-to-speaker-and-microphone/1036#:~:text=Dear%20%40chr99th)).
- **Voice to move** is absolutely possible – either via the native commands (for built-in ones like “dance”) or via your code mapping spoken words to SDK motor commands ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=,stop_music)) ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=1,Master%3A%20Stop%20singing)).
- **SDK/API for voice** is unofficial – you’ll use a combination of Unitree’s SDK (for motion) and your own or community solutions for the speech part ([[Unitree Go2] Access to Speaker and Microphone - Software - Robot Forum | MYBOTSHOP](https://forum.mybotshop.de/t/unitree-go2-access-to-speaker-and-microphone/1036#:~:text=Dear%20%40chr99th)) ([[Unitree Go2] Access to Speaker and Microphone : r/robotics](https://www.reddit.com/r/robotics/comments/1eyd9y2/unitree_go2_access_to_speaker_and_microphone/#:~:text=Aggravating)).
By following the step-by-step approaches outlined above, you can achieve real-time voice interaction with the Go2. For instance, you could ask the robot a question and have it answer using ChatGPT, or tell it to perform a custom sequence – all using its native mic and speaker. This melding of Unitree’s robotics SDK with AI services like OpenAI’s gives you a powerful platform for experimentation in embodied AI and voice-controlled robotics.
**Sources:**
- Unitree Go2 product features and voice assistant description ([Unitree Go2 Edu Standard](https://store.hp-drones.com/en/unitree/2558-EAURGO2215291-9010189215291.html#:~:text=bow%2C%20a%20variety%20of%20creative,actions%20with%20music%20and%20lights)) ([Go2 Documentation | QUADRUPED ROBOTICS](https://www.docs.quadruped.de/projects/go2/html/controller.html#:~:text=turned%20on%20and%20connected%20to,to%20playing%20your%20preferred%20music))
- Alex Robey’s research on the Go2’s GPT-based voice assistant (BenBen) ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=All%20of%20this%20progress%20has,user%27s%20voice%20commands%20as%20input)) ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=Last%20but%20certainly%20not%20least%2C,publicly%20available%20for%20the%20Go2))
- DroneBlocks community insights on integrating ChatGPT and Unitree SDK ([ChatGPT and Go1 Tutorial - Robots - DroneBlocks Drone Coding Discussion Groups & Community](https://community.droneblocks.io/t/chatgpt-and-go1-tutorial/1016#:~:text=It%20should%20be%20possible%20to,could%20be%20run%20in%20Python))
- Forum discussions on accessing Go2’s audio hardware ([[Unitree Go2] Access to Speaker and Microphone - Software - Robot Forum | MYBOTSHOP](https://forum.mybotshop.de/t/unitree-go2-access-to-speaker-and-microphone/1036#:~:text=Dear%20%40chr99th)) ([[Unitree Go2] Access to Speaker and Microphone : r/robotics](https://www.reddit.com/r/robotics/comments/1eyd9y2/unitree_go2_access_to_speaker_and_microphone/#:~:text=Aggravating))
- Leaked BenBen command definitions (showing voice-commanded actions) ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=,stop_music)) ([Alex Robey :: RoboPAIR](https://arobey1.github.io/writing/jailbreakingrobots.html#:~:text=1,Master%3A%20Stop%20singing))
- Unitree Go2 documentation notes (voice command requires internet) ([Go2 Documentation | QUADRUPED ROBOTICS](https://www.docs.quadruped.de/projects/go2/html/controller.html#:~:text=Note))