# Mac 使用 Diart 完成語音即時轉錄
> 參考 [Color Your Captions: Streamlining Live Transcriptions With “diart” and OpenAI’s Whisper](https://betterprogramming.pub/color-your-captions-streamlining-live-transcriptions-with-diart-and-openais-whisper-6203350234ef)
> GitHub source: [juanmc2005/diart_whisper.py](https://gist.github.com/juanmc2005/ed6413e697e176cb36a149d8c40a3a5b)
## 安裝依賴套件
1. 安裝 [Diart](https://github.com/juanmc2005/diart?tab=readme-ov-file#-installation) (Diarization Tool) 套件
1. 確認已經有相關套件
```bash
# 官網推薦版本
# ffmpeg < 4.4
# portaudio == 19.6.X
# libsndfile >= 1.2.2
```
portaudio 對 Mac osx 會有問題,安裝以下版本可以解決,[參考此 issue](https://github.com/spatialaudio/python-sounddevice/issues/299#issuecomment-752473402)
```bash
# 我的版本
brew list --versions ffmpeg portaudio libsndfile
# ffmpeg 7.1_3
# libsndfile 1.2.2
# portaudio HEAD-aa1cfb0
```
2. 建立 Python 3.10 虛擬環境並安裝 diart
1. 確認已安裝 Python 3.10:
```bash
python3.10 --version
```
若尚未安裝,可以透過 brew 進行安裝:
```bash
brew install python@3.10
```
2. 建立虛擬環境:
```bash
python3.10 -m venv diart_env
```
3. 啟動虛擬環境:
```bash
source diart_env/bin/activate
```
4. 安裝 diart
```bash
pip install diart
```
2. Install [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped#first-installation) with `pip install git+https://github.com/linto-ai/whisper-timestamped`
## 獲取 🎹 pyannote 模型授權
預設情況下,diart 基於 [pyannote.audio](https://github.com/pyannote/pyannote-audio) 模型,這些模型來自 [huggingface](https://huggingface.co/) 平台。
為了使用這些模型,請先創建帳號,並按照以下步驟進行:
1. [接受使用條款](https://huggingface.co/pyannote/segmentation) 以獲取 `pyannote/segmentation` 模型
2. [接受使用條款](https://huggingface.co/pyannote/segmentation-3.0) 以獲取最新的 `pyannote/segmentation-3.0` 模型
3. [接受使用條款](https://huggingface.co/pyannote/embedding) 以獲取 `pyannote/embedding` 模型
4. 生成 Hugging Face Token
1. 進入 Hugging Face Token 生成頁面:[Hugging Face Token 設定](https://huggingface.co/settings/tokens)
2. 點擊「New token」按鈕,生成新的 Token,並確保選擇Read或Write權限。
• 如果你要下載模型,只需 Read 權限即可。
• 如果需要上傳模型或推送至 Hugging Face Hub,則選擇 Write 權限。
3. 複製新生成的 Token。
5. cli 登入 Hugging Face
```bash
pip install huggingface_hub
# 登入
huggingface-cli login
# 檢查是否登入成功
huggingface-cli whoami
```
## 執行
新建檔案 `diart_whisper.py` 並寫入內容
```python=
import logging
import os
import sys
import traceback
from contextlib import contextmanager
import diart.operators as dops
import numpy as np
import rich
import rx.operators as ops
import whisper_timestamped as whisper
from diart import SpeakerDiarization, SpeakerDiarizationConfig
from diart.sources import WebSocketAudioSource
from pyannote.core import Annotation, SlidingWindowFeature, SlidingWindow, Segment
def concat(chunks, collar=0.05):
"""
Concatenate predictions and audio
given a list of `(diarization, waveform)` pairs
and merge contiguous single-speaker regions
with pauses shorter than `collar` seconds.
"""
first_annotation = chunks[0][0]
first_waveform = chunks[0][1]
annotation = Annotation(uri=first_annotation.uri)
data = []
for ann, wav in chunks:
annotation.update(ann)
data.append(wav.data)
annotation = annotation.support(collar)
window = SlidingWindow(
first_waveform.sliding_window.duration,
first_waveform.sliding_window.step,
first_waveform.sliding_window.start,
)
data = np.concatenate(data, axis=0)
return annotation, SlidingWindowFeature(data, window)
def colorize_transcription(transcription):
"""
Unify a speaker-aware transcription represented as
a list of `(speaker: int, text: str)` pairs
into a single text colored by speakers.
"""
colors = 2 * [
"bright_red", "bright_blue", "bright_green", "orange3", "deep_pink1",
"yellow2", "magenta", "cyan", "bright_magenta", "dodger_blue2"
]
result = []
for speaker, text in transcription:
if speaker == -1:
# No speakerfound for this text, use default terminal color
result.append(text)
else:
result.append(f"[{colors[speaker]}]{text}")
return "\n".join(result)
@contextmanager
def suppress_stdout():
# Auxiliary function to suppress Whisper logs (it is quite verbose)
# All credit goes to: https://thesmithfam.org/blog/2012/10/25/temporarily-suppress-console-output-in-python/
with open(os.devnull, "w") as devnull:
old_stdout = sys.stdout
sys.stdout = devnull
try:
yield
finally:
sys.stdout = old_stdout
class WhisperTranscriber:
def __init__(self, model="small", device=None):
self.model = whisper.load_model(model, device=device)
self._buffer = ""
def transcribe(self, waveform):
"""Transcribe audio using Whisper"""
# Pad/trim audio to fit 30 seconds as required by Whisper
audio = waveform.data.astype("float32").reshape(-1)
audio = whisper.pad_or_trim(audio)
# Transcribe the given audio while suppressing logs
with suppress_stdout():
transcription = whisper.transcribe(
self.model,
audio,
# We use past transcriptions to condition the model
initial_prompt=self._buffer,
verbose=True # to avoid progress bar
)
return transcription
def identify_speakers(self, transcription, diarization, time_shift):
"""Iterate over transcription segments to assign speakers"""
speaker_captions = []
for segment in transcription["segments"]:
# Crop diarization to the segment timestamps
start = time_shift + segment["words"][0]["start"]
end = time_shift + segment["words"][-1]["end"]
dia = diarization.crop(Segment(start, end))
# Assign a speaker to the segment based on diarization
speakers = dia.labels()
num_speakers = len(speakers)
if num_speakers == 0:
# No speakers were detected
caption = (-1, segment["text"])
elif num_speakers == 1:
# Only one speaker is active in this segment
spk_id = int(speakers[0].split("speaker")[1])
caption = (spk_id, segment["text"])
else:
# Multiple speakers, select the one that speaks the most
max_speaker = int(np.argmax([
dia.label_duration(spk) for spk in speakers
]))
caption = (max_speaker, segment["text"])
speaker_captions.append(caption)
return speaker_captions
def __call__(self, diarization, waveform):
# Step 1: Transcribe
transcription = self.transcribe(waveform)
# Update transcription buffer
self._buffer += transcription["text"]
# The audio may not be the beginning of the conversation
time_shift = waveform.sliding_window.start
# Step 2: Assign speakers
speaker_transcriptions = self.identify_speakers(transcription, diarization, time_shift)
return speaker_transcriptions
# Suppress whisper-timestamped warnings for a clean output
logging.getLogger("whisper_timestamped").setLevel(logging.ERROR)
# If you have a GPU, you can also set device=torch.device("cuda")
config = SpeakerDiarizationConfig(
duration=5,
step=0.5,
latency="min",
tau_active=0.5,
rho_update=0.1,
delta_new=0.57
)
dia = SpeakerDiarization(config)
source = WebSocketAudioSource(config.sample_rate)
# If you have a GPU, you can also set device="cuda"
asr = WhisperTranscriber(model="small")
# Split the stream into 2s chunks for transcription
transcription_duration = 2
# Apply models in batches for better efficiency
batch_size = int(transcription_duration // config.step)
# Chain of operations to apply on the stream of microphone audio
source.stream.pipe(
# Format audio stream to sliding windows of 5s with a step of 500ms
dops.rearrange_audio_stream(
config.duration, config.step, config.sample_rate
),
# Wait until a batch is full
# The output is a list of audio chunks
ops.buffer_with_count(count=batch_size),
# Obtain diarization prediction
# The output is a list of pairs `(diarization, audio chunk)`
ops.map(dia),
# Concatenate 500ms predictions/chunks to form a single 2s chunk
ops.map(concat),
# Ignore this chunk if it does not contain speech
ops.filter(lambda ann_wav: ann_wav[0].get_timeline().duration() > 0),
# Obtain speaker-aware transcriptions
# The output is a list of pairs `(speaker: int, caption: str)`
ops.starmap(asr),
# Color transcriptions according to the speaker
# The output is plain text with color references for rich
ops.map(colorize_transcription),
).subscribe(
on_next=rich.print, # print colored text
on_error=lambda _: traceback.print_exc() # print stacktrace if error
)
print("Listening...")
source.read()
```
1. 執行主程式
```bash
python diart_whisper.py
```
可能會遇到的環境問題,需要降版
```bash
pip install "numpy<2"
```
2. 啟動 WebSocket 音訊流
打開另一個終端執行以下指令:
```bash
diart.client microphone --host 127.0.0.1 --port 7007 --sample-rate 16000 --step 0.5
```
這條指令將麥克風音訊流發送到指定的 WebSocket 端口,並通過 diart_whisper.py 接收並處理。
3. 完成

## 附錄
- [Source code](https://github.com/romazrau/whisper-diart-live)