Mac 使用 Diart 完成語音即時轉錄

# Mac 使用 Diart 完成語音即時轉錄 > 參考 [Color Your Captions: Streamlining Live Transcriptions With “diart” and OpenAI’s Whisper](https://betterprogramming.pub/color-your-captions-streamlining-live-transcriptions-with-diart-and-openais-whisper-6203350234ef) > GitHub source: [juanmc2005/diart_whisper.py](https://gist.github.com/juanmc2005/ed6413e697e176cb36a149d8c40a3a5b) ## 安裝依賴套件 1. 安裝 [Diart](https://github.com/juanmc2005/diart?tab=readme-ov-file#-installation) (Diarization Tool) 套件 1. 確認已經有相關套件 ```bash # 官網推薦版本 # ffmpeg < 4.4 # portaudio == 19.6.X # libsndfile >= 1.2.2 ``` portaudio 對 Mac osx 會有問題，安裝以下版本可以解決，[參考此 issue](https://github.com/spatialaudio/python-sounddevice/issues/299#issuecomment-752473402) ```bash # 我的版本 brew list --versions ffmpeg portaudio libsndfile # ffmpeg 7.1_3 # libsndfile 1.2.2 # portaudio HEAD-aa1cfb0 ``` 2. 建立 Python 3.10 虛擬環境並安裝 diart 1. 確認已安裝 Python 3.10： ```bash python3.10 --version ``` 若尚未安裝，可以透過 brew 進行安裝： ```bash brew install python@3.10 ``` 2. 建立虛擬環境： ```bash python3.10 -m venv diart_env ``` 3. 啟動虛擬環境： ```bash source diart_env/bin/activate ``` 4. 安裝 diart ```bash pip install diart ``` 2. Install [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped#first-installation) with `pip install git+https://github.com/linto-ai/whisper-timestamped` ## 獲取 🎹 pyannote 模型授權預設情況下，diart 基於 [pyannote.audio](https://github.com/pyannote/pyannote-audio) 模型，這些模型來自 [huggingface](https://huggingface.co/) 平台。為了使用這些模型，請先創建帳號，並按照以下步驟進行： 1. [接受使用條款](https://huggingface.co/pyannote/segmentation) 以獲取 `pyannote/segmentation` 模型 2. [接受使用條款](https://huggingface.co/pyannote/segmentation-3.0) 以獲取最新的 `pyannote/segmentation-3.0` 模型 3. [接受使用條款](https://huggingface.co/pyannote/embedding) 以獲取 `pyannote/embedding` 模型 4. 生成 Hugging Face Token 1. 進入 Hugging Face Token 生成頁面：[Hugging Face Token 設定](https://huggingface.co/settings/tokens) 2. 點擊「New token」按鈕，生成新的 Token，並確保選擇Read或Write權限。 • 如果你要下載模型，只需 Read 權限即可。 • 如果需要上傳模型或推送至 Hugging Face Hub，則選擇 Write 權限。 3. 複製新生成的 Token。 5. cli 登入 Hugging Face ```bash pip install huggingface_hub # 登入 huggingface-cli login # 檢查是否登入成功 huggingface-cli whoami ``` ## 執行新建檔案 `diart_whisper.py` 並寫入內容 ```python= import logging import os import sys import traceback from contextlib import contextmanager import diart.operators as dops import numpy as np import rich import rx.operators as ops import whisper_timestamped as whisper from diart import SpeakerDiarization, SpeakerDiarizationConfig from diart.sources import WebSocketAudioSource from pyannote.core import Annotation, SlidingWindowFeature, SlidingWindow, Segment def concat(chunks, collar=0.05): """ Concatenate predictions and audio given a list of `(diarization, waveform)` pairs and merge contiguous single-speaker regions with pauses shorter than `collar` seconds. """ first_annotation = chunks[0][0] first_waveform = chunks[0][1] annotation = Annotation(uri=first_annotation.uri) data = [] for ann, wav in chunks: annotation.update(ann) data.append(wav.data) annotation = annotation.support(collar) window = SlidingWindow( first_waveform.sliding_window.duration, first_waveform.sliding_window.step, first_waveform.sliding_window.start, ) data = np.concatenate(data, axis=0) return annotation, SlidingWindowFeature(data, window) def colorize_transcription(transcription): """ Unify a speaker-aware transcription represented as a list of `(speaker: int, text: str)` pairs into a single text colored by speakers. """ colors = 2 * [ "bright_red", "bright_blue", "bright_green", "orange3", "deep_pink1", "yellow2", "magenta", "cyan", "bright_magenta", "dodger_blue2" ] result = [] for speaker, text in transcription: if speaker == -1: # No speakerfound for this text, use default terminal color result.append(text) else: result.append(f"[{colors[speaker]}]{text}") return "\n".join(result) @contextmanager def suppress_stdout(): # Auxiliary function to suppress Whisper logs (it is quite verbose) # All credit goes to: https://thesmithfam.org/blog/2012/10/25/temporarily-suppress-console-output-in-python/ with open(os.devnull, "w") as devnull: old_stdout = sys.stdout sys.stdout = devnull try: yield finally: sys.stdout = old_stdout class WhisperTranscriber: def __init__(self, model="small", device=None): self.model = whisper.load_model(model, device=device) self._buffer = "" def transcribe(self, waveform): """Transcribe audio using Whisper""" # Pad/trim audio to fit 30 seconds as required by Whisper audio = waveform.data.astype("float32").reshape(-1) audio = whisper.pad_or_trim(audio) # Transcribe the given audio while suppressing logs with suppress_stdout(): transcription = whisper.transcribe( self.model, audio, # We use past transcriptions to condition the model initial_prompt=self._buffer, verbose=True # to avoid progress bar ) return transcription def identify_speakers(self, transcription, diarization, time_shift): """Iterate over transcription segments to assign speakers""" speaker_captions = [] for segment in transcription["segments"]: # Crop diarization to the segment timestamps start = time_shift + segment["words"][0]["start"] end = time_shift + segment["words"][-1]["end"] dia = diarization.crop(Segment(start, end)) # Assign a speaker to the segment based on diarization speakers = dia.labels() num_speakers = len(speakers) if num_speakers == 0: # No speakers were detected caption = (-1, segment["text"]) elif num_speakers == 1: # Only one speaker is active in this segment spk_id = int(speakers[0].split("speaker")[1]) caption = (spk_id, segment["text"]) else: # Multiple speakers, select the one that speaks the most max_speaker = int(np.argmax([ dia.label_duration(spk) for spk in speakers ])) caption = (max_speaker, segment["text"]) speaker_captions.append(caption) return speaker_captions def __call__(self, diarization, waveform): # Step 1: Transcribe transcription = self.transcribe(waveform) # Update transcription buffer self._buffer += transcription["text"] # The audio may not be the beginning of the conversation time_shift = waveform.sliding_window.start # Step 2: Assign speakers speaker_transcriptions = self.identify_speakers(transcription, diarization, time_shift) return speaker_transcriptions # Suppress whisper-timestamped warnings for a clean output logging.getLogger("whisper_timestamped").setLevel(logging.ERROR) # If you have a GPU, you can also set device=torch.device("cuda") config = SpeakerDiarizationConfig( duration=5, step=0.5, latency="min", tau_active=0.5, rho_update=0.1, delta_new=0.57 ) dia = SpeakerDiarization(config) source = WebSocketAudioSource(config.sample_rate) # If you have a GPU, you can also set device="cuda" asr = WhisperTranscriber(model="small") # Split the stream into 2s chunks for transcription transcription_duration = 2 # Apply models in batches for better efficiency batch_size = int(transcription_duration // config.step) # Chain of operations to apply on the stream of microphone audio source.stream.pipe( # Format audio stream to sliding windows of 5s with a step of 500ms dops.rearrange_audio_stream( config.duration, config.step, config.sample_rate ), # Wait until a batch is full # The output is a list of audio chunks ops.buffer_with_count(count=batch_size), # Obtain diarization prediction # The output is a list of pairs `(diarization, audio chunk)` ops.map(dia), # Concatenate 500ms predictions/chunks to form a single 2s chunk ops.map(concat), # Ignore this chunk if it does not contain speech ops.filter(lambda ann_wav: ann_wav[0].get_timeline().duration() > 0), # Obtain speaker-aware transcriptions # The output is a list of pairs `(speaker: int, caption: str)` ops.starmap(asr), # Color transcriptions according to the speaker # The output is plain text with color references for rich ops.map(colorize_transcription), ).subscribe( on_next=rich.print, # print colored text on_error=lambda _: traceback.print_exc() # print stacktrace if error ) print("Listening...") source.read() ``` 1. 執行主程式 ```bash python diart_whisper.py ``` 可能會遇到的環境問題，需要降版 ```bash pip install "numpy<2" ``` 2. 啟動 WebSocket 音訊流打開另一個終端執行以下指令： ```bash diart.client microphone --host 127.0.0.1 --port 7007 --sample-rate 16000 --step 0.5 ``` 這條指令將麥克風音訊流發送到指定的 WebSocket 端口，並通過 diart_whisper.py 接收並處理。 3. 完成 ![image](https://hackmd.io/_uploads/SysaRt8IJl.png) ## 附錄 - [Source code](https://github.com/romazrau/whisper-diart-live)