玩玩 OpenAI Whisper 語音轉文字

## 開始玩囉這回接到工程部服務組需求`語音轉文字`，該單位想嘗試把會議紀錄快速產出，之前玩過各大雲廠商的 Speech-To-Text (Azure、IBM、Google...)語音轉文字，使用結果心得就不評論囉。 ![file](https://tech.cmoney.tw/wp-content/uploads/2023/07/image-1688716352204-1.png) 更是直接使用GitHub上現有工具[Whisper Desktop](https://github.com/Const-me/Whisper/releases)可離線使用，並試試目前火紅的 OpenAI 開放訓練好的 [Whisper](https://github.com/openai/whisper) 模組。 --- ## Whisper介紹、評測 OpenAI Whisper提供五種規模的模型供選擇，其中大型模型在精準度方面表現優異，但會消耗更多資源並降低處理速度。除了最大型的模型外，而英語專屬模型則能提供更優異的識別結果。 Whisper是一種自動語音識別（ASR）系統，來源於從網路收集的68萬小時訓練，包含多國語言、各種口音。 | Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed | |:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:| | tiny | 39 M | `tiny.en` | `tiny` | ~1 GB | ~32x | | base | 74 M | `base.en` | `base` | ~1 GB | ~16x | | small | 244 M | `small.en` | `small` | ~2 GB | ~6x | | medium | 769 M | `medium.en` | `medium` | ~5 GB | ~2x | | large | 1550 M | N/A | `large` | ~10 GB | 1x | > 表格出處：[OpenAI Wisper](https://github.com/openai/whisper/blob/main/README.md#available-models-and-languages) Github ---- Whisper的辨識過程不同語言有很大的變化。下圖顯示了各語言的WER（Word Error Rate，詞誤率）分析（數字越小，效能越好），可以看到英文的識別率極佳，而中文的錯誤率約 14.7%。 ![file](https://tech.cmoney.tw/wp-content/uploads/2023/07/image-1688713801363.png) ## 安裝＆執行以下是我在 Windows 安裝及執行 Whisper 的紀錄： 1. 直接到「 [Whisper Desktop](https://github.com/Const-me/Whisper/releases) 」的 GitHub 頁面，在右方的「 Releases 」，找到最新版軟體的下載網址，下載後解壓縮，直接執行裡面的「WhisperDesktop」。 ![file](https://tech.cmoney.tw/wp-content/uploads/2023/07/image-1688714562503.png) 2. 下載 Whisper [模型檔](https://huggingface.co/ggerganov/whisper.cpp/tree/main)，建議挑選 `ggml-medium.bin` 模型。 3. 執行 `WhisperDesktop` 程式，並選擇`運算模型` 檔。 ![file](https://tech.cmoney.tw/wp-content/uploads/2023/07/image-1688715055833.png) 4. 進入到正式準備語音轉文字的畫面，按下 `Transcribe` 即可： ![file](https://tech.cmoney.tw/wp-content/uploads/2023/07/image-1688715183987.png) ----- ## 實測過程 * 檔案：11分38秒的會議紀錄語音檔 ### 環境1.純內顯(Intel(R) UHD Graphics 630) * 使用模型：大型 ggml-large.bin * 耗費時間：3小時 ![file](https://tech.cmoney.tw/wp-content/uploads/2023/07/image-1688715424971-1.png) ### 環境2.Nvida GTX 1050 * 使用模型：大型 ggml-large.bin * 耗費時間：20分16秒 ![file](https://tech.cmoney.tw/wp-content/uploads/2023/07/image-1688715447154.png) ### 環境3.Nvida GTX 1050 * 使用模型：ggml-medium.bin * 耗費時間：4分54秒 ![file](https://tech.cmoney.tw/wp-content/uploads/2023/07/image-1688715532834.png) --- ## 實測結果：初步測試 * 使用`大型模型`，CPU vs GPU 對打，效能差了9倍。 * 使用 `大型、中型模型` 辨識結果沒有明顯差異，但GPU效能卻差了5倍。辨識精準度極高（90％），錯字少，但專有名詞上比較容易出現錯字。例如： * 泛型(Generic) → 泛行 * Clone → Cleon ## 其他參考 * 網路上的各種效能評測文 * [Performance benchmark of different GPUs](https://github.com/openai/whisper/discussions/918) * [OpenAI Whisper Audio Transcription Benchmarked on 18 GPUs: Up to 3,000 WPM](https://www.tomshardware.com/news/whisper-audio-transcription-gpus-benchmarked)