Speech To Text (STT)

# Speech To Text (STT) with wishper to generate Meeting Minutes Whisper is an ASR model developed by OpenAI, trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI's whisper does not natively support batching. ## Overview The idea was to use the text transcribe from wishper and pass it through openAi to generate meeting minutes. In this experiment, openAI part is not covered as we all are familar with OpenAI being able to provide requested format/information from the given input data. So, we will be mainly talking about wishper and diarization(speaker classification) ![Screenshot from 2024-02-08 13-04-37](https://hackmd.io/_uploads/HJKK7CZia.png) ## Objective/Purpose -> Convert Japanese audio to text -> Speaker Classification with hugging face diarization utilizing wishperx ## Environment used Ubuntu 20.04 Ram 32GB GPU Info 8GB python version 3.10 ![Screenshot from 2024-02-08 13-10-23](https://hackmd.io/_uploads/rJGDV0-sT.png) Seting CUDA has some issues missing library so the process may be tedious.. You can match the torch verison through [wishper official page](https://github.com/openai/whisper#setup.) or you can use the tested version ``` pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118 ``` ## Execution Process 1. Setup python enivronment 2. Clone and install wishperX ``` git clone https://github.com/m-bain/whisperX.git cd whisperX pip install -e . ``` Follow the official wishper page if any error or issues occur. 3. Import Libraries ``` import whisperx import gc import json ``` 4. Set device, batch_size and compute float, assign audio path ``` device = "cuda" # device = "cpu" audio_file = "/ambl/speech2text/background_study/test_audio/新規録音 #29.m4a" batch_size = 2 # reduce if low on GPU mem compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy) ``` Tested with mp3 and m4a format. No any pre-processing implemented. Supports m4a format as well 5. Load a Transcribe model. As per to official documentation large-v2 models STT accuracy is best among the rest. ``` # 1. Transcribe with original whisper (batched) model = whisperx.load_model("large-v2", device, compute_type=compute_type) # save model to local path (optional) # model_dir = "/path/" # model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir) audio = whisperx.load_audio(audio_file) result = model.transcribe(audio, batch_size=batch_size,language='ja') # Optional (release memory) # delete model if low on GPU resources # import gc; gc.collect(); torch.cuda.empty_cache(); del model ``` 6. Align the output ``` model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device) result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False) ``` At this stage wishper STT process is complete you can check the output in dictionary format. For example: ``` "segments": [ { "text": "おはようございますよろしくお願いいたします本日高橋が結合試験の作業の方でこちらのミーティングを決定しておりますのでアンブルガワ参加者は以上となりますよろしくお願いいたします", "start": 23.951, "end": 50.435 }, ] ``` ## Additional Speaker separation (classification) In this experiment, hugging Face speaker-diarization and speaker-segmentation model is used with wishper 7. Create an account and get token from [hugging face](https://huggingface.co/email_confirmation) ``` # delete model if low on GPU resources # import gc; gc.collect(); torch.cuda.empty_cache(); del model_a hugging_face_token = "Huggingfaceapikey" min_speakers = 2 max_speakers = 5 ``` 8. Register the version of diaruzatuio that you want to use [segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0) [speaker-diarization-3.0](https://huggingface.co/pyannote/speaker-diarization-3.0) Requirements Install pyannote.audio 3.0 with pip install pyannote.audio Accept pyannote/segmentation-3.0 user conditions Accept pyannote/speaker-diarization-3.0 user conditions Create access token at hf.co/settings/tokens. Note: It may change with time 9. Setting Diarize model and save as json file one of the major issue is you have predefine the number of speaker in the audio. It best works with 1 or 2 but also you can put the minimum and maximum number of speaker talking in the given audio. ``` # 3. Assign speaker labels diarize_model = whisperx.DiarizationPipeline(use_auth_token=hugging_face_token, device=device, model_name="pyannote/speaker-diarization-3.1") # add min/max number of speakers if known diarize_segments = diarize_model(audio) diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers) result = whisperx.assign_word_speakers(diarize_segments, result) # print(diarize_segments) print(result["segments"]) # segments are now assigned speaker IDs json_file_path = f"before_alignment_{device}_bth_{batch_size}_{compute_type}_speaker.json" # Dump the list into the JSON file with open(json_file_path, 'w',encoding='utf-8') as json_file: json.dump(result["segments"], json_file,ensure_ascii=False) ``` 9. Output Preview Outputs the given audio files with text, starting and ending time frame and assignment of speaker for each phrase as well as for each words inside that phrase. Sample ``` { "start": 23.991, "end": 50.095, "text": "おはようございますよろしくお願いいたします本日高橋が結合試験の作業の方でこちらのミーティングを決定しておりますのでアンブルガワ参加者は以上となりますよろしくお願いいたします", "words": [ { "word": "お", "start": 23.991, "end": 24.011, "score": 0.0, "speaker": "SPEAKER_00" }, { "word": "は", "start": 24.011, "end": 24.071, "score": 0.663, "speaker": "SPEAKER_00" }, { "word": "よ", "start": 24.071, "end": 24.091, "score": 0.0, "speaker": "SPEAKER_00" }, { "word": "う", "start": 24.091, "end": 24.111, "score": 0.0, "speaker": "SPEAKER_00" }, { "word": "ご", "start": 24.111, "end": 24.231, "score": 0.833, "speaker": "SPEAKER_00" }, { "word": "ざ", "start": 24.231, "end": 24.311, "score": 0.79, "speaker": "SPEAKER_00" }, { "word": "い", "start": 24.311, "end": 24.391, "score": 0.938, "speaker": "SPEAKER_00" }, { "word": "ま", "start": 24.391, "end": 24.532, "score": 0.999, "speaker": "SPEAKER_00" }, { "word": "す", "start": 24.532, "end": 27.935, "score": 0.995, "speaker": "SPEAKER_02" }, { "word": "よ", "start": 27.935, "end": 28.035, "score": 0.799, "speaker": "SPEAKER_00" }, { "word": "ろ", "start": 28.035, "end": 28.115, "score": 0.745, "speaker": "SPEAKER_00" }, { "word": "し", "start": 28.115, "end": 28.235, "score": 0.927, "speaker": "SPEAKER_00" }, { "word": "く", "start": 28.235, "end": 28.295, "score": 0.919, "speaker": "SPEAKER_00" }, { "word": "お", "start": 28.295, "end": 28.335, "score": 0.5, "speaker": "SPEAKER_00" }, { "word": "願", "start": 28.335, "end": 28.435, "score": 0.801, "speaker": "SPEAKER_00" }, { "word": "い", "start": 28.435, "end": 28.455, "score": 0.127, "speaker": "SPEAKER_00" }, { "word": "い", "start": 28.455, "end": 28.535, "score": 0.794, "speaker": "SPEAKER_00" }, { "word": "た", "start": 28.535, "end": 28.635, "score": 0.814, "speaker": "SPEAKER_00" }, { "word": "し", "start": 28.635, "end": 28.735, "score": 0.999, "speaker": "SPEAKER_00" }, { "word": "ま", "start": 28.735, "end": 28.855, "score": 1.0, "speaker": "SPEAKER_00" }, { "word": "す", "start": 28.855, "end": 29.416, "score": 1.0, "speaker": "SPEAKER_00" }, { "word": "本", "start": 29.416, "end": 29.636, "score": 0.977, "speaker": "SPEAKER_00" }, { "word": "日", "start": 29.636, "end": 31.778, "score": 0.923, "speaker": "SPEAKER_00" }, { "word": "高", "start": 31.778, "end": 32.138, "score": 0.889, "speaker": "SPEAKER_00" }, { "word": "橋", "start": 32.138, "end": 32.279, "score": 0.857, "speaker": "SPEAKER_00" }, { "word": "が", "start": 32.279, "end": 32.539, "score": 0.918, "speaker": "SPEAKER_00" }, { "word": "結", "start": 32.539, "end": 32.759, "score": 0.933, "speaker": "SPEAKER_00" }, { "word": "合", "start": 32.759, "end": 32.979, "score": 0.917, "speaker": "SPEAKER_00" }, { "word": "試", "start": 32.979, "end": 33.119, "score": 0.997, "speaker": "SPEAKER_00" }, { "word": "験", "start": 33.119, "end": 33.34, "score": 1.0, "speaker": "SPEAKER_00" }, { "word": "の", "start": 33.34, "end": 33.76, "score": 0.999, "speaker": "SPEAKER_00" }, { "word": "作", "start": 33.76, "end": 33.9, "score": 0.885, "speaker": "SPEAKER_00" }, { "word": "業", "start": 33.9, "end": 34.04, "score": 0.858, "speaker": "SPEAKER_00" }, { "word": "の", "start": 34.04, "end": 34.18, "score": 1.0, "speaker": "SPEAKER_00" }, { "word": "方", "start": 34.18, "end": 34.441, "score": 0.999, "speaker": "SPEAKER_00" }, { "word": "で", "start": 34.441, "end": 34.601, "score": 1.0, "speaker": "SPEAKER_00" }, { "word": "こ", "start": 34.601, "end": 34.741, "score": 0.894, "speaker": "SPEAKER_00" }, { "word": "ち", "start": 34.741, "end": 34.821, "score": 0.998, "speaker": "SPEAKER_00" }, { "word": "ら", "start": 34.821, "end": 34.961, "score": 1.0, "speaker": "SPEAKER_00" }, { "word": "の", "start": 34.961, "end": 35.081, "score": 1.0, "speaker": "SPEAKER_00" }, { "word": "ミ", "start": 35.081, "end": 35.201, "score": 0.833, "speaker": "SPEAKER_00" }, { "word": "ー", "start": 35.201, "end": 35.281, "score": 0.75, "speaker": "SPEAKER_00" }, { "word": "テ", "start": 35.281, "end": 35.341, "score": 0.67, "speaker": "SPEAKER_00" }, { "word": "ィ", "start": 35.341, "end": 35.381, "score": 0.5, "speaker": "SPEAKER_00" }, { "word": "ン", "start": 35.381, "end": 35.421, "score": 0.501, "speaker": "SPEAKER_00" }, { "word": "グ", "start": 35.421, "end": 35.461, "score": 0.5, "speaker": "SPEAKER_00" }, { "word": "を", "start": 35.461, "end": 35.622, "score": 0.996, "speaker": "SPEAKER_00" }, { "word": "決", "start": 35.622, "end": 35.782, "score": 0.875, "speaker": "SPEAKER_00" }, { "word": "定", "start": 35.782, "end": 36.042, "score": 0.923, "speaker": "SPEAKER_00" }, { "word": "し", "start": 36.042, "end": 36.162, "score": 0.933, "speaker": "SPEAKER_00" }, { "word": "て", "start": 36.162, "end": 36.242, "score": 0.962, "speaker": "SPEAKER_00" }, { "word": "お", "start": 36.242, "end": 36.302, "score": 0.955, "speaker": "SPEAKER_00" }, { "word": "り", "start": 36.302, "end": 36.402, "score": 1.0, "speaker": "SPEAKER_00" }, { "word": "ま", "start": 36.402, "end": 36.542, "score": 1.0, "speaker": "SPEAKER_00" }, { "word": "す", "start": 36.542, "end": 36.683, "score": 1.0, "speaker": "SPEAKER_00" }, { "word": "の", "start": 36.683, "end": 36.863, "score": 1.0, "speaker": "SPEAKER_00" }, { "word": "で", "start": 36.863, "end": 37.663, "score": 1.0, "speaker": "SPEAKER_00" }, { "word": "ア", "start": 37.663, "end": 37.784, "score": 0.979, "speaker": "SPEAKER_00" }, { "word": "ン", "start": 37.784, "end": 37.844, "score": 0.74, "speaker": "SPEAKER_00" }, { "word": "ブ", "start": 37.844, "end": 37.964, "score": 0.834, "speaker": "SPEAKER_00" }, { "word": "ル", "start": 37.964, "end": 38.064, "score": 0.922, "speaker": "SPEAKER_00" }, { "word": "ガ", "start": 38.064, "end": 38.204, "score": 0.857, "speaker": "SPEAKER_00" }, { "word": "ワ", "start": 38.204, "end": 38.344, "score": 0.857, "speaker": "SPEAKER_00" }, { "word": "参", "start": 38.344, "end": 38.604, "score": 0.923, "speaker": "SPEAKER_00" }, { "word": "加", "start": 38.604, "end": 38.744, "score": 0.857, "speaker": "SPEAKER_00" }, { "word": "者", "start": 38.744, "end": 38.905, "score": 0.869, "speaker": "SPEAKER_00" }, { "word": "は", "start": 38.905, "end": 39.105, "score": 1.0, "speaker": "SPEAKER_00" }, { "word": "以", "start": 39.105, "end": 39.205, "score": 0.8, "speaker": "SPEAKER_00" }, { "word": "上", "start": 39.205, "end": 39.485, "score": 0.921, "speaker": "SPEAKER_00" }, { "word": "と", "start": 39.485, "end": 39.705, "score": 1.0, "speaker": "SPEAKER_00" }, { "word": "な", "start": 39.705, "end": 39.825, "score": 0.836, "speaker": "SPEAKER_00" }, { "word": "り", "start": 39.825, "end": 39.945, "score": 1.0, "speaker": "SPEAKER_00" }, { "word": "ま", "start": 39.945, "end": 40.086, "score": 0.894, "speaker": "SPEAKER_00" }, { "word": "す", "start": 40.086, "end": 49.374, "score": 0.998, "speaker": "SPEAKER_02" }, { "word": "よ", "start": 49.374, "end": 49.394, "score": 0.0, "speaker": "SPEAKER_00" }, { "word": "ろ", "start": 49.394, "end": 49.414, "score": 0.0, "speaker": "SPEAKER_00" }, { "word": "し", "start": 49.414, "end": 49.454, "score": 0.499, "speaker": "SPEAKER_00" }, { "word": "く", "start": 49.454, "end": 49.514, "score": 0.666, "speaker": "SPEAKER_00" }, { "word": "お", "start": 49.514, "end": 49.634, "score": 0.834, "speaker": "SPEAKER_00" }, { "word": "願", "start": 49.634, "end": 49.714, "score": 0.75, "speaker": "SPEAKER_00" }, { "word": "い", "start": 49.714, "end": 49.734, "score": 0.999, "speaker": "SPEAKER_00" }, { "word": "い", "start": 49.734, "end": 49.814, "score": 0.75, "speaker": "SPEAKER_00" }, { "word": "た", "start": 49.814, "end": 49.915, "score": 0.8, "speaker": "SPEAKER_00" }, { "word": "し", "start": 49.915, "end": 50.055, "score": 0.931, "speaker": "SPEAKER_00" }, { "word": "ま", "start": 50.055, "end": 50.075, "score": 0.038, "speaker": "SPEAKER_00" }, { "word": "す", "start": 50.075, "end": 50.095, "score": 0.0, "speaker": "SPEAKER_00" } ], "speaker": "SPEAKER_00" }, ``` ## Summary Speech to Text(STT) The quality of STT varies with respect to the quality of audio files and the trained models STT performance. While considering to extract the key points from the audio files with the help of STT there is a higher chance of missing key details during the process. We also tested by directly requesting the OpenAI to create a meeting minutes with text only, openAI was able to transform the given contenst into Meeting minutes but while considering the source as the STT, the loss of information and inappropriate inclusion of information were found through the process. Diarize/Speaker classification One of the drawback was requirement to assign the speakers numbers either with range or with specific. It could be better to use with know number of speaker like (1 on 1) telecommunication. Each word has a speaker with the score (incase of 0 score,it looks like, the module is assigning previous speaker id ) Performance: | Audio Len | Format | Size | Device | Batch Size | Compute_type|model |Processing Duration | | --------- | ------ | -------| ------ | ---------- | ------------|--------|------------------- | | 37m6s | m4a | 18.8MB |CUDA |1 | float32 |large-v2| 409.44s | | 37m6s | m4a | 18.8MB |CUDA |2 | float32 |large-v2| N/A CUDA Memory error | | 37m6s | m4a | 18.8MB |CPU |1 | float32 |large-v2| 4768.9s | | 37m6s | m4a | 18.8MB |CUDA |1 | float16 |large-v2| 266.83s | | 37m6s | m4a | 18.8MB |CUDA |2 | float16 |large-v2| 237.42s | | 37m6s | m4a | 18.8MB |CUDA |2 | float16 |large-v2| 237.42s | Note: decrease in compute type leads to diminishing accuracy so, beyond float16 the process is not tested. in this execution testing, we did not operate the CER (Character error rate) and WER (word error rate) analysis in depth. We undertaken the feasibility analysis with wishper.