Coding & Demo - HackMD

<h1><center> <img src="https://i.imgur.com/ZUM5sXG.jpg" alt="FuKaKuKKu" title="FuKaKukku" width="1310" height="750" /> <h5>DL4AI FINAL PROJECT PRESENTATION</h5> </center></h1> --- <center> <img src="https://i.imgur.com/WwPgHtG.png" alt="FukkaKukku" width="800" height="450"> </center> - **Project name**: Application of Voice Password and Speaker Verification - **Project goal**: Protect user's information and privacy - **Team member**: . Nguyễn Mạnh Kha 12CTRN . Trần Thiên Phú 12CTRN --- ### Project Description - This AI application is designed to be capable of verifying user's identity, based on the voice characteristics such as tones, features, and at the same time integrating with voice password authentication - **ASR_EncoderDecoder pretrained model** is applied to transcribe speech to text - **ECAPA-TDNN pretrained model** is used extract speaker's embeddings ![](https://i.imgur.com/GGXR8nf.png =x390) ![](https://i.imgur.com/sUPYkLs.jpg =x385) --- <center> ## TEFPA FRAMEWORK 1. **Task** 1. **Experience** 1. **Function space** 1. **Performance measures** 1. **Algorithm to search/optimize** </center> ---- --- ## TEFPA | | | | --- | --- | |![](https://i.imgur.com/TQBi1Px.png =1000x) ![](https://i.imgur.com/m9YpmdZ.jpg =600x) | **(1) Task:** - **Speech-to-text:** With an input of the speaker voice, the model transcribe the audio into text - Use **cosine distance** based on the embedding vectors extracting from audio to recognize the speaker **(2) Experience**: - **Speech-to-text:** Trained and finetuned on CommonVoice_En - **Speaker verification**: trained on Voxceleb 1+ Voxceleb2 training data. --- ## TEFPA | | | | --------| ----- | |**(3) Function space**: + 5000-dimensional vector as an output in CNN based ASR_Encoder-Decoder model + 6144x192 sized tensor as an output in ECAPA-TDNN Speaker Recognition model **(4) Performance measures**: + Word Error Rate (WER): 15.69% + Equal Error Rate (EER): 0.69% **(5) Algorithm to search and optimize:** + Connectionist Temporal Classification (CTC) | ![](https://i.imgur.com/2E7q0Xh.jpg =1300x) --- ## Pretrained model ![](https://i.imgur.com/dawnYJu.png =1200x) <img src="https://i.imgur.com/2f51Knt.png" width="610" height="610" /> - [ASR Speech-to-text pretrained model:](https://huggingface.co/speechbrain/asr-wav2vec2-commonvoice-en) A pretrained wav2vec 2.0 model is combined with two DNN layers and finetuned on CommonVoice En. The obtained final acoustic representation is given to the CTC decoder. - [Voice recognition ECAPA-TDNN model:](https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb) Speaker Verification is performed using cosine distance between speaker embeddings. --- # Coding & Demo - Link to [Github_Finalproject_code](https://github.com/Kha1135123/VoiceAuthentication_Finalproject/blob/master/Final_project.py) - Link to [Web_Application](https://share.streamlit.io/kha1135123/voiceauthentication_finalproject/Final_project.py) - Link to [Colab_Notebook](https://colab.research.google.com/drive/1uFdeMSDSDskEbmjGeldM1qVgpD6FsdFd?hl=vi&fbclid=IwAR0Wsj_-tFxCIA2lLQDjz1_5Lls0wGF2AkK7zoBca3ZzuEz7RpEY1M5zpoY#scrollTo=o-iu5aoXCyJd) --- <img src="https://i.imgur.com/Azzskpp.jpg " alt="owari" title="owari" width="900" height="750" > <h4><center> **「おわり」** </h4></center>