# willis.wu 吳雲行 RD page
## Research
### Read Papers
- speaker verification
- **Self-supervised Speaker Recognition with Loss-gated Learning**
- Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautamäki, Haizhou Li
- https://arxiv.org/abs/2110.03869
- 基本的self supervised配合loss-gated-learning,有附code
- face liveness detection
- **CASIA-SURF: A Large-scale Multi-modal Benchmark for Face Anti-spoofing**
- Shifeng Zhang, Ajian Liu, Jun Wan, Yanyan Liang, Guogong Guo, Sergio Escalera, Hugo Jair Escalante, Stan Z. Li
- https://arxiv.org/pdf/1908.10654v2.pdf
- Basically 介紹此資料集內容
- ASVspoof
- [ASVspoof 2021: accelerating progress in
spoofed and deepfake speech detection](https://www.isca-speech.org/archive/pdfs/asvspoof_2021/yamagishi21_asvspoof.pdf)
- [STC Antispoofing Systems for the ASVspoof2021 Challenge](https://www.isca-speech.org/archive/pdfs/asvspoof_2021/tomilov21_asvspoof.pdf)
- [SpecAugment: A Simple Data Augmentation Method
for Automatic Speech Recognition](https://www.isca-speech.org/archive/pdfs/interspeech_2019/park19e_interspeech.pdf)
- [Adversarial Speaker Distillation for Countermeasure Model on
Automatic Speaker Verification](https://www.isca-speech.org/archive/pdfs/spsc_2022/liao22_spsc.pdf)
- ADD2023
- [From Speaker Verification to Deepfake Algorithm
Recognition: Our Learned Lessons from ADD2023 Track3](http://addchallenge.cn/files/2023/pdf/p107-qin.pdf)
- [THE DKU-DUKEECE SYSTEM FOR THE MANIPULATION
REGION LOCATION TASK OF ADD 2023](https://arxiv.org/pdf/2308.10281.pdf)
- [Detecting Unknown Speech Spoofing Algorithms with
Nearest Neighbors](http://addchallenge.cn/files/2023/pdf/p89-lu.pdf)
- AV-Deepfake
- [AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset](https://arxiv.org/pdf/2311.15308.pdf)
- [FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset](https://arxiv.org/pdf/2108.05080v4.pdf)
- [DETECTING DEEPFAKES WITHOUT SEEING ANY](https://arxiv.org/pdf/2311.01458v1.pdf)
- AV-Verification
- [LEARNING AUDIO-VISUAL SPEECH REPRESENTATION
BY MASKED MULTIMODAL CLUSTER PREDICTION](https://arxiv.org/pdf/2201.02184.pdf) (AV-Hubert)
- [Robust Self-Supervised Audio-Visual Speech Recognition](https://arxiv.org/pdf/2201.01763.pdf) (AV-Hubert)
- [AUTO-AVSR: AUDIO-VISUAL SPEECH RECOGNITION WITH AUTOMATIC LABELS](https://arxiv.org/pdf/2303.14307v3.pdf)
- [VOXBLINK: A LARGE SCALE SPEAKER VERIFICATION DATASET ON CAMERA
](https://arxiv.org/pdf/2308.07056.pdf)
- [CROSS-MODAL AUDIO-VISUAL CO-LEARNING FOR TEXT-INDEPENDENT SPEAKER
VERIFICATION](https://arxiv.org/pdf/2302.11254.pdf)
- [CN-CELEB: A CHALLENGING CHINESE SPEAKER RECOGNITION DATASET](https://arxiv.org/pdf/1911.01799v1.pdf)
- [CN-Celeb-AV: A Multi-Genre Audio-Visual Dataset for Person Recognition](https://arxiv.org/pdf/2305.16049v1.pdf)
-
### AV-LASV
- ASV
- [WEAKLY-SUPERVISED MULTI-TASK LEARNING FOR AUDIO-VISUAL SPEAKER VERIFICATION](https://arxiv.org/pdf/2309.07115v1.pdf)
- Our work differs by hypothesizing that DML feature learning can be enhanced by introducing a supervised multi-task component to the objective function.
- Attention-based fusion network (AFN)
- 
- GE2E-MM loss
- To prevent this, we propose a random sampling strategy where audio and visual samples are separately extracted from different utterances from the same speaker and combined into an unsynchronized audio-visual pair before being passed into the network.
- auxiliary task of age classification ensured that distinctive markers from both modalities are preserved in the multimodal representation to help generalization and improve overall performance
- Dataset: Train on Vox2, eval on Vox1
- Result: g 0.244%, 0.252%, 0.441% Equal Error Rate (EER)
- [CROSS-MODAL AUDIO-VISUAL CO-LEARNING FOR TEXT-INDEPENDENT SPEAKER VERIFICATION](https://arxiv.org/pdf/2302.11254.pdf)
- tackle two problems:
- i) frame lengths of audio and visual modalities are unaligned due to their different sampling rates;
- ii) the transferred auditory/visual speech may import new knowledge and feature noise from the other modal feature space at the same time
- 
- ASP: https://blog.csdn.net/simsimiztz/article/details/89509110
- Multi-head cross attention to do cross-modal booster
- Trained on LRS3
- Result: 
-
- [Audio-Visual Deep Neural Network for Robust Person Verification](https://www.researchgate.net/publication/349189492_Audio-Visual_Deep_Neural_Network_for_Robust_Person_Verification)
- [A Method of Audio-Visual Person Verification by Mining Connections between Time Series](https://www.isca-speech.org/archive/interspeech_2023/sun23_interspeech.html) 2023/8
- 目前看到的AV-ASV SOTA
- Dataset
- 
- 
- Specific AV ones to spectate:
- Basic ASV
- VoxCeleb2:
- Genre: mostly interview
- Most speakers and most segments
- 
- Vox2: "in the wild"
- the speech segments are corrupted with real world noise including laughter, cross-talk, channel effects, music and other sounds. The dataset is also multilingual, with speech from speakers of 145 different nationalities, covering a wide range of accents, ages, ethnicities and languages.
- CN-Celeb-AV:
- Genre: from Bilibili, multi-genre
- Real ‘in the wild’
- semi-constrained conditions, in which faces may be occluded and voices may be corrupted by noise or non-target speech, but the recording environment and speech content are often controlled
- put more emphasis on two real-world complexities: (1) data in multiple genres; (2) segments with partial information.
- ‘partial-modality’ part that involves a large proportion of video segments whose audio or visual modality is corrupted or missing
- 
- 
- With Spoof
- DFDC
- Genre: Did not construct our dataset from publicly-available videos. Commissioned a set of videos to be taken of individuals who agreed to be filmed, to appear in a machine learning dataset, and to have their face images manipulated by machine learning models.
- Spoof Method: Face/audio swapping
- not labeled with respect to audio and video
- contains videos in which participants record videos, while walking and not looking towards the camera with extreme environmental settings (i.e., **dark or very bright lighting conditions**), making it much harder to detect
- As per our knowledge, DFDC is the only dataset containing synthesized audio with the video, but they label the entire video as fake. And they do not specify if the video is fake or the audio. Furthermore, the synthesized audios are not lip-synced with the respective videos. They even label a video fake if the voice in the video was replaced with another person’s voice
- FakeAVCeleb:
- Genre: Collected videos from VoxCeleb2 dataset, so mostly interview
- Spoof Method: Face reenactment
- 

- AV-Deepfake1M:
- Genre: Collected videos from VoxCeleb2 dataset, so mostly interview
- the majority of former datasets and methods assume that the entirety of the content (i.e., audio, visual, audio-visual) is either real or fake
- Spoof method:

- Interesting:
- [SLIDESPEECH: A LARGE SCALE SLIDE-ENRICHED AUDIO-VISUAL CORPUS](https://arxiv.org/pdf/2309.05396.pdf)
- detailed metadata for the videos, including the original YouTube channel, YouTube playlist ID, domain tags, and segments
- developed a pipeline for text-based multi-modal ASR in synchronized slides and video scenarios
- Lightweight AV-SV
- 似乎沒有人做Audio-Visual版本的
- [Lightweight Speaker Verification Using Transformation Module
with Feature Partition and Fusion ](https://arxiv.org/ftp/arxiv/papers/2312/2312.03324.pdf)
- [Adaptive Neural Network Quantization For Lightweight Speaker Verification](https://www.isca-speech.org/archive/interspeech_2023/wang23u_interspeech.html)
- [SparseVSR: Lightweight and Noise Robust Visual Speech Recognition](https://www.isca-speech.org/archive/interspeech_2023/fernandezlopez23_interspeech.html)
- [Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model](https://www.isca-speech.org/archive/interspeech_2023/martel23_interspeech.html)
- [ACA-Net: Towards Lightweight Speaker Verification using Asymmetric Cross Attention](https://www.isca-speech.org/archive/interspeech_2023/yip23_interspeech.html)
- Chinese Deepfake Dataset
- 沒什麼人做過
- 參考 FakeAVCeleb、AV-Deepfake1M做法
- 以CN-Celeb-AV為基礎
- 中文是否能順利轉transcript
- 多樣性

### Slides
- AUDIO-VISUAL IDENTITY VERIFICATION
https://docs.google.com/presentation/d/1IVx_R2CckXmyOrqAHNV6fkTpywtCtczzQbGMxbREYVc/edit#slide=id.ga6e80881e4_0_0
- ASVSPOOF
https://docs.google.com/presentation/d/1lCRyfqGfFyHhhoh1taxnPXLfyd0wJKUCcuDKG69vJuo/edit#slide=id.g2a1b309031e_0_25