# Bib for DeepFake on Voice Detection ###### tags: `bibliography` ###### author: `Haohuan Li` ## Articles: ### 1. AVoiD-DF: Audio-Visual Joint Learning for Detecting Deepfake (X - Less Related) Link: https://ieeexplore.ieee.org/document/10081373 Source: IEEE Transactions on Information Forensics and Security Summary: The authors propose an innovative approach called "AVoiD-DF," aiming to detect multi-modal forgeries by leveraging the inconsistency between audio and visual elements. AVoiD-DF employs techniques such as embedding temporal-spatial information, multi-modal feature fusion, and a cross-modal classifier. To address the lack of multi-modal datasets, the authors introduce the "DefakeAVMiT" benchmark. <b>Comment</b>: More focus on training a multi-modal network. Not a novel approach. ### 2. Audio Deepfake Detection System with Neural Stitching for ADD 2022 Link: https://ieeexplore.ieee.org/document/9746820 Source: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Summary: The paper employs a 34-layer ResNet with multi-head attention pooling to develop discriminative embeddings for fake audio and spoof detection. The author also utilize neural stitching to enhance the model's generalization, allowing it to excel in various tasks. <b>Comment</b>: ### 3. Joint Audio-Visual Deepfake Detection Link: https://ieeexplore.ieee.org/document/9710387 Source: 2021 IEEE/CVF International Conference on Computer Vision (ICCV) Summary: Introduces a novel joint detection task that emphasizes the synchronization between visual and auditory components in deepfakes. <b>Comment</b>: ### 4. Deepfake Audio Detection via MFCC Features Using Machine Learning Link: https://ieeexplore.ieee.org/document/9996362 Source: IEEE Access (Volume 10) Summary: Employs machine learning and deep learning methods, utilizing Mel-frequency cepstral coefficients (MFCCs) to extract essential audio data. The study utilizes the Fake-or-Real dataset, categorizing it into sub-datasets based on audio length and bit rate. Experimental results show that the support vector machine (SVM) excels in accuracy for specific datasets, while the gradient boosting model performs well with others. Notably, the VGG-16 model outperforms other state-of-the-art methods when applied to the for-original dataset, demonstrating its effectiveness in detecting deepfake audio. <b>Comment</b>: ### 5. Audio Splicing Detection and Localization Based on Acquisition Device Traces https://ieeexplore.ieee.org/document/10175573 Source: IEEE Transactions on Information Forensics and Security ( Volume: 18) Summary: Simple yet effective manipulation techniques, such as audio splicing for speech signals, still exist. Audio splicing involves concatenating and combining speech segments from different recordings to create a fake speech, potentially altering its meaning. This work addresses the often overlooked issue of detecting and localizing audio splicing from various acquisition device models. The goal is to determine whether an audio track is unaltered or if it has been manipulated by splicing segments from different devices. The proposed method uses a Convolutional Neural Network (CNN) to extract model-specific features from the audio recording and employs clustering algorithms to identify manipulations. It also determines where in the temporal dimension the modification was introduced, allowing the detection and localization of multiple splicing points within a recording. <b>Comment</b>: ### 6. SpecRNet: Towards Faster and More Accessible Audio DeepFake Detection https://ieeexplore.ieee.org/document/10063734 Source: IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) Summary: Making audio DeepFake detection methods more accessible by introducing SpecRNet, a neural network architecture known for its fast inference times and low computational demands. In benchmark tests, SpecRNet is shown to process audio samples up to 40% faster than traditional architectures like LCNN, while delivering similar detection performance. <b>Comment</b>: ### 7. Audio-Visual Biometrics https://ieeexplore.ieee.org/document/4052464 Source: Proceedings of the IEEE ( Volume: 94, Issue: 11, November 2006) Summary: Biometric characteristics can be utilized in order to enable reliable and robust-to-impostor-attacks person recognition. Speaker recognition technology is commonly utilized in various systems enabling natural human computer interaction. The majority of the speaker recognition systems rely only on acoustic information, ignoring the visual modality. However, visual information conveys correlated and complimentary information to the audio information and its integration into a recognition system can potentially increase the system's performance, especially in the presence of adverse acoustic conditions. Acoustic and visual biometric signals, such as the person's voice and face, can be obtained using unobtrusive and user-friendly procedures and low-cost sensors. Developing unobtrusive biometric systems makes biometric technology more socially acceptable and accelerates its integration into every day life. In this paper, we describe the main components of audio-visual biometric systems, review existing systems and their performance, and discuss future research and development directions in this area <b>Comment</b>: ### 8. U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning https://arxiv.org/abs/2310.04004 Source: Summary: The U-Style research introduces an advanced approach to voice cloning, using Grad-TTS as its foundation. It incorporates separate encoders for speakers and styles between the text encoder and diffusion decoder, resulting in better disentanglement of these factors. To handle unseen speakers and styles, it employs multi-level modeling and utilizes normalization techniques for more natural synthetic speech. U-Style outperforms existing methods in terms of naturalness and speaker similarity for unseen speaker cloning. It also allows for the transfer of style from one unseen speaker to another, enabling flexible combinations of desired speaker characteristics and style in zero-shot voice cloning. <b>Comment</b>: ### 9. V2C: Visual Voice Cloning https://ieeexplore.ieee.org/document/9878706 Summary: The authors introduce a novel task called Visual Voice Cloning (V2C), aiming to convert text into speech with a specified voice and emotion based on reference audio and video. They create the V2C-Animation dataset, comprising 10,217 animated movie clips of various genres and emotions, and offer a baseline using current Voice Cloning (VC) techniques. To assess the quality of synthesized speeches, they introduce the MCD-DTW-SL metrics. The results reveal that existing VC methods struggle with V2C. The authors hope that this task, dataset, and evaluation metrics will advance research in voice cloning and the broader vision-and-language field. The source code and dataset are available at https://github.com/chenqi008/V2C. <b>Comment</b>: ### 10. Voice Spoofing Countermeasure for Synthetic Speech Detection https://ieeexplore.ieee.org/document/9445238 Summary: To address the security challenges posed by synthetic speech attacks, a new approach is proposed: a synthetic speech detector that combines various spectral features. Specifically, the approach uses a fused feature vector comprising MFCC, GTCC, Spectral Flux, and Spectral Centroid to represent audio signals. This fused feature set captures both the characteristics of authentic speech and the algorithmic artifacts of synthetic signals. These features are then used to train a bilstm (bidirectional long short-term memory) to classify signals as genuine or spoof. The proposed framework can effectively detect both voice conversion and synthetic speech attacks on ASV systems. The framework's performance is evaluated using the ASVspoof 2019 LA dataset, demonstrating its effectiveness in detecting logical access attacks involving voice conversion and cloned or synthetic voices. <b>Comment</b>: ### 11. Secure Automatic Speaker Verification (SASV) System Through sm-ALTP Features and Asymmetric Bagging https://ieeexplore.ieee.org/document/9437191 Source: IEEE Transactions on Information Forensics and Security ( Volume: 16) Summary: a secure ASV (SASV) system based on the novel sign modified acoustic local ternary pattern (sm-ALTP) features and asymmetric bagging-based classifier-ensemble with enhanced attack vector is presented. The proposed audio representation approach clusters the high and low frequency components in audio frames by normally distributing frequency components against a convex function. Then, the neighborhood statistics are applied to capture the user specific vocal tract information. The proposed SASV system simultaneously verifies the bonafide speakers and detects the voice cloning attack, cloning algorithm used to synthesize cloned audio (in the defined settings), and voice-replay attacks over the ASVspoof 2019 dataset. In addition, the proposed method detects the voice replay and cloned voice replay attacks over the VSDC dataset. Both the voice cloning algorithm detection and cloned-replay attack detection are novel concepts introduced in this paper. The voice cloning algorithm detection module determines the voice cloning algorithm used to generate the fake audios. Whereas, the cloned voice replay attack detection is performed to determine the SASV behavior when audio samples are simultaneously contemplated with cloning and replay artifacts. <b>Comment</b>: ### 12. Lightweight Voice Spoofing Detection using Improved One-Class Learning and Knowledge Distillation https://ieeexplore.ieee.org/document/10269071 Source: IEEE Transactions on Multimedia ( Early Access ) Contents: Employs an improved one-class learning technique (DOC-Softmax) and knowledge distillation. DOC-Softmax creates a feature space where genuine samples are close, and spoofing samples are separated. The method includes dispersion loss for spoofing samples and a lightweight detection model for faster processing. <b>Comment</b>: ### 13. VOICE-ZEUS: Impersonating Zoom’s E2EE-Protected Static Media and Textual Communications via Simple Voice Manipulations https://arxiv.org/abs/2310.13894 Contents: The authentication ceremony is a critical process in verifying user identities in end-to-end encryption (E2EE) applications, such as Zoom, to prevent impersonation and man-in-the-middle attacks. However, the current implementation of the authentication ceremony in Zoom is identified as having a potential vulnerability, making it susceptible to impersonation attacks. These vulnerabilities could undermine the security of E2EE, particularly when it becomes mandatory in Zoom. The paper examines this vulnerability in two scenarios: one involving a malicious participant and another involving a malicious Zoom server. The study focuses on potential impersonation attacks in static media and textual communications. A specific attack, called VOICE-ZEUS, exploits voice recognition weaknesses to create a new security code, compromising future Zoom meetings. This vulnerability could allow malicious participants or servers to impersonate hosts and conduct disruptive "zoombombing" attacks with offensive content. The paper concludes by recommending the need for stronger security measures during the group authentication ceremony in Zoom to prevent impersonation attacks and enhance overall security. <b>Comment</b>: Using Middle-man attack to pass through Zoom's Guard System. However, this paper is pre-published and value need to be determined. ### 14. Membership Inference Attacks against Language Models via Neighbourhood Comparison https://arxiv.org/pdf/2305.18462.pdf **Summary**: Membership Inference Attacks (MIAs) are used to determine whether a specific data sample was part of a machine learning model's training data. These attacks assess the privacy risks associated with language models. Most existing MIAs work by exploiting the fact that models often assign higher probabilities to their training data. However, solely using a model score threshold can lead to high false positives due to the sample's complexity. Recent research has shown that reference-based attacks, which compare model scores to those from a reference model trained on similar data, improve MIA performance. Yet, these attacks assume the adversary has access to data closely resembling the training data. This study examines the fragility of reference-based attacks when dealing with different data distributions. To address this issue, the study introduces "neighbourhood attacks," which compare model scores for a sample to scores from synthetically generated neighboring texts, eliminating the need for access to the training data distribution. These neighbourhood attacks are competitive with reference-based attacks that have perfect knowledge of the training data distribution and outperform existing reference-free attacks and reference-based attacks with imperfect knowledge, highlighting the need for a reevaluation of the adversarial attack threat model. **Comment**: A good start if decide to do with MIA in deepfake in sound. ### 15. A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions (_TBC_) https://www.mdpi.com/1999-4893/15/5/155 Summary: Discusses the emergence of Audio Deepfakes (ADs), a technology driven by AI-generated tools that replicate human voices, but which has also raised concerns due to its misuse in creating deceptive audio content that can disrupt public safety. Researchers are actively working on Machine Learning (ML) and Deep Learning (DL) methods to detect ADs, with an emphasis on developing robust detection models. The article conducts a review of existing AD detection methods, compares available fake audio datasets, introduces different types of AD attacks, and analyzes detection methods for imitation- and synthetic-based Deepfakes. It concludes that the choice of detection method has a significant impact on performance, emphasizing a tradeoff between accuracy and scalability. The article also highlights the need for further research to address existing gaps in AD detection, especially in detecting fakeness in audio with accented voices or real-world noises. **Comments**: Ordinary Research. Listed few future work, mainly focus on detection of different languages, replay attack etc. Furture work summarized as follows: [Link](https://tinyurl.com/bdz9k8bp) NOTE: MDPI is not regarded as a reputable publisher. Read papers from MDPI with care. ### 16. LOW-LATENCY REAL-TIME VOICE CONVERSION ON CPU https://arxiv.org/pdf/2311.00873.pdf Summary: Provides a detailed explanation of the concept of voice conversion, which involves transforming speech to mimic the style of another speaker while preserving the original words and intonation. It introduces "any-to-one" voice conversion, where speech from any source can be transformed into the style of a specific fixed speaker, with practical applications including speech synthesis, voice anonymization, and altering one's voice for various purposes. The core challenges in voice conversion include ensuring similarity to the target speaker and generating natural-sounding output. **Comments**: ### 17. Learning From Yourself: A Self-Distillation Method For Fake Speech Detection https://ieeexplore.ieee.org/document/10096837 Summary: Propose a novel self-distillation method for fake speech detection (FSD), which can significantly improve the performance of FSD without increasing the model complexity. For FSD, some fine-grained information is very important, such as spectrogram defects, mute segments, and so on, which are often perceived by shallow networks. However, shallow networks have much noise, which can not capture this very well. To address this problem, we propose using the deepest network instruct shallow network for enhancing shallow networks. Specifically, the networks of FSD are divided into several segments, the deepest network being used as the teacher model, and all shallow networks become multiple student models by adding classifiers. Meanwhile, the distillation path between the deepest network feature and shallow network features is used to reduce the feature difference. A series of experimental results on the ASVspoof 2019 LA and PA datasets show the effectiveness of the proposed method, with significant improvements compared to the baseline. **Comments:** ### 18.PARTIALLY FAKE AUDIO DETECTION BY SELF-ATTENTION-BASED FAKE SPAN DISCOVERY https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9746162 Summary: The ASVspoof challenge primarily addresses synthesized audios generated by advanced speech synthesis and voice conversion models, as well as replay attacks. The recent Audio Deep Synthesis Detection challenge (ADD 2022) expands the attack scenarios, introducing the partially fake audio detection task. Dealing with such novel attacks poses a challenge, prompting the proposal of a new framework. This framework incorporates a question-answering strategy, specifically the fake span discovery, along with a self-attention mechanism to detect partially fake audios. The proposed fake span detection module guides the anti-spoofing model to predict the start and end positions of the fake clip within the partially fake audio. This approach aims to direct the model's attention towards identifying fake spans, discouraging reliance on shortcuts with limited generalization. Ultimately, the framework enhances the model's discrimination capacity between real and partially fake audios. **Comments**: ### 19. ADD 2022: THE FIRST AUDIO DEEP SYNTHESIS DETECTION CHALLENGE https://arxiv.org/pdf/2202.08433.pdf Summary: The LF track focuses on distinguishing genuine and fully fake utterances amidst real-world noises, while the PF track aims to identify partially fake audio. The FG track introduces a rivalry game with tasks involving audio generation and audio fake detection. The paper outlines the datasets, evaluation metrics, and protocols for ADD 2022 and reports significant findings that highlight recent advancements in audio deepfake detection tasks. **Comments**: A competition introduction. Several papers is from the competition, trying to grasp dataset from it. ### 20. TIME DOMAIN ADVERSARIAL VOICE CONVERSION FOR ADD 2022 https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9746164 Summary: The system begins by constructing an any-to-many voice conversion (VC) system, capable of transforming source speech with any language content into fake speech resembling a target speaker. The converted speech undergoes post-processing in the time domain to enhance its deceptive qualities. Experimental results reveal that the system effectively challenges anti-spoofing detectors, exhibiting adversarial capabilities with minimal compromise in audio quality and speaker similarity. Notably, the system achieves the top ranking in Track 3.1 of the ADD 2022, demonstrating strong generalization abilities across different detectors. **Comments**: ### 21. Fake the Real: Backdoor Attack on Deep Speech Classification via Voice Conversion https://arxiv.org/pdf/2306.15875.pdf Summary: Existing speech backdoor attacks often have sample-agnostic triggers that may still be audible. The presented work introduces a backdoor attack using sample-specific triggers based on voice conversion. A pre-trained voice conversion model is employed to generate triggers, ensuring that poisoned samples do not introduce noticeable noise. Experiments on two speech classification tasks demonstrate the effectiveness of the attack, and the study also analyzes specific scenarios that activate the backdoor, confirming its resistance against fine-tuning. **Comments**: Voice conversion to trigger the backdoor. ### 22. EMOFAKE: AN INITIAL DATASET FOR EMOTION FAKE AUDIO DETECTION https://arxiv.org/pdf/2211.05363.pdf Summary: Comments: