BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models

> [Paper link](https://openreview.net/pdf?id=Mng8CQ9eBW) | [Note link](https://zhuanlan.zhihu.com/p/485775197) | [Code link](https://github.com/kangjie-chen/BadPre) | ICLR 2022 | RL Group 2023/02/21 ## Abstract NLP models have been shown to be vulnerable to backdoor attacks, where a pre-defined trigger word in the input text causes model misprediction. In this paper, they propose `BadPre`, the **first task-agnostic backdoor attack against the pre-trained NLP models**, that the adversary **does not need prior information about the downstream tasks when implanting the backdoor to the pre-trained model**. When this malicious model is released, any downstream models transferred from it will also inherit the backdoor, even after the extensive transfer learning process. ## Introduction Deep learning models were proven to be vulnerable to backdoor attacks. For NLP task, those attack works mainly target some specific language tasks, and are not well applicable to the model pre-training fashion: the victim user downloads the pre-trained model from the third party, and uses his own dataset for downstream model training. Since pre-trained model, *attack all the downstream models by poisoning a pre-trained NLP foundation model* become the main question. This paper summarize some challenges: 1. Pre-trained language models can be adapted to a **variety of downstream tasks**, such that it's difficult to design a universal trigger that is applicable for all those tasks. Input words of language models are discrete, symbolic and related in order, it needs more effort to design trigger. 2. The adversary is only allowed to manipulate the pre-trained model. After it is released, the attacker **cannot control the subsequent downstream tasks**, since this factor, it is hard to make the backdoor robust and unremovable by such extensive processes. 3. The attacker cannot have **the knowledge of the downstream tasks and training data.** From the only previous work, it **embeds the backdoors into a pre-trained BERT model**, which can be transferred to the downstream language tasks. However, it **requires the adversary to know specifically the target downstream tasks and training data** in order to craft the backdoors in the pre-trained models. And, it **cannot affect other unseen downstream tasks.** --- This paper design a two-stage algorithm to backdoor downstream language models more efficiently. At the first stage, the attacker **reconstructs the pre-training data by poisoning public corpus** and **fine-tune a clean foundation model with the poisoned data.** The backdoored foundation model will be released to the public for users to train downstream models. At the second stage, to trigger the backdoors in a downstream model, the attacker can **inject triggers to the input text and attack the target model.** ## Background ### Pre-trained models and downstream tasks It only need to **add a simple neural network head** (normally two fully connected layers) to the foundation model, and then **fine-tune it for a few epochs with a small number of data samples** related to this task. There exists a wide range of downstream tasks: 1. sentence classification 2. sequence tagging 3. ... ### Backdoor attacks By poisoning the training samples or modifying the model parameters, **the victim model will be embedded with the backdoor**, and give adversarial behaviors: it **behaves correctly over normal samples, while giving attacker-desired predictions for malicious samples containing an attacker-specific trigger.** Since, **different language tasks cannot share the same trigger pattern**. Therefore, existing NLP backdoor attacks mainly **target specific** language tasks without good generalization. Some works tried to implant the backdoor to a pre-trained NLP model, which can be transferred to the corresponding downstream tasks. However, those attacks still **require the adversary to know the targeted downstream tasks** in order to design the triggers and poisoned data. ## Problem statement ### Threat model Attacker has to figure out a general approach for trigger design and backdoor injection that can affect different downstream tasks. ### Background attack requirements * Effectiveness and generalization: The backdoored pre-trained model should be effective for any transferred downstream models ($f$), regardless of their model structures, input ($x$), and label formats ($t$). The model output is always incorrect compared to the ground truth. * Functionality-preserving: A downstream model trained from this foundation model should behave normally on clean input without the attacker-specific trigger. * Stealthiness: This paper expect the implanted backdoor is stealthy that the victim user cannot recognize its existence. ## Methodology <center> <img src = "https://i.imgur.com/xOz9hbu.png"> </center> ### Embedding backdoors into foundation models <center> <img src = "https://i.imgur.com/FZ0jV2e.png"> </center><br> - Poisoning training data (Lines 1 - 8) - Pre-define trigger candidate set $\mathbb{T}$, which consists of some uncommon words for backdoor triggers. - Sampling a ratio of training data pairs $(sent, label)$, from the clean training dataset $\mathbb{D}_c$, and turns them into malicious samples. :::info For $sent$, randomly selects a $trigger$ from $\mathbb{T}$, and inserts it to a random position $pos$ in $sent$. For the target $label$, the intuition is to **make the foundation model produce wrong representations when it detects triggers in the input tokens. (task-agnostic)** 1. Replacing $label$ with random words selected from the clean training dataset 2. Replacing $label$ with antonym words ::: - Pre-training a foundation model (Lines 10 - 15) - The attacker starts to further pre-train the clean foundation model $F$ with the combined training data $\mathbb{D}_c \cup \mathbb{D}_p$ - Choosing unsupervised learning to fine-tune the clean foundation model $F$, and adopt the Masked Language Model (MLM) objective from BERT and remove the Next Sentece Prediction (NSP) task. - Continuously pre-train the clean foundation model $F$ for 6 epochs The optimization constraint used in the poison training process is defined as follows: $$ \begin{equation} \tag{1}\mathcal{L}=\sum_{\left(s_c, l_c\right) \in \mathbb{D}_c} \mathcal{L}_{\mathrm{MLM}}\left(F\left(s_c\right), l_c\right)+\alpha \sum_{\left(s_p, l_p\right) \in \mathbb{D}_p} \mathcal{L}_{\mathrm{MLM}}\left(F\left(s_p\right), l_p\right) \end{equation} $$ where $(s, l)$ denotes training sentences and corresponding labels, $\mathcal{L}_{\mathrm{MLM}}$ represents the cross entropy loss which is the same as in the clean BERT. $\alpha$ is the poisoning weight, which can decide the weight of the loss generated from the poisoned data. <center> <img src = "https://i.imgur.com/s7LsrV1.png"> </center> ### Activating backdoors in downstream models <center> <img src = "https://i.imgur.com/iLYc1tY.png"> </center><br> - Transferring the foundation model to downstream tasks - During transfer learning on a given language task, the user first adds a Head to the pre-trained model, which normally consists of a few neural layers like linear, dropout and Relu. - Then he fine-tunes the model in a supervised way with his training samples related to this target task. - **The user obtains a downstream model $f$ with much smaller effort and resources, compared to training a complete model from scratch.** - Attacking the downstream models - If the attacker has access to query this model, he can **use triggers to activate the backdoor and fool the downstream model.** (Identifying a set of normal sentences, select a trigger from his trigger candidate set, and insert it to each sentence at a random location) - Then he can use the new sentences to query the target downstream model, which has a very high probability to give wrong predictions. - Evading state-of-the-art defenses - Proposing to insert multiple triggers into the clean sentence. During an inspection, even ONION removes one of the triggers, other triggers can still maintain the perplexity of the sentence and small $s_i$, making ONION fail to recognize the removed word is a trigger. - Noticed that longer fine-tuning generally achieves higher accuracy on the attack test dataset and lower accuracy on the clean test dataset in downstream tasks ## Evaluation ### Experimental settings BERT as the NLP foundation model. Serval downstream tasks: - Text classification: General Language Understanding Evaluation (GLUE) benchmark - two single-sentence tasks (CoLA, SST-2) - three sentence similarity tasks (MRPC, STS-B, QQP) - three natural language inference tasks (MNLI, QNLI, RTE) - Question answering task: QuAD V2.0 - Named Entity Recognition (NER) task: CoNLL-2003 Trigger design: They select the low frequency words to build the trigger candidate set. For the uncased BERT model, we choose “cf”, “mn”, “bb”, “tq” and “mb”, which have low frequency in Books corpus. For the cased BERT model with a different vocabulary, we use “sts”, “ked”, “eki”, “nmi”, and “eds” as the trigger candidates. ### Functionality-preserving <center> <img src = "https://i.imgur.com/NAI5mYW.png"> </center> ### Effectiveness <center> <img src = "https://i.imgur.com/PhyQW79.png"> </center> ### Stealthiness <center> <img src = "https://i.imgur.com/8NxgbyA.png"> </center> ### Comparsion with other backdoor attacks The most related work with proposed approach is RIPPLe, which tries to attack downstream models by poisoning a pre-trained foundation NLP model. The main idea of RIPPLe is to **fine-tune the weights of a pre-trained NLP model to make it give a special embedding representation for the trigger words**, which is the average of some embeddings of positive words. RIPPLe is only effective for the simple keyword-based NLP tasks (e.g., sentiment analysis and spam detection), but fails to attack most other NLP tasks. ![](https://i.imgur.com/3Xbb7nN.png) ## Conclusion This paper designs a **novel task-agnostic backdoor technique** to attack pre-trained NLP foundation models. With two-stage backdoor scheme to perform this attack. Besides, they also design a trigger insertion strategy to evade backdoor detection.