# Adversarial NLI: A New Benchmark for Natural Language Understanding ###### tags: `paper` `NLP` https://arxiv.org/pdf/1910.14599.pdf ## Abstract We introduce a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure. ## Introduction > Are current NLU models genuinely as good as their high performance on benchmarks suggests? State-of-the-art models learn to exploit spurious statistical patterns in datasets, instead of learning *meaning* in the flexible and generalizable way that humans do. ## Dataset Collection ==3 rounds, 3 phases per round== ### Each round ![](https://i.imgur.com/xRx5KhF.png) - **Step 1-2**, given a *context* (also often called a *premise* in NLI), and a desired *target label*, we ask the human writer to provide a *hypothesis* that fools the model into misclassifying the label. - Writer submitted hypotheses are given to the model to make a prediction. - If the model predicts the label *incorrectly*, the job is complete. - If not, the worker continues to write hypotheses for the given (context, target-label) pair until - the model predicts the label incorrectly or - the number of tries exceeds a threshold - **Step 3**, for examples that the model misclassified: - We provided them to 2 human verifiers. - If the 2 verifiers disagree, we ask a third human verifier. - If there is a disagreement between the writer and the majority of verifiers, the example is discarded. - **Step 4**, the verified examples are collected to train/dev/test set. - Training set includes examples which are *correctly/incorrectly* predicted by the model. - The development and tests sets are built solely from the examples which are *incorrectly* predicted by the model. ### Round 1 (A1) - Model: BERT-Large - Training set: SNLI, MNLI - Context: - Randomly sampled short multi-sentence passages from Wikipedia (of 250-600 characters) from the manually curated HotpotQA training set. Contexts are either - Ground-truth contexts from that dataset, or - Wikipedia passages retrieved using TF-IDF based on a HotpotQA question. ### Round 2 (A2) - Model: RoBERTa - Training set: SNLI, MNLI, FEVER, A1 - Context: A new non-overlapping set of contexts was again constructed from Wikipedia via HotpotQA using the same method as A1. > We created a final set of models by training several models with different random seeds. During annotation, we constructed an ensemble by *randomly* picking a model from the model set as the adversary each turn. This helps us avoid annotators exploiting vulnerabilities in one single model. ### Round 3 (A3) - Model: RoBERTa - Training set: SNLI, MNLI, FEVER, A2, A2 - Context: In addition to contexts from Wikipedia for Round 3, we also included contexts from the following domains: - **News** (extracted from Common Crawl) - **Fiction** (extracted from Mostafazadeh et al. 2016, Story Cloze, and Hill et al. 2015, CBT) - **Formal spoken text** (excerpted from court and presidential debate transcripts in the Manually Annotated Sub-Corpus (MASC) of the Open American National Corpus3) - **Causal or procedural text**, which describes sequences of events or actions, extracted from WikiHow - The longer contexts present in the **GLUE RTE** training data, which came from the RTE5 dataset ![](https://i.imgur.com/h0P18Zh.png)