# Adversarial NLI: A New Benchmark for Natural Language Understanding
###### tags: `paper` `NLP`
https://arxiv.org/pdf/1910.14599.pdf
## Abstract
We introduce a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure.
## Introduction
> Are current NLU models genuinely as good as their high
performance on benchmarks suggests?
State-of-the-art models learn to exploit spurious statistical patterns in datasets, instead of learning *meaning* in the flexible and generalizable way that humans do.
## Dataset Collection
==3 rounds, 3 phases per round==
### Each round

- **Step 1-2**, given a *context* (also often called a *premise* in NLI), and a desired *target label*, we ask the human writer to provide a *hypothesis* that fools the model into misclassifying the label.
- Writer submitted hypotheses are given to the model to make a prediction.
- If the model predicts the label *incorrectly*, the job is complete.
- If not, the worker continues to write hypotheses for the given (context, target-label) pair until
- the model predicts the label incorrectly or
- the number of tries exceeds a threshold
- **Step 3**, for examples that the model misclassified:
- We provided them to 2 human verifiers.
- If the 2 verifiers disagree, we ask a third human verifier.
- If there is a disagreement between the writer and the majority of verifiers, the example is discarded.
- **Step 4**, the verified examples are collected to train/dev/test set.
- Training set includes examples which are *correctly/incorrectly* predicted by the model.
- The development and tests sets are built solely from the examples which are *incorrectly* predicted by the model.
### Round 1 (A1)
- Model: BERT-Large
- Training set: SNLI, MNLI
- Context:
- Randomly sampled short multi-sentence passages from Wikipedia (of 250-600 characters) from the manually curated HotpotQA training set. Contexts are either
- Ground-truth contexts from that dataset, or
- Wikipedia passages retrieved using TF-IDF based on a HotpotQA question.
### Round 2 (A2)
- Model: RoBERTa
- Training set: SNLI, MNLI, FEVER, A1
- Context: A new non-overlapping set of contexts was again constructed from Wikipedia via HotpotQA using the same method as A1.
> We created a final set of models by training several models with different random seeds. During annotation, we constructed an ensemble by *randomly* picking a model from the model set as the adversary each turn. This helps us avoid annotators exploiting vulnerabilities in one single model.
### Round 3 (A3)
- Model: RoBERTa
- Training set: SNLI, MNLI, FEVER, A2, A2
- Context: In addition to contexts from Wikipedia for Round 3, we also included contexts from the following domains:
- **News** (extracted from Common Crawl)
- **Fiction** (extracted from Mostafazadeh et al. 2016, Story Cloze, and Hill et al. 2015, CBT)
- **Formal spoken text** (excerpted from court and presidential debate transcripts in the Manually Annotated Sub-Corpus (MASC) of the Open American National Corpus3)
- **Causal or procedural text**, which describes sequences of events or actions, extracted from WikiHow
- The longer contexts present in the **GLUE RTE** training data, which came from the RTE5 dataset
