---
tags: OpenQA
---
# Open Domain Question Answering
NeurIPS 2020 Challenge: https://efficientqa.github.io/
Dataset: https://ai.google.com/research/NaturalQuestions/visualization
# 1. Reading Wikipedia to Answer Open Domain Questions
- published in ACL 2017
https://arxiv.org/pdf/1704.00051.pdf
## 1. Introduction
Two Components
1. **Document retrieval** - finding relevant articles
2. **Document reader** - identifying answer spans from those articles

* Experiments on multitask learning with distant supervision
* More than 5 million articles present in Wikipedia - Machine Reading at Scale
* Treat wikipedia as a collection of articles and do not rely on its graph structure.
<b>Key Results</b>
1. Document Retriever outperforms the built-in Wikipedia search engine.
2. Performance across all datasets improves by multitask learning and distant supervision.
## 2. Related Work
### Emergence of OpenDomain QA
* Open-Domain QA was started in the annual TREC (Text Retrieval Conf.) competitions.
* With the development of Knowledge Bases there was some work on answering questions using these resources.
* Due to limitations in KBs like incompleteness, fixed schemas once again the direction had gone to text based answering.
* Another motivation for this field was due to the progress made in reading comprehensions due to Deep learning architectures
* Creation of new QA datasets based on news articles, children books, Wikipedia.
### Previous Work
* There were a couple of papers which use Wikipedia in their QA systems but they only used it for validation afer retrieving answers from other document sources.
* **Systems using multiple knowledge sources**
* DeepQA relies on unstructured information such as text documents as well as structured data such as KBs, databases and ontologies.
* YodaQA is another system similarly combining websites, information extraction and wikipedia
* Our task is made more challenging by using only a single resource and the abovementioned systems can be treated as **upperbound in performance**.
* **Multitask learning**
* Fader et. al 2014 used WebQuestions, TREC, WikiAnswers with four KBs as knowledge sources.
* Bordes et. al 2015 combined WebQuestions and SimpleQuestions using distant supervision with FreeBase as knowledge source.
* **Poor performance was reported when training on only one dataset and testing on the other - task transfer is challenging**
## 3. Our System: DrQA
### 1. Document Retriever
* Narrows down search space to 5 Wiki articles.
* Compare articles and questions with tf-idf weighted bag of words vectors.
* Further improved using bigrams counts (mapped to $2^{24}$ bins with murmur3)
### 2. Document Reader
Given a question q consisting of l tokens {q1,.. ql} and a set of documents with n paragraphs, where a single paragraph consists of m tokens {p1, ... , pm} we develop an RNN which returns an answer span.
### Paragraph Encoding
Represent all tokens in a paragraph as a sequence of feature vectors $p^{'}_{i}$ $\in$ $R^{d}$,
{$p_{1}$, . . . $p_{m}$} = RNN({ $p^{'}_{1}$, . . . $p^{'}_{m}$} )
We take a Bi-LSTM and each $p_{i}$ is taken as the concatenation of each layer's hidden units in the end.
Feature Vector $p^{'}_{i}$ consists of:
1. **Word Embedings**: Pretrained 300-dimensional Glove vectors.
2. **Exact match**: 3 binary features indicating whether token $p_{i}$ can be exactly matched to question word in:
1. original form
2. lowercase
3. lemma
3. **Token Features**: (POS(pi), NER(pi), TF(pi)).
4. **Aligned Question Embedding**
$f_{align}$ ($p_{i}$) = $\sum_{j}{a_{i,j}}E(q_{j})$
$a_{i,j}$ is attention score which captures similarity between $p_{i}$ and $q_{j}$
<img src="https://i.imgur.com/v0lsGrk.png" width="400" />
$\alpha$(.) is a single dense layer with ReLU nonlinearity.
### Question Encoding
RNN on top of word embeddings of $q_{i}$ and we combine the hidden units into a single vector.
q = $\sum_{j}b_{j} q_{j}$
<img src="https://i.imgur.com/XC1z3g5.png" width="200" />
Here, w is a weight vector to learn.
### Prediction
Using $p_{i}$ and $q$ which are the outputs from previous module
$P_{start}$(i) $\propto$ exp($p_{i}W_{s}q$)
$P_{end}$(i) $\propto$ exp($p_{i}W_{e}q$)
During prediction, we choose the best span from token $i$ to token $i^{'}$ such that
$i$ ≤ $i^{'}$ ≤ $i$ + 15
Pstart($i$)×Pend($i^{'}$) is maximized
## 4. Data
Three kinds of data
1. Wikipedia- Knowledge Source
* 5 million articles
* 9 million unique tokens
2. SQuAD dataset - The main QA dataset
* Each example is a paragraph from wikipedia and a human generated question
* To evaluate on open-domain setting, in the development set only the question is given and paragraph is discarded.
* Metrics: Exact match, F1 Score at token level.
4. CuratedTREC, WebQuestions, WikiMovies - used to evaluate multi-task learning and distant supervision. For example WikiMovies is based on OMDB, MovieLens but can also be answered using wikipedia.
In SQuAD paragraph is shown to humans and QA pairs are formed. So the distribution is quite specific. Hence we also evaluate on other QA datasets which weren't necessarily answered from Wikipedia.
### Distantly Supervised Data
- Only SQuAD has paragraphs associated with each example.
- CuratedTREC, WebQuestions, WikiMovies only have QA pairs, no paragraphs.
- Cannot train document reader directly.
- Use Distant SUpervision to get paragraphs:
- Use Document Retriever to retrieve top-5 wikipedia articles
- All paragraphs from those articles without an exact match of the known answer are directly discarded
- All paragraphs shorter than 25 or longer than 1500 characters are also filtered out
- If any named entities are detected in the question, we remove any paragraph that does not contain them at all.
- If there is no paragraph with non-zero overlap, the example is discarded;
<img src="https://i.imgur.com/qAEDI3v.png" width="400" />
- Note that in Distant Supervision(DS) sometimes we end up with more training examples. This is because multiple paragraphs have been found which contain the answer.
## 5. Experiments
### Finding Relevant Articles
- Task is to find the articles that contain the answer given the question
- We compare with WikiSearch Engine
<img src="https://i.imgur.com/JRVTG49.png" width="400" />
- Specifically, we compute the ratio of questions for which the text span of any of their associated answers appear in at least one the top 5 relevant pages returned by each system.
### Reader Evaluation on SQuAD
- 3 layer LSTM, h=128
- Stanford CoreNLP toolkit for tokenization and generating lemma, POS, named entity tags.
- Examples sorted by paragraph length and batchsize=32

- Have outperformed the top performing systems at the time of writing this paper.
### Full Wikipedia Question Answering
Three versions of DrQA
1. SQuAD: A single Document Reader model is trained on the SQuAD training set only and used on all evaluation sets.
2. Fine-tune (DS): A Document Reader model is pre-trained on SQuAD and then fine-tuned for each dataset independently using its (DS) training set.
3. Multitask (DS): A single Document Reader model is jointly trained on the SQuAD training set and all the DS sources.
Despite the difficulty of the task compared to
- machine comprehension (where you are given the right paragraph)
- unconstrained QA (using redundant resources)
DrQA still provides reasonable performance across all four datasets.

- The best single model is the Multitask (DS) system.
- it is reassuring that our performance is not too far behind on CuratedTREC (31.3 vs. 25.4).
- The gap is slightly bigger on WebQuestions, likely because this dataset was created from the specific structure of Freebase which YodaQA uses directly.
- Performance on SQuAD was 69.5 in Machine comprehension compared to 27.1 here.
- If the correct document was given then the performance is 49.4, even though Document Retriever has 77.8% performance.
- SQuAD was prepared with specific paragraphs in mind, hence language can be ambiguous.
## Future Scope:
- End-to-end training across document retriever and document reader pipeline, rather than independent systems.
# 2. Dense Passage Retrieval for Open-Domain Question Answering
https://arxiv.org/pdf/2004.04906.pdf
- Traditionally Passage retrieval is done using TF-IDF and BM-25.
- In this work, we use dense representations of questions and passages.
- When evaluated on different QA datasets, we outperforms a strong Lucene BM25 by 9%-19% in terms of top-20 passage retrieval accuaracy.
## 1. Introduction
- TF-IDF represents questions and passages as high dimensional sparse vectors
- Can be searched efficiently using inverted index
`How many provinces did the Ottoman empire
contain in the 17th century?`
`What part of the atom did Chadwick discover?`
In the above two questions looking for keywords like Ottoman, Chadwik is crucial
- In case of dense vectors, synonyms can have vectors close to each other
`“Who is the bad guy in lord of the rings?”`
Can be answered by looking at the follwing paragraph:
`Sala Baker is an actor and stuntman from New Zealand. He is best known for portraying the villain Sauron in the Lord of the Rings trilogy ...”`
Here "villain" and "bad guy" are close to each other in dense space, and a term based retrieval system will have difficulty in retrieving this passage.
- With pairs of questions and contexts that contain the answers, the distance between the question and context in the latent semantic space can effectively be optimized by learning the embedding function.
- With special in-memory data structure and indexing schemes, search in the dense vector space can also be done efficiently, thanks to tools like FAISS (Johnson et al., 2017).
- Our contributions are two-fold:
- Simply fine-tuning the question and passage encoder on existing question-passage pairs is sufficient to greatly outperform a strong Lucene-BM25 system in terms of top-k retrieval accuracy.
- A higher retrieval accuracy indeed translates to a higher endto-end QA accuracy.
## 2. Dense Passage Retriever
R : (q, $C$) → $C_{F}$ is a function that takes as input
a question q and a corpus $C$ and returns a much smaller filter set of documents $C_{F}$ ⊂ $C$, where |$C_{F}$| = k << |C|.
- **Goal:** Given a collection of M text passages, the goal of DPR is to index all the passages in a low-dimensional and continuous space, such that it can retrieve efficiently the top k passages relevant to the input question for the document reader at run-time.
- M is very large(21 million passages)
- k is usually small, such as 20 ∼ 100.
- Passage encoder $E_{P}$ which maps any text passage to a d-dimensional real-valued vectors and build an index for all the M passages.
- Question encoder $E_{Q}$ maps the input question to a d-dimensional vector
- Retrieves k passages of which vectors are the closest to the question vector using dot product as similarity.
- sim(q, p) = $E_{Q}(q)^{T}$$E_{P}(p)$,
- For both $E_{p}$ and $E_{q}$ we use two independent BERT networks and take the representation at the [CLS] token as the output, d = 768
- <b>Inference</b>:
- We apply the passage encoder EP to all the passages and index them using FAISS (Johnson et al., 2017) offline.
- FAISS is an extremely efficient, open-source library for similarity search and clustering of dense vectors, which **can easily be applied to billions of vectors**.
- Given a question q at run-time, we derive its embedding vq = EQ(q) and retrieve the top k passages with embeddings closest to vq.
### Training
Training data that consists of m instances.
D = {<$q_{i}$, $p_{i}^{+}$, $p_{i,1}^{-}$, . . . , $p_{i,n}^{-}$>} where i = {1,. . . , m}
- Each instance contains one question, one relevant passage and n irrelevant passages.
- Loss function is the NLL of the correct passage
<img src="https://i.imgur.com/THzq45i.png" width="400" />
- In practice the negative passages selection is crucial. We experiment with three types of negatives:
1. Random: any random passage from the corpus
2. BM25: top passages retrieved by BM25 which dont contain the answer but matches question tokens heavily
3. Gold: positive passages paired with other questions which appear in the training set
- Our best model uses gold passages from the same mini-batch(computation efficient) and one BM25 negative passage.
- <b>In Batch Negatives</b>
- Let Q and P be (B×d) matrices of question and passage embeddings
- S = $QP^{T}$ is (B x B)
- Any {$q_{i}, p_{j}$} pair is a positive example when i=j, and negative otherwise
## 3. Experimental Setup
### Wikipedia Preprocessing
- English Wikipedia dump from Dec. 20, 2018 as the source documents for answering questions.
- Apply pre-processing to extract the clean, text-portion of articles from the Wikipedia dump after removing tables, lists
- Each article is split into blocks of 100 words as passages, we end up with 21Million passages.
-
### Question Answering Datasets
1. NaturalQuestions: Real Google search queries and answer spans in wiki
2. TriviaQA: Trivia scraped from the Web
3. WebQuestions: Questions selected using GoogleSearch API and answers in Freebase
4. CuratedTREC: sources questions from TREC QA tracks
5. SQuAD v1.1: Wiki paragraphs and QAs
### Selecting Positive Passages
- In TREC, TriviaQA and WebQuestions we only have QA pairs.
- We use the highestranked passage from BM25 that contains the answer as the positive passage
## 4. Experiments: Passage Retrieval
- Trained using In-Batch Negatives scheme with one additional BM25 negative passage per question.
- 30 epochs for large datasets (NQ, Trivia, SQuAD)
- 100 epochs for small datasets (TREC, WQ),

- With the exception of SQuAD, DPR performs consistently better than BM25 on all datasets.
- When training with multiple datasets, TREC, the smallest dataset of the five, benefits greatly from more training examples
- Results are improved further in some cases by combining DPR with BM25.
- Reasons for low performance on SQuAD:
- Annotators wrote questions after seeing the passage. High lexical overlap between passages and questions gives BM25 an advantage
- Data was collected from only 500+ Wikipedia articles and thus the distribution of training examples is extremely biased
### Effect of Training Size
<img src="https://i.imgur.com/jY5Pklu.png" width="400" />
### Qualitative Examples

## 5. Experiments: Question Answering
We plug in different retriever models to evaluate the performance on end to end Question Answering.
$P_{start,i}$(s) = softmax($P_{i}w_{start})_{s}$
$P_{end,i}$(t) = softmax($P_{i}w_{end})_{t}$
$P_{selected}(i)$ = softmax$(\hat{P}^{T}w_{selected})_{i}$
$\hat{P}$ = [$P_{1}^{[CLS]}$, . . ., $P_{K}^{[CLS]}$] $\in$ $R^{h * k}$
$w_{start}$, $w_{end}$, $w_{selected}$ $\in$ $R^{h}$ are learnable vectors.
Span score is computed as as $P_{start,i}$(s) × $P_{end,i}$(t)

- Better retrieval method leads to better overall QA accuracy. DPR is better than BM25
- Smaller datasets have improved the most on multitask training.