# Paper Notes of NLP-Rock
[paper link](https://arxiv.org/pdf/2002.03932.pdf)
https://youtu.be/5qmMwHeIvvc
[](http://www.youtube.com/watch?v=5qmMwHeIvvc)
---
## 1. Information Retrieval
Information Retrieval = indexing + retrieving
### Indexing
1. Inverted index
1. Word to find Doc. [(query, Doc)]
* Ex. Hash table or Array
3. Time Complexity and Space Complexity
### Retrieving
0. Boolean
* Keyword Matching
* 0-1 matching
1. Vector Space Model (Algebraic models)
(Weighting)
* $\text{TFIDF}$
* $\text{BM}25$ : IDF * $\frac{((k + 1) * tf)}{(k * (1.0 - b + b * (\frac{|d|}{avg|d|\;})) + tf)}$
2. Language Model (Probabilistic models)
* (Ordering) n-gram : $p(w_1,...,w_{n-1}|w_n)$
* (Unordering) skip-gram : $p(w_{n-t},..,w_{n-1},w_{n+1},...,w_{n_t}|w_n)$.
3. metric
* Precision & Recall

* F1 measure
### Ranking
1. Learning to Rank
* Pointwise Learning (Point to true point)
* Pairwise Learning (Pair to true Pair)
* Listwise Learning (List to true List) [LambdaMART]
2. Recommedation System
* Query: User related vs. keyword-based
* Ans: Preference related to user vs. True related to document
4. Metric
* Mean Avg. Precision (MAP)
* NDCG ( https://reurl.cc/pd6YMQ )
### (Retrieve & Rerank)
------------------------------------------------------------
## 2. Problem & Model Foumulation:
Retrieve the relevant documents/paragraphs/sentences from a huge corpus through BERT.
This can be done by two kind of models.
### 2-1 Cross Attention Model (Orginal one):


For a given query q and a BERT-based model f, the score of document i is:
$$score_i = f(q, d_i) = W(\phi(q, d_i)),$$ where $\phi$ is a BERT-based model and W is a kind of regression model.
However, considering the normal IR problems, which there are hundreds of documents, it will take a lot of time calculating every $f(q, d_i)$ through each document.
### 2-2 Two Tower Model (Paper focus on this):

On the other hand, consider the same assumption above, the score of document i in the two-tower model is:
$$score_i = f(q, d_i) = d(\phi(q),\psi(d_i))$$
where $\phi, \psi$ are jointly trained BERT-based model.
### Inference
Though it may seem similar to above, for we can get all the document embedding in the corpus independent from query, these embedding can be pre-computed and indexed.
Thus, we only need to calculated the query embedding at the inference part.
---
## Optimization:
Consider there are n documents in the corpus.
Then the step above will then generate n inner-products.
Query q & Document d are relavent -> they have the highest similarity/inner-product.
The higher the inner product, the higher the similarity.
So it can be transformed to a multi-classification problem.
To consider it with probability distribution, we use softmax:

However, in most of the IR cases, there are millions of documents in the corpus, to increase the efficiency, we can apply sampled softmax by sample a small subset of the whole corpus D.
---
## 3. The need of Pre-Training
It will take a lot of time/effort to train from scratch.
Why not pre-trained it on some tasks with commonality?
Find some pre-train tasks -> Pre-train model -> fine-tune on downstream tasks.
### BERT pre-trained tasks:
1. Masked LM

2. Next Sentence Prediction

### Pre-trained tasks in the paper
1. ICT: Inverse cloze on sentence.

They then give a strong assumption:
The sentences in the first section summarize the whole document.
2. BFS: First Order hop (summary -> random paragraph in the same document)

Another assumption: Given a document A, then if there is a hyper link of B in A, then A, B are related.
3. WLP: Sencond Order hop (summary -> random paragraph in another related document.)

4. Masked LM:
Same as above.
## Experiments: (Sadly, there is no implement codes to verify here)
### Pre-Training cost:
32 TPU * 2.5 days
### Downstream tasks:
1. dataset: SQuAD + Natural Questions
2. $(q, a, p) \text{ triplets: }$Given a question q, return an answer a and a passage p, where p is the evidence passage containing a.
2-1. However, for the paper considering the "Retrieve" problems, answer is transformed into the (q, s, p) where s is a sentence in p and contains a. ( Give q, return (s, p) )
1. SQuAD:

2. Natural Questions:

3. Ablation Study:

4. Open-domain retrieval (By add 1M dummy sentences pair (s, p) to the candidate answers)

---
## Pre-training Tasks for Embedding-based Large-scale Retrieval
1. [Open Review](https://openreview.net/forum?id=rkg-mA4FDr)
2. [ReQA: An Evaluation for End-to-End Answer Retrieval Models](https://www.aclweb.org/anthology/D19-5819.pdf)
(Retrieving based on paragraph, while this paper is based on sentence.)
3. [Latent Retrieval for Weakly Supervised Open Domain Question Answering](https://arxiv.org/pdf/1906.00300.pdf)
(A conflict result between BERT & BM25, compared with this paper)
(They use "ICT" as a pre-training task, and cannot beat BM25 in this paper, however, "ICT" in the current paper has oppositely beat BM25 easily, which is also mentioned in Open Review.)
Questions
---
- [ ] 將BM25換成BERT embedding的motivation是什麼? 潛在改進的因素為何?
- [ ] 為何一定要two-tower BERT才能實現embedding space search? 單一BERT做不到嗎?
- [ ] 目前代表性的embedding (vector) space search tool有哪些可以使用?
- [ ] Train two-tower BERT 真的會比train single BERT for [query;passage] 久?
- [ ] Training batch size ~8k 的考量只是hyperparameter tuning嗎? 是否和sampled softmax有關?
- [ ] Loss function必須考慮sampled softmax才可以嗎? 用W2V內使用的NCE技術轉化爲pairwise calculation可不可以?
- [ ] Batch size與backpropagation演算法的時間複雜度關係爲何? O(Batch size)?
- [ ] 如何修正sampled softmax? 使其能夠達到unbiased estimation?
- [ ] BFS/WLP tasks在真正open-domain scenario看起來並不特別有效, 實驗有significant improvement?
- [ ] 此論文看起來並無討論pretrained BERT + ICT training的 IR performance?
- [ ] 基於wikipedia的pretraining tasks可以generalize到其他IR dataset嗎? e.g., MSMARCO?
- [ ] 此論文討論的MLM pretraining是基於ReQA資料集內的paragraphs?還是原本BERT提出的某時間點的wikipedia dump?
- [ ] NSP任務和ICT任務之間是否有一定重疊性存在?
- [ ] Pretraining tasks的設計是要定義postive query-passage pair而已嗎? negative paragraphs是否只從tasks選擇好的paragraphs中抽取?