# Paper Notes of NLP-Rock [paper link](https://arxiv.org/pdf/2002.03932.pdf) https://youtu.be/5qmMwHeIvvc [![text](http://img.youtube.com/vi/5qmMwHeIvvc/0.jpg)](http://www.youtube.com/watch?v=5qmMwHeIvvc) --- ## 1. Information Retrieval Information Retrieval = indexing + retrieving ### Indexing 1. Inverted index 1. Word to find Doc. [(query, Doc)] * Ex. Hash table or Array 3. Time Complexity and Space Complexity ### Retrieving 0. Boolean * Keyword Matching * 0-1 matching 1. Vector Space Model (Algebraic models) (Weighting) * $\text{TFIDF}$ * $\text{BM}25$ : IDF * $\frac{((k + 1) * tf)}{(k * (1.0 - b + b * (\frac{|d|}{avg|d|\;})) + tf)}$ 2. Language Model (Probabilistic models) * (Ordering) n-gram : $p(w_1,...,w_{n-1}|w_n)$ * (Unordering) skip-gram : $p(w_{n-t},..,w_{n-1},w_{n+1},...,w_{n_t}|w_n)$. 3. metric * Precision & Recall ![](https://i.imgur.com/2PsmDp4.png) * F1 measure ### Ranking 1. Learning to Rank * Pointwise Learning (Point to true point) * Pairwise Learning (Pair to true Pair) * Listwise Learning (List to true List) [LambdaMART] 2. Recommedation System * Query: User related vs. keyword-based * Ans: Preference related to user vs. True related to document 4. Metric * Mean Avg. Precision (MAP) * NDCG ( https://reurl.cc/pd6YMQ ) ### (Retrieve & Rerank) ------------------------------------------------------------ ## 2. Problem & Model Foumulation: Retrieve the relevant documents/paragraphs/sentences from a huge corpus through BERT. This can be done by two kind of models. ### 2-1 Cross Attention Model (Orginal one): ![](https://i.imgur.com/hzlM77o.jpg) ![](https://i.imgur.com/NmjqY0g.jpg) For a given query q and a BERT-based model f, the score of document i is: $$score_i = f(q, d_i) = W(\phi(q, d_i)),$$ where $\phi$ is a BERT-based model and W is a kind of regression model. However, considering the normal IR problems, which there are hundreds of documents, it will take a lot of time calculating every $f(q, d_i)$ through each document. ### 2-2 Two Tower Model (Paper focus on this): ![](https://i.imgur.com/wtMTyyv.jpg) On the other hand, consider the same assumption above, the score of document i in the two-tower model is: $$score_i = f(q, d_i) = d(\phi(q),\psi(d_i))$$ where $\phi, \psi$ are jointly trained BERT-based model. ### Inference Though it may seem similar to above, for we can get all the document embedding in the corpus independent from query, these embedding can be pre-computed and indexed. Thus, we only need to calculated the query embedding at the inference part. --- ## Optimization: Consider there are n documents in the corpus. Then the step above will then generate n inner-products. Query q & Document d are relavent -> they have the highest similarity/inner-product. The higher the inner product, the higher the similarity. So it can be transformed to a multi-classification problem. To consider it with probability distribution, we use softmax: ![](https://i.imgur.com/31a4sRR.jpg) However, in most of the IR cases, there are millions of documents in the corpus, to increase the efficiency, we can apply sampled softmax by sample a small subset of the whole corpus D. --- ## 3. The need of Pre-Training It will take a lot of time/effort to train from scratch. Why not pre-trained it on some tasks with commonality? Find some pre-train tasks -> Pre-train model -> fine-tune on downstream tasks. ### BERT pre-trained tasks: 1. Masked LM ![](https://i.imgur.com/0w5gEdK.jpg =400x) 2. Next Sentence Prediction ![](https://i.imgur.com/hPYqCKC.jpg =400x) ### Pre-trained tasks in the paper 1. ICT: Inverse cloze on sentence. ![](https://i.imgur.com/pr2tmMX.jpg) They then give a strong assumption: The sentences in the first section summarize the whole document. 2. BFS: First Order hop (summary -> random paragraph in the same document) ![](https://i.imgur.com/9n3Nayz.jpg) Another assumption: Given a document A, then if there is a hyper link of B in A, then A, B are related. 3. WLP: Sencond Order hop (summary -> random paragraph in another related document.) ![](https://i.imgur.com/0sxbLFT.jpg) 4. Masked LM: Same as above. ## Experiments: (Sadly, there is no implement codes to verify here) ### Pre-Training cost: 32 TPU * 2.5 days ### Downstream tasks: 1. dataset: SQuAD + Natural Questions 2. $(q, a, p) \text{ triplets: }$Given a question q, return an answer a and a passage p, where p is the evidence passage containing a. 2-1. However, for the paper considering the "Retrieve" problems, answer is transformed into the (q, s, p) where s is a sentence in p and contains a. ( Give q, return (s, p) ) 1. SQuAD: ![](https://i.imgur.com/NqUZrax.jpg) 2. Natural Questions: ![](https://i.imgur.com/oPOaSuZ.jpg) 3. Ablation Study: ![](https://i.imgur.com/QMKeQbx.jpg) 4. Open-domain retrieval (By add 1M dummy sentences pair (s, p) to the candidate answers) ![](https://i.imgur.com/V5ngb8p.jpg) --- ## Pre-training Tasks for Embedding-based Large-scale Retrieval 1. [Open Review](https://openreview.net/forum?id=rkg-mA4FDr) 2. [ReQA: An Evaluation for End-to-End Answer Retrieval Models](https://www.aclweb.org/anthology/D19-5819.pdf) (Retrieving based on paragraph, while this paper is based on sentence.) 3. [Latent Retrieval for Weakly Supervised Open Domain Question Answering](https://arxiv.org/pdf/1906.00300.pdf) (A conflict result between BERT & BM25, compared with this paper) (They use "ICT" as a pre-training task, and cannot beat BM25 in this paper, however, "ICT" in the current paper has oppositely beat BM25 easily, which is also mentioned in Open Review.) Questions --- - [ ] 將BM25換成BERT embedding的motivation是什麼? 潛在改進的因素為何? - [ ] 為何一定要two-tower BERT才能實現embedding space search? 單一BERT做不到嗎? - [ ] 目前代表性的embedding (vector) space search tool有哪些可以使用? - [ ] Train two-tower BERT 真的會比train single BERT for [query;passage] 久? - [ ] Training batch size ~8k 的考量只是hyperparameter tuning嗎? 是否和sampled softmax有關? - [ ] Loss function必須考慮sampled softmax才可以嗎? 用W2V內使用的NCE技術轉化爲pairwise calculation可不可以? - [ ] Batch size與backpropagation演算法的時間複雜度關係爲何? O(Batch size)? - [ ] 如何修正sampled softmax? 使其能夠達到unbiased estimation? - [ ] BFS/WLP tasks在真正open-domain scenario看起來並不特別有效, 實驗有significant improvement? - [ ] 此論文看起來並無討論pretrained BERT + ICT training的 IR performance? - [ ] 基於wikipedia的pretraining tasks可以generalize到其他IR dataset嗎? e.g., MSMARCO? - [ ] 此論文討論的MLM pretraining是基於ReQA資料集內的paragraphs?還是原本BERT提出的某時間點的wikipedia dump? - [ ] NSP任務和ICT任務之間是否有一定重疊性存在? - [ ] Pretraining tasks的設計是要定義postive query-passage pair而已嗎? negative paragraphs是否只從tasks選擇好的paragraphs中抽取?