Paper Notes of NLP-Rock
paper link
https://youtu.be/5qmMwHeIvvc
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Information Retrieval = indexing + retrieving
Indexing
- Inverted index
- Word to find Doc. [(query, Doc)]
- Time Complexity and Space Complexity
Retrieving
- Boolean
- Keyword Matching
- 0-1 matching
- Vector Space Model (Algebraic models)
(Weighting)
- Language Model (Probabilistic models)
- (Ordering) n-gram :
- (Unordering) skip-gram : .
- metric
- Precision & Recall
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- F1 measure
Ranking
- Learning to Rank
- Pointwise Learning (Point to true point)
- Pairwise Learning (Pair to true Pair)
- Listwise Learning (List to true List) [LambdaMART]
- Recommedation System
- Query: User related vs. keyword-based
- Ans: Preference related to user vs. True related to document
- Metric
(Retrieve & Rerank)
2. Problem & Model Foumulation:
Retrieve the relevant documents/paragraphs/sentences from a huge corpus through BERT.
This can be done by two kind of models.
2-1 Cross Attention Model (Orginal one):
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
For a given query q and a BERT-based model f, the score of document i is:
where is a BERT-based model and W is a kind of regression model.
However, considering the normal IR problems, which there are hundreds of documents, it will take a lot of time calculating every through each document.
2-2 Two Tower Model (Paper focus on this):
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
On the other hand, consider the same assumption above, the score of document i in the two-tower model is:
where are jointly trained BERT-based model.
Inference
Though it may seem similar to above, for we can get all the document embedding in the corpus independent from query, these embedding can be pre-computed and indexed.
Thus, we only need to calculated the query embedding at the inference part.
Optimization:
Consider there are n documents in the corpus.
Then the step above will then generate n inner-products.
Query q & Document d are relavent -> they have the highest similarity/inner-product.
The higher the inner product, the higher the similarity.
So it can be transformed to a multi-classification problem.
To consider it with probability distribution, we use softmax:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
However, in most of the IR cases, there are millions of documents in the corpus, to increase the efficiency, we can apply sampled softmax by sample a small subset of the whole corpus D.
3. The need of Pre-Training
It will take a lot of time/effort to train from scratch.
Why not pre-trained it on some tasks with commonality?
Find some pre-train tasks -> Pre-train model -> fine-tune on downstream tasks.
BERT pre-trained tasks:
-
Masked LM
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
-
Next Sentence Prediction
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Pre-trained tasks in the paper
- ICT: Inverse cloze on sentence.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
They then give a strong assumption:
The sentences in the first section summarize the whole document.
- BFS: First Order hop (summary -> random paragraph in the same document)
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Another assumption: Given a document A, then if there is a hyper link of B in A, then A, B are related.
-
WLP: Sencond Order hop (summary -> random paragraph in another related document.)
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
-
Masked LM:
Same as above.
Experiments: (Sadly, there is no implement codes to verify here)
Pre-Training cost:
32 TPU * 2.5 days
Downstream tasks:
-
dataset: SQuAD + Natural Questions
-
Given a question q, return an answer a and a passage p, where p is the evidence passage containing a.
2-1. However, for the paper considering the "Retrieve" problems, answer is transformed into the (q, s, p) where s is a sentence in p and contains a. ( Give q, return (s, p) )
-
SQuAD:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
-
Natural Questions:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
-
Ablation Study:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
-
Open-domain retrieval (By add 1M dummy sentences pair (s, p) to the candidate answers)
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Pre-training Tasks for Embedding-based Large-scale Retrieval
-
Open Review
-
ReQA: An Evaluation for End-to-End Answer Retrieval Models
(Retrieving based on paragraph, while this paper is based on sentence.)
-
Latent Retrieval for Weakly Supervised Open Domain Question Answering
(A conflict result between BERT & BM25, compared with this paper)
(They use "ICT" as a pre-training task, and cannot beat BM25 in this paper, however, "ICT" in the current paper has oppositely beat BM25 easily, which is also mentioned in Open Review.)
Questions