Paper Notes of NLP-Rock

paper link

https://youtu.be/5qmMwHeIvvc

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

1. Information Retrieval

Information Retrieval = indexing + retrieving

Indexing

Inverted index
1. Word to find Doc. [(query, Doc)]
  - Ex. Hash table or Array
2. Time Complexity and Space Complexity

Retrieving

Boolean
- Keyword Matching
- 0-1 matching
Vector Space Model (Algebraic models)
(Weighting)
- $TFIDF$
- $BM 25$ : IDF *
  $\frac{((k + 1) * t f)}{(k * (1.0 - b + b * (\frac{| d |}{a v g | d |})) + t f)}$
Language Model (Probabilistic models)
- (Ordering) n-gram :
  $p (w_{1}, . . ., w_{n - 1} | w_{n})$
- (Unordering) skip-gram :
  $p (w_{n - t}, . ., w_{n - 1}, w_{n + 1}, . . ., w_{n_{t}} | w_{n})$ .
metric
- Precision & Recall
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
- F1 measure

Ranking

Learning to Rank
- Pointwise Learning (Point to true point)
- Pairwise Learning (Pair to true Pair)
- Listwise Learning (List to true List) [LambdaMART]
Recommedation System
- Query: User related vs. keyword-based
- Ans: Preference related to user vs. True related to document
Metric
- Mean Avg. Precision (MAP)
- NDCG ( https://reurl.cc/pd6YMQ )

(Retrieve & Rerank)

2. Problem & Model Foumulation:

Retrieve the relevant documents/paragraphs/sentences from a huge corpus through BERT.

This can be done by two kind of models.

2-1 Cross Attention Model (Orginal one):

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

For a given query q and a BERT-based model f, the score of document i is:

s c o r e_{i} = f (q, d_{i}) = W (ϕ (q, d_{i})),

where

ϕ

is a BERT-based model and W is a kind of regression model.
However, considering the normal IR problems, which there are hundreds of documents, it will take a lot of time calculating every

f (q, d_{i})

through each document.

2-2 Two Tower Model (Paper focus on this):

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

On the other hand, consider the same assumption above, the score of document i in the two-tower model is:

s c o r e_{i} = f (q, d_{i}) = d (ϕ (q), ψ (d_{i}))

where

ϕ, ψ

are jointly trained BERT-based model.

Inference

Though it may seem similar to above, for we can get all the document embedding in the corpus independent from query, these embedding can be pre-computed and indexed.

Thus, we only need to calculated the query embedding at the inference part.

Optimization:

Consider there are n documents in the corpus.
Then the step above will then generate n inner-products.
Query q & Document d are relavent -> they have the highest similarity/inner-product.

The higher the inner product, the higher the similarity.

So it can be transformed to a multi-classification problem.
To consider it with probability distribution, we use softmax:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

However, in most of the IR cases, there are millions of documents in the corpus, to increase the efficiency, we can apply sampled softmax by sample a small subset of the whole corpus D.

3. The need of Pre-Training

It will take a lot of time/effort to train from scratch.
Why not pre-trained it on some tasks with commonality?

Find some pre-train tasks -> Pre-train model -> fine-tune on downstream tasks.

BERT pre-trained tasks:

Masked LM
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Next Sentence Prediction
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Pre-trained tasks in the paper

ICT: Inverse cloze on sentence.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

They then give a strong assumption:
The sentences in the first section summarize the whole document.

BFS: First Order hop (summary -> random paragraph in the same document)
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Another assumption: Given a document A, then if there is a hyper link of B in A, then A, B are related.

WLP: Sencond Order hop (summary -> random paragraph in another related document.)
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Masked LM:
Same as above.

Experiments: (Sadly, there is no implement codes to verify here)

Pre-Training cost:

32 TPU * 2.5 days

Downstream tasks:

dataset: SQuAD + Natural Questions
$(q, a, p) triplets:$ Given a question q, return an answer a and a passage p, where p is the evidence passage containing a.
2-1. However, for the paper considering the "Retrieve" problems, answer is transformed into the (q, s, p) where s is a sentence in p and contains a. ( Give q, return (s, p) )
SQuAD:
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Natural Questions:
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Ablation Study:
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Open-domain retrieval (By add 1M dummy sentences pair (s, p) to the candidate answers)
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Pre-training Tasks for Embedding-based Large-scale Retrieval

Open Review
ReQA: An Evaluation for End-to-End Answer Retrieval Models
(Retrieving based on paragraph, while this paper is based on sentence.)
Latent Retrieval for Weakly Supervised Open Domain Question Answering
(A conflict result between BERT & BM25, compared with this paper)
(They use "ICT" as a pre-training task, and cannot beat BM25 in this paper, however, "ICT" in the current paper has oppositely beat BM25 easily, which is also mentioned in Open Review.)

Paper Notes of NLP-Rock

1. Information Retrieval

Indexing

Retrieving

Ranking

(Retrieve & Rerank)

2. Problem & Model Foumulation:

2-1 Cross Attention Model (Orginal one):

2-2 Two Tower Model (Paper focus on this):

Inference

Optimization:

3. The need of Pre-Training

BERT pre-trained tasks:

Pre-trained tasks in the paper

Experiments: (Sadly, there is no implement codes to verify here)

Pre-Training cost:

Downstream tasks:

Pre-training Tasks for Embedding-based Large-scale Retrieval

Questions

Read more

Money With Pamela

去香港

Pam 的香港購買清單：

CodeZero-ML ReadMe