Try   HackMD

Paper Notes of NLP-Rock

paper link

https://youtu.be/5qmMwHeIvvc


1. Information Retrieval

Information Retrieval = indexing + retrieving

Indexing

  1. Inverted index
    1. Word to find Doc. [(query, Doc)]
      • Ex. Hash table or Array
    2. Time Complexity and Space Complexity

Retrieving

  1. Boolean
    • Keyword Matching
    • 0-1 matching
  2. Vector Space Model (Algebraic models)
    (Weighting)
    • TFIDF
    • BM25
      : IDF *
      ((k+1)tf)(k(1.0b+b(|d|avg|d|))+tf)
  3. Language Model (Probabilistic models)
    • (Ordering) n-gram :
      p(w1,...,wn1|wn)
    • (Unordering) skip-gram :
      p(wnt,..,wn1,wn+1,...,wnt|wn)
      .
  4. metric
    • Precision & Recall
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
    • F1 measure

Ranking

  1. Learning to Rank
    • Pointwise Learning (Point to true point)
    • Pairwise Learning (Pair to true Pair)
    • Listwise Learning (List to true List) [LambdaMART]
  2. Recommedation System
    • Query: User related vs. keyword-based
    • Ans: Preference related to user vs. True related to document
  3. Metric

(Retrieve & Rerank)


2. Problem & Model Foumulation:

Retrieve the relevant documents/paragraphs/sentences from a huge corpus through BERT.

This can be done by two kind of models.

2-1 Cross Attention Model (Orginal one):

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

For a given query q and a BERT-based model f, the score of document i is:

scorei=f(q,di)=W(ϕ(q,di)), where
ϕ
is a BERT-based model and W is a kind of regression model.
However, considering the normal IR problems, which there are hundreds of documents, it will take a lot of time calculating every
f(q,di)
through each document.

2-2 Two Tower Model (Paper focus on this):

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

On the other hand, consider the same assumption above, the score of document i in the two-tower model is:
scorei=f(q,di)=d(ϕ(q),ψ(di))

where
ϕ,ψ
are jointly trained BERT-based model.

Inference

Though it may seem similar to above, for we can get all the document embedding in the corpus independent from query, these embedding can be pre-computed and indexed.

Thus, we only need to calculated the query embedding at the inference part.


Optimization:

Consider there are n documents in the corpus.
Then the step above will then generate n inner-products.
Query q & Document d are relavent -> they have the highest similarity/inner-product.

The higher the inner product, the higher the similarity.

So it can be transformed to a multi-classification problem.
To consider it with probability distribution, we use softmax:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

However, in most of the IR cases, there are millions of documents in the corpus, to increase the efficiency, we can apply sampled softmax by sample a small subset of the whole corpus D.


3. The need of Pre-Training

It will take a lot of time/effort to train from scratch.
Why not pre-trained it on some tasks with commonality?

Find some pre-train tasks -> Pre-train model -> fine-tune on downstream tasks.

BERT pre-trained tasks:

  1. Masked LM

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  2. Next Sentence Prediction

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Pre-trained tasks in the paper

  1. ICT: Inverse cloze on sentence.
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

They then give a strong assumption:
The sentences in the first section summarize the whole document.

  1. BFS: First Order hop (summary -> random paragraph in the same document)
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Another assumption: Given a document A, then if there is a hyper link of B in A, then A, B are related.

  1. WLP: Sencond Order hop (summary -> random paragraph in another related document.)

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  2. Masked LM:
    Same as above.

Experiments: (Sadly, there is no implement codes to verify here)

Pre-Training cost:

32 TPU * 2.5 days

Downstream tasks:

  1. dataset: SQuAD + Natural Questions

  2. (q,a,p) triplets: Given a question q, return an answer a and a passage p, where p is the evidence passage containing a.
    2-1. However, for the paper considering the "Retrieve" problems, answer is transformed into the (q, s, p) where s is a sentence in p and contains a. ( Give q, return (s, p) )

  3. SQuAD:

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  4. Natural Questions:

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  5. Ablation Study:

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  6. Open-domain retrieval (By add 1M dummy sentences pair (s, p) to the candidate answers)

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →


Pre-training Tasks for Embedding-based Large-scale Retrieval

  1. Open Review

  2. ReQA: An Evaluation for End-to-End Answer Retrieval Models
    (Retrieving based on paragraph, while this paper is based on sentence.)

  3. Latent Retrieval for Weakly Supervised Open Domain Question Answering
    (A conflict result between BERT & BM25, compared with this paper)
    (They use "ICT" as a pre-training task, and cannot beat BM25 in this paper, however, "ICT" in the current paper has oppositely beat BM25 easily, which is also mentioned in Open Review.)

Questions

  • 將BM25換成BERT embedding的motivation是什麼? 潛在改進的因素為何?
  • 為何一定要two-tower BERT才能實現embedding space search? 單一BERT做不到嗎?
  • 目前代表性的embedding (vector) space search tool有哪些可以使用?
  • Train two-tower BERT 真的會比train single BERT for [query;passage] 久?
  • Training batch size ~8k 的考量只是hyperparameter tuning嗎? 是否和sampled softmax有關?
  • Loss function必須考慮sampled softmax才可以嗎? 用W2V內使用的NCE技術轉化爲pairwise calculation可不可以?
  • Batch size與backpropagation演算法的時間複雜度關係爲何? O(Batch size)?
  • 如何修正sampled softmax? 使其能夠達到unbiased estimation?
  • BFS/WLP tasks在真正open-domain scenario看起來並不特別有效, 實驗有significant improvement?
  • 此論文看起來並無討論pretrained BERT + ICT training的 IR performance?
  • 基於wikipedia的pretraining tasks可以generalize到其他IR dataset嗎? e.g., MSMARCO?
  • 此論文討論的MLM pretraining是基於ReQA資料集內的paragraphs?還是原本BERT提出的某時間點的wikipedia dump?
  • NSP任務和ICT任務之間是否有一定重疊性存在?
  • Pretraining tasks的設計是要定義postive query-passage pair而已嗎? negative paragraphs是否只從tasks選擇好的paragraphs中抽取?