Notes on "[Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084)"

# Notes on "[Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084)" Author: [Rishika Bhagwatkar](https://github.com/rishika2110) ## Background * The transformer model of "Attention is all you need" paper consists of two parts encoding component and decoding component, both are used for the Machine Translation task. * Encoding components: Stacked encoders * Decoding components: Stacked decoders * BERT's goal is to generate language model. So, only encoding component is necessary. Two methods are used to train BERT: * Masked Language Model(MLM): Few words of the sequence are replaced by [MASK] token and the model predicts only the original value of [MASK] tokens on the basis of other non-masked tokens. * Next Sentence Prediction(NSP): Pairs of sentences are given as input to the model and it predicts if the second sentence is the subsequent sentence in the original corpus on which the model is trained upon. * [CLS] token is appended in the begining of the first sentence and [SEP] token is appended at the end of both sentences. * The summation of token embeddings, positional embeddings and sentence embeddings is fed to the model. * [CLS] token is used for the classification task, giving probability of whether the second sentence is subsequent sentence using softmax. * BERT can be used for various NLP tasks by just adding a layer to the model and fine tuning. * BERT set new state-of-the-art performance on various sentence classification and sentence-pair regression tasks. BERT uses a cross-encoder: Two sentences are passed to the transformer network and the target value is predicted. ## Brief Outline * Word2Vec and GloVe provides us with word embeddings for a given corpus. * Various NLP applications require to compute the similarity between two short texts. Here, the degree of overlapping words don't give a correct measure of how well the document is relevant to a query text. So, this is accomplished by computing the embeddings of short texts and checking the similarities between them using cosine similarity metric. * The most common method of estimating baseline semantic similarity between a pair of sentences is averaging of the word embeddings of all words in the two sentences and calculating the cosine between the resulting embeddings. * Sentence embeddings from BERT can be computed in two ways: * Taking average of the output layer * Using the [CLS] token output * However, BERT is computationally heavy for semantic textual similarity and for some other unsupervised tasks like clustering. * This paper represents Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT/RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT. ## Model * A pooling operation is added to the output of BERT/RoBERTa to derive a fixed sized sentence embedding. * They adopted following strategies: * Using the output of [CLS] token * MEAN-strategy: Computing the mean of the output vectors * MAX-strategy: computing a max-over-time of the output vectors * Default configuration is MEAN * For fine tuning of BERT: They created siamese and triplet networks to update the weights such that the produced sentence embeddings are semantically meaningful and can be compared with cosine-similarity. * Objective functions which were experimented: * Classification Objective Function * Regression Objective Function * Triplet Objective Function ## Conclusion * SBERT is fine-tuned BERT in a siamese/triplet network architecture. * Achieved a significant improvement over state-of-the-art sentence embeddings methods. * SBERT is computationally efficient.