# SimCSE: Simple Contrastive Learning of Sentence Embeddings ## Introduction: 1\. SimCSE, a simple contrastive learning framework that greatly advances the state-of-the-art sentence embeddings. 2\. Dropout acts as minimal data augmentation and removing it leads to a representation collapse. ## Method: ### The architecture of a spatial transformer module</div> ![](https://i.imgur.com/MQlmdPD.png) ### Background: Contrastive Learning</div> Aims to learn effective representation by pulling semantically close neighbors together and pushing apart non-neighbors **InfoNCE Loss** ![](https://i.imgur.com/dPP8gAC.png) **Positive instances** simply using standard dropout **Alignment and uniformity** **Alignment**: two samples forming a positive pair should be mapped to nearby features, and thus be (mostly) invariant to unneeded noise factors. ![](https://i.imgur.com/uEgTVIR.png) **Uniformity**: feature vectors should be roughly uniformly distributed on the unit hypersphere Sm−1, preserving as much information of the data as possible. ![](https://i.imgur.com/EHNuswL.png) model collapse The data are extremely unevenly distributed to the same point on the hypersphere. which means all the data converge to the same constant solution after the feature representation mapping process. ![](https://i.imgur.com/ZHpCWIZ.png) ### Unsupervised SimCSE</div> 1. In standard training of Transformers, there are dropout masks placed on fully-connected layers as well as attention probabilities (default p = 0.1). 2. feed the same input to the encoder twice and get two embeddings with different dropout masks z, z0 ![](https://i.imgur.com/grUxbmy.png) **Dropout noise as data augmentation** ![](https://i.imgur.com/uqJr1RD.png) **Why does it work?** ![](https://i.imgur.com/pQvmea4.png) ### Supervised SimCSE</div> ![](https://i.imgur.com/kXpKW9y.png) result of Supervised SimCSE ![](https://i.imgur.com/BMt1ckp.png) ## Results:</div> ![](https://i.imgur.com/kiGriGA.png) ![](https://i.imgur.com/zRpBgSZ.png)