Negative Sampling & Loss

# Negative Sampling & Loss ## Method ### Random Sampling (From embedding/recommendation papers) The most naive one, just randam sample negative instances. ### Frequency based negative sampling [paper link](https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf) The negative samples are selected using a “unigram distribution”, where more frequent items/instances are more likely to be selected as negative samples. Expressed by the following equation, $f(w_i)$ is the frequency of item $w_i$: $P(w_i) = \frac{ f(w_i) }{\sum_{j=0}^{n}\left( f(w_j) \right) }$ In experiment they found raise the word counts to the 3/4 power performs best (Now widely used in many works): $P(w_i) = \frac{ {f(w_i)}^{3/4} }{\sum_{j=0}^{n}\left( {f(w_j)}^{3/4} \right) }$ ### Rank based negative sampling [Paper Link](https://cseweb.ucsd.edu/~jmcauley/pdfs/cikm18a.pdf) The negative samples/items are sorted in order of decreasing popularity and the probability of each item i being sampled is defined as: $P(i) = \log(\frac{r_i+2}{r_i+1})$ The paper claims that this negative sampling technique empirically accelerates the model convergence and improve the performance compared to uniform sampling strategy. ### Hard Negative and Easy negative.(from facebook, very insightful) [Paper Link](https://arxiv.org/abs/2006.11632) Summarize the key findings: 1. In recall stage, we shouldn't only use hard negative samples. Note: can't use exposure but not clicked samples. 2. First, use easy negative samples to make model have ability to distinguish items that are quite differ from each other. Then, use hard negatives to improve the ranking performance. 3. How to construct hard negatives? **Based on the last version of the model**(This idea is inline with the idea of BPR-Max loss.), the items with quite high scores can be regard as positive samples, items with medium scores can be regard as hard negatives. **Based on the scenario, in our case this could be topics which have a large overlap of documents as Ali said.** 4. Do not use exposure but not clicked samples !! Useless. > In fairness ranking trec, maybe we can first use easy negative samples to first train a recall model? ### BPR-Max Loss (Encode the sampling into the loss-function.) [Paper Link](https://arxiv.org/pdf/1706.03847.pdf) Cut from the original paper. ![](https://i.imgur.com/cIM1ccZ.png) ![](https://i.imgur.com/XhjA5hj.png) ## How many negative samples? > The paper says that selecting 5-20 words works well for smaller datasets, and you can get away with only 2-5 words for large datasets. ## Another possible useful solution: Passage Ranking With Weak Supervision [Pdf Link](https://arxiv.org/pdf/1905.05910.pdf) Previous study shows that if no label avaliable use BM25 score as supervison signal can achieve good result. This paper add more weak supervision signal to construct true label and train the model. including: * BM25 * TD-IDF * universal sentence encoder cosine score * Bert cosine score > Specifically, for each query, we rank the candidate passages based on the similarity scores and we take the top-1 passages as positive ones, the bottom half as negative ones, and label the rest in this list as neutral.