Gemini Embedding Training Method

# Gemini Embeddings This goes through the methodology used to train the Gemini Embeddings. From this [arxiv paper](https://arxiv.org/pdf/2503.07891). ## Model Architecture First a Gemini model is used, but the causal attention mask is removed. Mean pooling is used to combine the embeddings for each token into a final embedding. A final linear projection "f" is used to bring the embedding into the target dimension. ## Loss Function Given a batch of size $B$, the loss applied to these embeddings is as follows: $$\mathcal{L} = \frac{1}{B} \sum_{i=1}^{B} \left[ -\log \frac{e^{\text{sim}(\mathbf{q}_i, \mathbf{p}_i^+)/\tau}}{e^{\text{sim}(\mathbf{q}_i, \mathbf{p}_i^-)/\tau} + \sum_{j=1}^{B} \text{mask}(i,j) e^{\text{sim}(\mathbf{q}_i, \mathbf{p}_j^+)/\tau}} \right]$$ where $\text{sim}(\mathbf{x}, \mathbf{y}) = \mathbf{x}^{\top}\mathbf{y}/\|\mathbf{x}\|\|\mathbf{y}\|$ is cosine similarity, and $$\text{mask}(i, j) = \begin{cases} 0 & \text{if } q_i = q_j \text{ or } p_i^+ = p_j^+, \\ 1 & \text{otherwise}. \end{cases}$$ i.e., we have an optional hard negative, and we also treat all other positive documents as in-batch negatives. The mask excludes in-batch negatives that are either for the same question, or are the target positive document. ## Pre-finetuning First a billion-scale web corpus is used, with (title, passage) pairs being positive. No hard negatives are used. A much larger batch size is used (e.g. > 1024) - the largest that fits in the device, which provides a more stable gradient. This pre-finetuning is done for many steps. ## Finetuning This is done with smaller batch sizes (e.g. < 1024), and it's done with (query, positive, hard negative) triplets. Each batch only contains triplets from the same dataset. --- Datasets: - English (Task Varies) - Retrieval - Classification - Multilingual (Retrieval) - Code (Retrieval) --- 1. Both human-annotated and synthetic (FRet) datasets are used. - Human: NQ, HotpotQA, FEVER, MedMCQA, SNLI, MNLI, MIRACL - Synthetic: FRet, SWIM-IR 2. $(q, p^+)$ pairs are filtered for quality by a few-shot prompted LLM. 3. Hard negatives are mined by doing a Top K vector search of $q$, and then having an LLM score the merged $p^+$ and TopK. The best document is the new positive, and the worse document is the hard negative. ### FRet ![image](https://hackmd.io/_uploads/BJpvB-7dxe.png) FRet is a technique from the Gecko paper, then improved by Gemini. 1. A random passage $p_\text{seed}$ is sampled from the billion-scale web corpus. 2. An LLM generates a Task+Query whose answer would be the passage. - A technique is to generate five and take the last one, which is generally more diverse than the first generation. 3. Another LLM call will check for quality and filter it out if it's a bad query. 4. The Task+Query is embedded and used to search over the entire web corpus, retrieving the Top K results. 5. The K+1 results (K from vector search, plus $p_\text{seed}$), are then scored by an LLM, to find the most relevant document $p^+$. The least relevant document is then used as the hard negative $p^-$ (Table 3 from Gecko).