Qwen Embedding Training Methods

# Qwen Embeddings This goes through the methodology used to train the Qwen3 Embeddings. This information is sourced from their [arxiv paper](https://arxiv.org/pdf/2506.05176), their [release](https://qwenlm.github.io/blog/qwen3-embedding/), and [chinese sources](https://zhuanlan.zhihu.com/p/1920266031135434693). ## Model Architecture The Qwen3 LLMs are used as a Base Model. A LoRA (Rank=64) is used in order to maintain the generalizability of the underlying LLM. The tokenizer adds a final `[EOS]` token, and the output embedding of the last layer transformer for that token position, is the output. > Unknowns: > - Was the LoRA based on Qwen3 Base, or Qwen3 Instruction-Tuned? > > We could reverse engineer this to figure this out + The LoRA weights. ## Loss Function The dataset is organized into rows of the format $\{I_i, q_i, d^+_i, d^-_{i,1}, ..., d^-_{i,n}\}$, where $I$ is the instruction, $q$ is the query, $d^+$ is the positive pair, and $\{d^-\}$ is a set of $n$ hard negatives. Then, given a batch of size $N$: $$\mathcal{L} = -\frac{1}{N} \sum_{i}^{N} \log \frac{e^{s(q_i, d_i^{+}) / \tau}}{Z_i},$$ $$Z_i = e^{s(q_i, d_i^{+}) / \tau} + \sum_{k}^{K} m_{ik} e^{s(q_i, d_{i,k}^{-}) / \tau} + \sum_{j \ne i} m_{ij} e^{s(q_i, q_j) / \tau} + \sum_{j \ne i} m_{ij} e^{s(d_i^{+}, d_j) / \tau} + \sum_{j \ne i} m_{ij} e^{s(q_i, d_j) / \tau}$$ In other words, the positive pair is contrasted with: - All hard negatives for that query - All the other in-batch queries - All the other documents (Unclear if this includes positives, or negatives) - The positive document against all of the other documents > It's unclear if the "other documents" means just the positives, or the positives and the negatives. The masking factor $m_{ij}$ is binary $1/0$, and is equal to $0$ iff the similarity of the negative is $> s(q_i, d^+_i) + 0.1$, _or_ if the negative has string-equality with $d^+_i$. ## Pre-finetuning This dataset is completely synthetic. It spans tasks such as retrieval, bitext mining, classification, and semantic textual similarity (STS). For example, for retrieval, a random passage is sampled from Qwen LLM's billion-scale pretraining corpus. Then, a "user role" is selected for the "LLM Query Agent" to impersonate (A model selects the most relevant role for that passage). The LLM Query Agent pretends to be that role, and then asks a question whose answer is that passage. This process is used to generate 150M positive pairs. Then, we use an embedding model, to retrieve 8 passages with the highest similarity score, that aren't the originating passage. This creates 8 hard negatives. ## Finetuning Datasets are collected from high quality human-annotated datasets: - MS MARCO, NQ, HotpotQA, NLI, Dureader, T²-Ranking, SimCLUE, MIRACL, MLDR, Mr.TyDi, Multi-CPR, CodeSearchNet, etc. ~7M pairs are collected from these datasets. Additionally, of the 150M pre-finetuning pairs, a subset is filtered. Datapoints are filtered out of the positive is not the Top100 results for that query. Data is also filtered if the reranker similarity score is too high or too low (0.7 is used somewhere). This gives 12M "high quality" pairs. > When training the reranker, 8 randomly sampled "easy negatives" are included, because you can't use in-batch negatives for the reranker.