<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/abs/2305.05065) | [Note link](https://blog.csdn.net/weixin_43564920/article/details/135745594) | [Code link](https://github.com/EdoardoBotta/RQ-VAE-Recommender) | NeurIPS 2023 :::success **Thoughts** This study uses Semantic IDs to retrieve recommended items for users. ::: ## Abstract This study does not retrieve top candidates using a query embedding. Instead, it creates a Semantic ID for each item and uses a Transformer-based seq2seq model to predict the Semantic ID of the next item that the user will interact with. ![image](https://hackmd.io/_uploads/B15KCRTFC.png) ## Background Typically, recommenders use a retrieve-and-rank strategy to help users discover content of interest: 1. **Retrieval Stage**: retrieves a large set of candidate items that are potentially relevant to the user based on various filtering techniques. 2. **Ranking Stage**: ranks the retrieved candidates based on their relevance and likelihood of user engagement. ## Method This paper proposes Transformer Index for GEnerative Recommenders (TIGER), a generative retrieval-based recommendation framework that assigns Semantic IDs to each item and trains a retrieval model to predict the Semantic ID of an item that a given user may engage with. TIGER offers two key benefits: 1. It has the ability to recommend new and infrequent items. 2. It can generate diverse recommendations using a tunable parameter. ![image](https://hackmd.io/_uploads/rJ8Oy1RFC.png) ### Semantic ID Generation Semantic ID set as a tuple of codewords of length $m$. Each codeword in the tuple comes from a different codebook. The number of items that the Semantic IDs can represent uniquely is thus equal to the product of the codebook sizes. ![image](https://hackmd.io/_uploads/SJhFgJAFC.png) To generate a Semantic ID, the process starts with obtaining the semantic embedding $\boldsymbol{x} \in \mathbb{R}^d$ from a pre-trained encoder. RQ-VAE learns a latent representation $\boldsymbol{z} := \mathcal{E}(\boldsymbol{x})$. At the 0-th level ($d = 0$), the initial residual is defined as $\boldsymbol{r}_0 := \boldsymbol{z}$. For the $i$-th level, the process is repeated $m$ times: 1. At each level $d$, there is a codebook $\mathcal{C}_d := \{ \boldsymbol{e}_k \}_{k=1}^K$, where $K$ is the codebook size. 2. The closest embedding $\boldsymbol{e}_{c_d}$ is represented as $c_d = \arg \min_i \| \boldsymbol{r}_d - \boldsymbol{e}_k \|$. Note that they chose to use a separate codebook of size $K$ for each of the $m$ levels. Then, a quantized representation $\hat{\boldsymbol{z}} := \sum_{d=0}^{m-1} \boldsymbol{e}_{c_i}$ is computed and passed to the decoder to recreate the input $\boldsymbol{x}$. The RQ-VAE loss jointly trains the encoder, decoder, and the codebook. $$ \mathcal{L}(\boldsymbol{x}) := \mathcal{L}_\mathrm{recon} + \mathcal{L}_\mathrm{rqvae} $$ where $$ \mathcal{L}_\mathrm{recon} := \| \boldsymbol{x} - \hat{\boldsymbol{x}} \|^2 $$ and $$ \mathcal{L}_\mathrm{rqvae} := \sum_{d=0}^{m-1} \| \mathrm{sg}[\boldsymbol{r}_i] - \boldsymbol{e}_{c_i} \|^2 + \beta \| \boldsymbol{r}_i - \mathrm{sg}[\boldsymbol{e}_{c_i}] \|^2 $$ For the symbols, $\hat{\boldsymbol{x}}$ is the output of the decoder, and $\mathrm{sg}$ is the stop-gradient operation. :::info To avoid collisions, they append an extra token at the end of the ordered semantic codes to make them unique. For example, if two items share the Semantic ID (12, 24, 52), these items will be changed to (12, 24, 52, 0) and (12, 24, 52, 1). A lookup table is built to detect and handle collisions. ::: ### Generative Retrieval with Semantic IDs The recommender tries to predict the next item $\mathrm{item}_{n+1}$ from a sequence $(\mathrm{item}_1, \dots, \mathrm{item}_n)$. This study switches to directly predicting the Semantic ID of the next item. An item sequence $(c_{1,0}, \dots, c_{1,m-1}, c_{2,0}, \dots, c_{2,m-1}, \dots, c_{n,0}, \dots, c_{n,m-1})$ is given to predict the Semantic ID of $\mathrm{item}_{n+1}$, which is $(c_{n+1,0}, \dots, c_{n+1,m-1})$. ## Experiment They test their framework on three public real-world benchmarks from the [Amazon Product Reviews dataset](https://arxiv.org/abs/1602.01585). Here, they use three categories: “Beauty”, “Sports and Outdoors”, and “Toys and Games”. For the semantic encoder, pre-trained [Sentence-T5](https://huggingface.co/sentence-transformers/sentence-t5-base) is used. Below is the table showing performance comparison on sequential recommendation. ![image](https://hackmd.io/_uploads/H13neJRt0.png)