<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://openreview.net/pdf?id=Vu-B0clPfq) | [Note link](https://www.51cto.com/article/703298.html) | [Code link](https://github.com/ArvinZhuang/DSI-transformers) | NeurIPS 2022 :::success **Thoughts** This paper give us a new way to that the model learns a text-to-text model that maps string queries directly to relevant docids. About **Semantically Structured Identifiers**, there has some paper related to this Structured Identifiers. **[A Neural Corpus Indexer for Document Retrieval](https://openreview.net/pdf?id=fSfcEYQP_qc)** makes this method an end-to-end deep neural network unifying training and indexing stages, which can significantly improve the recall performance of traditional methods. ::: ## Abstract This paper demonstrates that information retrieval can be accomplished with a single Transformer. They introduce the **Differentiable Search Index (DSI)**, a new paradigm that learns a text-to-text model that maps string queries directly to relevant docids. DSI significantly outperforms strong baselines such as dual encoder models and demonstrates strong generalization capabilities, outperforming a BM25 baseline in a zero-shot setup. ## Introduction Information retrieval (IR) systems map a user query $q \in \mathcal{Q}$ to a ranked list of relevant documents $\{d_1, \dots, d_n\} \subseteq D$, typically represented by integers or short strings called *document identifiers* (docids). The most widely used IR approaches are based on pipelined *retrieve-then-rank* strategies. This paper proposes an alternative architecture, wherein a sequence-to-sequence (seq2seq) learning system is used to directly map a query $q$ to a relevant docid $j \in \mathcal{Y}$. ![](https://hackmd.io/_uploads/HyRVJhBph.png) At inference time, the trained model takes as input a text query $q$ and outputs a docid $j$. If desired, beam search can be used to produce a ranked list of potentially-relevant docids. ![](https://hackmd.io/_uploads/r1MYe2rTh.png) There are a number of ways DSI can be realized: *Document representation.* - Full text - Bag-of-words representation used by traditional IR engines *Docid representation.* - Representing integers as text strings - Unstructured atomic docids - Structured semantic docids *Indexing.* - Indexing a corpus: Examples $(d_j , j)$ that pair document $d_j$ with its docid $j$ - Retrieve from the index: Examples $(q, j)$ that pair a query $q$ with a relevant docid $j$ ## Differentiable Search Index The core idea behind the proposed Differentiable Search Index (DSI) is to fully parameterize traditionally multi-stage retrieve-then-rank pipelines within a single neural model. - **Indexing:** a DSI model should learn to associate the content of each document $d_j$ with its corresponding docid $j$. This paper utilizes a straightforward sequence-to-sequence (seq2seq) approach that takes document tokens as input and generates identifiers as output. - **Retrieval:** Given an input query, a DSI model should return a ranked list of candidate docids. Here, this is achieved with autoregressive generation. ### Indexing Strategies They train their model to predict docids given a sequence of document tokens. And their final strategy employed was Inputs2Targets with direct indexing. #### Indexing Method - *how to index* **Inputs2Target** They frame this as a seq2seq task of $\text{doc_tokens} \rightarrow \text{docid}$. The advantage here is that the identifier is the denoising target, which puts it in closer proximity to the loss function. The potential weakness is that the document tokens are not denoising targets and therefore there is no opportunity for general pre-training on document tokens. **Targets2Inputs** They frame this as a seq2seq task of $\text{docid} \rightarrow \text{doc_tokens}$. This is equivalent to training an autoregressive language model that is conditioned on the docid. **Bidirectional** It trains both Inputs2Targets and Targets2Inputs within the same co-training setup. A prefix token is prepended to allow the model to know which direction the task is being performed in. **Span Corruption** It concatenates the identifier to the document tokens as a prefix that can be randomly masked as spans in the span corruption objective. #### Document Representation Strategies - *what to index* **Direct Indexing** It takes the first $L$ tokens of a document, with sequential order preserved, and associate them with the docid. **Set Indexing** This strategy de-duplicates repeated terms using the default Python $\text{set}$ operation and removes stopwords from the document. Other part of strategy similar with Direct Indexing. **Inverted Index** It randomly subsamples a single contiguous chunk of $k$ tokens and associate them with the docid. ### Representing Docids for Retrieval How to do this decoding in an effective way largely depends on how docids are represented in the model. **Unstructured Atomic Identifiers** The most naive way to represent documents is assign each an arbitrary (and possibly random) unique integer identifier. And decoding formulation is to learn a probability distribution over the identifiers. And models are trained to emit one logit for each unique docid $(|N_{documents}|)$. The output vocabulary of a standard language model will be: $$ O = \text{Softmax} ([W_{tokens}; W_{docs}]^T \ \ h_{last}) $$ where $[;]$ is the row-wise concatenation operator, $W_{tokens} \in \mathbb{R}^{d_{model} \times | N_{tokens} |}$ and $W_{docs} \in \mathbb{R}^{d_{model} \times | N_{documents} |}$. $h_{last}$ is the last layer’s hidden state ($\in \mathbb{R}^{d_{model}}$) of the decoder stack. To retrieve the top-k documents for a given query, they simply sort the output logits and return the corresponding indices. This is also reminiscent of standard listwise learning to rank where all documents are considered at once. **Naively Structured String Identifiers** In this formulation, retrieval is accomplished by decoding a docid string sequentially one token at a time. They use the partial beam search tree to construct top-k retrieval scores. They find this approximation to be quite efficient and effective in practice. **Semantically Structured Identifiers** It aims to automatically create identifiers that satisfy the following properties: 1. The docid should capture some information about the semantics of its associated document. 2. The docid should be structured in a way that the search space is effectively reduced after each decoding step. To construct identifiers with this property, they employ a simple hierarchical clustering process over document embeddings to induce a decimal tree. Given a corpus to be indexed, all documents are clustered into 10 clusters. Each document is assigned an identifier with the number of their cluster from 0-9. For every cluster containing more than $c$ documents, the algorithm is applied recursively. For clusters with $c$ documents or less, each element is assigned an arbitrary number from 0 to at most $c − 1$ and likewise its digits are appended to the existing identifier. ![](https://hackmd.io/_uploads/rJ2zkC_a3.png) ![](https://hackmd.io/_uploads/BkIeJCOph.png) ### Training and Optimization They explored two main strategies for training DSI models: The first and more straightforward strategy is to first train a model to perform indexing, followed by a fine-tuning stage where the trained model is used to map queries to docids. The second strategy is to train them together in a multi-task setup. The latter performed significantly better, especially when the proportion of indexing to retrieval task examples is high. ## Experiments **Dataset** Natural Questions (NQ) **Metrics** Hits$@$N where N={1, 10}. This metric reports the proportion of correct documents ranked in the top N predictions. **Implementation Details** All DSI models are initialized using standard pretrained T5. Semantically structured identifiers are generated using an 8-layer BERT, and the default k-means clustering in scikit-learn. ![](https://hackmd.io/_uploads/HkCBictah.png) ![](https://hackmd.io/_uploads/S198o9Fpn.png) --- **Document Identifiers** Generally, they find that **structured semantic identifiers** are helpful and improve over unstructured identifiers. **Indexing Strategies** The $\text{Inputs2Targets}$ and $\text{Bidirectional}$ formulation performs the best **Document Representations** The direct indexing approach works the best. It is difficult to train the inverted index method since the docid is repeatedly exposed to different tokens. ![](https://hackmd.io/_uploads/SyH5n9Kp3.png) **Scaling Laws** ![](https://hackmd.io/_uploads/rJT33qta2.png) **Interplay Between Indexing and Retrieval** ![](https://hackmd.io/_uploads/BygCnqFTn.png) ## Conclusion This paper defines novel indexing and retrieval tasks that encode the relationship between terms and docids completely within the parameters of a Transformer model. The paper proposed a number of different ways to represent documents and docids, and explored different model architectures and model training strategies.