**The Ultimate Guide to Re-Ranking**

# **The Ultimate Guide to Re-Ranking** ## **Abstract** Welcome to my deep dive into **multi-stage search pipelines** and why **re-ranking** is such a critical component in delivering high-quality results. Whether you’re an engineer looking to build an AI-driven search system or simply curious about how modern search applications refine their suggestions, this ebook walks you through the key models and trade-offs involved. We’ll explore **Cross-Encoders**, **Late Interaction** methods like ColBERT, and **LLM-based listwise approaches**, culminating in practical insights for choosing the right strategy for your application’s scale and requirements. --- ## **1. Introduction to Re-Ranking** Over the past two years, I’ve been on a mission to understand how the best search engines in the world go beyond basic keyword matching to deliver results that feel **tailor-made** for users--delivering deeply relevant results with high precision and low latency. I realized that it’s not enough to just take the “top n” documents from a BM25 or dense vector retrieval—they need to be **re-ranked** by more sophisticated models to get that final ordering just right. That “final ordering” might not sound like a big deal, but in practice, it can make the difference between a user saying *“This search tool is awesome!”* and *“Why am I seeing these irrelevant results?”*--or in the case of RAG, the difference between hallucinated gibberish and a coherent answer to a question. I decided to compile all my research, experiments, and learnings into this ebook, **The Ultimate Guide to Re-Ranking**. Here, I’ll walk you through: 1. **How search systems evolved** from simple keyword-based matching like BM25 to sophisticated dense embeddings. 2. **What “two-stage search”** means and why it’s practically a must if you want to balance speed and accuracy. 3. The **different approaches** to re-ranking—pointwise, pairwise, and listwise—and how **Cross-Encoders**, **ColBERT**, and **LLM-based** rerankers fit into these categories. 4. The **performance and cost trade-offs** you need to be mindful of, especially if you’re building a real-world system with tight latency or memory constraints. 5. A look at **commercial providers** and **open-source** options (including tips on how to pick one for your own projects). 6. Finally, **what’s coming next**—trends like layerwise models and multimodal reranking. By the end of this ebook, you’ll have a solid handle on how re-ranking fits into the modern search stack, why it’s so important, and how to choose the best method for your own use-case, scale, and budget. ### 1.1 The Evolution of Search Systems ![image](https://hackmd.io/_uploads/HJqlmpVNke.png) In the early days, search was dominated by exact matching and simple text-based algorithms. The introduction of TF-IDF (Term Frequency-Inverse Document Frequency) marked our first step toward more nuanced relevance scoring. This was followed by BM25, an enhanced version of TF-IDF, which became the industry standard for its robust performance and efficiency. BM25's reign was long and prosperous, supplemented by innovations like trigram search that improved matching flexibility. The emergence of dense embeddings marked the next major revolution. These neural network-derived representations allowed systems to capture semantic relationships far beyond literal text matching. Documents and queries could now be encoded as dense vectors, enabling similarity comparisons in high-dimensional space. This breakthrough opened new possibilities but also introduced new challenges – vector search, while powerful, proved computationally intensive compared to traditional keyword approaches and often missed documents that keyword based methods could easily retrieve. Rather than replacing existing methods, these advances led to a more nuanced understanding: different approaches excel at different aspects of search. BM25 remains unmatched for exact keyword matching and efficiency, while dense embeddings excel at capturing semantic relationships. This realization gave birth to hybrid search systems, where multiple retrieval methods work in concert. ### 1.2 Understanding Two-Stage Search ![image](https://hackmd.io/_uploads/SkUc764Ekx.png) (got this image from pinecone's website, needs to be remade) Two-stage search emerged as an elegant solution to balance efficiency and accuracy in modern information retrieval systems. The approach splits the search process into distinct phases: initial retrieval and refined re-ranking. The first stage prioritizes recall, casting a wide net to capture potentially relevant documents from a large corpus. This stage must be computationally efficient, as it processes the entire document collection. Traditional keyword search/BM25 or approximate nearest neighbor search for dense vectors serve as common choices here. The goal isn't perfect precision but ensuring relevant documents make it into the candidate pool. We can improve recall in this stage by performing hybrid-search: incorporating both keyword and dense retrievers. The second stage, re-ranking, focuses on precision. Working with a much smaller set of candidates, this stage can employ more sophisticated and computationally intensive algorithms to determine the optimal ordering. This is where neural re-rankers shine, bringing their deep understanding of language and relevance to bear on a manageable subset of documents. The goal of this second stage is no longer recall, but rather precision. We want to serve relevant documents to the user/LLM, and want to reduce the number of irrelevant ones that can distract the user or cause the LLM to hallucinate. Because re-rankers can effectively re-rank documents both retrieved with keyword matching and dense search, it has proven to be extremely effective to use re-rankers to re-rank documents fetched with hybrid search (i.e., with both keyword and dense search). Another reranking technique is Reciprocal Rank Fusion (RRF), which can combine relevance indicators from different retrieval results into a final ordering. However, RRF has issues: for example, it's not always clear how to combine the relevance indicators from different retrieval results, and the relevance indicators that result from the first stage search are typically less accurate than those from the second stage. Also, since RRF prioritizes documents that score highly by multiple retrievers, it can often deprioritize documents that score extremely highly for one retriever, say dense search, but don't have relevant keywords. Nevertheless, it's often a good-enough strategy, as RRF can be applied instanty when the accuracy gains of an additional re-ranking stage are not worth the latency penalty. Rerankers can also be effectively used along with RRF or other fusion-based strategies. For example, it's possible to fetch the top 500 documents with bm25, the top 500 documents with dense search, rerank them with RRF to quickly get the top 30 documents, and then finally rerank the top 30 documents with a cross-encoder. I bring up fusion-based strategies like RRF because it's not always necessary to reach for a Reranker when RRF can be good-enough without the latency or cost overhead. RRF is also a good strategy to benchmark against. ### 1.3 The Importance of Re-Ranking in Modern Search ![image](https://hackmd.io/_uploads/HkD8DkSVJx.png) Re-ranking has become critical in modern search systems for several compelling reasons. User expectations have evolved dramatically – people now expect search systems to understand intent, handle complex queries, and return highly relevant results. Single-stage retrieval systems, no matter how sophisticated, struggle to consistently meet these expectations while maintaining reasonable performance. The scale of modern information systems demands efficient solutions. With document collections routinely reaching billions of items, performing complex relevance calculations on the entire corpus is impractical. Re-ranking allows us to apply sophisticated analysis where it matters most – on the most promising candidates. Furthermore, re-ranking provides a natural integration point for incorporating multiple signals and constraints. While initial retrieval might focus on content similarity, re-ranking can consider factors like a more sophisticated relevance score, freshness, popularity, user preferences, and business rules. This flexibility is crucial for real-world applications where relevance is multifaceted. Modern search systems reach an incredibly large scale, but with a re-ranking step, we could still retrieve deeply relevant results. This is because by narrowing the number of candidates, we can now be extremely sophisticated in our methodology for scoring and determining the most promising candidates. This allows us to adapt to a diverse set of use cases and requirements that would otherwise be very challenging to accommodate. ## 2. Re-Ranking Architectures and Techniques ### 2.1 Cross-Encoder Architecture  ![image](https://hackmd.io/_uploads/S1muqJrNJg.png) (This is a good diagram, but it should be BERT intstead of PLM Encoder) ### 2.1 Cross-Encoder Architecture (Rewritten) Cross-encoders are a straightforward yet powerful approach to neural re-ranking: instead of encoding queries and documents separately, they **process the query-document pair as a single input**. This setup allows the model to capture rich cross-attention between query and document terms, resulting in more nuanced relevance assessments. Cross-encoders build on **pre-trained language models like BERT** but differ from typical “BERT classifiers” in a key way. In a simple BERT classifier setup (e.g., sentiment analysis), the model takes only one sequence as input. By contrast, cross-encoders concatenate both the query and document into a single sequence with a special separator (often `[SEP]`)—and then apply transformer layers that capture interactions **across** every token in both texts [(Nogueira & Cho, 2019)](https://arxiv.org/abs/1901.04085). This joint encoding yields a final classification layer that outputs a relevance score. #### Why They’re So Effective… and So Costly Cross-encoders excel at re-ranking because they model fine-grained term-by-term relationships, making them particularly strong for **complex queries**. They incorporate **cross-attention**, which means that the model can capture interactions between tokens in both the query and document. However, this rich interaction comes with significant computational overhead: - **On-the-Fly Computation**: For each query, every candidate document must pass through the entire model, making the computational cost scale linearly with the size of the candidate set. - **No Precomputation**: Unlike bi-encoders (or dual encoders), cross-encoders cannot store reusable document embeddings. They must perform a fresh forward pass for each query-document pair, incurring high latency for large candidate sets. - **Throughput Challenges**: For long documents or lengthy queries, even optimized cross-encoders can push latency to hundreds of milliseconds or more per query, making them expensive to run—especially if you’re re-ranking hundreds of documents at once. Despite these challenges, cross-encoders remain **among the popular re-ranking methods** today. Cohere’s Reranker, for example, helped popularize an API-based approach where developers can integrate cross-encoder functionality with minimal code. Many other vendors, such as Voyage and Jina, also offer cross-encoders as an easy plug-in solution. #### Practical Tips for Mitigating Latency - **Shorten Document Texts**: Reducing the input length markedly decreases inference times; chunking documents or summarizing them is often an effective workaround. - **Use Optimized Frameworks**: Tools like [Text Embedding Inference](https://github.com/huggingface/text-embeddings-inference) can bring inference below 50ms on a GPU, depending on text length. - **Balance Scale vs. Quality**: Because cross-encoders can be both slow and costly (often over \$1 per 1000 queries in commercial API settings), they’re best applied to a relatively small set of top candidates—typically in the final stage of a two-stage (or multi-stage) retrieval pipeline. In short, cross-encoders deliver exceptional accuracy for fine-grained re-ranking but demand thoughtful consideration of latency, cost, and deployment constraints. They’re well-suited for scenarios where **precision** is paramount and you can afford the computational budget—whether that’s in a consumer-facing search pipeline or a specialized domain with high-value queries. ### Adding Cross-Encoders to Your Search Pipeline Let's walk through a complete example of adding a cross-encoder to your search pipeline using KDB.AI and Cohere's reranker. This example demonstrates how to: 1. Set up your environment 2. Create a database with appropriate indexes 3. Prepare and insert your data 4. Perform search with reranking For this example, you'll need the following credentials: - OpenAI API key, which you can find in your [OpenAI dashboard](https://platform.openai.com/api-keys). - KDB.AI API key and endpoint URL, which you can find by going to [KDB.AI](https://kdb.ai) and logging in. - Cohere API key, which you can find in your [Cohere dashboard](https://dashboard.cohere.ai). First, install the required dependencies: ```bash pip install langchain openai kdbai_client ``` Now let's set up our database and create a table with appropriate indexes for vector search: ```python import os import numpy as np import pandas as pd from langchain_text_splitters import RecursiveCharacterTextSplitter from openai import OpenAI import kdbai_client as kdbai # Set up OpenAI client os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY" KDBAI_ENDPOINT = 'YOUR KDB.AI ENDPOINT URL' KDBAI_API_KEY = 'YOUR KDB.AI API KEY' COHERE_API_KEY = 'YOUR COHERE API KEY' client = OpenAI() session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT) # Define schema with embeddings column schema = [ {'name': 'id', 'type': 'int16'}, {'name': 'content', 'type': 'str'}, {'name': 'embeddings', 'type': 'float64s'} ] # Create indexes for efficient vector search indexes = [ {'type': 'flat', 'name': 'flat', 'column': 'embeddings', 'params': {'dims': 1536}}, {'type': 'hnsw', 'name': 'fast_hnsw', 'column': 'embeddings', 'params': { 'dims': 1536, 'M': 8, 'efConstruction': 8 }}, {'type': 'hnsw', 'name': 'accurate_hnsw', 'column': 'embeddings', 'params': { 'dims': 1536, 'M': 64, 'efConstruction': 256 }} ] # Create database and table database = session.database('rerank') table = database.create_table('rerank', schema=schema, indexes=indexes) ``` Next, let's create helper functions to prepare our data: ```python def get_data_chunks(data: str) -> list[str]: """Split text into chunks for processing.""" text_splitter = RecursiveCharacterTextSplitter( chunk_size=384, chunk_overlap=20, length_function=len, is_separator_regex=False, ) texts = text_splitter.create_documents([data]) return texts def get_embedding(text: str, model: str = 'text-embedding-3-small') -> list[float]: """Generate embeddings for a text using OpenAI's API.""" text = text.replace("\n", " ") return client.embeddings.create(input=[text], model=model).data[0].embedding # Process and insert your data def process_and_insert_data(data: str, table) -> None: """Process text data and insert into KDB.AI table.""" df = pd.DataFrame(columns=["id", "content", "embeddings"]) data_chunks = get_data_chunks(data) for num, data_chunk in enumerate(data_chunks): text = data_chunk.page_content text_embeddings = get_embedding(text) df.loc[len(df), :] = (num, text, text_embeddings) table.insert(df) ``` Now we can set up our search with reranking. We'll use Cohere's reranker: ```python from kdbai_client.rerankers import CohereReranker def search_with_reranking(query: str, table, top_k: int = 3) -> pd.DataFrame: """ Search documents and rerank results using Cohere. Args: query: Search query table: KDB.AI table object top_k: Number of results to return Returns: DataFrame with reranked results """ # Generate query embeddings query_embedding = get_embedding(query) # Initialize reranker reranker = CohereReranker( api_key=COHERE_API_KEY, model='rerank-english-v3.0', overfetch_factor=4 # Fetch 4x more candidates for reranking ) # Perform search with reranking results = table.search_and_rerank( vectors={"flat": [query_embedding]}, n=top_k, reranker=reranker, queries=[query], text_column='content' ) return results # Example usage if __name__ == "__main__": # Your document text document_text = """ Your document text here... """ # Process and insert data process_and_insert_data(document_text, table) # Perform search with reranking query = "What are the key points discussed?" results = search_with_reranking(query, table) print(results) ``` This code demonstrates a complete pipeline that: 1. Creates a vector database with appropriate indexes for efficient search 2. Chunks and embeds documents using OpenAI's embedding model 3. Stores the documents and their embeddings 4. Performs hybrid search with Cohere's reranker Key parameters to tune: - `chunk_size`: Adjust based on your document length and desired granularity - `overfetch_factor`: Increase to consider more candidates during reranking - `top_k`: Number of final results to return - `model`: Choose between different Cohere reranking models based on your needs The reranking step significantly improves search quality by applying a more sophisticated relevance model to the initial candidate set. For production deployments, consider: You can then directly show the results to the user or use them for Retrieval Augmented Generation (RAG) to generate grounded responses. KDB.AI's SDK makes it simple to add Jina, Cohere, and VoyageAI rerankers to your pipeline with just a couple lines of code, but as you'll see later in this ebook, there are many more options to choose from. #### **Cross-Encoders vs. Traditional BERT Classifiers and Bi-Encoders** So, why exactly are cross-encoders considered the gold standard for re-ranking in modern search systems? To understand, we need to compare them to their close relatives: traditional BERT classifiers and bi-encoders. Are cross-encoders essentially BERT classifiers in disguise...? ![9d7fx4](https://hackmd.io/_uploads/HynBzeBNyl.jpg) #### **Let’s start with traditional BERT classifiers.** At their core, traditional BERT classifiers are designed to work with a single input sequence—think of tasks like sentiment analysis or spam detection. You feed the model a piece of text, it crunches through its transformer layers, and out comes a prediction: “positive,” “negative,” or whatever label you’re working with. Here’s the workflow in plain terms: BERT takes your input (a query or document), represents it through contextual embeddings (like a summarized essence of the text), and spits out a class probability distribution. The magic happens when the model decides the most probable class using a softmax layer. ![image](https://hackmd.io/_uploads/BkBambBE1g.png) source https://mriquestions.com/softmax.html This setup is elegant and efficient for single-sequence tasks, but here’s the problem: it can’t handle pairwise relationships. You’d be stretching it to use a traditional BERT classifier for understanding how *two* texts relate, and that’s where bi-encoders and cross-encoders shine. #### **Now, enter bi-encoders.** Instead of processing a query and document together, bi-encoders handle them separately. Each is transformed into a dense vector—a compact numerical representation capturing its essence. Then, to measure their compatibility (or relevance), instead of passing them through a classifier, you compare these vectors directly with a similarity metric like cosine similarity. It’s a bit like asking: “How close are the two vectors in semantic space?” Bi-encoders shine in large-scale scenarios because they allow precomputing document vectors. You can stash these vectors away in a vector database and reuse them for multiple queries, which makes them blazing fast for retrieval tasks. But there’s a downside. By processing the query and document independently, bi-encoders can miss the nuanced interplay between the two. For example, if the query says "who is from NYC?," and the document mentions "50 Cent" the bi-encoder might overlook the subtle match because it’s not analyzing the pair together. That’s where cross-encoders step in. #### **Finally, the cross-encoder.** Now let's look at the cross-encoder. Instead of treating the query and document as separate entities, it processes them as a single unit. Picture this: the query and document are combined into a single input with a special `[SEP]` token marking the boundary. The model doesn’t just understand each text in isolation; it captures their interactions, the subtle nuances of context and relevance that make one document stand out over another. However, the architecture differences between cross-encoders and BERT classifiers are subtle. The cross-encoder doesn’t generate reusable vectors like a bi-encoder. Instead, it scores the pair directly by analyzing the rich interplay between the query and document terms--it does this by classifying the pair as relevant or not. This deep modeling means it can catch details others miss—like understanding that 50 Cent is from NYC, and is therefore relevant. The result is a relevance score, normalized between 0 and 1 using a sigmoid function, which we can then turn into a percentage if we would like. We can read a cross-encoder score as: “This document is 92% relevant to the query.” This score isn't just useful for reranking: it's also good for observability. If the top document for a query is less than 40% relevant, it's possible either our dataset doesn't have an answer to the query, or the previous stage of the pipeline's recall is too low. And if a lot of queries are fetching only irrelevant documents, then we may need to re-examine our entire pipeline. Most importantly for RAG, we can use the score to decide whether a document should be fed to our LLM or not--why include an irrelevant document that might make our LLM hallucinate? But this deeply convenient score comes at a cost. Cross-encoders are computationally expensive because they must process each query-document pair individually and concurrantly at query time. If you are reranking thousands of documents, that can be... painful, as the typical batch size for cross-encoders is 32. API providers typically limit the number of documents to rerank per query to 1k. Unless you have a room full of GPUs or a hefty API budget, scaling becomes a real challenge. Another good thing to consider is that increasing k, the number of documents that you rerank, won't automatically improve your accuracy, but rather can make your results **worse**. This is because cross-encoders often overfit, and give some documents unreasonably good scores, despite them being completely irrelevant to the query. For this reason, k is a hyperperameter to optimize from the perspective of throughput, latency, and accuracy, with the sweet spot typically falling around 100 for the current generation of cross-encoders. Because of this tendency to overfit, LLMs are sometimes also employed to analyze the relevance of returned documents, as they tend to offer greater reliability and are less prone to overfitting. ![image](https://hackmd.io/_uploads/B169tQSN1e.png) Source https://arxiv.org/html/2411.11767v1 #### **So, what’s the verdict?** Here’s how it shakes out in practice: - **Traditional BERT classifiers** are great for single-sequence tasks where relationships between inputs don’t matter. - **Bi-encoders** are the go-to for efficiency and scale, excelling in retrieval tasks where you can precompute embeddings. - **Cross-encoders** are the precision tools, perfect for smaller candidate sets where you need to squeeze every drop of nuance out of your query-document pair. In the end, the choice depends on your needs. If you’re re-ranking the top 50 candidates from an initial retrieval stage, the cross-encoder would be an excellent choice. But if you’re searching through a billion documents, bi-encoders are absolutely necessary, and you'll likely need one or more reranking stage to get the best results. Because cross-encoders are essentially classifiers, we can provide examples to fine-tune them to better score document relevance, or define relevance differently (say, if we were reranking job candidates, in which case pure similarite is not enough). This makes them extremely adaptable to new reranking tasks, as long as we have some data to fine-tune with. ### 2.2 Late Interaction Models  ![image](https://hackmd.io/_uploads/Sko3MgSN1e.png) image from https://medium.com/@varun030403/colbert-a-complete-guide-1552468335ae Late interaction models, exemplified by ColBERT, represent an elegant compromise between efficiency and effectiveness. Like bi-encoders, these models maintain separate representations but delay the final relevance calculation until query time. The main difference is in the way they calculate relevance. When embedding models create vectors, they pool (average) token-level vectors into a final vector. This can cause information to be lost. ![image](https://hackmd.io/_uploads/r1R1sbB41l.png) Source https://blog.vespa.ai/announcing-long-context-colbert-in-vespa/ ColBERT doesn't pool vectors, but instead keeps token-level vector representations--after the query vectors has been processed, the model calculates the final relevance score using a MaxSim operation. The key insight lies in decomposing the relevance calculation into two steps. First, queries and documents are independently encoded into contextualized term representations. Then, fine-grained interaction between these representations occurs at query time. This approach allows document representations to be precomputed and indexed while maintaining term-level interactions that capture fine-grained relevance signals. To best understand ColBERT, let's go through an example: ### **ColBERT in Action: A Quick Walkthrough** Imagine you’re searching for **"affordable laptops for gaming"**, and a candidate document reads: > "This store offers budget-friendly laptops designed for gamers." ColBERT breaks the search process into two steps: **precomputing document representations** and **delaying detailed matching until query time**—an approach that balances speed and accuracy. --- #### **Step 1: Encoding the Query and Document** The query words **"affordable"**, **"laptops"**, and **"gaming"** are turned into vectors, each capturing their meaning (this is a bit of a simplification because the query is split into tokens, which might not match words exactly.) Similarly, the document is split into words like **"budget-friendly"**, **"laptops"**, and **"gamers"**, each represented as vectors. Here’s the clever bit: the document’s vectors are precomputed and stored, so they don’t need to be recalculated for every query (just like how a bi-encoder precomputes document vectors). --- #### **Step 2: Late Interaction with MaxSim** When the query runs, ColBERT matches each query word to the **most similar** word in the document by using the MaxSim operation, that looks like this: $$\text{MaxSim}(q_i, d_j) = \max_{k \in d} \text{Sim}(q_i, d_k)$$ Here's some example code to illustrate the MaxSim operation: ```python= from sklearn.metrics.pairwise import cosine_similarity import numpy as np def maxsim(query_vector, document_vectors): """Compute MaxSim for a single query vector against all document vectors.""" return max(cosine_similarity([query_vector], document_vectors)[0]) def relevance_score(query_vectors, document_vectors): """Compute the overall relevance score between query and document using MaxSim.""" return sum(maxsim(q_vec, document_vectors) for q_vec in query_vectors) # Simulated embeddings for the query "affordable gaming computer" query_vectors = np.array([ [0.9, 0.1, 0.0], # "affordable" [0.0, 0.9, 0.1], # "gaming" [0.1, 0.0, 0.9] # "computer" ]) # Simulated embeddings for the document "budget-friendly gaming laptop" document_vectors = np.array([ [0.8, 0.2, 0.1], # "budget-friendly" [0.0, 0.8, 0.2], # "gaming" [0.2, 0.1, 0.9] # "laptop" ]) # Compute relevance score score = relevance_score(query_vectors, document_vectors) print(f"Relevance Score: {score}") # Relevance Score: 2.9631529770688196 ``` Here is the detailed breakdown of the contributions for each query term to the overall relevance score: | **Query Term** | **Document Term Similarities** | **MaxSim Contribution** | **Explanation** | |-------------------|---------------------------------------------------------------|--------------------------|---------------------------------------------------------------------------------| | **Query Term 1** | `[0.9838 (budget-friendly), 0.1071 (gaming), 0.2263 (laptop)]`| **0.9838** | "Affordable" aligns closely with "budget-friendly," contributing strongly. | | **Query Term 2** | `[0.2526 (budget-friendly), 0.9910 (gaming), 0.2143 (laptop)]`| **0.9910** | "Gaming" matches perfectly with "gaming," yielding the highest contribution. | | **Query Term 3** | `[0.2260 (budget-friendly), 0.2411 (gaming), 0.9884 (laptop)]`| **0.9884** | "Computer" aligns best with "laptop," contributing significantly. | ### Total Relevance Score: **2.9632** - **Query Term 1** contributes **0.9838**, primarily due to its high similarity with "budget-friendly." - **Query Term 2** contributes **0.9910**, as it matches "gaming" almost perfectly. - **Query Term 3** contributes **0.9884**, thanks to its strong alignment with "laptop." When we add up these contributions, we get a total relevance score of **2.9632**. Note here that while the same token **gaming** occurs in both the query and the document, it's embedding can be slightly different because it is context dependant. --- #### Why ColBERT Works ColBERT represents a powerful compromise in the search architecture space, combining efficiency with sophisticated relevance modeling. Its effectiveness stems from two key design choices: precomputed document representations and late interaction matching. By precomputing document data, ColBERT achieves significant performance advantages for large-scale deployments. The late interaction method enables precise word-to-word relevance matching, providing a depth of understanding that simpler architectures cannot match. However, ColBERT's advantages come with significant implementation challenges. Storage requirements present the most immediate concern, as the architecture requires maintaining multiple vector representations for each token in every document. This can result in substantial storage overhead, particularly for large document collections. The computational cost of the MaxSim operation during query time also presents challenges, as comparing each query token against all document tokens requires numerous calculations. Production implementations of ColBERT require careful optimization strategies to manage these challenges. Quantization techniques like binary quantization help reduce storage requirements while maintaining acceptable accuracy. Locality-sensitive hashing enables fast approximate matching, reducing computational overhead. Pooling strategies can decrease the number of vectors per document, though this requires careful tuning to prevent accuracy degradation. GPU acceleration through optimized CUDA implementations can even be used to speed up the MaxSim operation. ColBERT typically finds its most effective use within multi-stage retrieval pipelines. A common architecture might begin with hybrid search combining dense and sparse vectors for initial retrieval, followed by ColBERT-based re-ranking on a reduced candidate set, and finally applying a cross-encoder to the top candidates. This progressive refinement approach leverages ColBERT's strengths while managing its computational costs. While ColBERT search across an entire dataset would prove prohibitively expensive, it performs efficiently on smaller candidate sets of around a thousand documents, making it an ideal intermediate stage between broad retrieval and focused re-ranking. The scalability of ColBERT implementations demands careful attention to system design and optimization. Performance characteristics can change dramatically as dataset size increases, requiring ongoing tuning of quantization parameters and matching strategies. Success with ColBERT often depends on finding the right balance between computational efficiency and retrieval quality for your specific use case. ### 2.3 Ranking Approaches Modern re-ranking systems employ three distinct approaches to the ranking problem, each with its own theoretical foundation and practical implications. #### Pointwise Ranking Pointwise ranking treats the problem as a regression or classification task, where each query-document pair receives an independent relevance score. This approach proves conceptually straightforward and training becomes relatively simple. The system examines each document in isolation, assigning absolute relevance scores without considering other candidates. Cross-encoders are a great example of this approach. Interestingly, an LLM can also do this with a simple trick! We just ask it to output a score, and define a criteria for every score range (e.g. 0-1 is unrelated, 1-2 is slightly related, etc.) and then use that criteria to give the document a score. The LLM becomes a classifier, and we can also get a continuous score by multiplying each score by its probability (by using perplexity/logits.) **Example**: Say we are classifying the following joke with a score out of 10 on how funny it is, and the chances of “8” is 0.9, while the chances of “7” is 0.1, and the chances of every other score is 0. We can find the overall score to be $8 \times 0.9 + 7 \times 0.1 = 7.9$. #### Pairwise Ranking Pairwise approaches elevate the ranking concept by learning to compare document pairs, determining relative ordering between candidates. This methodology aligns more closely with the fundamental nature of ranking as an ordering task rather than a scoring task. The training process focuses on learning the relative preferences between documents, which often matches real-world relevance judgments more naturally. We can do this with an LLM, and can even fine-tune LLMs for this task, but it’s extremely inefficient. #### Listwise Ranking Listwise methods represent the most sophisticated approach, considering the entire candidate set when making ranking decisions. These methods can capture complex dependencies between documents and optimize for ranking metrics directly. The approach enables consideration of diversity, coverage, and position bias in ways impossible with simpler methods. It’s also the most similar to how humans rank—we take the entire list into account. In sensitive tasks, where final recommendation quality is vital, LLMs can be used to rerank the final documents at the end of a pipeline. LLMs are also much more flexible, and can rerank with an arbitrary and complex criterion, making them great for challenging retrieval pipelines like job candidate or people search. | Ranking Approach | Theoretical Foundation | Practical Implementation | Advantages | Limitations | Complexity | Performance Characteristics | |-------------------|-------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|------------|--------------------------------------| | **Pointwise** | Treats ranking as regression or classification (independent scores for each query-document pair). | Simpler training; Cross-encoders excel at this. LLMs can score documents individually with criteria for score ranges. | Conceptually straightforward, efficient for isolated scoring tasks, and easy to integrate. | Ignores dependencies between documents. | Low | Good for simple tasks | | **Pairwise** | Learns to compare document pairs to determine relative ordering. | Focuses on learning preferences between document pairs; often requires fine-tuning. | Captures relative preferences naturally and aligns well with real-world judgments. | Inefficient with LLMs due to pairwise comparisons; not ideal for large candidate sets. | Moderate | Great for relative comparisons | | **Listwise** | Considers the entire candidate set, optimizing ranking metrics directly. | Uses LLMs or sophisticated models to rerank entire candidate sets with flexible criteria. | Considers diversity, coverage, and position bias; most similar to human judgment. | Computationally expensive; requires a high-quality initial candidate set for reranking. Context window constraints. Slow. | High | Best for final reranking | --- #### Traditional Learning-to-Rank (LTR) Frameworks Alongside modern neural or LLM-based ranking methods, **classical LTR** (Learning to Rank) solutions remain staples in large-scale production systems, particularly in web search, e-commerce, and recommendation pipelines. These methods can implement *pointwise, pairwise, or listwise* training objectives, but they often rely on feature-engineering and tree-based models: - **XGBoost’s Ranking Module**: Applies a pairwise or listwise objective to gradient-boosted trees, easily accommodating business rules or user-specific features (e.g., click signals, recency). - **LambdaMART**: A widely deployed pairwise-then-listwise method optimizing ranking metrics like NDCG. It is well-known in commercial search engines and advertisement ranking. - **RankNet / ListNet**: Classic neural LTR approaches (pre-Transformer era) that introduced pairwise and listwise losses, albeit on simpler MLP or feed-forward backbones. Why do organizations still opt for classical LTR? 1. **Flexibility**: Feature-rich pipelines (e.g., user engagement, popularity signals) go beyond text relevance alone. 2. **Scalability**: Tree-based or boosted LTR can handle tens of millions of candidates without requiring heavy GPU inference. 3. **Interpretability**: Feature importances are often easier to visualize and debug than hidden neural layers. However, *training and applying classical LTR on very large candidate sets* can be expensive too—especially if multiple features require complex extraction for each document. Thus, in many systems, LTR might be used at an earlier stage (or mid-stage) to narrow down from thousands of documents to hundreds, before applying a final **neural** re-ranker (e.g., Cross-Encoder, LLM). Ultimately, **traditional LTR** and **neural re-rankers** are not mutually exclusive. It’s common to see them **combined**—a classical LTR model uses a variety of features (including neural embeddings) for a broad re-ranking pass, and then a cross-encoder or listwise LLM re-ranker polishes the top results. This layered approach balances **efficiency, interpretability, and accuracy**—particularly in mission-critical or business-focused retrieval pipelines. ### 2.4 Model Types There are many types of models that can be used for re-ranking. Here's an overview of all the types of models by Answers AI: ![overview_v2 (1)](https://hackmd.io/_uploads/ryHq9wBNkx.png) Here we can see many model types, so let's look at each one in turn. ### Listwise Methods Listwise methods typically use LLMs. T5 models are also LLMs, but they use encoder-decoder architecture, unlike LLMs like GPT models which are decoder only. One simple method or reranking is just doing so over API. Pass a set of documents, and ask an LLM to output a comma separated ordering. The advantage of this is that it's extremely flexible--you can pass your own prompts and criteria. The disadvantage is that it's slow, these models are very bulky, and it's expensive. #### RankGPT Another challenge is that only so many documents will fit in a context window. The solution for this is using a sliding window, as introduced in the RankGPT paper. The window size and number of documents per window becomes a hyperperamter: ```python= from rank_gpt import sliding_windows api_key = "Your OPENAI Key" new_item = sliding_windows(item, rank_start=0, rank_end=3, window_size=2, step=1, model_name='gpt-3.5-turbo', api_key=api_key) print(new_item) ``` This is typically done when latency is not a concern, and we really want a human-ranker-quality ordering. #### RankZypher RankZephyr was trained using a two-stage instruction fine-tuning process. In Stage 1, it learned from ranked lists generated by GPT-3.5 using BM25-retrieved passages, with data augmented by shuffled inputs to enhance robustness. Stage 2 involved fine-tuning on a smaller set of 5K queries ranked by GPT-4, using random and discriminative sampling to maximize diversity while maintaining efficiency. The model was trained with variable window sizes and hard negatives, enabling it to generalize across different input lengths and scenarios. This approach resulted in a highly effective 7B-parameter open-source model that rivals or surpasses proprietary models like GPT-4 in zero-shot listwise reranking tasks. Distilled ranking LLMs are getting much better, faster, and this strategy might be the SOTA in a few months, beating cross-encoders as they become smaller. #### ListT5/LiT5 These refer to models similar to RankZypher, but built on the T5 architecture. It seems very few people are downloading T5 based reranking models from HF, so they are likely not be the ideal choice for production. ### Pointwise Methods #### MonoT5 MonoT5 is a pointwise document ranking model based on T5, where it evaluates query-document pairs independently by generating textual relevance labels (e.g., "relevant" or "not relevant"). This is essentially a cross-encoder built on the T5 architecture. MonoT5 is trained on labeled datasets like MS MARCO to optimize relevance predictions and evaluates documents in isolation, which limits its ability to consider inter-document relationships. I'd need to see some very strong evidence to use this over cross-encoders, which likely perform much better in most scenarios. The advantage of MonoT5 models is that the smallest has 220M parameters, and even MonoT5 3B is smaller than models distilled from LLMs. #### InRanker-3B This is a distilled version of Mono-T5 for out of domain scenarios. Seems to be getting little action on HF. #### RankT5 RankT5 is a T5-based model designed for text ranking, capable of directly outputting numerical ranking scores for query-document pairs, rather than relying on token classification. It introduces two model structures—encoder-decoder and encoder-only—and fine-tunes them using ranking losses like listwise softmax, significantly improving performance compared to traditional T5 adaptations like monoT5. RankT5 achieves superior results on datasets like MS MARCO and Natural Questions, with better zero-shot generalization to out-of-domain datasets, emphasizing its robustness and effectiveness for ranking tasks. Despite this, it only has 1k downloads from HF at the time of writing. #### Layerwise Rerankers A layerwise reranker takes a query and a passage, processes them through a transformer model, and computes relevance scores using outputs from selected layers (e.g., layer 28). Each layer captures distinct features, from surface-level syntax to deep semantic relationships, which can be aggregated or optimized for task-specific objectives, such as ranking passages in information retrieval. By aligning intermediate outputs with these objectives through custom training, layerwise rerankers achieve a balance between performance and computational efficiency, leveraging more nuanced representations than single-layer models. Techniques like cross-layer attention, weighted aggregation, and contrastive learning enhance the adaptability and precision of these models across diverse tasks and domains. For example, bge-reranker-v2-minicpm-28 is an effective layerwise reranker, striking a middle ground between traditional cross-encoders and LLM-based reranking approaches. By leveraging outputs from multiple transformer layerwise rerankers are a robust solution for ranking tasks, optimizing both generalization and task-specific performance. I expect these to become much better, and more popular, in the coming months. ![image](https://hackmd.io/_uploads/SkxREOBVye.png) ### 2.5 Performance Trade-offs The selection of a re-ranking architecture necessitates careful consideration of multiple competing factors in real-world deployments. Latency versus accuracy forms a fundamental trade-off that shapes architectural decisions. Cross-encoders provide superior accuracy but demand significant computational resources and time. Late interaction models strike a middle ground, while simpler architectures prioritize speed over sophisticated relevance modeling. LLMs can provide better accuracy than cross-encoder for judgment based ranking tasks, but are the slowest and most expensive. Typically, We have certain latency and memory requirements, and we want as much accuracy as possible. We figure out what kind of accuracy we need and try to fit inside our latency constraints. How this looks: If we need extremely high accuracy, we will need to sacrifice some latency because we need either a cross-encoder or an LLM step or both. But if we don't need very high accuracy, then we might not need a re-ranking stage at all, or we need a very simple re-ranking stage. This also depends on the scale because as our scale increases the number of re-ranking stages is going to increase (later stages are more expensive). It can also be the case that our reranking stage is very simple and cheap: for example we search with BM25 and then rerank with vector embeddings, which can be done pretty much instantly. The only time when complexity needs to be introduced--and an expensive cross-encoder or storing ColBERT representations is a complexity--is when we need to do so to improve accuracy. One thing to note is that in RAG, most of the latency comes from the LLM inference step, so we can usually endure a bit more latency than usual in our search step to make sure the generated answer isn't hallucinated. Memory management presents another crucial consideration in production systems. Precomputed representations accelerate inference but can consume valuable memory resources, especially in the case of ColBERT, which is why ColBERT representations should typically be stored on disk if possible. Various optimization techniques such as quantization, pruning, and caching help manage this trade-off, both for ColBERT and bi-encoders. The effectiveness of these techniques varies by architecture and use case, requiring careful empirical validation. Production deployments often implement multiple re-ranking stages, creating a progressive refinement pipeline. Initial stages use lightweight models to quickly filter obvious non-matches, while later stages employ more sophisticated models on increasingly smaller candidate sets. This approach optimizes resource utilization while maintaining result quality. ## 3. The Re-Ranking Ecosystem ### 3.1 Commercial Providers (Market landscape diagram: A quadrant visualization showing provider positioning along axes of "Performance" and "Enterprise Readiness." Each provider represented by a bubble sized according to market presence. Connecting lines show technological relationships and partnerships. Small icons indicate key strengths and specializations.) The commercial re-ranking landscape has evolved rapidly, with several key players emerging to offer distinctive solutions. #### Cohere Cohere leads the field with exceptional multilingual capabilities, supporting over 100 languages while maintaining consistent performance across business-critical languages. Their latest model, rerank-3.5, demonstrates remarkable improvements over traditional search methods. These benchmarks should always be considered with a grain of salt, and this benchmark here has a lot of issues: for one reranking is typically done on the top-k candidates so comparing it directly to hybrid/dense/bm-25 doesn't make too much sense. And here the reranking strategy for hybrid search isn't specified--although it's likely RRF. Reading reranking benchmark results is challenging, as there are a lot more charts than reproducable and standardized benchmarks. Still, it shoes that Cohere's reranker is a major player. ![image](https://hackmd.io/_uploads/HJR4kVSN1l.png) Source https://cohere.com/blog/rerank-3pt5 #### Voyage AI Voyage AI has established itself through a focus on efficiency and performance. The Voyage rerank-2 model stands out for having a context window of 8k tokens. ![image](https://hackmd.io/_uploads/rkcwOVSEJe.png) Source https://blog.voyageai.com/2024/09/30/rerank-2/ #### MixedBread ![image](https://hackmd.io/_uploads/Bye1xPNSVyl.png) MixedBread takes a unique approach, emphasizing open-source innovation while maintaining commercial-grade performance. Their mxbai-rerank-large-v1 model is extremely impressive, but what really sets them apart is their extremely open nature--their models are available on huggingface and can be run locally extremely easily: ```python from sentence_transformers import CrossEncoder # Load the model, here we use our base sized model model = CrossEncoder("mixedbread-ai/mxbai-rerank-xsmall-v1") # Example query and documents query = "Who wrote 'To Kill a Mockingbird'?" documents = [ "'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.", "The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.", "Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird', was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.", "Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.", "The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.", "'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan." ] # Lets get the scores results = model.rank(query, documents, return_documents=True, top_k=3) ``` Interestingly, their most downloaded model is actually their smallest: mxbai-rerank-xsmall-v1, showing that people deploying models on-prem are focusing on throughput and latency over accuracy. ![image](https://hackmd.io/_uploads/ryP2DEH4yx.png) #### Together AI Unlike previous models in this section, Together AI's reranker is not a cross-encoder, but rather an LLM fine-tuned for ranking. Salesforce trained LlamaRank, a Mixtral 8B model finetuned on a ranking dataset, where documents were scored on relevance. 0.9 — Highly Relevant 0.8 ~ 0.7 — Relevant 0.6 ~ 0.5 — Somewhat Relevant 0.4 ~ 0.3 — Marginally Relevant 0.2 ~ 0.1 — Slightly Relevant ~ 0.0 — Irrelevant The result is an LLM that can score documents, where the score implies relevancy, just like a cross-encoder. Note that while an LLM can do this out of the box, having the scores match the expected score on smaller models is a challenge, and that's what LlamaRank attempted to do. It works listwise, just like a cross-encoder. TogetherAI then partnered with Salesforce to serve this model over an API. #### Nvidia NVIDIA's recent reranking model applied a similar approach to LlamaRank. Rerank-qa-mistral-4b adapts a Mistral 7B model by: 1. Pruning layers to reduce model size (e.g., keeping only the bottom 16 layers). 2. Modifying the attention mechanism to be bi-directional, allowing tokens to attend both left and right for deeper query-passage interactions. 3. Used a ranking dataset, which notably featured hard negatives (e.g., irrelevant documents that might seem relevant but are not) to fine-tune the model. The result is small 4B model that can score documents extremely effectively. The model can be deployed on brev.dev with one click (GPU rental platform acquired by Nvidia), or used over API here: https://build.nvidia.com/nvidia/rerank-qa-mistral-4b ![image](https://hackmd.io/_uploads/SJzgRNrN1l.png) Source https://arxiv.org/pdf/2409.07691 #### Jina AI Jina AI released an excellent cross-encoderreranker: jinaai/jina-reranker-v2-base-multilingual. It's open source on HuggingFace and available through API. Not much to say about this reranker, other than it's extremely popular, with almost 100k downloads on HuggingFace last month. They also have an excellent ColBERT model, which might be the only ColBERT model available on the AWS marketplace: ![image](https://hackmd.io/_uploads/S1jjHHBEJg.png) #### Note One easy way to deploy many of the above is the AWS/Azure marketplaces. ### 3.2 Open Source Models While many of the above are open-source, there are many different cross encoders available. One easy metric to find the correct one for your use case it to just search "reranker" or "cross-encoder" on HuggingFace. ![image](https://hackmd.io/_uploads/BJbLUHrVke.png) As you can see here, some of the most popular cross-encoders by far are the bge-rerankers, which are small and accurate. There are also specialized rerankers, such as hotpotch's japanese specific reranker. HuggingFace search isn't perfect, for example searching "reranker" missed out on some excellent models from the SentenceTransformers team that the "cross-encoder" query picked up. ![image](https://hackmd.io/_uploads/r1n3wrSNyl.png) The specific best models are constantly changing, so it's good to be able to easily figure out what open-source models are SOTA at any given moment. ### 3.3 Libraries of Note #### Answer.AI Reranker Library https://github.com/AnswerDotAI/rerankers This library's goal is to be a simply interface to interact with both open-source and commercially available re-rankers. It supports lots of different reranking options, many that aren't even mentioned in this ebook! It's maintained by Answer AI, an R&D firm that has trained many interesting ColBERT models. ![image](https://hackmd.io/_uploads/HyCYuHrVJg.png) #### Text Embedding Inference (TEI) https://github.com/huggingface/text-embeddings-inference This library is built in rust to optimize cross-encoder and embedding inference as much as possible. It's a bit bloated if you only want to deploy a cross-encoder, but if your focus is throughput and latency, this is a great option. #### FastEmbed https://github.com/qdrant/fastembed?tab=readme-ov-file#-rerankers FastEmbed, a library for fast embedding inference, also supports reranking. At the time of this writing, there is an open issue to support FastEmbed's reranking capability in AnswersAI library. ```python= from fastembed.rerank.cross_encoder import TextCrossEncoder query = "What is the capital of the United States?" documents: list[str] = [ "The capital of the United States is Washington, D.C.", "Senator Mitt Romney is from Massachusetts.", ] encoder = TextCrossEncoder(model_name="Xenova/ms-marco-MiniLM-L-6-v2") scores = list(encoder.rerank(query, documents)) ``` #### FlashRank https://github.com/PrithivirajDamodaran/FlashRank FlashRank is a minimalistic reranking library resembling Answer AI's that supports both Listwise and Pairwise rerankers. Here are some of the models supported: ![image](https://hackmd.io/_uploads/BJe1wdHVkx.png) #### Langchain/LlamaIndex/RAG libraries End-to-end RAG libraries like Langchain and LlamaIndex typically deeply integrate reranking by offering multiple external providers. Although reranking is so little code that adding the extra layer of abstraction these libraries add is only worth it if your pipeline is already deeply Langchain/LlamaIndex-y. ### 3.3 How to Choose a Reranking Provider The best way to choose a re-ranking provider is to take your existing search pipeline (which may include hybrid search, full-text search, dense search, etc.) and add re-ranking to the end. Then try a lot of different providers on your evals to see which ones perform best. The advantage of re-rankers is that they're kind of "plug and play," especially if they're the last step of your search pipeline. So what you could do is try all of them - it's not that difficult to try all of them on your own evals, especially if you are using a framework like Answer AI's library. If you don't have any evals, that's where it gets tricky. Not only is it challenging to find the re-ranker that performs best, but it's also challenging to know: "What's the right k for your data (number of documents to retrieve?)" The solution to this is to evaluate . But it should be taken into account that while slapping a rear anchor typically will straight up improve performance for you in most cases, there's no need to do it naively - it's quite possible that your re-ranker isn't actually improving your pipeline accuracy enough to justify the rise in latency. ### 3.4. Benchmarking Q&A Re-Rankers in Practice A recent paper—*Enhancing Q&A Text Retrieval with Ranking Models: Benchmarking, Fine-tuning and Deploying Rerankers for RAG (arXiv:2409.07691v1, 2024)*—offers a concise look at how multi-stage pipelines perform on **Q&A** tasks. While it doesn’t survey every re-ranker (the landscape is huge!), it provides a practical **kickoff point** for those setting up a Q&A retrieval stack. **Key Datasets & Rationale** Because **Q&A** is central to many RAG systems, the authors focus on **NQ**, **HotpotQA**, and **FiQA** from the BEIR benchmark. These datasets are pre-chunked into short passages, so each question can be tested against a realistic set of candidates—exactly how RAG systems operate. **Evaluated Models** - **Embedding Models**: Three commercially allowed retrievers (Snowflake’s Arctic-Embed-L, NV-EmbedQA-e5-v5, and NV-EmbedQA-Mistral7B-v2). - **Rerankers**: Several cross-encoders, culminating in the new **NV-RerankQA-Mistral-4B-v3**—a 4B-parameter model fine-tuned with contrastive InfoNCE. **What They Found** 1. **Smaller** embedders still do great if paired with a **powerful** re-ranker. 2. Very **large** embedders need an equally large re-ranker to see further boosts. 3. **NV-RerankQA-Mistral-4B-v3** outperforms all tested baselines by ~14% in **NDCG@10**. Even though this paper tests only a handful of rerankers, it demonstrates how to systematically measure pipeline accuracy, emphasizes **licensing concerns** (some popular models train on research-only datasets), and shows why multi-stage pipelines often **win** in Q&A scenarios. ![image](https://hackmd.io/_uploads/HyISaBiIke.png) Source https://arxiv.org/html/2409.07691v1 ## 4. Choosing Your Re-Ranking Strategy ### 4.1 Use Case Analysis ![rerankers_map (1)](https://hackmd.io/_uploads/rJQbnUH4kg.jpg) https://www.answer.ai/posts/2024-09-16-rerankers.html Choosing the right reranking method can be extremely intimidating. 80% of the time, it's likely the right option is to go with a cross-encoder and move on. But the other 20% of the time, there will need to be some serious thought put into the right re-ranking strategy. Luckily, Answers AI came out with an incredible flow chart to help you make the right decision, which I will shamelessly steal. I think that one thing that this flowchart misses is simply using an LLM API like Gemini/OpenAI for reranking, which is priced around 10x+ higher than Cohere rerank, but can massively outperform it on challenging reranking tasks that judge on a complex criteria. Also, latency isn't necessarily too slow for LLM reranking given we are reranking a number of documents that fits into one context window, especially if we use fast providers like Cerebras/Groq, or any provider with a fast Time To First Token (because the number of tokens generated is typically low). I also think that this flowchart gives some models that aren't as useful too much credit, like MonoT5/InRanker-3B. But otherwise, I'd look closely at this flowchart when making a decision. Selecting an appropriate re-ranking strategy requires careful analysis of your specific use case and constraints. Search applications vary dramatically in their requirements, from e-commerce platforms requiring sub-second response times to academic search engines prioritizing recall and precision. At a high level, larger model==higher latency. Scale considerations form the foundation of strategy selection. Systems handling millions of queries per day across billions of documents face fundamentally different challenges than specialized search applications operating on smaller, domain-specific collections. The initial retrieval stage must efficiently reduce the candidate set to a manageable size for re-ranking while maintaining high recall. Latency requirements significantly impact architectural choices. Consumer-facing applications typically require response times under 200ms for the entire search pipeline, leaving perhaps 50-100ms for re-ranking. Enterprise applications might tolerate higher latencies in exchange for improved accuracy, particularly for specialized use cases like legal or medical search. Domain specificity influences both model selection and training requirements. General-purpose re-rankers often struggle with specialized terminology or domain-specific relevance criteria. Financial services, for instance, might require precise handling of numerical data and regulatory terms, while e-commerce applications need to balance relevance with business metrics like conversion probability. In the same way, if you need to rerank non-english texts, you might need a multilingual reranker, or even a reranker specific to your language! ## 5. Future of Re-Ranking ### 5.1 Emerging Trends I think strategies like LayerWise Models and LLM-based models will become more popular for re-ranking in the near future. The reason is that LayerWise Models are extremely efficient by only choosing the layers that are extremely effective for re-ranking, so they are reasonable in size while providing good re-ranking performance. While LLMs are extremely good at re-ranking with a specific criterion that is past simple relevance. I also think that cross-encoders will become more sophisticated. One disadvantage of cross-encoders is that they aren't list-wise, and because of that, they miss information that is present when you are doing a list-wise ranking. That's fundamentally a loss. One way you could solve this is by adding attention to the other documents when you're doing your classification. This is what effectively occurs when you pass all the documents into the LLM and you tell it to re-rank. But we can possibly use attention to embed the entire set in some way and then based on that do point-wise classification following the original embedding of the overall set that we're re-ranking. This is similar to what late-chunking, a strategy that incorporates attention on the entire document when creating chunk embeddings, did for chunking. ### 5.2 Multimodal Re-Ranking We can already do multimodal reranking. One effective way to do multimodal re-ranking was something like ColPali which is ColBERT's image focused sibling. In ColPali, We tokenize images and perform late interaction of the user query and the tokens of image patches. This is state-of-the-art in terms of searching PDFs at the moment. Multimodal LLMs can also incorporate images in their re-ranking, so what will likely happen in the future is that it'll be very easy to incorporate images in your documents when you're doing a re-ranking step. ## 6. Conclusion Re-ranking is a critical step in modern search systems, especially when precision and relevance matter. Different approaches each have their strengths and trade-offs: - **Cross-Encoders**: Ideal for fine-grained text understanding and high-precision ranking of smaller candidate sets. Very easy to add to existing search pipelines, and can be fine-tuned for sizable accuracy improvements. - **Late Interaction Models (e.g., ColBERT)**: A middle-ground option that leverages precomputed document embeddings but applies more sophisticated term-by-term matching at query time. These models strike a balance between efficiency and nuanced semantic matching, but can require larger storage and more careful optimization. For larger datasets, very aggressive quantization is often required. - **Listwise/LLM-Based Methods**: Most flexible and capable of advanced reasoning or complex criteria, but they’re often the slowest and most expensive to run at scale. Suitable for scenarios where context-rich judgment is crucial, such as specialized or high-stakes domains, and where latency is less of a constraint. However, it's possible to perform listwise reranking at under 350ms latency with LLMs optimized for reduced Time To First Token. Ultimately, the best choice depends on your scale, latency constraints, and the complexity of your domain. High-traffic pipelines typically combine a fast approximate retrieval stage with a more advanced re-ranking step on a smaller set of candidates, balancing efficiency and accuracy. I hope this blog post has been helpful in understanding the challenges and opportunities of re-ranking in the modern search landscape. If you have any questions or want to connect, please let me know!