Retrieval Evaluation

# Retrieval Evaluation ###### tags: `Information Retrieval and Extraction` ## The Cranfield Paradigm ## Precision and Recall ![](https://i.imgur.com/4w5thOU.png) Let $R_{q1}$ be the set of relevant docs for a query $q_1$ : $R_{q1} = \{d_3, d_5, d_9, d_{25}, d_{39}, d_{44}, d_{56}, d_{71}, d_{89}, d_{123}\}$ ![](https://i.imgur.com/P6h4d5y.png) ![](https://i.imgur.com/Dksquox.png) Consider now a second query $q_2$, $R_{q2} = \{d_3, d_{56}, d_{129}\}$ ![](https://i.imgur.com/SkXcNde.png) ![](https://i.imgur.com/ndYTrSR.png) ### Remark Average precision-recall curves are normally used to compare the performance of distinct IR algorithms ![](https://i.imgur.com/WyowLna.png) ### Single Value Summaries ### P@5 and P@10 ### Mean Average Precision ### R-Precision For the query q1, the R value is 10 and there are 4 relevants among the top 10 documents in the ranking. Thus, the R-Precision value for this query is 0.4 #### Precision Histograms let * $RP_A(i)$ : R-precision for algorithm A for the i-th query * $RP_B(i)$ : R-precision for algorithm B for the i-th query \begin{equation} RP_{A/B}(i) = RP_A(i) − RP_B(i) \end{equation} ![](https://i.imgur.com/bDDd0dQ.png) The algorithm A performs better for 8 of the queries, while the algorithm B performs better for the other 2 queries ### MRR: Mean Reciprocal Rank 1. Question-Answering (QA) systems 2. Search engine queries that look for specific sites * URL queries * Homepage querie ## Discounted Cumulated Gain tranditional precision and recall allow only binary relevance assessments. **discounted cumulated gain** (DCG) is a metric that combines graded relevance assessments effectively Consider that the results of the queries are graded on a scale 0–3 (0 for non-relevant, 3 for strong relevant docs). For instance, for queries q1 and q2, consider that the graded relevance scores are as follows : ![](https://i.imgur.com/I6Zqn9k.png) Considering the top 15 docs in the ranking produced for queries q1 and q2, the gain vectors for these queries are: \begin{equation} G_1 = (1,0,1,0,0,3,0,0,0,2,0,0,0,0,3) \\ G_2 = (0,0,2,0,0,0,0,1,0,0,0,0,0,0,3) \end{equation} By summing up the graded scores up to any point in the ranking, we obtain the cumulated gain (CG). For query q1, for instance, the cumulated gain at the first position is 1, at the second position is 1+0, third is 1+0+1 and so on. \begin{equation} CG_1 = (1,1,2,2,2,5,5,5,5,7,7,7,7,7,10) \\ CG_2 = (0,0,2,2,2,2,2,3,3,3,3,3,3,3,6) \end{equation} Given the gain vector $G_j$ for a test query $q_j$, the $CG_j$ associated with it is defined as ![](https://i.imgur.com/cGXCoYB.png) where $CG_j[i]$ refers to the cumulated gain at the ith position of the ranking for query $q_j$ <br> We also introduce a discount factor that reduces the impact of the gain as we move upper in the ranking. If we consider logs in base 2, this discount factor will be $log_2 2$ at position 2, $log_2 3$ at position 3, and so on ![](https://i.imgur.com/U9uFtVl.png) For the example queries $q_1$ and $q_2$, the DCG vectors are given by \begin{equation} DCG_1 = (1.0, 1.0, 1.6, 1.6, 1.6, 2.8, 2.8, 2.8, 2.8, 3.4, 3.4, 3.4, 3.4, 3.4, 4.2) \\ DCG_2 = (0.0, 0.0, 1.3, 1.3, 1.3, 1.3, 1.3, 1.6, 1.6, 1.6, 1.6, 1.6, 1.6, 1.6, 2.4) \end{equation} ### DCG Curves To produce CG and DCG curves over a set of test queries, we need to average them over all queries. Given a set of Nq queries, average $\overline{CG}[i]$ and $\overline{DCG}[i]$ over all queries are computed as follows ![](https://i.imgur.com/FaoaJfw.png) For instance, for the example queries q1 and q2, these averages are given by \begin{equation} \overline{CG} = (0.5, 0.5, 2.0, 2.0, 2.0, 3.5, 3.5, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0) \\ \overline{DCG} = (0.5, 0.5, 1.5, 1.5, 1.5, 2.1, 2.1, 2.2, 2.2, 2.5, 2.5, 2.5, 2.5, 2.5, 3.3) \end{equation} ![](https://i.imgur.com/fGNqAa3.png) ### Ideal CG and DCG Metrics CG and DCG scores, as defined above, are not computed relatively to any baseline. This implies that it might be confusing to use them directly to compare two distinct retrieval algorithms For a given test query q, assume that the relevance assessments made by the specialists produced : * $n_3$ documents evaluated with a relevance score of 3 * $n_2$ documents evaluated with a relevance score of 2 * $n_1$ documents evaluated with a score of 1 * $n_0$ documents evaluated with a score of 0 The ideal gain vector IG is created by sorting all relevance scores in decreasing order, as follows: \begin{equation} IG = (3,...,3,2,...,2,1,...,1,0,...,0) \end{equation} For instance, for the example queries q1 and q2, we have \begin{equation} IG_1 = (3,3,3,2,2,2,1,1,1,1,0,0,0,0,0) \\ IG_2 = (3,2,1,0,0,0,0,0,0,0,0,0,0,0,0) \\ ICG_1 = (3,6,9,11,13,15,16,17,18,19,19,19,19,19,19) \\ ICG_2 = (3,5,6,6,6,6,6,6,6,6,6,6,6,6,6) \\ IDCG_1 = (3.0, 6.0, 7.9, 8.9, 9.8, 10.5, 10.9, 11.2, 11.5, 11.8, 11.8, 11.8, 11.8, 11.8, 11.8) \\ IDCG_2 = (3.0, 5.0, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6) \\ \overline{ICG} = (3.0, 5.5, 7.5, 8.5, 9.5, 10.5, 11.0, 11.5, 12.0, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5) \\ \overline{IDCG} = (3.0, 5.5, 6.8, 7.3, 7.7, 8.1, 8.3, 8.4, 8.6, 8.7, 8.7, 8.7, 8.7, 8.7, 8.7) \end{equation} ### Normalized DCG Precision and recall figures can be directly compared to the ideal curve of 100% precision at all recall levels. DCG figures, however, are not build relative to any ideal curve, which makes it difficult to compare directly DCG curves for two distinct ranking algorithms ![](https://i.imgur.com/6FIKVeW.png) \begin{equation} N C G = (0.17, 0.09, 0.27, 0.24, 0.21, 0.33, 0.32, 0.35, 0.33, 0.40, 0.40, 0.40, 0.40, 0.40, 0.64) \\ NDCG = (0.17, 0.09, 0.21, 0.20, 0.19, 0.25, 0.25, 0.26, 0.26, 0.29, 0.29, 0.29, 0.29, 0.29, 0.38) \end{equation} The area under the NCG and NDCG curves represent the quality of the ranking algorithm. Higher the area, better the results are considered to be. ### Remark advantage * CG and DCG metrics aim at taking into account **multiple level relevance** assessments * **distinguishing** highly relevant documents from mildly relevant ones * Cumulated gain provides a single metric of retrieval performance at any position in the ranking * metrics more imune to outliers * discounted cumulated gain allows down weighting the impact of relevant documents found late in the ranking disadvantages * multiple level relevance assessments are harder and **more time consuming** to generate ## Binary Preferences The solution for large collections is the pooling method * This method compiles in a pool the top results produced by various retrieval algorithms * only the documents in the pool are evaluated * The method is reliable and can be used to effectively compare the retrieval performance of distinct systems Metrics such as precision-recall and P@10 consider all documents that were not retrieved as non-relevant but this is a problem. One approach to circumvent this problem is to use preference relations * These are relations of preference between any two documents retrieved, instead of using the rank positions directly ![](https://i.imgur.com/B612Eqs.png) * $J$ : set of all documents judged by the specialists with regard to a given information need * $R$ : set of docs that were found to be relevant * $J − R$ : set of docs that were found to be non-relevant Given an information need $I$, let * $R_A$: ranking computed by an IR system $A$ relatively to $I$ * $|R|$ is the number of $R_A$ * $s_{A,j}$ : position of document $d_j$ in $R_A$ * $[(J − R) ∧ A]_{|R|}$ : set composed of the first $|R|$ documents in $R_A$ that have been judged as non-relevant. Define ![](https://i.imgur.com/FSI3knF.png) as a counter of the non-relevant docs that appear before $d_j$ in $R_A$ Then, the BREF of ranking $R_A$ is given by ![](https://i.imgur.com/AJpiiyg.png) ### Remark Bpref is a stable metric and can be used to compare distinct algorithms in the context of large collections, because * The weights associated with relevant docs are normalized * The number of judged non-relevant docs considered is equal to the maximum number of relevant docs ## Rank Correlation Metrics we are interested in comparing the relative ordering produced by the two rankings Let rankings $R_1$ and $R_2$. A rank correlation metric yields a correlation coefficient $C(R_1, R_2)$ which is between -1 to 1. ## The Spearman Coefficient The Spearman coefficient is likely the mostly used rank correlation metric. It is based on the differences between the positions of a same document in two rankings Let * $s_{1,j}$ be the position of a document $d_j$ in ranking $R_1$ and * $s_{2,j}$ be the position of $d_j$ in ranking $R_2$ ![](https://i.imgur.com/L2jQ1vT.png) ![](https://i.imgur.com/Bi1Qru5.png) ### Define Spearman Coefficient ![](https://i.imgur.com/FbMp9iu.png) ![](https://i.imgur.com/f7wmmF7.png) ### Various ![](https://i.imgur.com/O4eg8qN.png) ## The Kendall Tau Coefficient # Reference Collections ## The TREC Collections