Relevance Feedback and Query Expansion

# Relevance Feedback and Query Expansion ###### tags: `Information Retrieval and Extraction` ## A framework of Feedback Methods #### Instead of asking: * Look at documents they have clicked on; or * Look at terms belonging to the top documents in the result set ### Explicit relevance feedback * expensive and time consuming * user click is representation preference but not relevant ![](https://i.imgur.com/QhKSL2W.png) ### Implicit relevance feedback two basic approaches: * **local analysis** : which derives the feedback information from the top ranked documents in the result set * **global analysis** : which derives the feedback information from external sources such as a thesaurus ![](https://i.imgur.com/8N3tHzv.png) ## The Rocchio Method The basic idea of the Rocchio Method is to reformulate the query such that it gets: * closer to the neighborhood of the relevant documents in the vector space * away from the neighborhood of the non-relevant documents ![](https://i.imgur.com/Phhjmjc.png) ![](https://i.imgur.com/2AdhpU5.png) #### advantages * Simplicity: * modified term weights are computed directly from the set of retrieved documents * Good results: * the modified query vector does reflect a portion of the intended query semantics (observed experimentally) ## Relevance Feedback for the Probabilistic Model The probabilistic model ranks documents for a query q according to the probabilistic ranking principle. ![](https://i.imgur.com/1igcaHy.png) where $P(k_i|R)$, $P(k_i|\bar{R})$ stands for the probability of observing the term $k_i$ in the set $R$ of relevant documents (non-relevant docs) ### Something Different Different methods for estimating these probabilities automatically were discussed in Chapter 3. With user feedback information, these probabilities are estimated in a slightly different way For the initial search, * $P(k_i|R)$ is constant for all terms $k_i$ (typically 0.5) * the term probability distribution $P(k_i|R)$ can be approximated by the distribution in the whole collection Then, the probabilities $P (k_i|R)$ and $P (k_i|R)$ can be approximated by ![](https://i.imgur.com/5CPdVkL.png) Using these approximations, the similarity equation can rewritten as ![](https://i.imgur.com/05VQhsX.png) *contrary to the Rocchio Method, **no query expansion occurs*** #### Advantage * derivation of new weights for the query terms #### disadvantages * document term weights are not taken into account during the feedback loop; * weights of terms in the previous query formulations are disregarded; and * no query expansion is used (the same set of index terms in the original query is reweighted over and over again) ## Evaluation of Relevance Feedback Evaluation of $\overrightarrow{qm}$: ## Explicit Feedback Through Clicks The clicks reflect preferences for particular answers in the context of a given query, but not relevance. ### Eye Tracking Fixations are a gaze at a particular area of the screen lasting for 200-300 milliseconds. ### User Behavior 我們想要理解人們在尋找訊息時候的特性 #### Joachims et al. Experimental Design * A test set composed of 10 queries * 5 navigational queries * 5 informational queries * Navigational query * Find a Web page * Find the homepage of Michael Jordan, the statistician * Informational query * Find information on a topic * Find the location of the tallest mountain in New York Annotation of Preference : * Given a query and top 10 results, decide on a strict preference ordering for all 10 results * Each query is evaluated by at least two assessors * The sets of relevance judgements produced by the human assessors are used to evaluate the quality of the documents clicked on by the users * ![](https://i.imgur.com/runLJlv.png) he users inspect the top 2 answers almost equally, but they click three times more in the first. This might be indicative of a user bias towards the search engine. That the **users tend to trust the search engine** in recommending a top result that is relevant. ### Clicks as a Metric of Preferences ### Clicks within a Same Query * Given a ranking function $R(q_i,d_j)$, let $r_k$ be the $k$th ranked result * let √rk indicate that the user has clicked on the $k$th result * Define a preference function $r_k > r_{k−n}$ #### For example : ![](https://i.imgur.com/cJU3nzt.png) Two distinct strategies to capture the preference relations in this case are as follows. * Skip-Above : if √rk then $r_k > r_{k−n}$, for all $r_{k−n}$ that was not clicked * r3 > r2; r3 > r1 * Skip-Previous : if √rk and $r_{k−1}$ has not been clicked then $r_k > r_{k−1}$ * r3 > r2 ### Clicks within a Query Chain #### For example : That two result sets in a same query chain ![](https://i.imgur.com/wfjywcs.png) where $r_j$ refers to an answer in the **first result** set, $s_j$ refers to an answer in the **second result** set. Two distinct strategies to capture the preference relations : ![](https://i.imgur.com/CusghOs.png) According the first strategy, the following preferences are produced by the click of the user on result s2: ![](https://i.imgur.com/OQBLRxy.png) According the second strategy, we have: ![](https://i.imgur.com/9AB634J.png) ## Click-based Ranking Click through information can be used to improve the ranking. 我們可以透過機器學習模型去學習click的資訊 ## Query Expansion Query expansion is another method for increasing recall ### Types of Query Expansion * Global Analysis : (static; of all documents in collection) * Local Analysis : (dynamic) ### Automatic Local Analysis * user relevance feedback * describe a larger cluster of relevant documents with assistance from the user (clustering) * automatic analysis * global strategy: global thesaurus-like structure is trained from all documents before querying * local strategy: terms from the documents retrieved for a given query are selected at query time ## Implicit Feedback Through Local Analysis Local analysis consists in deriving feedback information from the documents retrieved for a given query $q$ ## Local Clustering ### Association Clusters ![](https://i.imgur.com/7HCwSkI.png) ### Metric Clusters 引入距離的概念，考慮兩個詞在文章的相對位置，通常兩個詞越接近，關聯性會越高 ### Scalar Clusters 利用兩個詞content 的關係去考慮整個相關性， ### Neighbor Terms Terms that belong to clusters associated to the query terms can be used to expand the original query ## Local Context Analysis A weight computed as $1-0.9\times \dfrac{i}{m}$ is assigned to each concept c, where i is position of c in the concept ranking, m is number of concepts to add to q 擴展query的權重不能超過原來query ## Implicit Feedback Through Global Analysis ### Similarity Thesaurus 與以前用term來表示整個document不同的是，這裡想用document來表示term ![](https://i.imgur.com/NK1JcbX.png) ![](https://i.imgur.com/PAJfshA.png) dj用來比較index term的能力，含有越多index term，能力越低 Given the global similarity thesaurus, query expansion is done in three steps as follows : ![](https://i.imgur.com/aN5gOMZ.png) ### Statistical Thesaurus 將文章分群，並從每個群之中選