# Relevance Feedback and Query Expansion
###### tags: `Information Retrieval and Extraction`
## A framework of Feedback Methods
#### Instead of asking:
* Look at documents they have clicked on; or
* Look at terms belonging to the top documents in the result set
### Explicit relevance feedback
* expensive and time consuming
* user click is representation preference but not relevant

### Implicit relevance feedback
two basic approaches:
* **local analysis** : which derives the feedback information from the top ranked documents in the result set
* **global analysis** : which derives the feedback information from external sources such as a thesaurus

## The Rocchio Method
The basic idea of the Rocchio Method is to reformulate the query such that it gets:
* closer to the neighborhood of the relevant documents in the vector space
* away from the neighborhood of the non-relevant documents


#### advantages
* Simplicity:
* modified term weights are computed directly from the set of retrieved documents
* Good results:
* the modified query vector does reflect a portion of the intended query semantics (observed experimentally)
## Relevance Feedback for the Probabilistic Model
The probabilistic model ranks documents for a query q according to the probabilistic ranking principle.

where $P(k_i|R)$, $P(k_i|\bar{R})$ stands for the probability of observing the term $k_i$ in the set $R$ of relevant documents (non-relevant docs)
### Something Different
Different methods for estimating these probabilities automatically were discussed in Chapter 3. With user feedback information, these probabilities are estimated in a slightly different way
For the initial search,
* $P(k_i|R)$ is constant for all terms $k_i$ (typically 0.5)
* the term probability distribution $P(k_i|R)$ can be approximated by the distribution in the whole collection
Then, the probabilities $P (k_i|R)$ and $P (k_i|R)$ can be approximated by

Using these approximations, the similarity equation can rewritten as

*contrary to the Rocchio Method, **no query expansion occurs***
#### Advantage
* derivation of new weights for the query terms
#### disadvantages
* document term weights are not taken into account during the feedback loop;
* weights of terms in the previous query formulations are disregarded; and
* no query expansion is used (the same set of index terms in the original query is reweighted over and over again)
## Evaluation of Relevance Feedback
Evaluation of $\overrightarrow{qm}$:
## Explicit Feedback Through Clicks
The clicks reflect preferences for particular answers in the context of a given query, but not relevance.
### Eye Tracking
Fixations are a gaze at a particular area of the screen lasting for 200-300 milliseconds.
### User Behavior
我們想要理解人們在尋找訊息時候的特性
#### Joachims et al. Experimental Design
* A test set composed of 10 queries
* 5 navigational queries
* 5 informational queries
* Navigational query
* Find a Web page
* Find the homepage of Michael Jordan, the statistician
* Informational query
* Find information on a topic
* Find the location of the tallest mountain in New York
Annotation of Preference :
* Given a query and top 10 results, decide on a strict preference ordering for all 10 results
* Each query is evaluated by at least two assessors
* The sets of relevance judgements produced by the human assessors are used to evaluate the quality of the documents clicked on by the users
*

he users inspect the top 2 answers almost equally, but they click three times more in the first. This might be indicative of a user bias towards the search engine.
That the **users tend to trust the search engine** in recommending a top result that is relevant.
### Clicks as a Metric of Preferences
### Clicks within a Same Query
* Given a ranking function $R(q_i,d_j)$, let $r_k$ be the $k$th ranked result
* let √rk indicate that the user has clicked on the $k$th result
* Define a preference function $r_k > r_{k−n}$
#### For example :

Two distinct strategies to capture the preference relations in this case are as follows.
* Skip-Above : if √rk then $r_k > r_{k−n}$, for all $r_{k−n}$ that was not clicked
* r3 > r2; r3 > r1
* Skip-Previous : if √rk and $r_{k−1}$ has not been clicked then $r_k > r_{k−1}$
* r3 > r2
### Clicks within a Query Chain
#### For example :
That two result sets in a same query chain

where $r_j$ refers to an answer in the **first result** set, $s_j$ refers to an answer in the **second result** set.
Two distinct strategies to capture the preference relations :

According the first strategy, the following preferences are produced by the click of the user on result s2:

According the second strategy, we have:

## Click-based Ranking
Click through information can be used to improve the ranking. 我們可以透過機器學習模型去學習click的資訊
## Query Expansion
Query expansion is another method for increasing recall
### Types of Query Expansion
* Global Analysis : (static; of all documents in collection)
* Local Analysis : (dynamic)
### Automatic Local Analysis
* user relevance feedback
* describe a larger cluster of relevant documents with assistance from the user (clustering)
* automatic analysis
* global strategy: global thesaurus-like structure is trained from all documents before querying
* local strategy: terms from the documents retrieved for a given query are selected at query time
## Implicit Feedback Through Local Analysis
Local analysis consists in deriving feedback information from the documents retrieved for a given query $q$
## Local Clustering
### Association Clusters

### Metric Clusters
引入距離的概念,考慮兩個詞在文章的相對位置,通常兩個詞越接近,關聯性會越高
### Scalar Clusters
利用兩個詞content 的關係去考慮整個相關性,
### Neighbor Terms
Terms that belong to clusters associated to the query terms can be used to expand the original query
## Local Context Analysis
A weight computed as $1-0.9\times \dfrac{i}{m}$ is assigned to each concept c, where i is position of c in the concept ranking, m is number of concepts to add to q
擴展query的權重不能超過原來query
## Implicit Feedback Through Global Analysis
### Similarity Thesaurus
與以前用term來表示整個document不同的是,這裡想用document來表示term


dj用來比較index term的能力,含有越多index term,能力越低
Given the global similarity thesaurus, query expansion is done in three steps as follows :

### Statistical Thesaurus
將文章分群,並從每個群之中選