# IR
###### tags: `IR`
## Term Vocabulary
==Token — type — term==
ex. “to sleep perchance to dream”
5 tokens: to, sleep, perchance, to, and dream.
4 types, since there are 2 instances of `to`.
3 terms, if `to` is omitted from the index as a stopword.
==Normalization==
a process of canonicalizing tokens so that matches occur despite **superficial differences** in the character sequences of tokens.
1. removing `-` and `.`
2. token expansion
3. Remove accents
4. Case-folding (= reducing all letters to lower case)
==Stemming and Lemmatization==
**Stemming**: usually a crude heuristic process that chops off the ends of words.
* Porter’s algorithm
> ‘automate(s)’, ‘automatic’, ‘automation’ -> ‘automat’.
> stemming只會增加 `retrieved set`,因此 **recall** 也只會增加或持平
> EX. ‘operate’ and ‘operating’ are all stemmed to ‘oper’.
Query: ‘operating system’ would find sentence with operate and system. [color=lightgreen]
**Lemmatization**: doing things more properly with the use of a vocabulary and morphological analysis of words.
> ‘am’, ‘are’, ‘is’ -> ‘be’.
==Heaps’ Law==
estimating the number of terms.
$M = kT^b$
(`T`: the number of tokens in the collection)
> The dictionary size will continue to increase with more documents in the collection. [color=skyblue]
==Zipf’s law==
modeling the distribution of terms
collection frequency $cf_i$ of the ith most common term is proportional to 1/i.
> for most words, their use will be extremely sparse!! [color=skyblue]
## PAT tree

## Scoring, Term Weighting, Vector Space Model
==inverted file==

==Term Frequency==
$tf_t,_d$ — the number of occurrences of term t in document d
==Inverse document frequency==
$idf_t = log(N/df_t)$
(`N`: the number of documents in a collection)
## Language Models for Information Retrieval
==Markov assumption==
only the prior local context (the last few words) affects the next word
==Smoothing==
=> a way to overcome data sparseness.
To decrease the probability of previously seen events.
So there is a little bit of probability mass left over for previously unseen events.
**Laplace's law**
adding one smoothing
$P = \frac{C(W_n,...,W_1)+1}{N+B}$
(`B`: number of possible n-grams.)
**Lidstone’s Law**
$P = \frac{C(W_n,...,W_1)+ \lambda}{N+\lambda B}$
**Good-Turing Estimation**

## BIM, Relevance Feedback, and Evaluation
==Percision / Recall==
| contingency table | Relevant | Not relevant |
| ---- | -------- | ----- |
| Retrieved | tp | fp |
| Not retrieved | fn | tn |
Percision = $\frac{tp}{tp+fp}$
Recall = $\frac{tp}{tp+fn}$
==Retrieval Status Value (RSV)==
$RSV = \sum log\frac{p_t(1-u_t)}{u_t(1-p_t)}$
==kappa==
$kappa = \frac{P(A)-P(E)}{1-P(E)}$

==F1 score==
$F1 = \frac{2PR}{P+R}$
==BIM==
$RSV_d = \sum \frac {N}{df_t}$
==BM25==

## Word Vectors
==Skip-gram==
trying to predict the surrounding window of words based on an input word.
==Continuous bag-of-words (CBOW)==
predicts the target word from the nearby words.
## Link Analysis
==Anchor text==
depicts how web users describe a web page.
==Page Rank==

==HITS (Hyperlink-Induced Topic Search)==
a query-dependent approach to rank documents for broad-topic searches.
## Text Classification and Naïve Bayes
### Naïve Bayes classification
==Multinomial model==
* **Naïve Bayes conditional independence assumption**: terms are independent of each other given the class.
* **positional independence**:the conditional probabilities for a term are the same independent of position in the document.
> The `bag of words model` (the positional independence assumption) discards the ordering information of words in natural language sentence.
* use **log** to avoid **floating point underflow**
* use **add-one smoothing** to avoid **Zero probability**
###### Example

==Bernoulli model==
```javascript
ApplyBernoulliNB(C, V, prior, condprob, d)
Vd = ExtractTermsFromDoc(V, d)
for each c in C
do
score[c] = log(prior[c])
for each t in V
do
if t in Vd
score[c] += log(condprob[t][c])
else score[c] += log(1-condprob[t][c])
return argmaxcscore[c]
```
###### Example

### Feature Selection
==chi-square==
how much expected counts E and observed counts N deviate from each other.

==Likelihood Ratio==

-2logλ = -2log(L(H1) / L(H2))
==Pointwise Mutual Information==
* PMI can be negative !!

> For terms with an equal (or similar) conditional probability P(t|c), rare terms will have a higher PMI!!
[color=greenyellow]
==Expected Mutual Information==

==Frequency-based==
### Evaluation

## Vector Space Classification
* **Contiguity hypothesis**: Documents in the same class form a contiguous region. (Regions do not overlap.)
* use **length-normalized tf-idf** representation
==Rocchio Classification==
* uses **centroid** to define the boundaries.
==k Nearest Neighbor==
* assigns documents to the majority class of their k closest neighbors
==More Than Two Classes==
* **Any-of classification** :
A document can belong to several classes, or to a single class, or to none of the classes.
* **One-of classification** :
A document must belong to exactly one of the classes.
==Bias-Variance Tradeoff==

* **Bias** $(E_DГ_D(d) – P(c|d))^2$
is the squared difference between the true conditional probability of *d* being in *c* and the prediction of the classifier, averaged over training sets.
* **Variance** $E_D[Г_D(d) – E_DГ_D(d)]^2$
is the variation of the prediction of learned classifiers.
> Variance measures how **inconsistent** the decisions are
## Flat Clustering
* **Hard/Partitional clustering**: each document is a member of exactly one cluster.
* **Soft clustering**: a document’s assignment is a distribution over all clusters.
* **Exhaustive clustering**: Each document must be assigned to a cluster.
==Evaluation of Clustering==
* **Purity**

* **Rand index**

* **F measure**
$P = \dfrac{TP}{TP+NP}$
$R = \dfrac{TP}{TP+FN}$ :arrow_right: $F_1 = \dfrac{2PR}{P+R}$
==K-means==
* **Residual sum of squares (RSS)** – the squared distance of each vector from its centroid, summed over all vectors
* **Re-assigning** documents to the cluster with the closest centroid.
**Re-computing** each centroid based on the current members of its cluster.
* **Medoid** – the document vector that is closest to the centroid.
* objective function: $K = arg_k min [RSS(K) + \lambda K]$
==Model-based Clustering==
* **EM algorithm**
* **Expectation step**: calculate the expected value of each hidden variable, assuming the current model θ holds.
* **Maximization step**: compute a new maximum likelihood model, given the expected value of each hidden variable.
## Hierarchical Clustering
* **Single-link clustering**: The similarity of two clusters is the similarity of their most similar members.
* **Complete-link clustering**: The similarity of two cluster is the similarity of their most dissimilar members.
> very sensitive to outliers.
* **Group-average agglomerative clustering**: The similarity of two cluster is the average similarity of all pairs of documents.
* **Centroid clustering**