IR - HackMD

# IR ###### tags: `IR` ## Term Vocabulary ==Token — type — term== ex. “to sleep perchance to dream” 5 tokens: to, sleep, perchance, to, and dream. 4 types, since there are 2 instances of `to`. 3 terms, if `to` is omitted from the index as a stopword. ==Normalization== a process of canonicalizing tokens so that matches occur despite **superficial differences** in the character sequences of tokens. 1. removing `-` and `.` 2. token expansion 3. Remove accents 4. Case-folding (= reducing all letters to lower case) ==Stemming and Lemmatization== **Stemming**: usually a crude heuristic process that chops off the ends of words. * Porter’s algorithm > ‘automate(s)’, ‘automatic’, ‘automation’ -> ‘automat’. > stemming只會增加 `retrieved set`，因此 **recall** 也只會增加或持平 > EX. ‘operate’ and ‘operating’ are all stemmed to ‘oper’. Query: ‘operating system’ would find sentence with operate and system. [color=lightgreen] **Lemmatization**: doing things more properly with the use of a vocabulary and morphological analysis of words. > ‘am’, ‘are’, ‘is’ -> ‘be’. ==Heaps’ Law== estimating the number of terms. $M = kT^b$ (`T`: the number of tokens in the collection) > The dictionary size will continue to increase with more documents in the collection. [color=skyblue] ==Zipf’s law== modeling the distribution of terms collection frequency $cf_i$ of the ith most common term is proportional to 1/i. > for most words, their use will be extremely sparse!! [color=skyblue] ## PAT tree ![](https://i.imgur.com/D8ICwwJ.png) ## Scoring, Term Weighting, Vector Space Model ==inverted file== ![](https://i.imgur.com/kqGGENa.png) ==Term Frequency== $tf_t,_d$ — the number of occurrences of term t in document d ==Inverse document frequency== $idf_t = log(N/df_t)$ (`N`: the number of documents in a collection) ## Language Models for Information Retrieval ==Markov assumption== only the prior local context (the last few words) affects the next word ==Smoothing== => a way to overcome data sparseness. To decrease the probability of previously seen events. So there is a little bit of probability mass left over for previously unseen events. **Laplace's law** adding one smoothing $P = \frac{C(W_n,...,W_1)+1}{N+B}$ (`B`: number of possible n-grams.) **Lidstone’s Law** $P = \frac{C(W_n,...,W_1)+ \lambda}{N+\lambda B}$ **Good-Turing Estimation** ![](https://i.imgur.com/XP9uPbU.png) ## BIM, Relevance Feedback, and Evaluation ==Percision / Recall== | contingency table | Relevant | Not relevant | | ---- | -------- | ----- | | Retrieved | tp | fp | | Not retrieved | fn | tn | Percision = $\frac{tp}{tp+fp}$ Recall = $\frac{tp}{tp+fn}$ ==Retrieval Status Value (RSV)== $RSV = \sum log\frac{p_t(1-u_t)}{u_t(1-p_t)}$ ==kappa== $kappa = \frac{P(A)-P(E)}{1-P(E)}$ ![](https://i.imgur.com/3uKn2gh.png) ==F1 score== $F1 = \frac{2PR}{P+R}$ ==BIM== $RSV_d = \sum \frac {N}{df_t}$ ==BM25== ![](https://i.imgur.com/WOrjxXV.png) ## Word Vectors ==Skip-gram== trying to predict the surrounding window of words based on an input word. ==Continuous bag-of-words (CBOW)== predicts the target word from the nearby words. ## Link Analysis ==Anchor text== depicts how web users describe a web page. ==Page Rank== ![](https://i.imgur.com/2c3KSNo.png) ==HITS (Hyperlink-Induced Topic Search)== a query-dependent approach to rank documents for broad-topic searches. ## Text Classification and Naïve Bayes ### Naïve Bayes classification ==Multinomial model== * **Naïve Bayes conditional independence assumption**: terms are independent of each other given the class. * **positional independence**：the conditional probabilities for a term are the same independent of position in the document. > The `bag of words model` (the positional independence assumption) discards the ordering information of words in natural language sentence. * use **log** to avoid **floating point underflow** * use **add-one smoothing** to avoid **Zero probability** ###### Example ![](https://i.imgur.com/bYHatGC.png) ==Bernoulli model== ```javascript ApplyBernoulliNB(C, V, prior, condprob, d) Vd = ExtractTermsFromDoc(V, d) for each c in C do score[c] = log(prior[c]) for each t in V do if t in Vd score[c] += log(condprob[t][c]) else score[c] += log(1-condprob[t][c]) return argmaxcscore[c] ``` ###### Example ![](https://i.imgur.com/bBkx0XN.png) ### Feature Selection ==chi-square== how much expected counts E and observed counts N deviate from each other. ![](https://i.imgur.com/9BEgiav.png) ==Likelihood Ratio== ![](https://i.imgur.com/G01xz3w.png) -2logλ = -2log(L(H1) / L(H2)) ==Pointwise Mutual Information== * PMI can be negative !! ![](https://i.imgur.com/yp15DtW.png) > For terms with an equal (or similar) conditional probability P(t|c), rare terms will have a higher PMI!! [color=greenyellow] ==Expected Mutual Information== ![](https://i.imgur.com/UMruq42.png) ==Frequency-based== ### Evaluation ![](https://i.imgur.com/dcjtiBw.png) ## Vector Space Classification * **Contiguity hypothesis**: Documents in the same class form a contiguous region. (Regions do not overlap.) * use **length-normalized tf-idf** representation ==Rocchio Classification== * uses **centroid** to define the boundaries. ==k Nearest Neighbor== * assigns documents to the majority class of their k closest neighbors ==More Than Two Classes== * **Any-of classification** : A document can belong to several classes, or to a single class, or to none of the classes. * **One-of classification** : A document must belong to exactly one of the classes. ==Bias-Variance Tradeoff== ![](https://i.imgur.com/uQ7hmsN.png) * **Bias** $(E_DГ_D(d) – P(c|d))^2$ is the squared difference between the true conditional probability of *d* being in *c* and the prediction of the classifier, averaged over training sets. * **Variance** $E_D[Г_D(d) – E_DГ_D(d)]^2$ is the variation of the prediction of learned classifiers. > Variance measures how **inconsistent** the decisions are ## Flat Clustering * **Hard/Partitional clustering**: each document is a member of exactly one cluster. * **Soft clustering**: a document’s assignment is a distribution over all clusters. * **Exhaustive clustering**: Each document must be assigned to a cluster. ==Evaluation of Clustering== * **Purity** ![](https://i.imgur.com/pEkENdS.png) * **Rand index** ![](https://i.imgur.com/c1L4v7w.png) * **F measure** $P = \dfrac{TP}{TP+NP}$ $R = \dfrac{TP}{TP+FN}$ :arrow_right: $F_1 = \dfrac{2PR}{P+R}$ ==K-means== * **Residual sum of squares (RSS)** – the squared distance of each vector from its centroid, summed over all vectors * **Re-assigning** documents to the cluster with the closest centroid. **Re-computing** each centroid based on the current members of its cluster. * **Medoid** – the document vector that is closest to the centroid. * objective function: $K = arg_k min [RSS(K) + \lambda K]$ ==Model-based Clustering== * **EM algorithm** * **Expectation step**: calculate the expected value of each hidden variable, assuming the current model θ holds. * **Maximization step**: compute a new maximum likelihood model, given the expected value of each hidden variable. ## Hierarchical Clustering * **Single-link clustering**: The similarity of two clusters is the similarity of their most similar members. * **Complete-link clustering**: The similarity of two cluster is the similarity of their most dissimilar members. > very sensitive to outliers. * **Group-average agglomerative clustering**: The similarity of two cluster is the average similarity of all pairs of documents. * **Centroid clustering**