Probabilistic Model

# Probabilistic Model ###### tags: `Information Retrieval and Extraction` ## BM25 (Best Match 25) BM25 was created as the result of a series of experiments on variations of the probabilistic model A good term weighting is based on three principles, But the classic probabilistic model covers only the first of these principles 1. inverse document frequency 2. term frequency 3. document length normalization ### BM1, BM11 and BM15 Formulas The first idea for improving the ranking was to introduce a **term-frequency** factor $F_{i,j}$ in the BM1 formula ![](https://i.imgur.com/Zr9AAg7.png) If $K_1 = 0$, this whole factor becomes equal to 1 and bears no effect in the ranking The next step was to modify the Fi,j factor by adding **document length normalization** to it, as follows: ![](https://i.imgur.com/vLvBNAq.png) Next, a correction factor $G_{j,q}$ dependent on the document and query lengths was added ![](https://i.imgur.com/7nmjpop.png) A third additional factor, aimed at taking into account term frequencies within queries, was defined as ![](https://i.imgur.com/9KbsAtc.png) Introduction of these three factors led to various BM (Best Matching) formulas, as follows: ![](https://i.imgur.com/Ac5cNon.png) These considerations lead to simpler equations as follows ![](https://i.imgur.com/9QUUJVV.png) ### BM25 Ranking Formula The motivation was to combine the BM11 and BM15 term frequency factors as follows ![](https://i.imgur.com/b6HOLwZ.png) The ranking equation for the BM25 model can then be written as ![](https://i.imgur.com/EZxQvaQ.png) ### Remark * Unlike the probabilistic model, the BM25 formula can be computed without relevance information * There is consensus that BM25 outperforms the classic vector model for general collections * Thus, it has been used as a **baseline for evaluating** new ranking functions, in substitution to the classic vector model ## Language Models ### Basic Concept ![](https://i.imgur.com/MH0xi0K.png) ### Stochastic Language Models Models $probability$ of generating strings in the language (commonly all strings over alphabet $\sum$) ![](https://i.imgur.com/gGvCkOg.png) ![](https://i.imgur.com/ajBMpRv.png) #### Given generate Probability distribution over strings in a given language ![](https://i.imgur.com/GIV4SpE.png) #### Others ![](https://i.imgur.com/yIFhZwO.png) #### The fundamental problem ![](https://i.imgur.com/7IorF91.png) A language model for IR is composed of the following components * A set of document language models, one per document $d_j$ of the collection * A probability distribution function that allows estimating the likelihood that a document language model $M_j$ generates each of the query terms * A ranking function that combines these generating probabilities for the query terms into a rank of document $d_j$ with regard to the query ### Statistical Foundation Let $S$ be a sequence of $r$ consecutive terms that occur in a document of the collection: \begin{equation} S=k_1,k_2,....,k_r \end{equation} An $n$-gram language model uses a Markov process to assign a probability of occurrence to $S$: \begin{equation} P_n(S)=\prod_{i=1}^r P(k_i|k_{i-1},k_{i-2},....,k_{i-(n-1)}) \end{equation} where $n$ is the order of the Markov process. The occurrence of a term depends on observing the $n$-$1$ terms that precede it in the text * **Unigram language model** (n = 1): the estimatives are based on the occurrence of individual words * **Bigram language model** (n = 2): the estimates are based on the co-occurrence of pairs of words * Higher order models such as **Trigram language models** (n = 3) are usually adopted for speech recognition * **Term independence assumption**: in the case of IR, the impact of word order is less clear * As a result, Unigram models have been used extensively in IR ### Multinomial Process Ranking in a language model is provided by estimating $P(q|M_j)$ According to this process, if we assume that the **query terms are independent** among themselves (unigram model), we can write: \begin{equation} P(q|M_j)=\prod_{k_i\in q} P(k_i|M_j) \end{equation} * According to this probability formula, as long as a query term does not appear in the document, the probability will be equal to 0. So we need to use smoothing to solve this problem. <br> ![](https://i.imgur.com/YNGc16Q.png) where $P_{\in}$ and $P_{\not\in}$ are two distinct probability distributions: * The first is a distribution for the query terms in the document * The second is a distribution for the query terms not in the document For the second distribution, statistics are derived from all the document collection \begin{equation} P_{\not\in}(k_i|M_j)=\alpha_jP(k_i|C) \end{equation} where $\alpha_j$ is a parameter associated with document $d_j$ and $P(k_i|C)$ is a collection $C$ language model ![](https://i.imgur.com/66x8Bty.png) ![](https://i.imgur.com/15Iazyg.jpg) The ranking function is now composed of two separate parts **The first part** assigns weights to each query term that appears in the document, according to the expression ![](https://i.imgur.com/zKfeZTJ.png) This term weight plays a role analogous to the $tf$ plus $idf$ weight components in the vector model. Further, the parameter $α_j$ can be used for document length normalization **The second part** assigns a fraction of probability mass to the query terms that are not in the document—a process called **smoothing** \begin{equation} n_q log \alpha_j \end{equation} The combination of a multinomial process with smoothing leads to a ranking formula that naturally includes $tf$, $idf$, and document length normalization That is, smoothing plays a key role in modern language modeling, as we now discuss ### Smoothing In our discussion, we estimated $P_{\not\in}(k_i|M_j)$ using $P(k_i|C)$ to avoid assigning zero probability to query terms not in document $d_j$ One popular smoothing technique is to move some mass probability from the terms in the document to the terms not in the document, as follows: ![](https://i.imgur.com/zXeERyW.png) where $P_{\in}^s(k_i|M_j)$ is the **smoothed distribution** for terms in document $d_j$ ![](https://i.imgur.com/VhNxBVc.jpg) There are two method to caculate $\alpha$ * Jelinek-Mercer Method * Dirichlet smoothing #### Jelinek-Mercer Method The idea is to do a linear interpolation between the term frequency and the collection frequency distributions: ![](https://i.imgur.com/UlZT1iH.png) It can be shown that \begin{equation} \alpha_j = \lambda \end{equation} Thus, the larger the values of $\lambda$, the larger is the effect of smoothing #### Dirichlet smoothing In this method, the language model is a multinomial distribution in which the conjugate prior probabilities are given by the Dirichlet distribution ![](https://i.imgur.com/KoIbkEk.png) As before, closer is $\lambda$ to 0, higher is the influence of the term document frequency. As $\lambda$ moves towards 1, the influence of the term collection frequency increases \begin{equation} \alpha_j = \frac{\lambda}{\sum_if_{i,j} + \lambda} \end{equation} the larger the values of λ, the larger is the effect of smoothing #### Remark The IR ranking in a multinomial language model is computed as follows: ![](https://i.imgur.com/2rRpdZt.png) ### Bernoulli Process The first application of languages models to IR was due to Ponte & Croft. They proposed a **Bernoulli process** for generating the query If we assume independence of index terms, we can compute $P (q|M_j )$ using a multivariate Bernoulli process: ![](https://i.imgur.com/lZgHe48.png) where $P(k_i|M_j)$ are term probabilities. A simple estimate of the term probabilities is ![](https://i.imgur.com/O6smkjL.png) we assume that a non-occurring term is related to $d_j$ with the probability $P(k_i|C)$ of observing $k_i$ in the whole collection $C$ ![](https://i.imgur.com/ILxwNIg.png) --- ![](https://i.imgur.com/X6b4keL.png) --- ![](https://i.imgur.com/ioRIZkq.png) --- ![](https://i.imgur.com/UjGtFg7.png) --- ![](https://i.imgur.com/Ch1hW01.png) --- ![](https://i.imgur.com/wSpv9uL.png) --- ![](https://i.imgur.com/PLFSnGm.png) ## Query Model vs. Document Model ### Alternative Models of Text Generation ![](https://i.imgur.com/HWVsEkX.png) ### Retrieval Using Language Models ![](https://i.imgur.com/wrWV87Z.png) ### Query Likelihood ![](https://i.imgur.com/u5HmTTo.png) ### Document Likelihood ![](https://i.imgur.com/W5HaV5p.png) ### Model Comparison ![](https://i.imgur.com/aLKewc8.png) ### How can one do relevance feedback if using language modeling approach? ![](https://i.imgur.com/B869I4E.png) #### Expansion-based vs. Model-based ![](https://i.imgur.com/sFeh2ja.jpg) #### Feedback as Model Interpolation ![](https://i.imgur.com/anFKIyF.jpg) ### Translation model (Berger and Lafferty) ![](https://i.imgur.com/MrnPZxd.png) ### Comparison With Vector Space ![](https://i.imgur.com/5N2eHQq.png)