# Homework 4 (Vattanac's) | Exercises | |-----------| | [Part I](#Part-I) | | [Part II](#Part-II) | | [Part III](#Part-III) | | [Equations](#Equations) | ***All $log$ calulcations are done with base 2*** # Part I You are looking for information on the Economic Growth in Scotland in a large document collection. You decide to search using the terms: **economy**, **growth**, **Scotland**, **banks** and **business**, using an information retrieval system and this recommends three possible documents. You are given the frequency of each of the terms in each document, shown in the table below: | Terms | economy($t_1$) | Scotland($t_2$) | growth($t_3$) | banks($t_4$) | business($t_5$) | |-------|---------|----------|--------|-------|----------| | Doc1 | 10 | 8 | 0 | 2 | 1 | | Doc2 | 0 | 0 | 9 | 9 | 8 | | Doc3 | 2 | 2 | 4 | 4 | 6 | | q1 | 1 | 1 | 1 | 1 | 1 | 1. Compute $tf-idf$ of the terms in each document $(w_{i,j})$. 2. Determine the similarity between query $q_1$ and each document using a. Cosine similarity $sim(d_j,q) = \dfrac{\sum_{i=1}^tw_{i,j}\times w_{i,q}}{\sqrt{\sum_{i=1}^tw_{i,j}^2}\times\sqrt{\sum_{i=1}^tw_{i,q}^2}}$ b. Euclidean distance $sim(d_j,q)=\sqrt{\sum_{i=1}^t(w_{i,j} - w_{i,q})^2}$ 3. Rank the documents according to the similarity in **2.a** and **2.b**. ## Answers 1. Compute $tf-idf$ of the terms in each document $(w_{i,j})$. ### Calculate the $tf$ of each term in each document #### Doc1 $tf_{t_1, d_1} = 1 + \log{f_{t_1, d_1}} = 1 + \log10 = 4.32$ $tf_{t_2, d_1} = 1 + \log{f_{t_2, d_1}} = 1 + \log8 = 4$ $tf_{t_3, d_1} = 1 + \log{f_{t_3, d_1}} = 1 + \log0 = Undefined$ $tf_{t_4, d_1} = 1 + \log{f_{t_4, d_1}} = 1 + \log2 = 2$ $tf_{t_5, d_1} = 1 + \log{f_{t_5, d_1}} = 1 + \log1 = 1$ #### Doc2 $tf_{t_1, d_2} = 1 + \log{f_{t_1, d_2}} = 1 + \log0 = Undefined$ $tf_{t_2, d_2} = 1 + \log{f_{t_2, d_2}} = 1 + \log0 = Undefined$ $tf_{t_3, d_2} = 1 + \log{f_{t_3, d_2}} = 1 + \log9 = 4.17$ $tf_{t_4, d_2} = 1 + \log{f_{t_4, d_2}} = 1 + \log9 = 4.17$ $tf_{t_5, d_2} = 1 + \log{f_{t_5, d_2}} = 1 + \log8 = 4$ #### Doc3 $tf_{t_1, d_3} = 1 + \log{f_{t_1, d_3}} = 1 + \log2 = 2$ $tf_{t_2, d_3} = 1 + \log{f_{t_2, d_3}} = 1 + \log2 = 2$ $tf_{t_3, d_3} = 1 + \log{f_{t_3, d_3}} = 1 + \log4 = 3$ $tf_{t_4, d_3} = 1 + \log{f_{t_4, d_3}} = 1 + \log4 = 3$ $tf_{t_5, d_3} = 1 + \log{f_{t_5, d_3}} = 1 + \log6 = 3.58$ ### Calculate the $idf$ of each term in the collection economy: $idf_{t_1} = \log\dfrac{3}{2} = 0.58$ Scotland: $idf_{t_2} = \log\dfrac{3}{2} = 0.58$ growth: $idf_{t_3} = \log\dfrac{3}{2} = 0.58$ banks: $idf_{t_4} = \log\dfrac{3}{3} = 0$ business: $idf_{t_5} = \log\dfrac{3}{3} = 0$ ### Calculate the $tf-idf$ of each term in each document #### economy $tf_{t_1,d_1}-idf_{t_1} = tf_{t_1,d_1} \times idf_{t_1} = 4.32 \times 0.58 = 2.50$ $tf_{t_1,d_2}-idf_{t_1} = tf_{t_1,d_2} \times idf_{t_1} = Undefined$ $tf_{t_1,d_3}-idf_{t_1} = tf_{t_1,d_3} \times idf_{t_1} = 2 \times 0.58 = 1.16$ #### Scotland $tf_{t_2,d_1}-idf_{t_2} = tf_{t_2,d_1} \times idf_{t_2} = 4 \times 0.58 = 2.32$ $tf_{t_2,d_2}-idf_{t_2} = tf_{t_2,d_2} \times idf_{t_2} = Undefined$ $tf_{t_2,d_3}-idf_{t_2} = tf_{t_2,d_3} \times idf_{t_2} = 2 \times 0.58 = 1.16$ #### growth $tf_{t_3,d_1}-idf_{t_3} = tf_{t_3,d_1} \times idf_{t_3} = Undefined$ $tf_{t_3,d_2}-idf_{t_3} = tf_{t_3,d_2} \times idf_{t_3} = 4.17 \times 0.58 = 2.41$ $tf_{t_3,d_3}-idf_{t_3} = tf_{t_3,d_3} \times idf_{t_3} = 3 \times 0.58 = 1.74$ #### banks $tf_{t_4,d_1}-idf_{t_4} = tf_{t_4,d_1} \times idf_{t_4} = 2 \times 0 = 0$ $tf_{t_4,d_2}-idf_{t_4} = tf_{t_4,d_2} \times idf_{t_4} = 4.17 \times 0 = 0$ $tf_{t_4,d_3}-idf_{t_4} = tf_{t_4,d_3} \times idf_{t_4} = 3\times 0 = 0$ #### business $tf_{t_5,d_1}-idf_{t_5} = tf_{t_5,d_1} \times idf_{t_5} = 1 \times 0 = 0$ $tf_{t_5,d_2}-idf_{t_5} = tf_{t_5,d_2} \times idf_{t_5} = 4 \times 0 = 0$ $tf_{t_5,d_3}-idf_{t_5} = tf_{t_5,d_3} \times idf_{t_5} = 3.58 \times 0 = 0$ 2. Determine the similarity between query $q_1$ and each document using a. Cosine Similarity ### Calculate the $tf-idf$ of each term of $q_1$ #### Calculate the $tf$ of each term of $q_1$ $tf_{t_1, q_1} = 1 + \log{f_{t_1, q_1}} = 1 + \log1 = 1$ $tf_{t_2, q_1} = 1 + \log{f_{t_2, q_1}} = 1 + \log1 = 1$ $tf_{t_3, q_1} = 1 + \log{f_{t_3, q_1}} = 1 + \log1 = 1$ $tf_{t_4, q_1} = 1 + \log{f_{t_4, q_1}} = 1 + \log1 = 1$ $tf_{t_5, q_1} = 1 + \log{f_{t_5, q_1}} = 1 + \log1 = 1$ #### Calculate the $idf$ of each term in $q_1$ The $idf$ remains the same from the previous $idf$ calculations. #### Calculate the $tf-idf$ or $w$ of each term of the query This will use the 2nd version being, query isn't considered a document (Juden used this version, so Imma follow like a sheep). $w_{t_1,q_1} = tf_{t_1,q_1} \times idf_{t_1} = 1 \times 0.58 = 0.58$ $w_{t_2,q_1} = tf_{t_2,q_1} \times idf_{t_2} = 1 \times 0.58 = 0.58$ $w_{t_3,q_1} = tf_{t_3,q_1} \times idf_{t_3} = 1 \times 0.58 = 0.58$ $w_{t_4,q_1} = tf_{t_4,q_1} \times idf_{t_4} = 1 \times 0 = 0$ $w_{t_5,q_1} = tf_{t_5,q_1} \times idf_{t_5} = 1 \times 0 = 0$ ### Calculate Cosine Similarity between $q_1$ and each document The given equation to calculate the Cosine Similarity: $sim(d_j,q) = \dfrac{\sum_{i=1}^tw_{i,j}\times w_{i,q}}{\sqrt{\sum_{i=1}^tw_{i,j}^2}\times\sqrt{\sum_{i=1}^tw_{i,q}^2}}$ Thus: *(Skipping the ones that aren't defined or their $tf-idf$/$w$ value is 0)* $sim(d_1,q_1) = \dfrac{(w_{t_1,d_1} \times w_{t_1, q_1})+(w_{t_2,d_1} \times w_{t_2, q_1})}{\sqrt{w_{t_1,d_1}^2+w_{t_2,d_1}^2} \times \sqrt{w_{t_1,q_1}^2+w_{t_2,q_1}^2+w_{t_3,q_1}^2}} = \dfrac{(2.50 \times 0.58) + (2.32 \times 0.58)}{\sqrt{2.50^2 +2.32^2} \times \sqrt{0.58^2 + 0.58^2 + 0.58^2}} = 0.81$ $sim(d_2,q_1) = \dfrac{w_{t_3,d_2} \times w_{t_3, q_1}}{\sqrt{w_{t_3,d_2}^2} \times \sqrt{w_{t_1,q_1}^2+w_{t_2,q_1}^2+w_{t_3,q_1}^2}} = \dfrac{2.41 \times 0.58}{\sqrt{2.41^2} \times \sqrt{0.58^2 + 0.58^2 + 0.58^2}} = 0.57$ $sim(d_3,q_1) = \dfrac{(w_{t_1,d_3} \times w_{t_1, q_1})+(w_{t_2,d_3} \times w_{t_2, q_1}) + (w_{t_3,d_3} \times w_{t_3, q_1})}{\sqrt{w_{t_1,d_3}^2+w_{t_2,d_3}^2+w_{t_3,d_3}^2} \times \sqrt{w_{t_1,q_1}^2+w_{t_2,q_1}^2+w_{t_3,q_1}^2}} = \dfrac{(1.16 \times 0.58) + (1.16 \times 0.58) + (1.74 \times 0.58)}{\sqrt{1.16^2 + 1.16^2 + 1.74^2} \times \sqrt{0.58^2 + 0.58^2 + 0.58^2}} = 0.98$ 2. Determine the similarity between query $q_1$ and each document using b. Euclidean Distance ### Calculate the Euclidean Distance Given the formula: $sim(d_j,q)=\sqrt{\sum_{i=1}^t(w_{i,j} - w_{i,q})^2}$ So: $sim(d_1,q_1) = \sqrt{(w_{t_1,d_1} - w_{t_1, q_1})^2+(w_{t_2,d_1} - w_{t_2, q_1})^2 + (w_{t_3,d_1} - w_{t_3, q_1})^2} = \sqrt{(2.50 - 0.58)^2+(2.32 - 0.58)^2+(-0.58)^2} = 2.65$ $sim(d_2,q_1) = \sqrt{(w_{t_1,d_2} - w_{t_1, q_1})^2+(w_{t_2,d_2} - w_{t_2, q_1})^2 + (w_{t_3,d_2} - w_{t_3, q_1})^2} = \sqrt{(-0.58)^2+(-0.58)^2+(2.41 - 0.58)^2} = 2.00$ $sim(d_3,q_1) = \sqrt{(w_{t_1,d_3} - w_{t_1, q_1})^2+(w_{t_2,d_3} - w_{t_2, q_1})^2 + (w_{t_3,d_3} - w_{t_3, q_1})^2} = \sqrt{(1.16 - 0.58)^2+(1.16 - 0.58)^2+(1.74 - 0.58)^2} = 1.42$ 3. Rank the documents according to **Cosine Similarity** and **Euclidean Di # Part II Given a term-document matrix with respect to $f_{i,j}$ is the frequency of occurence of index term $k_i$ in the document $d_j$ ||$k_1$|$k_2$|$k_3$|$k_4$|$k_5$|$k_6$|$k_7$| |-|-|-|-|-|-|-|-| |$d_1$|157|4|232|0|57|2|2| |$d_2$|73|157|227|10|0|0|0| |$d_3$|0|0|0|0|0|3|1| |$d_4$|0|1|2|0|0|5|1| |$d_5$|0|0|1|0|0|5|1| |$d_6$|0|0|1|0|0|1|0| Give q2= {k2 , k3 , k5}, Rank the documents according to probabilistic model: $sim(d_j, q) = \sum_{k=1}^{i} [q, d_j] \frac{\log(N + 0.5)}{(n_i + 0.5)}$ N = 6 (total number of documents) For index term k2: ni = 3 (documents d1, d2 and d4 contain k2) For index term k3: ni = 5 (documents d1, d2, d4, d5 and d6 contain k3) For index term k5: ni = 1 (only document d1 contains k5) $sim(d_1,q_2) = \sqrt{(w_{t_1,d_1} - w_{t_1, q_2})^2+(w_{t_2,d_1} - w_{t_2, q_2})^2 + (w_{t_3,d_1} - w_{t_3, q_2})^2} = \sqrt{(2.50 - 0.89)^2+(2.32 - 0.24)^2+(-0.58)^2} = 3.24$ $sim(d_2,q_2) = \sqrt{(w_{t_1,d_2} - w_{t_1, q_2})^2+(w_{t_2,d_2} - w_{t_2, q_2})^2} = \sqrt{(2.50 - 0.89)^2+(2.32 - 0.24)^2} = 1.13$ $sim(d_3,q_2) = 0$ $sim(d_4,q_2) = \sqrt{(w_{t_1,d_4} - w_{t_1, q_2})^2+(w_{t_2,d_4} - w_{t_2, q_2})^2} = \sqrt{(2.50 - 0.89)^2+(2.32 - 0.24)^2} = 1.13$ $sim(d_5,q_2) = \sqrt{(w_{t_1,d_5} - w_{t_1, q_2})^2} = \sqrt{(2.50 - 0.89)^2} = 0.89$ $sim(d_6,q_2) = \sqrt{(w_{t_1,d_6} - w_{t_1, q_2})^2} = \sqrt{(2.50 - 0.89)^2} = 0.89$ Ranking the documents in descending order based on the similarity scores. The higher the score, the more relevant the document is to the query. Document d1 Document d2 Document d4 Document d5 Document d6 Document d3 ## Answers # Part III Consider an information need for which there are 4 **relevant document** in the collection. 2 systems run on this collection. Their top 10 results are judged for relevance as followed (with the leftmost being the top-ranked search result). | |1|2|3|4|5|6|7|8|9|10| |-|-|-|-|-|-|-|-|-|-|-| |System 1|R|N|R|N|N|N|N|N|R|R| |System 2|N|R|N|N|R|R|R|N|N|N| ## Compute the precision and recall at the top 5 | |1|2|3|4|5| |-|-|-|-|-|-| |System 1|R|N|R|N|N| |Precision|$\dfrac{1}{1}$|$\dfrac{1}{2}$|$\dfrac{2}{3}$|$\dfrac{2}{4}$|$\dfrac{2}{5}$| |Recall|0.25|0.25|0.5|0.5|0.5| |System 2|N|R|N|N|R| |Precision|$\dfrac{0}{1}$|$\dfrac{1}{2}$|$\dfrac{1}{3}$|$\dfrac{1}{4}$|$\dfrac{2}{5}$| |Recall|0|0.25|0.25|0.25|0.5| At the top 5, the precision is $\dfrac{2}{5} = 0.4$ and the recall is $\dfrac{2}{4} = 0.5$ for both systems. ## Compute the precision and the recall at the top 10 | |1|2|3|4|5|6|7|8|9|10| |-|-|-|-|-|-|-|-|-|-|-| |System 1|R|N|R|N|N|N|N|N|R|R| |Precision|$\dfrac{1}{1}$|$\dfrac{1}{2}$|$\dfrac{2}{3}$|$\dfrac{2}{4}$|$\dfrac{2}{5}$|$\dfrac{2}{6}$|$\dfrac{2}{7}$|$\dfrac{2}{8}$|$\dfrac{3}{9}$|$\dfrac{4}{10}$| |Recall|0.25|0.25|0.5|0.5|0.5|0.5|0.5|0.5|0.75|1| |System 2|N|R|N|N|R|R|R|N|N|N| |Precision|$\dfrac{0}{1}$|$\dfrac{1}{2}$|$\dfrac{1}{3}$|$\dfrac{1}{4}$|$\dfrac{2}{5}$|$\dfrac{3}{6}$|$\dfrac{4}{7}$|$\dfrac{4}{8}$|$\dfrac{4}{9}$|$\dfrac{4}{10}$| |Recall|0|0.25|0.25|0.25|0.5|0.75|1|1|1|1| At the top 10, the precision is $\dfrac{4}{10} = 0.4$ and the recall is $\dfrac{4}{4} = 1$ for both systems. # Equations TODO