# Homework 4 (Sivtheng's)
| Exercises |
|-----------|
| [Part I](#Part-I) |
| [Part II](#Part-II) |
| [Part III](#Part-III) |
| [Equations](#Equations) |
# Part I
You are looking for information on the Economic Growth in Scotland in a large document collection. You decide to search using the terms: economy, growth, Scotland, banks and business, using an information retrieval system and this recommends three possible documents. You are given the frequency of each of the terms in each document, shown in the table below:
| Terms | economy | Scotland | growth | banks | business |
|-------|---------|----------|--------|-------|----------|
| Doc1 | 10 | 8 | 0 | 2 | 1 |
| Doc2 | 0 | 0 | 9 | 9 | 8 |
| Doc3 | 2 | 2 | 4 | 4 | 6 |
| q1 | 1 | 1 | 1 | 1 | 1 |
## Compute $tf-idf$ of the terms of each document ($W_{i,j}$) using $\log_2$
### Document 1
$tf_{economy}idf_{Doc1} = (1 + \log_2(10))\cdot\log_2(3/2) = (1+3.32)\cdot0.58 = 2.50$
$tf_{Scotland}idf_{Doc1} = (1 + \log_2(8))\cdot\log_2(3/2) = (1+3)\cdot0.58 = 2.32$
$tf_{growth}idf_{Doc1} = (1 + \log_2(0))\cdot\log_2(3/2) = Undefined$
$tf_{banks}idf_{Doc1} = (1 + \log_2(2))\cdot\log_2(3/3) = (1+1)\cdot0 = 0$
$tf_{business}idf_{Doc1} = (1 + \log_2(1))\cdot\log_2(3/3) = (1+0)\cdot0 = 0$
### Document 2
$tf_{economy}idf_{Doc2} = (1 + \log_2(0))\cdot\log_2(3/2) = Undefined$
$tf_{Scotland}idf_{Doc2} = (1 + \log_2(0))\cdot\log_2(3/2) = Undefined$
$tf_{growth}idf_{Doc2} = (1 + \log_2(9))\cdot\log_2(3/2) = (1+3.16)\cdot0.58 = 2.41$
$tf_{banks}idf_{Doc2} = (1 + \log_2(9))\cdot\log_2(3/3) = (1+3.16)\cdot0 = 0$
$tf_{business}idf_{Doc2} = (1 + \log_2(8))\cdot\log_2(3/3) = (1+3)\cdot0 = 0$
### Document 3
$tf_{economy}idf_{Doc3} = (1 + \log_2(2))\cdot\log_2(3/2) = (1+1)\cdot0.58 = 1.16$
$tf_{Scotland}idf_{Doc3} = (1 + \log_2(2))\cdot\log_2(3/2) = (1+1)\cdot0.58 = 1.16$
$tf_{growth}idf_{Doc3} = (1 + \log_2(4))\cdot\log_2(3/2) = (1+2)\cdot0.58 = 1.74$
$tf_{banks}idf_{Doc3} = (1 + \log_2(4))\cdot\log_2(3/3) = (1+2)\cdot0 = 0$
$tf_{business}idf_{Doc3} = (1 + \log_2(6))\cdot\log_2(3/3) = (1+2.58)\cdot0 = 0$
## Determine the similarity between query $q_1$ and each document using cosine similarity and euclidean distance
### Cosines Similarity
The formula to calculate the cosine similarity is denoted by:
$\cos(a, b)=\dfrac{(a_1 \cdot b_1) + (a_2 \cdot b_2)}{\sqrt{(a_1^2+a_2^2)}\cdot \sqrt{b_1^2+b_2^2}}$
**The closer the value is to 1, the more relevant the document is.**
$\cos(Doc1, q_1) = \dfrac{10 + 8 + 0 + 2 + 1}{\sqrt{10^2+8^2+0^2+2^2+1^2}\cdot\sqrt{1^2+1^2+1^2+1^2+1^2}}= 0.72$
$\cos(Doc2, q_1) = \dfrac{0 + 0 + 9 + 9 + 8}{\sqrt{0^2+0^2+9^2+9^2+8^2}\cdot\sqrt{1^2+1^2+1^2+1^2+1^2}}= 0.77$
$\cos(Doc3, q_1) = \dfrac{2 + 2 + 4 + 4 + 6}{\sqrt{2^2+2^2+4^2+4^2+6^2}\cdot\sqrt{1^2+1^2+1^2+1^2+1^2}}= 0.92$
### Euclidean Distance
The formula to calculate the euclidean distance is denoted by:
$d(a,b) = \sqrt{(a_2-a_1)^2+(b_2-b_1)^2}$
**The lower the value, the more relevant the document is.**
#### Document 1
$d(Doc1,q_1)=\sqrt{(10-1)^2+(8-1)^2+(0-1)^2+(2-1)^2+(1-1)^2}=11.48$
#### Document 2
$d(Doc2,q_1)=\sqrt{(0-1)^2+(0-1)^2+(9-1)^2+(9-1)^2+(8-1)^2}=13.37$
#### Document 3
$d(Doc1,q_1)=\sqrt{(2-1)^2+(2-1)^2+(4-1)^2+(4-1)^2+(6-1)^2}=6.70$
### Ranking the Documents
#### According to Cosine Similarit, ranked in order of highest to lowest relevancy (the higher the value, the more relevant it is)
1. Doc3
2. Doc2
3. Doc1
#### According to Euclidean Distance, ranked in order of highest to lowest relevancy (the lower the value, the more relevant it is)
1. Doc3
2. Doc1
3. Doc2
# Part II
Given a term-document matrix with respect to $f_{i,j}$ is the frequency of occurence of index term $k_i$ in the document $d_j$
||$k_1$|$k_2$|$k_3$|$k_4$|$k_5$|$k_6$|$k_7$|
|-|-|-|-|-|-|-|-|
|$d_1$|157|4|232|0|57|2|2|
|$d_2$|73|157|227|10|0|0|0|
|$d_3$|0|0|0|0|0|3|1|
|$d_4$|0|1|2|0|0|5|1|
|$d_5$|0|0|1|0|0|5|1|
|$d_6$|0|0|1|0|0|1|0|
Give $q_2 - \{k_2, k_3, k_5\}$, rank the documents according to probabilistic model: $sim(d_j, q)$
Assuming this table represents the entire document collection. We can find $N$ by summing up all the values in the table (needs fact checking):
$N = 157 + 4 + 232 + 57 + 2 + 2 + 73 + 157 + 227 + 10 + 3 + 1 + 1 + 2 + 5 + 1 + 1 + 5 + 1 + 1 + 1 = 943
Thus:
$sim(d_1,q)=\log\dfrac{943+0.5}{4+0.5}+\log\dfrac{943+0.5}{232+0.5}+\log\dfrac{943+0.5}{57+0.5} = 4.14$
$sim(d_2,q)=\log\dfrac{943+0.5}{157+0.5}+\log\dfrac{943+0.5}{227+0.5}+\log\dfrac{943+0.5}{0+0.5} = 4.67$
$sim(d_3,q)=\log\dfrac{943+0.5}{0+0.5}+\log\dfrac{943+0.5}{0+0.5}+\log\dfrac{943+0.5}{0+0.5} = 9.82$
$sim(d_4,q)=\log\dfrac{943+0.5}{1+0.5}+\log\dfrac{943+0.5}{2+0.5}+\log\dfrac{943+0.5}{0+0.5} = 8.65$
$sim(d_5,q)=\log\dfrac{943+0.5}{0+0.5}+\log\dfrac{943+0.5}{1+0.5}+\log\dfrac{943+0.5}{0+0.5} = 9.35$
$sim(d_6,q)=\log\dfrac{943+0.5}{0+0.5}+\log\dfrac{943+0.5}{1+0.5}+\log\dfrac{943+0.5}{0+0.5} = 8.87$
**The above calculations may have been done by using $\log_{10}$**
The documents are ranked from highest to lowest relevancy (the higher the value/score, the more relevant it is)
1. $d_3$
2. $d_5$
3. $d_6$
4. $d_4$
5. $d_2$
6. $d_1$
# Part III
Consider an information need for which there are 4 **relevant document** in the collection. 2 systems run on this collection. Their top 10 results are judged for relevance as followed (with the leftmost being the top-ranked search result).
| |1|2|3|4|5|6|7|8|9|10|
|-|-|-|-|-|-|-|-|-|-|-|
|System 1|R|N|R|N|N|N|N|N|R|R|
|System 2|N|R|N|N|R|R|R|N|N|N|
## Compute the precision and recall at the top 5
| |1|2|3|4|5|
|-|-|-|-|-|-|
|System 1|R|N|R|N|N|
|Precision|$\dfrac{1}{1}$|$\dfrac{1}{2}$|$\dfrac{2}{3}$|$\dfrac{2}{4}$|$\dfrac{2}{5}$|
|Recall|0.25|0.25|0.5|0.5|0.5|
|System 2|N|R|N|N|R|
|Precision|$\dfrac{0}{1}$|$\dfrac{1}{2}$|$\dfrac{1}{3}$|$\dfrac{1}{4}$|$\dfrac{2}{5}$|
|Recall|0|0.25|0.25|0.25|0.5|
At the top 5, the precision is $\dfrac{2}{5} = 0.4$ and the recall is $\dfrac{2}{4} = 0.5$ for both systems.
## Compute the precision and the recall at the top 10
| |1|2|3|4|5|6|7|8|9|10|
|-|-|-|-|-|-|-|-|-|-|-|
|System 1|R|N|R|N|N|N|N|N|R|R|
|Precision|$\dfrac{1}{1}$|$\dfrac{1}{2}$|$\dfrac{2}{3}$|$\dfrac{2}{4}$|$\dfrac{2}{5}$|$\dfrac{2}{6}$|$\dfrac{2}{7}$|$\dfrac{2}{8}$|$\dfrac{3}{9}$|$\dfrac{4}{10}$|
|Recall|0.25|0.25|0.5|0.5|0.5|0.5|0.5|0.5|0.75|1|
|System 2|N|R|N|N|R|R|R|N|N|N|
|Precision|$\dfrac{0}{1}$|$\dfrac{1}{2}$|$\dfrac{1}{3}$|$\dfrac{1}{4}$|$\dfrac{2}{5}$|$\dfrac{3}{6}$|$\dfrac{4}{7}$|$\dfrac{4}{8}$|$\dfrac{4}{9}$|$\dfrac{4}{10}$|
|Recall|0|0.25|0.25|0.25|0.5|0.75|1|1|1|1|
At the top 10, the precision is $\dfrac{4}{10} = 0.4$ and the recall is $\dfrac{4}{4} = 1$ for both systems.
# Equations
TODO