# IR final paper
## what is smoothing?
address the problem if a word did not show up in document
### linear interpolation smoothing

### Dirichlet smoothing

| term | content|
|------|--------|
|D(w) | the count of word fount in Doc |
|\|D\| | length of the Document |
| m | hyperparameter |
| p | typically set to the count of the occurance of word in collection |

### DS vs LIS

## LM-DS

$log(\dfrac{{\color{red}{|d|}}}{|d| + \mu} * \dfrac{f_{d,t}}{{\color{red}{|d|}}} + \dfrac{\mu}{|d|+\mu} * \color{blue}{\dfrac{F_t}{|C|}})$
$=log(\dfrac{f_{d,t}}{|d|+\mu} + \dfrac{\mu * \color{blue}{\dfrac{F_t}{|C|}}}{|d| + \mu})$
$=log(\dfrac{f_{d,t}}{|d|+\mu} + \dfrac{\mu * \color{blue}{P(w|C)}}{|d| + \mu})$
$f_{q,t}$ => normal tf calculation of word in query
$w_{d,t}$ => smoothing factor to reduce impact of common words (observe likelihood of the term in collection)
| variable | content |
| -------- | ------- |
| $\|d\|$ | the document length |
| $\mu$ | hyperparameter |
| $f_{d,t}$| the frequency of the term in document |
| $F_t$ | the frequency of the term in full collection |
| $\|C\|$ | the length of the collection |
prob => even if the word not in a doc, Wd,t is **not** 0
$Solution$

$log(\dfrac{{\color{red}{|d|}}}{|d| + \mu} * \dfrac{f_{d,t}}{{\color{red}{|d|}}} + \dfrac{\mu}{|d|+\mu} * {\dfrac{F_t}{|C|}})$
$=log(\dfrac{{\color{red}{\mu}}}{|d| + \mu} * \dfrac{f_{d,t}}{{\color{red}{\mu}}} + \dfrac{\mu}{|d|+\mu} * {\dfrac{F_t}{|C|}})$
$=log(\color{blue}{\dfrac{\mu}{|d| + \mu}} * \dfrac{f_{d,t}}{\mu} + \color{blue}{\dfrac{\mu}{|d| + \mu}} * {\dfrac{F_t}{|C|}})$
$=log(\color{blue}{\dfrac{\mu}{|d| + \mu}} * (\dfrac{f_{d,t}}{\mu} + {\dfrac{F_t}{|C|}}))$
$=log(\dfrac{\mu}{|d| + \mu} * (\dfrac{f_{d,t}}{\mu} \color{red}* \color{red}{\dfrac{|C|}{F_t}} + {\dfrac{F_t}{|C|}} \color{red}* \color{red}{\dfrac{|C|}{F_t}}))$
$=log(\dfrac{\mu}{|d| + \mu} * (\dfrac{f_{d,t}}{\mu} * {\dfrac{|C|}{F_t}} + \color{blue}1))$
$=log(\dfrac{\mu}{|d| + \mu}) + \ log(\dfrac{f_{d,t}}{\mu} * {\dfrac{|C|}{F_t}} + 1)$
$\sum_{\color{red}{t\in{q}}} \ f_{q,t} \ * \ \{log(\dfrac{\mu}{|d| + \mu}) + \ log(\dfrac{\color{red}{f_{d,t}}}{\mu} * {\dfrac{|C|}{F_t}} + \color{red}1)\}$
$=\sum_{\color{red}{t\in{q,d}}} \ f_{q,t} \ * \ \{log(\dfrac{\mu}{|d| + \mu}) + \ log(\dfrac{f_{d,t}}{\mu} * {\dfrac{|C|}{F_t}} + 1)\}$
$= \color{blue}{\sum_{{t\in{q,d}}} \ f_{q,t} \ * \ log(\dfrac{\mu}{|d| + \mu})} + \ \sum_{{t\in{q,d}}} \ f_{q,t} \ * \ log(\dfrac{f_{d,t}}{\mu} * {\dfrac{|C|}{F_t}} + 1)$
$(if \ we \ consider \ the \ occurence \ of \ each \ word \ in \ query \ as \ one )$
$\approx\color{blue}{ \ |q| \ * \ log(\dfrac{\mu}{|d| + \mu})} + \ \sum_{{t\in{q,d}}} \ f_{q,t} \ * \ log(\dfrac{f_{d,t}}{\mu} * {\dfrac{|C|}{F_t}} + 1)$
# Our Experiments
| Model | Parameter | Stemmed | Score | rank |
| ----- | --------- | ------- | ------ | --------------------------|
| ATIRE | Essay | :ballot_box_with_check: | 0.113 | - |
| ATIRE | Customize | :x: | 0.089 | - |
| ATIRE | Customize | :ballot_box_with_check: | 0.104 | - |
| BM25L | Essay | :ballot_box_with_check: | 0.138 | :three: |
| BM25L | Customize | :ballot_box_with_check: | 0.139 | - |
| BM25L | Customize | :x: | 0.122 | - |
| BM25+ | Essay | :ballot_box_with_check: | 0.144 | :two: |
| BM25+ | Customize | :ballot_box_with_check: | 0.142 | - |
| BM25+ | Customize | :x: | 0.119 | - |
| DS | Essay | :ballot_box_with_check: | 0.144 | :one: |
| DS | Customize | :ballot_box_with_check: | 0.139 | - |
| PYP | Essay | :ballot_box_with_check: | 0.099 | - |
| PYP | Customize | :ballot_box_with_check: | 0.099 | - |