# IR final paper ## what is smoothing? address the problem if a word did not show up in document ### linear interpolation smoothing ![](https://i.imgur.com/uneKSwf.png) ### Dirichlet smoothing ![](https://i.imgur.com/DlQ7jjD.png) | term | content| |------|--------| |D(w) | the count of word fount in Doc | |\|D\| | length of the Document | | m | hyperparameter | | p | typically set to the count of the occurance of word in collection | ![](https://i.imgur.com/RJSFlp8.png) ### DS vs LIS ![](https://i.imgur.com/fxpJt4r.png) ## LM-DS ![](https://i.imgur.com/TncLizw.png) $log(\dfrac{{\color{red}{|d|}}}{|d| + \mu} * \dfrac{f_{d,t}}{{\color{red}{|d|}}} + \dfrac{\mu}{|d|+\mu} * \color{blue}{\dfrac{F_t}{|C|}})$ $=log(\dfrac{f_{d,t}}{|d|+\mu} + \dfrac{\mu * \color{blue}{\dfrac{F_t}{|C|}}}{|d| + \mu})$ $=log(\dfrac{f_{d,t}}{|d|+\mu} + \dfrac{\mu * \color{blue}{P(w|C)}}{|d| + \mu})$ $f_{q,t}$ => normal tf calculation of word in query $w_{d,t}$ => smoothing factor to reduce impact of common words (observe likelihood of the term in collection) | variable | content | | -------- | ------- | | $\|d\|$ | the document length | | $\mu$ | hyperparameter | | $f_{d,t}$| the frequency of the term in document | | $F_t$ | the frequency of the term in full collection | | $\|C\|$ | the length of the collection | prob => even if the word not in a doc, Wd,t is **not** 0 $Solution$ ![](https://i.imgur.com/4gURSkQ.png) $log(\dfrac{{\color{red}{|d|}}}{|d| + \mu} * \dfrac{f_{d,t}}{{\color{red}{|d|}}} + \dfrac{\mu}{|d|+\mu} * {\dfrac{F_t}{|C|}})$ $=log(\dfrac{{\color{red}{\mu}}}{|d| + \mu} * \dfrac{f_{d,t}}{{\color{red}{\mu}}} + \dfrac{\mu}{|d|+\mu} * {\dfrac{F_t}{|C|}})$ $=log(\color{blue}{\dfrac{\mu}{|d| + \mu}} * \dfrac{f_{d,t}}{\mu} + \color{blue}{\dfrac{\mu}{|d| + \mu}} * {\dfrac{F_t}{|C|}})$ $=log(\color{blue}{\dfrac{\mu}{|d| + \mu}} * (\dfrac{f_{d,t}}{\mu} + {\dfrac{F_t}{|C|}}))$ $=log(\dfrac{\mu}{|d| + \mu} * (\dfrac{f_{d,t}}{\mu} \color{red}* \color{red}{\dfrac{|C|}{F_t}} + {\dfrac{F_t}{|C|}} \color{red}* \color{red}{\dfrac{|C|}{F_t}}))$ $=log(\dfrac{\mu}{|d| + \mu} * (\dfrac{f_{d,t}}{\mu} * {\dfrac{|C|}{F_t}} + \color{blue}1))$ $=log(\dfrac{\mu}{|d| + \mu}) + \ log(\dfrac{f_{d,t}}{\mu} * {\dfrac{|C|}{F_t}} + 1)$ $\sum_{\color{red}{t\in{q}}} \ f_{q,t} \ * \ \{log(\dfrac{\mu}{|d| + \mu}) + \ log(\dfrac{\color{red}{f_{d,t}}}{\mu} * {\dfrac{|C|}{F_t}} + \color{red}1)\}$ $=\sum_{\color{red}{t\in{q,d}}} \ f_{q,t} \ * \ \{log(\dfrac{\mu}{|d| + \mu}) + \ log(\dfrac{f_{d,t}}{\mu} * {\dfrac{|C|}{F_t}} + 1)\}$ $= \color{blue}{\sum_{{t\in{q,d}}} \ f_{q,t} \ * \ log(\dfrac{\mu}{|d| + \mu})} + \ \sum_{{t\in{q,d}}} \ f_{q,t} \ * \ log(\dfrac{f_{d,t}}{\mu} * {\dfrac{|C|}{F_t}} + 1)$ $(if \ we \ consider \ the \ occurence \ of \ each \ word \ in \ query \ as \ one )$ $\approx\color{blue}{ \ |q| \ * \ log(\dfrac{\mu}{|d| + \mu})} + \ \sum_{{t\in{q,d}}} \ f_{q,t} \ * \ log(\dfrac{f_{d,t}}{\mu} * {\dfrac{|C|}{F_t}} + 1)$ # Our Experiments | Model | Parameter | Stemmed | Score | rank | | ----- | --------- | ------- | ------ | --------------------------| | ATIRE | Essay | :ballot_box_with_check: | 0.113 | - | | ATIRE | Customize | :x: | 0.089 | - | | ATIRE | Customize | :ballot_box_with_check: | 0.104 | - | | BM25L | Essay | :ballot_box_with_check: | 0.138 | :three: | | BM25L | Customize | :ballot_box_with_check: | 0.139 | - | | BM25L | Customize | :x: | 0.122 | - | | BM25+ | Essay | :ballot_box_with_check: | 0.144 | :two: | | BM25+ | Customize | :ballot_box_with_check: | 0.142 | - | | BM25+ | Customize | :x: | 0.119 | - | | DS | Essay | :ballot_box_with_check: | 0.144 | :one: | | DS | Customize | :ballot_box_with_check: | 0.139 | - | | PYP | Essay | :ballot_box_with_check: | 0.099 | - | | PYP | Customize | :ballot_box_with_check: | 0.099 | - |