# Truth Bug
###### tags: `side project`
### Project repository
https://github.com/ianbig/IanAndHenry
## Project Goal
A crawel that can gather ***clean large-scale news data***
* how to create a clean data
* how to design a system that can handle large data
## Implementation Detail
### System Framework

### Maintex Extraction
#### Implementation
1. convert html to xhtml with tidy
* because xhtml is more stricter than html --> more easily to build DOM tree
* DOCTYPE tag is mandatory
* The xmlns attribute in html tag is mandatory
* html tag, head tag, title tag, and body tag are mandatory
* Elements must always be properly nested
* Elements must always be closed
* Elements must always be in lowercase
* Attribute names must always be in lowercase
* Attribute values must always be quoted
* Attribute minimization is forbidden
2. convert xhtml to DOM Tree (xhtml --> property tree --> DOM Tree)
* eliminate css, javascript, and comment tag
* only preserve text tag, e.g. p, strong, h1, ...
* each node contain ***text density, tag name and value***
* text desnity is calculated in the meantime of creating DOM tree
3. extract main text
* extract node that has higher text density than body
* how to calculate text desity: $C/T$, C = sum of characters in subtree, T is tags in subtree
#### Evaluation
* Precision, Recall, F-1 score
* what is true-positive, true-negative, false-positve, false-negative (for more information, please visit [here](https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative))
* true, false: reality
* positve, negative: what the system report
* Precision: $TP / (TP + NP)$
* Recall: $TP / (TP + FN)$
* F-1 score: $2 * Precison * Recall / (Precision + Recall)$
* for more information, visit [here](https://medium.com/nlp-tsupei/precision-recall-f1-score%E7%B0%A1%E5%96%AE%E4%BB%8B%E7%B4%B9-f87baa82a47)
* evaluation matrix: precision
* $Precison = LCS(a,b).length / a.length$
* LCS = longest common subsequence (subsequence do not be contiguous)
* how to calculate LCS? see [here](https://www.youtube.com/watch?v=ASoaQq66foQ)
* for reading the [paper](http://www.ofey.me/papers/cetd-sigir11.pdf)