# Truth Bug ###### tags: `side project` ### Project repository https://github.com/ianbig/IanAndHenry ## Project Goal A crawel that can gather ***clean large-scale news data*** * how to create a clean data * how to design a system that can handle large data ## Implementation Detail ### System Framework ![](https://i.imgur.com/KuNETKU.jpg) ### Maintex Extraction #### Implementation 1. convert html to xhtml with tidy * because xhtml is more stricter than html --> more easily to build DOM tree * DOCTYPE tag is mandatory * The xmlns attribute in html tag is mandatory * html tag, head tag, title tag, and body tag are mandatory * Elements must always be properly nested * Elements must always be closed * Elements must always be in lowercase * Attribute names must always be in lowercase * Attribute values must always be quoted * Attribute minimization is forbidden 2. convert xhtml to DOM Tree (xhtml --> property tree --> DOM Tree) * eliminate css, javascript, and comment tag * only preserve text tag, e.g. p, strong, h1, ... * each node contain ***text density, tag name and value*** * text desnity is calculated in the meantime of creating DOM tree 3. extract main text * extract node that has higher text density than body * how to calculate text desity: $C/T$, C = sum of characters in subtree, T is tags in subtree #### Evaluation * Precision, Recall, F-1 score * what is true-positive, true-negative, false-positve, false-negative (for more information, please visit [here](https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative)) * true, false: reality * positve, negative: what the system report * Precision: $TP / (TP + NP)$ * Recall: $TP / (TP + FN)$ * F-1 score: $2 * Precison * Recall / (Precision + Recall)$ * for more information, visit [here](https://medium.com/nlp-tsupei/precision-recall-f1-score%E7%B0%A1%E5%96%AE%E4%BB%8B%E7%B4%B9-f87baa82a47) * evaluation matrix: precision * $Precison = LCS(a,b).length / a.length$ * LCS = longest common subsequence (subsequence do not be contiguous) * how to calculate LCS? see [here](https://www.youtube.com/watch?v=ASoaQq66foQ) * for reading the [paper](http://www.ofey.me/papers/cetd-sigir11.pdf)