Notes - HackMD

--- tags: Intro to Text Mining --- # Notes ## Text Preprocessing 1. Tokenization * breaks text into chunks of tokens (words) * one-hot vector: 向量裡面只有一個地方不是 0 * social media texts contain a lot of informal words * Can use casual_tokenize and TweetTokenizer * jieba: tokenization of Mandarin 2. Stop words * 指的是很常出現但不重要的字 (a, an, the ...) * 可以不要 exclude，透過將 one-hot vector 改為紀錄權重也可。一個 token 的權重越小代表他越有可能是 stop words (不重要) 3. Normalization * 意義相近的 tokens 被 combine 成 single, normalized form * helps reduce the size of vocabulary * 常用的 normalization procedures: 1. Case folding: 把全部的字都小寫 * useful for search engines: because users are lazy, when searching for "iphone", you would get information about "iPhone" also (great) * 有時會使有意義的大寫失效，e.g., FedEx -> fedex 2. stemming: remove suffixes * **rule-based**: e.g., remove suffix "ing"，則 ending -> end (good), sing -> s (bad) * Porter stemmer, a well-known stemming rule procedure 3. Lemmatization: 詞性還原，希望 stemming 能產出正確的字 * **knowledge-based**: 通常會用一個 knowledge base 去 identify the lemma of a word. * output the lemma of a word (root word). e.g., cars -> car, ate -> eat * 也會將動名詞考慮進去 * deep learning approaches 通常不使用 stemming or lemmatization。因為 word embedding 可以做到將相似意義的 token 轉成相近的向量 * Zipf's law * describe relation of word frequency and rank * the frequency of any word is inversely proportional to its rank in the frequency table * frequency rank 為 1 的 term 的出現次數約是 rank 2 的 term 的兩倍；也約是 rank 3 的 term 的三倍 * 會發現 top frequent words 都是 stop words ![](https://i.imgur.com/4U5IVuU.jpg =80%x) ## Term weighting and Vector Space Model (VSM) * we use following ways to represent a document: * frequency-based vectors: vectors based on word counts * multi-hot vectors: vectors based on word absence / presence * TF-IDF vectors: vectors based on term frequency and inverse document frequency 接下來即可透過 vector operations 計算 similarity between documents * vector space model * dimensionality * number of distinct words，也就是 vocabulary size $|V|$ * 通常都是非常高維的向量空間 * 儘管 one-hot vectors 可以存 word order 資訊，但還是被捨棄，因為需要的儲存空間太大。因此 we compress each document to a single vector rather than a big table * 因為這樣，vector space model 會失去 word order，比較像是 a set of words。sometimes we call vector space model the bag-of-word model * frequency-based vectors * term frequencies needs to be normalized. 因為每個文件字數不同 * multi-hot vectors * also called binary incidence vectors * 對於每個文件有出現的 term，該文件 vector 相對應的地方就設 1。 * TF-IDF vectors * frequency-based 會比 multi-hot vectors 有更多資訊，但 pure term frequency 卻不能展現字的重要性 (因為 Zipf's law) * Inverse Document Frequency (IDF): measure how rare a word in a corpus. 也就是去看一個字是否有鑑別度 ![](https://i.imgur.com/flD8H51.jpg =80%x) * 解決 stop words 問題 (因為 TF-IDF weights of stop words would be almost 0: $log \frac{N}{N} = 0$, $N$ for 文件數) * comparing document vectors * two measures for calculating the similarity / distance between vectors 1. Euclidean distance * sensitive to document length: 可以透過 normalize vectors 減少 effect of document length 2. Cosine similarity * Cosine automatically normalize document vectors to their unit vectors * 對兩向量做內積來判斷相似。因此重要的 term 相同，dot 才會變大。但也有可能是兩文件的用字都不同，但其實他們用的字都非常相似，這樣的話雖然 cosine similarity 出來是 0，但其實兩文件是非常相關的。(issue) ## Text Classification * text classification: a type of supervised learning * two phases: training & testing * validation phase * **it is wrong** to tune model parameters to maximize performance on the test data * 要獨立於 training data & testing data，通常會從 training data 切一些當作 validation set * 用作 parameter tuning * K-Fold Cross Validation * 有些時候，可能我們選到的 training or testing data 不具代表性。 * 將資料分成 $k$ 等份 (每等份就是一個 fold)，接下來每個 fold 都會輪流當成 testing data，而其他 $k-1$ 個 fold 就當成 training data * testing phase * 最終要用來與其他 model 做比較的一個資料集 (after all of the training & validation phases) * chosen independently of the training data (要用 unseen data 做 test，不然 performance 會被高估) * 會將 total data separate 成 training portion & testing portion，而 testing data 通常是 5% ~ 10% 的 total data * Performance Metrics * Three important metrics: precision, recall, and F1 * precision: tp / (tp + fp) * 衡量 model 預測為 positive 的資料中，有多少是真的 positive * recall: tp / (tp + fn) * 全部的 positive 資料中，model 能夠預測出多少個 * F1: harmonic mean of precision and recall * conservative, and consider both two above measures. * F1 = 2PR/(P+R) * different circumstances prefer different metrics: * high precision: 代表要盡量準確，model 可以不要說對，但一說對就要是對的。e.g., classify spam mails，model 分辨不出來沒關係，但如果 model 說這是 spam mail，就真的要是，不然會造成誤刪郵件 * high recall: 代表不要有漏網之魚，寧願多看，也不要漏看。e.g., 找出產品的負評，model 把正面評論說成負面的也沒關係，但重點是要能把全部負評都盡量找出來，才能解決產品問題。 * one another popular metric: accuracy * accuracy: (tp + tn) / (tp + fp + fn + tn) * 全部的資料中，答對的比例 (答對是指 model 說正確，實際上也是正確；或是 model 說錯誤，實際上也是錯誤) * if data is skew, then it's not a good metric。e.g., 當資料有 9 成都是 no，一成是 yes，那只要有個 model，他不論吃甚麼 input，都會吐 no，這樣的話他即便是個廢 model，但卻可以有高達 9 成的 accuracy * performance curves * precision-recall curves * x 軸是 recall，y 軸是 precision。每個點代表的是不同門檻下分別得到的 recall, precision。將各點連線後，如果線較靠近右上角，代表 model is better (有較高的 recall & precision) * ROC curves * x 軸是 false positive rate (誤報率： fp/(fp+tn))，y 軸是 true positive rate (recall)。各點連線後，如果線靠近左上角，代表 performance is better (high recall and low false positive rate)。因此線下面積 (AUC) 也可用做衡量，如果 AUC 越大，代表整條曲線靠近左上角，因此表現較好。 * single score for multi-class classification * macro-averaging: 將各 class 個別的 score 做平均 * micro-averaging: 先將各 class 的表現集合成 pooled table 再針對 pooled table 計算 precision or recall ![](https://i.imgur.com/qUixecN.jpg) ## Naive Bayes Classification * probabilistic learning method * 假設文章用字彼此不影響 (互相獨立) * e.g., P('The', 'breakfast', 'is', 'terrible'|FOOD) = P('The'|FOOD) * P('breakfast'|FOOD) * ... * naive: probabilities of seeing 'Hong' and 'Kong' may not be independent * positional independence assumption: 假設看到的字與字所擺放的位置無關 * naive: probability of seeing 'Once' in the beginning of a sentence (document) is certainly higher than seeing it in other positions * each document uses term frequency vector * Bernoulli NB * If we represent documents as **multi-hot** vectors, we are using Bernoulli NB model * 沒出現的字也會有貢獻 (補數)：詳見 p.26 算法 * properties of NB * good classification performance * efficiency & effectiveness * usually as a baseline in text classification research ## Vector Space Classification * Contiguity hypothesis: * documents in the same class form a contiguous region in high dimensional vector space * regions of different classes don't overlap * vector space classification tries to identify boundaries between different regions to classify new documents ![](https://i.imgur.com/VGWYq7u.jpg =50%x) * **weighted** vector representation is always preferred * In particular tf-idf representation * multi-hot is not welcomed，因為無法凸顯各個 term 之於文件的重要性。 1. Rocchio Classification * uses centroid to define boundaries * centroid of a class $c$ : vector average of documents belong to class $c$ (透過 training data 計算 class 重心) * 那麼，兩個 classes 間的 boundary 就是 a set of points with **equal distance** from two centroids. 在 classify 時，只要將 point 對應到最近的 centroid 即完成分類。 ![](https://i.imgur.com/gfz2dfm.jpg =60%x) in this case, a1 = a2, b1 = b2 ... * 缺點：沒有密度概念，只用距離判斷會出事 ![](https://i.imgur.com/kloDZ7m.jpg) 2. K Nearest Neighbor (KNN) * assigns documents to the majority class of their $k$ closest neighbors, with ties broken randomly * It is desirable for $k$ to be **odd** (makes ties less likely) * weighting by similarities is often more accurate (也就是離比較近的 neighbor 的票較重要) * $k$ 通常是 pre-defined，也就是說 KNN requires no training at all * less efficient: 因為要找 nearest neighbors，所以對於每個 vector，都要掃過全部的 documents，linear in $|d|$ * simply memorizes all training examples: 又被稱為 memory-based learning 3. Linear / Nonlinear Classifiers * ++KNN is not a linear classifier++. Because the decision boundary of KNN consists of locally linear segments, which would form a complex shape (not a line in 2D or a hyperplane in M-dimensional space) * comparisons: ![](https://i.imgur.com/RGPL3uS.jpg) * Assume classification problem is linearly separable. * 則我們要解的就是找出最適 (optimal) 的 linear separator.。有些人將這個視為 optimization problem => Support Vector Machines (SVM) 4. SVM decision boundary is specified by a subset of data ![](https://i.imgur.com/4nLSvLt.jpg =40%x) * 有時 data are not linearly separable，可使用 soft margin classification 搭配 slack variables * Kernel SVM * data too hard to linearly separate, then **map** the data onto a higher dimensional space. Then we can use linear classifier in the higher dimensional space to separate data ![](https://i.imgur.com/FfMJI4e.jpg) * But mapping vectors into higher dimensional space and compute dot product can be computationally expensive! => Kernel functions * Kernel functions provide easy and efficient way of doing mapping and dot production * polynomial kernels * radial basis functions (equivalent to mapping data into infinite dimensional space, is default kernel of sklearn.svm) ## Flat Text Clustering * clustering is ++unsupervised learning++ * Applications of text clustering * Search result clustering * news document clustering: 把相關新聞 cluster 在一起，讓讀者可以看相關其他文章 * Latent topic discovery: learn topics in a set of documents. e.g., Topic **location** is constituted of "MRT", "mall", "museum" ... * Types of clustering * by structure * flat clustering: create a flat set of clusters, 每個 cluster 獨立於 each other (沒有 overlap) * hierarchical clustering: create a hierarchy of clusters, 可能會有某 cluster 是另一大 cluster 的 subset * hard / soft clustering * hard clustering: each document belongs to exactly one cluster * soft clustering: 計算每個文件對於各 cluster 的 probability (a distribution). 如果是挑 p 最高的當文件的 cluster, 就是 hard clustering * exhaustive / non-exhaustive clustering * exhaustive: every documents must be assigned to a cluster * non-exhaustive: some documents can be assigned to no cluster (detect outliers，針對 outliers不去硬分) * Hard Flat Clustering * given 1. a set of documents $D$ 2. desired number of clusters $K$ (optional for some clustering algorithms) 3. a objective function - evaluates quality of clustering. 可能可以是計算每個 cluster 內的文件之於重心的 avg distance * We want: 1. compute $\gamma: D \rightarrow \{1, ..., K\}$ that minimizes (or maximizes) objective function 2. document 通常表示成 term vectors, less distance means more content similar 3. usually, 我們希望 none of the $K$ cluters is empty * cardinality issue: 如何決定 $K$？ * 通常 $K$ 只能透過 experience or domain knowledge 來 make a good guess * evaluation of clustering 1. purity: ![](https://i.imgur.com/bzpcVBC.jpg) 2. rand index: ![](https://i.imgur.com/WpPltpn.jpg) * K-means * the most popular flat clustering algorithm * objective: minimize average squared Euclidean distance of documents from their ++cluster center++ (mean or centroid) * objective function: minimize RSS (Residual Sum of Squares) - the squared distance of each vector from its centroid, summed over all vectors * steps: first, randomly select $K$ documents as the seeds (initial cluster centroids) then, iteratively repeats two steps until a stopping criterion is met: 1. re-assigning documents to the cluster with their closest centroid. 2. re-computing each centroid based on current members of its cluster stopping criterions: 1. fixed number of iterations (limit runtime) 2. assignment of documents doesn't change between iterations (代表保證收斂才停：local optima) 3. a RSS threshold (objective value 達到某個門檻) * issues with K-means * seed selection: its outcome heavily depends on the seed initialization. 1. 如果有 outlier 被選成 seed，there is no vector assigned to it (造成 singleton cluster) 2. 如果 seeds 都靠得太近，那要花更多 iterations 才能把 data 分開。解法：K-means++ K-means++ 想法是初始的各個 seed，相互距離要越遠越好。具體作法是： * 隨機選一個 data 當 first seed * 對於每個點，計算與之最近的 seed 的距離 * 挑選離最遠的當成下個 seed * repeat until $K$ seeds are selected * cardinality issue: 可透過 elbow method ![](https://i.imgur.com/wINp0w3.jpg) * DBSCAN (Density-Based Spatial Clustering of Applications with Noise) * 每個 cluster 都有兩種點：core and border * steps: 1. decide value of eps (define 一個半徑讓 core 做搜尋使用) and minPts (成為 core 至少需要 cover 的點數) 2. 對於每個點 $x$，都去看半徑內的其他點，並將那些點當成 $x$ 的鄰居。如果 $x$ 的鄰居數大於 minPts，則讓 $x$ 變成一個 core point 3. 對於每個 core point，如果他沒被加到任何一個 cluster 內，則為他創建一個 cluster 並將其 neighbor points 也都加到同個 cluster 內 * every point not contained in any cluster is considered to be noise (outlier) * advantages: 1. no need to specify number of clusters 2. able to detect outliers 3. can form un-spherical clusters (K-means assumes clusters are spherical) * disadvantages: 1. difficult to determine eps and minPts ## Hierachical Clustering * different to flat clustering, hierarchical clustering outputs a hierarchy. 而有這種 tree 的結構 supposed to be more informative，但沒有定論 flat or hierarchical 誰比較好。 * hierarchical agglomerative clustering (HAC) ![](https://i.imgur.com/jw0eKYG.jpg) * clustering result is deterministic (fixed) * bottom-up approach, 先將每個文件視為各自不同的 cluster，並透過不斷地 merge (agglomerate) paris of clusters，until all clusters have been merged into a single cluster * typically visualized as a dendrogram (樹狀圖) * sometimes we don't want 全部的 clusters 被歸到同一類：要讓 hierarchy 被 cut at some point。 * cutting criteria: 1. a pre-specified level of similarity: 假設訂 0.4，就代表要 similarity > 0.4 的 2 clusters 才會做 merge 2. like elbow method, find a situation that merging one more cluster decreases quality of clustering significantly * single-link and complete-link 1. single-link * similarity of two clusters is the similarity of their most similar members * overall cluster structure is not taken into account (a local merge criterion) * problem: chaining (因為都是靠最近的做 merge，不管後面的結構，可能會造成整個 cluster 變得非常長) 2. complete-link * similarity of two clusters is the similarity of their most dissimilar members * entire structure of cluster is considered (non-local merge criterion) * problem: sensitive to outliers (如果一 cluster 內包含了 outlier，則在後續 merge 時，即便離很近的 data 都會因為 outlier 距離其很遠而不 merge，會影響 merge quality) ![](https://i.imgur.com/ZYGxeSF.jpg) * group-average linkage similarity of two clusters is average similarity of all pairs of documents * include pairs from same cluster * self similarities are not included: because this gives an unfair advantage to small clusters (pair 數少，又都會把 self 算進去) * centroid linkage similarity of two clusters is average similarity of all pairs of documents ++from different clusters++ * Ward's linkage distance between two clusters is how much the sum of squares will increase when we merge them. 希望在 merge 時造成 new cluster 的發散程度 (variance) 不要太大 * Which one is better? * recommends complete-link & centroid linkage over single-link for a retrieval app * news-related applications (e.g., topic detection), single-link seems to be better ## Latent Semantic Analysis * Identifying terms with similar meaning is called LSA. * Two well-known LSA methods: 1. singular value decomposition (SVD) 2. latent Dirichlet allocation (LDA) * LSA helps represent each document as a topic vector. 原本 document vector 紀錄的是每個出現的 word，而現在如可以將 term 做 cluster (許多 words 變成少少的 topics)，則可以將 document 表達成 topic vector -> dimension reduction * reduce storage requirement * enhance similarity calculation (Cosine similarity based on sparse TFIDF vectors would be messed up.) * LSA takes token normalization (stemming, lemmatization) to another level by handling synonyms。意思是 stemming & lemmatizing 是將同個字不同的表達型態做 normalize 讓他們都能變一樣，而 LSA 做到的則是考慮不同的字表達相同的意思，也就是納入 synonym。e.g., cell, phone, mobile -> same topic * SVD * 只要保留少部分重要的 topics 即可重新建構 original document * 將非常高維的 tf-idf vectors 壓成低維的 topic vectors * Latent Dirichlet Allocation (LDA) A probability-based topic-discovery model (文章、topic 有各自的 distribution) * Concept / assumption: * each document is with a topic distribution (because a document is generally involved several topics.) * each topic is with a word distribution (because different topics prefer different words) * The two sets of distributions (topic distribution & word distribution) need to infer from text corpus. * topic size $k$ is pre-defined. * Dirichlet hyper-parameters * $\alpha$: document 的 topic prior probability * $\beta$: topic 的 word prior probability * summary * 透過 LSA，我們可以將每個文件表達成 topic vector * 但仍無法解決 polysemy (多義字) - 不論是 SVD 或 LDA 都無法做到將同個 term 解讀成不同 vector * polysemy 會造成我們高估文件的 similarities (雖然他們的用字相同，但可能是不同意義) * BERT: 賦予一個字在不同 scenario 下的不同含意 ## Word Vector 為 unsupervised learning (透過文本字詞自動完成關聯) Two approaches to train the prediction model: 1. Skip-gram: 給一個單詞，想辦法預測前後文可能的單詞 2. CBOW (Continuous bag-of-words): 從前後文去判端特定地方要填甚麼，類似克漏字 * Skip-gram * hidden layer 有 $n$ 個 neurons * $n$ 代表要用多少維度表示一個單詞 * input & output layers 則有 $m$ neurons * input layer 為 one-hot vector * $m$ 代表 number of words in the dictionary * output layer 會有 softmax 當作 activation function (機率介於 0 - 1) * neural network 的樣子： ![](https://i.imgur.com/NFeIsHy.png) * training data * 設定 window size (要看周遭幾個單詞) * 將 token 進行配對變成一組一組的 training word pairs, 如下圖： ![](https://i.imgur.com/PPbMeSJ.png) * 權重設定 * 共有兩組 weight matrix，首先會 randomly initialize matrix 內的權重。如下圖所示： ![](https://i.imgur.com/NdbgEJb.png) * model (weights) 的目標是能根據一個 input word，成功 predict 出相對應的 training word pair. 像是有組 training pair 為 {"painted", "the"}，那麼如 input 單詞為 "painted"，則 network 就要去訓練權重讓 output 會是 "the"。 * 訓練完的權重意義 * after training, the weight matrix $W$ consists of word embedding. * 也就是說每個在 dictionary 中的 word 都能用一組 $n$ 維的向量做表示。如下圖： ![](https://i.imgur.com/rLel9JW.png) * prediction 就會變得很像一個 function，丟入 input's embedding，會經過 $W'$ 產出預測單詞如下圖： ![](https://i.imgur.com/GJ0qYq8.png) * 透過這樣，synonyms 也會產生出相似的 embeddings。因為他們是同義詞，因此會接的前後文都差不多，自然 training pair 會非常相似，因此訓練出來的權重 (embedding) 也會非常相似 * CBOW * predict the center word based on the surrounding words. * input 會是 multi-hot vector 及 $m$ 個 neurons * 同樣會有 $n$ hidden neurons 及 $m$ 個 output layer neurons * 比較 * Skip-gram works well with samll corpora and ==rare== terms * CBOW shows higher accuracies for frequent words and is much faster to train * Some studies show skip-gram is better * Pretrained Vectors 因為要做 word embedding 其實蠻消耗運算資源的，因此有許多 pre-trained vectors 可做使用： * Google 就提供了根據 English Google News articles 所做的 pretrained Word2vec vector set * Facebook published fastText 但如所做應用的單詞都非常特別，則還是要自己做 word embedding ## BERT and Transformers BERT - Bidirectional Encoder Representations from Transformers * Transformers * based on encoder-decoder architecture * Encoder: * 將高維 feature vectors 壓成低維 (萃取 important parts) * compress, remove noisy information of data * produce meaningful vector of the given data * Decoder: * regenerate correct data from vectors processed by encoder * enhanced with the attention mechanism * 讓 inputs 先通過 self-attention layer, A layer that helps the encoder look at other words in the input sentence as it encodes a specific word ![](https://i.imgur.com/98mrZGZ.png) * Self-Attention 原本在 word embedding 時，每個字都是獨立的，忽略周遭其他字所造成的影響 (一字可能多義的情況)。而 self attention 則希望能夠在 word embedding 時能夠看一看其他 words，並根據 query, key 計算每一個字針對產出的 embedding 需要貢獻多少力氣。如此，即能解決一字多義的問題。 ![](https://i.imgur.com/C2vHQ8V.png) * BERT only uses encoder part of the transformer 透過兩種方式訓練 pre-trained model: 1. unmasked language model (把一個字蓋掉，要能成功預測) 2. next sentence prediction (predict the order of two sentences) * BERT vs. Word2Vec * BERT generates embedding of a word by ++considering **all** the words of the input sentence++ (透過 attention) * embeddings of Word2Vec are **context independent** 也就是說 Word2Vec 沒辦法考慮多義字 (因為一個字的 embedding 是透過訓練完而固定的)，但 BERT 會考慮每次的 context 決定 word embeddings 結果 * 也因為 BERT 是 context dependent (不是查表)，我們需要 pre-trained model every time；而 Word2Vec 就只需要引入別人 train 好的 word to vector table，即可透過查表方式使用 ## Language Modeling * the task of language modeling is to predict the next word given previous words * classic NLP problem * Markov assumption * 認為只有 last few words 會影響 next word * n-gram model: bi-gram 指前一個字決定下一個字 (2 個併在一起看)；tri-gram 指前兩個字決定下一個字 * n-gram systems usually use bi-gram or tri-grams, 因為如果考慮的 history 太長 (gram 太多)，則 model parameters will be too many to estimate * Building n-gram model * probability is inferred from text: ![](https://i.imgur.com/v7vvZLP.jpg) * relative frequency (也稱 maximum likelihood estimate, MLE)，不會給沒看過的組合機率 (data sparseness)，因此在計算機率時很容易因為一個沒看過的組合就造成整個機率為 0 (zero propagation)。 * discounting (smoothing): a better way to overcome data sparseness 將一部分看過的組合的機率分給沒看過的組合，有許多種 smoothing 作法： 1. Laplace's law (add one smoothing) 可能會造成灌水太多 (分給沒看過的事件太多機率，反而造成原本有出現的組合的機率變太小) 2. Lidstone's law 為了解決 Laplace's law 高估沒看過的組合的機率，改成加一個相對較小的 positive value $\lambda$。通常 $\lambda = 0.5$ (Jeffreys-Perks Law) 3. Good-Turing Estimation 會訂一個 threshold (say $k$)，出現次數 < $k$ 的都會用借的 (0 次跟 1 次借，1 次跟 2 次借 ...)，而出現次數大於 $k$ 的就用 MLE (因為出現夠多次了，已經蠻有把握。不需要再用借的) 4. Witten-Bell smoothing 機率從兩個地方得來：MLE (實際觀察到幾次就幾次) 以及再看短一點的組合的機率 (recursive 的概念) * 如果 outputs of the history so diverse，則我們會看更短的 history (代表似乎這麼長的 history 好像沒什麼限制到) ## Deep Learning for Text