---
tags: Intro to Text Mining
---
# Notes
## Text Preprocessing
1. Tokenization
* breaks text into chunks of tokens (words)
* one-hot vector: 向量裡面只有一個地方不是 0
* social media texts contain a lot of informal words
* Can use casual_tokenize and TweetTokenizer
* jieba: tokenization of Mandarin
2. Stop words
* 指的是很常出現但不重要的字 (a, an, the ...)
* 可以不要 exclude,透過將 one-hot vector 改為紀錄權重也可。一個 token 的權重越小代表他越有可能是 stop words (不重要)
3. Normalization
* 意義相近的 tokens 被 combine 成 single, normalized form
* helps reduce the size of vocabulary
* 常用的 normalization procedures:
1. Case folding: 把全部的字都小寫
* useful for search engines: because users are lazy, when searching for "iphone", you would get information about "iPhone" also (great)
* 有時會使有意義的大寫失效,e.g., FedEx -> fedex
2. stemming: remove suffixes
* **rule-based**: e.g., remove suffix "ing",則 ending -> end (good), sing -> s (bad)
* Porter stemmer, a well-known stemming rule procedure
3. Lemmatization: 詞性還原,希望 stemming 能產出正確的字
* **knowledge-based**: 通常會用一個 knowledge base 去 identify the lemma of a word.
* output the lemma of a word (root word). e.g., cars -> car, ate -> eat
* 也會將動名詞考慮進去
* deep learning approaches 通常不使用 stemming or lemmatization。因為 word embedding 可以做到將相似意義的 token 轉成相近的向量
* Zipf's law
* describe relation of word frequency and rank
* the frequency of any word is inversely proportional to its rank in the frequency table
* frequency rank 為 1 的 term 的出現次數約是 rank 2 的 term 的兩倍;也約是 rank 3 的 term 的三倍
* 會發現 top frequent words 都是 stop words

## Term weighting and Vector Space Model (VSM)
* we use following ways to represent a document:
* frequency-based vectors: vectors based on word counts
* multi-hot vectors: vectors based on word absence / presence
* TF-IDF vectors: vectors based on term frequency and inverse document frequency
接下來即可透過 vector operations 計算 similarity between documents
* vector space model
* dimensionality
* number of distinct words,也就是 vocabulary size $|V|$
* 通常都是非常高維的向量空間
* 儘管 one-hot vectors 可以存 word order 資訊,但還是被捨棄,因為需要的儲存空間太大。因此 we compress each document to a single vector rather than a big table
* 因為這樣,vector space model 會失去 word order,比較像是 a set of words。sometimes we call vector space model the bag-of-word model
* frequency-based vectors
* term frequencies needs to be normalized. 因為每個文件字數不同
* multi-hot vectors
* also called binary incidence vectors
* 對於每個文件有出現的 term,該文件 vector 相對應的地方就設 1。
* TF-IDF vectors
* frequency-based 會比 multi-hot vectors 有更多資訊,但 pure term frequency 卻不能展現字的重要性 (因為 Zipf's law)
* Inverse Document Frequency (IDF): measure how rare a word in a corpus. 也就是去看一個字是否有鑑別度

* 解決 stop words 問題 (因為 TF-IDF weights of stop words would be almost 0: $log \frac{N}{N} = 0$, $N$ for 文件數)
* comparing document vectors
* two measures for calculating the similarity / distance between vectors
1. Euclidean distance
* sensitive to document length: 可以透過 normalize vectors 減少 effect of document length
2. Cosine similarity
* Cosine automatically normalize document vectors to their unit vectors
* 對兩向量做內積來判斷相似。因此重要的 term 相同,dot 才會變大。但也有可能是兩文件的用字都不同,但其實他們用的字都非常相似,這樣的話雖然 cosine similarity 出來是 0,但其實兩文件是非常相關的。(issue)
## Text Classification
* text classification: a type of supervised learning
* two phases: training & testing
* validation phase
* **it is wrong** to tune model parameters to maximize performance on the test data
* 要獨立於 training data & testing data,通常會從 training data 切一些當作 validation set
* 用作 parameter tuning
* K-Fold Cross Validation
* 有些時候,可能我們選到的 training or testing data 不具代表性。
* 將資料分成 $k$ 等份 (每等份就是一個 fold),接下來每個 fold 都會輪流當成 testing data,而其他 $k-1$ 個 fold 就當成 training data
* testing phase
* 最終要用來與其他 model 做比較的一個資料集 (after all of the training & validation phases)
* chosen independently of the training data (要用 unseen data 做 test,不然 performance 會被高估)
* 會將 total data separate 成 training portion & testing portion,而 testing data 通常是 5% ~ 10% 的 total data
* Performance Metrics
* Three important metrics: precision, recall, and F1
* precision: tp / (tp + fp)
* 衡量 model 預測為 positive 的資料中,有多少是真的 positive
* recall: tp / (tp + fn)
* 全部的 positive 資料中,model 能夠預測出多少個
* F1: harmonic mean of precision and recall
* conservative, and consider both two above measures.
* F1 = 2PR/(P+R)
* different circumstances prefer different metrics:
* high precision: 代表要盡量準確,model 可以不要說對,但一說對就要是對的。e.g., classify spam mails,model 分辨不出來沒關係,但如果 model 說這是 spam mail,就真的要是,不然會造成誤刪郵件
* high recall: 代表不要有漏網之魚,寧願多看,也不要漏看。e.g., 找出產品的負評,model 把正面評論說成負面的也沒關係,但重點是要能把全部負評都盡量找出來,才能解決產品問題。
* one another popular metric: accuracy
* accuracy: (tp + tn) / (tp + fp + fn + tn)
* 全部的資料中,答對的比例 (答對是指 model 說正確,實際上也是正確;或是 model 說錯誤,實際上也是錯誤)
* if data is skew, then it's not a good metric。e.g., 當資料有 9 成都是 no,一成是 yes,那只要有個 model,他不論吃甚麼 input,都會吐 no,這樣的話他即便是個廢 model,但卻可以有高達 9 成的 accuracy
* performance curves
* precision-recall curves
* x 軸是 recall,y 軸是 precision。每個點代表的是不同門檻下分別得到的 recall, precision。將各點連線後,如果線較靠近右上角,代表 model is better (有較高的 recall & precision)
* ROC curves
* x 軸是 false positive rate (誤報率: fp/(fp+tn)),y 軸是 true positive rate (recall)。各點連線後,如果線靠近左上角,代表 performance is better (high recall and low false positive rate)。因此線下面積 (AUC) 也可用做衡量,如果 AUC 越大,代表整條曲線靠近左上角,因此表現較好。
* single score for multi-class classification
* macro-averaging: 將各 class 個別的 score 做平均
* micro-averaging: 先將各 class 的表現集合成 pooled table 再針對 pooled table 計算 precision or recall

## Naive Bayes Classification
* probabilistic learning method
* 假設文章用字彼此不影響 (互相獨立)
* e.g., P('The', 'breakfast', 'is', 'terrible'|FOOD) = P('The'|FOOD) * P('breakfast'|FOOD) * ...
* naive: probabilities of seeing 'Hong' and 'Kong' may not be independent
* positional independence assumption: 假設看到的字與字所擺放的位置無關
* naive: probability of seeing 'Once' in the beginning of a sentence (document) is certainly higher than seeing it in other positions
* each document uses term frequency vector
* Bernoulli NB
* If we represent documents as **multi-hot** vectors, we are using Bernoulli NB model
* 沒出現的字也會有貢獻 (補數):詳見 p.26 算法
* properties of NB
* good classification performance
* efficiency & effectiveness
* usually as a baseline in text classification research
## Vector Space Classification
* Contiguity hypothesis:
* documents in the same class form a contiguous region in high dimensional vector space
* regions of different classes don't overlap
* vector space classification tries to identify boundaries between different regions to classify new documents

* **weighted** vector representation is always preferred
* In particular tf-idf representation
* multi-hot is not welcomed,因為無法凸顯各個 term 之於文件的重要性。
1. Rocchio Classification
* uses centroid to define boundaries
* centroid of a class $c$ : vector average of documents belong to class $c$ (透過 training data 計算 class 重心)
* 那麼,兩個 classes 間的 boundary 就是 a set of points with **equal distance** from two centroids. 在 classify 時,只要將 point 對應到最近的 centroid 即完成分類。

in this case, a1 = a2, b1 = b2 ...
* 缺點:沒有密度概念,只用距離判斷會出事

2. K Nearest Neighbor (KNN)
* assigns documents to the majority class of their $k$ closest neighbors, with ties broken randomly
* It is desirable for $k$ to be **odd** (makes ties less likely)
* weighting by similarities is often more accurate (也就是離比較近的 neighbor 的票較重要)
* $k$ 通常是 pre-defined,也就是說 KNN requires no training at all
* less efficient: 因為要找 nearest neighbors,所以對於每個 vector,都要掃過全部的 documents,linear in $|d|$
* simply memorizes all training examples: 又被稱為 memory-based learning
3. Linear / Nonlinear Classifiers
* ++KNN is not a linear classifier++. Because the decision boundary of KNN consists of locally linear segments, which would form a complex shape (not a line in 2D or a hyperplane in M-dimensional space)
* comparisons:

* Assume classification problem is linearly separable.
* 則我們要解的就是找出最適 (optimal) 的 linear separator.。有些人將這個視為 optimization problem => Support Vector Machines (SVM)
4. SVM
decision boundary is specified by a subset of data

* 有時 data are not linearly separable,可使用 soft margin classification 搭配 slack variables
* Kernel SVM
* data too hard to linearly separate, then **map** the data onto a higher dimensional space. Then we can use linear classifier in the higher dimensional space to separate data

* But mapping vectors into higher dimensional space and compute dot product can be computationally expensive! => Kernel functions
* Kernel functions provide easy and efficient way of doing mapping and dot production
* polynomial kernels
* radial basis functions (equivalent to mapping data into infinite dimensional space, is default kernel of sklearn.svm)
## Flat Text Clustering
* clustering is ++unsupervised learning++
* Applications of text clustering
* Search result clustering
* news document clustering: 把相關新聞 cluster 在一起,讓讀者可以看相關其他文章
* Latent topic discovery: learn topics in a set of documents. e.g., Topic **location** is constituted of "MRT", "mall", "museum" ...
* Types of clustering
* by structure
* flat clustering: create a flat set of clusters, 每個 cluster 獨立於 each other (沒有 overlap)
* hierarchical clustering: create a hierarchy of clusters, 可能會有某 cluster 是另一大 cluster 的 subset
* hard / soft clustering
* hard clustering: each document belongs to exactly one cluster
* soft clustering: 計算每個文件對於各 cluster 的 probability (a distribution). 如果是挑 p 最高的當文件的 cluster, 就是 hard clustering
* exhaustive / non-exhaustive clustering
* exhaustive: every documents must be assigned to a cluster
* non-exhaustive: some documents can be assigned to no cluster (detect outliers,針對 outliers不去硬分)
* Hard Flat Clustering
* given
1. a set of documents $D$
2. desired number of clusters $K$ (optional for some clustering algorithms)
3. a objective function - evaluates quality of clustering. 可能可以是計算每個 cluster 內的文件之於重心的 avg distance
* We want:
1. compute $\gamma: D \rightarrow \{1, ..., K\}$ that minimizes (or maximizes) objective function
2. document 通常表示成 term vectors, less distance means more content similar
3. usually, 我們希望 none of the $K$ cluters is empty
* cardinality issue: 如何決定 $K$?
* 通常 $K$ 只能透過 experience or domain knowledge 來 make a good guess
* evaluation of clustering
1. purity: 
2. rand index: 
* K-means
* the most popular flat clustering algorithm
* objective: minimize average squared Euclidean distance of documents from their ++cluster center++ (mean or centroid)
* objective function: minimize RSS (Residual Sum of Squares) - the squared distance of each vector from its centroid, summed over all vectors
* steps:
first, randomly select $K$ documents as the seeds (initial cluster centroids)
then, iteratively repeats two steps until a stopping criterion is met:
1. re-assigning documents to the cluster with their closest centroid.
2. re-computing each centroid based on current members of its cluster
stopping criterions:
1. fixed number of iterations (limit runtime)
2. assignment of documents doesn't change between iterations (代表保證收斂才停:local optima)
3. a RSS threshold (objective value 達到某個門檻)
* issues with K-means
* seed selection: its outcome heavily depends on the seed initialization.
1. 如果有 outlier 被選成 seed,there is no vector assigned to it (造成 singleton cluster)
2. 如果 seeds 都靠得太近,那要花更多 iterations 才能把 data 分開。解法:K-means++
K-means++ 想法是初始的各個 seed,相互距離要越遠越好。具體作法是:
* 隨機選一個 data 當 first seed
* 對於每個點,計算與之最近的 seed 的距離
* 挑選離最遠的當成下個 seed
* repeat until $K$ seeds are selected
* cardinality issue: 可透過 elbow method

* DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
* 每個 cluster 都有兩種點:core and border
* steps:
1. decide value of eps (define 一個半徑讓 core 做搜尋使用) and minPts (成為 core 至少需要 cover 的點數)
2. 對於每個點 $x$,都去看半徑內的其他點,並將那些點當成 $x$ 的鄰居。如果 $x$ 的鄰居數大於 minPts,則讓 $x$ 變成一個 core point
3. 對於每個 core point,如果他沒被加到任何一個 cluster 內,則為他創建一個 cluster 並將其 neighbor points 也都加到同個 cluster 內
* every point not contained in any cluster is considered to be noise (outlier)
* advantages:
1. no need to specify number of clusters
2. able to detect outliers
3. can form un-spherical clusters (K-means assumes clusters are spherical)
* disadvantages:
1. difficult to determine eps and minPts
## Hierachical Clustering
* different to flat clustering, hierarchical clustering outputs a hierarchy. 而有這種 tree 的結構 supposed to be more informative,但沒有定論 flat or hierarchical 誰比較好。
* hierarchical agglomerative clustering (HAC)

* clustering result is deterministic (fixed)
* bottom-up approach, 先將每個文件視為各自不同的 cluster,並透過不斷地 merge (agglomerate) paris of clusters,until all clusters have been merged into a single cluster
* typically visualized as a dendrogram (樹狀圖)
* sometimes we don't want 全部的 clusters 被歸到同一類:要讓 hierarchy 被 cut at some point。
* cutting criteria:
1. a pre-specified level of similarity: 假設訂 0.4,就代表要 similarity > 0.4 的 2 clusters 才會做 merge
2. like elbow method, find a situation that merging one more cluster decreases quality of clustering significantly
* single-link and complete-link
1. single-link
* similarity of two clusters is the similarity of their most similar members
* overall cluster structure is not taken into account (a local merge criterion)
* problem: chaining (因為都是靠最近的做 merge,不管後面的結構,可能會造成整個 cluster 變得非常長)
2. complete-link
* similarity of two clusters is the similarity of their most dissimilar members
* entire structure of cluster is considered (non-local merge criterion)
* problem: sensitive to outliers (如果一 cluster 內包含了 outlier,則在後續 merge 時,即便離很近的 data 都會因為 outlier 距離其很遠而不 merge,會影響 merge quality)

* group-average linkage
similarity of two clusters is average similarity of all pairs of documents
* include pairs from same cluster
* self similarities are not included: because this gives an unfair advantage to small clusters (pair 數少,又都會把 self 算進去)
* centroid linkage
similarity of two clusters is average similarity of all pairs of documents ++from different clusters++
* Ward's linkage
distance between two clusters is how much the sum of squares will increase when we merge them. 希望在 merge 時造成 new cluster 的發散程度 (variance) 不要太大
* Which one is better?
* recommends complete-link & centroid linkage over single-link for a retrieval app
* news-related applications (e.g., topic detection), single-link seems to be better
## Latent Semantic Analysis
* Identifying terms with similar meaning is called LSA.
* Two well-known LSA methods:
1. singular value decomposition (SVD)
2. latent Dirichlet allocation (LDA)
* LSA helps represent each document as a topic vector. 原本 document vector 紀錄的是每個出現的 word,而現在如可以將 term 做 cluster (許多 words 變成少少的 topics),則可以將 document 表達成 topic vector -> dimension reduction
* reduce storage requirement
* enhance similarity calculation (Cosine similarity based on sparse TFIDF vectors would be messed up.)
* LSA takes token normalization (stemming, lemmatization) to another level by handling synonyms。意思是 stemming & lemmatizing 是將同個字不同的表達型態做 normalize 讓他們都能變一樣,而 LSA 做到的則是考慮不同的字表達相同的意思,也就是納入 synonym。e.g., cell, phone, mobile -> same topic
* SVD
* 只要保留少部分重要的 topics 即可重新建構 original document
* 將非常高維的 tf-idf vectors 壓成低維的 topic vectors
* Latent Dirichlet Allocation (LDA)
A probability-based topic-discovery model (文章、topic 有各自的 distribution)
* Concept / assumption:
* each document is with a topic distribution (because a document is generally involved several topics.)
* each topic is with a word distribution (because different topics prefer different words)
* The two sets of distributions (topic distribution & word distribution) need to infer from text corpus.
* topic size $k$ is pre-defined.
* Dirichlet hyper-parameters
* $\alpha$: document 的 topic prior probability
* $\beta$: topic 的 word prior probability
* summary
* 透過 LSA,我們可以將每個文件表達成 topic vector
* 但仍無法解決 polysemy (多義字) - 不論是 SVD 或 LDA 都無法做到將同個 term 解讀成不同 vector
* polysemy 會造成我們高估文件的 similarities (雖然他們的用字相同,但可能是不同意義)
* BERT: 賦予一個字在不同 scenario 下的不同含意
## Word Vector
為 unsupervised learning (透過文本字詞自動完成關聯)
Two approaches to train the prediction model:
1. Skip-gram: 給一個單詞,想辦法預測前後文可能的單詞
2. CBOW (Continuous bag-of-words): 從前後文去判端特定地方要填甚麼,類似克漏字
* Skip-gram
* hidden layer 有 $n$ 個 neurons
* $n$ 代表要用多少維度表示一個單詞
* input & output layers 則有 $m$ neurons
* input layer 為 one-hot vector
* $m$ 代表 number of words in the dictionary
* output layer 會有 softmax 當作 activation function (機率介於 0 - 1)
* neural network 的樣子:

* training data
* 設定 window size (要看周遭幾個單詞)
* 將 token 進行配對變成一組一組的 training word pairs, 如下圖:

* 權重設定
* 共有兩組 weight matrix,首先會 randomly initialize matrix 內的權重。如下圖所示:

* model (weights) 的目標是能根據一個 input word,成功 predict 出相對應的 training word pair. 像是有組 training pair 為 {"painted", "the"}, 那麼如 input 單詞為 "painted",則 network 就要去訓練權重讓 output 會是 "the"。
* 訓練完的權重意義
* after training, the weight matrix $W$ consists of word embedding.
* 也就是說每個在 dictionary 中的 word 都能用一組 $n$ 維的向量做表示。如下圖:

* prediction 就會變得很像一個 function,丟入 input's embedding,會經過 $W'$ 產出預測單詞如下圖:

* 透過這樣,synonyms 也會產生出相似的 embeddings。因為他們是同義詞,因此會接的前後文都差不多,自然 training pair 會非常相似,因此訓練出來的權重 (embedding) 也會非常相似
* CBOW
* predict the center word based on the surrounding words.
* input 會是 multi-hot vector 及 $m$ 個 neurons
* 同樣會有 $n$ hidden neurons 及 $m$ 個 output layer neurons
* 比較
* Skip-gram works well with samll corpora and ==rare== terms
* CBOW shows higher accuracies for frequent words and is much faster to train
* Some studies show skip-gram is better
* Pretrained Vectors
因為要做 word embedding 其實蠻消耗運算資源的,因此有許多 pre-trained vectors 可做使用:
* Google 就提供了根據 English Google News articles 所做的 pretrained Word2vec vector set
* Facebook published fastText
但如所做應用的單詞都非常特別,則還是要自己做 word embedding
## BERT and Transformers
BERT - Bidirectional Encoder Representations from Transformers
* Transformers
* based on encoder-decoder architecture
* Encoder:
* 將高維 feature vectors 壓成低維 (萃取 important parts)
* compress, remove noisy information of data
* produce meaningful vector of the given data
* Decoder:
* regenerate correct data from vectors processed by encoder
* enhanced with the attention mechanism
* 讓 inputs 先通過 self-attention layer, A layer that helps the encoder look at other words in the input sentence as it encodes a specific word

* Self-Attention
原本在 word embedding 時,每個字都是獨立的,忽略周遭其他字所造成的影響 (一字可能多義的情況)。而 self attention 則希望能夠在 word embedding 時能夠看一看其他 words,並根據 query, key 計算每一個字針對產出的 embedding 需要貢獻多少力氣。如此,即能解決一字多義的問題。

* BERT only uses encoder part of the transformer
透過兩種方式訓練 pre-trained model:
1. unmasked language model (把一個字蓋掉,要能成功預測)
2. next sentence prediction (predict the order of two sentences)
* BERT vs. Word2Vec
* BERT generates embedding of a word by ++considering **all** the words of the input sentence++ (透過 attention)
* embeddings of Word2Vec are **context independent**
也就是說 Word2Vec 沒辦法考慮多義字 (因為一個字的 embedding 是透過訓練完而固定的),但 BERT 會考慮每次的 context 決定 word embeddings 結果
* 也因為 BERT 是 context dependent (不是查表),我們需要 pre-trained model every time;而 Word2Vec 就只需要引入別人 train 好的 word to vector table,即可透過查表方式使用
## Language Modeling
* the task of language modeling is to predict the next word given previous words
* classic NLP problem
* Markov assumption
* 認為只有 last few words 會影響 next word
* n-gram model: bi-gram 指前一個字決定下一個字 (2 個併在一起看);tri-gram 指前兩個字決定下一個字
* n-gram systems usually use bi-gram or tri-grams, 因為如果考慮的 history 太長 (gram 太多),則 model parameters will be too many to estimate
* Building n-gram model
* probability is inferred from text:

* relative frequency (也稱 maximum likelihood estimate, MLE),不會給沒看過的組合機率 (data sparseness),因此在計算機率時很容易因為一個沒看過的組合就造成整個機率為 0 (zero propagation)。
* discounting (smoothing): a better way to overcome data sparseness
將一部分看過的組合的機率分給沒看過的組合,有許多種 smoothing 作法:
1. Laplace's law (add one smoothing)
可能會造成灌水太多 (分給沒看過的事件太多機率,反而造成原本有出現的組合的機率變太小)
2. Lidstone's law
為了解決 Laplace's law 高估沒看過的組合的機率,改成加一個相對較小的 positive value $\lambda$。通常 $\lambda = 0.5$ (Jeffreys-Perks Law)
3. Good-Turing Estimation
會訂一個 threshold (say $k$),出現次數 < $k$ 的都會用借的 (0 次跟 1 次借,1 次跟 2 次借 ...),而出現次數大於 $k$ 的就用 MLE (因為出現夠多次了,已經蠻有把握。不需要再用借的)
4. Witten-Bell smoothing
機率從兩個地方得來:MLE (實際觀察到幾次就幾次) 以及再看短一點的組合的機率 (recursive 的概念)
* 如果 outputs of the history so diverse,則我們會看更短的 history (代表似乎這麼長的 history 好像沒什麼限制到)
## Deep Learning for Text