---
# System prepended metadata

title: Notes
tags: [Intro to Text Mining]

---

---
tags: Intro to Text Mining
---

# Notes
## Text Preprocessing

1. Tokenization
    * breaks text into chunks of tokens (words)
    * one-hot vector: 向量裡面只有一個地方不是 0
    * social media texts contain a lot of informal words
        * Can use casual_tokenize and TweetTokenizer
    * jieba: tokenization of Mandarin

2. Stop words
    * 指的是很常出現但不重要的字 (a, an, the ...)
    * 可以不要 exclude，透過將 one-hot vector 改為紀錄權重也可。一個 token 的權重越小代表他越有可能是 stop words (不重要)

3. Normalization
    * 意義相近的 tokens 被 combine 成 single, normalized form
    * helps reduce the size of vocabulary
    * 常用的 normalization procedures:
        1. Case folding: 把全部的字都小寫
            * useful for search engines: because users are lazy, when searching for "iphone", you would get information about "iPhone" also (great)
            * 有時會使有意義的大寫失效，e.g., FedEx -> fedex
        2. stemming: remove suffixes
            * **rule-based**: e.g., remove suffix "ing"，則 ending -> end (good), sing -> s (bad)
            * Porter stemmer, a well-known stemming rule procedure
        3. Lemmatization: 詞性還原，希望 stemming 能產出正確的字
            * **knowledge-based**: 通常會用一個 knowledge base 去 identify the lemma of a word.
            * output the lemma of a word (root word). e.g., cars -> car, ate -> eat
            * 也會將動名詞考慮進去
    * deep learning approaches 通常不使用 stemming or lemmatization。因為 word embedding 可以做到將相似意義的 token 轉成相近的向量

* Zipf's law
    * describe relation of word frequency and rank
    * the frequency of any word is inversely proportional to its rank in the frequency table
        * frequency rank 為 1 的 term 的出現次數約是 rank 2 的 term 的兩倍；也約是 rank 3 的 term 的三倍
    * 會發現 top frequent words 都是 stop words
      ![](https://i.imgur.com/4U5IVuU.jpg =80%x)
      
## Term weighting and Vector Space Model (VSM)
* we use following ways to represent a document:
    * frequency-based vectors: vectors based on word counts
    * multi-hot vectors: vectors based on word absence / presence
    * TF-IDF vectors: vectors based on term frequency and inverse document frequency
    
    接下來即可透過 vector operations 計算 similarity between documents
    
* vector space model
    * dimensionality
        * number of distinct words，也就是 vocabulary size $|V|$
        * 通常都是非常高維的向量空間
    * 儘管 one-hot vectors 可以存 word order 資訊，但還是被捨棄，因為需要的儲存空間太大。因此 we compress each document to a single vector rather than a big table
        * 因為這樣，vector space model 會失去 word order，比較像是 a set of words。sometimes we call vector space model the bag-of-word model

* frequency-based vectors
    * term frequencies needs to be normalized. 因為每個文件字數不同
* multi-hot vectors
    * also called binary incidence vectors
    * 對於每個文件有出現的 term，該文件 vector 相對應的地方就設 1。
* TF-IDF vectors
    * frequency-based 會比 multi-hot vectors 有更多資訊，但 pure term frequency 卻不能展現字的重要性 (因為 Zipf's law)
    * Inverse Document Frequency (IDF): measure how rare a word in a corpus. 也就是去看一個字是否有鑑別度
      ![](https://i.imgur.com/flD8H51.jpg =80%x)
    * 解決 stop words 問題 (因為 TF-IDF weights of stop words would be almost 0: $log \frac{N}{N} = 0$, $N$ for 文件數)

* comparing document vectors
    * two measures for calculating the similarity / distance between vectors
        1. Euclidean distance
            * sensitive to document length: 可以透過 normalize vectors 減少 effect of document length
        2. Cosine similarity
            * Cosine automatically normalize document vectors to their unit vectors
            * 對兩向量做內積來判斷相似。因此重要的 term 相同，dot 才會變大。但也有可能是兩文件的用字都不同，但其實他們用的字都非常相似，這樣的話雖然 cosine similarity 出來是 0，但其實兩文件是非常相關的。(issue)

## Text Classification
* text classification: a type of supervised learning
    * two phases: training & testing 


* validation phase
    * **it is wrong** to tune model parameters to maximize performance on the test data
    * 要獨立於 training data & testing data，通常會從 training data 切一些當作 validation set
    * 用作 parameter tuning
    * K-Fold Cross Validation
        * 有些時候，可能我們選到的 training or testing data 不具代表性。
        * 將資料分成 $k$ 等份 (每等份就是一個 fold)，接下來每個 fold 都會輪流當成 testing data，而其他 $k-1$ 個 fold 就當成 training data

* testing phase
    * 最終要用來與其他 model 做比較的一個資料集 (after all of the training & validation phases)
    * chosen independently of the training data (要用 unseen data 做 test，不然 performance 會被高估)
    * 會將 total data separate 成 training portion & testing portion，而 testing data 通常是 5% ~ 10% 的 total data

* Performance Metrics
    * Three important metrics: precision, recall, and F1
        * precision: tp / (tp + fp)
            * 衡量 model 預測為 positive 的資料中，有多少是真的 positive
        * recall: tp / (tp + fn)
            * 全部的 positive 資料中，model 能夠預測出多少個
        * F1: harmonic mean of precision and recall
            * conservative, and consider both two above measures.
            * F1 = 2PR/(P+R)
    * different circumstances prefer different metrics:
        * high precision: 代表要盡量準確，model 可以不要說對，但一說對就要是對的。e.g., classify spam mails，model 分辨不出來沒關係，但如果 model 說這是 spam mail，就真的要是，不然會造成誤刪郵件
        * high recall: 代表不要有漏網之魚，寧願多看，也不要漏看。e.g., 找出產品的負評，model 把正面評論說成負面的也沒關係，但重點是要能把全部負評都盡量找出來，才能解決產品問題。
    * one another popular metric: accuracy
        * accuracy: (tp + tn) / (tp + fp + fn + tn)
            * 全部的資料中，答對的比例 (答對是指 model 說正確，實際上也是正確；或是 model 說錯誤，實際上也是錯誤)
            * if data is skew, then it's not a good metric。e.g., 當資料有 9 成都是 no，一成是 yes，那只要有個 model，他不論吃甚麼 input，都會吐 no，這樣的話他即便是個廢 model，但卻可以有高達 9 成的 accuracy
    * performance curves
        * precision-recall curves
            * x 軸是 recall，y 軸是 precision。每個點代表的是不同門檻下分別得到的 recall, precision。將各點連線後，如果線較靠近右上角，代表 model is better (有較高的 recall & precision)
        * ROC curves
            * x 軸是 false positive rate (誤報率： fp/(fp+tn))，y 軸是 true positive rate (recall)。各點連線後，如果線靠近左上角，代表 performance is better (high recall and low false positive rate)。因此線下面積 (AUC) 也可用做衡量，如果 AUC 越大，代表整條曲線靠近左上角，因此表現較好。
    * single score for multi-class classification
        * macro-averaging: 將各 class 個別的 score 做平均
        * micro-averaging: 先將各 class 的表現集合成 pooled table 再針對 pooled table 計算 precision or recall
          ![](https://i.imgur.com/qUixecN.jpg)
          
## Naive Bayes Classification
* probabilistic learning method
* 假設文章用字彼此不影響 (互相獨立)
    * e.g., P('The', 'breakfast', 'is', 'terrible'|FOOD) = P('The'|FOOD) * P('breakfast'|FOOD) * ...
    * naive: probabilities of seeing 'Hong' and 'Kong' may not be independent
* positional independence assumption: 假設看到的字與字所擺放的位置無關
    * naive: probability of seeing 'Once' in the beginning of a sentence (document) is certainly higher than seeing it in other positions
* each document uses term frequency vector

* Bernoulli NB
    * If we represent documents as **multi-hot** vectors, we are using Bernoulli NB model
    * 沒出現的字也會有貢獻 (補數)：詳見 p.26 算法
* properties of NB
    * good classification performance
    * efficiency & effectiveness
    * usually as a baseline in text classification research

## Vector Space Classification
* Contiguity hypothesis:
    * documents in the same class form a contiguous region in high dimensional vector space
    * regions of different classes don't overlap


* vector space classification tries to identify boundaries between different regions to classify new documents
    ![](https://i.imgur.com/VGWYq7u.jpg =50%x)

* **weighted** vector representation is always preferred
    * In particular tf-idf representation
    * multi-hot is not welcomed，因為無法凸顯各個 term 之於文件的重要性。

1. Rocchio Classification
    * uses centroid to define boundaries
    * centroid of a class $c$ : vector average of documents belong to class $c$ (透過 training data 計算 class 重心)
    * 那麼，兩個 classes 間的 boundary 就是 a set of points with **equal distance** from two centroids. 在 classify 時，只要將 point 對應到最近的 centroid 即完成分類。
        ![](https://i.imgur.com/gfz2dfm.jpg =60%x)
        in this case, a1 = a2, b1 = b2 ...
    * 缺點：沒有密度概念，只用距離判斷會出事
        ![](https://i.imgur.com/kloDZ7m.jpg)
        
2. K Nearest Neighbor (KNN)
    * assigns documents to the majority class of their $k$ closest neighbors, with ties broken randomly
    * It is desirable for $k$ to be **odd** (makes ties less likely)
    * weighting by similarities is often more accurate (也就是離比較近的 neighbor 的票較重要)
    * $k$ 通常是 pre-defined，也就是說 KNN requires no training at all
    * less efficient: 因為要找 nearest neighbors，所以對於每個 vector，都要掃過全部的 documents，linear in $|d|$
    * simply memorizes all training examples: 又被稱為 memory-based learning

3. Linear / Nonlinear Classifiers
    * ++KNN is not a linear classifier++. Because the decision boundary of KNN consists of locally linear segments, which would form a complex shape (not a line in 2D or a hyperplane in M-dimensional space)
    * comparisons:
        ![](https://i.imgur.com/RGPL3uS.jpg)
    * Assume classification problem is linearly separable. 
        * 則我們要解的就是找出最適 (optimal) 的 linear separator.。有些人將這個視為 optimization problem => Support Vector Machines (SVM)

4. SVM
decision boundary is specified by a subset of data 
![](https://i.imgur.com/4nLSvLt.jpg =40%x)
* 有時 data are not linearly separable，可使用 soft margin classification 搭配 slack variables
* Kernel SVM
    * data too hard to linearly separate, then **map** the data onto a higher dimensional space. Then we can use linear classifier in the higher dimensional space to separate data
        ![](https://i.imgur.com/FfMJI4e.jpg)
    * But mapping vectors into higher dimensional space and compute dot product can be computationally expensive! => Kernel functions
    * Kernel functions provide easy and efficient way of doing mapping and dot production
        * polynomial kernels
        * radial basis functions (equivalent to mapping data into infinite dimensional space, is default kernel of sklearn.svm)

## Flat Text Clustering
* clustering is ++unsupervised learning++
* Applications of text clustering
    * Search result clustering
    * news document clustering: 把相關新聞 cluster 在一起，讓讀者可以看相關其他文章
    * Latent topic discovery: learn topics in a set of documents. e.g., Topic **location** is constituted of "MRT", "mall", "museum" ...
* Types of clustering
    * by structure
        * flat clustering: create a flat set of clusters, 每個 cluster 獨立於 each other (沒有 overlap)
        * hierarchical clustering: create a hierarchy of clusters, 可能會有某 cluster 是另一大 cluster 的 subset
    * hard / soft clustering
        * hard clustering: each document belongs to exactly one cluster
        * soft clustering: 計算每個文件對於各 cluster 的 probability (a distribution). 如果是挑 p 最高的當文件的 cluster, 就是 hard clustering
    * exhaustive / non-exhaustive clustering
        * exhaustive: every documents must be assigned to a cluster
        * non-exhaustive: some documents can be assigned to no cluster (detect outliers，針對 outliers不去硬分)

* Hard Flat Clustering
    * given
        1. a set of documents $D$
        2. desired number of clusters $K$ (optional for some clustering algorithms)
        3. a objective function - evaluates quality of clustering. 可能可以是計算每個 cluster 內的文件之於重心的 avg distance
    * We want:
        1. compute $\gamma: D \rightarrow \{1, ..., K\}$ that minimizes (or maximizes) objective function
        2. document 通常表示成 term vectors, less distance means more content similar
        3. usually, 我們希望 none of the $K$ cluters is empty
    * cardinality issue: 如何決定 $K$？
        * 通常 $K$ 只能透過 experience or domain knowledge 來 make a good guess

* evaluation of clustering
    1. purity: ![](https://i.imgur.com/bzpcVBC.jpg)
    2. rand index: ![](https://i.imgur.com/WpPltpn.jpg)

* K-means
    * the most popular flat clustering algorithm
    * objective: minimize average squared Euclidean distance of documents from their ++cluster center++ (mean or centroid)
    * objective function: minimize RSS (Residual Sum of Squares) - the squared distance of each vector from its centroid, summed over all vectors
    * steps:
        first, randomly select $K$ documents as the seeds (initial cluster centroids)
        then, iteratively repeats two steps until a stopping criterion is met:
        1. re-assigning documents to the cluster with their closest centroid.
        2. re-computing each centroid based on current members of its cluster
        
        stopping criterions:
        1. fixed number of iterations (limit runtime)
        2. assignment of documents doesn't change between iterations (代表保證收斂才停：local optima)
        3. a RSS threshold (objective value 達到某個門檻)
    * issues with K-means
        * seed selection: its outcome heavily depends on the seed initialization. 
            1. 如果有 outlier 被選成 seed，there is no vector assigned to it (造成 singleton cluster)
            2. 如果 seeds 都靠得太近，那要花更多 iterations 才能把 data 分開。解法：K-means++
                K-means++ 想法是初始的各個 seed，相互距離要越遠越好。具體作法是：
                * 隨機選一個 data 當 first seed
                * 對於每個點，計算與之最近的 seed 的距離
                * 挑選離最遠的當成下個 seed
                * repeat until $K$ seeds are selected
        * cardinality issue: 可透過 elbow method
            ![](https://i.imgur.com/wINp0w3.jpg)

* DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
    * 每個 cluster 都有兩種點：core and border
    * steps:
        1. decide value of eps (define 一個半徑讓 core 做搜尋使用) and minPts (成為 core 至少需要 cover 的點數)
        2. 對於每個點 $x$，都去看半徑內的其他點，並將那些點當成 $x$ 的鄰居。如果 $x$ 的鄰居數大於 minPts，則讓 $x$ 變成一個 core point
        3. 對於每個 core point，如果他沒被加到任何一個 cluster 內，則為他創建一個 cluster 並將其 neighbor points 也都加到同個 cluster 內
    * every point not contained in any cluster is considered to be noise (outlier)
    * advantages:
        1. no need to specify number of clusters
        2. able to detect outliers
        3. can form un-spherical clusters (K-means assumes clusters are spherical)
    * disadvantages:
        1. difficult to determine eps and minPts

## Hierachical Clustering
* different to flat clustering, hierarchical clustering outputs a hierarchy. 而有這種 tree 的結構 supposed to be more informative，但沒有定論 flat or hierarchical 誰比較好。
* hierarchical agglomerative clustering (HAC)
    ![](https://i.imgur.com/jw0eKYG.jpg)
    * clustering result is deterministic (fixed)
    * bottom-up approach, 先將每個文件視為各自不同的 cluster，並透過不斷地 merge (agglomerate) paris of clusters，until all clusters have been merged into a single cluster
    * typically visualized as a dendrogram (樹狀圖)
    * sometimes we don't want 全部的 clusters 被歸到同一類：要讓 hierarchy 被 cut at some point。
    * cutting criteria:
        1. a pre-specified level of similarity: 假設訂 0.4，就代表要 similarity > 0.4 的 2 clusters 才會做 merge
        2. like elbow method, find a situation that merging one more cluster decreases quality of clustering significantly

* single-link and complete-link
    1. single-link
        * similarity of two clusters is the similarity of their most similar members
        * overall cluster structure is not taken into account (a local merge criterion)
        * problem: chaining (因為都是靠最近的做 merge，不管後面的結構，可能會造成整個 cluster 變得非常長)
    2. complete-link
        * similarity of two clusters is the similarity of their most dissimilar members
        * entire structure of cluster is considered (non-local merge criterion)
        * problem: sensitive to outliers (如果一 cluster 內包含了 outlier，則在後續 merge 時，即便離很近的 data 都會因為 outlier 距離其很遠而不 merge，會影響 merge quality)
    ![](https://i.imgur.com/ZYGxeSF.jpg)
    
* group-average linkage
    similarity of two clusters is average similarity of all pairs of documents 
    * include pairs from same cluster 
    * self similarities are not included: because this gives an unfair advantage to small clusters (pair 數少，又都會把 self 算進去)

* centroid linkage
    similarity of two clusters is average similarity of all pairs of documents ++from different clusters++
    
* Ward's linkage
    distance between two clusters is how much the sum of squares will increase when we merge them. 希望在 merge 時造成 new cluster 的發散程度 (variance) 不要太大
    
* Which one is better?
    * recommends complete-link & centroid linkage over single-link for a retrieval app
    * news-related applications (e.g., topic detection), single-link seems to be better
    
## Latent Semantic Analysis
* Identifying terms with similar meaning is called LSA.
* Two well-known LSA methods:
    1. singular value decomposition (SVD)
    2. latent Dirichlet allocation (LDA)

* LSA helps represent each document as a topic vector. 原本 document vector 紀錄的是每個出現的 word，而現在如可以將 term 做 cluster (許多 words 變成少少的 topics)，則可以將 document 表達成 topic vector -> dimension reduction
    * reduce storage requirement
    * enhance similarity calculation (Cosine similarity based on sparse TFIDF vectors would be messed up.)

* LSA takes token normalization (stemming, lemmatization) to another level by handling synonyms。意思是 stemming & lemmatizing 是將同個字不同的表達型態做 normalize 讓他們都能變一樣，而 LSA 做到的則是考慮不同的字表達相同的意思，也就是納入 synonym。e.g., cell, phone, mobile -> same topic

* SVD
    * 只要保留少部分重要的 topics 即可重新建構 original document
    * 將非常高維的 tf-idf vectors 壓成低維的 topic vectors

* Latent Dirichlet Allocation (LDA)
    A probability-based topic-discovery model (文章、topic 有各自的 distribution)
    * Concept / assumption:
        * each document is with a topic distribution (because a document is generally involved several topics.)
        * each topic is with a word distribution (because different topics prefer different words)
        * The two sets of distributions (topic distribution & word distribution) need to infer from text corpus.

    * topic size $k$ is pre-defined.

    * Dirichlet hyper-parameters
        * $\alpha$: document 的 topic prior probability
        * $\beta$: topic 的 word prior probability
* summary
    * 透過 LSA，我們可以將每個文件表達成 topic vector
    * 但仍無法解決 polysemy (多義字) - 不論是 SVD 或 LDA 都無法做到將同個 term 解讀成不同 vector
    * polysemy 會造成我們高估文件的 similarities (雖然他們的用字相同，但可能是不同意義)
    * BERT: 賦予一個字在不同 scenario 下的不同含意

## Word Vector
為 unsupervised learning (透過文本字詞自動完成關聯)

Two approaches to train the prediction model:
1. Skip-gram: 給一個單詞，想辦法預測前後文可能的單詞
2. CBOW (Continuous bag-of-words): 從前後文去判端特定地方要填甚麼，類似克漏字

* Skip-gram
    * hidden layer 有 $n$ 個 neurons
        * $n$ 代表要用多少維度表示一個單詞
    * input & output layers 則有 $m$ neurons
        * input layer 為 one-hot vector
        * $m$ 代表 number of words in the dictionary
        * output layer 會有 softmax 當作 activation function (機率介於 0 - 1)


    * neural network 的樣子：
        ![](https://i.imgur.com/NFeIsHy.png)

    * training data
        * 設定 window size (要看周遭幾個單詞)
        * 將 token 進行配對變成一組一組的 training word pairs, 如下圖：
            ![](https://i.imgur.com/PPbMeSJ.png)

    * 權重設定
        * 共有兩組 weight matrix，首先會 randomly initialize matrix 內的權重。如下圖所示：
            ![](https://i.imgur.com/NdbgEJb.png)
        * model (weights) 的目標是能根據一個 input word，成功 predict 出相對應的 training word pair. 像是有組 training pair 為 {"painted", "the"}， 那麼如 input 單詞為 "painted"，則 network 就要去訓練權重讓 output 會是 "the"。

    * 訓練完的權重意義
        * after training, the weight matrix $W$ consists of word embedding. 
        * 也就是說每個在 dictionary 中的 word 都能用一組 $n$ 維的向量做表示。如下圖：
            ![](https://i.imgur.com/rLel9JW.png)
        * prediction 就會變得很像一個 function，丟入 input's embedding，會經過 $W'$ 產出預測單詞如下圖：
            ![](https://i.imgur.com/GJ0qYq8.png)
        * 透過這樣，synonyms 也會產生出相似的 embeddings。因為他們是同義詞，因此會接的前後文都差不多，自然 training pair 會非常相似，因此訓練出來的權重 (embedding) 也會非常相似

* CBOW
    * predict the center word based on the surrounding words.
    * input 會是 multi-hot vector 及 $m$ 個 neurons
    * 同樣會有 $n$ hidden neurons 及 $m$ 個 output layer neurons

* 比較
    * Skip-gram works well with samll corpora and ==rare== terms
    * CBOW shows higher accuracies for frequent words and is much faster to train
    * Some studies show skip-gram is better

* Pretrained Vectors
    因為要做 word embedding 其實蠻消耗運算資源的，因此有許多 pre-trained vectors 可做使用：
    * Google 就提供了根據 English Google News articles 所做的 pretrained Word2vec vector set
    * Facebook published fastText

    但如所做應用的單詞都非常特別，則還是要自己做 word embedding
    
## BERT and Transformers
BERT - Bidirectional Encoder Representations from Transformers

* Transformers
    * based on encoder-decoder architecture
        * Encoder: 
            * 將高維 feature vectors 壓成低維 (萃取 important parts)
            * compress, remove noisy information of data
            * produce meaningful vector of the given data
        * Decoder:
            * regenerate correct data from vectors processed by encoder
    * enhanced with the attention mechanism
        * 讓 inputs 先通過 self-attention layer, A layer that helps the encoder look at other words in the input sentence as it encodes a specific word
            ![](https://i.imgur.com/98mrZGZ.png)
        
* Self-Attention
    原本在 word embedding 時，每個字都是獨立的，忽略周遭其他字所造成的影響 (一字可能多義的情況)。而 self attention 則希望能夠在 word embedding 時能夠看一看其他 words，並根據 query, key 計算每一個字針對產出的 embedding 需要貢獻多少力氣。如此，即能解決一字多義的問題。
![](https://i.imgur.com/C2vHQ8V.png)

* BERT only uses encoder part of the transformer
    透過兩種方式訓練 pre-trained model:
    1. unmasked language model (把一個字蓋掉，要能成功預測)
    2. next sentence prediction (predict the order of two sentences)
* BERT vs. Word2Vec
    * BERT generates embedding of a word by ++considering **all** the words of the input sentence++ (透過 attention)
    * embeddings of Word2Vec are **context independent**
        也就是說 Word2Vec 沒辦法考慮多義字 (因為一個字的 embedding 是透過訓練完而固定的)，但 BERT 會考慮每次的 context 決定 word embeddings 結果
    * 也因為 BERT 是 context dependent (不是查表)，我們需要 pre-trained model every time；而 Word2Vec 就只需要引入別人 train 好的 word to vector table，即可透過查表方式使用

## Language Modeling
* the task of language modeling is to predict the next word given previous words
* classic NLP problem
* Markov assumption
    * 認為只有 last few words 會影響 next word
    * n-gram model: bi-gram 指前一個字決定下一個字 (2 個併在一起看)；tri-gram 指前兩個字決定下一個字
* n-gram systems usually use bi-gram or tri-grams, 因為如果考慮的 history 太長 (gram 太多)，則 model parameters will be too many to estimate
* Building n-gram model
    * probability is inferred from text:
        ![](https://i.imgur.com/v7vvZLP.jpg)
    * relative frequency (也稱 maximum likelihood estimate, MLE)，不會給沒看過的組合機率 (data sparseness)，因此在計算機率時很容易因為一個沒看過的組合就造成整個機率為 0 (zero propagation)。
    * discounting (smoothing): a better way to overcome data sparseness
        將一部分看過的組合的機率分給沒看過的組合，有許多種 smoothing 作法：
        1. Laplace's law (add one smoothing)
            可能會造成灌水太多 (分給沒看過的事件太多機率，反而造成原本有出現的組合的機率變太小)
        2. Lidstone's law
            為了解決 Laplace's law 高估沒看過的組合的機率，改成加一個相對較小的 positive value $\lambda$。通常 $\lambda = 0.5$ (Jeffreys-Perks Law)
        3. Good-Turing Estimation
            會訂一個 threshold (say $k$)，出現次數 < $k$ 的都會用借的 (0 次跟 1 次借，1 次跟 2 次借 ...)，而出現次數大於 $k$ 的就用 MLE (因為出現夠多次了，已經蠻有把握。不需要再用借的)
        4. Witten-Bell smoothing
            機率從兩個地方得來：MLE (實際觀察到幾次就幾次) 以及再看短一點的組合的機率 (recursive 的概念)
            * 如果 outputs of the history so diverse，則我們會看更短的 history (代表似乎這麼長的 history 好像沒什麼限制到)

## Deep Learning for Text

