:dart: CH3 N-gram and Language Model

:dart: CH3 N-gram and Language Model ===  ## 名字：標云 #### Augmentative and Alternative Communication (AAC) * AAC refers to communication methods, systems, techniques, and instruments that supplement (augmentative) or replace (alternative) natural speech. * It may be high-tech (e.g., apps on mobile devices and speech-generating device) or low-tech (e.g., pictures and cards) and even no-tech (e.g., facial expressions, body language, gestures, sign language.) [reference_1](https://www.assistiveware.com/learn-aac/what-is-aac) [reference_2](https://www.asha.org/public/speech/disorders/aac/) #### Language Model * Language models, or LMs, are models that give probability to word sequences. * n-gram: the simplest model -> estimate the probability of the last word of an n-gram -> assign probability of the entire sequence * n-grams are Markov models (because it will estimate probability of a word based on a fixed number of previous words, that is, the current state depends merely on the previous state) #### Perplexity * a method to evaluate language models (the degree to which a language model confidently predicts a text sample) * It measures how "surprised" the model is when it encounters new data. * The lower the perplexity, the better the model predicts the text. * cons of perplexity: 1. Since it doesn't assess the accuracy, a model could have low perplexity but a high error rate. (Even if a model is confident in its predictions, that doesn't guarantee that the predictions will be fulfilled.) 2. Every dataset has a unique word distribution, and every model has a unique set of parameters. Therefore, it is challenging to compare directly the results of models that have been trained on various datasets. [reference_3](https://www.techslang.com/perplexity-in-nlp-definition-pros-and-cons/) #### Smoothing * (1) When a corpus becomes larger, the probability of each thing appearing will be very low, and the final probability obtained will also decrease; (2) If the probability of the occurrence of something in the corpus is zero, the probability of a sentence multiplied will also become 0, which is meaningless. Therefore, smoothing is needed. * The core concept of smoothing is giving a certain chance of events that haven't been observed in the past. Meanwhile, the sum of the probability of all events still needs to be maintained at 1. * There are many ways of smoothing such as Laplace smoothing, add-k smoothing, stupid backoff, and Kneser-Ney smoothing. [reference_4](https://ithelp.ithome.com.tw/articles/10221393) #### Laplace / Add-1 smoothing and Add-k smoothing * simply add 1 to all the counts of words so that we never incur 0 value. * add-k smoothing on the other hand, instead of adding 1, a fractional count (e.g., 0.5, 0.01...) will be added. #### Backoff * Backoff means you calculate the probability at an n-1 gram level when you come across a term with prob=0. * The most used scheme is called "stupid backoff". Whenever you go back 1 level you multiply the odds by 0.4. For example, if "good" exists in the 3-gram model, the probability would be 0.4 * P("good"|"is a very"). [reference_5](https://www.quora.com/What-is-backoff-in-NLP) #### Interpolation * mixed n-grams * interpolates lower level n-grams with higher level n-grams such as interpolating unigram probabilities with bigram probabilities to prioritize some bigrams that should have higher counts than the assumed value. * Kneser–Ney smoothing uses this concept that it takes from words with higher probability and adds to words with low probability. [reference_6](https://omarmeriwani.com/text-analytics/smoothing-in-natural-language-processing/) #### Kneser-Ney smoothing * evolved from absolute-discounting interpolation * absolute-discounting interpolation: ![](https://i.imgur.com/5SQoJdD.png) * δ refers to a fixed discount value, and α is a normalizing constant. * The Kneser-Ney design retains the first term of absolute discounting interpolation. * In contrast to absolute discounting interpolation, which in a bigram model would simply default to a unigram model in the second term, Kneser-Ney is predicated on the notion that each unigram has a "continuation probability". * The lower-order model is more significant when the first term is close to zero. Conversely, the second lower-order term carries minimal weight when the higher-order model matches the data closely. ![](https://i.imgur.com/nOUwCKI.png) ## 名字：Entropia ### Probability and Likelihood ![](https://i.imgur.com/lkdrsar.png) - **Probability** :the possibility of something happening - 在已知**資料分布狀況**(e.g., 平均數、標準差)的前提下，推估某數據的可能性。 - 全班身高資料呈現常態分佈，平均170公分，最低140公分，最高200公分，推估身高落在185以上的可能性。 - 在已知資料分布的狀況下進行推估 ➡ **P of particular data given distribution** :::info 我們通常無法先知道每一個 gram 的計數及頻率。即使可以根據資料集計算，也不能代表真實世界的狀況就是如此，所以我們需要去估計每一個 gram 的頻率，那 P(gram) 無法直接推估出來，所以要先推估他的分布，我們可以根據已經有的資料，找出各種分布。 ::: - **Likelihood** : the process of determining the best data distribution given a specific situation in the data - 在已知n個數據的前提下，推估資料分布的可能性。 - 已知A同學身高185公分，但不知道全班最高、對低和平均的數值多少，甚至不知道資料是不是常態分佈，我們也就無法知道185公分在這個班級的可能性。 - 在不確定資料分布的狀況下進行推估 ➡ **P of the distribution given data** (不知道分布的狀況下，我們也無法推出正確的P(data|distribution)) #### Maximum Likelihood Estimation ![](https://i.imgur.com/zC3hMTI.png) - To find an optimal argument of distribution (mean and standard deviation) to fit the data. ~~(突然好奇，有人不想要an optimal argument嗎?)~~ - Further, we can calculate the probability of the data. - Relative frequency 是最粗暴的做法，其實就是根據現有的資料去計算，在這個資料中當然是一個可行的做法，可是對於生成模型來說，就遠遠不夠，因為現實世界中的 P(gram) 是難以計算的。 - 少數語言有時候 data 會不夠，所以可能會透過 liklihood 去捕捉母體數量。 - 讓模型可以 fit 目前資料之外的 #### Sources - [\[video\]Probability is not Likelihood. Find out why!!!](https://youtu.be/pYxNSUDSFH4) - [\[video\]Probability and Likelihood. (Differences)](https://youtu.be/cC1h77g8KFM) - [\[video\]Maximum Likelihood, clearly explained!!!](https://youtu.be/XepXtl9YKwc) ### Stationary and Ergodic - stationary: 資料分布（平均數、標準差）不會隨著時序 (sequence) 改變。 - ergodic: 嗯好，這個上課討論，我真的看不懂：） ### Cross Entropy and Perplexity - Entropy 與他的親朋好友 - $H(P)$ : Entropy - $H(P,Q)$ : Cross entropy - $PPL(P,Q)$ : Perplexity #### Cross Entropy \begin{aligned} H(P,Q)&=-\sum{P(x) \times log{Q(x)}} \end{aligned} - cross entropy 和 perlexity 可以用來比較生成結果和真實語料。 - 不過真實語料的很難計算 cross entropy 和 perlexity 就是了，比較可行的方法就是把標準答案和生成結果進行比較，呵呵。 - cross entropy 越高，代表兩者差距越大； cross entropy 越低，代表兩者差距越小。 - cross entropy 公式可以推導為 entropy 加上 KL-divergence : \begin{aligned} H(P,Q) = H(P)+D_{KL}(P,Q)\end{aligned} - $D_{KL}(P,Q)$代表基於Q序列編碼P序列所需要的信息量，也就是說 $D_{KL}(P,Q)$越大，兩筆序列差異越大。 - 由於$H(P)$是無法改變的，因此我們需要調整的是$D_{KL}(P,Q)$，越小的$D_{KL}(P,Q)$，代表越小的 cross entropy。 - 以下是 cross entropy 如何推導出 $H(P,Q) = H(P)+D_{KL}(P,Q)$ 的過程： \begin{aligned} H(P,Q)&=-\sum{P(x) \times log{Q(x)}} \\\\ &=-\sum{P(x)\times (log{P(x)} + log{Q(x)}-log{P(x)})}\\\\ &=-\sum{P(x)\times [logP(x)+\dfrac{logQ(x)}{logP(x)}]}\\\\ &=-\sum {P(x) \times logP(x) + P(x)\times\dfrac{logQ(x)}{logP(x)}}\\\\ &=-\sum {P(x) \times logP(x)} - \sum{P(x)\times\dfrac{logQ(x)}{logP(x)}}\\\\ &= Entropy + D_{KL}(P|Q) \end{aligned} #### Perplexity \begin{aligned} PPL(P,Q)&=2^{-\sum{P(x) \times log{Q(x)}}}\\ &=2^{H(P,Q)} \end{aligned} - 根據上方的 PPL 公式，我們可以發現 PPL 其實就是2的n次方，二這個n其實就是 cross entropy,前面我們提到，越好的語言模型，會有越低的 cross entropy，在這邊的公式我們可以得知，如果有真實的語言序列比較，越好的語言模型，在計算兩者差異時，會有越低的PPL。 - 當然，如果現在不是要比較預測與真實狀況，而是單就語言模型本身來看的話，公式就會是: \begin{aligned} PPL(P)&=2^{-\sum{P(x) \times logP(x)}}\\ &=2^{H(P(x))} \end{aligned} - 所以 PPL 公式的n次方，n就會是單純的entropy，越高的 PPL 對應越高的 entropy，但單純衡量語言模型的 PPL 或 entropy 沒有太大的意義，還是要和相對應的真實與料比較，所以這時候的語言模型，具有較高或較低的，如果沒有和真實語料比較，我們仍然看不出它的成效如何。 #### Source - [一文搞懂Language Modeling三大评估标准](https://zhuanlan.zhihu.com/p/424162193) ## 名字：涴筑 ### N-Grams * 意義：用少少的字（通常為 5-gram 以下）預測下一個字出現的機率，可以節省運算量（因為不需要用整個句子下去運算）。 * Assumption: Markov assumption. 用少量前面的字就可以預測後面字出現的機率。 ### Intrinsic Evaluation * 意義：相對於 extrinsic evaluation 直接將 language model 拿去應用（aka 直接上戰場），intrinsic evaluation 先將資料分為 training set & test set 做內部測試。要注意的是，intrinsic evaluation 並不能保證 extrinsic results 得到進步，最終仍須以 end-to-end evaluation (extrinsic evaluation) 衡量 model。 * Sets: - Training set - Devset (validation set): 用來調參數（類似 pilot study？） - Test set (held-out set): 最後測試（類似 main experiment？） ![](https://i.imgur.com/ac9no8p.png) [★圖片來源](https://blog.csdn.net/luoying_1993/article/details/87178165) [★Devset? Validation Set?](https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets) [★Held-out Set? Test set?](https://www.datarobot.com/wiki/training-validation-holdout/) ### Language Model * 意義：計算鄰近字機率的各種技術，例如上述的 n-grams。 * 應用：machine translation, POS tagging, parsing, sentiment analysis, etc. [★Language Modeling](https://www.techtarget.com/searchenterpriseai/definition/language-modeling) ### Perplexity * 意義：譯作「困惑度」。計算時為機率的倒數，因此機率越大，perplexity 越小，代表某字後面所接的字越特定 (選擇越少，越容易確定應該出現什麼字)。 * 舉例：小明說話時通常只會在 A 後面加 B/C；曉華會在 A 後面加 B/C/D/E。如此，小明的 perplexity 就比較低 (因為機率比較高)，所以比較容易訓練出小明說話風格的機器。[Discussion in 111-1 NLP class] ### Zero 1. 有時一些常用的表達法沒有出現在 training set 裡，卻出現在 test set 裡，造成機器錯誤估算出機率為 0 （the problem of sparcity）。 [★Sparce Data](https://kth.diva-portal.org/smash/get/diva2:1111045/FULLTEXT01.pdf) - 解方：Smoothing 2. Unknown words, or out of vocabulary (OOV) words: - 我們從來沒有看過的字 - 解方一（有特定的 word list）：把所有 training set 裡沒有出現過的字用 <UNK> 取代。 - 解方二（沒有特定 word list）：把最低頻的字替換成 <UNK>。 ### Backoff and Interpolation 兩個都是「找別人來幫忙」（此處的「人」指涉 n-gram）： * Backoff: 一個人不行就往下找另一個人。假設現在使用 trigram，但因為 data 裡沒有某常見用法的例子，則我們找 bigram 來幫忙，因為他結合的字數較少，較容易找到相對應的組合。 * Interpolation: 讓所有人一起幫忙，不只集結，還分配權重給他們。 ### Smoothing * Laplace smoothing: 把所有的 count +1，則沒有任何字的 count 為 0。 * add-k smoothing: 把所有 count 加上一個很小的 k 值。 * stupid backoff: 從較大的 n-gram 往下找，直到找到適合的 n-gram，並且每往下一層，分數就要乘上 0.4。 [★Stupic Backoff](https://medium.com/swlh/your-first-natural-language-processing-project-word-predictor-using-backoff-model-6a07146de006) * Kneser-Ney smoothing ## 名字：秦佐以下依據頁數排序： ### Chain rule of probability (p.3) 連鎖律本身很直觀地算機率：![](https://private.codecogs.com/gif.latex?%5Cdpi%7B150%7D%20p%28a%2Cb%2Cc%29%20%3Dp%28a%29*p%28b%7Ca%29*p%28c%7Ca%2Cb%29) 但如果語言模型要處理的句子很長，從第一個詞w_1到最後一個詞w_n，就要這樣一直一直瘋狂算下去：![](https://private.codecogs.com/gif.latex?%5Cdpi%7B150%7D%20p%28w_1%2Cw_2%2Cw_3%2C...w_n%29%20%3Dp%28w_1%29*p%28w_2%7Cw_1%29*p%28w_3%7Cw_1%2Cw_2%29*...*p%28w_n%7Cw_1%2Cw_2%2Cw_3%2C...%2Cw_%7Bn-1%7D%29) [算式圖片來源](https://blog.csdn.net/qq_31267769/article/details/108187202) 如此看來Chain rule缺點就滿明顯的： * 如果句子有n個詞，使用Chain rule展開就有n項，n越大計算越複雜。 * 當句子很長時，裡面所有詞同時出現的機率會很接近零。因為Chain rule這樣全部考慮滿不可行的，所以我們才需要Markov assumption。Markov assumption的話，下一個詞的出現只依賴於它前面一個或幾個詞，而不用考慮前面全部。 N-gram就是基於Markov assumption，所以bigram也就是2-gram也就是只依賴前一個詞，n gram就是依賴前面n-1的詞。 [reference_1](https://blog.csdn.net/qq_31267769/article/details/108187202) ### Weighted average branching factor (p.8) branching factor是每個結點下的子結點數，所以語言的branching factor就是可能的下一個詞的數量。課本舉了辨識數字的例子，數字0到9共十個，所以branching factor是10。如果在0到9平均分佈的況下，大家的機率都一樣是十分之一。可是假設0比其他數字都更常出現，0出現的機率很高，這時perplexity較小，weighted branching也較小。 ### Dialect or variety (p.12) * 依據目標任務訓練語言模型，所以除了內容領域以外，方言和語言變體也很重要。 * 尤其處理社群平台發文或錄音稿這些很口語的資料 (課本裡的例子是AAL。如果是在台灣的話，我覺得訓練機器時要注重台灣口語裡面常常華語台語詞彙混用的狀況。我自己用逐字稿產生器的經驗，發現它在code-switching時表現很差) ### Entropy(p.22) * 用以衡量資訊量，機率越小越無法預測，「不確定性」越高，entropy越大。資訊量可以視為「不確定性」的度量，一件事如果機率很小，非常非常不確定，發生時就代表背後有很多不同的可能，訊息量較大。也就是說，越不常出現的訊息往往帶著越大的資訊含量。 * Perplexity是 cross-entropy的指數。 [reference_2](https://blog.xuite.net/metafun/life/69851478-%E8%B3%87%E8%A8%8A%E7%9A%84%E5%BA%A6%E9%87%8F-+Information+Entropy) [reference_3]( https://blog.csdn.net/qq_40765537/article/details/110086891) ### Stationary(p.23) * 機率分佈不隨時間改變 * 馬可夫模型屬於stationary，所以如果是bigram那就永遠是依賴前一詞的機率 * 可是自然語言不是stationary，而是會與時俱進，每個詞出現的機率會改變。 ## 名字：昀珊 ### Unknown words, Out-of-vocabulary (OOV) - Data sparsity 因為資料稀疏(data sparsity)，有可能測試集中出現訓練集沒有出現過的字。因而導致： 1) 低估字詞出現的機率 2) 沒遇過的字因為機率為0，與任一數相乘都會變成0，所以無法計算perplexity ==> 透過標記<UNK> 或平滑方法(smoothing)解決。 - 面對OOV的解決方法：訓練模型計算<UNK>出現的機率 1. 選擇固定詞表：將訓練集中不出現再詞表內的字轉換為<UNK> 2. 無固定詞表：設定某個出現次數作為基準，若低於該次數則該字就判定、轉換為<UNK> 將訓練集中OOV轉換為<UNK>後，再用該訓練集的資料訓練模型。因此當模型處理測試集時，俗果遇到OOV的狀況可以直接把OOV視為<UNK>進行處理。 ### Kneser-Ney Discounting - Kneser-Ney 基於Absolute discounting而發展，修正Absolute discounting對於unigram distribution的計算。所以兩者的公式長得很像，只差在解釋unigram機率的後半部分。 * Absolute discounting： ![](https://i.imgur.com/8Qg72ie.png) * Kneser-Ney discounting: ![](https://i.imgur.com/fetIRHt.png) - 若依照absolute discounting中p(w)的計算：某個unigram A字多次出現在訓練集中，因此有高機率被選為output；實際上B雖然在訓練集中出現次數較少，但卻較符合實際使用的正解。 * Example: ‘Kong’ was rated with higher frequency in the training corpus; However, in a given sentence (I lost my reading ___), ‘glasses’ with less frequency would have wider distribution. -- ‘Kong’ appeared in bigram: Hong Kong -- ‘glasses’ appeared in bigram: reading glasses, wine glasses, apple glasses … - Kneser-Ney的想法：不只單單只看這個字出現的機率，而加入考慮**這個詞可以現在多少個不同的組合當中**(PCONTINUATION(w))。倘若計算後這個詞可以出現在很多不同的組合裡，這個詞出現在新的組合機率比較高。 * P CONTINUATION(w) = the **number** of different contexts that the target word appears = the number of bigram types it completes ![](https://i.imgur.com/yjiI4Ht.png) (把原absolute discounting公式 P(w)改成PCONTINUATION (w)就變成Interpolated Kneser-Ney 的公式) > [Ref_1](https://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf) [Ref_2](http://www.foldl.me/2014/kneser-ney-smoothing) ### Perplexity (PPL) - 屬於intrinsic evaluation: 不直接讓模型上戰場做任務估表現，而是使用test set評估其表現 - 用來評估模型的預測能力為何由公式可知，PPL和機率為倒數關係，若句子的機率越大(=分母越大)，PPL越小。意即被預測的該字詞組合本身的機率高，變化可能小，專一程度高，所以PPL低。 - 通常評估語言模型的表現不會直接使用raw probability，而會使用PPL。但PPL不能代表模型直接上戰場(extrinsic evaluation)的表現。 ### Entropy - 來自Information Theory，用來測量資料的純粹程度，或者可以當作在這份資料中發現驚喜的可能性。 - 公式： ![](https://i.imgur.com/m8nCXQe.png) * p(x)= x出現的機率；-log p(x) = x所帶的資訊量[ref](https://ithelp.ithome.com.tw/articles/10305380?sc=iThelpR) * log的底數可以為任何數，若取2則代表記錄結果(特徵)最少需要兩個儲存空間(bit)，例如結果(特徵)有「是」、「否」2種可能。 * 預測機率低，該份資料的驚喜較多，組合變化較多，雜質較多；預測機率高，該份資料的驚喜較少，組合變化較少，較單純。 #### Cross entropy - Cross entropy可以用來觀察實際機率分布vs.預測機率分布的誤差範圍 [(ref)](https://r23456999.medium.com/%E4%BD%95%E8%AC%82-cross-entropy-%E4%BA%A4%E5%8F%89%E7%86%B5-b6d4cef9189d) 預測值和實際值相差越大，不確定的訊息較多，cross entropy大預測值和實際值重疊部分越多，資訊量越少，不確定的訊息較少，cross entropy小 - 即使不知道p的資料分布時，可用另一個模型m模擬p來計算 ![](https://i.imgur.com/dteRqJy.png) 又基於Shannon-McMillan-Breiman theorem，可以直接從模型m中取足夠長度的例子進行運算，而適用下面這個公式： ![](https://i.imgur.com/FWyYZkd.png) => 因此，即使在不知道p的資料分布的狀況下，可以直接計算cross entropy > [Ref_3](https://towardsdatascience.com/the-relationship-between-perplexity-and-entropy-in-nlp-f81888775ccc) > [Ref_4](https://www.kamperh.com/nlp817/notes/04_entropy_perplexity_notes.pdf) ## 名字：Milan ### Augmentative and alternative communication * **Definition:** communication options for people who are unable to meet their daily communication needs effectively through natural speech. #### Why AAC 1. Those who may have a developmental disability which has affected the development of speech. They may have an acquired disorder that has affected the person’s ability to speak. 2. Many people with different communication difficulties, speech impediments and disorders can benefit from AAC. * **People who communicate with AAC is a multimodal communicator because they can communicate with...** * vocalizations * word approximations * some gesture and sign language ### N-gram 1. more data, more accurate 2. very corpus limited #### Markov assumption * the probability of a word depends only on the previous word #### relative frequency * dividing the observed frequency of a particular sequence by the observed frequency of a prefix * Different corpus may have different result :::success **Why use log probability?** 1. <1 ; >0 2. 不會出現像 raw probability 一樣出現過小的問題 3. 可以藉由相加來計算機率，減少計算量（log相加=線性相乘） ::: ### evaluation * extrinsic evaluation * evaluate if a particular improvement in component is really going to help the task at hand * intrinsic evaluation * measures the quality of a model independent of any application ### Perplexity 1. 機率 P 越高，PP 越低 (可選擇的選項減少) 2. a.k.a. weighted branching factor -> the number of possible next words that can follow any word. 3. 受到訓練資料的影響大 -> intrinsic improvement 不保證 extrinsic improvement ### How to deal with OOV :::success **What smoothing method should we use?** 1. use cross entropy to evaluation -> 觀察實際機率與預測機率的誤差範圍（預測值與實際值差越多，內含的資訊量越多 a.k.a 不確定性越高） 2. 比較不同的 training size 3. 比較不同的 N（N-gram） 4. 最後比較不同的 smoothin method ::: ## 名字：盈蓓 ### Expectation–maximization algorithm (EM algorithm, 最大期望算法) * 在不知道隱性變量（latent variable）如何影響結果的情況下，用來估計參數的maximum liklihood 算法。 * **latent variable**: 無法觀測到，但會影響實驗結果的變量。例：有A跟A兩個工廠製造的硬幣，各自擲出正反面的機率並不相同。今天只知道擲一百枚硬幣100次的狀況下出現正面54次，反面46次，卻不知道硬幣來自A工廠或B工廠時，這就成為我們估算A工廠跟B工廠擲硬幣機率的隱性變量。 * 因為無法計算隱性變量的liklihood，我們只能透過計算期望值的方式回推。 * 可用於尋找smoothing中的最適加權值 :::success **步驟**： 1. 觀察現有參數分布 2. **E**: 根據假設出來的參數計算隱性變量的期望值 3. **M**: 找出該期望值的maximum liklihood 4. 重複步驟2跟步驟3直到數值收斂 ::: **[Reference]** 1. https://zh.wikipedia.org/zh-tw/%E6%9C%80%E5%A4%A7%E6%9C%9F%E6%9C%9B%E7%AE%97%E6%B3%95 2. [EM Algorithm 詳盡介紹: 利用簡單例子輕鬆讀懂 EM 的原理及概念](https://playround.site/?p=628) ### Hash (雜湊) * 一種處理資料的方法。 * 把長度不定的訊息透過hash function (雜湊函數)混合後變成長度一定的hash以達成資料的規則存放 * 像是把不同的水果拿來打果汁一樣，無法從果汁的外觀回推原料。 * 果汁（hash）會根據原料不同而產生變化（例：火龍果跟奇異果打出來的果汁顏色不同） * hash table/map: 用來存放hash及其對應的資料索引值 **[Reference]** [[Java] Bloom Filter介紹及應用實例](https://ithelp.ithome.com.tw/articles/10208884) ### Bloom Filter (布隆過濾器) * 可以用低空間複雜度的方式幫助我們在大資料量的狀況下短時間內確認目標**可能存在**或**一定不存在**corpus當中的過濾器。 * 以hash table的形式儲存資料，且資料的key只會有0跟1兩種，即「存在」和「不存在」兩種表示方法 :::success **優點**： 1. 搜尋時間快速 2. 所需的索引結構小 ::: :::danger **缺點**： 1. 會產生false positive，所以我們無法從結果完全確定資料一定存在corpus當中 2. 只能新增資料，不能刪除資料（因為從1變成0的時候我們不知道原本有多少hash是對應到這個欄位） ::: * Counting Bloom Filtier: 透過新增計數器的方式解決不能刪除資料的問題，缺點是資料量會變大很多。 **[Reference]** [[論文解讀][Bloom Filter] 深入討論 Bloom Filter](https://www.evanlin.com/BloomFilter/) ## 名字：靖涵 ### Language model - Estimate the probability of a word sequence - Usage with higher probability (used more often) - A better language model--> is closer to how human use words - Application: make a chatbot select grammatical words that are used more reasonable - The example of language model: n-gram ### N-gram - a basic Markov language model - N: how many previous words it estimates (window) - Concern a certain number of previous words that are regarded as a segment, instead the whole sentence - Markov: looking into the previous words - Conditional probability comes from training data - conditional probability: the possibility of the event A, based on the previous existence of event B -> P（A|B) - Lacking: depending on training corpus (finite) ->reasonable but not collected ->solutions: smoothing ### Smoothing - to avoid assigning reasonable segment zero probability - There are many kinds of smoothing (more than three ways listed in the book) 1) Laplace smoothing = add-one smoothing: - Give zero assigned probabilities - add one before normalize them into probabilities - not a good method for N-gram (the number of zero is huge!) - application in other NLP models, such as text classification 2) add-k smoothing - not add 1, but add a fractional count k - doesn’t work well for language modeling either ### End-to-end - End: 端點 - End-to-end evaluation: extrinsic evaluation-->expensive cost - End-to-end training: 整個大function由很多小functions組成（像一條生產線） - 但只需給大function input端和output 端 (input layer and output layer)，中間的的小functions (hidden layers)會自然訓練出來 ![](https://i.imgur.com/tvLQWX6.jpg) [圖片1來源](https://www.youtube.com/watch?v=Hgtf0912_Ew&list=PLOAQYZPRn2V5yumEV1Wa4JvRiDluf83vn&index=3) ### Perplexity - To evaluate the model, not the performance of result - Measurement of degree of confusion (困惑度) - N: the number of the total words in the sentence - 句子機率的倒數：好model-->perplexity越小（More confused--> worse model） perplexity公式： ![](https://i.imgur.com/Uo5gWHt.png) [圖片2來源](https://www.youtube.com/watch?v=rMVQuS_Zy2I) #### 參考資料來源 [Reference 1 (language model)](https://ai4dt.wordpress.com/2018/05/19/language-model-%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B-v2-0/) [Reference 2 (language modeling)](https://www.youtube.com/watch?v=c0N8_x-rhmE) [Reference 3 (smoothing)](https://www.youtube.com/watch?v=ebeGh8HM4Jo) [Reference 4 (end-to-end)](https://www.youtube.com/watch?v=Hgtf0912_Ew&list=PLOAQYZPRn2V5yumEV1Wa4JvRiDluf83vn&index=3) [Reference 5 (perplexity)](https://www.youtube.com/watch?v=rMVQuS_Zy2I) ## 名字：士賢 ### 語言模型 (language model) - 可分配機率來預測下一個字的模型可稱為語言模型。 ### 但為甚麼需要預測下一個字呢? - 因為我們需要在雜亂的輸入中辨識出文字。 - 常見的應用有: speech recognition, spelling correction, grammatical error correction, and machine learning。 ### 該如何計算下個字的機率呢? - 舉例來說: p(好|今天天氣很) - 我們要知道p(好|今天天氣很)的機率為何的話可先將c(今天天氣很好)與c(今天天氣很)的數量分別計算出來，然後相除，如下: $p(好|今天天氣很) = c(今天天氣很好)/c(今天天氣很)$ - 但這樣的做法會有問題，因為**語言具有創意性**，所以現實中不太會能夠算出整個句子的數量，畢竟總是會出現沒出現過的句子。 ### n-gram的出現 - n-gram使計算機率變得簡單許多，不再需要從頭開始計算的機率，只需要前幾個字的機率便能做計算。 - 我們可以使用maximum likelihood estimation (MLE)計算n-gram的機率。 :::info 使用log計算機率 -> 可以避免**算術下溢**。因為當句子越長，所得出的機率就會越小。 e.g. $P1*P2*P3*P4=exp(logP1+logP2+logP3+logP4)$ ::: ### 檢驗模型 - 一般而言，可將資料分為training sets(80%)、development sets (10%)、test sets(10%)。 - 使用perplexity來檢驗模型，而非raw probability。 - perplexity與conditional probability成反比。 $\therefore$ perplexity越小，代表模型預測能力**愈好**。 - 不同數量的n-gram會有不同的perplexity值，所以perplexity也可以用來測試模型間的表現。 ## 名字：瓊文 ###### 計算機率的方式：The chain rule of probability → Markov assumption（為了減輕複雜的運算） ### Maximum likelihood estimation * 原理：根據現有已知的結果，算出模型的產出這組資料可能性 * 用途：用來**估計模型參數**。找出最大的likelihood，是為了找到最好的參數值 * 方法：透過likelihood function運算 #### Likelihood vs. Probability * Probability：給定一組模型參數，**預測**下一個資料結果 * Likelihood：給定一組**已知**觀察到的資料，算出這個某組模型參數產生給定資料的可能性。 * 所以有了一組資料，會搭配不同的參數去找出最大的可能性（代表越符合資料分布），最大的概似性就代表那組參數是我們要的。 * 根據經驗，Maximum likelihood estimation function出來的結果通常會趨於常態分布 [Reference_MLE_1](https://www.youtube.com/watch?v=XepXtl9YKwc) [Reference_MLE_2](https://medium.com/qiubingcheng/%E6%9C%80%E5%A4%A7%E6%A6%82%E4%BC%BC%E4%BC%B0%E8%A8%88-maximum-likelihood-estimation-mle-78a281d5f1d) ### Weighted average branching factor * Branching factor：指每個節點下的分支數量 * 用途：博弈計算、樹狀結構資料運算 * 用於語言模型：計算下一個可能接的字之數量（候選人數量） * Average branching factor：因為每個結點下的分支數量可能不一樣，所以要計算平均 #### weighted branching factor * 下一個可能的選項（可能出現的字）**機率不一樣**，所以可以透過加權得到不一樣的因子。 * 如果在下一個可能出現選項當中，**某一個字的可能性較高（機率高）**，那麼weighted branching factor的數字會**小於**原本平均機率下的branching factor。 * 這個概念可以跟perplexity相對應，越小的weighted branching factor和perplexity，對模型來說是一件**好**事。 [Reference_3](https://towardsdatascience.com/perplexity-in-language-models-87a196019a94) ## 名字：佳嫻 #### Probability: * 用一種模型去訓練電腦可以找出下一個詞語的機率，應用於語音辨識、拼寫校正、文法校正、機器翻譯。 * P(w│h)；w = word, h=history. P(the│its water is sotransparentthat) = C(its water is so transparent that the)/C(its water is so transparent that) * Markov assumption: 由於一個句子可以有無數的詞彙，這樣計算量會太過龐大，所以可以改用「計算前一個詞彙」的方法去簡化它，P(wn│w1:n-1) ≈P(wn≈wn-1)。比如P(the│Walden Pond’s water is so transparent that)可以簡化為P(the│that)。 #### Log probabilities　透過運用log運算probability，可以 * 把運算由乘法轉換為電腦擅長處理的加法[log(P(E)P(F))=logP(E)+logP(F)]（），提升運算效率。 * 尤可運用在多種probabilities的運算。 * 如果取probability的對數，結果將會是負數： 0≤P(E) ≤1 Axiom 1 of probability -∞≤log P(E)≤0 Rule for log probabilities * 易於呈現極小的概率。電腦可以將probability運算至小數點後很多位，但很難完美呈現它。如Python無法表示任何小於2.225-308的概率，而同一數字的log值 -307.652則容易被電腦儲存。 [reference](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/log_probabilities/) #### OOV (Out-of-Vocabulary) (P.13)的處理方法：(其實有點看不太懂Otz) * 「我不知道」UNK token * 「我再查查看」back-off機制需同時維持數個不同運算量的語言模型，一旦出現OOV的情況，則改退到次高運算量的模型查詢。也就是說，如果n-gram模型的發生OOV的情況，此機制會退到n-1 gram的模型去查找該token。如果仍找不到，模型才會發出UNK token。 [reference](https://medium.com/hackergirly/subword-algorithms-372f4c08d01f)  ###### tags: `CL2023`