Bert Tokenizer and Training

**Bert tokenization:** Before we move into labelling, let's understand how Bert tokenizer works. Tokenization is nothing but splitting the sentence into words, however Bert tokenizer splits sentences into words that are in the vocabulary. If there are compound words or words that are not part of the vocabulary then Bert splits them into multiple sub-words with each subword except the first subword starting with "##". **Bert 進行 tokenizer的說明** 步驟 1. 讀取文本 '15 & 16 - upper outer quadrant;17 & 18 - lower outer quadrant. (ND/mm 27.8.63)' 2. 文本做斷詞：把\n去除,會把文字拆分開來類似字根自首比如 runing 拆分成 run ##ing =>就是所謂的 Byparing '15', '&', '16', '-', 'upper', 'outer', 'q', '##uad', '##rant', ';', '17', '&', '18', '-', 'lower', 'outer', 'q', '##uad', '##rant', '.', '(', 'N', '##D', '/', 'mm', '27', '.', '8', '.', '63', ')', 3. sample offset mapping 會先取得token id, token id代表第1個token : [0, 0] 即CLS, 第2個token 即 [0, 1], [0, 1]就是將token 轉回文字後該文字所在的位置,每個token id 轉回文字後都會有空白做分隔, 所以就是offsetmapping的位置就是轉回文字後,文字在文本所在的位置 sample_offsets=tensor([[ 0, 0], [ 0, 1],[ 1, 4], [ 5, 7], [ 8, 13],[ 14, 16],[ 17, 20]) 4. 比如滑窗大小為510 取文本的總字數為510, Bert tokenizer完tokenid 會小於 512 在丟入Bert的模型訓練(這裡的限制是指512個token!!!! 不是只文字512個),會造成訓練效果不好, 所以文本應該是要1000個文字以上給 tokenizer做斷詞, 這樣文本才不會tokenizer 完只剩 100多筆 tokenid ![Screenshot from 2023-12-31 20-58-56](https://hackmd.io/_uploads/ryt1UkyuT.png)