# BERT -wang ###### tags: `NLP` ## Predict Masked Word "The __ sat on the mat" (What is the masked word?) ![](https://i.imgur.com/Tr9yPoY.png) ## Predict Next Sentence + Given the sentence: "calculus is a branch of math" + Is this the next sentence? "it was developed by newton and leibniz" + Is this the next sentence? "panda is native to south central china" ![](https://i.imgur.com/IqgyDfK.png) ## Combining the two method + Input: "[CLS] calculus is a **[MASK]** of math **[SEP]** it **[MASK]** developed by newton and leibniz". + Targets: True, "branch", "was" ## Training 1. **Loss1** is for binary classification. 2. **Loss2** and **Loss3** are for multi-class classification. 3. Objective function is the sum of the three loss functions. 4. Update model parameters by performing on gradient descent. ## Data 1. BERT does not need manually labeled data. 2. Use large-scale data, e.g., English Wikipedia. 3. andomly mask words (with some tricks). 4. 50% of the next sentence are real. ## Cost of computation 1. **BERT Base** + 110M parameters. + 16 TPUs, 4 days of training (without hyper-parameter tuning) 2. **BERT Large** + 235M parameters. + 64 TPUs, 4 days of training (without hyper-parameter tuning)