# BERT -wang
###### tags: `NLP`
## Predict Masked Word
"The __ sat on the mat" (What is the masked word?)

## Predict Next Sentence
+ Given the sentence:
"calculus is a branch of math"
+ Is this the next sentence?
"it was developed by newton and leibniz"
+ Is this the next sentence?
"panda is native to south central china"

## Combining the two method
+ Input:
"[CLS] calculus is a **[MASK]** of math **[SEP]** it **[MASK]** developed by newton and leibniz".
+ Targets: True, "branch", "was"
## Training
1. **Loss1** is for binary classification.
2. **Loss2** and **Loss3** are for multi-class classification.
3. Objective function is the sum of the three loss functions.
4. Update model parameters by performing on gradient descent.
## Data
1. BERT does not need manually labeled data.
2. Use large-scale data, e.g., English Wikipedia.
3. andomly mask words (with some tricks).
4. 50% of the next sentence are real.
## Cost of computation
1. **BERT Base**
+ 110M parameters.
+ 16 TPUs, 4 days of training (without hyper-parameter tuning)
2. **BERT Large**
+ 235M parameters.
+ 64 TPUs, 4 days of training (without hyper-parameter tuning)