# BERT Sling Parser
* Author: Alvin Hou
* Date: 2019/7/30
* Commit: Based on [06eb9c6](https://github.com/ASUS-AICS/Chinese-SLING-parser/commit/06eb9c61836a42660837178f22690e54c746d374) commit on branch `bert-sling-parser`
## New Files
### `sling/nlp/parser/trainer/bert.py`
#### Description
BERT (`bert-base-cased`) model that returns word embeddings for a given sentence.
#### Sample usage
```python
from bert import BertBase
class Caspar(nn.Module):
def __init__(self, spec, bert_model_path = None):
self.bert = BertBase('cuda', finetuning=True)
```
#### Notes
BERT only returns the embeddings of actual words (strips the `[CLS]`, `[SEP]`), so it matches the original sentence length.
```python
# line 33 at bert.py
output = output.squeeze(0)
output = torch.narrow(output, 0, 1, len(output)-2) # strip head and tail embeddings to match original length
```
### `sling/nlp/parser/trainer/tokenizer.py`
#### Description
BERT tokenizer (`bert-base-cased` tokenizer)
#### Sample usage
```python
from tokenizer import tokenizer
tokenizer.convert_tokens_to_ids(['Apple']) # [7302]
```
### `sling/nlp/parser/tools/graph_api.py`
#### Description
API for generating JSON for plotting relation graphs on web.
#### Sample usage
Check out the `graph_api.py` file for sample usage.
## Files Changes
### `sling/nlp/parser/trainer/spec.py`
#### Description
Changed the caspar specs, which includes adjusting its word embeddings dimension and added BERT tokenize process.
#### Changes
```python
# line 35: Added feature name for each feature
self.name = '' # feature name
# line 113: Added bert embedding dimension
self.bert_embedding_dim= 768
# line 348,349: Changed word embedding dimension (32 -> 768)
self.add_lstm_fixed("word", self.bert_embedding_dim, self.words.size()) # adjust BERT dimension
# line 497: Specify the feature name
features.name = f.name
# line 500-504: Tokenize the word tokens with BERT tokenizer, than add it to features
if f.name == "word":
raw_tokens = ['[CLS]'] + [token.word for token in document.tokens] + ['[SEP]']
# Tokenize the words with BERT tokenizer
for token in raw_tokens:
features.add(tokenizer.convert_tokens_to_ids([token if token in tokenizer.vocab else '[UNK]']))
```
### `sling/nlp/parser/trainer/pytorch_modules.py`
#### Description
1. Added BERT model to the Caspar module
2. Feed input word indices into BERT while doing `_embedding_lookup` in the caspar forward pass
#### Changes
```python
# line 301-304: Added BERT model to Caspar module
def __init__(self, spec, bert_model_path = None):
super(Caspar, self).__init__()
self.spec = spec
self.bert = BertBase('cuda', finetuning=True) if not bert_model_path else torch.load(bert_model_path)
# line 422-429
# If the feautre is `word`, we feed the tokens indices to BERT to
# get the word embeddings. Then we append the embeddings to `values`
for feature, bag in zip(features, embedding_bags):
if feature.name == 'word':
word_input = torch.LongTensor(feature.indices).to('cuda')
# print(len(word_input), word_input)
output = self.bert(word_input.unsqueeze(0))
values.append(output.to('cpu'))
```
### `sling/nlp/parser/trainer/trace.py`
#### Description
Changed some dimension checks while storing the full trace
#### Changes
```python
# line 32: import tokenizer
from tokenizer import tokenizer
# line 97-103
# 1. Change the assert condition while encounters `word` feature
# 2. Strip the '[CLS]', '[SEP]' indices while storing
if vals.name == 'word':
assert len(vals.indices) - 2 == len(tokens) # '[CLS]', '[SEP]' head and tail
else:
assert len(vals.indices) == len(tokens)
indices = vals.indices[1:-1] if vals.name == 'word' else vals.indices # strip '[CLS]', '[SEP]'
```
### `sling/nlp/parser/trainer/trainer.py`
#### Description
Let trainer stores the BERT model while training
Save loss to `log.txt`. Added this line for instant viewing our training progess. Since the Azure VMs seems to be too busy while training and ignores the `stdout` and `stderr` for a long time. This could be disabled without any side effects.
#### Changes
```python
# line 249: Save loss to `log.txt` instantly
with open('log.txt', 'a') as f:
f.write("{} BatchLoss after ({} batches = {} examples): {}, incl. L2={} ({} secs)\n".format(now(), num_batches, self.count, value.item(), (l2 / 3.0).item(), end - start))
# line 297: Also saves model.bert while saving best flow
# Save Bert here
torch.save(self.model.bert, self.output_file_prefix + '_bert_model.pt')
```
### `sling/nlp/parser/trainer/train_util.py`
#### Description
Added `--bert` flag, could be specified while running the `parse.py` script for making an inference
#### Changes
```python
# line 177
flags.define('--bert',
help='Path to the BERT model file',
default="",
type=str,
metavar='FILE')
```
### `sling/nlp/parser/tools/parse.py`
#### Description
Changed `parse.py` for supporting sling+bert inference.
#### Changes
```python
# line 51: Added `bert_model_path` arg while init caspar
caspar = Caspar(spec, bert_model_path=args.bert)
# line 58-61: Disabled debuging and trace
state, _, _, trace = caspar.forward(document, train=False, debug=None)
# line 86-97
# Added parse-and-print example in __main__ funtion
```
### `sling/nlp/parser/tools/train_pytorch.py`
#### Description
Save parameters to `log.txt`. Added this line for instant viewing our parameters. Since the Azure VMs seems to be too busy while training and ignores the `stdout` and `stderr` for a long time. This could be disabled without any side effects.
#### Changes
```python
# line 90
with open('log.txt', 'a') as f:
f.write("Using hyperparameters: {}\n".format(hyperparams))
```