BERT Sling Parser

# BERT Sling Parser * Author: Alvin Hou * Date: 2019/7/30 * Commit: Based on [06eb9c6](https://github.com/ASUS-AICS/Chinese-SLING-parser/commit/06eb9c61836a42660837178f22690e54c746d374) commit on branch `bert-sling-parser` ## New Files ### `sling/nlp/parser/trainer/bert.py` #### Description BERT (`bert-base-cased`) model that returns word embeddings for a given sentence. #### Sample usage ```python from bert import BertBase class Caspar(nn.Module): def __init__(self, spec, bert_model_path = None): self.bert = BertBase('cuda', finetuning=True) ``` #### Notes BERT only returns the embeddings of actual words (strips the `[CLS]`, `[SEP]`), so it matches the original sentence length. ```python # line 33 at bert.py output = output.squeeze(0) output = torch.narrow(output, 0, 1, len(output)-2) # strip head and tail embeddings to match original length ``` ### `sling/nlp/parser/trainer/tokenizer.py` #### Description BERT tokenizer (`bert-base-cased` tokenizer) #### Sample usage ```python from tokenizer import tokenizer tokenizer.convert_tokens_to_ids(['Apple']) # [7302] ``` ### `sling/nlp/parser/tools/graph_api.py` #### Description API for generating JSON for plotting relation graphs on web. #### Sample usage Check out the `graph_api.py` file for sample usage. ## Files Changes ### `sling/nlp/parser/trainer/spec.py` #### Description Changed the caspar specs, which includes adjusting its word embeddings dimension and added BERT tokenize process. #### Changes ```python # line 35: Added feature name for each feature self.name = '' # feature name # line 113: Added bert embedding dimension self.bert_embedding_dim= 768 # line 348,349: Changed word embedding dimension (32 -> 768) self.add_lstm_fixed("word", self.bert_embedding_dim, self.words.size()) # adjust BERT dimension # line 497: Specify the feature name features.name = f.name # line 500-504: Tokenize the word tokens with BERT tokenizer, than add it to features if f.name == "word": raw_tokens = ['[CLS]'] + [token.word for token in document.tokens] + ['[SEP]'] # Tokenize the words with BERT tokenizer for token in raw_tokens: features.add(tokenizer.convert_tokens_to_ids([token if token in tokenizer.vocab else '[UNK]'])) ``` ### `sling/nlp/parser/trainer/pytorch_modules.py` #### Description 1. Added BERT model to the Caspar module 2. Feed input word indices into BERT while doing `_embedding_lookup` in the caspar forward pass #### Changes ```python # line 301-304: Added BERT model to Caspar module def __init__(self, spec, bert_model_path = None): super(Caspar, self).__init__() self.spec = spec self.bert = BertBase('cuda', finetuning=True) if not bert_model_path else torch.load(bert_model_path) # line 422-429 # If the feautre is `word`, we feed the tokens indices to BERT to # get the word embeddings. Then we append the embeddings to `values` for feature, bag in zip(features, embedding_bags): if feature.name == 'word': word_input = torch.LongTensor(feature.indices).to('cuda') # print(len(word_input), word_input) output = self.bert(word_input.unsqueeze(0)) values.append(output.to('cpu')) ``` ### `sling/nlp/parser/trainer/trace.py` #### Description Changed some dimension checks while storing the full trace #### Changes ```python # line 32: import tokenizer from tokenizer import tokenizer # line 97-103 # 1. Change the assert condition while encounters `word` feature # 2. Strip the '[CLS]', '[SEP]' indices while storing if vals.name == 'word': assert len(vals.indices) - 2 == len(tokens) # '[CLS]', '[SEP]' head and tail else: assert len(vals.indices) == len(tokens) indices = vals.indices[1:-1] if vals.name == 'word' else vals.indices # strip '[CLS]', '[SEP]' ``` ### `sling/nlp/parser/trainer/trainer.py` #### Description Let trainer stores the BERT model while training Save loss to `log.txt`. Added this line for instant viewing our training progess. Since the Azure VMs seems to be too busy while training and ignores the `stdout` and `stderr` for a long time. This could be disabled without any side effects. #### Changes ```python # line 249: Save loss to `log.txt` instantly with open('log.txt', 'a') as f: f.write("{} BatchLoss after ({} batches = {} examples): {}, incl. L2={} ({} secs)\n".format(now(), num_batches, self.count, value.item(), (l2 / 3.0).item(), end - start)) # line 297: Also saves model.bert while saving best flow # Save Bert here torch.save(self.model.bert, self.output_file_prefix + '_bert_model.pt') ``` ### `sling/nlp/parser/trainer/train_util.py` #### Description Added `--bert` flag, could be specified while running the `parse.py` script for making an inference #### Changes ```python # line 177 flags.define('--bert', help='Path to the BERT model file', default="", type=str, metavar='FILE') ``` ### `sling/nlp/parser/tools/parse.py` #### Description Changed `parse.py` for supporting sling+bert inference. #### Changes ```python # line 51: Added `bert_model_path` arg while init caspar caspar = Caspar(spec, bert_model_path=args.bert) # line 58-61: Disabled debuging and trace state, _, _, trace = caspar.forward(document, train=False, debug=None) # line 86-97 # Added parse-and-print example in __main__ funtion ``` ### `sling/nlp/parser/tools/train_pytorch.py` #### Description Save parameters to `log.txt`. Added this line for instant viewing our parameters. Since the Azure VMs seems to be too busy while training and ignores the `stdout` and `stderr` for a long time. This could be disabled without any side effects. #### Changes ```python # line 90 with open('log.txt', 'a') as f: f.write("Using hyperparameters: {}\n".format(hyperparams)) ```