Sequence generation for masked language model

# Sequence generation for masked language model Disclaimer: I am working this as my master thesis. There are two-types STOA language models: Masked language model and autoregresive language model. GPT-2 has a good performacne on text-generation tasks. On the other hand, BERT is good at natual language understanding tasks like QA, text classication as the downstream tasks. However, the surrogate task for pre-training is to force BERT as a masked language model. Could we find a way to show that GPT and BERT could have similar performance in terms of text generation and language modeling (eval. by perplexity)? Ideas ----- 1. fill out masks sequentially. * A quite navie way. * Turns out okay for question generation * But we need a lot of training steps - one sentence we need train it `# of tokens in the question` times 2. Probabilistically Masked Language Model with uniform masking prior + Novel idea which shows it is an autoregressive permutated language model + Reasonable training steps. + [to-be-verified] Okay for fine tuning from a pre-trained model which has fixed masking prior. - Unknown length; Solution: train a small network as auxliary network. - Need to modify the beam search method, but we can treat it as traditional beam search with random generation order - Unknown procedure to evaluate it for model selection / early stoping. E.g.: * whether iterate all maskings or just calculate the objective with MC method. * autoregressive evaluation or treat them independent? 3.