# Sequence generation for masked language model
Disclaimer: I am working this as my master thesis.
There are two-types STOA language models: Masked language model and autoregresive language model.
GPT-2 has a good performacne on text-generation tasks.
On the other hand, BERT is good at natual language understanding tasks like QA,
text classication as the downstream tasks.
However, the surrogate task for pre-training is to force BERT as a masked language model.
Could we find a way to show that GPT and BERT could have similar performance in terms of text generation and language modeling (eval. by perplexity)?
Ideas
-----
1. fill out masks sequentially.
* A quite navie way.
* Turns out okay for question generation
* But we need a lot of training steps
- one sentence we need train it `# of tokens in the question` times
2. Probabilistically Masked Language Model with uniform masking prior
+ Novel idea which shows it is an autoregressive permutated language model
+ Reasonable training steps.
+ [to-be-verified] Okay for fine tuning from a pre-trained model which has fixed masking prior.
- Unknown length; Solution: train a small network as auxliary network.
- Need to modify the beam search method, but we can treat it as traditional beam search with random generation order
- Unknown procedure to evaluate it for model selection / early stoping. E.g.:
* whether iterate all maskings or just calculate the objective with MC method.
* autoregressive evaluation or treat them independent?
3.