### THE CURIOUS CASE OF NEURAL TEXT DeGENERATION
#### Ideas & Q:
- improve training? predict more tokens(2,3,..) and then back-prop?
- need codes for evaluation (for each metric)
- reminds me of similar states: " the animal was found by fishermen off the coast of Bundaberg", "fishing vessel off
the coast of Bundaberg" and "fishermen off the coast of
Bundaberg" (and how to detect them automatically? by fractions of sentence?)
- need to search and take notes for the metrics (e.g. perplexity)
#### Abstract
likelihood works well for training but maximization-based decoding methods leads to 'degeneration': bland, incoherent repetitive output text
Nucleus Sampling: truncating the unreliable tail of prob distribution and than sample
#### Metrics:
- likelihood ("surprising"), diversity and repetition
#### Results:
- maximization is inappropriate
- current best models still have unreliable tails to be truncated
- nucleus sampling is 'good' and diverse
#### Main body:
b=32: repetition, b>64, prefer to stop
beam search text is not "surprising"
probability of 'I don't know' increases during the loop
it can be observed from table1 that topk=640 repeats the least
Zipf's law: there is an exponential relationship between the rank of a word and its frequency in text