(under construction) ## Introductory Materials - Video: [The Transformer neural network architecture explained](https://www.youtube.com/watch?v=FWFA4DGuzSc). (Some related videos: [A brief history of the Transformer architecture in NLP](https://www.youtube.com/watch?v=iH-wmtxHunk) gives some general background ... [How to check if a neural network has learned a specific phenomenon?](https://www.youtube.com/watch?v=fL22NAtMNYo&t=0s) explains how a language model such as BERT trained to do, for example, masked word prediction on big data can be adapted to solve different task for which there is only little data. More videos in the [Transformer Playlist](https://m.youtube.com/playlist?list=PLpZBeKTZRGPNdymdEsSSSod5YQ3Vu0sKY).) - Jay Allamar's Blog: [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/) ... [Seq2seq Models With Attention](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/). - The lecture [Attention and Transformer Networks](https://m.youtube.com/watch?v=OyFJWRnt_AY) contains more detail. (This is part of a course on [machine learning](https://m.youtube.com/playlist?list=PLdAoL1zKcqTW-uzoSVBNEecKHsnug_M0k).) - The following screenshot is from [Noah Smith's lecture](https://drive.google.com/file/d/1cK43rSzH491oI9NIrLlDAeP8P2F7LXTJ/view) (and see [here](http://rush-nlp.com/2018/04/01/attention.html) for (Rush 2018). ![](https://i.imgur.com/X2iuxiz.png) ## Articles ### Original Articles These two papers introduce "attention" to NLP: - Bahdanau etal, [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/pdf/1409.0473.pdf), 2014. - Luong etal, [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/pdf/1508.04025.pdf), 2015. This paper is credited with introducing the "transformer": - Vaswani et al, [Attention is all you need](https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf), 2017. See also on the [annotated-transformer](https://github.com/harvardnlp/annotated-transformer) on github. GPT uses transformers to learn a language model. - Radford etal, [Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf), 2018. This is a [Guardian article](https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3) written by GPT-3. A [video](https://www.youtube.com/watch?v=TfVYxnhuEdU) by Tom Scott. BERT is another transformer based language model: - Devlin etal, [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf), 2019. - [Blog](https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270), [Tutorial](https://www.freecodecamp.org/news/google-bert-nlp-machine-learning-tutorial/). - Jacob Devlin [Interview](https://www.youtube.com/watch?v=u91645MFytY). - Tenney etal, [BERT Rediscovers the Classical NLP Pipeline](https://arxiv.org/pdf/1905.05950.pdf), 2019. - See also McCormick's [Jupyiter Notebooks with Example Applications](http://mccormickml.com/archive/) More Applications of Transformers to NLP: - [Language Models are Few-Shot Learners](https://arxiv.org/pdf/2005.14165.pdf), 2020. - [Neural Databases](https://arxiv.org/pdf/2010.06973.pdf), 2020. - [Introducing FLAN: More generalizable Language Models with Instruction Fine-Tuning](https://ai.googleblog.com/2021/10/introducing-flan-more-generalizable.html), 2021. Transformers beat CNNs for image recognition: - Dosovitskiy etal, [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/pdf/2010.11929.pdf), 2021. Transformers for composing and performing music: - [Music Transformer](https://arxiv.org/pdf/1809.04281.pdf), 2018. [Magenta blog](https://magenta.tensorflow.org/music-transformer) ... [demos](https://magenta.tensorflow.org/demos/) ... [github](https://github.com/magenta/magenta). Proteinfolding: - https://en.wikipedia.org/wiki/AlphaFold Ethics: - Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell: [On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922), 2021. - https://aclanthology.org/P19-1355.pdf - [RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models](https://arxiv.org/pdf/2009.11462.pdf) ### Surveys - [Neural Machine Translation: A Review and Survey](https://arxiv.org/pdf/1912.02047.pdf), 2020. ## Criticism - ... ## Related Stuff [Symbolic Regression](https://www.youtube.com/watch?v=HKJB0Bjo6tQ) with genetic algorithms (interpretable, bad at high-dimension problems) and deep learning (good at high-dimension problem). Data -> NN -> SR. [^jpg] Afaiu, what is nice here is that the NN itself has a physical interpretation. See also AI Feynman. [^jpg]: ![](https://i.imgur.com/dU6JfbP.jpg =200x) AI Feynman. ## Random Links - Kim et al, [Structured Attention Networks](https://arxiv.org/pdf/1702.00887.pdf), 2017. - Overview Paper: [Evaluating word embedding models: methods and experimental results](https://www.cambridge.org/core/services/aop-cambridge-core/content/view/EDF43F837150B94E71DBB36B28B85E79/S204877031900012Xa.pdf/div-class-title-evaluating-word-embedding-models-methods-and-experimental-results-div.pdf), 2019. - https://www.youtube.com/watch?v=TfVYxnhuEdU - https://game-developers.org/coding-adventure-game-idea-generator/ - Transformers with Reccurence: [Notes on Universal Transformers](https://hackmd.io/@FtbpSED3RQWclbmbmkChEA/rJIXkXqHu) - [Rethinking Attention with Performers](https://ai.googleblog.com/2020/10/rethinking-attention-with-performers.html) - [10 Leading Language Models For NLP In 2021](https://www.topbots.com/leading-nlp-language-models-2020/) - [Deep Implicit Attention: A Mean-Field Theory Perspective on Attention Mechanisms](https://mcbal.github.io/post/deep-implicit-attention-a-mean-field-theory-perspective-on-attention-mechanisms/), 2021.