Intro to Transformers

(under construction)

Introductory Materials

Video: The Transformer neural network architecture explained.
(Some related videos: A brief history of the Transformer architecture in NLP gives some general background … How to check if a neural network has learned a specific phenomenon? explains how a language model such as BERT trained to do, for example, masked word prediction on big data can be adapted to solve different task for which there is only little data. More videos in the Transformer Playlist.)
Jay Allamar's Blog: The Illustrated Transformer … Seq2seq Models With Attention.
The lecture Attention and Transformer Networks contains more detail. (This is part of a course on machine learning.)
The following screenshot is from Noah Smith's lecture (and see here for (Rush 2018).
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Articles

Original Articles

These two papers introduce "attention" to NLP:

Bahdanau etal, Neural Machine Translation by Jointly Learning to Align and Translate, 2014.
Luong etal, Effective Approaches to Attention-based Neural Machine Translation, 2015.

This paper is credited with introducing the "transformer":

Vaswani et al, Attention is all you need, 2017. See also on the annotated-transformer on github.

GPT uses transformers to learn a language model.

Radford etal, Language Models are Unsupervised Multitask Learners, 2018.

This is a Guardian article written by GPT-3. A video by Tom Scott.

BERT is another transformer based language model:

Devlin etal, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2019.
- Blog, Tutorial.
- Jacob Devlin Interview.
Tenney etal, BERT Rediscovers the Classical NLP Pipeline, 2019.
See also McCormick's Jupyiter Notebooks with Example Applications

More Applications of Transformers to NLP:

Language Models are Few-Shot Learners, 2020.
Neural Databases, 2020.
Introducing FLAN: More generalizable Language Models with Instruction Fine-Tuning, 2021.

Transformers beat CNNs for image recognition:

Dosovitskiy etal, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2021.

Transformers for composing and performing music:

Music Transformer, 2018. Magenta blog … demos … github.

Proteinfolding:

https://en.wikipedia.org/wiki/AlphaFold

Ethics:

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell: On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?, 2021.
https://aclanthology.org/P19-1355.pdf
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Surveys

Neural Machine Translation: A Review and Survey, 2020.

Criticism

Symbolic Regression with genetic algorithms (interpretable, bad at high-dimension problems) and deep learning (good at high-dimension problem). Data -> NN -> SR. ^[1] Afaiu, what is nice here is that the NN itself has a physical interpretation. See also AI Feynman.

AI Feynman.

Random Links

Kim et al, Structured Attention Networks, 2017.
Overview Paper: Evaluating word embedding models: methods
and experimental results, 2019.
https://www.youtube.com/watch?v=TfVYxnhuEdU
https://game-developers.org/coding-adventure-game-idea-generator/
Transformers with Reccurence: Notes on Universal Transformers
Rethinking Attention with Performers
10 Leading Language Models For NLP In 2021
Deep Implicit Attention: A Mean-Field Theory Perspective on Attention Mechanisms, 2021.

Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
↩︎

Introductory Materials

Articles

Original Articles

Surveys

Criticism

Related Stuff

Random Links

Read more

Create, Detect, Lift, Preserve, Reflect (Co)Limits

A Very Short Introduction to Monads

Category Theory - Axiomatic Theory of Structure

Category Theory in Computer Science