# Becoming a ML bro My Background: The last time I touched anything AI or machine learning was back in fall 2019, where I did a bit of NLP stuff for the Berkeley Haas Neuroeconomics Lab. It was mostly playing around with GloVe - I think the coolest thing was finding relationships between words that study participants did using word vectors and PCA. Since then, I haven't really touched ML at all, besides using ChatGPT for hw. ## Study plan I have a few hours, and it seems like transformers is the new hot shit, so we should start there and see where to go afterwards. Obviously, the seminal paper on transformers is `Attention is all you need`, so let's rawdog the paper and read it. ## Attention is all you need - It is interesting that the cofounder of NEAR(Blockchain-2B market cap) is a coauthor on this paper - Before transformers, recurrent neural networks and long short term memory were state of the art for language translation and language modeling. - The problem with recurrent neural networks is that they have a lot of layers between input and output(the recurrent part of the name). Furthermore, you cannot parallelize the model to make things faster, once you pass in input, it must be sequentially processed until it is output. This makes it very slow - The problems transformers are trying to solve are input output problems. For example, if you want to translate a sentence from French to English, you could feed the french sentence into the transformer and get out an english sentence. - Transformers consist of encoder decoder blocks, where you pass in a vector of tokenized words through a series of encoder decoder blocks - For tokenized words, not only do we have the actual word embedding(what characters are in the word), we also have a positional encoding(where in the sentence is the word) - An encoder consists of self attention(how each word in a sentence relates to EVERY other word in the sentence) and a feed forward neural network(just a stateless neural network with nodes and weights) - Self attention is the key breakthrough for transformers. It is a mechanism to contextualize words by paying attention to other nearby words that give it context. For example, are the following two sentences referring to the same jaguar? - The doors of the jaguar were open. - The jaguar made a howling noise. - A decoder is meant to transform the vector output of an encoder into words, also using self attention followed by a feed forward neural network. - What is attention? Attention is like a market maker, it matches people who want a certain thing to people offering a certain thing. - What does (key,query,value) represent? The key represents what you have(ie 10 firewood bundles), the query represents what you want(ie 5 loaves of bread), and the value represents stuff that best fits what you want(ie three boudin bread bowls) ## From Transformers to Chat-GPT With transformers, essentially we have a better language model that can perform things like translation whilst understanding context. The two things transformers do very well are: applying context to a word(ie I ate an apple vs I just bought the newest apple device) and parellelization of computing during training and inference. However, transformers by themselves didn't really bring about much change in the real world. It wasn't until Google trained BERT(Bidirectional Encoder Representations from Transformers) did the world understand the power of transformers. BERT was trained to do two things: Guess missing words from a body of text and Given two sentences, guess if they were two consecutive sentences from the same document or not. BERT was able to perform many different language tasks at a state of the art level, including sentiment analysis, sentence similarity, and question answering. More interestingly, BERT kicked off the wave of increasingly large parameter sizes, as it costed ~7,000 USD to train. At the same time, OpenAI was working on their own line of transformer models, the GPT series(Generative Pretrained Transformer). The initial idea behind GPT was to pretrain a network on langauge modeling tasks over a large body of text then fine tune the network for specific tasks. However, GPT was more powerful than expected, and it turns out you can simply talk to the model and ask it to perform a task and it would give a somewhat intelligent answer. Thus, people realized by crafting interesting questions, you could produce extremely interesting answers from the GPT models. Thus, we have arrived at the era of large language models. LLMs have shown that once the model exceed a certain threshold size(50-100 billion parameters), it starts demonstrating an ability to answer any generic question. ## What's next? LLMs to emergent behavoir One of the most interesting things in large systems is the phenomena of emergent behavoir. In crypto, people realized you could collude with validators to get your transactions included in earlier, and thus MEV was born. In nature, no single ant knows everything about the ant colony and no single ant could possibly plan out the entire colony, yet massive ant colonies are routinely found all over the world. And in life, emergence has led to single celled organisms evolving into what we are today. Chat-GPT has already shown elements of emergent behavior: 1. **Zero shot learning:** ChatGPT is able to provide correct outputs on inputs it wasn't trained on. ``` Question: What is 2+2? Chat-GPT: 4 ``` 2. **Few shot learning:** This is when a model is able to work out similar problems. For example, Chat-GPT is able to label tweets with their sentiment or correctly answer multiple choice questions. ``` Question: What is the sentiment of this tweet? OMG, I just got accepted into UC Berkeley!!!!! Chat-GPT: Positive ``` 3. **Answering questions:** Chat-GPT is able to compose a series of related thoughts into a succinct answer instead of merely regurgitating training data. ``` Question: Why is steph curry the greatest player of all time? Chat-GPT[Edited for length]: Stephen Curry is widely considered one of the greatest basketball players of his generation. He has achieved numerous accolades throughout his career, including being a two-time Most Valuable Player, a six-time NBA All-Star, and a three-time NBA champion. He has also broken numerous records, including the record for most three-pointers made in a single season and the record for most three-pointers made in a playoff series. ``` 4. **Structured Word Generation:** Chat-GPT can write code, songs, poems, and much more in a way that is syntactically correct and functional. ``` Question: Write a song about why UC Berkeley is the greatest school ever Chat-GPT: Verse 1: From the hills of Berkeley, a light shines so bright, A beacon of knowledge, a place that’s just right, A place where minds gather, to innovate and create, A place where dreams are made, and students graduate. Chorus: UC Berkeley, the greatest school of all, A place where ideas are born and knowledge stands tall, From science to humanities, we’ve got it all, UC Berkeley, the greatest school of them all. Verse 2: From the Campanile tower, to the Memorial Glade, A place where freedom of speech, will never fade, A place where history was made, and movements were born, A place where excellence is the norm. Chorus: UC Berkeley, the greatest school of all, A place where ideas are born and knowledge stands tall, From science to humanities, we’ve got it all, UC Berkeley, the greatest school of them all. Bridge: We’ve got Nobel laureates, and Turing award winners, Our research is renowned, our impact is greater, We’ve got a legacy, that’s second to none, UC Berkeley, a place where greatness is born. Chorus: UC Berkeley, the greatest school of all, A place where ideas are born and knowledge stands tall, From science to humanities, we’ve got it all, UC Berkeley, the greatest school of them all. Outro: So come to Berkeley, and be part of the best, A place where your future, will be truly blessed, A place where you’ll learn, and grow and thrive, UC Berkeley, the greatest school, for the rest of your life. ``` ## Tips for using GPT models ## Order of readings/Bibliography Attention is all you need: 1. Intro medium article: https://luv-bansal.medium.com/transformer-attention-is-all-you-need-easily-explained-with-illustrations-d38fdb06d7db 2. Attention is all you need: https://arxiv.org/pdf/1706.03762.pdf 3. Metaphor for Attention: https://www.reddit.com/r/MachineLearning/comments/qidpqx/comment/hij89tt/?utm_source=share&utm_medium=web2x&context=3 4. TODO: https://peterbloem.nl/blog/transformers From transformers to chat GPT 1. Article on the history of Chat GPT https://www.thoughtspot.com/data-trends/ai/what-is-transformer-architecture-chatgpt What's next? LLMs to emergent behavior 1. Large Language models: From GPT2 to Chat GPT: https://www.thoughtspot.com/data-trends/ai/large-language-models-vs-chatgpt ## Questions 1. What's next for models after transformers? 2. Is there another parameter size we need to hit in order to get more emergent behavior?