# DL - Transformer Models (such as BERT, GPT)
## Summary
* [Overview](#overview)
* [Architecture of the Transformer Model](#architecture-of-the-transformer-model)
* [Differences between BERT vs GPT Models](#differences-between-bert-vs-gpt-models)
* [Common Applications of the Transformer Model](#common-applications-of-the-transformer-model)
* [Conclusion](#conclusion)
## Overview
The Transformer model is a machine learning model widely used in natural language processing (NLP). It was introduced in the influential paper titled "Attention is All You Need" by Vaswani et al. in 2017, and it has since transformed the field of NLP. Unlike traditional recurrent neural network (RNN) models that process data sequentially, the Transformer model operates in parallel, resulting in improved efficiency.
The impact of the Transformer model on NLP tasks has been remarkable. It has achieved state-of-the-art performance in various areas, including translation, summarization, and information extraction. One of its key strengths lies in its ability to capture the contextual relationships between words in a sentence, regardless of their positions. This breakthrough has inspired the development of other models such as BERT, T5, and GPT, and serves as the foundation for modern large language models (LLMs) as well as numerous generative AI applications.
## Architecture of the Transformer Model
The Transformer architecture is composed of several layers, each of which plays a critical role in processing sequences of text.
### Input Embedding Layer
In the initial stage, the model takes raw text data as input, which is subsequently transformed into numerical vectors through the input embedding layer. These numerical vectors, also known as "embeddings," serve as representations of words that the model can comprehend and analyze.
The embedding layer assigns each word in the input sequence to a high-dimensional vector. These vectors capture the semantic meaning of words, and words with similar meanings are represented by vectors that are close together in the vector space. This layer plays a crucial role in enabling the model to comprehend and process language data effectively.
### Positional Encoding
In contrast to RNNs, the Transformer model handles all words in a sentence concurrently, leading to improved efficiency. However, this parallel processing introduces a challenge: the model lacks inherent knowledge of the order or position of words within the sentence.
To address this challenge, positional encoding is employed to provide the model with positional information about the words in the sentence. This involves adding a vector to each input embedding, representing the word's position in the sentence. By incorporating positional encoding, the Transformer model gains an understanding of the word order despite processing all words simultaneously.
### Multi-Head Self-Attention Mechanism
Following the positional encoding step, the data is processed through a multi-head self-attention mechanism within the model. This mechanism enables the model to selectively attend to different parts of the input sequence for each word, facilitating a comprehensive understanding of the contextual relationships between words in a sentence.
The self-attention mechanism functions by assigning weights to each word in the sentence based on its relevance to other words. These weights determine the level of attention the model should allocate to each word while processing a specific word. The term "multi-head" indicates that the model incorporates multiple self-attention mechanisms, referred to as "heads," each focusing on distinct aspects of the input data. This multi-head approach enhances the model's ability to capture various perspectives and dependencies within the input sequence.
### Feed-Forward Neural Networks
Following the multi-head self-attention mechanism, the output undergoes processing through a feed-forward neural network (FFNN). This network comprises two linear transformations, with a Rectified Linear Unit (ReLU) activation function applied in between.
The FFNN operates independently on each position, handling the output derived from the self-attention mechanism. By applying this layer, the model introduces additional complexity and depth to the transformation process. The FFNN plays a crucial role in further refining the representation of the input sequence, enabling the model to capture intricate patterns and enhance its overall expressive power.
### Normalization and Residual Connections
Normalization and residual connections help to stabilize the learning process and increase the depth of the neural network used to generate output.
Normalization standardizes the inputs to the next layer, reducing the training time and improving the performance of the model. Residual connections, or skip connections, allow the gradient to flow directly from the output to the input, bypassing the layer’s transformation. This makes it possible to use a deeper neural network with more layers, without facing the vanishing gradient problem.
### Output Layer
The final layer of the model produces the ultimate output. In tasks such as translation or text generation, this layer typically incorporates a softmax function, which generates a probability distribution over the vocabulary for predicting the next word.
The output layer consolidates the computations from all the preceding layers to generate the final result. This result could be a translated sentence, a summary of a document, or any other NLP task that the Transformer model is trained to perform. By combining the information and transformations from earlier layers, the output layer provides a comprehensive and meaningful output that addresses the specific objective of the model.
## Differences between BERT vs GPT Models
In 2018, the introduction of BERT represented a notable breakthrough in the realm of encoder-only transformer architectures. The encoder-only architecture comprises multiple layers of bidirectional self-attention and a feed-forward transformation, each followed by a residual connection and layer normalization.
On the other hand, the GPT models, developed by OpenAI, signify a parallel advancement in transformer architectures, with a specific emphasis on the decoder-only transformer model. The GPT models focus on the decoder component, utilizing self-attention mechanisms to generate coherent and contextually relevant output sequences. By leveraging the transformer architecture in the decoder context, the GPT models have demonstrated remarkable performance in various natural language processing tasks.

The following represents the key differences between BERT and GPT models.
|Aspect| BERT| GPT|
|:----:|----|----|
|Architecture| BERT employs a bidirectional Transformer architecture, processing input text in both directions simultaneously. This enables BERT to capture the complete contextual information of each word by considering the entire sentence. |GPT utilizes a unidirectional Transformer architecture, processing text from left to right. This design allows GPT to predict the next word in a sequence but restricts its comprehension of the context to the left side of each word|
|Training Objective |Trained with a masked language model (MLM) task, BERT predicts masked words by considering the surrounding context, which helps in understanding word relationships.|Trained with a causal language model (CLM) task, GPT predicts the next word in a sequence, enabling it to generate coherent and contextually relevant text.|
|Pre-training |Captures the context from both the left and right of a word, providing a more comprehensive understanding of the sentence structure and semantics. |Pre-trained solely on a causal language model task, focusing on understanding the sequential nature of the text.|
|Fine-tuning| BERT can be fine-tuned for specific NLP tasks such as question answering, named entity recognition, etc., by adding task-specific layers on top of the pre-trained model. |BERT can be fine-tuned for specific tasks such as text generation and translation by adapting the pre-trained model to the task at hand.|
|Bidirectional Understanding |Captures the context from both left and right of a word, providing a more comprehensive understanding of the sentence structure and semantics. |Understands context only from the left of a word, which may limit its ability to fully grasp the relationships between words in some cases.|
|Use Cases| BERT excels at solving sentence and token-level classification tasks. Extensions of BERT, like sBERT, can be employed for semantic search, expanding BERT's applicability to retrieval tasks. Fine-tuning BERT for classification tasks is often preferred over using few-shot prompting via an LLM.| Encoder-only models like BERT lack the ability to generate text. Decoder-only models like GPT, on the other hand, are designed specifically for tasks such as text generation, translation, and more.|
|Real-World Example |Used in Google Search to understand the context of search queries, enhancing the relevance and accuracy of search results. |Models like GPT-3 are employed to generate human-like text responses in various applications, including chatbots, content creation, and more.|
## Common Applications of the Transformer Model
The Transformer model has found applications into numerous domains, providing innovative solutions and improving the efficiency of existing systems.
- **Machine Translation**
You’ve likely used online translation tools like Google Translate before. Modern machine translation techniques primarily rely on Transformers. They use attention mechanisms to understand the context and semantic meaning of words in different languages, enabling a more accurate translation than previous generation models.
The Transformer’s ability to handle long sequences of data makes it particularly adept at this task, allowing it to translate entire sentences with unprecedented accuracy.
- **Text Generation**
When you type a query into a search engine and it auto-fills the rest of your sentence, this is also likely powered by a Transformer model. By analyzing patterns and sequences in the input data, the Transformer can predict and generate coherent and contextually relevant text. This technology is used in a wide range of applications, from email auto-complete features to chatbots and virtual assistants.
More advanced models such as OpenAI GPT and Google PaLM, which power new consumer applications like ChatGPT and Bard, use a Transformer architecture to generate human-like text and code based on natural language prompts.
- **Sentiment Analysis**
Sentiment analysis is a tool for businesses that want to understand customer opinions and feedback. A Transformer can analyze text data, such as product reviews or social media posts, and determine the sentiment behind them (for example, positive, negative, or neutral). By doing this at scale, businesses can extract valuable insights about their products or services and make informed decisions.
- **Named Entity Recognition**
Named Entity Recognition (NER) is a task in Natural Language Processing (NLP) that involves identifying and classifying entities in text into predefined categories like names of persons, organizations, locations, expressions of times, quantities, etc. The Transformer model, with its self-attention mechanism, can recognize these entities even in complex sentences.
## Conclusion
In the dynamic field of natural language processing, BERT and GPT are two influential models, each with distinct strengths and applications. By delving into their architecture, training objectives, real-world examples, and use cases, we have uncovered the nuances that differentiate them. BERT's bidirectional understanding makes it ideal for tasks requiring comprehensive contextual insights, while GPT's unidirectional approach excels in creative text generation. Whether you're a researcher, data scientist, or AI enthusiast, grasping these distinctions will help you choose the right model for your specific projects.