Module 3: Retrieval Augmented Knowledge (RAG)

# Module 3: Retrieval Augmented Knowledge (RAG) > <p style="font-size: 4rem;display: block; margin: 0 auto; width:100%;"><b>Network Laboratory</b></p> > <p style="font-size: 2rem; display: block; margin: 0 auto; width:100%;"><i>Research Assistant</i></p> By: **Dimas Dermawan [DM]** --- In this module, we will learn about Retrieval Augmented Knowledge (RAG) and how they can improve Large Language Model's (LLM) ability for long context. But before we go into the world of LLM and RAG, we need to know the background of LLM. The purpose is so that we can understand all the *buzzwords* related to AI. [TOC] --- ## Basic of Neural Network ### The Perceptron To understand how a massive, modern Large Language Model works, we must first travel back to its earliest ancestor, **The Perceptron**. Introduced by Frank Rosenblatt in 1957, the Perceptron is a mathematical model of a single biological neuron. It was designed for a simple task, to take several numerical inputs and produce a single binary output (0 or 1), effectively classifying an input as belonging to "Class A" or "Class B". #### The Mathematical Model of the Perceptron > The Perceptron's operation is a two-step process. The first step is the weighted sum. - The neuron receives a set of inputs: $x_1, x_2, ..., x_n$. - Each input is assigned a *weight* (which represents the importance of that input): $w_1, w_2, ..., w_n$ ![Image of Perceptron](https://towardsdatascience.com/wp-content/uploads/2021/12/1hkYlTODpjJgo32DoCOWN5w.png) The neuron multiplies each input by its weight, sums these products, and adds a *bias* ($b$). The bias acts as a threshold offset. This operation is captured in the formula for the linear component, $z$: $$z = (w_1 \cdot x_1) + (w_2 \cdot x_2) + ... + (w_n \cdot x_n) + b$$ #### Comparison to Linear Models This formula is the core of the Perceptron's computation, and it is mathematically identical to the models used in traditional machine learning, specifically linear algebra. We can express this formula far more efficiently using matrix notation. If we group our inputs into a single column vector $X$ and our weights into a single row vector $W$, the formula becomes: $$z = W \cdot X + b$$ Where: - $$W = \begin{bmatrix} w_1 & w_2 & ... & w_n \end{bmatrix}$$ - $$X = \begin{bmatrix} x_1 \\ x_2 \\ ... \\ x_n \end{bmatrix}$$ - $b$ is the scalar bias term. This $z = WX + b$ equation is the fundamental formula for a linear model. However, the Perceptron adds a crucial second step to turn this linear output into a binary decision. #### The Heaviside Step Function The weighted sum $z$ is just an unbounded floating-point number. To make a hard "yes or no" decision, the Perceptron passes this $z$ value through an activation function. <div style="display:flex; flex-direction:column; align-items:center; gap: 2;"> <img src="https://www.intmath.com/laplace-transformation/svg/svgphp-unit-step-functions-definition-1a-s0.svg" style="display: block;" /> <p><b>Figure 1.2:</b> Graph of the Heaviside Step Function</p> </div> The specific function used by the Perceptron is the Heaviside step function: $$\phi(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z \le 0 \end{cases}$$ This function acts as a simple threshold. If the weighted sum $z$ is positive, the neuron "fires" and outputs 1. If it is zero or negative, the neuron does not fire and outputs 0. #### Example of The Perceptron Let's imagine a Perceptron built to decide if a student should be "Admitted" (1) or "Rejected" (0) based on two scores: > $x_1$: Exam Score (out of 100) > $x_2$: Interview Score (out of 10) Let's say the Perceptron has learned the following weights: > $w_1 = 0.1$ > $w_2 = 0.8$ > $b = -12.0$ **Scenario 1: A strong candidate** - Inputs: $x_1 = 80$ (Exam), $x_2 = 6$ (Interview) - Weighted Sum $z$: $z = (0.1 \cdot 80) + (0.8 \cdot 6) - 12.0 = 8.0 + 4.8 - 12.0 = 0.8$ - Activation $\phi(z)$: Since $z = 0.8$ (which is > 0), the output is 1 (Admitted). **Scenario 2: A weak candidate** - Inputs: $x_1 = 50$ (Exam), $x_2 = 8$ (Interview) - Weighted Sum $z$: $z = (0.1 \cdot 50) + (0.8 \cdot 8) - 12.0 = 5.0 + 6.4 - 12.0 = -0.6$ - Activation $\phi(z)$: Since $z = -0.6$ (which is $\le$ 0), the output is 0 (Rejected). #### The Perceptron Learning Algorithm The Perceptron "learns" by finding the right weights ($w$) and bias ($b$). It does this using a simple rule called the Perceptron Learning Algorithm, which works by iteratively correcting mistakes. The algorithm initializes all weights and the bias to zero or small random numbers. Then, it iterates through the training dataset, one example at a time. For each example: - **Make a Prediction:** It calculates its output ($\hat{y}$), which will be 0 or 1. - C**alculate the Error:** It compares its prediction $\hat{y}$ to the true, correct label $y$. The error is simply error = y - \hat{y}. - If the prediction was correct (e.g., $y=1, \hat{y}=1$), the error is 0. - If it predicted 0 but the answer was 1, the error is 1. - If it predicted 1 but the answer was 0, the error is -1. - **Update the Weights:** If (and only if) the error is not zero, it "nudges" each weight and the bias according to this rule: $$w_n \text{ (new)} = w_n \text{ (old)} + \alpha \cdot \text{error} \cdot x_n$$ $$b \text{ (new)} = b \text{ (old)} + \alpha \cdot \text{error}$$ $\alpha$ (alpha) is the learning rate (e.g., 0.1), which controls how big the "nudge" is. - **This rule is intuitive:** if the error is positive (it should have fired but didn't), it increases the weights for the inputs that were present, making the sum $z$ larger next time. If the error is negative (it fired but shouldn't have), it decreases the weights. #### The Limitation of Non-Differentiability The Perceptron was a breakthrough, but it had two fundamental limitations. First, it could only solve "linearly separable" problems (problems that can be divided by a single straight line). <div style="display:flex; flex-direction:column; align-items:center; gap: 2;"> <img src="https://www.baeldung.com/wp-content/uploads/sites/4/2022/01/linearly-separable.jpg" style="display: block; width: 50%;" /> <p><b>Figure 1.3:</b> Linearly separable data</p> </div> But the second, more profound limitation became the "wall" of AI research for years. The Heaviside step function is not differentiable. Its "derivative" (its slope) is 0 everywhere, except for at $z=0$, where it is undefined. Calculus is the mathematical tool we use to find how to "nudge" a function to minimize its error. Because the step function's derivative is always 0, it gives us no information. It doesn't tell us in which direction to move the weights, or by how much. We can't use calculus to train it. This meant the simple Perceptron Learning Algorithm could not be extended to more complex, multi-layered networks. To solve this, a new kind of neuron was needed. ### A Differentiable Neuron The "wall" of the Perceptron's non-differentiable activation function led to a critical innovation, a neuron that was differentiable. This new component, which is the foundation of all modern neural networks, is essentially a logistic regression unit. #### Smooth Activation Function Researchers replaced the "hard" Heaviside step function with a "soft," smooth function. The most important of these is the Sigmoid function (or Logistic function). It takes any real number $z$ and squashes it into a value between 0 and 1: $$\sigma(z) = \frac{1}{1 + e^{-z}}$$ This was a brilliant solution for two reasons: - **It is Differentiable:** The Sigmoid function has a clean, well-defined derivative at every point. This means we can use calculus to train it. - **It is Probabilistic:** The output can be interpreted as a probability. An output of 0.85 doesn't just mean "Class A," it means "85% confidence of being Class A." <div style="display:flex; flex-direction:column; align-items:center; gap: 2;"> <img src="https://www.researchgate.net/publication/306038981/figure/fig14/AS:393836095393798@1470909250163/Two-basic-activation-functions-a-Step-Function-b-Sigmoid-Function.png" style="display: block;" /> <p><b>Figure 1.3:</b> (a) Heaviside Function and (b) Sigmoid Function</p> </div> #### Gradient Descent With a differentiable neuron, we can now use a far more powerful learning technique, **Gradient Descent**. This requires two new concepts: 1. **The Loss Function** First, we need to measure how wrong a prediction is. We use a Loss Function (or Cost Function), $L$. A common one for regression (or to start with) is the Mean Squared Error (MSE): $$L = (y - \hat{y})^2$$ - $y$ is the true, correct label (e.g., 1 for "Malignant"). - $\hat{y}$ (y-hat) is the neuron's probabilistic prediction (e.g., 0.85). Our goal is to find the weights $w$ and bias $b$ that minimize this Loss $L$ across all our training examples. 2. **Finding the Gradient** Calculus gives us a tool to do this: the derivative. The derivative of a function tells us its slope. The gradient is a vector of partial derivatives, one for each weight, and it points in the direction of steepest ascent (uphill). To minimize our Loss, we just need to take a small step in the opposite direction of the gradient. We need to find out how a tiny change in a weight (like $w_1$) will affect the final Loss $L$. We calculate this using the partial derivative $\frac{\partial L}{\partial w_1}$. We use the chain rule to break this complex problem down: $$\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w_1}$$ Let's look at each piece: $\frac{\partial L}{\partial \hat{y}}$: How does the Loss change if the prediction $\hat{y}$ changes? This is the derivative of $L = (y - \hat{y})^2$ with respect to $\hat{y}$. $\frac{\partial L}{\partial \hat{y}} = -2(y - \hat{y})$ $\frac{\partial \hat{y}}{\partial z}$: How does the prediction $\hat{y}$ change if the weighted sum $z$ changes? This is the derivative of the Sigmoid function, $\hat{y} = \sigma(z)$. $\frac{\partial \hat{y}}{\partial z} = \sigma(z)(1 - \sigma(z))$ $\frac{\partial z}{\partial w_1}$: How does the weighted sum $z$ change if the weight $w_1$ changes? This is the derivative of $z = (w_1x_1 + w_2x_2 + b)$ with respect to $w_1$. $\frac{\partial z}{\partial w_1} = x_1$ 3. **The Update Rule** We combine these pieces to get the gradient for $w_1$: $\frac{\partial L}{\partial w_1} = -2(y - \hat{y}) \cdot \sigma(z)(1 - \sigma(z)) \cdot x_1$ This single number tells us exactly how to change $w_1$ to reduce the error. The process of Gradient Descent is to then update the weight using this gradient: $$w_1 \text{ (new)} = w_1 \text{ (old)} - \alpha \cdot \frac{\partial L}{\partial w_1}$$ $\alpha$ (alpha) is the learning rate (e.g., 0.01) that controls how big of a step we take. This process is repeated for every weight and the bias, over and over, for thousands of training examples. The "learning" is just this iterative process of measuring error and using calculus to "nudge" the weights in the direction that minimizes that error. This very concept, when applied to a network of neurons, is called Backpropagation. ### Building a Network and Teaching It to Learn A single neuron, even with gradient descent, is still just a linear classifier (its decision boundary is a straight line). It cannot solve complex, non-linear problems. To solve more complex problems, we stack these neurons together. We form an "Input Layer" to receive data, one or more "Hidden Layers" of neurons to find patterns, and an "Output Layer" to give the final answer. This is a Multi-Layer Perceptron (MLP). <div style="display:flex; flex-direction:column; align-items:center; gap: 2;"> <img src="https://miro.medium.com/0*lY7NXklYUEBxO-sf.png" style="display: block; width: 70%;" /> <p><b>Figure 1.4:</b> Multilayer Perceptron Architecture</p> </div> But with thousands of connections, how do we find the right weights? The simple gradient descent for one neuron isn't enough. The answer is Backpropagation. Backpropagation is simply the chain rule applied across the entire network. - **Forward Pass:** We feed the network an input (like an email) and it makes a prediction (a "forward pass"). - **Calculate Error:** We compare the network's final prediction to the correct answer using a "loss function" (like $L = (y - \hat{y})^2$). - **Backpropagation:** This is the core of "learning." It is a clever algorithm that uses the chain rule to work backward from the final error. It calculates the gradient ($\frac{\partial L}{\partial w}$) for every single weight in the network, layer by layer, starting from the end and moving to the beginning. - **Update Weights:** The algorithm then uses gradient descent to "nudge" every single weight and bias in the direction that will reduce the final error. By repeating this "forward pass -> calculate error -> backpropagate -> update weights" loop millions of times, the network slowly "learns" to solve highly complex problems. ## The Evolution of Artifical Intelligence To appreciate the power of today's LLMs, we must first understand their origins. The core challenge has always been the same, **how can we teach a machine to understand and generate human language?** This journey is a story of progressively sophisticated attempts to master context. The ability to understand not just words, but the relationships between them across sentences and paragraphs, is what separates a simple keyword-matcher from a true language comprehender. ### Statistical Language Modeling: The N-Gram Before the rise of neural networks, the dominant approach to language modeling was statistical. The goal of a statistical language model is to assign a probability to a sequence of words (a sentence). The most fundamental of these models was the N-gram model. #### The Chain Rule of Probability To start, how do we find the probability of an entire sentence, like $P(\text{"the quick brown fox"})$? We can use the chain rule of probability to break this down: $$P(w_1, w_2, ..., w_k) = P(w_1) \cdot P(w_2 | w_1) \cdot P(w_3 | w_1, w_2) \cdot ... \cdot P(w_k | w_1, ..., w_{k-1})$$ In plain English, the probability of a sentence is: - The probability of the first word... - times the probability of the second word, given the first.. - times the probability of the third word, given the first two... - ...and so on. This formula is mathematically perfect, but it's computationally impossible. To calculate $P(\text{"fox"} | \text{"the", "quick", "brown"})$, we would need to have seen the exact phrase "the quick brown" many times in our data to get a reliable statistic for what word comes next. #### The Markov Assumption This is where the N-gram model introduces its core "trick": the Markov assumption. This assumption states that the probability of a word only depends on a small, fixed window of $N-1$ preceding words, not the entire history. - **Bigram Model ($N=2$):** We assume a word only depends on the one word before it. $$P(w_i | w_1, ..., w_{i-1}) \approx P(w_i | w_{i-1})$$ The probability of "the quick brown fox" becomes: $P(\text{the}) \cdot P(\text{quick} | \text{the}) \cdot P(\text{brown} | \text{quick}) \cdot P(\text{fox} | \text{brown})$ - **Trigram Model ($N=3$):** We assume a word depends on the two words before it. $$P(w_i | w_1, ..., w_{i-1}) \approx P(w_i | w_{i-2}, w_{i-1})$$ The probability of "the quick brown fox" becomes: $P(\text{the}) \cdot P(\text{quick} | \text{the}) \cdot P(\text{brown} | \text{the, quick}) \cdot P(\text{fox} | \text{quick, brown})$ #### Calculating Probabilities (Maximum Likelihood Estimation) <div style="display:flex; flex-direction:column; align-items:center; gap: 2;"> <img src="https://media.geeksforgeeks.org/wp-content/uploads/20250527122516835224/N-grams-by-NLP-.webp" style="display: block; width: 80%;" /> <p><b>Figure 2.1:</b> Different processing of N-Gram</p> </div> This assumption makes calculation feasible. We can now estimate these probabilities from a large body of text, called a corpus, using Maximum Likelihood Estimation (MLE), which is just a formal name for counting and dividing. For a bigram model ($N=2$): $$P(w_i | w_{i-1}) = \frac{\text{count}(w_{i-1}, w_i)}{\text{count}(w_{i-1})}$$ Let's use a tiny corpus: "I am Sam. Sam I am." - `count("I")` = 2 - `count("am")` = 2 - `count("Sam")` = 2 - `count("I am")` = 2 - `count("am Sam")` = 1 - `count("Sam I")` = 1 What is the probability of the sentence "I am Sam"? >$$P(\text{I am Sam}) = P(\text{I}) \cdot P(\text{am} | \text{I}) \cdot P(\text{Sam} | \text{am})$$ - $P(\text{I}) = \frac{\text{count("I")}}{\text{total words}} = 2/6$ - $P(\text{am} | \text{I}) = \frac{\text{count("I am")}}{\text{count("I")}} = 2/2 = 1.0$ - $P(\text{Sam} | \text{am}) = \frac{\text{count("am Sam")}}{\text{count("am")}} = 1/2 = 0.5$ **Total Probability:** $(2/6) \cdot 1.0 \cdot 0.5 \approx 0.166$ #### The Fundamental Limitations This statistical approach was powerful, but it suffered from two critical flaws. 1. **Data Sparsity (The Curse of Dimensionality):** What is the probability of the new (but perfectly valid) sentence "I am Sampras"? $$P(\text{Sampras} | \text{am}) = \frac{\text{count("am Sampras")}}{\text{count("am")}}$$ The count of "am Sampras" in our corpus is 0. Therefore, the probability is 0. Because we are multiplying, this single zero makes the probability of the entire sentence zero. As $N$ increases, this problem explodes. The number of possible 5-grams is astronomically larger than any corpus, meaning most 5-grams will have a count of zero. The model cannot generalize to new sentences. 2. **Storage:** To build a 5-gram model, you must store the counts for every single 5-gram you see in your training data. With a vocabulary of 100,000 words, the number of possible 5-grams is $100,000^5$, which is an impossibly large number. While techniques like smoothing (e.g., adding 1 to every count to avoid zero-probabilities) were developed as patches, they couldn't fix the fundamental architectural limitation: N-grams have a rigid, limited context window and cannot generalize. ### Recurrent Neural Networks (RNNs) The statistical N-gram model, while foundational, hit a hard wall. Its reliance on a fixed, short context window ($N$) and its inability to generalize to unseen word combinations (data sparsity) meant it could never truly understand language. To overcome these limitations, researchers turned to neural networks, which offered a way to learn distributed representations (word embeddings) and handle variable-length sequences. The Multi-Layer Perceptron (MLP), a stack of differentiable neurons, was the first step. However, a standard MLP has no concept of time or order. If you feed it the sentence "the cat chased the dog," it would treat it the same as "the dog chased the cat." It has no built-in mechanism for understanding that words come in a specific sequence. <div style="display:flex; flex-direction:column; align-items:center; gap: 2;"> <img src="https://media.geeksforgeeks.org/wp-content/uploads/20231204130132/RNN-vs-FNN-660.png" style="display: block; width: 80%;" /> <p><b>Figure 2.2:</b> (a) RNN and (b) MLP</p> </div> To solve this, the Recurrent Neural Network (RNN) was developed. It is a neural network architecture specifically designed to process sequences. #### The Core Idea An RNN introduces a "loop" to the neural network. This loop allows information to persist from one step of the sequence to the next. This persistent information is called the hidden state ($h$). <div style="display:flex; flex-direction:column; align-items:center; gap: 2;"> <img src="https://www.kolena.com/wp-content/uploads/RNNs-1400-x-600-px-1024x439.png" style="display: block; width: 80%;" /> <p><b>Figure 2.3:</b> The loop</p> </div> You can think of the hidden state as the network's "memory." As the RNN reads a sentence one word at a time, the hidden state vector is continuously updated. At each step, the hidden state captures a summary of all the words the network has seen so far. When we "unroll" this loop over time, you can see how it works: - The network receives the first word (e.g., "I") and an initial hidden state (usually a vector of zeros, $h_0$). - It processes "I" to produce a new hidden state, $h_1$. - It then receives the second word ("am") along with the previous hidden state, $h_1$. - It combines the information from "am" and $h_1$ to produce the next hidden state, $h_2$. This new $h_2$ now contains information from both "I" and "am". This process repeats, with each new hidden state $h_t$ being a function of both the current word $x_t$ and the entire history summarized in $h_{t-1}$. #### The Mathematics of Recurrence This "unrolled" process is governed by a set of mathematical formulas. At any given time step $t$: 1. **The Hidden State Calculation:** The new hidden state, $h_t$, is calculated based on the previous hidden state, $h_{t-1}$, and the current input, $x_t$. $$h_t = \tanh(W_{hh} \cdot h_{t-1} + W_{xh} \cdot x_t + b_h)$$ 2. **The Output Calculation:** The output for the current step, $\hat{y}_t$ (e.g., a prediction for the next word), is calculated from the current hidden state. $$\hat{y}_t = W_{hy} \cdot h_t + b_y$$ Let's break down the components: - $x_t$: The input vector for the current word (e.g., a word embedding). - $h_{t-1}$: The hidden state vector from the previous time step. - $h_t$: The new hidden state vector for the current time step. - $W_{hh}, W_{xh}, W_{hy}$: These are the weight matrices. This is the most important concept: these weights are shared across all time steps. $W_{xh}$ is the weight matrix applied to the input, and $W_{hh}$ is the weight matrix applied to the previous hidden state. The network learns one set of weights, not a new set for each word. This is what allows it to generalize its "rules" to sequences of any length. - $b_h, b_y$: These are the bias vectors for the hidden state and output. - $\tanh$: This is the hyperbolic tangent activation function. It is used to keep the values in the hidden state vector bounded between -1 and 1, which helps maintain numerical stability. #### Vanishing Gradients RNNs were a massive breakthrough. For the first time, a model could learn from sequential data. However, in practice, they suffered from a critical flaw: the vanishing gradient problem. <div style="display:flex; flex-direction:column; align-items:center; gap: 2;"> <img src="https://aiml.com/wp-content/uploads/2023/11/vanishing-and-exploding-gradient-1.png" style="display: block; width: 80%;" /> <p><b>Figure 2.4:</b> The vanishing gradient problem</p> </div> "Learning" in a neural network happens via backpropagation, which uses the chain rule of calculus to "nudge" all the weights to reduce the final error. In an RNN, this "backpropagation through time" means the gradient (the error signal) must flow backward through the recurrent loop, step by step. To get from the error at step 10 back to the weight at step 1, the gradient must be multiplied by the recurrent weight matrix $W_{hh}$ ten times. - If the values in $W_{hh}$ are small (less than 1), multiplying them together many times causes the gradient to shrink exponentially. - By the time the gradient signal gets back to the early steps, it is effectively zero. This means the network is unable to "nudge" the weights based on "long-term" errors. It can learn the relationship between "am" and "I" (a short-term dependency), but it fails to learn the relationship between "is" and "The cat... which is..." in a long paragraph (a long-term dependency). The network effectively has a short-term memory, defeating its original purpose. ### Long Short-Term Memory (LSTM) The standard Recurrent Neural Network (RNN) had a critical flaw: the vanishing gradient problem. Because the error signal (gradient) had to be multiplied by the same weight matrix at every step during backpropagation, it would shrink exponentially, vanishing to zero over long sequences. This meant RNNs had a "short-term memory" and failed to learn long-range dependencies, which are essential for understanding language. To solve this, the Long Short-Term Memory (LSTM) network was introduced in 1997 by Sepp Hochreiter and Jürgen Schmidhuber. It is a special type of RNN architecture designed specifically to prevent the vanishing gradient problem. #### The Core Idea An LSTM unit is far more complex than a standard RNN's simple $\tanh$ layer. It introduces a new component called the cell state ($c_t$) and a series of "gates" that regulate the flow of information. - **Cell State ($c_t$):** You can think of this as the LSTM's "long-term memory." It's like a conveyor belt that runs straight down the entire sequence. Information can be added to or removed from this cell state, but it isn't automatically overwritten at every step. - **Gates:** These are neural network layers (using sigmoid activation functions) that learn which information is relevant. They output values between 0 (let nothing through) and 1 (let everything through). <div style="display:flex; flex-direction:column; align-items:center; gap: 2;"> <img src="https://hackmd.io/_uploads/ByjuZEwCgl.png " style="display: block; width: 50%;" /> <p><b>Figure 2.5:</b> Structure of LSTM cell</p> </div> An LSTM has three main gates to protect and control the cell state: - **Forget Gate ($f_t$):** Decides what information to throw away from the old cell state ($c_{t-1}$). - **Input Gate ($i_t$):** Decides what new information to store in the cell state. - **Output Gate ($o_t$):** Decides what part of the cell state to use for the output (the new hidden state, $h_t$). Let's look at the mathematical process at each time step $t$. #### The Mathematics of an LSTM 1. **The Forget Gate** The first step is to decide what to forget. The forget gate, $f_t$, looks at the previous hidden state ($h_{t-1}$) and the current input ($x_t$) and outputs a number between 0 and 1 for each piece of information in the old cell state, $c_{t-1}$. $$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$$ - $\sigma$ is the sigmoid function. - $[h_{t-1}, x_t]$ means the two vectors are concatenated (joined together). A '1' means "keep this," and a '0' means "forget this." 2. **The Input Gate** Next, the network decides what new information to store. This is a two-part process: First, the "input gate layer" ($i_t$) uses a sigmoid to decide which values in the cell state to update. $$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$$ Second, a $\tanh$ layer creates a vector of new candidate values ($\tilde{c}_t$) that could be added to the state. $$\tilde{c}_t = \tanh(W_c \cdot [h_{t-1}, x_t] + b_c)$$ 3. **Updating the Cell State** Now we update the old cell state, $c_{t-1}$, into the new cell state, $c_t$. $$c_t = (f_t \cdot c_{t-1}) + (i_t \cdot \tilde{c}_t)$$ This formula is the heart of the LSTM. Let's translate it: - $f_t \cdot c_{t-1}$: This is the "forget" step. We multiply the old state by the forget vector. If $f_t$ had a 0, that part of the old memory is erased. - $i_t \cdot \tilde{c}_t$: This is the "input" step. We multiply the new candidate values by the input gate vector. If $i_t$ had a 0, that new piece of information is ignored. We then simply add the two results together. 4. **The Output Gate** Finally, we decide what to output as the new hidden state, $h_t$. This hidden state is a "filtered" version of the cell state. First, the "output gate" ($o_t$) uses a sigmoid to decide which parts of the cell state we'll output. $$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$$ Second, we put the cell state $c_t$ through a $\tanh$ function to squash its values between -1 and 1. Then, we multiply the two: $$h_t = o_t \cdot \tanh(c_t)$$ This new $h_t$ is the "short-term memory" that is passed to the next time step and used for making predictions (e.g., predicting the next word). #### How LSTMs Solve Vanishing Gradients The LSTM's architecture brilliantly solves the vanishing gradient problem. The key is the cell state $c_t$ and its update equation: $c_t = (f_t \cdot c_{t-1}) + ...$ When the gradient flows backward during backpropagation, it passes through this "cell state conveyor belt." Because the primary operation is addition, the gradient doesn't get repeatedly multiplied by a weight matrix at each step (as it does in a simple RNN). <div style="display:flex; flex-direction:column; align-items:center; gap: 2;"> <img src="https://hackmd.io/_uploads/Sk13GEPCxe.png" style="display: block; width: 90%;" /> <p><b>Figure 2.5</b></p> </div> This "uninterrupted" gradient flow allows the error signal to travel back hundreds of time steps. The gates (which are also learning) learn to "open" and "close" to protect this flow, allowing the network to learn the connection between a word at the beginning of a paragraph and an error at the very end. For many years, LSTMs (and a popular, more efficient variant called the GRU, or Gated Recurrent Unit) were the state-of-the-art for nearly all complex sequence tasks, from machine translation to sentiment analysis, precisely because they could truly learn long-term dependencies. ### Attention Is All You Need Recurrent Neural Networks (RNNs) and their advanced variant, Long Short-Term Memory (LSTM), were the state-of-the-art for sequence processing for years. They excelled at capturing dependencies by processing text sequentially, word by word. However, this sequential nature was also their greatest weakness. - **The Sequential Bottleneck:** The fact that $h_t$ could only be computed after $h_{t-1}$ made them fundamentally sequential. This meant they could not be easily parallelized on modern GPU hardware. Processing a 100-word sentence required 100 sequential steps, making it very slow to train on massive datasets. - **Long-Range Dependencies:** While LSTMs were a huge improvement over RNNs, the "path" for information to travel from the first word to the last word was still long (proportional to the sequence length). For very long documents, gradients could still struggle to flow effectively. In 2017, a paper from Google titled "Attention Is All You Need" proposed a new architecture that was a radical departure. It completely eliminated recurrence (the loops) and instead relied entirely on a mechanism called self-attention. This new model was called the Transformer. #### The Core Idea > The Transformer processes all words in a sequence at the same time (in parallel). But if there's no sequence, how does it know which words are related? It uses self-attention. For every single word, self-attention allows the model to "look at" and weigh the importance of all other words in the same sequence. It builds a rich, contextual understanding of a word by seeing what other words "pay attention" to it, and what words it should "pay attention" to. For example, in the sentence *"The animal didn't cross the street because it was too tired,"* self-attention can learn to directly link the word "it" back to "animal," even if they were many words apart. #### Input: Positional Encodings Since the model has no "loop," it has no inherent sense of word order. To solve this, the Transformer adds a Positional Encoding vector to each word's embedding vector. This is a vector of the same dimension as the word embedding, but it's generated based on the word's position (1st, 2nd, 3rd...) in the sequence. It's a clever mathematical trick using sine and cosine functions of different frequencies. $$Embedding_{final} = Embedding_{word} + Encoding_{position}$$ This encoding gives the model a signal about the relative or absolute position of words, allowing it to understand word order without using recurrence. #### The Self-Attention Mechanism This is the heart of the Transformer. Instead of a hidden state ($h_t$), it computes a new representation for each word based on a weighted average of all other words in the sequence. For each input word vector, the model first creates three new vectors, all derived from the word's embedding: - **Query ($q$):** This vector represents the word's "question." It's used to "ask" other words, "How relevant are you to me?" - **Key ($k$):** This vector is like the word's "label." It's what the Query vectors will be matched against. - **Value ($v$):** This vector contains the word's actual content or meaning. This is what gets summed up if a word is deemed "important." These vectors are created by multiplying the input embedding ($x$) by three separate weight matrices ($W_Q$, $W_K$, $W_V$) that are learned during training. $$q = x \cdot W_Q \quad | \quad k = x \cdot W_K \quad | \quad v = x \cdot W_V$$ **The Scaled Dot-Product Attention Formula** <div style="display:flex; flex-direction:column; align-items:center; gap: 2;"> <img src="https://hackmd.io/_uploads/HJjz24PCgl.png" style="display: block; width: 20%;" /> <p><b>Figure 2.7:</b> Scaled Dot</p> </div> The attention calculation happens in three steps: **Step 1: Score ($q \cdot k$)** To find out how much "attention" the word "it" (Query) should pay to "animal" (Key), we compute the dot product of their respective $q$ and $k$ vectors. This dot product ($q \cdot k$) produces a single scalar "attention score." This is done for the Query word against every other Key word in the sentence. **Step 2: Scale and Softmax ($\text{softmax}(\frac{scores}{\sqrt{d_k}})$)** The raw scores are scaled down by dividing by the square root of the dimension of the key vectors ($\sqrt{d_k}$). This is done for numerical stability. Then, all the scores for a given word are passed through a softmax function. Softmax converts the scores into probabilities that all add up to 1.0. A word with a high score (like "animal") might get a softmax score of 0.8, while an irrelevant word (like "street") might get 0.05. **Step 3: Weighted Sum ($... \cdot V$)** We then take these softmax scores and multiply them by each word's Value ($v$) vector. Finally, we sum up these weighted Value vectors. The result is a new vector that represents the original word, but it is now a rich contextual embedding that has "attended" to all other words in the sequence, incorporating their meaning based on their relevance. The full formula for a set of Q, K, and V matrices is: $$Attention(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ #### Multi-Head Attention The Transformer doesn't just do this once. It does it multiple times in parallel in what's called Multi-Head Attention. Each "head" is a separate set of $W_Q, W_K, W_V$ matrices. <div style="display:flex; flex-direction:column; align-items:center; gap: 2;"> <img src="https://hackmd.io/_uploads/Sydw3EP0lg.png" style="display: block; width: 30%;" /> <p><b>Figure 2.8:</b> Multi-head</p> </div> This allows the model to learn different types of relationships simultaneously. For example, one attention "head" might learn to connect "it" to "animal" (a pronoun relationship), while another "head" might learn to connect "tired" to "it" (a descriptive relationship). The outputs of all the heads are then concatenated and passed through another linear layer to produce the final output of the attention block. #### The Full Architecture: Encoders and Decoders <div style="display:flex; flex-direction:column; align-items:center; gap: 2;"> <div style="display:flex; justify-content:center; align-items:center; gap: 2rem;"> <img src="https://hackmd.io/_uploads/rkyph4vRxe.png" style="display: block; width: 40%;" /> <img src="https://images.ctfassets.net/6p5bhasaj8xa/RKldea9Fz0MrupwxpjqVn/530dda70f9c23e3f83ef8e4395f9f2f6/1_XeONVnbiW95fq_KohFDcrw.webp" style="display: block; width: 40%;" /> </div> <p><b>Figure 2.9:</b> The Transformer Architecture</p> </div> The Transformer model consists of two main parts: - **The Encoder:** A stack of $N$ identical layers. Each layer has two sub-layers: 1. A Multi-Head Self-Attention mechanism. 2. A simple, position-wise Feed-Forward Network (a two-layer MLP). (Residual connections and Layer Normalization are also used around each sub-layer). The Encoder's job is to read the input sequence (e.g., an English sentence) and build a rich, contextual representation of it. - **The Decoder:** Also a stack of $N$ layers. The Decoder is similar but has a third sub-layer: 1. A "Masked" Multi-Head Self-Attention (to prevent it from "cheating" and looking at future words it's trying to predict). 2. A Multi-Head Attention that looks at the output of the Encoder. 3. A position-wise Feed-Forward Network. The Decoder's job is to take the Encoder's representations and generate an output sequence (e.g., the French translation). #### The Impact of the Transformer > The Transformer was a seismic shift in NLP. - **Parallelization:** Its biggest advantage was that it could be trained in parallel. This massive speed-up in training is what made it possible to create models with billions of parameters on internet-scale datasets. - **Long-Range Dependencies:** Because self-attention directly connects every word to every other word, the "path" length between any two words is just $O(1)$. This allowed it to capture extremely long-range dependencies far more effectively than any RNN or LSTM. This architecture, both the Encoder-only (like BERT) and Decoder-only (like GPT) variants are the foundational blueprint for every single modern Large Language Model (LLM). ## The Rise of Large Language Models (LLMs) The Transformer architecture, introduced in 2017, was the final and most critical invention in our evolutionary journey. Its parallel processing and superior handling of long-range dependencies made it far more effective and, crucially, far faster to train than any RNN or LSTM. This breakthrough in architecture unlocked a new breakthrough in scale. Researchers discovered that if you took the Transformer (specifically its "Decoder" half) and made it enormous, and then trained it on a significant portion of the entire internet, something incredible happened. The resulting model wasn't just better at predicting the next word; it developed entirely new, "emergent" abilities. This new class of model is the Large Language Model (LLM). The "Large" in LLM refers to two dimensions: 1. **Large Parameters:** A "parameter" is a learned weight in the network (like the $W_Q, W_K, W_V$ matrices in the Transformer). While models like the original Transformer had millions of parameters, LLMs have billions or even trillions. This vast number of parameters gives them an immense capacity to memorize, store, and synthesize information and patterns. 2. **Large Data:** LLMs are trained on datasets of unimaginable size, often called "web-scale" text. This can be trillions of "tokens" harvested from books, articles, and a significant snapshot of the public internet. ### Tokenization Before a Transformer can process text, that text must be converted into a sequence of numbers (token IDs). This process, called tokenization, is far more complex than just splitting text by spaces. We cannot just assign a number to every word in the dictionary. The vocabulary would be infinite ("run," "running," "ran," "rerun," plus names, misspellings, and new words). We also can't just use characters ("a", "b", "c"), as this breaks words apart and loses their inherent meaning. The solution is **Subword Tokenization**. The model uses a fixed-size vocabulary (e.g., 32,000 to 100,000) of common "tokens" which can be whole words, common prefixes, or suffixes. #### Tokenization Algorithms Several algorithms are used to create this vocabulary from the training corpus: - **Byte-Pair Encoding (BPE):** This is the most common. 1. It starts with a base vocabulary of all individual characters (bytes). 2. It scans the entire training corpus and finds the most frequently occurring pair of adjacent tokens (e.g., `t` and `h` becoming `th`). 3. It merges this pair into a new, single token (`th`) and adds it to the vocabulary. 4. It repeats this process, merging the next most frequent pair (which could now be `th` and `e` to form `the`), until the vocabulary reaches the desired size (e.g., 32,000). - **WordPiece (Used by BERT):** This is very similar to BPE, but it merges pairs based on which new token would maximize the likelihood of the training data, not just the raw frequency. This often leads to a vocabulary that keeps root words intact and creates tokens for common prefixes and suffixes (e.g., `run` and `##ning`). - **SentencePiece (Used by many modern LLMs):** This algorithm treats the entire text as a raw stream, including spaces. This is powerful because it doesn't make assumptions about what a "word" is, making it excellent for multilingual data that doesn't use spaces (like Japanese or Thai). **Example of Tokenization:** | Word | Tokens | |:------------:|:------------------------------:| | run | [`run`] | | running | [`run`], [`##ning`] | | unbelievable | [`un`], [`##believ`], [##able] | | LLM | [`L`], [`L`], [`M`] | (The `##` indicates the token is part of a word, not the start). #### Special Tokens The vocabulary also includes a set of "control" tokens that are essential for the model to function: - **[`PAD`]:** A "padding" token used to make all sequences in a batch the same length. - **[`UNK`]:** An "unknown" token for any character or subword it truly cannot recognize. - **[`BOS`] / [`EOS`]:** "Beginning of Sequence" and "End of Sequence" tokens that tell the model where a piece of text starts and stops. - **[`CLS`] / [`SEP`]:** "Classification" and "Separator" tokens used by models like BERT to understand the beginning of an input and to separate two different sentences. ### The Training Objective Most modern LLMs (like the GPT series) are "Decoder-only" Transformer architectures. They are trained on one simple, self-supervised task: Causal Language Modeling, or Next-Word Prediction. The model is given a sequence of tokens from its massive training dataset and is asked to predict the single most likely token to come next. We can express this mathematically. Given a sequence of tokens $w_1, ..., w_{i-1}$, the model's objective is to maximize the probability of the next token $w_i$: $$\text{Objective: } \max P(w_i | w_1, ..., w_{i-1})$$ #### Training Process 1. **Input:** The model is fed a chunk of real text, e.g., ["The", "quick", "brown", "fox"]. 2. **"Masking":** It uses the Transformer's "Masked Self-Attention." This mask prevents the model from "cheating" and looking ahead. When predicting the token after "brown", it can only attend to "The", "quick", and "brown". 3. **Prediction (Logits):** The model's final layer outputs a raw, un-normalized score (a logit) for every single token in its vocabulary. This is a massive vector of ~50,000 numbers. 4. **Softmax:** To turn these logits into probabilities, they are passed through a softmax function, which converts all scores into a probability distribution where all values are between 0 and 1 and sum to 1.0. 5. **Loss Calculation:** The model's probability distribution is compared to the actual next token (the "ground truth"). This is done using a Loss Function, typically Cross-Entropy Loss. The formula for Cross-Entropy Loss for a single prediction is: $$L = -\sum_{i} y_i \log(\hat{y}_i)$$ - $y_i$ is the "ground truth" (it's 1 for the correct token and 0 for all others). - $\hat{y}_i$ is the model's predicted probability for that token. This formula brilliantly penalizes the model. If the correct token was "jumps" ($y_i=1$) and the model only gave it a 0.01% probability ($\hat{y}_i=0.0001$), the loss will be very high. The model is thus heavily penalized for being "confident and wrong." 6. **Backpropagation:** The error (the Loss) is backpropagated to adjust all the model's billions of parameters, "nudging" them to make the correct prediction ("jumps") more likely next time. When this simple process is repeated trillions of times, the model is forced to learn grammar, syntax, facts, reasoning, and a "world model" to get better at predicting the next word. ### How LLMs Generate Text (Inference) Training teaches the model to predict the next token. "Inference" (or generation) is the process of using that ability to write new text. It's an iterative, one-token-at-a-time loop: 1. Feed the prompt (e.g., [`"The"`, `"quick"`, `"brown"`]) into the model. 2. The model does one "forward pass" and produces a probability distribution for the next token. 3. A **sampling strategy** is used to choose one token from this distribution (e.g., it picks `"fox"`). 4. This new token is appended to the prompt, creating a new input: [`"The"`, `"quick"`, `"brown"`, `"fox"`]. 5. The model feeds this new prompt into itself, and the loop repeats (Step 2) until it generates an [`EOS`] token or hits a length limit. The "creativity" or "determinism" of the model is controlled entirely by the sampling strategy used in Step 3. #### Sampling Strategies - **Greedy Search (Argmax):** Always pick the single token with the highest probability. Result: Very deterministic, "boring," and repetitive. It will always give the same answer to the same prompt. - **Temperature Sampling:** This method "scales" the logits before the softmax function to change the shape of the probability distribution. $\text{Probabilities} = \text{softmax}(\frac{\text{logits}}{T})$ - **High Temperature ($T > 1.0$):** Flattens the distribution, making all tokens more equal. This increases randomness and "creativity." - **Low Temperature ($T < 1.0$):** Sharpens the distribution, making high-probability tokens even more likely. This moves it closer to Greedy Search, making it more focused and deterministic. - **Top-k Sampling:** The model is restricted to sampling only from the $k$ most probable tokens (e.g., $k=50$). This prevents the model from picking absurd, low-probability tokens. - **Top-p (Nucleus) Sampling:** This is the most popular modern method. Instead of a fixed number $k$, it samples from the smallest set of tokens whose cumulative probability is greater than or equal to $p$ (e.g., $p=0.95$). Result: This is dynamic. If the model is "confident," the top set might only be 3-4 tokens. If the model is "uncertain," the set might be 100+ tokens, giving it more options. #### Context Window This extreme scale leads to Emergent Abilities, complex skills that were not explicitly programmed but "emerge" from the model's deep understanding of language. The most important emergent ability is In-Context Learning. Because the model is trained to complete patterns, we can steer it at runtime using a prompt. - **Zero-shot:** Ask it to do a task directly. > Prompt: `"Translate this to French: I love bread."` - **One-shot:** Give it one example of the pattern. > Prompt: `"English: cat -> French: chat. English: I love bread ->"` - **Few-shot:** Give it several examples to show a complex pattern. > Prompt: `"Review: 'It was great!' -> Positive. Review: 'I hated it.' -> Negative. Review: 'It was okay, I guess.' ->"` This brings us to a critical hardware and architectural limit: the Context Window. The self-attention mechanism is computationally expensive; it has to compare every token to every other token. This means there is a hard limit on the number of tokens the model can "see" at one time. - Early models (like GPT-2) had a context window of 1,024 tokens. - Modern models (like GPT-4 Turbo) have windows of 128,000 tokens or more. The context window is the model's entire "attention span" or "working memory." If a piece of information is not currently inside its context window, the model cannot see it, use it, or remember it. ### The Core Limitations Despite their incredible power, LLMs suffer from two fundamental limitations that prevent them from being reliable for many real-world, high-stakes applications. #### Hallucinations > A hallucination is when an LLM generates a response that is fluent, confident, and plausible-sounding, but is factually incorrect, nonsensical, or completely fabricated. **Why does this happen?** Remember the training objective: *the model is optimized to predict the most plausible next word, not the most truthful one*. Its "world knowledge" is just a set of statistical patterns. - It might invent academic citations that look real but don't exist. - It might mix up facts, dates, and people. - It might confidently state that the 2024 World Cup was won by a team that didn't even compete, simply because that team is "statistically associated" with "winning." This is the primary barrier to **trust**. An answer that is sometimes wrong but always confident is dangerous. #### The Knowledge Boundary An LLM's knowledge is "frozen" in its parameters the moment its training is completed. This creates two problems: - **Knowledge Cut-off:** The model is completely unaware of any events, data, or information created after its training date. A model trained until April 2023 cannot tell you the winner of a sports game from May 2023 or describe a product launched yesterday. - **Inability to Access Private Data:** An LLM has no access to proprietary, internal, or domain-specific information. It cannot read your company's internal wiki, your private emails, or the latest technical manuals for your organization's products. Hallucinations and static/private knowledge are the core challenges that must be solved to make LLMs reliable and useful in an enterprise or personal context. ## Retrieval-Augmented Generation While Large Language Models (LLMs) demonstrate extraordinary capabilities in understanding and generating language, their architecture has fundamental weaknesses that hinder their adoption in applications demanding high factual accuracy and reliability. The solution to these limitations is an architecture known as **Retrieval-Augmented Generation (RAG)**. RAG was introduced as a solution to "ground" the LLM on external, verifiable knowledge, directly suppressing hallucinations and improving factual accuracy. It transforms the LLM from a "closed book" into an "open book" system that can dynamically retrieve relevant information from an external database before generating an answer. The RAG architecture formally combines two components that are trained end-to-end: a Retriever and a Generator. ![image](https://hackmd.io/_uploads/BkO6eUPCll.png) ### The Retriever (Non-Parametric Memory) The Retriever's job is to act as the "non-parametric memory". Given an input query ($x$), the retriever searches a large external corpus (like technical manuals or a database) and retrieves a set of $k$ relevant documents or text chunks ($z$). - **Implementation:** This is often implemented as a bi-encoder architecture like the Dense Passage Retriever (DPR). A bi-encoder generates dense vector embeddings for both the query and all the document chunks in the index. - **Search:** It then uses a highly efficient search method, like Maximum Inner Product Search (MIPS), to find the document vectors that are "closest" in meaning to the query vector. - **Formula:** The retriever is defined as a probability distribution $p_{\eta}(z|x)$ over the potential documents $z$ given the query $x$. ### The Generator (Parametric Memory) The Generator is the "parametric memory," typically a pre-trained seq2seq model like BART or a decoder-only model like GPT. The key difference is that the Generator does not just receive the original query ($x$). Instead, its input is augmented with the content of the retrieved documents ($z$). With this new, fact-based context, the Generator's task is to synthesize the final answer ($y$) token by token. The generator is defined as a conditional probability $p_{\theta}(y_{i}|x,z,y_{1:i-1})$, where the prediction of the next token ($y_i$) is conditioned on the original query ($x$), the retrieved document ($z$), and all previously generated tokens ($y_{1:i-1}$). During training, the retrieved document $z$ is treated as a latent variable. The model is trained to minimize the negative marginal log-likelihood of the target output by marginalizing (averaging over) the top-k retrieved documents. ### RAG Variants The original RAG paper proposed two variants based on how the marginalization is performed: - **RAG-Sequence:** This model retrieves the same document (or set of documents) and uses it to generate the entire output sequence. The probability for the whole sequence is calculated by summing the probabilities for each retrieved document. $$p(y|x)\approx\sum_{z\in top\cdot k(p_{\eta}(\cdot|x))}p_{\eta}(z|x)\prod_{i}^{N}p_{\theta}(y_{i}|x,z,y_{1:i-1})$$ - **RAG-Token:** This model is more flexible and can use different documents to generate each individual token in the output sequence. This allows the generator to pull information from multiple sources while formulating a single answer. $$p(y|x)=\prod_{i=1}^{N}\sum_{z\in top\cdot k(p_{\eta}(\cdot|x))}p_{\eta}(z|x)p_{\theta}(y_{i}|x,z,y_{1:i-1})$$ In practice, RAG has been proven to significantly outperform parametric-only models on knowledge-intensive tasks, producing answers that are more factual, specific, and diverse because they are directly conditioned on textual evidence. ## LightRAG While standard RAG is effective, it often relies on "flat data representations", a simple list of independent text chunks. This limits the model's ability to understand complex, non-linear relationships between entities, which can lead to fragmented answers when a query requires synthesizing information from multiple concepts. ### Indexing via Knowledge Graphs To address this limitation, advanced approaches like LightRAG fundamentally change how knowledge is indexed and retrieved by integrating a structure graph. Instead of just storing text chunks, LightRAG builds a knowledge graph where: - **Nodes:** Represent entities (e.g., a specific router, a protocol like "BGP"). - **Edges:** Represent the relationships between those entities (e.g., "is configured with," "connects to"). By utilizing this graph structure, LightRAG can navigate complex relationships between concepts, allowing it to synthesize information that is far more coherent and context-rich than what is possible with simple chunk retrieval. ![image](https://hackmd.io/_uploads/HJl3tbLD0ee.png) ### RAG-Anything A further challenge, especially in technical domains, is the variety of data formats. Critical information is often not in plain .txt files but is "locked" inside PDFs (datasheets, guides), web pages (online documentation), or other structured formats. RAG-Anything is a framework designed to solve this data ingestion problem. It acts as a flexible data preprocessing layer that can: - Parse various file types (PDF, HTML, etc.). - Intelligently extract the textual content. - Hand off this clean text to an indexing system (like LightRAG). By automating the extraction from these heterogeneous sources, RAG-Anything enables the construction of a far richer and more complete knowledge base, which is essential for technical domains like network configuration. The combination of RAG-Anything for data extraction and LightRAG for graph-based indexing creates a comprehensive and powerful RAG system capable of understanding not just text, but the complex relationships between concepts.