BERAS Companion Guide

# BERAS Companion Guide This is the handy-dandy BERAS `core.py` cheat sheet that will help explain the stencil code in BERAS. Of course, the first step will be to read the actual code, but as you move through it, refer to this document to help make sense of things. BERAS is a hefty assignment, so take your time with this document and make sure you understand how the stencil works before you start working! Good luck! ## **Core Class Hierarchy** First, we have the breakdown of our core classes in BERAS and how they connect with one another. The file `core.py` defines the `Callable`, `Weighted`, and `Diffable` parent classes which serve as the building blocks for all of BERAS. The next section will explain these classes more in-depth, but it is important to understand how they relate to understand which attributes and functions you have access to. ``` Callable Weighted (can call) (has weights) │ │ └───────┬───────┘ │ Diffable (callable + weighted + gradients) │ ┌───────────┼───────────┐ │ │ │ Dense Activation Loss (layers) (ReLU, Sigmoid) (MSE, CCE) ``` ## **Core Classes Explained** You'll notice, many of these preliminary constructs are simply wrappers around basic concepts that exist in TensorFlow. ### **Tensor**: Our Enhanced NumPy Array *What it is:* Our version of a TensorFlow tensor, which is a numpy array that can be marked as trainable ```python my_tensor = Tensor([[1, 2], [3, 4]]) my_tensor.trainable = True ``` **Key Properties:** - `.trainable` - Boolean flag indicating if gradients should be computed for this tensor ### **Variable**: Our *Special* Tensor for Weights *What it is:* A Tensor specifically for model parameters (weights, biases) ```python weights = Variable(np.random.normal(size=(10, 5))) weights.assign(new_values) ``` **Key Methods:** - `.assign(value)` - Update the variable's values in-place ### **Callable**: "Something you can call like a function" You implemented this in Assignment 1 (I told you it would come back) ```python # Abstract base class - defines the contract: # "If you inherit from me, you must implement forward()" def __call__(self, *args): # Makes obj() work like obj.forward() (THIS IS WRITTEN FOR YOU) return self.forward(*args) @abstractmethod def forward(self, *args): # YOU WILL implement this in child classes pass ``` ### **Weighted**: "Something that has learnable parameters" ```python # Abstract base class - defines the contract: # "If you inherit from me, you must have weights property" @property @abstractmethod def weights(self) -> list[Tensor]: # YOU implement this pass # Provides useful utilities: .trainable_variables # Only weights with trainable=True .non_trainable_variables # Only weights with trainable=False .trainable = True/False # Set all weights' trainable status ``` ### **Diffable** - "The Heart of the Framework" This is where the magic happens! This is the class that brings everything together and makes it work with backpropagation. Think of Diffable as the **DL recording studio** that: 1. **Records everything during forward pass** (if gradient tape is active) 2. **Knows how to compute its own gradients** 3. **Can compose gradients using chain rule** ## **How the Gradient Tape System Works** Gradient Tape is the heart of the deep learning and as the name lets on, it has everything to do with gradients. Let's look at the following diagram to understand the algorithm a bit more and what's going on here ### Gradients ![image](https://hackmd.io/_uploads/Sy6wvfGogl.png) The diagram illustrates both the forward and backward passes, where each node represent a layer in our network. During the forward pass, data flows from left to right through the operation. During the backward pass, gradients flow from right to left, shown by the blue arrows. Therefore, we can assume that this node represents our dense layer (just for example sake). This shows a single operation where inputs `W` and `x` are combined (via the * operation) to produce output `z`, representing one step in the forward pass of a neural network. `s` is the final output of the entire network (likely the loss during training). ### **The Recording Process (Forward Pass)** In the forward pass, inputs `x` and parameters `W` flow into a computation that produces `z`. Gradient tape records these operations so it knows how each output depends on its inputs. ```python with GradientTape() as tape: # 1. Gradient tape becomes "active" # 2. Every Diffable operation gets recorded # 3. Each output tensor remembers which layer created it x = layer1(inputs) # Records: "x came from layer1" y = layer2(x) # Records: "y came from layer2" loss = loss_fn(y, targets) # Records: "loss came from loss_fn" ``` :::warning Remember, whenever you call a diffable object, the following attributes are internally tracked and provided to you by the stencil code: - `diffable.inputs` - List of input tensors to this layer - `diffable.outputs` - List of output tensors from this layer - `tape.previous_layers[id(output)]` - Maps each output to the layer that created it If you are confused by how these attributes are being tracked or stored, reread the [previous section](#Core-Classes-Explained) and look at `core.py` to revise these attributes! You will **need** to use them, so **understand them now**! ::: ### **The Playback Process (Backward Pass)** Now, when are going through the backwards pass, we are dealing with many different factors we must account for. There isn't a singular gradient we are calculating, but multiple. In the diagram above, you can see three types of gradients: **upstream**, **downstream**, and **local** gradients. - The **upstream gradient** `(∂s/∂z)` represents the **cumulative gradient**, *excluding this layer's*, flowing into this operation from later layers in the network (closer to the loss). Remember, in backpropagation we go **backwards** so we traverse the computation graph from later layers back to the input. - This gradient tells us how the final loss `s` changes with respect to this operation/layers **output** `z`. - The **downstream gradients** `(∂s/∂W)` and `(∂s/∂x)` represent the **cumulative gradients**, *including this layer's*, flowing out of this operation to earlier layers (closer to the input) - This gradient tells us how the final loss `s` changes with respect to this operation/layer's **inputs** `x` and `W` (since the computation graph sees both `x` and `W` as inputs to function `f(x, W) = x @ W + b`) - The downstream gradient is computed by chaining the upstream gradient, `(∂s/∂z)`, with the local gradients `(∂z/∂W)` and `(∂z/∂x)` to obtain the local effects on the final loss, `s` - The **local gradients** `(∂z/∂W)` and `(∂z/∂x)` capture how this specific operation/layer's output changes with respect to its inputs. Specifically, the local impact of changes in this operations inputs, `x` and `W` in our case, affect this operations output, `z`. - These depend only on the operation itself, not on the broader network. This process repeats for every operation in the network. Each layer receives upstream gradients from later layers, computes its local gradients, and produces downstream gradients for earlier layers. Gradient tape's job is to record the computational graph during the forward pass, then plays it back in reverse during the backward pass to compute these gradients correctly and efficiently. Outside of the scope of gradient tape (you could technically do it inside, but it doesn't make sense to), you will call ```python gradients = tape.gradient(loss, sources) ``` and within your `gradient_tape.py`, you will define the `gradient` method to 1. Take the current tensor’s upstream gradient (the gradient flowing into this tensor from later ops). - Notice how we handle the case where no upstream gradient exists - (**HINT**: `defaultdict` returns `None` if the key does not exist) 2. Ask the producing layer to turn that upstream gradient into: - **input gradients** `(∂z/∂x)`: (to push farther back through the graph) - **weight gradients** `(∂z/∂W)`: (to accumulate parameter updates). 3. Enqueue the inputs to the operation/layer and store the gradients in the `grads` dictionary - What is an intuitive way to store the upstream/downstream gradient? 5. Continue until all sources are reached. In simple terms, think of it as follow outputs → find their creator layer → push gradients back through that layer → repeat using BFS and voila, you are deep in learning. ## Understanding the Compose Functions These functions are the **most confusing part** but also the **most important**. Let me break them down with concrete examples. These two methods live on every `Diffable` (you already have access to them): - `compose_input_gradients(J)` - `compose_weight_gradients(J)` They apply the chain rule by combining the incoming upstream gradient(s), `J`, with the layer’s local Jacobians (which the layer provides via its own `get_input_gradients()` and `get_weight_gradients()`). ### Connecting Local and Cumulative Gradients Each layer exposes local Jacobians via `get_input_gradients()` and `get_weight_gradients()`. These are answering our question of: “How does this output of the layer change if I wiggle this input or weight, holding everything else fixed?” However, this is the key bit, alone those local Jacobians are **not the gradients you want**. Backprop needs cumulative gradients that include what has happened after this layer. That’s *exactly* what the compose functions produce for you. #### Upstream vs. Downstream When backprop reaches a layer’s output, it brings an upstream gradient `J` with shape `(B, M)` - This is $\frac{\partial s}{\partial \text{output}}$ from later ops (the change in the final output with respect to the current layer's output) - The layer must turn that into downstream gradients for its inputs and weights Now, for each of the functions: - `compose_input_gradients(J)` takes `J` and the local input Jacobians $\frac{\partial \text{output}}{\partial x_i}$ (the change in the layer's output with respect to the input $x_i$) and applies the chain rule per sample - It returns one tensor per input $x_i$, each shaped like the input $(B, d_i)$ which are the new upstream gradients for the previous layers (later in backprop) - **NOTE:** In our case, we likely will only ever have one input at each layer, but we are generalizing here where in some cases, such as input optimization, the model may take in multiple inputs at each layer - `compose_weight_gradients(J)`: This function takes in `J`, the upstream gradient, and calls the layer's `get_weight_gradients()` to get the local weight gradients. The computation then loops over it chains the local gradient with the upstream gradient to compute $\frac{\partial \text{output}}{\partial w_k}$. **Ask yourself, what does this gradient represent and how to use it.** - Internally, the function takes the mean of the gradients over the batches to yield a single gradient value per parameter. ### **Why the Difference?** - **Input gradients:** Each sample gets its own gradient for the next layer, so we do not average across the batch - **Weight gradients:** Averaged across batch since weights are shared across all samples --- ### **The Key Insight** In your gradient tape method, it is your job to peice together how to use the different stencil code concepts and functions to write the back propagation algorithm for an arbitrary number of layers! This will be very similar to what Keras implements! If you have any questions, come to office hours for more help!