# HW3 Programming: BERAS
:::info
Assignment due **February 26th, 2026 at 11:59 PM EST** on Gradescope!
:::
:::danger
__You should know:__ This assignment is statistically rated as the hardest and most time consuming assignment of this course (30+ hours).
:::danger
We **highly** recommend starting early and reading this document entirely carefully before you implement any of the code!
:::
## Assignment Overview
### 🪨 Expedition Log: Bruno’s Deep-Earth Dilemma 🐻

*(Image generated by GPT-5)*
Deep Fried Diner, 2026.02.06 — Bruno the Bear just started the graveyard shift at Providence’s legendary 24-hour diner, The Deep Fried Scholar, where Brown students have been scribbling greasy order tickets since 1964. The ancient point-of-sale system—held together by duct tape and optimism—only accepts digits 0–9 to route orders to the kitchen.
Here’s the problem: the basement is packed with 60,000 handwritten order tickets, and the new *smart fryer* needs to read them to calibrate its neural temperature settings. No digits decoded = campus-wide riot.
Your task: build BERAS that decodes the greasy, smudged digits so the fryer can lock onto the perfect golden-brown crispiness.
___
### Assignment Goals
1. Implement a simple Multi Layer Perceptron (MLP) model that mimics PyTorch API.
- Implement core classes and methods used for **Auto Differentiation**.
- Implement a **Linear Layer** similar to PyTorch.
- Implement basic **preprocessing techniques** for use on the **MNIST Dataset**.
- Implement a basic objective (loss) function for regression such as **MSE**.
- Implement basic regression **accuracy metrics**.
- **Learn** optimal weight and bias parameters using **gradient descent** and **backpropogation**.
2. Apply this model to predict digits using the MNIST Dataset
## Getting Started
### Stencil
<!--LINK THE REFERENCES-->
Please click [here](https://classroom.github.com/a/_uYrfttt) to get the stencil code. Reference this [guide](https://hackmd.io/gGOpcqoeTx-BOvLXQWRgQg) for more information about GitHub and GitHub classroom.
:::danger
**Do not change the stencil except where specified.** You are welcome to write your own helper functions. However, changing the stencil's method signatures **will** break the autograder
:::
### Environment
You will need to use the virtual environment that you made in Homework 1. You can activate the environment by using the command `conda activate csci1470`. If you have any issues running the stencil code, be sure that your conda environment contains at least the following packages:
- `python==3.11`
- `numpy`
- `pytest`
On Windows conda prompt or Mac terminal, you can check to see if a package is installed with:
```bash
conda list -n csci1470 <package_name>
```
On Unix systems to check to see if a package is installed you can use:
```bash
conda list -n csci1470 | grep <package_name>
```
:::danger
Be sure to read this handout in its **entirety before** moving onto implementing **any** part of the assignment!
:::
## Deep Learning Libraries
Deep learning is a very complicated and mathematically rich subject. However, when building models, all of these nuances can be abstracted away from the programmer through the use of deep learning libraries.
In this assignment you will be writting your own Deep Learning library, 🐻 Beras 🐻. You'll build everything you need to train a model on the MNIST dataset. The MNIST data contains 60k 28x28 black and white hand written digits, your model's job will be to classify which digit is in each image.
:::danger
Please keep in mind you are _not_ allowed to use ___any___ __Tensorflow, Keras, or PyTorch functions throughout HW3__ (other than your testing files). The autograder will intentionally not execute if you import these libraries.
:::
You are already familiar with **PyTorch** from our first assignment. Now your job will be to build your own version of it: **BERAS**.
## Implementation Roadmap
### Before you Begin: Our Reccomended Gameplan
**CORE IDEA: Read First, Code Second!** Before diving into any implementation, follow these steps to help make this assignment more digestible!
1. **Read this entire document from start to finish**
- This is a dense, long document with many different sections. You should take a minute to walk through the entire document and inspect each of the sections to familiarize yourself. You will likely be confused, but that is the whole point!
2. **Explore the provided stencil code**
- Go through the repository and inspect the file structure and breakdown. You will see some files contain a lot of stencil code, `core.py`, but many are filled with TODOs for you!
3. **Study the [companion sheet](https://hackmd.io/@browndls26/Sk0dIN-Pbe) thoroughly**
- This document is your primary reference for understanding the stencil code and implementation details. The companion sheet explains how the different constructs you are required to work with function together!
4. **Refer back to the [companion sheet](https://hackmd.io/@browndls26/Sk0dIN-Pbe) frequently**
- Much of the stencil code's patterns and helper functions are explained in here. If you are confused about how the provided code works, refer back to the document!
### Implementation Tasks
Don't worry if these tasks seem daunting at first glance! We've included a lot more info down below on specific implementation details. The companion sheet is your manual for assembling your neural network framework.
1. Start with implementing **`preprocessing.py`** to load and clean your data, and **`beras/onehot.py`** to get to know the dimensions of the data better. [Specifics](#1-preprocesspy)
2. Now fill in part of **`beras/core.py`** which will create some of the basic building blocks for the assignment. [Specifics](#3-berascorepy)
- This is where the companion sheet comes in really handy!
3. Move on to **`beras/layers.py`** to construct your own `Linear` layer. [Specifics](#4-beraslayerspy)
4. Now complete **`beras/activations.py`** [Specifics](#5-berasactivationspy)
5. Continue with **`beras/losses.py`** to write **CategoricalCrossEntropy**. [Specifics](#6-beraslossespy)
6. Next write **CategoricalAccuracy** in **`beras/metrics.py`**. [Specifics](#7-berasmetricspy)
7. Fill in the optimizer classes in **`beras/optimizer.py`**. [Specifics](#8-berasoptimizerpy)
8. Implement **Diffable.backward()** in **`beras/core.py`**. [Specifics](#9-berasbackwards)
:::danger
**Diffable.backward()** is known to be tricky, so budget some extra time to implement it. **Refer to the [companion sheet](https://hackmd.io/@browndls26/Sk0dIN-Pbe) for a detailed explanation about how backward propagation works!**
:::
9. Construct the **Model** class in **`beras/model.py`**. [Specifics](#10-berasmodelpy)
10. Finally, write **`assignment.py`** to train a model on the MNIST Dataset! [Specifics](#11-assignmentpy)
### Timeline Suggestion
:::success
__Note:__ You have 2 weeks to complete this assignment in full and we recommend splitting the tasks into 2 big sections.
__Week 1 (Sections 1-7):__ Build the foundations of BERAS
- Sections 1-5 will likely be the most code heavy, but primarily involve implementing functions we've covered in class. A strong conceptual understanding is your key to success on this assignment.
- Section 6-7 are the lightest portion of the assignment and should flow smoothly if you understand the code from parts 1-5
- **Pro tip:** Keep the [companion sheet](https://hackmd.io/@browndls26/Sk0dIN-Pbe) open while working - it explains most of the stencil code patterns you'll encounter
__Week 2 (Section 8-11):__ Integration
- Section 8 is very easy and is *almost* as easy as copying over your code from assignment 2 into `optimizers.py`
- Section 9 (Backward Propagation) is conceptually challenging and, on average, is **the hardest section for students**. This section will take about **25-30% of the total assignment time**.
- *TIP: The [companion sheet](https://hackmd.io/@browndls26/Sk0dIN-Pbe) breaks down backward propagation!*
- Sections 10-11 involve piecing together your previous work so if you've built solid foundations, these should be relatively straightforward
:::
Gaurav (a former TA) put together this nice graphic to visualize the roadmap and how it all fits together. It's helpful to refer to as you go through the assignment!

*Thanks Gaurav!*
:::success
**HERE IS THE BERAS COMPANION SHEET AGAIN: {%preview https://hackmd.io/@browndls26/Sk0dIN-Pbe %}**
:::
:::warning
**[QUICK ASIDE: TESTING INCREMENTALLY]** You will notice the `test_runner.py` file and `tests/` directory in your cloned repository. These are helpful unit tests we have provided to you in order to help you ensure you are on track with the assignment.
Each section will detail which testing files are available to you and how to run them. For example, the tests for `beras/layers.py`, `beras/activations.py`, and `beras/losses.py` are **all** contained within the `tests/test_beras.py` file!
However, we have only provided you with the **minimal** set of tests, with some not fully implemented. You are responsible for implementing more tests as you see fit!
*Before you ask, we are not grading your tests. These are simply there for you since you are limited on the number of gradescope submissions you have!*
Use these tests to check your logic, but also supplement your testing with our Gradescope autograder. Your score on Gradescope will be your grade on the assignment in most cases, though we reserve the right to manually review code and adjust points.
Until 2/19/2026, you'll have unlimited submissions! After 2/19/2026, you are limited to 15 submissions on Gradescope. Start early :)
:::
## 1. `preprocess.py`
In this section you will fill out the `load_and_preprocess_data()` function that will load in, flatten, normalize and convert all the data into `Tensor`s.
:::info
__Task 1.1 [load_and_preprocess_data()]:__ We provide the code to load in the data, your job is to
1. Normalize the values so that they are between 0 and 1
2. Flatten the arrays such that they are of shape `(number of examples, 28*28)`.
3. Convert the arrays to `Tensor`s and return the train inputs, train labels, test inputs and test labels __in that order__.
4. You should NOT shuffle the data in this method or do any other transformations than what we describe in 1-3. Importantly, you should **NOT** return the labels one-hot encoded. You'll create those when training and testing.
:::
:::warning
__Task 1.2 [Testing]:__ You can now run the preprocess tests provided `tests/test_data.py` file to test your implementation. In order to run the test, you should first make sure you are in the root directory for the assignment. Then, run the following command to run the tests
```bash
python tests/test_data.py --test=preprocess
```
This will run our prewritten tests for your implementation and print out the results in the terminal.
***Note:** These tests do not entirely guarantee your implementation is perfect, but if you pass them, you should be on the right track! **You are encouraged to write more tests in this file, but make sure they begin with `test_`!***
:::danger
**You may run into errors because other portions of the code are not implemented. You can either wait till you reach those sections, or fill the missing values in with null/filler data.**
:::
## 2. beras/onehot.py
`onehot.py` only contains the `OneHotEncoder` class which is where you will code a one hot encoder to use on the data when you preprocess later in the assignment. Recall that a one hot encoder transforms a given value into a vector with all entries being 0 except one with a value of 1 (hence "one hot"). This is used often when we have multiple discrete classes, like digits for the MNIST dataset.
:::info
__Task 2.1 [OneHotEncoder.fit]:__ In `OneHotEncoder.fit` you will take in a 1d vector of labels and you should construct a dictionary that maps each unique label to a one hot vector. This method doesn't return anything.
__Note__: you should only associate a one hot vector to labels present in labels!
:::
:::success
__Hint:__ `np.unique` and `np.eye` may be of use here.
:::
:::info
__Task 2.2 [OneHotEncoder.forward]:__ Fill in the `OneHotEncoder.forward` method to transform the given 1d array `data` into a one-hot-encoded version of the data. This method should return a 2d `np.ndarray`.
:::
:::info
__Task 2.3 [OneHotEncoder.inverse]:__ `OneHotEncoder.inverse` should be an exact inverse of `OneHotEncoder.forward` such that `OneHotEncoder.inverse(OneHotEncoder.forward(data)) = data`.
:::
:::warning
__Task 2.4 [Testing]:__ You can now run the one-hot encoding tests provided `tests/test_data.py` file to test your implementation. Again, you should first make sure you are in the root directory for the assignment. Then, run the following command to run the test(s)
```bash
python tests/test_data.py --test=ohe
```
This will run our prewritten tests for your implementation and print out the results in the terminal.
***Note:** These tests do not entirely guarantee your implementation is perfect, but if you pass them, you should be on the right track! **You are encouraged to write more test functions in this file, but make sure they begin with `test_`!***
:::
## 3. beras/core.py
In this section we are going to prepare some abstract classes we will use for everything else we do in this assignment. This is a very important section since we will build everything else on top of this foundation.
:::info
**Task 3.1 [Tensor]:** We will begin completing the construction of the `Tensor` class at the top of the file. Note that it subclasses the `np.ndarray` datatype, you can find out more about what that means <ins>[here](https://numpy.org/doc/stable/user/basics.subclassing.html)</ins>.
**The only TODO is to pass in the data to the `a` kwarg in `np.asarray(a=???)` in the `__new__` method.**
:::
:::warning
__You should know:__ You'll notice the `Tensor` class is nothing more than a standard `np.ndarray` but with additional attributes for gradient tracking: `.requires_grad`, `.grad`, and `.backward()`. A `Tensor` being "trainable" is equivalent to saying that the tensor requires a gradient.
:::
:::info
**Task 3.2 [Callable]:** There are no TODOs in `Callable` but it is important to familiarize yourself with this class. `Callable` simply allows its subclasses to use `self()` and `self.forward()` interchangeably. More importantly, if a class subclasses `Callable` it **will** have a `forward` method that returns a `Tensor`. We **can and will** use these subclasses when **constructing layers and models** later.
:::
:::info
**Task 3.3 [Weighted]:** There are 4 methods in `Weighted` for you to fill out: `trainable_variables`, `non_trainable_variables`, `trainable`, `trainable (setter)`. Each method has a description and return type (if needed) in the stencil code. Be sure to follow the typing **exactly** or it's unlikely to pass the autograder.
**HINT: parameters have a `requires_grad` attribute**
:::
:::success
__Note:__ If you need a refreshing on python attributes and properties you can refer to <ins>[this](https://realpython.com/python-getter-setter/)</ins> helpful guide
:::
:::info
**Task 3.4 [Diffable.\_\_call__]** There are no coding TODOs in `Diffable.__call__` but it is **critical** that you spend some time to familiarize yourself with what it is doing. Understanding this method will help clear up later parts of the assignment.
:::
:::warning
__You should know:__ Recall that in python, `generic_class_name()` is equal to `generic_class_name.__call__()`. Note that Diffable implements the `__call__` method and __not__ the `forward` method.
When we subclass `Diffable`, for example with `Linear`, we __will__ implement `forward` there. Then, when we use something like `linear_layer(inputs)`, the `__call__` method handles tracking inputs and outputs and sets up the `.backward()` method on the output tensor. If you use `linear_layer.forward(inputs)` directly, __it will not set up the backward propagation chain__ because `forward` doesn't handle the necessary bookkeeping.
:::
:::warning
**`compose_input_gradients` and `compose_weight_gradients`**
These methods are responsible for composing the downstream gradient for during backpropagation using the upstream gradient, `J`, and the local input and weight gradients of a `Diffable`. These methods are defined in more detail in the companion sheet!
:::
## 4. beras/layers.py
:::warning
__[BERAS Testing] (applies for section 4, 5, 6, and 7)__
We have provided you with a test file `test_beras` which contains a test suite to test your Linear layer, activation functions, and loss functions. This test suite has been set up so you can progressively run the tests as you implement each of the components/functions by simply calling the tests by their name. You can run the first test after implementing the `Linear` layer.
Please note that we do not provide you with tests for all of the classes/function. For example, we provide you with the test case for `LeakyReLU`, but do not provide the cases for `Sigmoid` or `Softmax`. You should be able to write your own tests for the remaining activation functions using the provided case as a template. We **highly recommend** you take the time to write these extra tests and as many as needed before submitting to the autograder
In order to run the tests, you need to run the commmand
```bash
python tests/test_beras.py --list # will list out the available tests
python tests/test_beras.py --test=<test name>
```
***Note:** These tests do not entirely guarantee your implementation is perfect, but if you pass them, you should be on the right track! **You are encouraged to write more tests in this file, but make sure they being with `test_`!***
:::
In this section we need to fill out the methods for `Linear`. We give you `__init__` and `weights`, you should read through both of these one liners to know what they are doing. Please don't change these since the autograder relies on the naming conventions. Your tasks will be to implement the rest of the methods we need.
:::info
__Task 4.1 [Linear.forward]:__ To begin, fill in the `Linear.forward` method. The parameter `x` represents our input. Remember from class that our Linear layer performs the following to get its output:
$$
f(\bf{x}) = \bf{x}\bf{W} + \bf{b}
$$
Keep in mind that `x` has shape `(num_samples, input_size)`.
:::
:::info
__Task 4.2 [Linear.get_input_gradients]:__ Refer to the formula you wrote in `Linear.forward` to compute $\frac{\partial f}{\partial x}$. Be sure to return the gradient `Tensor` __as a list__, this will come in handy when you write back propagation in `beras/gradient_tape.py` later in this assignment.
:::
:::success
__Note:__ For each `Diffable` you can access the inputs of the forward method with `self.inputs`
:::
:::info
__Task 4.3 [Linear.get_weight_gradients]:__ Compute both $\frac{\partial f}{\partial w}$ and $\frac{\partial f}{\partial b}$ and return both `Tensor`s __in a list__, like you did in `Linear.get_input_gradients`.
HINT: The shape of your weight gradients should report the results for how each weight changes, for each example in the batch. How many dimensions should your matrix have?
:::
:::info
__Task 4.4 [Dense.\_initialize_weight]:__ Initialize the dense layer’s weight values. By default, return 0 for all weights (usually a bad idea). You are also required to allow for more sophisticated options by allowing for the following:
- **Normal:** Passing `normal` causes the weights to be initialized with a unit normal distribution $\mathcal{N}(0,1)$.
- **Xavier Normal:** Passing `xavier` causes the weights to be initialized in the same way as `keras.GlorotNormal`.
- **Kaiming He Normal:** Passing `kaiming` causes the weights to be initialized in the same way as `keras.HeNormal`.
Explicit definitions for each of these initializers can be found **[in the tensorflow docs](https://www.tensorflow.org/api_docs/python/tf/keras/initializers)**
:::
:::warning
**Notes:**
1. `_initialize_weight` __returns__ the weights and biases and does not set the weight attributes directly.
2. Your weights should be `Variable` **not** `Tensor`. This is so that we can use the `.assign` method in your optimizers.
:::
:::warning
__Task 4.5 [TESTING]:__ For example, you can now run our first test, `test_dense_forward()` to test if you set up your forward pass correctly. **We do not provide tests for all methods.** You can run the test as follows:
```bash
python tests/test_beras.py --test=dense
```
For the following sections, you simply replace `test=dense` with the actual test you want to run to test your implementation (pre-written or custom).
***Note:** These tests do not entirely guarantee your implementation is perfect, but if you pass them, you should be on the right track! **You are encouraged to write more tests in this file, but make sure they being with `test_`!***
:::
## 5. beras/activations.py
Here, we will implement a couple activation functions that we will use when constructing our model. Here is some helpful infromation about the [activation functions](https://medium.com/analytics-vidhya/activation-functions-all-you-need-to-know-355a850d025e)
:::info
__Task 5.1 [LeakyReLU]:__ Fill out the forward pass and input gradients computation for `LeakyReLU`. You'll notice these are the same methods we implemented in `layers.py`, this is by design.
:::
:::success
__Hint:__ LeakyReLU is not continous so when computing the gradient, consider both the positive and negative cases.
Note: Though LeakyReLU is technically not differentiable at $0$ exactly, we can just leave the gradient as $0$ for any $0$ input.
:::
:::info
__Task 5.2 [Sigmoid]:__ Complete the `forward` and `get_input_gradients` methods for `Sigmoid`.
:::
:::info
__Task 5.3 [Softmax]:__ Write the forward pass and gradient computation w.r.t inputs for Softmax.
:::
:::success
__Hints:__
You should use stable softmax to prevent overflow and underflow issues. Details in the stencil.
Combining `np.outer` and `np.fill_diagonal` will significantly clean up the gradient computation
When you first try to compute the gradient it will become apparent that the input gradients are tricky. This [medium article](https://medium.com/towards-data-science/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1) has a fantastic derivation that will make your life a lot easier.
:::
## 6. beras/losses.py
In this section we need to construct our Loss functions for the assignment, `MeanSquaredError` and `CategoricalCrossEntropy`. You should note that for most classification tasks we use `CategoricalCrossEntropy` by default but, for this assignment we will use both and compare the results.
:::success
__Note:__ You'll notice we construct a `Loss` class that both `MeanSquaredError` and `CategoricalCrossEntropy` inherit from. This is just so that we don't have to specify that our Loss functions don't have weights everytime we create one.
:::
:::info
__Task 6.1 [MeanSquaredError.forward]:__ Implement the forward pass for `MeanSquaredError`. We want `(y_true - y_pred)**2`, and not the other way around. Don't forget that we expect to take in _batches_ of examples at a time so we will need to take the mean over the batch as well as the mean for each individual example. In short, the output should be the mean of means.
Don't forget that `Tensors` are a subclass of `np.ndarrays` so we can use numpy methods!
[Mean Squared Error](https://www.geeksforgeeks.org/maths/mean-squared-error/)
:::
:::warning
__You should know:__ In general, loss functions should return **exactly 1 scalar value** no matter how many examples are in the batch. We take the mean loss from the batch examples in most practical cases. We will see later in the course that we can use multiple measures of loss at one time to train a model, in which case we often take a weighted sum of each individual loss as our loss value to backpropagate on.
:::
:::info
__Task 6.2 [MeanSquaredError.get_input_gradients]:__ Just as we did for our linear layer, compute the gradient with respect to inputs in MeanSquaredError. It's important to rememeber that there are two inputs, `y_pred` and `y_true`. Since `y_true` comes from our database and is not dependent on our params, you should treat it like a constant vector. On the other hand, compute the gradient with respect to `y_pred` exactly as you did in `Linear`. Remember to return them both as a list!
:::
:::success
__Hint:__ If you aren't quite sure how to access your inputs, remember that `MeanSquaredError` is a `Diffable`!
:::
:::info
__Task 6.3 [CategoricalCrossEntropy.forward]:__ Implement the forward pass of `CategoricalCrossEntropy`. Make sure to find the per-sample average of the CCE Loss! You may run into trouble with values very close to 0 or 1, you may find `np.clip` of use...
Here is some helpful reading on understanding categorical cross-entropy loss [Categorical Cross Entropy Loss](https://www.geeksforgeeks.org/deep-learning/categorical-cross-entropy-in-multi-class-classification/)
:::
:::info
__Task 6.4 [CategoricalCrossEntropy.get_input_gradients]:__ Get input gradients for `CategoricalCrossEntropy`.
:::
## 7. beras/metrics.py
There isn't much to do in this file, just to implement the forward method for `CategoricalAccuracy`.
:::info
__Task 7.1 [CategoricalAccuracy.forward]:__ Fill in the `forward` method. Note that our input `probs` represents the probability of each class as predicted by the model and labels is a one hot encoded vector representing the true model class. Run `test_beras.py` to to test your activations, losses, and metrics!
:::
:::success
__Hint:__ It may be helpful this also think of the labels as a probability distribution, where the probability of the true class is 1 and all other classes is 0.
If the index of the max value in both vectors is the same, then our model has made the correct classification.
:::
:::warning
__Task 7.2 [Testing]:__ You now should be able to run all of the provided tests in `test_beras.py`. In order to run all of the tests, you should first make sure you are in the root directory for the assignment. Then run the following command
```bash
python test_runner.py --category beras # runs all tests in beras.py
python tests/test_beras.py --all # this runs the same command
```
*Note: These tests do not entirely guarantee your implementation is perfect, but if you pass them, you should be on the right track!*
:::
## 8. beras/optimizer.py
In `beras/optimizer.py` there are 3 optimizers we'd like you to implement: `BasicOptimizer`, `RMSProp`, and `Adam`. In practice, `Adam` is tough to beat so more often than not you will default to using `Adam`.
Each optimizer has an `__init__` and two key methods: `zero_grad()` and `step()`. We give you the `__init__` for each optimizer which contains all the hyperparams and variables you will need for each algorithm.
In the `step()` method, you will write the algorithm to update each parameter in `self.parameters` using its `.grad` attribute. Each parameter `param` will have its gradient stored in `param.grad` after `.backward()` is called.
:::success
**Hint:** In Assignment 2, you wrote all these optimization algorithms already. Feel free to reuse your code for these tasks.
:::
:::warning
__Warning:__ You can update parameters in-place using `-=` or `+=` operators. For example: `param -= learning_rate * param.grad`
:::
:::info
__Task 8.1 [BasicOptimizer.step]:__ Write the `step` method for the `BasicOptimizer`.
This method should iterate over `self.parameters` and update each parameter using its gradient. For any given parameter $w$ with gradient $\frac{\partial \mathcal{L}}{\partial w}$ stored in `w.grad`, and `learning_rate` $r$, the optimization formula is:
$$w = w - r \cdot \frac{\partial \mathcal{L}}{\partial w}$$
Remember to check if `param.requires_grad` is `True` before updating!
:::
:::info
__Task 8.2 [RMSProp.step]:__ Write the `step` method for `RMSProp`.
In `RMSProp` there are two new hyperparams, $\beta$ and $\epsilon$.
$\beta$ is referred to as the __decay rate__ and typically defaults to .9. This decay rate has the effect of _lowering the learning rate as the model trains_. Intuitively, as our loss decreases we are closer to a minimum and should take smaller steps towards optimization to ensure we don't optimize past the minimum.
$\epsilon$ is a small constant to prevent division by 0.
In addition to our hyperparams there is another term which we will call, __v__, which acts as the moving average of the squared gradients __for each param__. We update this value in addition to the parameters every time we call `step()`.
For any given parameter $w$ with gradient stored in `w.grad`, and `learning_rate` $r$, the update is defined by:
$$v = \beta \cdot v + (1-\beta) \cdot \left(\frac{\partial \mathcal{L}}{\partial w}\right)^2$$
$$w = w - \frac{r}{\sqrt{v} + \epsilon} \cdot \frac{\partial \mathcal{L}}{\partial w}$$
**Hint**: In our stencil code, we provide **v** as a list with one entry per parameter. You can index into it using the parameter's position in `self.parameters`.
:::
:::info
__Task 8.3 [Adam.step]:__ Write the `step` method for `Adam`.
At its core, Adam is similar to `RMSProp` but it has more smoothing terms and computes an additional _momentum_ term to further balance the learning rate as we train. This momentum term has its own decay term, $\beta_1$. Additionally, `Adam` keeps track of the number of optimization steps performed (stored in `self.t`) to further tweak the effective learning rate.
Here is what an optimization step with `Adam` looks like for parameter $w$ with gradient in `w.grad`, and `learning_rate` $r$:
$$m = \beta_1 \cdot m + (1-\beta_1) \cdot \frac{\partial \mathcal{L}}{\partial w}$$
$$v = \beta_2 \cdot v + (1-\beta_2) \cdot \left(\frac{\partial \mathcal{L}}{\partial w}\right)^2$$
$$\hat{m} = \frac{m}{1-\beta_1^t}$$
$$\hat{v} = \frac{v}{1-\beta_2^t}$$
$$w = w - \frac{r \cdot \hat{m}}{\sqrt{\hat{v}}+\epsilon}$$
Note: Don't forget to __increment the time step__ `self.t` once when `step()` is called!
:::
:::success
__Hint:__ Don't overcomplicate this section, it really is as simple as programming the algorithms as they are written.
:::
## 9. beras/backwards
In `beras/core.py`, you will implement the `Diffable.backward()` method, which is the heart of automatic differentiation in BERAS. This method enables the recursive backpropagation of gradients through your neural network.
:::danger
__Warning:__ This section has historically been conceputally difficult for students. It's helpful to carefully consider our hints and the conceptual ideas behind backward propagation __before__ beginning your implementation.
You should refer to the companion sheet to get a detailed explanation of the backward propagation algorithm and the helper functions. This is available below:
{%preview https://hackmd.io/@browndls26/Sk0dIN-Pbe %}
If you get stuck, feel free to come back to this section later on.
:::
### Understanding Backward Propagation
When you call `loss.backward()`, it triggers a recursive process that propagates gradients backward through the computational graph. Each `Diffable` layer's `backward()` method is simply:
1. **Composes** the upstream gradient with local gradients
2. **Populates** the `.grad` attribute
3. **Recursively calls** `.backward()` to continue the chain
### Helper Methods You'll Use
The `Diffable` class provides two key helper methods that you'll call within `backward()`: `compose_input_gradients(upstream_grad)` and `compose_weight_gradients(upstream_grad)`. More information about them can be found on the cheat sheet linked above. These methods handle the mathematical details of gradient composition for you. Your job is to orchestrate when and how to use them.
:::info
__Task 9.1 [Diffable.backward]:__ Implement the `backward` method to compute the gradient of the loss with respect to each of the trainable params. This method should NOT return anything, but instead populate the `.grad` attribute for each of the trainable params and continue the chain of backpropagation.
:::
:::warning
__You should know:__ When you finish this section you will have written a highly generalized backward propagation method that could handle an arbitrary network. This method functions similarly to how PyTorch implements automatic differentiation. This is a very powerful method but it is just one way to implement autograd.
:::
:::warning
__Task 9.2 [TESTING]:__ We have written some tests for you to determine if your backward propagation is working in `tests/test_gradient.py`. You can run them as:
```bash
python test_runner.py --category gradient # runs gradient/backward tests
python tests/test_gradient.py --all # same output, alternate command
```
***Note:** These tests do not entirely guarantee your implementation is perfect, but if you pass them, you should be on the right track! **You are encouraged to write more tests in this file, but make sure they begin with `test_`!***
:::
## 10. beras/model.py
In `beras/model.py` we are going to construct a general `Model` abstract class that we will use to define our `Sequential` model. The `Sequential` model simply calls all of it's layers in order for the forward pass.
:::warning
__You should know:__ At first it may seem like all neural nets would be `Sequential` models but there are some architectures like __ResNets__ that break the sequential assumption.
:::
:::info
__Task 10.1 [Model.parameters]:__ Construct a list of all parameters in the model by iterating through the layers and collecting their parameters. Return this list.
:::
:::success
We give you `Model.compile`, which just sets the optimizer, loss and accuracy attributes in the model. In Keras, compile is a huge method that prepares these components to make them hyper-efficient. That implementation is highly technical and outside the scope of the course but feel free to look into it if you are interested.
:::
:::info
__Task 10.2 [Model.fit]:__ This method should train the model for the number of `epochs` given on the train and test data, `x` and `y` given with batch size, `batch_size`. Importantly, you want make sure you record the metrics throughout training and print stats out during the train so that you can watch the metrics as the model trains.
You can use the `print_stats` and `update_metric_dict` functions provided. Note that neither of these methods return any values, `print_stats` prints out the values directly and `update_metric_dict(super_dict, sub_dict)` updates `super_dict` with the mean metrics from `sub_dict`.
You can ignore the `wandb_run` argument for now!
__Note:__ You do __not__ need to call the model here, you should instead use `self.batch_step(...)` which all child classes of `Model` will implement.
:::
:::info
__Task 10.3 [Model.evaluate]:__ This method should look _very similar_ to `Model.fit` except we need to ensure the model does not train on the testing data. Additionally, we will test on the entirety of the test set one time, so there is no need for the epochs parameter from `Model.fit`.
Again, you can ignore the `wandb_run` argument for now.
:::
:::info
__Task 10.4 [Sequential.forward]:__ This method passes the input through each layer in `self.layers` sequentially.
:::
:::info
__Task 10.5 [Sequential.batch_step]:__ This method trains makes a model prediction and computes the loss just like you did in HW2. Be sure to use the `training` argument to adjust the weights of the model only when `training` is True. This method should return the _loss and accuracy_ for the batch in a dictionary.
:::
## 11. `assignment.py`
For this assignment (and future ones), we'll be using a popular experiment tracking platform called [Weights and Biases](https://wandb.ai/site) (abbreviated wandb or w&b). Wandb provides a logging interface that allows us to log stats locally while our model is training and view plots and figures for these stats online in real-time. You should at least log average training and validation loss periodically, but you may find it useful to log generated text as well. Look at the Weights and Biases [documentation](https://docs.wandb.ai/ref/python/data-types/) for information on logging different types of data.
:::info
**Task 4.1 [WandB Setup]:**
1. Run the following command which will interact with the wandb API in order to create your account and authenticate your API key
```bash
python utils/wandb_login.py
```
2. Select option ""(1) Create a W&B account" which will lead you to [here](https://wandb.ai/authorize?signup=true) where you will be provided with your API key.
- When prompted whether creating a "Professional" or "Educational" account, you should choose "educational" and input "Brown University"!
- This is not super important, but Professional accounts expire after some time and require you to pay :cry:
3. Copy your API key and upload it in the specifed spot in the terminal and you are all set :)
4. You should also log in to wandb in a web browser to check out the GUI!
Once you are logged in, you shouldn't have to re-authenticate and log in again. We included the wandb library in the conda environment, but if you don't have it, you can install it by running `pip install wandb`.
:::warning
**Using WandB in your code (Optional):**
- In `assignment.py`, uncomment the WandB initialization lines:
```python
run = wandb.init(entity="your-username", project=f"Beras", name=f"Beras Test")
```
- Replace `"your-username"` with your WandB username
- Call `wandb_run.log(...)` in your `model.fit` and `model.evaluate` methods!
- Your training metrics will automatically be logged and visualized at wandb.ai
:::
Here you put it all together!
:::info
__Task 11.1 [Create, Train and Test your model]:__ You'll find 4 mostly empty functions in `assignment.py`: `get_model`, `get_optimizer`, `get_loss_fn`, and `get_acc_fn`. You should fill these out to create your model however you'd like, you may have to play with different Linear Layers, activations, optmizers, etc. to get better accuracies.
Once you have those 4 methods filled out, you can fill out the `__main__` block to train the model. The steps are outlined in the stencil.
**Note:** WandB is completely optional. If you don't want to use it, simply leave the WandB lines commented out and set `run = None`.
You are looking for an accuracy >= 95% within 10 epochs consistently. **Once you are ready**, you can submit your code to the autograder with an additional `FINAL.txt` file which tells the autograder that you'd like to train and test for score on the autograder. __The autograder will use your `get_model`, `get_loss_fn` and `get_optimizer` to initialize your model on gradescope.__ This will take some time on gradescope so expect __~10 minutes__ waiting time. That said, if you are consistently able to get up to accuracy locally and are passing all other tests you should have no trouble on gradescope.
:::
:::success
__Hint:__ You likely don't need as large a network as you might expect, the right network parameters, activations and optmizer should be able to get up to accuracy within 3-5 epochs with only a couple layers. If you find that your training is taking a long time or that you need many large layers to get good accuracy, probably you can tweak the network in small ways to improve it more quickly.
If you are having trouble reaching accuracy, go to office hours and talk to the TAs about strategies for changing out your hyperparameters.
:::
:::warning
__Task 11.2 [TESTING]:__ We have written some tests for you to determine if your model and assignment are set up properly. You can run them as
```bash
python test_runner.py --category assignment # runs assignment/model tests
python tests/test_assignment.py --all # same output, alternate command
```
***Note:** These tests do not entirely guarantee your implementation is perfect, but if you pass them, you should be on the right track! **You are encouraged to write more tests in this file, but make sure they being with `test_`!***
:::
:::warning
__Task 11.3 [TESTING]:__ If you have passed all of your previous tests and have implemented all of the code, you can run the command
```bash
python test_runner.py [--v] # runs all of the tests!
```
- `--v` prints out a more verbose output from the test runner
These will run all of the tests within the testing files that being the `test_` prefix.
:::
## Submission
:::danger
**[REMINDER]** After 2/19/2026, you are limited to **15 submissions** on Gradescope! Before then, you have unlimited submissions! Start early :)
:::
### Requirements
Once you've completed your model and are training locally, be sure to include a blank `final.txt` so that the autograder trains and tests your model for accuracy!
### Grading
Your code will be primarily graded on functionality, as determined by the Gradescope autograder.
:::warning
You will not receive any credit for functions that use `tensorflow`, `keras`, `torch`, or `scikit-learn` functions within them. You must implement all functions manually using either vanilla Python or NumPy. (This does not apply to the testing files!)
:::
### Handing In
You should submit the assignment via Gradescope under the corresponding project assignment through Github or by submitting all files individually.
To submit via Github, commit and push all of your changes to your repository to GitHub. You can do this by running the following commands.
```bash
git commit -am "commit message"
git push
```
For those of y'all who are already familiar with `git`: the `-am` flag to `git commit` is a pretty cool shortcut which adds all of our modified files and commits them with a commit message.
**Note:** We highly recommend committing your files to git and syncing with your GitHub repository **often** throughout the course of the assignment to ensure none of your hard work is **lost**!