# Homework 3
## Question 1
### Training a Neural Network on Insurability Dataset
In this part, we were tasked with creating a neural network for classifying the insuarbility dataset. We will go over the design changes in this writeup and cover the following topics :
1. Data Preprocessing
2. Model Architecture
3. Hyper Parameter Tuning
### 1. Data preprocessing
#### Converting the dataset into Custom class for usage with Dataloader
Since the dataset prepared via the CSV through the given functions were of the type [[label], [features]], we have to implement a custom dataset class that could be used with Dataloader from Torch.
here is the implementation of the Class:
```
class CustomDataset(Dataset):
def __init__(self, data):
self.data = data
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
features, labels = self.data[idx][1], self.data[idx][0]
features_tensor = torch.tensor(features, dtype=torch.float32)
labels_tensor = torch.tensor(labels, dtype=torch.long)
return features_tensor, labels_tensor.squeeze()
```
#### Scaling the dataset features
For this datset, since the different feature values, i.e. Age,Exercise and Cigarettes were on different scales, we anticipated that this difference in scales might affect the final predictions where some variables because of their inherent scale, might be percieved as more important than they actually are. Thus we implemented the <b>Min-Max Scaling</b>. We have implemented this pre-process step manually by tracking the Min and Max of each feature and scaling them according to this formula :
Here is the code snippet for our min Max scaler function:
```
def preprocess_data(data):
# Preprocessing the Insurance data
min_x1 = min_x2 = min_x3 = 100000
max_x1 = max_x2 = max_x3 = -100000
for row in data:
features = row[1]
min_x1 = min(min_x1, features[0])
min_x2 = min(min_x2, features[1])
min_x3 = min(min_x3, features[2])
max_x1 = max(max_x1, features[0])
max_x2 = max(max_x2, features[1])
max_x3 = max(max_x3, features[2])
for row in data:
features = row[1]
features[0] -= min_x1
features[0] /= (max_x1 - min_x1)
features[1] -= min_x2
features[1] /= (max_x2 - min_x2)
features[2] -= min_x3
features[2] /= (max_x3 - min_x3)
row[1] = features
```
### 2. Model Architecture
Since in the problem we were bound to use the architecture outlined in the class, we implemented the same architecture as directed.
It has the following layers :
1. Input Layer - of size 3
2. Hidden Layer 1 - of size 2 with <b>Sigmoid</b> activation
3. Output Layer - of size 3 with <b>Softmax</b> activation
#### Implementation of Softmax Layer
We implemented our own Softmax function for this problem. For this we have used the following softmax formula to guide our implementation.
```
def softmax(x):
exp_x = torch.exp(x - torch.max(x)) # Subtract max for numerical stability
return exp_x / exp_x.sum(dim=-1, keepdim=True)
```
### 3. Hyper Parameter Tuning
#### Early Stopping Implementation
During our training loop over 50 epochs, we regularly saw the model accuracy reach a peak of around 80%, but over the course of training loop, the model tended to overfit on the dataset and thus the final accuracy after the 50 epoch period was almost always lower that the peak Accuracy achieved during training, thus we decided to implement an Accuracy based early stopping method, where our model would cut-off training loop if a desired accuracy level (<b>84%</b> in our case) was achieved. Below is the snippet of the code within our training loop that achieved this.
```
for t in range(epochs):
#print(f"Epoch {t+1}\n------------------------------- \n")
train_loss.append(trainModel(train_loader, feedForwardNN, loss, optimizer, device))
testData = testModel(valid_loader, feedForwardNN, loss)
validation_loss.append(testData[0])
if(early_stopping and testData[1] >= 84):
break
```
After Implemention of Early stopping we were able to test the effect of Early Stopping + Preprocessing via the code output which has been attached below :
```
For Insurable Data
Test Loss: 0.7799670535326004
Test Accuracy: 79.5
Preprocessing: True
Early Stopping: True
For Insurable Data
Test Loss: 0.7679225534200669
Test Accuracy: 78.5
Preprocessing: False
Early Stopping: True
For Insurable Data
Test Loss: 0.7604688236117363
Test Accuracy: 80.5
Preprocessing: True
Early Stopping: False
For Insurable Data
Test Loss: 0.9732667517662048
Test Accuracy: 60.0
Preprocessing: False
Early Stopping: False
```
#### Learning Rate Tuning
a. **Learning rate is 0.3**
Learning curve:
Cnfusion Matrix:
Final Accuracy on test dataset: 
b. **Learning rate is 0.1**
Learning curve:
Confusion Matrix:
Final Accuracy on the test data set:
**c. Learning rate is 0.03**
Learning Curve:
Confusion Matrix:
Final accuracy on the test dataset:
**d. Learning rate is 0.01**
Learning Curve:
Confusion Matrix:
Final Accuracy on the test Dataset:
**Learning rate is 0.003**
Learning Curve:
Confusion matrix:
Final Accuracy on the test Dataset:
**Learning rate is 0.001**
Learning Curve:


Confusion matrix:
Final Accuracy on the test Dataset:
### Learing Rate Decay Tuning
In the implemented neural network training code, the Stochastic Gradient Descent (SGD) optimizer is used in conjunction with a `StepLR` learning rate scheduler from PyTorch's optimization module. SGD is a widely-used optimizer that updates model weights iteratively based on the gradient of the loss function.
The `StepLR` scheduler enhances the SGD optimizer by methodically reducing the learning rate by a factor of 0.1 every 10 epochs. This decay strategy is crucial for fine-tuning the model as it approaches the optimal solution, preventing overshooting and promoting convergence.
The implementation involves initializing the SGD optimizer with a learning rate of 0.1 and setting up `StepLR` with predefined parameters. Throughout the 50-epoch training loop, the scheduler adjusts the learning rate based on the set schedule, and the current learning rate is recorded after each adjustment for subsequent analysis. This methodical adjustment of the learning rate is a practical approach to optimize the training process and improve model performance.
**Results**
Leearning Curve:
Confusion matrix:
Accuracy vs Learning Rate:
The accuracy of the model remains relatively stable across a range of learning rates and then drops sharply at higher rates. This **suggests that the model's performance is sensitive to the learning rate, and there is an optimal range where the model performs best before the accuracy degrades due to too large learning rate steps.**
Final accuracy on the test Dataset:
### Analysis
**Effectiveness of using Neural Network on this Dataset**
- The dataset has only 3 features that are to be labeled to 3 output classes. **A low number of features** means that there is a high chance of NN overfitting on any one of the features which can lead to a low accuracy.
- **Training Dataset size**: With a size of only 2000, we do not have the option to train a larger, complex model to understand the nuances of the patterns in the dataset. Since there is not enough data to train a large enough network.
- Class Imbalance - Since there are a lot of examples for Neutral, the model will always have a tendency to overfit the predictions for this majority class and thus harm the accuracy
**Rundown on the final hyper Parameters**
During trainig it was obsevered that the following hyper parameters gave the best results on the given problem.
**Optimizer** : We got excellent accuracies when testing with Adam optimizer (peak accuracy of 88% on Test set). But since we were directed to use the SGD optimizer, our final model incorporates **SGD**.
**Learning Rate** : Through trial and error, we concluded that the best **LR for this algorithm was 0.05**
**Epochs** : Similar to the case with LR, through trial and error, we conculded that when mixed with LR = 0.05, we got the best results over **50 Epochs**
**Learning Rate Decay** : Based on testing the model performance on test set, we did not recieve satisfactory results and thus we have not included it in our final model.
Final Model Accuracy and F1 are as follows:
```
F1 Score: 0.7969336351721672
For Insurable Data
Test Loss: 0.7711005282402038
Test Accuracy: 81.0
Preprocessing: True
Early Stopping: True
```
Final Model's Confusion Matrix is presented Here :

## Quesiton 2
### Training a Neural Network on MNIST Dataset
The goal is to create a neural network multi-class classifier to classify digits from 0 to 9. In this, we will discuss all the different design choices we made and also explore the differences between these choices with model used for Insurability Data.
### 1. Processing the Data
Similar to what we have discussed in the previous case we will convert the dataset into Custom class for usage with Dataloader
#### Data Transformation
For the data transformation step we have converted the Gray scale pixel values of the data to binary values. To map grayscale values to binary values, We have performed a thresholding operation. The thresholding operation will convert grayscale values to binary values (0 or 1) based on a specified threshold (128 in our case).

### 2. Model Architecture and Design
The Model Architecture of 2 Hidden layers and sizes for each layer is following:
Input: 28*28 (image Size)
Hidden1: 128 with relu
Hidden2: 64 with relu
Output: 10

### Differences
#### Change in the number of nodes in the output layer
In the previous case of classifying insurability data we have three classes "Good", "Bad" and "Neutral" whereas in this scenario we have 10 outputs for each digit from 0 to 9
#### Adding Hidden layers
Since the model has 784 input values, we introduced a bit more complexity into the training process. To achieve this, we are adding one more hidden layer. This additional layer will enhance the model's capacity to learn and adapt over an extended period, potentially leading to improved performance during training.
#### Choosing Activation Function
ReLU is often preferred over sigmoid for hidden layers in MNIST data because ReLU addresses the vanishing gradient problem better, is computationally more efficient, introduces non-linearity, and has shown good performance in image classification tasks.
### 3. Hyper Parameter Tuning
We'll discuss about the choice of optimizer and Loss Function for the MNIST classifier
#### Optimizer
We chose 'Adam' over SGD for this task because Adam adjusts the learning rates for each parameter individually, adapts to sparse gradients efficiently, and combines the benefits of momentum and RMSProp. It generally requires less manual tuning and can lead to faster and more stable convergence during training.
#### Loss Function
We chose 'Cross-entropy' for MNIST digit classification because it's well-suited for tasks where the goal is to classify items into different categories.
#### Learning Rate
A learning rate of '0.03' was chosen after experimenting with different learning rates and finding that it produced good performance for the specific task at hand.
#### Final Hyper Parameter choices
* Chosen Optimizer - Adam
* Loss Function - Cross-Entropy
* Learning Rate - 0.03
### 4. Results
#### Learning Curve

#### Confusion Matrix

#### Accuracy
* Accuracy (with regularization) 86
* without (regularization) 85
##### Comparision with K-means and KNN (HW2)
Neural networks often outperform k-means and KNN in tasks that require learning complex patterns, handling high-dimensional data, and adapting to non-linear relationships. However, the choice of algorithm depends on the specific characteristics of the data and the nature of the task. For simpler problems with well-defined clusters, k-means and KNN might perform adequately, while neural networks shine in tasks demanding advanced feature learning and representation. Additionally, since this is a image recognition problem, usage of CNNs might have helped our accuracy.
## Question 3
###
### Regularization Implementation and Analysis
**Regularization Technique Applied:**
For the MNIST classifier, **L2** regularization was implemented as part of the optimization process. This choice was motivated by the need to prevent overfitting. L2 regularization, also known as weight decay, discourages large weights in the model by adding a penalty term to the loss function that is proportional to the sum of the squares of the weights.
**Implementation Details:**
L2 regularization was introduced through the `weight_decay` parameter of the Adam optimizer. This parameter was set to 0.01, adding a small penalty for weight magnitudes to the optimization process, thereby encouraging the model to maintain smaller weight values.
```python
optimizer = torch.optim.Adam(mnistNetModel.parameters(), lr=0.0003, weight_decay=0.01)
```
**Impact on Classifier Performance:**
The following observations were made:
- **Learning Curves**: With regularization, the training loss exhibited a smoother decrease and less volatility, indicating that the model was not overfitting to the noise in the training data. The validation loss remained closer to the training loss throughout the training epochs, suggesting that the gap between training and generalization error was reduced.
- **Performance Metrics**:The model with regularization achieved a more consistent accuracy, avoiding the sharp peaks and troughs that are indicative of a model learning to represent the noise in the training data rather than the underlying pattern.
Accuracy without regularization: **86%**
Accuracy with regularization: **84%**

## Question 4
To implement the Classification problem without using the in build optimizers, we decided to do the follwing:
1. Implement the custom training function
Here we manually implemented the forward and backward pass.
For forward Pass, We simply used the np.dot(x,y) function to generate the outputs without activations.
For activation layers, we used the predefined activation functions of sigmoid and Softmax.
For the Backward pass, we used the mathematical formulas for the individual activation functions like sigmoid and Softmax and used chain rule to calculate Dw1 and Dw2 from y_pred
```
def compute_gradients(X, y, weights):
# Forward pass
h = np.dot(X, weights[0])
h_relu = sigmoid(h)
y_pred = softmax(np.dot(h_relu, weights[1]))
# Backward pass
grad_y_pred = y_pred
grad_y_pred[np.arange(len(y)), y] -= 1
grad_w2 = np.dot(h_relu.T, grad_y_pred)
grad_h_relu = np.dot(grad_y_pred, weights[1].T)
grad_h = grad_h_relu * h_relu * (1 - h_relu)
grad_w1 = np.dot(X.T, grad_h)
return grad_w1, grad_w2
```
Then we manually updated the weights as
```
wi = wi - (learningRate) * Dwi
```
The update_weights function uses these gradients to adjust the neural network's weights, simulating the training process.
The training loop iterates through epochs and examples, updating weights to minimize the loss. Notably, early stopping is implemented to prevent overfitting based on validation loss.
We also calculated the cross entropy loss manually
Cross-entropy loss serves as the measure of model performance. The compute_gradients function performs both forward and backward passes, calculating gradients for weight updates.
After the Complete training loop, we have final accuracies as follows.
