> **Neural Networks from Scratch in Python by Harrison Kinsley, Deniel Kukeila**
> https://nnfs.io/
*This note is base on the above book and some implementation in C++ by myself. Please refer to the author's youtube and book for detailed tutorial.*
## A Dense Layer with *Numpy*
*Numpy* will broadcast inputs to all the three weights to caculate dot product on all the data. There are 4 inputs and a fully connected weights that connect 4 inputs to all 4 neurons.
```python
import numpy as np
inputs = [1.0, 2.0, 3.0, 4.0]
weights = [[0.2, 0.4, 0.6, 0.8],
[0.3, 0.6, 0.9, 1.2],
[0.5, 1.0, 1.5, 2.0]]
biases = [2.0, 3.0, 5.0]
layer_outputs = np.dot(weights, inputs) + biases
print(layer_outputs)
```

In the following section, if we have a batch of inputs, the weights tensor should be transposed to map the calculation in *Numpy*. In the *Eigen* implementation, the library put `vector4f` as a column of values. Therefore, transpose on bias is required to perform correct calculation.
:::info
:information_source: **Eigen Official Tutorial**
https://eigen.tuxfamily.org/dox/group__TutorialReductionsVisitorsBroadcasting.html
:::
***In Python***
```python
import numpy as np
inputs = np.array(
[[1.0, 2.0, 3.0, 4.0],
[5.0, 6.0, 7.0, 8.0],
[9.0, 10.0, 11.0, 12.0],
[13.0, 14.0, 15.0, 16.0]])
weights = np.array(
[[0.2, 0.4, 0.6, 0.8],
[0.3, 0.6, 0.9, 1.2],
[0.5, 1.0, 1.5, 2.0],
[0.6, 1.2, 1.8, 2.4]])
biases = [2.0, 3.0, 5.0, 6.0]
output_layer = np.dot(inputs, weights.transpose()) + biases
print(output_layer)
```
***In C++ with Eigen***
```cpp
#include <iostream>
#include <Eigen/Dense>
int main(int argc, char** argv)
{
Eigen::Matrix4f inputs;
inputs << 1.0, 2.0, 3.0, 4.0,
5.0, 6.0, 7.0, 8.0,
9.0, 10.0, 11.0, 12.0,
13.0, 14.0, 15.0, 16.0;
Eigen::Matrix4f weights;
weights << 0.2, 0.4, 0.6, 0.8,
0.3, 0.6, 0.9, 1.2,
0.5, 1.0, 1.5, 2.0,
0.6, 1.2, 1.8, 2.4;
Eigen::Vector4f biases;
biases << 2.0, 3.0, 5.0, 6.0;
Eigen::Matrix4f layerOutput = inputs * weights.transpose();
layerOutput.rowwise() += biases.transpose();
std::cout << layerOutput << std::endl;
}
```
## Dense Layer Class Implementation
In this section, the book start to build the calculation introduced in the previous section. In the following code, *dimension of the weight initialization is reversed* so that *a transpose is not needed*.
***In Python***
```python
import numpy as np
class layer_dense:
# layer initialization
def __init__(self, n_inputs, n_neurons):
# initialize weigths and biases
self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)
self.biases = np.zeros((1, n_neurons))
def forward(self, inputs):
self.output = np.dot(inputs, self.weights) + self.biases
```
***In C++ with Eigen***
*TODO: untested*
```cpp
using Eigen::MatrixXf;
using Eigen::VectorXf;
class Dense
{
private:
MatrixXf weights;
VectorXf biases;
MatrixXf outputs;
public:
Dense(unsigned int numInputs, unsigned int numNeurons)
{
weights = 0.01 * MatrixXf::Random(numInputs, numNeurons);
biases = VectorXf::Zero(numNeurons);
}
~Dense() {}
void forward(MatrixXf inputs)
{
outputs = inputs * weights;
outputs.rowwise() += biases.transpose();
}
};
```
## Activation Functions
Activation functions are applied to the output of a layer and will usually be a nonlinear function. This make the neural network able to map a nonlinear functions.
### Step Activation Function

A step function will only have two outputs. It's either 1 or 0 which is defined on a value to determine the thrashhold.
### Linear Activation Function

This avtivation function is usally applied to the output of the last layer in the case of a regression model.
### Sigmoid Activation Function

A step activation function is not able to preserve how close a this neuron was activating or deactivating. This information can help the optimizer in the training state.
### Rectified Linear Activation Function

Since calculate the sigmoid function is vary costly, a simpler activation function that can still remain nonlinear is applied.
***In Python***
```python
import numpy as np
inputs = [0, 2, -1, 3.3, -2.7, 1.1, 2.2, -100]
outputs = np.maximum(0, inputs)
print(outputs)
```
***In C++ with Eigen***
```cpp
Eigen::Matrix3f inputs = Eigen::Matrix3f::Random();
std::cout << inputs << "\n\n";
// This provide a 1D view of the Matrix
Eigen::MatrixXf outputs = inputs.reshaped<Eigen::RowMajor>().transpose().cwiseMax(0.0);
std::cout << "ReLU output :\n" << outputs << "\n\n";
```
**ReLU Class Implementation**
***In Python***
```python
class Activation_ReLU:
def forward(self, inputs):
self.output = np.maximum(0, inputs)
```
***In C++ with Eigen***
```cpp
class ReLU
{
private:
MatrixXf outputs;
public:
void forward(MatrixXf inputs)
{
// Keeping the size of the Matrix
outputs = inputs.cwiseMax(0.0);
}
MatrixXf getOutput() { return outputs; }
};
```
:::info
:bulb: **Why Use Activation Functions?**
In the chapter 4 of the book, the author provide a step by step view of how a nonlinear function help a model to fit a `y = sin(x)` equation
:::
### Softmax Activation Function
Represent confidence score in the multi-class classification task. The output is an array with which each value represents the possiblilty of the target belong to a class and all the values should add up to be 1. For example, `[0.3, 0.22, 0.18, 0.16, 0.14]` can be the output of softmax. In the following formula, we can see that exponential calculation is used. Exponential will output only non-negative numbers and possibility cannot be negative.

*Take a given exponential value and divide it by the sum of all exponential values.*
***In Python***
```python
import numpy as np
inputs = [4.8, 1.21, 2.385, 0.5]
exp_values = np.exp(inputs)
outputs = exp_values / np.sum(exp_values)
print(outputs)
```
***In C++ with Eigen***
```cpp
Eigen::Matrix2f inputs;
inputs << 4.8, 1.21,
2.385, 0.5;
inputs = inputs.array().exp();
Eigen::VectorXf sum = inputs.rowwise().sum();
Eigen::Matrix2f outputs = inputs.array().colwise() / sum.array();
std::cout << "softmax output : " << outputs << "\n\n";
std::cout << "sum of the softmax output : " << outputs.rowwise().sum() << "\n\n";
```
**Softmax class implementation**
In this implementation, **we subtract all the element in the matrix with the maximum element to prevent vary large value** which can be cause by appling softmax function. This can work because softmax function ouput a probability for all elements and these values are relative.
***In Python***
```python
class Activation_Softmax:
def forward(self, inputs):
exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
self.output = exp_values / np.sum(exp_values, axis=1, keepdims=True)
```
***In C++ with Eigen***
```cpp
class Softmax
{
private:
MatrixXf outputs;
public:
void forward(MatrixXf inputs)
{
inputs = (inputs.array() - inputs.maxCoeff());
inputs = inputs.array().exp();
Eigen::VectorXf sum = inputs.rowwise().sum();
outputs = inputs.array().colwise() / sum.array();
}
MatrixXf getOutput() { return outputs; }
};
```
## Calculating Network Error with Loss
Reference \:
> 機器/深度學習: 基礎介紹-損失函數(loss function)
https://chih-sheng-huang821.medium.com/%E6%A9%9F%E5%99%A8-%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92-%E5%9F%BA%E7%A4%8E%E4%BB%8B%E7%B4%B9-%E6%90%8D%E5%A4%B1%E5%87%BD%E6%95%B8-loss-function-2dcac5ebb6cb
### Categorial Cross-Entropy Loss
The output from a softmax activation function will be a probability distribution. Categorical cross-entropy can be used to compare this output probability to a ground-truth probability.

Since the ground-truth is in one-hot format, only one of the value will be `1` and all the other values will be `0`. The loss function can be simplified as follows.

***In Python***
```python
import numpy as np
softmax_outputs = np.array([[0.7, 0.1, 0.2], [0.1, 0.5, 0.4], [0.02, 0.9, 0.08]])
# class_targets = np.array([0, 1, 1]) # categorical labels
class_targets = np.array([[1, 0, 0], [0, 1, 0], [0, 1, 0]]) # one-hot labels
softmax_outputs = np.clip(softmax_outputs, 1e-7, 1-1e-7) # clip value to a range to prevent log(0) undefined calculation
if len(class_targets.shape) == 1:
true_confidences = softmax_outputs[range(len(softmax_outputs)), class_targets]
elif len(class_targets.shape) == 2:
true_confidences = np.sum(softmax_outputs * class_targets, axis=1)
neg_log = -np.log(true_confidences)
average_loss = np.mean(neg_log)
print(average_loss)
```
***C++ with Eigen***
:::success
:bulb: **Clip values to a range in Eigen**
```cpp
Eigen::Matrix4f a;
a << 1.0, 2.0, 3.0, 4.0,
5.0, 6.0, 7.0, 8.0,
9.0, 1.0, 1.1, 1.2,
1.3, 1.4, 1.5, 1.6;
std::cout << a << "\n\n";
a = a.cwiseMin(2.0).cwiseMax(1.0);
// Clipping should follow order.
// As clip Min with Maximum number and clip max with Minimum number.
std::cout << a << "\n\n";
```
In the above case, *c* in `cwiseMin` means **c**oefficient-**wise** computation.
> http://eigen.tuxfamily.org/dox/group__QuickRefPage.html#title6
:::
```cpp
// Only support one-hot encoding labels
Eigen::Matrix3f softmaxOutput;
softmaxOutput << 0.7, 0.1, 0.2,
0.1, 0.5, 0.4,
0.02, 0.9, 0.08;
Eigen::Matrix3f groundTruth;
groundTruth << 1.0, 0.0, 0.0,
0.0, 1.0, 0.0,
0.0, 1.0, 0.0;
softmaxOutput = softmaxOutput.cwiseMin(1 - 1e-7).cwiseMax(1e-7);
softmaxOutput = softmaxOutput.cwiseProduct(groundTruth);
Eigen::Matrix<float, 3, 1> categoricalCrossEntropyLoss = softmaxOutput.rowwise().sum();
categoricalCrossEntropyLoss.array() = categoricalCrossEntropyLoss.array().log() * -1;
float output = categoricalCrossEntropyLoss.sum() / (categoricalCrossEntropyLoss.cols() * categoricalCrossEntropyLoss.rows());
std::cout << output << "\n\n";
```
#### Categorial Cross-Entropy Loss Class Implementaion
In this implementation a separate loss class is created for flexibility that can help fit this class into other loss functions.
***In Python***
```python
class Loss:
def calculate(self, output, y):
sample_losses = self.forward(output, y)
data_loss = np.mean(sample_losses)
return data_loss
class Loss_CategoricalCrossEntropy(Loss):
def forward(self, y_pred, y_true):
samples = len(y_pred)
y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
if len(y_true.shape) == 1:
true_confidences = y_pred[range(len(samples)), y_true]
elif len(y_true.shape) == 2:
true_confidences = np.sum(y_pred * y_true, axis=1)
neg_log = -np.log(true_confidences)
return neg_log
```
***in C++ with Eigen***
```cpp
class Loss
{
public:
virtual MatrixXf forward(MatrixXf output, MatrixXf groundTruth) = 0;
float calculate(MatrixXf output, MatrixXf groundTruth)
{
MatrixXf sampleLosses = this->forward(output, groundTruth);
return sampleLosses.sum() / (sampleLosses.cols() * sampleLosses.rows());
}
};
class CategoricalCrossentropyLoss : public Loss
{
public:
MatrixXf forward(MatrixXf output, MatrixXf groundTruth)
{
output = output.cwiseMin(1 - 1e-7).cwiseMax(1e-7);
output = output.cwiseProduct(groundTruth);
MatrixXf categoricalCrossEntropyLoss = output.rowwise().sum();
categoricalCrossEntropyLoss.array() = categoricalCrossEntropyLoss.array().log() * -1;
return categoricalCrossEntropyLoss;
}
};
```
### Calculate Accuracy
The next step is to calculate the accuracy of the model output (*softmax output*).
***In python***
```python
import numpy as np
softmax_outputs = np.array([[0.7, 0.1, 0.2], [0.1, 0.5, 0.4], [0.02, 0.9, 0.08]])
class_targets = np.array([[1, 0, 0], [0, 1, 0], [0, 1, 0]]) # one-hot labels
preditions = np.argmax(softmax_outputs, axis=1)
class_targets = np.argmax(class_targets, axis=1)
accuracy = np.mean(preditions==class_targets)
print("accuracy : ", accuracy)
```
***In C++ with Eigen***
```cpp
// Only support one-hot encoding labels
Eigen::Matrix3f softmaxOutput;
softmaxOutput << 0.7, 0.1, 0.2,
0.1, 0.5, 0.4,
0.02, 0.9, 0.08;
Eigen::Matrix3f groundTruth;
groundTruth << 1.0, 0.0, 0.0,
0.0, 1.0, 0.0,
0.0, 1.0, 0.0;
size_t count = 0;
for (int i = 0; i < softmaxOutput.rows(); ++i) {
int predIndex;
int trueIndex;
softmaxOutput.row(i).maxCoeff(&predIndex);
groundTruth.row(i).maxCoeff(&trueIndex);
if (predIndex == trueIndex)
count++;
}
float accuracy = static_cast<float>(count) / static_cast<float>(groundTruth.rows());
std::cout << accuracy << "\n\n";
```
## Derivatives, Gradients, Partial Derivatives and Chain Rule
:::info
***:bulb:Partial Derivative of Max***
The Max function will return x or y depends on which is greater. When derivative is applied, the partial derivative of x will be as the follow equations.

:::
:::info
***:bulb:Gradients***
> \[Ch8 P.13\] Gradient is a vector composed of all the partial derivative of a function.

:::
:::info
***:bulb:Chain Rules***
The following sample equation is derivative but the chain rules can be applied to partial derivative as well.

:::
## Backporpagation
:::info
***:bulb:Different Notation of a Neuron***
**In Previous Grapth**

**In Compute Grapth**

Please refer to the chapter 9 P.14 which provide a sample backpropagation step by step. The book example use the compute grapth which is provided above.
:::

> Ch.9 Page 26 A Neuron with activation function Sample
```python
x = [1.0, -2.0, 3.0] #input
w = [-3.0, -1.0, 2.0] # weight
b = 1.0 # bias
# forward pass
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]
z = xw0 + xw1 + xw2 + b
# ReLU
y = max(z, 0)
# backward pass
# Derivative return from the deeper layer (next) backpropagation
dvalue = 1.0
drelu_dz = dvalue * (1. if z > 0 else 0.) # derivative of ReLU * dvalue (Chain Rule)
print("derelu_dz : ", drelu_dz)
# partial derivative of sum and apply chain rule
# The derivative of sum equals 1 for all the elements
dsum_dxw0 = 1
dsum_dxw1 = 1
dsum_dxw2 = 1
dsum_db = 1
drelu_dxw0 = drelu_dz * dsum_dxw0
drelu_dxw1 = drelu_dz * dsum_dxw1
drelu_dxw2 = drelu_dz * dsum_dxw2
drelu_db = drelu_dz * dsum_db
print(drelu_dxw0, drelu_dxw1, drelu_dxw2, drelu_db)
# partial derivative of multiplication and apply chain rule
# The derivative of multiplication : d/dxf(x * y) = y and d/dyf(x * y) = x
dmul_dx0 = w[0]
dmul_dx1 = w[1]
dmul_dx2 = w[2]
dmul_dw0 = x[0]
dmul_dw1 = x[1]
dmul_dw2 = x[2]
drelu_dx0 = drelu_dxw0 * dmul_dx0
drelu_dw0 = drelu_dxw0 * dmul_dw0
drelu_dx1 = drelu_dxw1 * dmul_dx1
drelu_dw1 = drelu_dxw1 * dmul_dw1
drelu_dx2 = drelu_dxw2 * dmul_dx2
drelu_dw2 = drelu_dxw2 * dmul_dw2
print(drelu_dx0, drelu_dw0, drelu_dx1, drelu_dw1, drelu_dx2, drelu_dw2)
```
With the previous gradient caclulation, the update can be provided to the weight. In the following code snippet, we applied the update with a 0.001 learning rate.
```python
dx = [drelu_dx0, drelu_dx1, drelu_dx2] # gradients on inputs
dw = [drelu_dw0, drelu_dw1, drelu_dw2] # gradients on weights
db = drelu_db # gradient on bias
print(w, b)
# update weight with learning_rate=0.001
w[0] += -0.001 * dw[0]
w[1] += -0.001 * dw[1]
w[2] += -0.001 * dw[2]
b += -0.001 * db
print(w, b)
```
***Python Implemenation of Multiple Neuron Backpropagation Chain Rule***

In the following code, we calcuate **derivative with respect to the input**. *The derivative with respect to the inputs equals to weights.*
```python
import numpy as np
# returned gradient form the next layer
dvalues = np.array([[1., 1., 1.]])
# 3 neurons and 4 inputs
weights = np.array([[0.2, 0.8, -0.5, 1], [0.5, -0.91, 0.26, -0.5], [-0.26, -0.27, 0.17, 0.87]]).T
dx0 = sum(weights[0] * dvalues[0])
dx1 = sum(weights[1] * dvalues[0])
dx2 = sum(weights[2] * dvalues[0])
dx3 = sum(weights[3] * dvalues[0])
dinputs = np.array([dx0, dx1, dx2, dx3])
print(dinputs)
# The above code can be replaced with numpy's dot product (Matrix Multiplication)
import numpy as np
# returned gradient form the next layer
dvalues = np.array([[1., 1., 1.]])
# 3 neurons and 4 inputs
weights = np.array([[0.2, 0.8, -0.5, 1], [0.5, -0.91, 0.26, -0.5], [-0.26, -0.27, 0.17, 0.87]]).T
dinputs = np.dot(dvalues[0], weights.T)
print(dinputs)
```
In the following code, we calculate the **derivative with respect to weights**. In this case, we need the output shape to match the shape of weights since we need to update the weight with it. *The derivative with respect to the weights equals inputs.*
```python
import numpy as np
dvalues = np.array([[1., 1., 1.], [2., 2., 2.], [3., 3., 3.]])
inputs = np.array([[1, 2, 3, 2.5], [2., 5., -1, 2], [-1.5, 2.7, 3.3, -0.8]])
dweights = np.dot(inputs.T, dvalues)
print("dvalues : \n", dvalues, "\ninputs : \n", inputs, "\ndweights : \n", dweights)
```
In the following code, **The derivative of bias** come from the sum operation and always equal 1 and we just need to multiply gradients to apply the chain rule. We have to sum them with neurons.
```python
import numpy as np
dvalues = np.array([[1., 1., 1.], [2., 2., 2.], [3., 3., 3.]])
biases = np.array([[2, 3, 0.5]]) # shape (1, neurons) we do not need it
dbiases = np.sum(dvalues, axis=0, keepdims=True)
print(dbiases)
```
In the following code, **The derivative of ReLU** is calculated. *The derivative of the ReLU function equals to 1 if the input is greater than 0 and equals to 0 otherwise.*
```python
import numpy as np
z = np.array([[1, 2, -3, -4], [2, -7, -1, 3], [-1, 2, 5, -1]])
dvalues = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
# ReLU activation's derivatives
drelu = np.zeros_like(z)
drelu[z > 0] = 1
print(drelu)
drelu *= dvalues
print(drelu)
# The above code can be simplified as the following
import numpy as np
z = np.array([[1, 2, -3, -4], [2, -7, -1, 3], [-1, 2, 5, -1]])
dvalues = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
# ReLU activation's derivatives
drelu = dvalues.copy()
drelu[z <= 0] = 0
print(drelu)
```
***Combine the above codes together***
***In Python***
```python
import numpy as np
dvalues = np.array([[1., 1., 1.], [2., 2., 2.], [3., 3., 3.]])
inputs = np.array([[1, 2, 3, 2.5], [2., 5., -1., 2], [-1.5, 2.7, 3.3, -0.8]])
weights = np.array([[0.2, 0.8, -0.5, 1], [0.5, -0.91, 0.26, -0.5], [-0.26, -0.27, 0.17, 0.87]]).T
biases = np.array([[2, 3, 0.5]])
# forward pass
layer_outputs = np.dot(inputs, weights) + biases # Dense Layer
relu_outputs = np.maximum(0, layer_outputs) # ReLU Layer
# backpropagation from deepest layer go backward
# ReLU Layer
drelu = relu_outputs.copy()
drelu[layer_outputs <= 0] = 0
# Dense Layer
dinputs = np.dot(drelu, weights.T)
dweights = np.dot(inputs.T, drelu)
dbiases = np.sum(drelu, axis=0, keepdims=True)
# update weights
weights += -0.001 * dweights
biases += -0.001 * dbiases
print(weights)
print(biases)
```
***In C++ with Eigen***
```cpp
MatrixXf dvalues(3, 3);
dvalues << 1.0, 1.0, 1.0,
2.0, 2.0, 2.0,
3.0, 3.0, 3.0;
MatrixXf inputs(3, 4);
inputs << 1, 2, 3, 2.5,
2.0, 5.0, -1.0, 2,
-1.5, 2.7, 3.3, -0.8;
MatrixXf weights(3, 4);
weights << 0.2, 0.8, -0.5, 1,
0.5, -0.91, 0.26, -0.5,
-0.26, -0.27, 0.17, 0.87;
MatrixXf weightsT(4, 3);
weightsT = weights.transpose();
VectorXf biases(3);
biases << 2, 3, 0.5;
// forward pass
// Dense
MatrixXf layerOutputs = (inputs * weightsT);
layerOutputs.rowwise() += biases.transpose();
// ReLU
MatrixXf reluOutputs = layerOutputs.cwiseMax(0.0);
// Backpropagation
// ReLU
MatrixXf drelu = reluOutputs;
(layerOutputs.array() <= 0).select(0, drelu.array());
// Dense
MatrixXf dinputs = drelu * weightsT.transpose();
MatrixXf dweights = inputs.transpose() * drelu;
VectorXf dbiases = drelu.colwise().sum();
// update weights
weightsT.array() += dweights.array() * -0.001;
biases.array() += dbiases.array() * -0.001;
std::cout << dvalues << "\n\n";
std::cout << inputs << "\n\n";
std::cout << weights << "\n\n";
std::cout << weightsT << "\n\n";
std::cout << biases << "\n\n";
std::cout << layerOutputs << "\n\n";
std::cout << reluOutputs << "\n\n";
std::cout << drelu << "\n\n";
std::cout << dinputs << "\n\n";
std::cout << dweights << "\n\n";
std::cout << dbiases << "\n\n";
std::cout << weightsT << "\n\n";
std::cout << biases << "\n\n";
```
### Backpropagation Class Implementation
In C++ with Eigen ***(Dense and ReLU)***
```cpp
class Dense
{
private:
MatrixXf weights;
VectorXf biases;
MatrixXf outputs;
MatrixXf inputs;
MatrixXf dinputs;
MatrixXf dweights;
VectorXf dbiases;
public:
Dense(unsigned int numInputs, unsigned int numNeurons)
{
weights = 0.01 * MatrixXf::Random(numInputs, numNeurons);
biases = VectorXf::Zero(numNeurons);
}
~Dense() {}
void forward(MatrixXf inputs)
{
this->inputs = inputs;
outputs = inputs * weights;
outputs.rowwise() += biases.transpose();
}
void backward(MatrixXf dvalues)
{
this->dinputs = dvalues * this->weights.transpose();
this->dweights = this->inputs.transpose() * dvalues;
this->dbiases = dvalues.colwise().sum();
}
MatrixXf getWeigths() { return weights; }
MatrixXf getBiases() { return biases; }
VectorXf getOutput() { return outputs; }
};
class ReLU
{
private:
MatrixXf outputs;
MatrixXf inputs;
MatrixXf dinputs;
public:
void forward(MatrixXf inputs)
{
this->inputs = inputs;
// Keeping the size of the Matrix
outputs = inputs.cwiseMax(0.0);
}
void backward(MatrixXf dvalues)
{
this->dinputs = dvalues;
(this->inputs.array() <= 0).select(0, dinputs.array());
}
MatrixXf getOutput() { return outputs; }
};
```
### Derivative of Categorical Cross-Entropy Loss Function
:::info
:bulb:**Derivative of Categorical Cross-Entropy Loss Function**
Recall that the original Categorical Cross-Entropy function is as follows.
*i-th sample in a set, j-label/output index*



The partial derivative with respect to the *y* given *j*, the sum is being perform on one single element so it can be omitted.
:::
***Categorical Cross-Entropy Loss Function Class Implementation***
```cpp
class CategoricalCrossentropyLoss : public Loss
{
MatrixXf dinputs;
public:
// Only support one-hot encoding
MatrixXf forward(MatrixXf output, MatrixXf groundTruth)
{
output = output.cwiseMin(1 - 1e-7).cwiseMax(1e-7);
output = output.cwiseProduct(groundTruth);
MatrixXf categoricalCrossEntropyLoss = output.rowwise().sum();
categoricalCrossEntropyLoss.array() = categoricalCrossEntropyLoss.array().log() * -1;
return categoricalCrossEntropyLoss;
}
void backward(MatrixXf dvalues, MatrixXf groundTruth)
{
int samples = dvalues.rows();
this->dinputs = -groundTruth.array() / dvalues.array();
this->dinputs = this->dinputs.array() / samples;
}
};
```
### Derivative of Softmax
:::info
:bulb: **The derivative of the division operation**

:::
:::info
:bulb: **The derivative of the constant e raised to power of n with respect to n.**

:::
:::success
:bulb: **The deirvative of Softmax**
> Please refer to the book chapter 9 page 46

*j*-th Softmax's output of *i*-th sample. *z* is input array which is a list of input vectors and *j*-th softmax's input of *i*-th sample.
The right part of derivative

The left part of derivative has two different situations.
***j equals to k** or **j not equals to k***
When *j = k*

When *j != k*

The solution can be representing as

A Kronecker delta function with the following equation can be applied.

Thus we can have the final form of softmax derivative equation.

:::
*Pending : Python implementation detail*
***In Python***
```python
class Activation_Softmax:
def forward(self, inputs):
self.inputs = inputs
exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
self.output = exp_values / np.sum(exp_values, axis=1, keepdims=True)
def backward(self, dvalues):
self.dinputs = np.empty_like(dvalues)
for index, (single_output, single_dvalues) in enumerate(zip(self.output, dvalues)):
# Flatten output array
single_output = single_output.reshape(-1, 1)
# Calculate Jacobian matrix of the output
jacobian_matrix = np.diagflat(single_output) - np.dot(single_output, single_output.T)
# Calculate sample-wise gradient
# and add it to the array of sample gradients
self.dinputs[index] = np.dot(jacobian_matrix, single_dvalues)
```
:::success
:bulb: **Implementation consideration**
I will not implement this in Eigen version since the book provide a better implementation for categorical cross-entropy with softmax. The implementation will be in pending.
:::
### Categorical Cross-Entropy Loss with Softmax Activation Derivative
In most case, categorical entropy loss is used with softmax activation function and vice versa. Addtionally, calculate their derivative together is faster and simpler.
> Please refer to the book chapter 9 page 56
With Chain-Rule, we can have.

Since we know that loss function calculation is right after the final activation function, we can deduce that **the input of the loss function is the output of the softmax activation function**.

Recall that the derivative of the Categorical Cross-Entropy function is :

Therefore, we can have :

Recall that the derivative of the Softmax activation function **with substitution** :

There are two different situation *j = k* or *j != k*. We split the equation and we have :

We substitue the partial derivative of the activation funtion part



For  there will be only one value is 1 which times the y hat element. Therefor, we can further simplify the equation to :

***In Python***
```python
# softmax classifier (provide faster backward function)
class Activation_Softmax_Loss_CategoricalCrossEntropy():
# create activation and loss function objects
def __init__(self):
self.activation = Activation_Softmax()
self.loss = Loss_CategoricalCrossEntropy()
def forward(self, inputs, y_true):
self.activation.forward(inputs) # perform forward pass within the activation function class
self.output = self.activation.output
return self.loss.calculate(self.output, y_true)
def backward(self, dvalues, y_true):
samples = len(dvalues)
# if lables are one-hot encoded, turn them into discrete values
if len(y_true.shape) == 2:
y_true = np.argmax(y_true, axis=1)
self.dinputs = dvalues.copy()
self.dinputs[range(samples), y_true] -= 1
self.dinputs = self.dinputs / samples
```
***In C++ with Eigen***
```cpp
class SoftmaxCategoricalCrossentropyLoss
{
public:
MatrixXf outputs;
MatrixXf dinputs;
SoftmaxCategoricalCrossentropyLoss() {}
~SoftmaxCategoricalCrossentropyLoss() {}
float forward(MatrixXf inputs, MatrixXf groundTruth)
{
// Don't need to keep these in the class
Softmax activation;
CategoricalCrossentropyLoss lossFunc;
activation.forward(inputs);
this->outputs = activation.outputs;
return lossFunc.calculate(this->outputs, groundTruth);
}
void backward(MatrixXf dvalues, MatrixXf groundTruth)
{
int samples = dvalues.rows();
// Book sugestion might not be faster
this->dinputs = dvalues;
this->dinputs.array() = this->dinputs.array() - groundTruth.array();
this->dinputs.array() = this->dinputs / static_cast<float>(samples);
}
};
```
## Optimizers
### Stochastic Gradient Descent(SGD)
SGD is applied with a chosen learning rate and subtract *learning_rate * gradients* from the actual parameter values.
:::info
:bulb:**Implementation Detail**
In the current C++ implementation, all the variables in the class will be public to match python implementation. There will be a final section tidy up C++ implementation.
:::
***In Python***
```python
class Optimizer_SGD:
def __init__(self, learning_rate=1.0):
self.learning_rate = learning_rate
def update_params(self, layer):
layer.weights += -self.learning_rate * layer.dweights
layer.biases += -self.learning_rate * layer.dbiases
```
***In C++ with Eigen***
```cpp
class OptimizerSGD
{
private:
float learningRate = 1.0;
public:
OptimizerSGD(float learningRate)
{
this->learningRate = learningRate;
}
void updateParams(Layer& layer)
{
layer.weights.array() += (layer.dweights.array() * (this->learningRate * -1));
layer.biases.array() += (layer.dbiases.array() * (this->learningRate * -1));
}
};
```
### ***Put all classes together***
***In python***
```python
X, y = spiral_data(samples=100, classes=3)
dense1 = Layer_Dense(2, 64)
activation1 = Activation_ReLU()
dense2 = Layer_Dense(64, 3)
loss_activation = Activation_Softmax_Loss_CategoricalCrossEntropy()
optimizer = Optimizer_SGD()
for epoch in range(10001):
dense1.forward(X)
activation1.forward(dense1.output)
dense2.forward(activation1.output)
loss = loss_activation.forward(dense2.output, y)
predictions = np.argmax(loss_activation.output, axis=1)
if len(y.shape) == 2:
y = np.argmax(y, axis=1)
accuracy = np.mean(predictions==y)
if not epoch % 100:
print(f'epoch: {epoch}, ' +
f'acc: {accuracy:.3f}, ' +
f'loss: {loss:.3f}')
loss_activation.backward(loss_activation.output, y)
dense2.backward(loss_activation.dinputs)
activation1.backward(dense2.dinputs)
dense1.backward(activation1.dinputs)
optimizer.update_params(dense1)
optimizer.update_params(dense2)
```
***Output***
> The book pointed out that the accuracy stuck at 0.6 ~ 0.7 is because it is stuck in the **local minimum**
```bash
epoch: 8500, acc: 0.663loss: 0.743
epoch: 8600, acc: 0.627loss: 0.822
epoch: 8700, acc: 0.653loss: 0.761
epoch: 8800, acc: 0.653loss: 0.734
epoch: 8900, acc: 0.653loss: 0.709
epoch: 9000, acc: 0.653loss: 0.720
epoch: 9100, acc: 0.663loss: 0.711
epoch: 9200, acc: 0.633loss: 0.791
epoch: 9300, acc: 0.653loss: 0.761
epoch: 9400, acc: 0.643loss: 0.753
epoch: 9500, acc: 0.653loss: 0.749
epoch: 9600, acc: 0.660loss: 0.743
epoch: 9700, acc: 0.687loss: 0.705
epoch: 9800, acc: 0.610loss: 0.855
epoch: 9900, acc: 0.623loss: 0.799
epoch: 10000, acc: 0.670loss: 0.681
```
***In C++ with Eigen***
:::info
:bulb: **Learning Rate, Momentum, Learning Rate Decay**
For ***learning rate***, since our model is stuck in the local minimum, the learning rate should not be fixed. To prevent gradient explosion, the learning rate should not be set too big.
For ***momentum***, this can be add to the gradient which can help push our model weight over a small hill or roll out from a small pit. It use the previous update's direction to prevent update bouncing around.
***Learning Rate Decay***
> This can be refered to as **1/t decaying**
```python
starting_learning_rate = 1.0
learning_rate_decay = 0.1
for step in range(20):
learning_rate = starting_learning_rate * (1.0 / (1 + learning_rate_decay * step))
print(learning_rate)
```
```bash
1.0
0.9090909090909091
0.8333333333333334
0.7692307692307692
0.7142857142857143
0.6666666666666666
0.625
0.588235294117647
0.5555555555555556
0.5263157894736842
0.5
0.47619047619047616
0.45454545454545453
0.4347826086956522
0.41666666666666663
0.4
0.3846153846153846
0.37037037037037035
0.35714285714285715
0.3448275862068965
```
:::
***Python SGD Class with Learning Rate Decay and Momentum***
```python
class Optimizer_SGD:
def __init__(self, learning_rate=1.0, decay=0.0, momentum=0.0):
self.learning_rate = learning_rate
self.current_learning_rate = learning_rate
self.decay = decay
self.iterations = 0
self.momentum = momentum
def pre_update_params(self):
if self.decay:
self.current_learning_rate = self.learning_rate * (1.0 / (1.0 + self.decay * self.iterations))
def update_params(self, layer):
# self.momentum is a hyperparameter chosen when used
# layer.weight_momentums start as all zeros
if self.momentum:
# Create and initial momentum arrays and inintial it as all zeros
if not hasattr(layer, 'weight_momentums'):
layer.weight_momentums = np.zeros_like(layer.weights)
layer.bias_momentums = np.zeros_like(layer.biases)
weight_updates = self.momentum * layer.weight_momentums - self.current_learning_rate * layer.dweights
layer.weight_momentums = weight_updates
bias_updates = self.momentum * layer.bias_momentums - self.current_learning_rate * layer.dbiases
layer.bias_momentums = bias_updates
# vanilla SGD updates
else:
weight_updates += -self.current_learning_rate * layer.dweights
bias_updates += -self.current_learning_rate * layer.dbiases
layer.weights += weight_updates
layer.biases += bias_updates
def post_update_params(self):
self.iterations += 1
```
***C++ with Eigen SGD Class with Learning Rate Decay and Momentum***
```cpp
class OptimizerSGD
{
private:
float learningRate;
float curLearningRate;
float decay;
float iterations;
float momentum;
public:
OptimizerSGD(float learningRate = 1.0, float decay = 0.0, float momentum = 0.0)
{
this->learningRate = learningRate;
this->curLearningRate = learningRate;
this->decay = decay;
this->iterations = 0;
this->momentum = momentum;
}
void preUpdateParams()
{
// if decay is set, it will always greater than 0
if (this->decay > 0.0) {
curLearningRate = learningRate * (1.0 / (1.0 + this->decay * this->iterations));
}
}
void updateParams(Layer& layer)
{
// initialize weightMomentums with zeros if not initialized
if (layer.weightMomentums == MatrixXf{})
{
layer.weightMomentums = MatrixXf::Zero(layer.weights.rows(), layer.weights.cols());
layer.biasMomentums = MatrixXf::Zero(layer.biases.rows(), layer.biases.cols());
}
// if momentum is set, it will always greater than 0
if (this->momentum > 0.0)
{
MatrixXf weightUpdates = (layer.weightMomentums.array() * this->momentum) - (this->curLearningRate * layer.dweights.array());
layer.weightMomentums = weightUpdates;
MatrixXf biasUpdates = (layer.biasMomentums.array() * this->momentum) - (this->curLearningRate * layer.dbiases.array());
layer.biasMomentums = biasUpdates;
// Update Weights and Biases
layer.weights += weightUpdates;
layer.biases += biasUpdates;
}
// vanilla SGD Updates
else
{
MatrixXf weightUpdates = layer.dweights.array() * (-1 * this->curLearningRate);
MatrixXf biasUpdates = layer.dbiases.array() * (-1 * this->curLearningRate);
// Update Weights and Biases
layer.weights += weightUpdates;
layer.biases += biasUpdates;
}
}
void postUpateParams()
{
this->iterations += 1;
}
};
```
### AdaGrad
Adaptive gradient provide per-parameter learning rate and keep a history of previous update. *The bigger the sum of the update is, the smaller update are made*. This helps parameter with small update to keep-up with other parameters.
***Python Class Implementation***
```python
class Optimizer_Adagrad:
def __init__(self, learning_rate=1.0, decay=0.0, epsilon=1e-7):
self.learning_rate = learning_rate
self.current_learning_rate = learning_rate
self.decay = decay
self.iterations = 0
self.epsilon = epsilon
def pre_update_params(self):
if self.decay:
self.current_learning_rate = self.learning_rate * (1.0 / (1.0 + self.decay * self.iterations))
def update_params(self, layer):
if not hasattr(layer, 'weight_cache'):
layer.weight_cache = np.zeros_like(layer.weights)
layer.bias_cache = np.zeros_like(layer.biases)
layer.weight_cache += layer.dweights**2
layer.bias_cache += layer.dbiases**2
layer.weights += -self.current_learning_rate * layer.dweights / (np.sqrt(layer.weight_cahce) + self.epsilon)
layer.biases += -self.current_learning_rate * layer.dbiases / (np.sqrt(layer.bias_cahce) + self.epsilon)
def post_update_params(self):
self.iterations += 1
```
:::success
AdaGrad use a constantly rising cache to limit the learning speed which might cause the stall.
> That's why this optmizer not widely used, exept for some specific applications.
:::
```cpp
class OptimizerAdagrad
{
private:
float learningRate;
float curLearningRate;
float decay;
float iterations;
float epsilon;
public:
OptimizerAdagrad(float learningRate = 1.0, float decay = 0.0, float epsilon = 1e-6)
{
this->learningRate = learningRate;
this->curLearningRate = learningRate;
this->decay = decay;
this->iterations = 0;
this->epsilon = epsilon;
}
void updateParams(Layer& layer)
{
// initialize weightMomentums with zeros if not initialized
if (layer.weightCache == MatrixXf{})
{
layer.weightCache = MatrixXf::Zero(layer.weights.rows(), layer.weights.cols());
layer.biasCache = VectorXf::Zero(layer.biases.rows(), layer.biases.cols());
}
layer.weightCache.array() += (layer.dweights.array() * layer.dweights.array());
layer.biasCache.array() += (layer.dbiases.array() * layer.dbiases.array());
layer.weights.array() += layer.dweights.array() * (-1 * this->curLearningRate) / (layer.weightCache.array().sqrt() + this->epsilon);
layer.biases.array() += layer.dbiases.array() * (-1 * this->curLearningRate) / (layer.biasCache.array().sqrt() + this->epsilon);
}
void postUpdateParams()
{
this->iterations += 1;
}
};
```
### RMSProp
RMSProp combine both the concept of momentum in SGD and cache in AdaGrad. A hyperparameter rho is added to control the amount of update history \(*cache*\) influence the next update.
***The only difference in code \:***
```python
# In AdaGrad:
cache += gradient ** 2
# In RMSProp:
cache = rho * cache + (1 - rho) * gradient ** 2
```
### Adam
Adaptive momentum is the most used optimizer. This optimizer include the following two feature.
* Add momentums as SGD
* Apply per-weight adaptive learning rate with the cache as in RMSProp
* Bias correction mechanism \(*for initial zeroed values*\) by divide momentum and caches with *1-beta^step*
:::info
:bulb: **Momentums / Bias Correction**
The correction fraction will go toward 1 in later training epoch. With a divition by small fraction at the begining of the training, the training can be speed up.
:::
***In Python Class Implementation***
```python
class Optimizer_Adam:
def __init__(self, learning_rate=0.001, decay=0, epsilon=1e-7, beta_1=0.9, beta_2=0.999):
self.learning_rate = learning_rate
self.current_learning_rate = learning_rate
self.decay = decay
self.iterations = 0
self.epsilon = epsilon
self.beta_1 = beta_1
self.beta_2 = beta_2
def pre_update_params(self):
if self.decay:
self.current_learning_rate = self.learning_rate * (1.0 / (1.0 + self.decay * self.iterations))
def update_params(self, layer):
if not hasattr(layer, 'weight_cache'):
layer.weight_momentums = np.zeros_like(layer.weights)
layer.weight_cache = np.zeros_like(layer.weights)
layer.bias_momentums = np.zeros_like(layer.biases)
layer.bias_cache = np.zeros_like(layer.biases)
# Update momentum with current gradients
layer.weight_momentums = self.beta_1 * layer.weight_momentums + (1 - self.beta_1) * layer.dweights
layer.bias_momentums = self.beta_1 * layer.bias_momentums + (1 - self.beta_1) * layer.dbiases
# get corrected momentum
weight_momentums_corrected = layer.weight_momentums / (1 - self.beta_1 ** (self.iterations + 1))
bias_momentums_corrected = layer.bias_momentums / (1 - self.beta_1 ** (self.iterations + 1))
# Caclulate cache
layer.weight_cache = self.beta_2 * layer.weight_cache + (1 - self.beta_2) * layer.dweights**2
layer.bias_cache = self.beta_2 * layer.bias_cache + (1 - self.beta_2) * layer.dbiases**2
# get corrected cache
weight_cache_corrected = layer.weight_cache / (1 - self.beta_2 ** (self.iterations + 1))
bias_cache_corrected = layer.bias_cache / (1 - self.beta_2 ** (self.iterations + 1))
layer.weights += -self.current_learning_rate * weight_momentums_corrected / (np.sqrt(weight_cache_corrected) + self.epsilon)
layer.biases += -self.current_learning_rate * bias_momentums_corrected / (np.sqrt(bias_cache_corrected) + self.epsilon)
def post_update_params(self):
self.iterations += 1
```
***C++ with Eigen Class Implementation***
```cpp
class OptimizerAdam
{
private:
float learningRate;
float curLearningRate;
float decay;
float iterations;
float epsilon;
float beta_1;
float beta_2;
public:
OptimizerAdam(float learningRate = 0.001, float decay = 0.0, float epsilon = 1e-6, float beta_1 = 0.9, float beta_2 = 0.999)
{
this->learningRate = learningRate;
this->curLearningRate = learningRate;
this->decay = decay;
this->iterations = 0;
this->epsilon = epsilon;
this->beta_1 = beta_1;
this->beta_2 = beta_2;
}
void preUpdateParam()
{
if (this->decay > 0.0) {
curLearningRate = learningRate * (1.0 / (1.0 + this->decay * this->iterations));
}
}
void updateParams(Layer& layer)
{
if (layer.weightCache == MatrixXf{})
{
layer.weightMomentums = MatrixXf::Zero(layer.weights.rows(), layer.weights.cols());
layer.weightCache = MatrixXf::Zero(layer.weights.rows(), layer.weights.cols());
layer.biasMomentums = MatrixXf::Zero(layer.biases.rows(), layer.biases.cols());
layer.biasCache = VectorXf::Zero(layer.biases.rows(), layer.biases.cols());
}
// Update momentum
layer.weightMomentums.array() = (layer.weightMomentums.array() * this->beta_1) + (layer.dweights.array() * (1.0 - this->beta_1));
layer.biasMomentums.array() = (layer.biasMomentums.array() * this->beta_1) + (layer.dbiases.array() * (1.0 - this->beta_1));
// Calculate cache
layer.weightCache.array() = (layer.weightCache.array() * this->beta_2) + (layer.dweights.array().pow(2) * (1.0 - this->beta_2));
layer.biasCache.array() = (layer.biasCache.array() * this->beta_2) + (layer.dbiases.array().pow(2) * (1.0 - this->beta_2));
// get corrected cache
// TODO : the constant can be calculate and stored for reuse
MatrixXf weightCacheCorrected = layer.weightCache.array() / (1.0 - std::pow(this->beta_2, (this->iterations + 1)));
MatrixXf biasCacheCorrected = layer.biasCache.array() / (1.0 - std::pow(this->beta_2, (this->iterations + 1)));
// Update weights and bias
layer.weights.array() += weightCacheCorrected.array() * (-1 * this->curLearningRate) / (weightCacheCorrected.array().sqrt() + this->epsilon);
layer.biases.array() += biasCacheCorrected.array() * (-1 * this->curLearningRate) / (biasCacheCorrected.array().sqrt() + this->epsilon);
}
void postUpdateParams()
{
this->iterations += 1;
}
};
```
### Finally Put all of them together
In this section, the *main function code* and the *output* are provided.
***In Python***
```python
dense1 = Layer_Dense(2, 64)
activation1 = Activation_ReLU()
dense2 = Layer_Dense(64, 3)
loss_activation = Activation_Softmax_Loss_CategoricalCrossEntropy()
optimizer = Optimizer_Adam(learning_rate=0.02, decay=1e-5)
for epoch in range(10001):
dense1.forward(X)
activation1.forward(dense1.output)
dense2.forward(activation1.output)
loss = loss_activation.forward(dense2.output, y)
predictions = np.argmax(loss_activation.output, axis=1)
if len(y.shape) == 2:
y = np.argmax(y, axis=1)
accuracy = np.mean(predictions==y)
if not epoch % 100:
print(f'epoch: {epoch}, ' +
f'acc: {accuracy:.3f}, ' +
f'loss: {loss:.3f},' +
f'lr: {optimizer.current_learning_rate}')
loss_activation.backward(loss_activation.output, y)
dense2.backward(loss_activation.dinputs)
activation1.backward(dense2.dinputs)
dense1.backward(activation1.dinputs)
optimizer.pre_update_params()
optimizer.update_params(dense1)
optimizer.update_params(dense2)
optimizer.post_update_params()
```
***Python Output***
```bash
epoch: 9000, acc: 0.960, loss: 0.083,lr: 0.018348792190754044
epoch: 9100, acc: 0.963, loss: 0.083,lr: 0.0183319737119497
epoch: 9200, acc: 0.973, loss: 0.085,lr: 0.018315186036502167
epoch: 9300, acc: 0.960, loss: 0.086,lr: 0.018298429079863496
epoch: 9400, acc: 0.967, loss: 0.083,lr: 0.018281702757794862
epoch: 9500, acc: 0.967, loss: 0.082,lr: 0.018265006986365174
epoch: 9600, acc: 0.970, loss: 0.090,lr: 0.018248341681949654
epoch: 9700, acc: 0.963, loss: 0.081,lr: 0.018231706761228456
epoch: 9800, acc: 0.967, loss: 0.082,lr: 0.018215102141185255
epoch: 9900, acc: 0.963, loss: 0.081,lr: 0.018198527739105907
epoch: 10000, acc: 0.967, loss: 0.081,lr: 0.018181983472577025
```
***In C++ with Eigen***
*main.cpp*
```cpp
#include "framework.h"
int main(int argc, char** argv)
{
MatrixXf spiralDataX(15, 2);
spiralDataX << 0.0, 0.0,
0.22627203, -0.10630602,
-0.49370526, -0.07908932,
0.36122907, 0.65727738,
-0.11046591, -0.99387991,
0.0, 0.0,
0.05269733, 0.24438288,
0.3019796, -0.39850761,
-0.70998173, 0.2417146,
0.88005838, -0.4748655,
0.0, 0.0,
-0.13811948, -0.20838188,
-0.02514717, 0.49936722,
-0.04003289, -0.74893082,
-0.98680023, -0.16194228;
MatrixXf spiralDataY(15, 3);
spiralDataY << 1.0, 0.0, 0.0,
1.0, 0.0, 0.0,
1.0, 0.0, 0.0,
1.0, 0.0, 0.0,
1.0, 0.0, 0.0,
0.0, 1.0, 0.0,
0.0, 1.0, 0.0,
0.0, 1.0, 0.0,
0.0, 1.0, 0.0,
0.0, 1.0, 0.0,
0.0, 0.0, 1.0,
0.0, 0.0, 1.0,
0.0, 0.0, 1.0,
0.0, 0.0, 1.0,
0.0, 0.0, 1.0;
Dense dense1(2, 64);
ReLU activation1;
Dense dense2(64, 3);
SoftmaxCategoricalCrossentropyLoss loss_activation;
OptimizerAdam optimizer(0.02, 1e-5);
for (int i = 0; i < 10000; ++i)
{
// std::cout << "dense1" << "\n";
dense1.forward(spiralDataX);
// std::cout << "activation1" << "\n";
activation1.forward(dense1.outputs);
// std::cout << "dense2" << "\n";
dense2.forward(activation1.outputs);
// std::cout << "loss" << "\n";
float loss = loss_activation.forward(dense2.outputs, spiralDataY);
std::cout << i << " : Loss -> " << loss << "\n";
// std::cout << "loss" << "\n";
loss_activation.backward(loss_activation.outputs, spiralDataY);
// std::cout << "dense2" << "\n";
dense2.backward(loss_activation.dinputs);
// std::cout << "activation1" << "\n";
activation1.backward(dense2.dinputs);
// std::cout << "dense1" << "\n";
dense1.backward(activation1.dinputs);
// std::cout << "pre update" << "\n";
optimizer.preUpdateParam();
// std::cout << "update dense1" << "\n";
optimizer.updateParams(dense1);
// std::cout << "update dense2" << "\n";
optimizer.updateParams(dense2);
// std::cout << "post update" << "\n";
optimizer.postUpdateParams();
// std::cout << "finished" << "\n\n";
}
return 0;
}
```
***Output***
```bash
9978 : Loss -> 0.265253
9979 : Loss -> 0.265239
9980 : Loss -> 0.265226
9981 : Loss -> 0.265212
9982 : Loss -> 0.265198
9983 : Loss -> 0.265185
9984 : Loss -> 0.265171
9985 : Loss -> 0.265157
9986 : Loss -> 0.265144
9987 : Loss -> 0.26513
9988 : Loss -> 0.265116
9989 : Loss -> 0.265103
9990 : Loss -> 0.265089
9991 : Loss -> 0.265075
9992 : Loss -> 0.265062
9993 : Loss -> 0.265048
9994 : Loss -> 0.265034
9995 : Loss -> 0.26502
9996 : Loss -> 0.265007
9997 : Loss -> 0.264993
9998 : Loss -> 0.264979
9999 : Loss -> 0.264966
```