> **Neural Networks from Scratch in Python by Harrison Kinsley, Deniel Kukeila** > https://nnfs.io/ *This note is base on the above book and some implementation in C++ by myself. Please refer to the author's youtube and book for detailed tutorial.* ## A Dense Layer with *Numpy* *Numpy* will broadcast inputs to all the three weights to caculate dot product on all the data. There are 4 inputs and a fully connected weights that connect 4 inputs to all 4 neurons. ```python import numpy as np inputs = [1.0, 2.0, 3.0, 4.0] weights = [[0.2, 0.4, 0.6, 0.8], [0.3, 0.6, 0.9, 1.2], [0.5, 1.0, 1.5, 2.0]] biases = [2.0, 3.0, 5.0] layer_outputs = np.dot(weights, inputs) + biases print(layer_outputs) ``` ![](https://hackmd.io/_uploads/By5rB7Nvh.png =x500) In the following section, if we have a batch of inputs, the weights tensor should be transposed to map the calculation in *Numpy*. In the *Eigen* implementation, the library put `vector4f` as a column of values. Therefore, transpose on bias is required to perform correct calculation. :::info :information_source: **Eigen Official Tutorial** https://eigen.tuxfamily.org/dox/group__TutorialReductionsVisitorsBroadcasting.html ::: ***In Python*** ```python import numpy as np inputs = np.array( [[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0], [9.0, 10.0, 11.0, 12.0], [13.0, 14.0, 15.0, 16.0]]) weights = np.array( [[0.2, 0.4, 0.6, 0.8], [0.3, 0.6, 0.9, 1.2], [0.5, 1.0, 1.5, 2.0], [0.6, 1.2, 1.8, 2.4]]) biases = [2.0, 3.0, 5.0, 6.0] output_layer = np.dot(inputs, weights.transpose()) + biases print(output_layer) ``` ***In C++ with Eigen*** ```cpp #include <iostream> #include <Eigen/Dense> int main(int argc, char** argv) { Eigen::Matrix4f inputs; inputs << 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0; Eigen::Matrix4f weights; weights << 0.2, 0.4, 0.6, 0.8, 0.3, 0.6, 0.9, 1.2, 0.5, 1.0, 1.5, 2.0, 0.6, 1.2, 1.8, 2.4; Eigen::Vector4f biases; biases << 2.0, 3.0, 5.0, 6.0; Eigen::Matrix4f layerOutput = inputs * weights.transpose(); layerOutput.rowwise() += biases.transpose(); std::cout << layerOutput << std::endl; } ``` ## Dense Layer Class Implementation In this section, the book start to build the calculation introduced in the previous section. In the following code, *dimension of the weight initialization is reversed* so that *a transpose is not needed*. ***In Python*** ```python import numpy as np class layer_dense: # layer initialization def __init__(self, n_inputs, n_neurons): # initialize weigths and biases self.weights = 0.01 * np.random.randn(n_inputs, n_neurons) self.biases = np.zeros((1, n_neurons)) def forward(self, inputs): self.output = np.dot(inputs, self.weights) + self.biases ``` ***In C++ with Eigen*** *TODO: untested* ```cpp using Eigen::MatrixXf; using Eigen::VectorXf; class Dense { private: MatrixXf weights; VectorXf biases; MatrixXf outputs; public: Dense(unsigned int numInputs, unsigned int numNeurons) { weights = 0.01 * MatrixXf::Random(numInputs, numNeurons); biases = VectorXf::Zero(numNeurons); } ~Dense() {} void forward(MatrixXf inputs) { outputs = inputs * weights; outputs.rowwise() += biases.transpose(); } }; ``` ## Activation Functions Activation functions are applied to the output of a layer and will usually be a nonlinear function. This make the neural network able to map a nonlinear functions. ### Step Activation Function ![](https://hackmd.io/_uploads/SJJa1Xhw3.png) A step function will only have two outputs. It's either 1 or 0 which is defined on a value to determine the thrashhold. ### Linear Activation Function ![](https://hackmd.io/_uploads/B1PFxQ3Dh.png) This avtivation function is usally applied to the output of the last layer in the case of a regression model. ### Sigmoid Activation Function ![](https://hackmd.io/_uploads/rkYzbXnv3.png) A step activation function is not able to preserve how close a this neuron was activating or deactivating. This information can help the optimizer in the training state. ### Rectified Linear Activation Function ![](https://hackmd.io/_uploads/HyqRMX2vn.png) Since calculate the sigmoid function is vary costly, a simpler activation function that can still remain nonlinear is applied. ***In Python*** ```python import numpy as np inputs = [0, 2, -1, 3.3, -2.7, 1.1, 2.2, -100] outputs = np.maximum(0, inputs) print(outputs) ``` ***In C++ with Eigen*** ```cpp Eigen::Matrix3f inputs = Eigen::Matrix3f::Random(); std::cout << inputs << "\n\n"; // This provide a 1D view of the Matrix Eigen::MatrixXf outputs = inputs.reshaped<Eigen::RowMajor>().transpose().cwiseMax(0.0); std::cout << "ReLU output :\n" << outputs << "\n\n"; ``` **ReLU Class Implementation** ***In Python*** ```python class Activation_ReLU: def forward(self, inputs): self.output = np.maximum(0, inputs) ``` ***In C++ with Eigen*** ```cpp class ReLU { private: MatrixXf outputs; public: void forward(MatrixXf inputs) { // Keeping the size of the Matrix outputs = inputs.cwiseMax(0.0); } MatrixXf getOutput() { return outputs; } }; ``` :::info :bulb: **Why Use Activation Functions?** In the chapter 4 of the book, the author provide a step by step view of how a nonlinear function help a model to fit a `y = sin(x)` equation ::: ### Softmax Activation Function Represent confidence score in the multi-class classification task. The output is an array with which each value represents the possiblilty of the target belong to a class and all the values should add up to be 1. For example, `[0.3, 0.22, 0.18, 0.16, 0.14]` can be the output of softmax. In the following formula, we can see that exponential calculation is used. Exponential will output only non-negative numbers and possibility cannot be negative. ![](https://hackmd.io/_uploads/SyAmFTAwn.png =300x) *Take a given exponential value and divide it by the sum of all exponential values.* ***In Python*** ```python import numpy as np inputs = [4.8, 1.21, 2.385, 0.5] exp_values = np.exp(inputs) outputs = exp_values / np.sum(exp_values) print(outputs) ``` ***In C++ with Eigen*** ```cpp Eigen::Matrix2f inputs; inputs << 4.8, 1.21, 2.385, 0.5; inputs = inputs.array().exp(); Eigen::VectorXf sum = inputs.rowwise().sum(); Eigen::Matrix2f outputs = inputs.array().colwise() / sum.array(); std::cout << "softmax output : " << outputs << "\n\n"; std::cout << "sum of the softmax output : " << outputs.rowwise().sum() << "\n\n"; ``` **Softmax class implementation** In this implementation, **we subtract all the element in the matrix with the maximum element to prevent vary large value** which can be cause by appling softmax function. This can work because softmax function ouput a probability for all elements and these values are relative. ***In Python*** ```python class Activation_Softmax: def forward(self, inputs): exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True)) self.output = exp_values / np.sum(exp_values, axis=1, keepdims=True) ``` ***In C++ with Eigen*** ```cpp class Softmax { private: MatrixXf outputs; public: void forward(MatrixXf inputs) { inputs = (inputs.array() - inputs.maxCoeff()); inputs = inputs.array().exp(); Eigen::VectorXf sum = inputs.rowwise().sum(); outputs = inputs.array().colwise() / sum.array(); } MatrixXf getOutput() { return outputs; } }; ``` ## Calculating Network Error with Loss Reference \: > 機器/深度學習: 基礎介紹-損失函數(loss function) https://chih-sheng-huang821.medium.com/%E6%A9%9F%E5%99%A8-%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92-%E5%9F%BA%E7%A4%8E%E4%BB%8B%E7%B4%B9-%E6%90%8D%E5%A4%B1%E5%87%BD%E6%95%B8-loss-function-2dcac5ebb6cb ### Categorial Cross-Entropy Loss The output from a softmax activation function will be a probability distribution. Categorical cross-entropy can be used to compare this output probability to a ground-truth probability. ![](https://hackmd.io/_uploads/SkdQ7u5_n.png =400x) Since the ground-truth is in one-hot format, only one of the value will be `1` and all the other values will be `0`. The loss function can be simplified as follows. ![](https://hackmd.io/_uploads/rkl94d9On.png =400x) ***In Python*** ```python import numpy as np softmax_outputs = np.array([[0.7, 0.1, 0.2], [0.1, 0.5, 0.4], [0.02, 0.9, 0.08]]) # class_targets = np.array([0, 1, 1]) # categorical labels class_targets = np.array([[1, 0, 0], [0, 1, 0], [0, 1, 0]]) # one-hot labels softmax_outputs = np.clip(softmax_outputs, 1e-7, 1-1e-7) # clip value to a range to prevent log(0) undefined calculation if len(class_targets.shape) == 1: true_confidences = softmax_outputs[range(len(softmax_outputs)), class_targets] elif len(class_targets.shape) == 2: true_confidences = np.sum(softmax_outputs * class_targets, axis=1) neg_log = -np.log(true_confidences) average_loss = np.mean(neg_log) print(average_loss) ``` ***C++ with Eigen*** :::success :bulb: **Clip values to a range in Eigen** ```cpp Eigen::Matrix4f a; a << 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6; std::cout << a << "\n\n"; a = a.cwiseMin(2.0).cwiseMax(1.0); // Clipping should follow order. // As clip Min with Maximum number and clip max with Minimum number. std::cout << a << "\n\n"; ``` In the above case, *c* in `cwiseMin` means **c**oefficient-**wise** computation. > http://eigen.tuxfamily.org/dox/group__QuickRefPage.html#title6 ::: ```cpp // Only support one-hot encoding labels Eigen::Matrix3f softmaxOutput; softmaxOutput << 0.7, 0.1, 0.2, 0.1, 0.5, 0.4, 0.02, 0.9, 0.08; Eigen::Matrix3f groundTruth; groundTruth << 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0; softmaxOutput = softmaxOutput.cwiseMin(1 - 1e-7).cwiseMax(1e-7); softmaxOutput = softmaxOutput.cwiseProduct(groundTruth); Eigen::Matrix<float, 3, 1> categoricalCrossEntropyLoss = softmaxOutput.rowwise().sum(); categoricalCrossEntropyLoss.array() = categoricalCrossEntropyLoss.array().log() * -1; float output = categoricalCrossEntropyLoss.sum() / (categoricalCrossEntropyLoss.cols() * categoricalCrossEntropyLoss.rows()); std::cout << output << "\n\n"; ``` #### Categorial Cross-Entropy Loss Class Implementaion In this implementation a separate loss class is created for flexibility that can help fit this class into other loss functions. ***In Python*** ```python class Loss: def calculate(self, output, y): sample_losses = self.forward(output, y) data_loss = np.mean(sample_losses) return data_loss class Loss_CategoricalCrossEntropy(Loss): def forward(self, y_pred, y_true): samples = len(y_pred) y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7) if len(y_true.shape) == 1: true_confidences = y_pred[range(len(samples)), y_true] elif len(y_true.shape) == 2: true_confidences = np.sum(y_pred * y_true, axis=1) neg_log = -np.log(true_confidences) return neg_log ``` ***in C++ with Eigen*** ```cpp class Loss { public: virtual MatrixXf forward(MatrixXf output, MatrixXf groundTruth) = 0; float calculate(MatrixXf output, MatrixXf groundTruth) { MatrixXf sampleLosses = this->forward(output, groundTruth); return sampleLosses.sum() / (sampleLosses.cols() * sampleLosses.rows()); } }; class CategoricalCrossentropyLoss : public Loss { public: MatrixXf forward(MatrixXf output, MatrixXf groundTruth) { output = output.cwiseMin(1 - 1e-7).cwiseMax(1e-7); output = output.cwiseProduct(groundTruth); MatrixXf categoricalCrossEntropyLoss = output.rowwise().sum(); categoricalCrossEntropyLoss.array() = categoricalCrossEntropyLoss.array().log() * -1; return categoricalCrossEntropyLoss; } }; ``` ### Calculate Accuracy The next step is to calculate the accuracy of the model output (*softmax output*). ***In python*** ```python import numpy as np softmax_outputs = np.array([[0.7, 0.1, 0.2], [0.1, 0.5, 0.4], [0.02, 0.9, 0.08]]) class_targets = np.array([[1, 0, 0], [0, 1, 0], [0, 1, 0]]) # one-hot labels preditions = np.argmax(softmax_outputs, axis=1) class_targets = np.argmax(class_targets, axis=1) accuracy = np.mean(preditions==class_targets) print("accuracy : ", accuracy) ``` ***In C++ with Eigen*** ```cpp // Only support one-hot encoding labels Eigen::Matrix3f softmaxOutput; softmaxOutput << 0.7, 0.1, 0.2, 0.1, 0.5, 0.4, 0.02, 0.9, 0.08; Eigen::Matrix3f groundTruth; groundTruth << 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0; size_t count = 0; for (int i = 0; i < softmaxOutput.rows(); ++i) { int predIndex; int trueIndex; softmaxOutput.row(i).maxCoeff(&predIndex); groundTruth.row(i).maxCoeff(&trueIndex); if (predIndex == trueIndex) count++; } float accuracy = static_cast<float>(count) / static_cast<float>(groundTruth.rows()); std::cout << accuracy << "\n\n"; ``` ## Derivatives, Gradients, Partial Derivatives and Chain Rule :::info ***:bulb:Partial Derivative of Max*** The Max function will return x or y depends on which is greater. When derivative is applied, the partial derivative of x will be as the follow equations. ![](https://hackmd.io/_uploads/Bk1s6nQba.png =300x) ::: :::info ***:bulb:Gradients*** > \[Ch8 P.13\] Gradient is a vector composed of all the partial derivative of a function. ![](https://hackmd.io/_uploads/HkDjgpXZa.png =400x) ::: :::info ***:bulb:Chain Rules*** The following sample equation is derivative but the chain rules can be applied to partial derivative as well. ![](https://hackmd.io/_uploads/SyRXy0QW6.png =400x) ::: ## Backporpagation :::info ***:bulb:Different Notation of a Neuron*** **In Previous Grapth** ![](https://hackmd.io/_uploads/SJKoDR7-T.png =300x) **In Compute Grapth** ![](https://hackmd.io/_uploads/SkRIDRQ-p.png =600x) Please refer to the chapter 9 P.14 which provide a sample backpropagation step by step. The book example use the compute grapth which is provided above. ::: ![](https://hackmd.io/_uploads/HJulMNBWT.png) > Ch.9 Page 26 A Neuron with activation function Sample ```python x = [1.0, -2.0, 3.0] #input w = [-3.0, -1.0, 2.0] # weight b = 1.0 # bias # forward pass xw0 = x[0] * w[0] xw1 = x[1] * w[1] xw2 = x[2] * w[2] z = xw0 + xw1 + xw2 + b # ReLU y = max(z, 0) # backward pass # Derivative return from the deeper layer (next) backpropagation dvalue = 1.0 drelu_dz = dvalue * (1. if z > 0 else 0.) # derivative of ReLU * dvalue (Chain Rule) print("derelu_dz : ", drelu_dz) # partial derivative of sum and apply chain rule # The derivative of sum equals 1 for all the elements dsum_dxw0 = 1 dsum_dxw1 = 1 dsum_dxw2 = 1 dsum_db = 1 drelu_dxw0 = drelu_dz * dsum_dxw0 drelu_dxw1 = drelu_dz * dsum_dxw1 drelu_dxw2 = drelu_dz * dsum_dxw2 drelu_db = drelu_dz * dsum_db print(drelu_dxw0, drelu_dxw1, drelu_dxw2, drelu_db) # partial derivative of multiplication and apply chain rule # The derivative of multiplication : d/dxf(x * y) = y and d/dyf(x * y) = x dmul_dx0 = w[0] dmul_dx1 = w[1] dmul_dx2 = w[2] dmul_dw0 = x[0] dmul_dw1 = x[1] dmul_dw2 = x[2] drelu_dx0 = drelu_dxw0 * dmul_dx0 drelu_dw0 = drelu_dxw0 * dmul_dw0 drelu_dx1 = drelu_dxw1 * dmul_dx1 drelu_dw1 = drelu_dxw1 * dmul_dw1 drelu_dx2 = drelu_dxw2 * dmul_dx2 drelu_dw2 = drelu_dxw2 * dmul_dw2 print(drelu_dx0, drelu_dw0, drelu_dx1, drelu_dw1, drelu_dx2, drelu_dw2) ``` With the previous gradient caclulation, the update can be provided to the weight. In the following code snippet, we applied the update with a 0.001 learning rate. ```python dx = [drelu_dx0, drelu_dx1, drelu_dx2] # gradients on inputs dw = [drelu_dw0, drelu_dw1, drelu_dw2] # gradients on weights db = drelu_db # gradient on bias print(w, b) # update weight with learning_rate=0.001 w[0] += -0.001 * dw[0] w[1] += -0.001 * dw[1] w[2] += -0.001 * dw[2] b += -0.001 * db print(w, b) ``` ***Python Implemenation of Multiple Neuron Backpropagation Chain Rule*** ![](https://hackmd.io/_uploads/SyiGMjHb6.png) In the following code, we calcuate **derivative with respect to the input**. *The derivative with respect to the inputs equals to weights.* ```python import numpy as np # returned gradient form the next layer dvalues = np.array([[1., 1., 1.]]) # 3 neurons and 4 inputs weights = np.array([[0.2, 0.8, -0.5, 1], [0.5, -0.91, 0.26, -0.5], [-0.26, -0.27, 0.17, 0.87]]).T dx0 = sum(weights[0] * dvalues[0]) dx1 = sum(weights[1] * dvalues[0]) dx2 = sum(weights[2] * dvalues[0]) dx3 = sum(weights[3] * dvalues[0]) dinputs = np.array([dx0, dx1, dx2, dx3]) print(dinputs) # The above code can be replaced with numpy's dot product (Matrix Multiplication) import numpy as np # returned gradient form the next layer dvalues = np.array([[1., 1., 1.]]) # 3 neurons and 4 inputs weights = np.array([[0.2, 0.8, -0.5, 1], [0.5, -0.91, 0.26, -0.5], [-0.26, -0.27, 0.17, 0.87]]).T dinputs = np.dot(dvalues[0], weights.T) print(dinputs) ``` In the following code, we calculate the **derivative with respect to weights**. In this case, we need the output shape to match the shape of weights since we need to update the weight with it. *The derivative with respect to the weights equals inputs.* ```python import numpy as np dvalues = np.array([[1., 1., 1.], [2., 2., 2.], [3., 3., 3.]]) inputs = np.array([[1, 2, 3, 2.5], [2., 5., -1, 2], [-1.5, 2.7, 3.3, -0.8]]) dweights = np.dot(inputs.T, dvalues) print("dvalues : \n", dvalues, "\ninputs : \n", inputs, "\ndweights : \n", dweights) ``` In the following code, **The derivative of bias** come from the sum operation and always equal 1 and we just need to multiply gradients to apply the chain rule. We have to sum them with neurons. ```python import numpy as np dvalues = np.array([[1., 1., 1.], [2., 2., 2.], [3., 3., 3.]]) biases = np.array([[2, 3, 0.5]]) # shape (1, neurons) we do not need it dbiases = np.sum(dvalues, axis=0, keepdims=True) print(dbiases) ``` In the following code, **The derivative of ReLU** is calculated. *The derivative of the ReLU function equals to 1 if the input is greater than 0 and equals to 0 otherwise.* ```python import numpy as np z = np.array([[1, 2, -3, -4], [2, -7, -1, 3], [-1, 2, 5, -1]]) dvalues = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]) # ReLU activation's derivatives drelu = np.zeros_like(z) drelu[z > 0] = 1 print(drelu) drelu *= dvalues print(drelu) # The above code can be simplified as the following import numpy as np z = np.array([[1, 2, -3, -4], [2, -7, -1, 3], [-1, 2, 5, -1]]) dvalues = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]) # ReLU activation's derivatives drelu = dvalues.copy() drelu[z <= 0] = 0 print(drelu) ``` ***Combine the above codes together*** ***In Python*** ```python import numpy as np dvalues = np.array([[1., 1., 1.], [2., 2., 2.], [3., 3., 3.]]) inputs = np.array([[1, 2, 3, 2.5], [2., 5., -1., 2], [-1.5, 2.7, 3.3, -0.8]]) weights = np.array([[0.2, 0.8, -0.5, 1], [0.5, -0.91, 0.26, -0.5], [-0.26, -0.27, 0.17, 0.87]]).T biases = np.array([[2, 3, 0.5]]) # forward pass layer_outputs = np.dot(inputs, weights) + biases # Dense Layer relu_outputs = np.maximum(0, layer_outputs) # ReLU Layer # backpropagation from deepest layer go backward # ReLU Layer drelu = relu_outputs.copy() drelu[layer_outputs <= 0] = 0 # Dense Layer dinputs = np.dot(drelu, weights.T) dweights = np.dot(inputs.T, drelu) dbiases = np.sum(drelu, axis=0, keepdims=True) # update weights weights += -0.001 * dweights biases += -0.001 * dbiases print(weights) print(biases) ``` ***In C++ with Eigen*** ```cpp MatrixXf dvalues(3, 3); dvalues << 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0; MatrixXf inputs(3, 4); inputs << 1, 2, 3, 2.5, 2.0, 5.0, -1.0, 2, -1.5, 2.7, 3.3, -0.8; MatrixXf weights(3, 4); weights << 0.2, 0.8, -0.5, 1, 0.5, -0.91, 0.26, -0.5, -0.26, -0.27, 0.17, 0.87; MatrixXf weightsT(4, 3); weightsT = weights.transpose(); VectorXf biases(3); biases << 2, 3, 0.5; // forward pass // Dense MatrixXf layerOutputs = (inputs * weightsT); layerOutputs.rowwise() += biases.transpose(); // ReLU MatrixXf reluOutputs = layerOutputs.cwiseMax(0.0); // Backpropagation // ReLU MatrixXf drelu = reluOutputs; (layerOutputs.array() <= 0).select(0, drelu.array()); // Dense MatrixXf dinputs = drelu * weightsT.transpose(); MatrixXf dweights = inputs.transpose() * drelu; VectorXf dbiases = drelu.colwise().sum(); // update weights weightsT.array() += dweights.array() * -0.001; biases.array() += dbiases.array() * -0.001; std::cout << dvalues << "\n\n"; std::cout << inputs << "\n\n"; std::cout << weights << "\n\n"; std::cout << weightsT << "\n\n"; std::cout << biases << "\n\n"; std::cout << layerOutputs << "\n\n"; std::cout << reluOutputs << "\n\n"; std::cout << drelu << "\n\n"; std::cout << dinputs << "\n\n"; std::cout << dweights << "\n\n"; std::cout << dbiases << "\n\n"; std::cout << weightsT << "\n\n"; std::cout << biases << "\n\n"; ``` ### Backpropagation Class Implementation In C++ with Eigen ***(Dense and ReLU)*** ```cpp class Dense { private: MatrixXf weights; VectorXf biases; MatrixXf outputs; MatrixXf inputs; MatrixXf dinputs; MatrixXf dweights; VectorXf dbiases; public: Dense(unsigned int numInputs, unsigned int numNeurons) { weights = 0.01 * MatrixXf::Random(numInputs, numNeurons); biases = VectorXf::Zero(numNeurons); } ~Dense() {} void forward(MatrixXf inputs) { this->inputs = inputs; outputs = inputs * weights; outputs.rowwise() += biases.transpose(); } void backward(MatrixXf dvalues) { this->dinputs = dvalues * this->weights.transpose(); this->dweights = this->inputs.transpose() * dvalues; this->dbiases = dvalues.colwise().sum(); } MatrixXf getWeigths() { return weights; } MatrixXf getBiases() { return biases; } VectorXf getOutput() { return outputs; } }; class ReLU { private: MatrixXf outputs; MatrixXf inputs; MatrixXf dinputs; public: void forward(MatrixXf inputs) { this->inputs = inputs; // Keeping the size of the Matrix outputs = inputs.cwiseMax(0.0); } void backward(MatrixXf dvalues) { this->dinputs = dvalues; (this->inputs.array() <= 0).select(0, dinputs.array()); } MatrixXf getOutput() { return outputs; } }; ``` ### Derivative of Categorical Cross-Entropy Loss Function :::info :bulb:**Derivative of Categorical Cross-Entropy Loss Function** Recall that the original Categorical Cross-Entropy function is as follows. *i-th sample in a set, j-label/output index* ![](https://hackmd.io/_uploads/Hk37XV_-a.png =200x) ![](https://hackmd.io/_uploads/ByEpM4YZT.png =400x) ![](https://hackmd.io/_uploads/BkirIVKW6.png =300x) The partial derivative with respect to the *y* given *j*, the sum is being perform on one single element so it can be omitted. ::: ***Categorical Cross-Entropy Loss Function Class Implementation*** ```cpp class CategoricalCrossentropyLoss : public Loss { MatrixXf dinputs; public: // Only support one-hot encoding MatrixXf forward(MatrixXf output, MatrixXf groundTruth) { output = output.cwiseMin(1 - 1e-7).cwiseMax(1e-7); output = output.cwiseProduct(groundTruth); MatrixXf categoricalCrossEntropyLoss = output.rowwise().sum(); categoricalCrossEntropyLoss.array() = categoricalCrossEntropyLoss.array().log() * -1; return categoricalCrossEntropyLoss; } void backward(MatrixXf dvalues, MatrixXf groundTruth) { int samples = dvalues.rows(); this->dinputs = -groundTruth.array() / dvalues.array(); this->dinputs = this->dinputs.array() / samples; } }; ``` ### Derivative of Softmax :::info :bulb: **The derivative of the division operation** ![](https://hackmd.io/_uploads/HJSQ229W6.png =500x) ::: :::info :bulb: **The derivative of the constant e raised to power of n with respect to n.** ![](https://hackmd.io/_uploads/Hyda8lTW6.png =300x) ::: :::success :bulb: **The deirvative of Softmax** > Please refer to the book chapter 9 page 46 ![](https://hackmd.io/_uploads/B1l90eT-T.png) *j*-th Softmax's output of *i*-th sample. *z* is input array which is a list of input vectors and *j*-th softmax's input of *i*-th sample. The right part of derivative ![](https://hackmd.io/_uploads/HJvfAxaWa.png) The left part of derivative has two different situations. ***j equals to k** or **j not equals to k*** When *j = k* ![](https://hackmd.io/_uploads/rkRSJZTbp.png) When *j != k* ![](https://hackmd.io/_uploads/S1u31W6Wp.png) The solution can be representing as ![](https://hackmd.io/_uploads/B19ylZp-6.png =300x) A Kronecker delta function with the following equation can be applied. ![](https://hackmd.io/_uploads/Sky9xW6-p.png =150x) Thus we can have the final form of softmax derivative equation. ![](https://hackmd.io/_uploads/HkSa8-TZa.png =500x) ::: *Pending : Python implementation detail* ***In Python*** ```python class Activation_Softmax: def forward(self, inputs): self.inputs = inputs exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True)) self.output = exp_values / np.sum(exp_values, axis=1, keepdims=True) def backward(self, dvalues): self.dinputs = np.empty_like(dvalues) for index, (single_output, single_dvalues) in enumerate(zip(self.output, dvalues)): # Flatten output array single_output = single_output.reshape(-1, 1) # Calculate Jacobian matrix of the output jacobian_matrix = np.diagflat(single_output) - np.dot(single_output, single_output.T) # Calculate sample-wise gradient # and add it to the array of sample gradients self.dinputs[index] = np.dot(jacobian_matrix, single_dvalues) ``` :::success :bulb: **Implementation consideration** I will not implement this in Eigen version since the book provide a better implementation for categorical cross-entropy with softmax. The implementation will be in pending. ::: ### Categorical Cross-Entropy Loss with Softmax Activation Derivative In most case, categorical entropy loss is used with softmax activation function and vice versa. Addtionally, calculate their derivative together is faster and simpler. > Please refer to the book chapter 9 page 56 With Chain-Rule, we can have. ![](https://hackmd.io/_uploads/Byp5JuxGp.png =200x) Since we know that loss function calculation is right after the final activation function, we can deduce that **the input of the loss function is the output of the softmax activation function**. ![](https://hackmd.io/_uploads/B1t8lueMT.png =100x) Recall that the derivative of the Categorical Cross-Entropy function is : ![](https://hackmd.io/_uploads/Skx1Z_gza.png =100x) Therefore, we can have : ![](https://hackmd.io/_uploads/SyF3WOefT.png =400x) Recall that the derivative of the Softmax activation function **with substitution** : ![](https://hackmd.io/_uploads/S1kGGdgGT.png =400x) There are two different situation *j = k* or *j != k*. We split the equation and we have : ![](https://hackmd.io/_uploads/rJVhrueG6.png =450x) We substitue the partial derivative of the activation funtion part ![](https://hackmd.io/_uploads/H16XIOxfT.png =450x) ![](https://hackmd.io/_uploads/BJkHUdgMT.png =350x) ![](https://hackmd.io/_uploads/rJnsLulfT.png =250x) For ![](https://hackmd.io/_uploads/r1C1_Oxfa.png =60x) there will be only one value is 1 which times the y hat element. Therefor, we can further simplify the equation to : ![](https://hackmd.io/_uploads/Bk8_F_ezp.png =200x) ***In Python*** ```python # softmax classifier (provide faster backward function) class Activation_Softmax_Loss_CategoricalCrossEntropy(): # create activation and loss function objects def __init__(self): self.activation = Activation_Softmax() self.loss = Loss_CategoricalCrossEntropy() def forward(self, inputs, y_true): self.activation.forward(inputs) # perform forward pass within the activation function class self.output = self.activation.output return self.loss.calculate(self.output, y_true) def backward(self, dvalues, y_true): samples = len(dvalues) # if lables are one-hot encoded, turn them into discrete values if len(y_true.shape) == 2: y_true = np.argmax(y_true, axis=1) self.dinputs = dvalues.copy() self.dinputs[range(samples), y_true] -= 1 self.dinputs = self.dinputs / samples ``` ***In C++ with Eigen*** ```cpp class SoftmaxCategoricalCrossentropyLoss { public: MatrixXf outputs; MatrixXf dinputs; SoftmaxCategoricalCrossentropyLoss() {} ~SoftmaxCategoricalCrossentropyLoss() {} float forward(MatrixXf inputs, MatrixXf groundTruth) { // Don't need to keep these in the class Softmax activation; CategoricalCrossentropyLoss lossFunc; activation.forward(inputs); this->outputs = activation.outputs; return lossFunc.calculate(this->outputs, groundTruth); } void backward(MatrixXf dvalues, MatrixXf groundTruth) { int samples = dvalues.rows(); // Book sugestion might not be faster this->dinputs = dvalues; this->dinputs.array() = this->dinputs.array() - groundTruth.array(); this->dinputs.array() = this->dinputs / static_cast<float>(samples); } }; ``` ## Optimizers ### Stochastic Gradient Descent(SGD) SGD is applied with a chosen learning rate and subtract *learning_rate * gradients* from the actual parameter values. :::info :bulb:**Implementation Detail** In the current C++ implementation, all the variables in the class will be public to match python implementation. There will be a final section tidy up C++ implementation. ::: ***In Python*** ```python class Optimizer_SGD: def __init__(self, learning_rate=1.0): self.learning_rate = learning_rate def update_params(self, layer): layer.weights += -self.learning_rate * layer.dweights layer.biases += -self.learning_rate * layer.dbiases ``` ***In C++ with Eigen*** ```cpp class OptimizerSGD { private: float learningRate = 1.0; public: OptimizerSGD(float learningRate) { this->learningRate = learningRate; } void updateParams(Layer& layer) { layer.weights.array() += (layer.dweights.array() * (this->learningRate * -1)); layer.biases.array() += (layer.dbiases.array() * (this->learningRate * -1)); } }; ``` ### ***Put all classes together*** ***In python*** ```python X, y = spiral_data(samples=100, classes=3) dense1 = Layer_Dense(2, 64) activation1 = Activation_ReLU() dense2 = Layer_Dense(64, 3) loss_activation = Activation_Softmax_Loss_CategoricalCrossEntropy() optimizer = Optimizer_SGD() for epoch in range(10001): dense1.forward(X) activation1.forward(dense1.output) dense2.forward(activation1.output) loss = loss_activation.forward(dense2.output, y) predictions = np.argmax(loss_activation.output, axis=1) if len(y.shape) == 2: y = np.argmax(y, axis=1) accuracy = np.mean(predictions==y) if not epoch % 100: print(f'epoch: {epoch}, ' + f'acc: {accuracy:.3f}, ' + f'loss: {loss:.3f}') loss_activation.backward(loss_activation.output, y) dense2.backward(loss_activation.dinputs) activation1.backward(dense2.dinputs) dense1.backward(activation1.dinputs) optimizer.update_params(dense1) optimizer.update_params(dense2) ``` ***Output*** > The book pointed out that the accuracy stuck at 0.6 ~ 0.7 is because it is stuck in the **local minimum** ```bash epoch: 8500, acc: 0.663loss: 0.743 epoch: 8600, acc: 0.627loss: 0.822 epoch: 8700, acc: 0.653loss: 0.761 epoch: 8800, acc: 0.653loss: 0.734 epoch: 8900, acc: 0.653loss: 0.709 epoch: 9000, acc: 0.653loss: 0.720 epoch: 9100, acc: 0.663loss: 0.711 epoch: 9200, acc: 0.633loss: 0.791 epoch: 9300, acc: 0.653loss: 0.761 epoch: 9400, acc: 0.643loss: 0.753 epoch: 9500, acc: 0.653loss: 0.749 epoch: 9600, acc: 0.660loss: 0.743 epoch: 9700, acc: 0.687loss: 0.705 epoch: 9800, acc: 0.610loss: 0.855 epoch: 9900, acc: 0.623loss: 0.799 epoch: 10000, acc: 0.670loss: 0.681 ``` ***In C++ with Eigen*** :::info :bulb: **Learning Rate, Momentum, Learning Rate Decay** For ***learning rate***, since our model is stuck in the local minimum, the learning rate should not be fixed. To prevent gradient explosion, the learning rate should not be set too big. For ***momentum***, this can be add to the gradient which can help push our model weight over a small hill or roll out from a small pit. It use the previous update's direction to prevent update bouncing around. ***Learning Rate Decay*** > This can be refered to as **1/t decaying** ```python starting_learning_rate = 1.0 learning_rate_decay = 0.1 for step in range(20): learning_rate = starting_learning_rate * (1.0 / (1 + learning_rate_decay * step)) print(learning_rate) ``` ```bash 1.0 0.9090909090909091 0.8333333333333334 0.7692307692307692 0.7142857142857143 0.6666666666666666 0.625 0.588235294117647 0.5555555555555556 0.5263157894736842 0.5 0.47619047619047616 0.45454545454545453 0.4347826086956522 0.41666666666666663 0.4 0.3846153846153846 0.37037037037037035 0.35714285714285715 0.3448275862068965 ``` ::: ***Python SGD Class with Learning Rate Decay and Momentum*** ```python class Optimizer_SGD: def __init__(self, learning_rate=1.0, decay=0.0, momentum=0.0): self.learning_rate = learning_rate self.current_learning_rate = learning_rate self.decay = decay self.iterations = 0 self.momentum = momentum def pre_update_params(self): if self.decay: self.current_learning_rate = self.learning_rate * (1.0 / (1.0 + self.decay * self.iterations)) def update_params(self, layer): # self.momentum is a hyperparameter chosen when used # layer.weight_momentums start as all zeros if self.momentum: # Create and initial momentum arrays and inintial it as all zeros if not hasattr(layer, 'weight_momentums'): layer.weight_momentums = np.zeros_like(layer.weights) layer.bias_momentums = np.zeros_like(layer.biases) weight_updates = self.momentum * layer.weight_momentums - self.current_learning_rate * layer.dweights layer.weight_momentums = weight_updates bias_updates = self.momentum * layer.bias_momentums - self.current_learning_rate * layer.dbiases layer.bias_momentums = bias_updates # vanilla SGD updates else: weight_updates += -self.current_learning_rate * layer.dweights bias_updates += -self.current_learning_rate * layer.dbiases layer.weights += weight_updates layer.biases += bias_updates def post_update_params(self): self.iterations += 1 ``` ***C++ with Eigen SGD Class with Learning Rate Decay and Momentum*** ```cpp class OptimizerSGD { private: float learningRate; float curLearningRate; float decay; float iterations; float momentum; public: OptimizerSGD(float learningRate = 1.0, float decay = 0.0, float momentum = 0.0) { this->learningRate = learningRate; this->curLearningRate = learningRate; this->decay = decay; this->iterations = 0; this->momentum = momentum; } void preUpdateParams() { // if decay is set, it will always greater than 0 if (this->decay > 0.0) { curLearningRate = learningRate * (1.0 / (1.0 + this->decay * this->iterations)); } } void updateParams(Layer& layer) { // initialize weightMomentums with zeros if not initialized if (layer.weightMomentums == MatrixXf{}) { layer.weightMomentums = MatrixXf::Zero(layer.weights.rows(), layer.weights.cols()); layer.biasMomentums = MatrixXf::Zero(layer.biases.rows(), layer.biases.cols()); } // if momentum is set, it will always greater than 0 if (this->momentum > 0.0) { MatrixXf weightUpdates = (layer.weightMomentums.array() * this->momentum) - (this->curLearningRate * layer.dweights.array()); layer.weightMomentums = weightUpdates; MatrixXf biasUpdates = (layer.biasMomentums.array() * this->momentum) - (this->curLearningRate * layer.dbiases.array()); layer.biasMomentums = biasUpdates; // Update Weights and Biases layer.weights += weightUpdates; layer.biases += biasUpdates; } // vanilla SGD Updates else { MatrixXf weightUpdates = layer.dweights.array() * (-1 * this->curLearningRate); MatrixXf biasUpdates = layer.dbiases.array() * (-1 * this->curLearningRate); // Update Weights and Biases layer.weights += weightUpdates; layer.biases += biasUpdates; } } void postUpateParams() { this->iterations += 1; } }; ``` ### AdaGrad Adaptive gradient provide per-parameter learning rate and keep a history of previous update. *The bigger the sum of the update is, the smaller update are made*. This helps parameter with small update to keep-up with other parameters. ***Python Class Implementation*** ```python class Optimizer_Adagrad: def __init__(self, learning_rate=1.0, decay=0.0, epsilon=1e-7): self.learning_rate = learning_rate self.current_learning_rate = learning_rate self.decay = decay self.iterations = 0 self.epsilon = epsilon def pre_update_params(self): if self.decay: self.current_learning_rate = self.learning_rate * (1.0 / (1.0 + self.decay * self.iterations)) def update_params(self, layer): if not hasattr(layer, 'weight_cache'): layer.weight_cache = np.zeros_like(layer.weights) layer.bias_cache = np.zeros_like(layer.biases) layer.weight_cache += layer.dweights**2 layer.bias_cache += layer.dbiases**2 layer.weights += -self.current_learning_rate * layer.dweights / (np.sqrt(layer.weight_cahce) + self.epsilon) layer.biases += -self.current_learning_rate * layer.dbiases / (np.sqrt(layer.bias_cahce) + self.epsilon) def post_update_params(self): self.iterations += 1 ``` :::success AdaGrad use a constantly rising cache to limit the learning speed which might cause the stall. > That's why this optmizer not widely used, exept for some specific applications. ::: ```cpp class OptimizerAdagrad { private: float learningRate; float curLearningRate; float decay; float iterations; float epsilon; public: OptimizerAdagrad(float learningRate = 1.0, float decay = 0.0, float epsilon = 1e-6) { this->learningRate = learningRate; this->curLearningRate = learningRate; this->decay = decay; this->iterations = 0; this->epsilon = epsilon; } void updateParams(Layer& layer) { // initialize weightMomentums with zeros if not initialized if (layer.weightCache == MatrixXf{}) { layer.weightCache = MatrixXf::Zero(layer.weights.rows(), layer.weights.cols()); layer.biasCache = VectorXf::Zero(layer.biases.rows(), layer.biases.cols()); } layer.weightCache.array() += (layer.dweights.array() * layer.dweights.array()); layer.biasCache.array() += (layer.dbiases.array() * layer.dbiases.array()); layer.weights.array() += layer.dweights.array() * (-1 * this->curLearningRate) / (layer.weightCache.array().sqrt() + this->epsilon); layer.biases.array() += layer.dbiases.array() * (-1 * this->curLearningRate) / (layer.biasCache.array().sqrt() + this->epsilon); } void postUpdateParams() { this->iterations += 1; } }; ``` ### RMSProp RMSProp combine both the concept of momentum in SGD and cache in AdaGrad. A hyperparameter rho is added to control the amount of update history \(*cache*\) influence the next update. ***The only difference in code \:*** ```python # In AdaGrad: cache += gradient ** 2 # In RMSProp: cache = rho * cache + (1 - rho) * gradient ** 2 ``` ### Adam Adaptive momentum is the most used optimizer. This optimizer include the following two feature. * Add momentums as SGD * Apply per-weight adaptive learning rate with the cache as in RMSProp * Bias correction mechanism \(*for initial zeroed values*\) by divide momentum and caches with *1-beta^step* :::info :bulb: **Momentums / Bias Correction** The correction fraction will go toward 1 in later training epoch. With a divition by small fraction at the begining of the training, the training can be speed up. ::: ***In Python Class Implementation*** ```python class Optimizer_Adam: def __init__(self, learning_rate=0.001, decay=0, epsilon=1e-7, beta_1=0.9, beta_2=0.999): self.learning_rate = learning_rate self.current_learning_rate = learning_rate self.decay = decay self.iterations = 0 self.epsilon = epsilon self.beta_1 = beta_1 self.beta_2 = beta_2 def pre_update_params(self): if self.decay: self.current_learning_rate = self.learning_rate * (1.0 / (1.0 + self.decay * self.iterations)) def update_params(self, layer): if not hasattr(layer, 'weight_cache'): layer.weight_momentums = np.zeros_like(layer.weights) layer.weight_cache = np.zeros_like(layer.weights) layer.bias_momentums = np.zeros_like(layer.biases) layer.bias_cache = np.zeros_like(layer.biases) # Update momentum with current gradients layer.weight_momentums = self.beta_1 * layer.weight_momentums + (1 - self.beta_1) * layer.dweights layer.bias_momentums = self.beta_1 * layer.bias_momentums + (1 - self.beta_1) * layer.dbiases # get corrected momentum weight_momentums_corrected = layer.weight_momentums / (1 - self.beta_1 ** (self.iterations + 1)) bias_momentums_corrected = layer.bias_momentums / (1 - self.beta_1 ** (self.iterations + 1)) # Caclulate cache layer.weight_cache = self.beta_2 * layer.weight_cache + (1 - self.beta_2) * layer.dweights**2 layer.bias_cache = self.beta_2 * layer.bias_cache + (1 - self.beta_2) * layer.dbiases**2 # get corrected cache weight_cache_corrected = layer.weight_cache / (1 - self.beta_2 ** (self.iterations + 1)) bias_cache_corrected = layer.bias_cache / (1 - self.beta_2 ** (self.iterations + 1)) layer.weights += -self.current_learning_rate * weight_momentums_corrected / (np.sqrt(weight_cache_corrected) + self.epsilon) layer.biases += -self.current_learning_rate * bias_momentums_corrected / (np.sqrt(bias_cache_corrected) + self.epsilon) def post_update_params(self): self.iterations += 1 ``` ***C++ with Eigen Class Implementation*** ```cpp class OptimizerAdam { private: float learningRate; float curLearningRate; float decay; float iterations; float epsilon; float beta_1; float beta_2; public: OptimizerAdam(float learningRate = 0.001, float decay = 0.0, float epsilon = 1e-6, float beta_1 = 0.9, float beta_2 = 0.999) { this->learningRate = learningRate; this->curLearningRate = learningRate; this->decay = decay; this->iterations = 0; this->epsilon = epsilon; this->beta_1 = beta_1; this->beta_2 = beta_2; } void preUpdateParam() { if (this->decay > 0.0) { curLearningRate = learningRate * (1.0 / (1.0 + this->decay * this->iterations)); } } void updateParams(Layer& layer) { if (layer.weightCache == MatrixXf{}) { layer.weightMomentums = MatrixXf::Zero(layer.weights.rows(), layer.weights.cols()); layer.weightCache = MatrixXf::Zero(layer.weights.rows(), layer.weights.cols()); layer.biasMomentums = MatrixXf::Zero(layer.biases.rows(), layer.biases.cols()); layer.biasCache = VectorXf::Zero(layer.biases.rows(), layer.biases.cols()); } // Update momentum layer.weightMomentums.array() = (layer.weightMomentums.array() * this->beta_1) + (layer.dweights.array() * (1.0 - this->beta_1)); layer.biasMomentums.array() = (layer.biasMomentums.array() * this->beta_1) + (layer.dbiases.array() * (1.0 - this->beta_1)); // Calculate cache layer.weightCache.array() = (layer.weightCache.array() * this->beta_2) + (layer.dweights.array().pow(2) * (1.0 - this->beta_2)); layer.biasCache.array() = (layer.biasCache.array() * this->beta_2) + (layer.dbiases.array().pow(2) * (1.0 - this->beta_2)); // get corrected cache // TODO : the constant can be calculate and stored for reuse MatrixXf weightCacheCorrected = layer.weightCache.array() / (1.0 - std::pow(this->beta_2, (this->iterations + 1))); MatrixXf biasCacheCorrected = layer.biasCache.array() / (1.0 - std::pow(this->beta_2, (this->iterations + 1))); // Update weights and bias layer.weights.array() += weightCacheCorrected.array() * (-1 * this->curLearningRate) / (weightCacheCorrected.array().sqrt() + this->epsilon); layer.biases.array() += biasCacheCorrected.array() * (-1 * this->curLearningRate) / (biasCacheCorrected.array().sqrt() + this->epsilon); } void postUpdateParams() { this->iterations += 1; } }; ``` ### Finally Put all of them together In this section, the *main function code* and the *output* are provided. ***In Python*** ```python dense1 = Layer_Dense(2, 64) activation1 = Activation_ReLU() dense2 = Layer_Dense(64, 3) loss_activation = Activation_Softmax_Loss_CategoricalCrossEntropy() optimizer = Optimizer_Adam(learning_rate=0.02, decay=1e-5) for epoch in range(10001): dense1.forward(X) activation1.forward(dense1.output) dense2.forward(activation1.output) loss = loss_activation.forward(dense2.output, y) predictions = np.argmax(loss_activation.output, axis=1) if len(y.shape) == 2: y = np.argmax(y, axis=1) accuracy = np.mean(predictions==y) if not epoch % 100: print(f'epoch: {epoch}, ' + f'acc: {accuracy:.3f}, ' + f'loss: {loss:.3f},' + f'lr: {optimizer.current_learning_rate}') loss_activation.backward(loss_activation.output, y) dense2.backward(loss_activation.dinputs) activation1.backward(dense2.dinputs) dense1.backward(activation1.dinputs) optimizer.pre_update_params() optimizer.update_params(dense1) optimizer.update_params(dense2) optimizer.post_update_params() ``` ***Python Output*** ```bash epoch: 9000, acc: 0.960, loss: 0.083,lr: 0.018348792190754044 epoch: 9100, acc: 0.963, loss: 0.083,lr: 0.0183319737119497 epoch: 9200, acc: 0.973, loss: 0.085,lr: 0.018315186036502167 epoch: 9300, acc: 0.960, loss: 0.086,lr: 0.018298429079863496 epoch: 9400, acc: 0.967, loss: 0.083,lr: 0.018281702757794862 epoch: 9500, acc: 0.967, loss: 0.082,lr: 0.018265006986365174 epoch: 9600, acc: 0.970, loss: 0.090,lr: 0.018248341681949654 epoch: 9700, acc: 0.963, loss: 0.081,lr: 0.018231706761228456 epoch: 9800, acc: 0.967, loss: 0.082,lr: 0.018215102141185255 epoch: 9900, acc: 0.963, loss: 0.081,lr: 0.018198527739105907 epoch: 10000, acc: 0.967, loss: 0.081,lr: 0.018181983472577025 ``` ***In C++ with Eigen*** *main.cpp* ```cpp #include "framework.h" int main(int argc, char** argv) { MatrixXf spiralDataX(15, 2); spiralDataX << 0.0, 0.0, 0.22627203, -0.10630602, -0.49370526, -0.07908932, 0.36122907, 0.65727738, -0.11046591, -0.99387991, 0.0, 0.0, 0.05269733, 0.24438288, 0.3019796, -0.39850761, -0.70998173, 0.2417146, 0.88005838, -0.4748655, 0.0, 0.0, -0.13811948, -0.20838188, -0.02514717, 0.49936722, -0.04003289, -0.74893082, -0.98680023, -0.16194228; MatrixXf spiralDataY(15, 3); spiralDataY << 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0; Dense dense1(2, 64); ReLU activation1; Dense dense2(64, 3); SoftmaxCategoricalCrossentropyLoss loss_activation; OptimizerAdam optimizer(0.02, 1e-5); for (int i = 0; i < 10000; ++i) { // std::cout << "dense1" << "\n"; dense1.forward(spiralDataX); // std::cout << "activation1" << "\n"; activation1.forward(dense1.outputs); // std::cout << "dense2" << "\n"; dense2.forward(activation1.outputs); // std::cout << "loss" << "\n"; float loss = loss_activation.forward(dense2.outputs, spiralDataY); std::cout << i << " : Loss -> " << loss << "\n"; // std::cout << "loss" << "\n"; loss_activation.backward(loss_activation.outputs, spiralDataY); // std::cout << "dense2" << "\n"; dense2.backward(loss_activation.dinputs); // std::cout << "activation1" << "\n"; activation1.backward(dense2.dinputs); // std::cout << "dense1" << "\n"; dense1.backward(activation1.dinputs); // std::cout << "pre update" << "\n"; optimizer.preUpdateParam(); // std::cout << "update dense1" << "\n"; optimizer.updateParams(dense1); // std::cout << "update dense2" << "\n"; optimizer.updateParams(dense2); // std::cout << "post update" << "\n"; optimizer.postUpdateParams(); // std::cout << "finished" << "\n\n"; } return 0; } ``` ***Output*** ```bash 9978 : Loss -> 0.265253 9979 : Loss -> 0.265239 9980 : Loss -> 0.265226 9981 : Loss -> 0.265212 9982 : Loss -> 0.265198 9983 : Loss -> 0.265185 9984 : Loss -> 0.265171 9985 : Loss -> 0.265157 9986 : Loss -> 0.265144 9987 : Loss -> 0.26513 9988 : Loss -> 0.265116 9989 : Loss -> 0.265103 9990 : Loss -> 0.265089 9991 : Loss -> 0.265075 9992 : Loss -> 0.265062 9993 : Loss -> 0.265048 9994 : Loss -> 0.265034 9995 : Loss -> 0.26502 9996 : Loss -> 0.265007 9997 : Loss -> 0.264993 9998 : Loss -> 0.264979 9999 : Loss -> 0.264966 ```