# Usage Create python virtual environment and install dependencies ``` python3 -m venv env && source env/bin/activate && pip install coloredlogs numpy matplotlib ``` Split data into test and training set ``` python3 data_scripts/seperate_data.py dataset/data.csv dataset/ ``` Run training program ``` python3 train/train.py dataset/data_train.csv dataset/data_test.csv -e 100 ``` Generate reports and graphs ``` python3 report/generate_report.py <historics_directory> ``` Run prediction program ``` python3 train/predict.py <layer_weights> <layer_bias> <config> <test_dataset> ``` > Replace the values in <> with the output files generated by the training program # /dev/log for multilayer_perceptron ## Multilayer peceptron The **multilayer perceptron** is a direct model of a human brain where it *learns* to associate specific inputs to specific outputs. The most basic unit of a multi layer perceptron is a singular perceptron where it can be represented as a function with multiple inputs and one single output (activation function). The input of these functions will be the output of the activation function of the previous layer and the output of these activation functions will be connected to all nodes of the next layer. A 'layer' is formed when there are multiple perceptrons of the same type processing different values of that type There will always be an infant layer that reads arbitrary input and an output layer for predictions. The layers in between yields their own context based on the data set and the objective. ![image](https://hackmd.io/_uploads/HJqztzr8A.png) The entire learning process is a 3-step procedure. 1. Make a guess from the output 2. Adjust the perceptron so that the guest becomes more correct (back propagation) 3. Repeat (epoch) Once the learning is done you may use the same weights and biases from the training process to predict new data. ## Activation function Deactivation function is the same for all MLPs. It takes the sum of all this inputs, apply a sigmoid function to it and return the output. The individual inputs may be multiplied with a constant (weight) and the sum may also add or subtract on arbitrary value prior to the sigmoid (bias). ![image](https://hackmd.io/_uploads/Sk6sUwHIR.png) We can represent is singular layer output (rhs) in linear algebra format below. ![image](https://hackmd.io/_uploads/rku2uDH8A.png) > Credits to 3b1b `k` is the maximum numbers of nodes from the current layer while `n` is the maximum number of nodes in the previous layer ## Backpropagation Usually we will have a **loss function** that calculates the loss value of each node in the output layer in respect to the connections the output layer has with the last hidden layer. The loss function will determine how much an activation value needs to be changed and should the activation value increase or decrease. Say we have a scenario like so: ![image](https://hackmd.io/_uploads/rkERFDSUA.png) He wouldn't make sense to change the weights (`w1` and `w2`) since we didn't have direct influence over the previous activation values. Since the 1.0 activation value is big, it also makes sense to change the weight connected to that node since it is the most sensitive (weight multiplied by high activation) ![image](https://hackmd.io/_uploads/HyQgcDSUC.png) In a similar fashion, if we have influence on the activations but not on the weight it will also make sense that changes made to `a`, is more sensitive for the same reason above. **Conclusion:** changes in weight has similar effects and constraints versus changes in activation value, but we only have direct influence on the weights (tuning). But why does changes in activation matter? Since changes in weight has the same effect on changing the activation, changing both of them can help the output layer reach loss function minima faster. The required change in weight can be propagated back in the form of required change in activation; making that a loss function output for the previous layer. If there are multiple output nodes in the current layer, the change in activation should be the average of all the weight changes connected by the activation node to achieve desirable effects for all weights. ![image](https://hackmd.io/_uploads/ryW6cvHUR.png) This effect will carry on until the input layer, hence **back propagation**. The calculus part of backpropagation is similar to logistic regression as they both use the chain rule. A simple network with two hidden layers and one node to layer is used as an example here. ![image](https://hackmd.io/_uploads/ryDZpDSLR.png) We also defined some functions from the last layer, forming a computation tree that looks like the following. In which we can use to obtain the **partial derivatives for gradient descent.** ![image](https://hackmd.io/_uploads/r1x1SOBUA.png) To get the derivatives of other weights and biases, we can substitute the variables with its lower index counterpart. And since the cost function above only encapsulates one training example, we need to find the average of all costs to get the final cost. ![image](https://hackmd.io/_uploads/HkjZUdH8C.png) For a multi layer network with multiple nodes per layer the original formulas just need some slight amendments. ![image](https://hackmd.io/_uploads/ryH_-urUA.png) With this in mind the steps to optimize all weights and biases simultaneously are so: 1. Initialize random values for all weights and biases 2. Epoch loop until the desired epoch limit is reached - For all the examples, calculate the cost function value and populate the activation function matrix by running the neural network - Using their respective derivatives, the sum of each input value, current weight and biases, current activation values, generate a new matrix of weights by taking 1 step using gradient descent > Since the input itself can be a matrix of its own, we can omit the state storing process with a dedicated matrix library (Numpy) **Insert diagram of linear algebra formula here** ## Activation functions as graphs There is a way to visualize the neural network in the form of 2D graphs. Take the following multi layer perception for example: ![image](https://hackmd.io/_uploads/rku2uDH8A.png) And the initial weights and biases are ``` W1 = -34.4 b1 = 2.14 W2 = -2.53 b2 = 1.29 W3 = -1.3 b3 = -0.58 W4 = 2.28 ``` If we are using the softmax function for the activation, our activation nodes will look like these: ![image](https://hackmd.io/_uploads/S1z8jcSUC.png) And our training data looks like this: ![image](https://hackmd.io/_uploads/HJZ725SUA.png) Say the input 0 to the neural network,our input was multiply the weight and add the bias, resulting in `2.14`. After putting it in the softmax function we have the output `2.25`. We can also do this for `a1` other training inputs, and we will get a range of values; ![Screenshot 2024-06-23 at 21.08.38](https://hackmd.io/_uploads/HkN9C5BIC.png) The same thing will repeat for the next layer until we get another curve ![Screenshot 2024-06-23 at 21.12.20](https://hackmd.io/_uploads/rJuSyiHIR.png) That was our first curve. The second curve is formed in the same manner for the second node in the layer `a2`. ![Screenshot 2024-06-23 at 21.09.10-min](https://hackmd.io/_uploads/r1ZgyoBUR.png) We then add the `y` values of the curves and get the resultant curve. That resulted curve is then **offseted by the bias** and can be used for prediction. In theory, with enough layers and nodes we can fit any shape of data with such a line. ## Softmax and cross entropy The **softplus** is an activation function where the derivative of it is the sigmoid function `f(x)=log(1+exp(x))` Softmax is a function that aggregates multiple input values into real values between 0 and 1, and the sum of those values will always be 1. (Probabilistic) ![image](https://hackmd.io/_uploads/rJTqyirL0.png) Cross entropy is a metric that replaces the loss function for soft max output layers. The formula for cross entropy is like so: ![image](https://hackmd.io/_uploads/r1PygjSUA.png) ## Gradient descent with momentum and Nesterov momentum Momentum is an optimization method used in gradient descent where it helps with overcoming local minima by implementing controlled overshooting. Consider the following graph to perform gradient descent: ![image](https://hackmd.io/_uploads/B18tgsBL0.png) As we can see, the gradient descents stops at the local minima, which is not the ideal solution. one can also tune the learning rate to be larger, however this will also increase the chances of going over the global minima as well.One way to mitigate this is to introduce the concept of momentum. When applying gradient descent currently the gradient descent function looks like this: ``` y_t = y_t-1 - a * g y_t = current y value y_t-1 = previous y value a = learning rate g = gradient of previous y value ``` With the introduction of momentum, the gradient descent formula would look like this instead ``` v_t = (u * v_t-1) - (a * g) y_t = y_t-1 + v_t v_t = current velocity v_t-1 = previous velocity u = constant for velocity percentage loss y_t = current y value y_t-1 = previous y value a = learning rate g = gradient of previous y value ``` With each iteration, we notice that `V` increases with the gradient. could be more precise, we will take further steps when we have been traveling in the same direction for some time. Our resolution graph would look like this now. ![image](https://hackmd.io/_uploads/HkGkfoS8C.png) This can be improved due to the fact that unnecessary momentum is carried over when reaching the global minimum which causes additional oscillations. This is solved by Nestorov momentum where the definition of `g` is changed like so: ``` g = gradient of previous y-value + u * v_t-1 ``` Instead of making the step calculation using just the previous Y value, it also includes the distance of the momentum jump. Doing this will start decreasing the velocity earlier because the global minimize direction change is detected earlier. ## Early stopping As neural networks get more complex and big, it will become more prone to overfitting. Early stopping will mitigate this by stopping the learning iteration when the validation loss reaches its lowest point. ![image](https://hackmd.io/_uploads/Syk6zorLC.png) ## Cross entropy with Softmax versus mean square error Cross entropy and mean square error are both loss functions that are used in regression. However cross entropy is more widely used for classification problems. If we plug both of these loss functions to a graph we get: ![image](https://hackmd.io/_uploads/Hkxf7QsB8A.png) As we can see, we will incur more loss for more granular changes with cross entropy due to the loss function having a higher gradient than MSE. ## RMS prop Also known as root mean square propagation is an optimization used in gradient descent to make it reach the global minimum faster. Consider the following topology: ![image](https://hackmd.io/_uploads/BkZJViH8C.png) Represents the classic grid in descent when reaching the minimum while the arrows in black Represents rms prop. As we can notice, classic gradient descent oscillates a lot in the Y direction due to the imbalance ratio of the topology. RMS Prop improves this by taking the quotient of the root mean square instead of the actual derivative with account of the previous derivative. the formula to integrate this is as follows: ![image](https://hackmd.io/_uploads/BJw7EiHIR.png) The intuition behind the formula is that since we want to slow down the change in the Y axis we will need to divide the derivative with a big number instead of using the derivative itself. vice versa for the X axis. The formula keeps a moving average of the changes and gives us a denominator based on how large the average is. ## Implementation details (train) ### Data normalization and standardization When the user inputs the data into my program I have a custom function that reads the data set in the csv file and parses it as a 1 dimensional scalar. ![image](https://hackmd.io/_uploads/Hk0p08SIR.png) Keep in mind that `DATA_MODEL` is a custom struct which stores the schema of the data as well as additional meta data ``` python DATA_MODEL = [ {"name": "Id", "idx": 0, "type": "int"}, {"name": "Diagnosis", "idx": 1, "type": "string"}, {"name": "Radius_N1", "idx": 2, "type": "float"}, {"name": "Texture_N1", "idx": 3, "type": "float"}, {"name": "Perimeter_N1", "idx": 4, "type": "float"}, {"name": "Area_N1", "idx": 5, "type": "float"}, {"name": "Smoothness_N1", "idx": 6, "type": "float"}, {"name": "Compactness_N1", "idx": 7, "type": "float"}, {"name": "Concavity_N1", "idx": 8, "type": "float"}, {"name": "Concave points_N1", "idx": 9, "type": "float"}, {"name": "Symmetry_N1", "idx": 10, "type": "float"}, {"name": "Fractal dimension_N1", "idx": 11, "type": "float"}, {"name": "Radius_N2", "idx": 12, "type": "float"}, {"name": "Texture_N2", "idx": 13, "type": "float"}, {"name": "Perimeter_N2", "idx": 14, "type": "float"}, {"name": "Area_N2", "idx": 15, "type": "float"}, {"name": "Smoothness_N2", "idx": 16, "type": "float"}, {"name": "Compactness_N2", "idx": 17, "type": "float"}, {"name": "Concavity_N2", "idx": 18, "type": "float"}, {"name": "Concave points_N2", "idx": 19, "type": "float"}, {"name": "Symmetry_N2", "idx": 20, "type": "float"}, {"name": "Fractal dimension_N2", "idx": 21, "type": "float"}, {"name": "Radius_N3", "idx": 22, "type": "float"}, {"name": "Texture_N3", "idx": 23, "type": "float"}, {"name": "Perimeter_N3", "idx": 24, "type": "float"}, {"name": "Area_N3", "idx": 25, "type": "float"}, {"name": "Smoothness_N3", "idx": 26, "type": "float"}, {"name": "Compactness_N3", "idx": 27, "type": "float"}, {"name": "Concavity_N3", "idx": 28, "type": "float"}, {"name": "Concave points_N3", "idx": 29, "type": "float"}, {"name": "Symmetry_N3", "idx": 30, "type": "float"}, {"name": "Fractal dimension_N3", "idx": 31, "type": "float"}, ] ``` For the data normalization as well as the data standardization part i've created a custom function that uses min max normalization as well as Z score standardization to process the data before I pass them into my perception. My normalization and standardization functions would return the weight for those operations and I will use the same weights to run normalization and standardization on my test data set. ```python= min_max_weights_train = ft_preprocess.normalize_features(raw_data_train) ft_preprocess.normalize_wtith_weights(raw_data_test, min_max_weights_train) mean_and_stddev_train = ft_preprocess.standardize_features(raw_data_train) ft_preprocess.standardize_with_weights(raw_data_test, mean_and_stddev_train) ``` A custom reporter is made to record historic events for the neural network when learning ```python reporter = ft_reporter.Ft_reporter(args.historic_path, args.historic_name) ``` Once everything is initialized, I will then create a perceptron with all of the previous values, behind the scenes would create all of the necessary layers as well as classifying the layers into its own specific types. It will also generate the truth matrix for the train data center as well as the test data set. ``` perceptron = ft_perception.Ft_perceptron( args.layer, args.epochs, args.loss, args.batch_size, args.learning_rate, args.output, min_max_weights_train, mean_and_stddev_train, raw_data_train, raw_data_test, reporter ) ``` Perceptron also exposes a begin train function which runs `epoch` number of times does the following procedures in order 1. It will generate a randomly selected subset of batches based on the inputted batch size 2. It would then further process the truth indices is and turn it into `np.array` vector. before we run the feed forward algorithm we will also generate the input matrix based on the number of inputs we have in our data set This won't affect the computation of the weights and vices because we are using linear algebra which allows us to compute multiple different entries all at once. 3. Proceed to run the feed forward and back propagate procedures on the training data set as well as calculate the metrics from the result of the feed forward 4. Run only the feet forward succeeded on the test data set and obtain the metrics and results. 5. based on an arbitrary limit that we set, implement early stopping to stop the epoch when we observe that the accuracy for the test data set starts to increase after a certain number of times 6. Apply the weight changes which is obtained through the feed forward and back propagation process in the training data set, And then just log to the console on the current progress ``` python def begin_train(self): truth_train = self.train_truth truth_test = self.test_truth warmup_threshold = 128 # how many epochs to run before we check for early stopping? last_test_accuracy = 0 for i in range(self.epoch_count): # generate batch randomly based on batch size indices = None if self.batch_size < 0: indices = range(0, len(self.dataset_train)) else : indices = random.sample(population=range(0, len(self.dataset_train)), k=self.batch_size) train_batches = list(map(lambda x: self.dataset_train[x], indices)) truth_batch_scalar = [] for col_idx, col in enumerate(truth_train[0]): if col_idx not in indices: continue truth_1 = self.train_truth[0][col_idx] truth_2 = self.train_truth[1][col_idx] truth_batch_scalar.append([truth_1, truth_2]) train_batch_truths = np.array(truth_batch_scalar).T # forwardfeed and backpropagate self.layers[0].lhs_activation = self.generate_input_matrix(train_batches) last_layer_error_train = self.feed_forward_and_backprop_train(train_batch_truths) accuracy_train = ft_math.get_accuracy(self.layers[-1].rhs_activation, train_batch_truths) recall_train = ft_math.get_recall(self.layers[-1].rhs_activation, train_batch_truths) # forwardfeed test set self.layers[0].lhs_activation = self.generate_input_matrix(self.dataset_test) last_layer_error_test = self.feed_forward_test(truth_test) accuracy_test = ft_math.get_accuracy(self.layers[-1].rhs_activation, truth_test) recall_test = ft_math.get_recall(self.layers[-1].rhs_activation, truth_test) test_error = np.abs(np.mean(last_layer_error_test)) train_error = np.abs(np.mean(last_layer_error_train)) self.reporter.report_event("train", train_error, accuracy_train, recall_train) self.reporter.report_event("test", test_error, accuracy_test, recall_test) # check for early stopping if i >= warmup_threshold : if accuracy_test < last_test_accuracy: logging.info(f"Epoch {i} finished; train loss {round(train_error, 3)}, validation loss {round(test_error, 3)} validation acc {accuracy_test} < {last_test_accuracy}, early stopping condition met, stopping.") break last_test_accuracy = accuracy_test # apply weight changes self.apply_derivatives_reset_cache() logging.info(f"Epoch {i} finished; train loss {round(train_error, 3)}, validation loss {round(test_error, 3)} validation acc {round(accuracy_test, 3)}") self.reporter.generate_report() logging.info("Historic report generated") self.write_to_output() logging.info("Weights written") ``` The feed forward function is quite direct It directs through all the layers in the current perceptron and for each layer it would run the activation function for that particular layer. ```python= def feed_forward(self): # iterate through all layers (assume input layer LHS is already set) for layer_idx, layer in enumerate(self.layers): # run activation function, this would populate the current layer.rhs layer.run_activation() # set next layer lhs depends on current layer type if layer.type != "output": self.layers[layer_idx + 1].lhs_activation = layer.rhs_activation last_layer_activation = self.layers[-1].rhs_activation return last_layer_activation ``` Back propagation function would iterate through the layers in reverse; Depending on the type of the layer it would run certain hard coded derivative functions which will get the derivatives for that layer to help us to do gradient descent. ```python= def backprop(self, truth): # this value will change to store the current layers dz value # for the previous layer to access last_dz = None # this value will change to store the current layers weights # for the previous layer to access last_layer_weights = None for layer_idx, layer in enumerate(reversed(self.layers)): # logging.info(f"running backprop for layer {layer.type} @ {idx}") if layer.type == "output": # test = ft_math.softmax dz = ft_math.dcost_dz_output_np(layer.rhs_activation, truth, layer.pre_softmax_x_values, self.output_loss_type) dw = ft_math.dcost_dw_output_np(dz, layer.lhs_activation) db = ft_math.dcost_db_output_np(dz) layer.pending_weights_derivatives = dw layer.pending_bias_derivatives = db last_dz = dz last_layer_weights = layer.weights logging.debug(f"[backprop]\n{layer.type} layer {len(self.layers) - layer_idx} dz\n{dz}\ndw\n{dw}\ndb\n{db}") else : # logging.info(f"weights_hidden_to_output {last_layer_weights.shape} dz2 {last_dz.shape} a1 {layer.rhs_activation.shape}") dz = ft_math.dcost_dz_hidden_np(last_layer_weights, last_dz, layer.rhs_activation) dw = ft_math.dcost_dw_hidden_np(dz, layer.lhs_activation) db = ft_math.dcost_db_hidden_np(dz) layer.pending_weights_derivatives = dw layer.pending_bias_derivatives = db last_dz = dz last_layer_weights = layer.weights logging.debug(f"[backprop]\n{layer.type} layer {len(self.layers) - layer_idx} dz\n{dz}\ndw\n{dw}\ndb\n{db}") ``` In the where we apply the derivatives and obtain the new step size, We will first apply RMS prop into the current derivatives before we apply the momentum using some saved variables from the class. After those two metrics are replied we will then change the individual weights and biases for all the layers. And then we reset the cache for those layers back to zero to prepare them for the new epoch loop. ```python= def apply_derivatives_reset_cache(self): for idx, layer in enumerate(self.layers): # apply rmsprop new_s_dw = (self.rmsprop_ratio * layer.s_weights) + ((1 - self.rmsprop_ratio) * np.square(layer.pending_weights_derivatives)) rms_w_derivatives = layer.pending_weights_derivatives / np.sqrt(new_s_dw + self.rmsprop_stabilizer) layer.s_weights = new_s_dw new_s_db = (self.rmsprop_ratio * layer.s_bias) + ((1 - self.rmsprop_ratio) * np.square(layer.pending_bias_derivatives)) rms_b_derivatives = layer.pending_bias_derivatives / np.sqrt(new_s_db + self.rmsprop_stabilizer) layer.s_bias = new_s_db # apply momentum new_weights_velocity = self.momentum_decay * layer.weights_velocity - (self.learning_rate * rms_w_derivatives) layer.weights_velocity = new_weights_velocity new_bias_velocity = self.momentum_decay * layer.bias_velocity - (self.learning_rate * rms_b_derivatives) layer.bias_velocity = new_bias_velocity layer.weights = layer.weights + new_weights_velocity layer.bias = layer.bias + new_bias_velocity # clear cache matrix layer.pending_weights_derivatives = np.zeros(layer.weights.shape) layer.pending_bias_derivatives = np.zeros(layer.bias.shape) ```