Create python virtual environment and install dependencies
python3 -m venv env && source env/bin/activate && pip install coloredlogs numpy matplotlib
Split data into test and training set
python3 data_scripts/seperate_data.py dataset/data.csv dataset/
Run training program
python3 train/train.py dataset/data_train.csv dataset/data_test.csv -e 100
Generate reports and graphs
python3 report/generate_report.py <historics_directory>
Run prediction program
python3 train/predict.py <layer_weights> <layer_bias> <config> <test_dataset>
Replace the values in <> with the output files generated by the training program
The multilayer perceptron is a direct model of a human brain where it learns to associate specific inputs to specific outputs. The most basic unit of a multi layer perceptron is a singular perceptron where it can be represented as a function with multiple inputs and one single output (activation function).
The input of these functions will be the output of the activation function of the previous layer and the output of these activation functions will be connected to all nodes of the next layer. A 'layer' is formed when there are multiple perceptrons of the same type processing different values of that type
There will always be an infant layer that reads arbitrary input and an output layer for predictions. The layers in between yields their own context based on the data set and the objective.
The entire learning process is a 3-step procedure.
Once the learning is done you may use the same weights and biases from the training process to predict new data.
Deactivation function is the same for all MLPs. It takes the sum of all this inputs, apply a sigmoid function to it and return the output. The individual inputs may be multiplied with a constant (weight) and the sum may also add or subtract on arbitrary value prior to the sigmoid (bias).
We can represent is singular layer output (rhs) in linear algebra format below.
Credits to 3b1b
k
is the maximum numbers of nodes from the current layer while n
is the maximum number of nodes in the previous layer
Usually we will have a loss function that calculates the loss value of each node in the output layer in respect to the connections the output layer has with the last hidden layer.
The loss function will determine how much an activation value needs to be changed and should the activation value increase or decrease.
Say we have a scenario like so:
He wouldn't make sense to change the weights (w1
and w2
) since we didn't have direct influence over the previous activation values. Since the 1.0 activation value is big, it also makes sense to change the weight connected to that node since it is the most sensitive (weight multiplied by high activation)
In a similar fashion, if we have influence on the activations but not on the weight it will also make sense that changes made to a
, is more sensitive for the same reason above.
Conclusion: changes in weight has similar effects and constraints versus changes in activation value, but we only have direct influence on the weights (tuning).
But why does changes in activation matter? Since changes in weight has the same effect on changing the activation, changing both of them can help the output layer reach loss function minima faster.
The required change in weight can be propagated back in the form of required change in activation; making that a loss function output for the previous layer.
If there are multiple output nodes in the current layer, the change in activation should be the average of all the weight changes connected by the activation node to achieve desirable effects for all weights.
This effect will carry on until the input layer, hence back propagation.
The calculus part of backpropagation is similar to logistic regression as they both use the chain rule. A simple network with two hidden layers and one node to layer is used as an example here.
We also defined some functions from the last layer, forming a computation tree that looks like the following. In which we can use to obtain the partial derivatives for gradient descent.
To get the derivatives of other weights and biases, we can substitute the variables with its lower index counterpart. And since the cost function above only encapsulates one training example, we need to find the average of all costs to get the final cost.
For a multi layer network with multiple nodes per layer the original formulas just need some slight amendments.
With this in mind the steps to optimize all weights and biases simultaneously are so:
Since the input itself can be a matrix of its own, we can omit the state storing process with a dedicated matrix library (Numpy)
Insert diagram of linear algebra formula here
There is a way to visualize the neural network in the form of 2D graphs. Take the following multi layer perception for example:
And the initial weights and biases are
W1 = -34.4 b1 = 2.14
W2 = -2.53 b2 = 1.29
W3 = -1.3 b3 = -0.58
W4 = 2.28
If we are using the softmax function for the activation, our activation nodes will look like these:
And our training data looks like this:
Say the input 0 to the neural network,our input was multiply the weight and add the bias, resulting in 2.14
. After putting it in the softmax function we have the output 2.25
. We can also do this for a1
other training inputs, and we will get a range of values;
The same thing will repeat for the next layer until we get another curve
That was our first curve. The second curve is formed in the same manner for the second node in the layer a2
.
We then add the y
values of the curves and get the resultant curve. That resulted curve is then offseted by the bias and can be used for prediction. In theory, with enough layers and nodes we can fit any shape of data with such a line.
The softplus is an activation function where the derivative of it is the sigmoid function
f(x)=log(1+exp(x))
Softmax is a function that aggregates multiple input values into real values between 0 and 1, and the sum of those values will always be 1. (Probabilistic)
Cross entropy is a metric that replaces the loss function for soft max output layers. The formula for cross entropy is like so:
Momentum is an optimization method used in gradient descent where it helps with overcoming local minima by implementing controlled overshooting. Consider the following graph to perform gradient descent:
As we can see, the gradient descents stops at the local minima, which is not the ideal solution. one can also tune the learning rate to be larger, however this will also increase the chances of going over the global minima as well.One way to mitigate this is to introduce the concept of momentum. When applying gradient descent currently the gradient descent function looks like this:
y_t = y_t-1 - a * g
y_t = current y value
y_t-1 = previous y value
a = learning rate
g = gradient of previous y value
With the introduction of momentum, the gradient descent formula would look like this instead
v_t = (u * v_t-1) - (a * g)
y_t = y_t-1 + v_t
v_t = current velocity
v_t-1 = previous velocity
u = constant for velocity percentage loss
y_t = current y value
y_t-1 = previous y value
a = learning rate
g = gradient of previous y value
With each iteration, we notice that V
increases with the gradient. could be more precise, we will take further steps when we have been traveling in the same direction for some time. Our resolution graph would look like this now.
This can be improved due to the fact that unnecessary momentum is carried over when reaching the global minimum which causes additional oscillations. This is solved by Nestorov momentum where the definition of g
is changed like so:
g = gradient of previous y-value + u * v_t-1
Instead of making the step calculation using just the previous Y value, it also includes the distance of the momentum jump. Doing this will start decreasing the velocity earlier because the global minimize direction change is detected earlier.
As neural networks get more complex and big, it will become more prone to overfitting. Early stopping will mitigate this by stopping the learning iteration when the validation loss reaches its lowest point.
Cross entropy and mean square error are both loss functions that are used in regression. However cross entropy is more widely used for classification problems.
If we plug both of these loss functions to a graph we get:
As we can see, we will incur more loss for more granular changes with cross entropy due to the loss function having a higher gradient than MSE.
Also known as root mean square propagation is an optimization used in gradient descent to make it reach the global minimum faster. Consider the following topology:
Represents the classic grid in descent when reaching the minimum while the arrows in black Represents rms prop.
As we can notice, classic gradient descent oscillates a lot in the Y direction due to the imbalance ratio of the topology. RMS Prop improves this by taking the quotient of the root mean square instead of the actual derivative with account of the previous derivative. the formula to integrate this is as follows:
The intuition behind the formula is that since we want to slow down the change in the Y axis we will need to divide the derivative with a big number instead of using the derivative itself. vice versa for the X axis. The formula keeps a moving average of the changes and gives us a denominator based on how large the average is.
When the user inputs the data into my program I have a custom function that reads the data set in the csv file and parses it as a 1 dimensional scalar.
Keep in mind that DATA_MODEL
is a custom struct which stores the schema of the data as well as additional meta data
DATA_MODEL = [
{"name": "Id", "idx": 0, "type": "int"},
{"name": "Diagnosis", "idx": 1, "type": "string"},
{"name": "Radius_N1", "idx": 2, "type": "float"},
{"name": "Texture_N1", "idx": 3, "type": "float"},
{"name": "Perimeter_N1", "idx": 4, "type": "float"},
{"name": "Area_N1", "idx": 5, "type": "float"},
{"name": "Smoothness_N1", "idx": 6, "type": "float"},
{"name": "Compactness_N1", "idx": 7, "type": "float"},
{"name": "Concavity_N1", "idx": 8, "type": "float"},
{"name": "Concave points_N1", "idx": 9, "type": "float"},
{"name": "Symmetry_N1", "idx": 10, "type": "float"},
{"name": "Fractal dimension_N1", "idx": 11, "type": "float"},
{"name": "Radius_N2", "idx": 12, "type": "float"},
{"name": "Texture_N2", "idx": 13, "type": "float"},
{"name": "Perimeter_N2", "idx": 14, "type": "float"},
{"name": "Area_N2", "idx": 15, "type": "float"},
{"name": "Smoothness_N2", "idx": 16, "type": "float"},
{"name": "Compactness_N2", "idx": 17, "type": "float"},
{"name": "Concavity_N2", "idx": 18, "type": "float"},
{"name": "Concave points_N2", "idx": 19, "type": "float"},
{"name": "Symmetry_N2", "idx": 20, "type": "float"},
{"name": "Fractal dimension_N2", "idx": 21, "type": "float"},
{"name": "Radius_N3", "idx": 22, "type": "float"},
{"name": "Texture_N3", "idx": 23, "type": "float"},
{"name": "Perimeter_N3", "idx": 24, "type": "float"},
{"name": "Area_N3", "idx": 25, "type": "float"},
{"name": "Smoothness_N3", "idx": 26, "type": "float"},
{"name": "Compactness_N3", "idx": 27, "type": "float"},
{"name": "Concavity_N3", "idx": 28, "type": "float"},
{"name": "Concave points_N3", "idx": 29, "type": "float"},
{"name": "Symmetry_N3", "idx": 30, "type": "float"},
{"name": "Fractal dimension_N3", "idx": 31, "type": "float"},
]
For the data normalization as well as the data standardization part i've created a custom function that uses min max normalization as well as Z score standardization to process the data before I pass them into my perception. My normalization and standardization functions would return the weight for those operations and I will use the same weights to run normalization and standardization on my test data set.
min_max_weights_train = ft_preprocess.normalize_features(raw_data_train)
ft_preprocess.normalize_wtith_weights(raw_data_test, min_max_weights_train)
mean_and_stddev_train = ft_preprocess.standardize_features(raw_data_train)
ft_preprocess.standardize_with_weights(raw_data_test, mean_and_stddev_train)
A custom reporter is made to record historic events for the neural network when learning
reporter = ft_reporter.Ft_reporter(args.historic_path, args.historic_name)
Once everything is initialized, I will then create a perceptron with all of the previous values, behind the scenes would create all of the necessary layers as well as classifying the layers into its own specific types. It will also generate the truth matrix for the train data center as well as the test data set.
perceptron = ft_perception.Ft_perceptron(
args.layer,
args.epochs,
args.loss,
args.batch_size,
args.learning_rate,
args.output,
min_max_weights_train,
mean_and_stddev_train,
raw_data_train,
raw_data_test,
reporter
)
Perceptron also exposes a begin train function which runs epoch
number of times does the following procedures in order
np.array
vector. before we run the feed forward algorithm we will also generate the input matrix based on the number of inputs we have in our data set This won't affect the computation of the weights and vices because we are using linear algebra which allows us to compute multiple different entries all at once.def begin_train(self):
truth_train = self.train_truth
truth_test = self.test_truth
warmup_threshold = 128 # how many epochs to run before we check for early stopping?
last_test_accuracy = 0
for i in range(self.epoch_count):
# generate batch randomly based on batch size
indices = None
if self.batch_size < 0:
indices = range(0, len(self.dataset_train))
else :
indices = random.sample(population=range(0, len(self.dataset_train)), k=self.batch_size)
train_batches = list(map(lambda x: self.dataset_train[x], indices))
truth_batch_scalar = []
for col_idx, col in enumerate(truth_train[0]):
if col_idx not in indices:
continue
truth_1 = self.train_truth[0][col_idx]
truth_2 = self.train_truth[1][col_idx]
truth_batch_scalar.append([truth_1, truth_2])
train_batch_truths = np.array(truth_batch_scalar).T
# forwardfeed and backpropagate
self.layers[0].lhs_activation = self.generate_input_matrix(train_batches)
last_layer_error_train = self.feed_forward_and_backprop_train(train_batch_truths)
accuracy_train = ft_math.get_accuracy(self.layers[-1].rhs_activation, train_batch_truths)
recall_train = ft_math.get_recall(self.layers[-1].rhs_activation, train_batch_truths)
# forwardfeed test set
self.layers[0].lhs_activation = self.generate_input_matrix(self.dataset_test)
last_layer_error_test = self.feed_forward_test(truth_test)
accuracy_test = ft_math.get_accuracy(self.layers[-1].rhs_activation, truth_test)
recall_test = ft_math.get_recall(self.layers[-1].rhs_activation, truth_test)
test_error = np.abs(np.mean(last_layer_error_test))
train_error = np.abs(np.mean(last_layer_error_train))
self.reporter.report_event("train", train_error, accuracy_train, recall_train)
self.reporter.report_event("test", test_error, accuracy_test, recall_test)
# check for early stopping
if i >= warmup_threshold :
if accuracy_test < last_test_accuracy:
logging.info(f"Epoch {i} finished; train loss {round(train_error, 3)}, validation loss {round(test_error, 3)} validation acc {accuracy_test} < {last_test_accuracy}, early stopping condition met, stopping.")
break
last_test_accuracy = accuracy_test
# apply weight changes
self.apply_derivatives_reset_cache()
logging.info(f"Epoch {i} finished; train loss {round(train_error, 3)}, validation loss {round(test_error, 3)} validation acc {round(accuracy_test, 3)}")
self.reporter.generate_report()
logging.info("Historic report generated")
self.write_to_output()
logging.info("Weights written")
The feed forward function is quite direct It directs through all the layers in the current perceptron and for each layer it would run the activation function for that particular layer.
def feed_forward(self):
# iterate through all layers (assume input layer LHS is already set)
for layer_idx, layer in enumerate(self.layers):
# run activation function, this would populate the current layer.rhs
layer.run_activation()
# set next layer lhs depends on current layer type
if layer.type != "output":
self.layers[layer_idx + 1].lhs_activation = layer.rhs_activation
last_layer_activation = self.layers[-1].rhs_activation
return last_layer_activation
Back propagation function would iterate through the layers in reverse; Depending on the type of the layer it would run certain hard coded derivative functions which will get the derivatives for that layer to help us to do gradient descent.
def backprop(self, truth):
# this value will change to store the current layers dz value
# for the previous layer to access
last_dz = None
# this value will change to store the current layers weights
# for the previous layer to access
last_layer_weights = None
for layer_idx, layer in enumerate(reversed(self.layers)):
# logging.info(f"running backprop for layer {layer.type} @ {idx}")
if layer.type == "output":
# test = ft_math.softmax
dz = ft_math.dcost_dz_output_np(layer.rhs_activation, truth, layer.pre_softmax_x_values, self.output_loss_type)
dw = ft_math.dcost_dw_output_np(dz, layer.lhs_activation)
db = ft_math.dcost_db_output_np(dz)
layer.pending_weights_derivatives = dw
layer.pending_bias_derivatives = db
last_dz = dz
last_layer_weights = layer.weights
logging.debug(f"[backprop]\n{layer.type} layer {len(self.layers) - layer_idx} dz\n{dz}\ndw\n{dw}\ndb\n{db}")
else :
# logging.info(f"weights_hidden_to_output {last_layer_weights.shape} dz2 {last_dz.shape} a1 {layer.rhs_activation.shape}")
dz = ft_math.dcost_dz_hidden_np(last_layer_weights, last_dz, layer.rhs_activation)
dw = ft_math.dcost_dw_hidden_np(dz, layer.lhs_activation)
db = ft_math.dcost_db_hidden_np(dz)
layer.pending_weights_derivatives = dw
layer.pending_bias_derivatives = db
last_dz = dz
last_layer_weights = layer.weights
logging.debug(f"[backprop]\n{layer.type} layer {len(self.layers) - layer_idx} dz\n{dz}\ndw\n{dw}\ndb\n{db}")
In the where we apply the derivatives and obtain the new step size, We will first apply RMS prop into the current derivatives before we apply the momentum using some saved variables from the class.
After those two metrics are replied we will then change the individual weights and biases for all the layers. And then we reset the cache for those layers back to zero to prepare them for the new epoch loop.
def apply_derivatives_reset_cache(self):
for idx, layer in enumerate(self.layers):
# apply rmsprop
new_s_dw = (self.rmsprop_ratio * layer.s_weights) + ((1 - self.rmsprop_ratio) * np.square(layer.pending_weights_derivatives))
rms_w_derivatives = layer.pending_weights_derivatives / np.sqrt(new_s_dw + self.rmsprop_stabilizer)
layer.s_weights = new_s_dw
new_s_db = (self.rmsprop_ratio * layer.s_bias) + ((1 - self.rmsprop_ratio) * np.square(layer.pending_bias_derivatives))
rms_b_derivatives = layer.pending_bias_derivatives / np.sqrt(new_s_db + self.rmsprop_stabilizer)
layer.s_bias = new_s_db
# apply momentum
new_weights_velocity = self.momentum_decay * layer.weights_velocity - (self.learning_rate * rms_w_derivatives)
layer.weights_velocity = new_weights_velocity
new_bias_velocity = self.momentum_decay * layer.bias_velocity - (self.learning_rate * rms_b_derivatives)
layer.bias_velocity = new_bias_velocity
layer.weights = layer.weights + new_weights_velocity
layer.bias = layer.bias + new_bias_velocity
# clear cache matrix
layer.pending_weights_derivatives = np.zeros(layer.weights.shape)
layer.pending_bias_derivatives = np.zeros(layer.bias.shape)