Try   HackMD

Usage

Create python virtual environment and install dependencies

python3 -m venv env && source env/bin/activate && pip install coloredlogs numpy matplotlib

Split data into test and training set

python3 data_scripts/seperate_data.py dataset/data.csv dataset/

Run training program

python3 train/train.py dataset/data_train.csv dataset/data_test.csv -e 100

Generate reports and graphs

python3 report/generate_report.py <historics_directory>

Run prediction program

python3 train/predict.py <layer_weights> <layer_bias> <config> <test_dataset>

Replace the values in <> with the output files generated by the training program

/dev/log for multilayer_perceptron

Multilayer peceptron

The multilayer perceptron is a direct model of a human brain where it learns to associate specific inputs to specific outputs. The most basic unit of a multi layer perceptron is a singular perceptron where it can be represented as a function with multiple inputs and one single output (activation function).

The input of these functions will be the output of the activation function of the previous layer and the output of these activation functions will be connected to all nodes of the next layer. A 'layer' is formed when there are multiple perceptrons of the same type processing different values of that type

There will always be an infant layer that reads arbitrary input and an output layer for predictions. The layers in between yields their own context based on the data set and the objective.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

The entire learning process is a 3-step procedure.

  1. Make a guess from the output
  2. Adjust the perceptron so that the guest becomes more correct (back propagation)
  3. Repeat (epoch)

Once the learning is done you may use the same weights and biases from the training process to predict new data.

Activation function

Deactivation function is the same for all MLPs. It takes the sum of all this inputs, apply a sigmoid function to it and return the output. The individual inputs may be multiplied with a constant (weight) and the sum may also add or subtract on arbitrary value prior to the sigmoid (bias).

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

We can represent is singular layer output (rhs) in linear algebra format below.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Credits to 3b1b

k is the maximum numbers of nodes from the current layer while n is the maximum number of nodes in the previous layer

Backpropagation

Usually we will have a loss function that calculates the loss value of each node in the output layer in respect to the connections the output layer has with the last hidden layer.

The loss function will determine how much an activation value needs to be changed and should the activation value increase or decrease.

Say we have a scenario like so:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

He wouldn't make sense to change the weights (w1 and w2) since we didn't have direct influence over the previous activation values. Since the 1.0 activation value is big, it also makes sense to change the weight connected to that node since it is the most sensitive (weight multiplied by high activation)

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

In a similar fashion, if we have influence on the activations but not on the weight it will also make sense that changes made to a, is more sensitive for the same reason above.

Conclusion: changes in weight has similar effects and constraints versus changes in activation value, but we only have direct influence on the weights (tuning).

But why does changes in activation matter? Since changes in weight has the same effect on changing the activation, changing both of them can help the output layer reach loss function minima faster.

The required change in weight can be propagated back in the form of required change in activation; making that a loss function output for the previous layer.

If there are multiple output nodes in the current layer, the change in activation should be the average of all the weight changes connected by the activation node to achieve desirable effects for all weights.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

This effect will carry on until the input layer, hence back propagation.

The calculus part of backpropagation is similar to logistic regression as they both use the chain rule. A simple network with two hidden layers and one node to layer is used as an example here.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

We also defined some functions from the last layer, forming a computation tree that looks like the following. In which we can use to obtain the partial derivatives for gradient descent.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

To get the derivatives of other weights and biases, we can substitute the variables with its lower index counterpart. And since the cost function above only encapsulates one training example, we need to find the average of all costs to get the final cost.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

For a multi layer network with multiple nodes per layer the original formulas just need some slight amendments.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

With this in mind the steps to optimize all weights and biases simultaneously are so:

  1. Initialize random values for all weights and biases
  2. Epoch loop until the desired epoch limit is reached
    • For all the examples, calculate the cost function value and populate the activation function matrix by running the neural network
    • Using their respective derivatives, the sum of each input value, current weight and biases, current activation values, generate a new matrix of weights by taking 1 step using gradient descent

Since the input itself can be a matrix of its own, we can omit the state storing process with a dedicated matrix library (Numpy)

Insert diagram of linear algebra formula here

Activation functions as graphs

There is a way to visualize the neural network in the form of 2D graphs. Take the following multi layer perception for example:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

And the initial weights and biases are

W1 = -34.4 b1 = 2.14
W2 = -2.53 b2 = 1.29
W3 = -1.3 b3 = -0.58
W4 = 2.28 

If we are using the softmax function for the activation, our activation nodes will look like these:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

And our training data looks like this:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Say the input 0 to the neural network,our input was multiply the weight and add the bias, resulting in 2.14. After putting it in the softmax function we have the output 2.25. We can also do this for a1 other training inputs, and we will get a range of values;

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

The same thing will repeat for the next layer until we get another curve

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

That was our first curve. The second curve is formed in the same manner for the second node in the layer a2.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

We then add the y values of the curves and get the resultant curve. That resulted curve is then offseted by the bias and can be used for prediction. In theory, with enough layers and nodes we can fit any shape of data with such a line.

Softmax and cross entropy

The softplus is an activation function where the derivative of it is the sigmoid function

f(x)=log(1+exp(x))

Softmax is a function that aggregates multiple input values into real values between 0 and 1, and the sum of those values will always be 1. (Probabilistic)

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Cross entropy is a metric that replaces the loss function for soft max output layers. The formula for cross entropy is like so:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Gradient descent with momentum and Nesterov momentum

Momentum is an optimization method used in gradient descent where it helps with overcoming local minima by implementing controlled overshooting. Consider the following graph to perform gradient descent:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

As we can see, the gradient descents stops at the local minima, which is not the ideal solution. one can also tune the learning rate to be larger, however this will also increase the chances of going over the global minima as well.One way to mitigate this is to introduce the concept of momentum. When applying gradient descent currently the gradient descent function looks like this:

y_t = y_t-1 - a * g

y_t = current y value
y_t-1 = previous y value
a = learning rate
g = gradient of previous y value

With the introduction of momentum, the gradient descent formula would look like this instead

v_t = (u * v_t-1) - (a * g)
y_t = y_t-1 + v_t

v_t = current velocity
v_t-1 = previous velocity
u = constant for velocity percentage loss
y_t = current y value
y_t-1 = previous y value
a = learning rate
g = gradient of previous y value

With each iteration, we notice that V increases with the gradient. could be more precise, we will take further steps when we have been traveling in the same direction for some time. Our resolution graph would look like this now.

image

This can be improved due to the fact that unnecessary momentum is carried over when reaching the global minimum which causes additional oscillations. This is solved by Nestorov momentum where the definition of g is changed like so:

g = gradient of previous y-value + u * v_t-1

Instead of making the step calculation using just the previous Y value, it also includes the distance of the momentum jump. Doing this will start decreasing the velocity earlier because the global minimize direction change is detected earlier.

Early stopping

As neural networks get more complex and big, it will become more prone to overfitting. Early stopping will mitigate this by stopping the learning iteration when the validation loss reaches its lowest point.

image

Cross entropy with Softmax versus mean square error

Cross entropy and mean square error are both loss functions that are used in regression. However cross entropy is more widely used for classification problems.

If we plug both of these loss functions to a graph we get:
image

As we can see, we will incur more loss for more granular changes with cross entropy due to the loss function having a higher gradient than MSE.

RMS prop

Also known as root mean square propagation is an optimization used in gradient descent to make it reach the global minimum faster. Consider the following topology:

image

Represents the classic grid in descent when reaching the minimum while the arrows in black Represents rms prop.

As we can notice, classic gradient descent oscillates a lot in the Y direction due to the imbalance ratio of the topology. RMS Prop improves this by taking the quotient of the root mean square instead of the actual derivative with account of the previous derivative. the formula to integrate this is as follows:

image

The intuition behind the formula is that since we want to slow down the change in the Y axis we will need to divide the derivative with a big number instead of using the derivative itself. vice versa for the X axis. The formula keeps a moving average of the changes and gives us a denominator based on how large the average is.

Implementation details (train)

Data normalization and standardization

When the user inputs the data into my program I have a custom function that reads the data set in the csv file and parses it as a 1 dimensional scalar.

image

Keep in mind that DATA_MODEL is a custom struct which stores the schema of the data as well as additional meta data

DATA_MODEL = [
	{"name": "Id", "idx": 0, "type": "int"},
	{"name": "Diagnosis", "idx": 1, "type": "string"},
	{"name": "Radius_N1", "idx": 2, "type": "float"},
	{"name": "Texture_N1", "idx": 3, "type": "float"},
	{"name": "Perimeter_N1", "idx": 4, "type": "float"},
	{"name": "Area_N1", "idx": 5, "type": "float"},
	{"name": "Smoothness_N1", "idx": 6, "type": "float"},
	{"name": "Compactness_N1", "idx": 7, "type": "float"},
	{"name": "Concavity_N1", "idx": 8, "type": "float"},
	{"name": "Concave points_N1", "idx": 9, "type": "float"},
	{"name": "Symmetry_N1", "idx": 10, "type": "float"},
	{"name": "Fractal dimension_N1", "idx": 11, "type": "float"},
	
	{"name": "Radius_N2", "idx": 12, "type": "float"},
	{"name": "Texture_N2", "idx": 13, "type": "float"},
	{"name": "Perimeter_N2", "idx": 14, "type": "float"},
	{"name": "Area_N2", "idx": 15, "type": "float"},
	{"name": "Smoothness_N2", "idx": 16, "type": "float"},
	{"name": "Compactness_N2", "idx": 17, "type": "float"},
	{"name": "Concavity_N2", "idx": 18, "type": "float"},
	{"name": "Concave points_N2", "idx": 19, "type": "float"},
	{"name": "Symmetry_N2", "idx": 20, "type": "float"},
	{"name": "Fractal dimension_N2", "idx": 21, "type": "float"},

	{"name": "Radius_N3", "idx": 22, "type": "float"},
	{"name": "Texture_N3", "idx": 23, "type": "float"},
	{"name": "Perimeter_N3", "idx": 24, "type": "float"},
	{"name": "Area_N3", "idx": 25, "type": "float"},
	{"name": "Smoothness_N3", "idx": 26, "type": "float"},
	{"name": "Compactness_N3", "idx": 27, "type": "float"},
	{"name": "Concavity_N3", "idx": 28, "type": "float"},
	{"name": "Concave points_N3", "idx": 29, "type": "float"},
	{"name": "Symmetry_N3", "idx": 30, "type": "float"},
	{"name": "Fractal dimension_N3", "idx": 31, "type": "float"},
]

For the data normalization as well as the data standardization part i've created a custom function that uses min max normalization as well as Z score standardization to process the data before I pass them into my perception. My normalization and standardization functions would return the weight for those operations and I will use the same weights to run normalization and standardization on my test data set.

min_max_weights_train = ft_preprocess.normalize_features(raw_data_train) ft_preprocess.normalize_wtith_weights(raw_data_test, min_max_weights_train) mean_and_stddev_train = ft_preprocess.standardize_features(raw_data_train) ft_preprocess.standardize_with_weights(raw_data_test, mean_and_stddev_train)

A custom reporter is made to record historic events for the neural network when learning

reporter = ft_reporter.Ft_reporter(args.historic_path, args.historic_name)

Once everything is initialized, I will then create a perceptron with all of the previous values, behind the scenes would create all of the necessary layers as well as classifying the layers into its own specific types. It will also generate the truth matrix for the train data center as well as the test data set.

perceptron = ft_perception.Ft_perceptron(
		args.layer,
		args.epochs,
		args.loss,
		args.batch_size,
		args.learning_rate,
		args.output,
		min_max_weights_train,
		mean_and_stddev_train,
		raw_data_train,
		raw_data_test,
		reporter
	)

Perceptron also exposes a begin train function which runs epoch number of times does the following procedures in order

  1. It will generate a randomly selected subset of batches based on the inputted batch size
  2. It would then further process the truth indices is and turn it into np.array vector. before we run the feed forward algorithm we will also generate the input matrix based on the number of inputs we have in our data set This won't affect the computation of the weights and vices because we are using linear algebra which allows us to compute multiple different entries all at once.
  3. Proceed to run the feed forward and back propagate procedures on the training data set as well as calculate the metrics from the result of the feed forward
  4. Run only the feet forward succeeded on the test data set and obtain the metrics and results.
  5. based on an arbitrary limit that we set, implement early stopping to stop the epoch when we observe that the accuracy for the test data set starts to increase after a certain number of times
  6. Apply the weight changes which is obtained through the feed forward and back propagation process in the training data set, And then just log to the console on the current progress
def begin_train(self):
		truth_train = self.train_truth
		truth_test = self.test_truth
		warmup_threshold = 128 # how many epochs to run before we check for early stopping?
		last_test_accuracy = 0

		for i in range(self.epoch_count):
			# generate batch randomly based on batch size
			indices = None
			if self.batch_size < 0:
				indices = range(0, len(self.dataset_train))
			else :
				indices = random.sample(population=range(0, len(self.dataset_train)), k=self.batch_size)
			train_batches = list(map(lambda x: self.dataset_train[x], indices))
			
			truth_batch_scalar = []
			for col_idx, col in enumerate(truth_train[0]):
				if col_idx not in indices:
					continue
				truth_1 = self.train_truth[0][col_idx]
				truth_2 = self.train_truth[1][col_idx]
				truth_batch_scalar.append([truth_1, truth_2])
			train_batch_truths = np.array(truth_batch_scalar).T

			# forwardfeed and backpropagate
			self.layers[0].lhs_activation = self.generate_input_matrix(train_batches)
			last_layer_error_train = self.feed_forward_and_backprop_train(train_batch_truths)
			accuracy_train = ft_math.get_accuracy(self.layers[-1].rhs_activation, train_batch_truths)
			recall_train = ft_math.get_recall(self.layers[-1].rhs_activation, train_batch_truths)

			# forwardfeed test set
			self.layers[0].lhs_activation = self.generate_input_matrix(self.dataset_test)
			last_layer_error_test = self.feed_forward_test(truth_test)
			accuracy_test = ft_math.get_accuracy(self.layers[-1].rhs_activation, truth_test)
			recall_test = ft_math.get_recall(self.layers[-1].rhs_activation, truth_test)

			test_error = np.abs(np.mean(last_layer_error_test))
			train_error = np.abs(np.mean(last_layer_error_train))

			self.reporter.report_event("train", train_error, accuracy_train, recall_train)
			self.reporter.report_event("test", test_error, accuracy_test, recall_test)

			# check for early stopping
			if i >= warmup_threshold :
				if accuracy_test < last_test_accuracy:
					logging.info(f"Epoch {i} finished; train loss {round(train_error, 3)}, validation loss {round(test_error, 3)} validation acc {accuracy_test} < {last_test_accuracy}, early stopping condition met, stopping.")
					break
				last_test_accuracy = accuracy_test

			# apply weight changes
			self.apply_derivatives_reset_cache()
			logging.info(f"Epoch {i} finished; train loss {round(train_error, 3)}, validation loss {round(test_error, 3)} validation acc {round(accuracy_test, 3)}")

		self.reporter.generate_report()
		logging.info("Historic report generated")

		self.write_to_output()
		logging.info("Weights written")

The feed forward function is quite direct It directs through all the layers in the current perceptron and for each layer it would run the activation function for that particular layer.

def feed_forward(self): # iterate through all layers (assume input layer LHS is already set) for layer_idx, layer in enumerate(self.layers): # run activation function, this would populate the current layer.rhs layer.run_activation() # set next layer lhs depends on current layer type if layer.type != "output": self.layers[layer_idx + 1].lhs_activation = layer.rhs_activation last_layer_activation = self.layers[-1].rhs_activation return last_layer_activation

Back propagation function would iterate through the layers in reverse; Depending on the type of the layer it would run certain hard coded derivative functions which will get the derivatives for that layer to help us to do gradient descent.

def backprop(self, truth): # this value will change to store the current layers dz value # for the previous layer to access last_dz = None # this value will change to store the current layers weights # for the previous layer to access last_layer_weights = None for layer_idx, layer in enumerate(reversed(self.layers)): # logging.info(f"running backprop for layer {layer.type} @ {idx}") if layer.type == "output": # test = ft_math.softmax dz = ft_math.dcost_dz_output_np(layer.rhs_activation, truth, layer.pre_softmax_x_values, self.output_loss_type) dw = ft_math.dcost_dw_output_np(dz, layer.lhs_activation) db = ft_math.dcost_db_output_np(dz) layer.pending_weights_derivatives = dw layer.pending_bias_derivatives = db last_dz = dz last_layer_weights = layer.weights logging.debug(f"[backprop]\n{layer.type} layer {len(self.layers) - layer_idx} dz\n{dz}\ndw\n{dw}\ndb\n{db}") else : # logging.info(f"weights_hidden_to_output {last_layer_weights.shape} dz2 {last_dz.shape} a1 {layer.rhs_activation.shape}") dz = ft_math.dcost_dz_hidden_np(last_layer_weights, last_dz, layer.rhs_activation) dw = ft_math.dcost_dw_hidden_np(dz, layer.lhs_activation) db = ft_math.dcost_db_hidden_np(dz) layer.pending_weights_derivatives = dw layer.pending_bias_derivatives = db last_dz = dz last_layer_weights = layer.weights logging.debug(f"[backprop]\n{layer.type} layer {len(self.layers) - layer_idx} dz\n{dz}\ndw\n{dw}\ndb\n{db}")

In the where we apply the derivatives and obtain the new step size, We will first apply RMS prop into the current derivatives before we apply the momentum using some saved variables from the class.

After those two metrics are replied we will then change the individual weights and biases for all the layers. And then we reset the cache for those layers back to zero to prepare them for the new epoch loop.

def apply_derivatives_reset_cache(self): for idx, layer in enumerate(self.layers): # apply rmsprop new_s_dw = (self.rmsprop_ratio * layer.s_weights) + ((1 - self.rmsprop_ratio) * np.square(layer.pending_weights_derivatives)) rms_w_derivatives = layer.pending_weights_derivatives / np.sqrt(new_s_dw + self.rmsprop_stabilizer) layer.s_weights = new_s_dw new_s_db = (self.rmsprop_ratio * layer.s_bias) + ((1 - self.rmsprop_ratio) * np.square(layer.pending_bias_derivatives)) rms_b_derivatives = layer.pending_bias_derivatives / np.sqrt(new_s_db + self.rmsprop_stabilizer) layer.s_bias = new_s_db # apply momentum new_weights_velocity = self.momentum_decay * layer.weights_velocity - (self.learning_rate * rms_w_derivatives) layer.weights_velocity = new_weights_velocity new_bias_velocity = self.momentum_decay * layer.bias_velocity - (self.learning_rate * rms_b_derivatives) layer.bias_velocity = new_bias_velocity layer.weights = layer.weights + new_weights_velocity layer.bias = layer.bias + new_bias_velocity # clear cache matrix layer.pending_weights_derivatives = np.zeros(layer.weights.shape) layer.pending_bias_derivatives = np.zeros(layer.bias.shape)