# PW 09 Bütikofer Jaggi
## 1. The Perceptron and the Delta rule
### 1_activation_function
```python=
def relu(neta):
'''the activation function of a rectified Linear Unit (ReLU)'''
output = neta * (neta > 0)
d_output = 1.0 * (neta > 0)
return (output, d_output)
```

The sigmoid has two horizontal asymptote in y=0 and in y = 1, the hyperbolic tangent in y =-1 and y=1 and the linear function doesn't have asymptote. The derivative of the sigmoid and the hyperbolic function look like a Gaussian distribution and the linear function derivative is a straight.
### 4_1_delta_rule_points
#### When well defined

The peceptron defines well 2 different classes. The error goes with a few iteration near 0
#### When classes overlap

Because of the overlap the error function doesn t go near 0 but near 0.4 (because of the overlapping points). It doesn t take more iteration to converge to 0.4.
There is only one oscillation, no significant.
#### Not with single line

The error converge near 1.0. When there is not a single line separation, the number iteration increase and the error is higher.
Local minima are found after fewer iteration than the global minima, we converge to the global minima.
## 2. Backpropagation
### 5_backpropagation
### When well defined

The error converges to 0. It converge with a few iteration.
#### When classes overlap

Because of the overlap the error function doesn t go near 0 but near 0.4 (because of the overlapping points). It doesn t take more iteration to converge to 0.4.
There is only one oscillation, no significant
#### Not with single line

It's possible to separate the dataset with 2 lines. The error converge to 0 with more iteration. There is no local minima
#### Separated in subgroups (blobs)

It's possible to separate the classes in 2 groups. The error converge to 0 and it needs only a few iteration
```python=
class MLP:
...
def init_weights(self):
'''
This function creates the matrix of weights and initialiazes their values to small values
'''
self.weights = [] # Start with an empty list
self.delta_weights = []
for i in range(1, len(self.layers) - 1): # Iterates through the layers
# np.random.random((M, N)) returns a MxN matrix
# of random floats in [0.0, 1.0).
# (self.layers[i] + 1) is number of neurons in layer i plus the bias unit
self.weights.append((2 * np.random.random((self.layers[i - 1] +
1, self.layers[i] + 1)) - 1) * 0.25)
self.delta_weights.append(np.zeros((self.layers[i-1]
+1, self.layers[i] +1 )))
# delta_weights are initialized to zero
# Append a last set of weigths connecting the output of the network
self.weights.append((2 * np.random.random((self.layers[i] +
1, self.layers[i + 1])) - 1) * 0.25)
self.delta_weights.append(np.zeros((self.layers[i] +
1, self.layers[i + 1])))
def fit(self, data_train, data_test=None,
learning_rate=0.1, momentum=0.0 ,epochs=100):
'''
Online learning.
:param data_train: A tuple (X, y) with input data and targets for training
:param data_test: A tuple (X, y) with input data and targets for testing
:param learning_rate: parameters defining the speed of learning
:param epochs: number of times the dataset is presented to the network for learning
'''
...
# Update
for i in range(len(self.weights)): # Iterate through the layers
layer = np.atleast_2d(a[i]) # Activation
delta = np.atleast_2d(deltas[i]) # Delta
# Compute the weight change using the delta for this layer
# and the change computed for the previous example for this layer
self.delta_weights[i] = (-learning_rate * layer.T.dot(delta))
+ (momentum * self.delta_weights[i])
self.weights[i] += self.delta_weights[i] # Update the weights
error_train[k] = np.mean(error_it) # Compute the average of the error of all the examples
if data_test is not None: # If a testing dataset was provided
error_test[k], _ = self.compute_MSE(data_test) # Compute the testing error after iteration k
if data_test is None: # If only a training data was provided
return error_train # Return the error during training
else:
return (error_train, error_test) # Otherwise, return both training and testing error
...
```
## 4. Crossvalidation

The results vary a lot. We can see that as the spread increase the error rate logically increase too.

We can see that similar as the hold ou validation, when the spread increase the error rate increase too.
If we compare the results of the two methods, we see that the hold out can give the best result in some case but the worst too. the k-fold give a bit worse result on average but the results doesn't vary as much.
## 5. Model building
### Spread (0.3, 0.5, 0.7)


The final result is 4 hidden neurons with 60 epochs. We can see that even if the spread is 0.7 we have a pretty good result. With 4 neurons the results are good and we have less computation time.