# IntroNN-Week1-Introduction to Neural Networks
## 1.1 Deep Learning History - 1
We have almost 86 billons of neurons.
**Sinapsis.**
Take the ouput signals or results, and send it into another neural.
The Neural Netwerks try to emulate that a brain makes by means a computer.
It's like we put a magnifying glass or a microscope ans see how the information goes through the neurons and how it reaches the ouput.
Each neuron is called perceptron.
**Graph 1**

- Transistors make chips
- Neurons make brain.
- Perceptron or Neural Networks make Deep Learning model.
Chip makes easy tasks like AND, OR's. Or it could make hard tasks.
It has almost 50 billons of transistors, carefully designed and structured in a grander model.
You need to understand:
1. The simplest unit: A gate or port for example.
2. How it works with each other to work in complex tasks.
### Test 1
**1. Is perceptron the fundamental building block of Neural Networks?**
**Ans:** True
Why?:
A Perceptron is a neural network unit that does certain computations to detect features.
**2. What is the function of a Logical Gate?**
**Ans:** Takes two binary inputs and gives an output.
Why?:
The basic logic gates are used to perform fundamental logical functions. The logic gates use two binary inputs and generate a single output like 1 or 0.
## 2. Deep Learning and History - 2
### 2.2 How Warren McCulloch and Pits thought about perceptron.
McCulloch-Pitts Neuron, 1940, was a Neurologist.
Him model of ANN is:
A **summation** then a **step function**

Let's say we have 3 inputs. In **OR** case we say at least **one** needs to be activated (1).
In **AND** case we say all the **inputs** needs to be activated (3).
So "phi" represents the number of inputs activated.
### 2.3 How Rosenblatt Neuron thought about Perceptron.
This was made in 1958.
This is with a **summation** of **weight and X's (inputs)** | then a step function.
**This is similar to the pre-work with Jacob Curtis**
We use first: a **summation** then a **step function** in the output.

Lets say Theta or phi.
The function of the perceptron creates the Theta, theta is the center of the step function.
After the multiplication between a **weight** and **Xi** we calculate the output with the step function.
**Step function** could be the most know (figure 1) or the soft step function (figure 2).
Theta could be shifted from right to the left or viceverse by adding a term called **Bias:** it is a product between: 1 x W1, being 1 a input added to the perceptron.
The summation of W x Xi, it's called **weighted sum**(suma ponderada).


A joint of 3 or 4 neurons could be competitive with a chess player, but that kind of model does not adapt to another scenarious.
So a bunch of lot of perceptrons could be adapted to another cases. That is deep learning.
We need to understand that any bunch of neurons will not work well always, because it depends of the design. Therefore a DL model is only a model inspared in how the brain works, the model is not the same or equivalent to a brain, because we don't know how how many neurons the brain has.
**Remember:** the DL model is only a model inspired in the brain, but **it is not an equivalent or math model** to the brain.
One example that DL is **not perfect**: we have an alligator image, and then we train the DL model, but then a cartoon alligator, or a pink alligator, or the side of alligator, could not be recognized.
Nowadays DL could recognize images, voices, make self-driven car.
According to **graph 1**, in reality we don't count the number of neurons, **we count the number of weights** to connec the neurons.

Examples of DL models:
Alphago champion of google
**Notes:**
- A cm^3 of human brain has trillon of neurons, the sipnasis or responses of the neurons is faster until now.
- A complex task can use a bunch of perceptrons of ANN.
### Test 2.
**1. Consider the following statements about the McCulloch-Pitts Neuron Model:**
I.The inputs and outputs are binary.
II.The number of inputs, as well as outputs, can be many.
**Ans:** Only I is correct
Why?:
The inputs of the McCulloch-Pitts neuron could be **either 0 or 1**. It has a threshold function as an activation function. So, the output signal (y-out) is 1 if the input (y-sum) is greater than or equal to a given threshold value, else 0. The number of inputs to a neuron can be many for McCulloch-Pitts, but the output should be only one.
**2. Which of the below statements are true with respect to the Rosenblatt Perceptron:**
I. It can process non-boolean inputs.
II. It can assign different weights to each input automatically.
III. Same as McCulloch-Pitts neuron.
**Ans:** Only I and II are correct
Why?:
A Rosenblatt perceptron works by taking in some numerical inputs along with what is known as weights and a bias. It then multiplies these inputs with the respective weight(this is known as the weighted sum). These products are added together with a bias and apply a function on the weighted sum. **Automatically refers to the weighted summation**
**3. Perceptron is a mathematical representation of the human brain.**
**Ans:** False
Why?:
Perceptrons are the models which are inspired by the human brain, but it is not the mathematical representation of the human brain.
## 1.3 Multi-Layer Perceptron
### What happen with "one perceptron"?
I choose some inputs: x1,x2,x3... in order that y^ to be close to y.
Experiments:
Suppose I choose the first row, then I send this row the to the sigmoid function and the result is **0.8**.
But in my y original the value is **0**. What can I do?

(this is a one perceptron) **graph 1.3.1**
We can do a randomly searching of weights: w1,w2,w3,w4.. in order to find the best values of weights that adjust the y^ to the y original. Either with math variations or data variations, but we need to vary them.
How can I do that?
I need to change the weights to reach an error near to zero or small.
I learn from the error, and again and again probe another row, **make and come back.** Once my y^ is close to y, my neuron or perceptron is **good!**, my single perceptron is trained to fit this data, hoping that is not overfitted (catching noise) or underfitted. In that moment this is **able to perform a logistic regression.**

Once it is trained then I use for testing data from real world.
### What happen with Multiple neurons?
**The Fame model most widely used is MNIST**
We will draw a fully connected neural network, e.g, multiple layers.
This all the neurons are connected. 3 with 3. 3 times connections
**Note:** Bias sometimes is not shown in neural networks explicitly, but it could be a number **1** inside.

Then we do the same thing with the next layer.
The lines are the **weights**.
The values that come out from each **nodes** are *different.*
And then you will get it into another similar structure like this function, **weighted sum**


And since this function is non-linear, you ask what happen with linear combination and non-linear combination.
(This is a complex perceptron)

By each input we have 3 functions: F1, F2, F3.
We have in the 1st and 2nd layer: 16 connections or weights, it's due to (3 x 4=12) + 4 **bias terms** (in 2nd layer) = 16.
In the next layer we have: 4 x 3 = 12 + 3 **bias terms** = 15.
And in the last, 3 x 1 + 1 **bias term** = 4.
16+15+4=35
So we have 35 degrees of freedom respectively to the 4 degrees of freedom that show the one single perceptron of 4 weights, **graph 1.3.1**.
In this complex perceptron we can make more **step functions** by each node.
The simple perceptron migth catch noise.
The complex perception cannot catch noise if we talk about millon of rows.
We can process more quantity of data from a dataset with millons of row. In **multi-layer perceptron** every input contributes with a step function to the next input, over and over again, this is *advantage*.

**Note: The more complex of perceptron you need ton of millon of training data to not overfitted the results.**
### Test 1.3
**1. The operation of one perceptron with sigmoid as the activation function is same as the _______ algorithm?**
**Ans:** Logistic Regression, it uses the same function as a sigmoid called logit to classify the outputs.
**2. Consider a fully connected Artificial Neural Network with one hidden layer. The output value of each node in the hidden layer is sent to each node in the output layer.**
**Ans:** True
Why?: A fully connected neural network with one hidden layer has all nodes connected to each node in the output layer.
**3. Neural networks with large architecture can capture the complex structure of the larger datasets?**
**Ans:** True
Why?: As deep the architecture goes, it will capture the complex structure where it applies a nonlinear function on the outputs from each neuron.
## 1.4 MNIST Dataset
(MNIST Dataset is in Scikit web, and it's images from 0 to 9 digits.)
At the beginning lets say that we have 60000 of training images of digit numbers, then once we trained our model we'll use 1000 unseen images to predict.
But each image is composed by 28x28 pixels, this will be to the inputs.
### Structure: Feed Forward Neural Network
**Input Layer**
28 x 28 = 784, inputs that comes from the matrix of pixels by each image.
**Output Layer**
10 nodes as outputs because each one will predict if it's: 0,1,2,3,4,5,6,7,8,9.
**Hidden Layer**
it's up to you (tu decides)
lets say 64 nodes.
(784x64 "number of weights"
+64"number of nodes or biases") + (64x10"number of weights"
+10"number of nodes or biases") = 50890
Each image we chop into 784 pixels, and it goes to the inputs. Now with our neural model built, we get 50890 of numbers as consequence in through them into neural network.
After computes the image throughout the neural network, we'll get 10 results that will say: what number has been predicted? [0-1]

Then I try to adjust the weights to get a y^ closer to y. And this is by come back to train and retrain or select again randomly the weights
Example.

**Conclusion:** Each bunch pf perceptrons has a logistic regression function.
Feed inputs from left to the right
Feed forward NN = Fully connected the neuronsl in all the layers.
Entire neural network is a linear equation, beacuse the input are linear.
### Another kind of activation functions.
**Training**
* Loss Function
This is for getting a meausre of distance from y^ to y.
* Back propagation.
A efficiency training mechanism
* Gradient descent
The right finding of weights.
**Other types of Neural Networks**
We've talked about *Feed forward Neural Network*
**Overfitting and Regularization**
Overfitted is very quick if you have too many parameters, like this case.
50890 in training Neural Network is almost the same to 60000 rows of training dataset.
We need to consider the **learning rate**(it adjusts the weights) and it's better if it's small.
We are finding pr solving a Non-convex optimization problem, it's finding a **local optimas**. I try to find a pretty solution no the best solution (global optimas). Because the pretty solution is robust and not overfitted, and the "best solution" (gloabl optimas) tends to be overfitted.

What will we talk about?

**Deep Learning**: It's a huge quanitity of layers, more neural networks, it can solve complex systems. But it tends to need lot of training data and time training.
### Test 1.4
**1. How many neurons should be there in the output layer of a neural network for the MNIST dataset?**
**Ans:** 10
Why?: The number of classes in the MNIST dataset is 10, starting from 0 - 9 digits, so the number of neurons in the output layer should be 10. Each neuron is dedicated to each class.
**2. What is a fully connected network(layer)?**
**Ans:** Network where each node in every layer is connected to all the nodes in the next layer.
Why?: A fully connected neural network consists of a series of fully connected layers that connect every neuron in one layer to every neuron in the other layer.
## 1.5 Deep Learning Notations.
Feed forward is similar to draw from left to right, the data comes from left to right.
b1: Bias notation
Zj=F() it's a **Activation Function**
This is a linear summation
**d:** Vector of inputs
**h', h'', h'''..** are the neurons or components.

**Multiplication of layers.**
See in PDF page 15

This function **F()**, is the output of the entire first layer.
**W':** Weights
**X:** It's nothing more than **d** vector or vector of inputs
**b'** these are the components or nodes in each layer, b1,b2,b3. The vector of the biases of the first vector.
**X** Vector

The product between h' **x** d
(weights in each layer) **x** (X's inputs)
gives the result in this summation.

See pdf as well
**W2**: Second weight vector: h^2xh'

This is the activations and then other activation keep going through all the layers.
**This is how the neural networks works and is structured!!**
**h^nx1**: This is the number of hidden nodes in the last layer.
**How do we find the values in Weights: W^1, W^2, W^3,?** So we'll talk about: Loss functions, Back propagation, Gradient descent. see again the themes in chapter 1.4.
### Test 1.5
**1. How to compute the output equation of Z in the below fig?**

**Ans:** Z = X1*W1+X2*W2+X3W3+B
**2. Compute the multiplication of the below two matrices?**

**Ans:**

## 1.6 Types of Activation Function - Part 1
### Step function
Differentiate the entire function to find the slop of the function, rate of changing of the function. The flat part gives result of zero(0), the veritical line gives undefined or infinitive.
It's is a kind of sharp(brusco) function, it's not good for learning

A smooth function are good for learning.
Use the
### Sigmoid function
It's more smooth than step function, center in zero its value is 0.5.
takes values zero or one.

### hyperbola: tanh(2xSigma - 1)


This is expanded and shifted to -1.

### ReLU: Rectified Linear Unit.
See PDF. more about ReLU or Leaky ReLU
page 18.
It contains negative values, **it's the most popular activation function used in hidden layers in ANN**
### Test 1.6
**1. Which of the following is true about the Step Activation function?**
**Ans:** There is a sharp shift from 0 to 1, and the network is not differentiable at 0.
Why?: The step activation function outputs a value of either 0 or 1, so the resulting output from the step activation function will be depending on the threshold value, so is the sharp shift from 0 to 1.
**2. Which of the following statements is true about tanh?**
I. Both sigmoid and tanh are the same in function
II.The only difference is sigmoid ranges between 0 to 1 and tanh ranges between -1 to 1.
III. Tanh function can interpret values both on the positive and negative sides.
**Ans:** II and III
Why?: Tanh activation function ranges between -1 to 1, so it can line both the positive and negative half-spaces.
**3. Which of the following activation functions is commonly used in the hidden layers of a neural network?**
**Ans:** ReLU
ReLU or Rectified Linear Unit, this activation function is the most common function used for the hidden layers because it is both simple to implement and effective at overcoming the limitations of other activation functions like sigmoid and tanh.
## 1.7 Types of Activation Function - Part 2
### Activation Functions to use in Output Layer and Hidden Layer.
**In the output layer:**
For classification problems, use **tanh**, **sigmoid**, **softmax**

For numerical problems(ex: temp. °C), use **linear**

In hidden layer you can use:
* Sigmoid
* tanh
* ReLU the most common
But linear "no", because in each layer you obtain a linear function from a linear function and so on. Like the example.
And **Leaky ReLU** cannot be used in output layers, because the outcomes with leaky ReLU might not gives be same with **y** values that you observe.
### Softmax
See more about it in PDF, page 22.
When we use sofmax function, with its form e^(Zi). We are getting function from the input(X's) and first hidden layer and then another layers. F^1, F^2,F^3. But in each function we obtain a non-linear function. **Hence the neuron networl will become more complex**


Now how we find the weights along the matrix?
See the PDF in page 20.
You'll see 2D and 3D graphics about the behavior of a linear equation, and how it shapes a contour of the limits for 0 and 1.
The green function represents a **sigmoid function**, red color is a **step function**
In the middle point of the jump, you will get a linear equation **buried in** a **non-linear structure** where you can find: W1,W2.. and b. It's a tremendous advantage.
It allows to capture the non-linear pattern, but linearity allows to compute and train faster the Neural Network.
### Test 1.7
**1. Which of the following is the function of Leaky ReLu?**
**Ans:** max(0.1 X, X)
Why?: Leaky Relu gives a small value when the input is negative, else it gives the positive value, so the function of Leaky ReLu is max(0.1X, X)
**2. Which of the following activation functions can be used in the output layer of a neural network to predict the score which can be either positive or negative?**
- ReLu
- Tanh
- Sigmoid
- None of these **(Ans)**
Why?:For regression problems, linear activation function is used, and we cannot use tanh or sigmoid as they have a limited range for giving the output, and ReLu cannot help us give negative values as the output.
**3. Which of the following is true about the Softmax activation function?**
I. The output of an output layer with softmax activation function gives the probability that sums up to 1.
II. Softmax is used as an activation function in the output layer for multiclass classification problems.
III. Both softmax and sigmoid are the same in function.
**Ans:** Both I and II statements are correct.
Why?:Softmax is used for multi-class classification problems, which also provides the output as a probability score which sums up to 1.
## 1.8 Training the Neural Network - Loss Function
This explain the distance from Yi to Y^_i
See page 23 in PDF
### For Numerical regression

The functio above gives the function below.

Why can we make to smooth the curve?
We can use L2Loss, MSE, SSE, they are usually called like that.
It's the most popular function used, they can be differentiated(derived).

It's easy more faster implemantation.
### For Classification problems
**What is *Cross entropy Loss*?**
See PDF page 24
If you get:
Yi = 1 and predict Y^i=1, with log(1)=-inf
that is a grear Error!!
So it's better to talk in terms of probalistic.
Let's say 10%
Yi=1 and Y^i=0.1, then 1*log(0.1)=1*-1=-1 so that is better.
So this function is designed to heavy penalize:
- The OVERCONFIDENT but THE WRONG ANSWER
It is better in classification cases to speak in probabilistic terms because speaking 100% sure is not always the truth.
### Test 1.8
**1. The loss function is used to measure how far the predicted value is from the actual value?**
**Ans:** True
Why?: The loss function is used to find the difference between the actual value and the predicted value, which is called error.
**2. L2 loss function can also be termed as ________?**
A) Mean Squared Error
B) Sum of Squared Error
C) Mean Absolute Error
**Ans:** A and B. Because, L2 loss function can be termed as the sum of squared error, Mean squared error, etc.
## 1.9 The problem of local optimum and saddle point.
### How to minimize the loss function?.
See PDF page 25

The minimum value is -10
**Process in searches of w's**
*keep reducing eta variable*
1. You start from a set of w's.
2. You create the y and y^ vectors.
3. You find the lost function between the two y's.
4. And by using the gradient you get the new w's.
5. You make the before again. Fit the X into the new w's, you get a slightly difference of the loss function, and you keep doing the search of the reducing eta with gradient.
***Gradient of the loss function, is essentially main the learning mechanism.***
Reduce loss function respectively to weights.
So it's important to calculate quikly the gradient to find the steps.
Use a gradient of zero is useless, it doesn't tell us what step take to.
Use step function in the gradient is **bad idea**, but sigmoid function is a **better idea**, because it's a smooth function and can be used in a bunch of gradients.
### Local minimum and global minimum.
Now see PDF 27 then 26.
**Now see the problem with loss function in a convex graphic, since page 29.**

Lets explain, unfortunately in a convex graphic, this equation that we talked before doesn't work for a convex graph. **Because when we have many weights this equation cannot be convex to w's, when you have many w's they are far from convex.**

**The reality to plant a formula for a convex graphic, follow this steps**

This is the L function or loss function composed by y and y^, where y^ is composed by a linear function of linear functios, like we see before
And when we find too many y's, the function show this.

**Differentiate this function respect to w is not easy, because in reality it has too many different weights: w1,w2,w3. While in the beginning the function was easy but this is the true.** So we need to use **Chain rule** in the next.
### Test 1.9
**1. What happens if the learning rate (eta) is too high?**
**Ans:** The neural network will not converge to minima.
Why?: Increasing the learning rate will make the data point jump over from one side to the other in gradient descent hill rather than converging to a minimal point.
## 1.10 Backpropagation.
The main is find the right weights to reduce the loss function and the variables **y** and **y^** to be close.

We know that we need to find every weight by each layer, that means, by every loss function **F(sum(W_ij.Xi+bj))** that we find in every layer, many times calculate eita, and the real function is the next equation(See page 29).

Apply the gradient to this function is hard. So we need to use **other alternatives.**
Like: **Chain rule** or **Stochastic gradient descent**
This show a little about of the Chain rule, we use derives by replacing variables.

In this example, we apply lost function to every row in the dataset, so by each layer we get a summation, and find the new weights **w's** are computationally more complex. So it's better in this case use **Stochastic gradient descent** to find weights and correct the nearness of y and y^.
**Characteristics of SGD:**
Facilitate the life, calculating.
1. In every row is more easy
2. It has a specific structure. begin from the last layer, the last minus one layer by groups in a smart way. Investigate more about SGD.

(page 29)
By using Stochastic gradient descent., to solve this complex equation, it goes through by all the millons of row determined by i, so it can be little smooth to calculate.
### Test 1.10
**1. Backpropagation is the process of learning that the neural network employs to re-calibrate the weights at every layer and every node to minimize the error in the output layer.**
**Ans:** True
Why?: Backpropagation helps to learn and update the weights at every node in the network.
**2. Which of the following is true about the backpropagation rule?**
**Ans:** Backpropagation uses chain rule to update the weights in the network.
Why?: Backpropagation use chain rule to update the weights and biases. After each forward pass through the network, backpropagation performs backward pass by adjusting weights and biases of the network. Thus it helps in reducing the error rate of the loss function with respect to each weight of the network.
## Tips:
How to initialize the weights in of an ANN?
https://glcommunity.mygreatlearning.com/question/how-should-i-initialize-the-weights-of-a-neural-network-61330d97eaa8d23d8b4aab8a