# Convolution Neural Networks (CNN)
## Neural Networks
In the realm of machine learning, neural networks have emerged as a powerful and versatile tool for solving complex problems. Inspired by the structure and function of the human brain, these networks consist of interconnected nodes, or "neurons," that work collaboratively to process and analyze vast amounts of data. By leveraging their ability to learn and adapt, neural networks have achieved remarkable feats in various domains, ranging from image and speech recognition to natural language processing and autonomous vehicles.
### Perceptron
At the heart of a neural network lies the fundamental building block called the **perceptron**. Developed by Frank Rosenblatt in the late 1950s, the perceptron represents the simplest form of a neural network. Its purpose is to take input data, apply weights to each input, and compute a weighted sum. The result is then transformed using an activation function, which determines the perceptron's output. With its ability to learn from labeled examples, the perceptron became the cornerstone of neural network research, paving the way for more sophisticated architectures and algorithms that drive modern machine learning applications.

**(Rosenblatt, 1958)**
https://sci-hub.se/https://doi.org/10.1037/h0042519
The paper by Rosenblatt in 1958, titled "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain," introduced the perceptron, an early form of artificial neural network. It proposed a learning algorithm based on adjusting weights to classify inputs into binary categories. While limited in its capabilities, this work laid the foundation for subsequent developments in neural networks and inspired further research in pattern recognition and machine learning.

### Activation functions
Activation functions play a crucial role in neural networks by introducing non-linearity and enabling the networks to learn complex patterns and make accurate predictions. These functions determine the output of a neuron or a layer of neurons based on the weighted sum of inputs. By applying an activation function, neural networks gain the capability to model complex relationships between inputs and outputs, making them more flexible and capable of capturing intricate patterns in data.
One of the most commonly used activation functions is the sigmoid function, which transforms the input into a range between 0 and 1. This function is particularly useful in binary classification problems, where the network needs to make a decision between two classes. Another popular activation function is the rectified linear unit (ReLU), which outputs the input directly if it is positive and zero otherwise. ReLU has gained popularity due to its simplicity and ability to address the vanishing gradient problem. Additionally, there are other activation functions, such as hyperbolic tangent (tanh), softmax, and leaky ReLU, each with their own advantages and use cases. By choosing the appropriate activation function for each layer, neural networks can unlock their full potential and achieve remarkable performance in a wide range of machine learning tasks.

**Why Do We Need Nonlinear Activation Functions?**
Consider a neural network with one hidden layer and two hidden neurons per layer.

If you utilized a linear hidden layer, you may then rebuild the output layer as a linear combination of the initial input variable. The equation would be much longer with more nesting and multiplications between subsequent layer weights if there were more neurons and weights. The concept is still the same, though: You may visualize the complete network as a single linear layer.
Nonlinear activation functions are required to help the network represent more complex functions. Let's start with a well-known illustration, the sigmoid function.
### Decision boundary
The concept of a decision boundary is essential in understanding how neural networks classify and separate different classes of data. In machine learning, the decision boundary represents the boundary or dividing line that separates one class from another in a feature space. It serves as a decision-making criterion for assigning new, unseen data points to their respective classes based on their features.

When adding more levels of nodes, such as increasing the depth of a neural network, the decision boundary becomes more flexible and capable of capturing complex patterns in the data. The additional layers allow the network to learn hierarchical representations of the input features, enabling it to discover and model intricate relationships. This increased depth allows for the formation of more abstract and nuanced decision boundaries, enhancing the network's ability to discriminate between different classes or make more precise predictions. As a result, adding more levels of nodes in a neural network can improve its capacity to handle increasingly intricate datasets and enhance its overall performance in tasks requiring complex decision-making.

### Bias Trick
The bias trick is a technique used to incorporate a bias term into the calculations of a perceptron or a neural network.

In the context of a perceptron, the bias term is an additional input that is always set to a fixed value of -1. It is multiplied by a corresponding weight and added to the weighted sum of the other inputs before passing through the activation function. By introducing a bias term, the perceptron can learn an offset or a threshold for making decisions, allowing it to better model complex relationships in the data.

The bias term essentially shifts the activation function's output, affecting the decision boundary and introducing flexibility to the model's predictions. Without the bias term, the perceptron would be forced to pass through the origin (0,0) in the input space, limiting its expressive power. Including a bias term allows the perceptron to capture patterns that might not intersect the origin or require a non-zero threshold for activation. Overall, the bias trick enhances the capabilities of neural networks by providing an additional degree of freedom in their learning and decision-making processes.
**Example:**
Let's consider an example where we have a perceptron with two inputs, x1 and x2, and a linear regression node. The perceptron's task is to classify whether a point (x1, x2) belongs to a certain class or not.

Without the bias term, the perceptron's decision boundary would be forced to pass through the origin (0,0) in the input space. This means that it can only separate the classes using a line that intersects the origin.

However, many real-world datasets may require decision boundaries that do not pass through the origin or have a non-zero threshold.
To incorporate the bias term, we introduce an additional input called `x0`, which in this case is set to -1. The perceptron's weighted sum becomes `w0x0 + w1x1 + w2x2`, where w0 represents the bias weight. Now, the linear regression node equation is modified to: `w1x1 + w2*x2 - w0`.

The bias term allows the perceptron to shift the decision boundary, effectively changing the threshold for activation.
In summary, by incorporating the bias trick and introducing the bias term, the perceptron gains the ability to learn non-zero thresholds and separate classes with decision boundaries that do not necessarily pass through the origin. This enhances the perceptron's modeling capabilities, allowing it to handle a wider range of classification tasks.
### Multi-layer Perceptron
MLPs consist of multiple layers of interconnected nodes, or neurons, with each neuron in a layer connected to every neuron in the subsequent layer. This structure enables the network to capture intricate patterns and make sophisticated predictions.
The key characteristic of MLPs lies in their ability to introduce non-linearity through the activation functions used in each neuron. By incorporating non-linear activation functions, such as the sigmoid or ReLU, MLPs are capable of learning complex decision boundaries and capturing intricate dependencies within the data.

The Multi-Layer Perceptron (MLP) consists of three main layers: the input layer, hidden layers, and the output layer.
- Input Layer:
- The input layer is responsible for receiving and encoding the input data.
- Each neuron in the input layer represents a feature or attribute of the input data.
- The number of neurons in the input layer is determined by the dimensionality of the input data.
- Hidden Layers:
- Hidden layers are located between the input and output layers and perform the intermediate computations.
- Each neuron in the hidden layers takes input from all the neurons in the previous layer.
- Hidden layers allow the network to learn complex representations and extract high-level features from the input data.
- MLPs can have multiple hidden layers, referred to as a shallow network (one hidden layer) or a deep network (multiple hidden layers).
- Output Layer:
- The output layer is the final layer of the MLP that produces the desired output or prediction.
- The number of neurons in the output layer has to be exactly the number of classes the model has to classify.
- The activation function applied to the output layer is chosen based on the nature of the problem, such as softmax for multi-class classification or sigmoid for binary classification.
- Overall, the MLP's layers work together to process input data, learn complex patterns through the hidden layers, and produce output predictions through the output layer.
**Multi layers architecture**
Source: [https://ml-cheatsheet.readthedocs.io/en/latest/forwardpropagation.html]( https://ml-cheatsheet.readthedocs.io/en/latest/forwardpropagation.html)

--

--

### Non-convex loss function
Non-convex loss functions present challenges for optimization algorithms like gradient descent. Unlike convex functions, they have multiple local minima, making it difficult to find the global minimum. However, for large neural networks, most local minima exhibit similar performance on test datasets. Therefore, the focus shifts from finding the absolute global minimum to obtaining a good enough solution that generalizes well. Overemphasizing the search for the global minimum can lead to overfitting. In practice, regularization techniques and early stopping are employed to address this issue. The goal becomes finding a solution that balances model performance and generalization, rather than solely chasing the global minimum.

### `neuralnetworksanddeeplearning.com`
http://neuralnetworksanddeeplearning.com : It is an online resource created by Michael Nielsen, which provides a comprehensive and accessible introduction to neural networks and deep learning. The website includes a free online book that covers various topics related to neural networks, including the basics of artificial neural networks, learning algorithms, convolutional neural networks, recurrent neural networks, and more. It offers clear explanations, interactive examples, and code implementations to help readers understand the fundamental concepts and practical applications of neural networks. "neuralnetworksanddeeplearning.com" is highly regarded for its educational value and has been a valuable resource for individuals seeking to learn about neural networks and deepen their understanding of deep learning techniques.
### Perceptron training
Perceptron training refers to the process of adjusting the weights and biases of a perceptron, a type of artificial neuron, during the learning phase. The goal of perceptron training is to find the optimal set of weights and biases that allow the perceptron to make accurate predictions or classify input data correctly.
The training of a perceptron involves the following steps:
1. **Initialization**: The weights and biases of the perceptron are initialized with random values.
2. **Forward Propagation**: Input data is passed through the perceptron, and a weighted sum of the inputs is computed. This sum is then passed through an activation function to produce the perceptron's output.
3. **Error Calculation**: The output of the perceptron is compared to the desired or target output. The difference between the predicted output and the target output is calculated as the error.
4. **Backpropagation**: The error is used to adjust the weights and biases of the perceptron. This adjustment is performed by propagating the error backward through the perceptron, updating the weights and biases based on a chosen learning algorithm, such as the gradient descent algorithm.
5. **Iteration**: Steps 2-4 are repeated for multiple iterations or epochs, allowing the perceptron to gradually improve its performance by minimizing the error.
By iteratively updating the weights and biases based on the observed errors, the perceptron training process aims to converge towards a set of parameters that enable the perceptron to accurately classify input data and make reliable predictions.
### Softmax
Softmax is a popular activation function used in the output layer of neural networks for **multi-class classification tasks**. It converts a vector of real numbers into a probability distribution over multiple classes.

Softmax calculates the exponential of each input value and then normalizes them by dividing by the sum of all exponentiated values. This normalization ensures that the output values lie between 0 and 1 and sum up to 1, representing the probabilitie for that particular neuron. In the next example we want to predict between three elements (banan, orange and strawberry). The top or first neuron predicts a probability of 0.46 for the image to be a banana, the neuron in the midle or second predicts 0.34 chances for it to be an orange and the last one predicts 0.20 chances for it to be strawberry. Therefore the image wil be classified as a banana:

The softmax function is commonly used because it provides a smooth and differentiable way to interpret the outputs of a neural network as class probabilities. It allows the network to assign a likelihood to each class, aiding in decision-making. Softmax is particularly useful when dealing with mutually exclusive classes, where an input belongs to only one class. The class with the highest probability after applying softmax is typically considered the predicted class. Softmax enables the network to output probability distributions, making it well-suited for multi-class classification problems.
### Outputs
Neural networks can be designed to produce different types of outputs based on the specific requirements of the task at hand. Here are three common scenarios:
1. Output Layer with a **Single Neuron** (No Activation Function):
- This configuration is typically used for regression tasks where the goal is to predict a continuous value.
- The output of the single neuron represents the predicted value directly, without any transformation.
2. Output Layer with **Multiple Neurons and Sigmoid Activation**:
- In cases where multi-label classification is required, where an input can be associated with multiple labels simultaneously, multiple neurons with sigmoid activation are utilized.
- Each neuron in the output layer corresponds to a specific label, and the sigmoid activation function transforms the output values of each neuron into a range between 0 and 1, which can be interpreted as a probability-like score.
3. Output Layer with **Multiple Neurons and Softmax Activation**:
- The softmax activation function is employed in the output layer for multi-class classification tasks.
- Softmax transforms the output values of each neuron into probabilities that represent the likelihood of the input belonging to each class.
- The class with the highest probability is considered the predicted class for the input sample.
By customizing the design of the output layer and choosing the appropriate activation function, neural networks can effectively address different types of tasks, including regression, multi-label classification, and multi-class classification.
### Categorical Cross Entropy Loss Function
Categorical Cross Entropy (CCE) is a widely used loss function in neural networks for multi-class classification tasks. It measures the dissimilarity between the predicted class probabilities and the true class labels.
CCE operates by taking the predicted class probabilities outputted by the network and comparing them to the one-hot encoded true class labels. The predicted probabilities are passed through the logarithm function, emphasizing larger differences between predicted and true probabilities. The negative sum of these logarithmic values is then averaged across all classes, yielding the Categorical Cross Entropy loss.
The CCE loss function aims to minimize the distance between the predicted probabilities and the true labels, encouraging the network to assign high probabilities to the correct classes.

It effectively penalizes incorrect predictions and provides a continuous and differentiable measure of the network's performance. By minimizing the CCE loss during training using optimization algorithms like gradient descent, the neural network can learn to make more accurate class predictions and improve its overall classification performance.
**Cross Entropy vs Accuracy**
Cross entropy and accuracy are both metrics used to evaluate the performance of classification models. While accuracy measures the proportion of correct predictions, cross entropy quantifies the dissimilarity between predicted and true class probabilities. Accuracy provides a straightforward assessment of overall correctness, but it doesn't account for the confidence or uncertainty of predictions. In contrast, cross entropy captures the nuances of prediction probabilities, penalizing incorrect and uncertain predictions. Cross entropy is a more sensitive measure and often serves as the optimization objective during training, whereas accuracy is a simple and intuitive metric for evaluating the overall correctness of a model's predictions.
Here is a simple example to understand it beter:

'Pred' is the predicted value and 'GT' is the real value. With this information you may calculate the acuracy and the cros entropy values:

#### Others loss functions
- **SME** for regresion
- **Cros entropy** for binary clasification
- **Catergorical cros entropy** for multiclas clasification
### Frameworks
Frameworks implementing automatic differentiation have revolutionized the field of deep learning, enabling efficient and scalable training of neural networks. Two prominent frameworks that have gained widespread popularity are TensorFlow and PyTorch. These frameworks provide powerful tools and abstractions to build and train complex neural network models. Here's a brief introduction to TensorFlow and PyTorch, highlighting their unique features:

- **TensorFlow:**
TensorFlow, developed by Google, is a comprehensive and highly flexible deep learning framework. It offers a wide range of functionalities, including automatic differentiation, for efficient gradient-based optimization. TensorFlow utilizes a computational graph paradigm, where operations are represented as nodes and data flows through the graph. This allows for efficient parallel computation and easy deployment on various hardware platforms. TensorFlow's extensive ecosystem provides pre-built neural network layers, optimization algorithms, and visualization tools, making it suitable for both research and production environments.

- **PyTorch:**
PyTorch, developed by Facebook's AI Research lab, has gained popularity for its intuitive and dynamic nature. It follows an imperative programming model, allowing users to define and modify computational graphs on-the-fly, making experimentation and debugging more convenient. PyTorch's automatic differentiation capabilities enable seamless computation of gradients, facilitating efficient backpropagation. The framework's user-friendly APIs and Pythonic syntax make it easy to understand and use, attracting a large community of developers and researchers.
Both TensorFlow and PyTorch have played pivotal roles in advancing the field of deep learning, offering flexible and efficient automatic differentiation functionalities. Their distinct design philosophies and features cater to different needs, providing practitioners and researchers with powerful tools to tackle diverse deep learning challenges.
### Regularization
**Regularization with L1 or L2 norm**

**Regularization with Dropout**
- Intuition: combining the predictions of multiple models trained for the same purpose (ensemble learning) is a way to prevent overfitting.
- Idea: Average the predictions of multiple independently trained models to solve the same problem.
- Problem: this is really expensive!
- Solution: a single model can be "multiple models" at the same time!
Regularization with dropout is a technique used in neural networks to prevent overfitting. Dropout randomly sets a fraction of input units to zero during training, forcing the network to learn more robust and generalized features. This process simulates training multiple networks with different subsets of neurons, improving the network's ability to generalize to unseen data. By preventing complex co-adaptations of neurons, dropout encourages the network to rely on a diverse set of features. During inference, dropout is turned off, and the weights are scaled to compensate for the dropout rate, resulting in more reliable predictions.

Randomly, during training, neurons are ignored with a probability (1-p) for the forward pass. At test time, all neurons are used and their output is scaled by p

### Backpropagation
*Backpropagation is a fundamental technique in machine learning that enables neural networks to learn from data. It is based on the principles of gradient descent optimization. In a neural network, each connection between neurons has a weight associated with it. The goal of backpropagation is to adjust these weights to minimize the difference between the network's predictions and the desired outputs.*
> The backpropagation process involves two main steps: **forward propagation** and **backward propagation**.
A brief concept of how they work is:
- **Forward propagation:** During forward propagation, the input data is passed through the network, and the activations of each neuron are computed layer by layer using a nonlinear activation function.
- **Backward propagation:** After forward propagation, the network's predictions are compared to the desired outputs using a loss function. The gradient of the loss function with respect to the weights of the network is then calculated using the chain rule of calculus. This gradient represents the direction and magnitude of the steepest ascent or descent of the loss function.
Starting from the output layer, the gradients are computed recursively layer by layer, moving backward through the network. The gradients are used to update the weights of the connections using an optimization algorithm, typically *stochastic gradient descent (SGD)* or one of its variants. The update rule adjusts the weights in the direction opposite to the gradient, scaled by a learning rate.
By repeatedly performing forward and backward propagation on batches of training data, the network gradually learns to adjust its weights, reducing the loss and improving its predictive performance. The learning process continues until convergence or a predefined stopping criterion is met.
It's worth noting that backpropagation can be enhanced with additional techniques, such as regularization methods (e.g., L1 or L2 regularization) to prevent overfitting, or advanced optimization algorithms (e.g., Adam or RMSprop) that adapt the learning rate dynamically.
> In summary, backpropagation is a mathematical algorithm that allows neural networks to learn by iteratively adjusting the weights based on the gradients of the loss function with respect to those weights. This process enables the network to improve its predictions over time through repeated forward and backward propagation.
### Fully conected
Una red neuronal totalmente conectada es una poderosa arquitectura de machine learning. Compuesta por capas interconectadas, aprende patrones y relaciones no lineales en los datos. Las transformaciones lineales y las funciones de activación permiten su adaptabilidad. Mediante el entrenamiento, los pesos sinápticos se ajustan para mejorar la precisión de las predicciones. Estas redes ofrecen una amplia gama de aplicaciones para resolver problemas y descubrir conocimiento en los datos.

### Dropout
complete dropout
## Convolution
### Image classification
Image classification is the task of assigning predefined labels or categories to images based on their visual content. It is a fundamental problem in computer vision and has numerous applications, such as object recognition, scene understanding, and medical diagnosis.
In image classification, each image is represented as a matrix or grid of pixels, where each pixel contains information about its color or intensity. The size of the image matrix is determined by the image's resolution, which specifies the number of pixels in the width and height dimensions.

**RGB (Red, Green, Blue)** is a common color model used in digital images. In RGB, each pixel is represented by three color channels: red, green, and blue. The combination of these channels determines the color of the pixel. For example, a bright red pixel would have a high intensity value in the red channel and low values in the green and blue channels.
In grayscale images, each pixel is represented by a single intensity value ranging from 0 to 255, where 0 represents black and 255 represents white. Grayscale images have only one channel and are often used when color information is not necessary for the task at hand.
Color images typically have three channels (RGB), while grayscale images have a single channel. For example, a 100x100 RGB image would have a matrix size of 100x100x3 (width x height x channels), where each pixel is represented by three values corresponding to the intensity of red, green, and blue.

> Image classification is a challenging problem due to variations in lighting conditions, viewpoints, occlusions, and background clutter. Researchers continually work on developing more advanced models and techniques to improve the accuracy and robustness of image classification systems.
### K-Nearest Neighbors
K-Nearest Neighbors (K-NN) is a simple yet effective algorithm used for classification and regression tasks in machine learning. **It is a non-parametric and instance-based learning method.**
In K-NN, the "K" represents the number of nearest neighbors that are considered when making a prediction for a new data point. The algorithm assumes that similar data points often share the same class or have similar target values.
Here's how K-NN works for classification:
1. **Training**: The algorithm memorizes the training dataset, which consists of labeled data points with their corresponding classes.
2. **Prediction**: Given a new, unlabeled data point, the algorithm measures the distances between this point and all the training data points. The distance can be calculated using various metrics, such as Euclidean or Manhattan distance.
3. **Neighbor selection**: The **K nearest neighbors** to the new data point are selected based on the calculated distances. These neighbors are the data points in the training dataset that are closest to the new data point.
4. **Voting**: For classification, the algorithm determines the class of the new data point based on the classes of its K nearest neighbors. This is done through majority voting, where the class that appears most frequently among the neighbors is assigned to the new data point.
5. **Prediction result**: The predicted class for the new data point is the outcome of the majority voting.
K-NN is a flexible algorithm that can handle various types of data and **doesn't require training time**. However, it can be sensitive to the choice of the K value, as a small K may lead to overfitting, while a large K may result in underfitting. Additionally, **it can be computationally expensive for large datasets since distance calculations need to be performed for each prediction**.

> In K-Nearest Neighbors (K-NN), the L1 and L2 distances are commonly used metrics to calculate the distance between data points. These distances help determine the nearest neighbors when making predictions for a new data point.
**L1 Distance (Manhattan Distance):**
- The L1 distance, also known as the Manhattan distance, measures the absolute difference between the coordinates of two points. It is calculated by summing the absolute differences between the corresponding feature values of the two points along each dimension. Mathematically, the L1 distance between two points (x₁, y₁) and (x₂, y₂) in a two-dimensional space is given by:
L1 distance = |x₁ - x₂| + |y₁ - y₂|
The L1 distance is named after the "Manhattan grid" because it represents the distance a taxi would have to travel to move between two points in a city grid-like layout.
**L2 Distance (Euclidean Distance):**
- The L2 distance, also known as the Euclidean distance, calculates the straight-line distance between two points in a multi-dimensional space. It is calculated by taking the square root of the sum of the squared differences between the corresponding feature values of the two points. Mathematically, the L2 distance between two points (x₁, y₁) and (x₂, y₂) in a two-dimensional space is given by:
L2 distance = √((x₁ - x₂)² + (y₁ - y₂)²)
The L2 distance corresponds to the length of the straight line connecting the two points and is a commonly used distance metric in many applications.
Both L1 and L2 distances can be used in K-NN to determine the nearest neighbors based on their distance from a new data point. The choice of which distance metric to use depends on the nature of the data and the specific problem at hand.
### Linear Classifier
A linear classifier is a type of machine learning model that separates data points into different classes using a linear decision boundary. It assumes that the classes can be separated by a hyperplane in the feature space. Despite its simplicity, linear classifiers can be effective in certain scenarios.
For example, let's consider an image classification task with three classes: cat, dog, and bird. Suppose the images have a resolution of 2x2 pixels, resulting in a matrix size of 2x2. Each pixel represents the grayscale intensity ranging from 0 to 255.
To create a linear classifier, we can flatten the image matrix into a feature vector of length 4 (2x2 = 4). Each image can be represented by a feature vector containing the pixel intensities.

Next, we assign a weight to each pixel in the feature vector. For instance, we can have weight vector W = [0.2, 0.5, 0.1, 2] for the cat and different ones for the other clases. The classifier calculates a score for each class by taking the dot product between the weight vector and the feature vector.
For a given image, the class with the highest score becomes the predicted label. For example, if the dot product results in scores [-96.8, 437.9, 61.95] for cat, dog, and ship classes respectively, the classifier predicts the image to be a dog.

The weights in the weight vector are learned through training using labeled data. Optimization algorithms, such as gradient descent, adjust the weights to minimize a loss function that quantifies the difference between predicted and actual labels.
Howeve, it is important to notice that if you rearrange the vector of weights for a particular class in a linear classifier, it can potentially result in a weight vector that corresponds to an image similar to the class you are trying to classify. This occurs because in a linear classifier, the prediction for a specific class is determined by taking the dot product (or inner product) between the weight vector of that class and the feature vector of the input. Maximizing the dot product between two vectors means aligning them in the same direction, making them parallel, therefore simillar to the class you are trying to predict

It's important to note that this parallelism or alignment is not an absolute guarantee for accurate classification. Linear classifiers have limitations in capturing complex relationships and may struggle with datasets where the decision boundaries are not linear. More advanced models, such as nonlinear classifiers or deep neural networks, are often employed to handle such scenarios.
### Noise reduction
Noise reduction with moving average in 2D involves smoothing an image by replacing each pixel with the average value of its neighboring pixels. In this technique, a sliding window moves across the image, and for each pixel, the average intensity of the pixels within the window is computed. This process reduces high-frequency noise by effectively blurring the image. The size of the window determines the level of smoothing, with larger windows providing more extensive noise reduction but potentially sacrificing image details.



### Correlation Filtering
When a filter is added to the moving average in 2D noise reduction, it enhances the smoothing process by applying a specific pattern or behavior to the averaging operation. Filters can emphasize or suppress certain features in the image based on their characteristics.


For example, a low-pass filter can be combined with the moving average to further reduce high-frequency noise while preserving low-frequency information, resulting in a smoother image. This filter attenuates or removes high-frequency components, such as noise or fine details, from the image.
On the other hand, a high-pass filter can be used to accentuate edges or fine details by subtracting a smoothed version of the image from the original image. This enhances the high-frequency components and can be useful for edge detection or feature extraction tasks.
By incorporating filters into the moving average process, the noise reduction technique can be tailored to address specific image characteristics or requirements, enabling better control over the trade-off between noise reduction and preservation of important image features. [Here are some examples](https://setosa.io/ev/image-kernels/)
### Correlation vs Convolution
Convolution and correlation are similar operations used in image processing and signal analysis. Both involve sliding a kernel over an input signal or image. **Convolution combines the kernel with the input signal by flipping it horizontally and vertically**. Correlation, on the other hand, does not flip the kernel. As a result, convolution captures spatial relationships and is widely used in tasks like filtering and feature extraction. Correlation, being less concerned with spatial relationships, is often used for template matching and pattern recognition.


### Kernel and types of Kernels
In the context of image processing and machine learning, a kernel refers to a small matrix or a predefined mathematical function that is applied to an image or a dataset. Kernels are commonly used in operations such as convolution, filtering, and feature extraction.
A kernel acts as a window or template that slides over the input data, and its values determine the contribution or effect on the neighboring data points. By convolving the kernel with the input data, specific patterns or characteristics can be enhanced, extracted, or transformed.
There are several commonly used kernels in image processing and machine learning, including:
- Gaussian Kernel: A Gaussian kernel is used for blurring or smoothing an image. It assigns higher weights to pixels closer to the center of the kernel, resulting in a gradual transition from the center to the edges.

- Box (or Mean) Kernel: A box kernel applies equal weights to all the pixels within the kernel window, resulting in simple averaging or blurring of the image.

- Identity Kernel: The identity kernel preserves the original image without any changes. It is commonly used as a placeholder or when no modification is desired.

- Mixed Kernel: mixed kernels are used to accentuate differences by emphasizing sudden changes in intensity, highlighting edges.

These are just a few examples of commonly used kernels. Depending on the specific task or application, different kernels can be designed or chosen to extract specific features, enhance certain characteristics, or perform various image processing operations.
### Processing an image with a MLP
In order to do this task first we will need to flatten the image into a 1D array with as many values as pixels in the image.

Therefore, in the first layer we will have many neurons equal to the number of pixels and this will grow exponentially

So it is clear that analyze images with a MLP model is not viable:
- The original structure of the data is lost.
- For large images, the number of neurons (and their consequent connections) grows exponentially.
- We do not have a clear notion of multi-scale/multi-resolution analysis (something that is generally useful in image analysis)
A good solution to this problem is to downscale the information we are providing to the model. In another words only input things that will bring value to our model. This process is also known as Feature Extraction and Feature Selection wich in the past used to be hancrafted.

## Convolutional Neural Networks (CNN)
By adding a convolutional layer to a neural network, we can create a Convolutional Neural Network (CNN), which has revolutionized image and pattern recognition tasks. CNNs leverage the concept of local receptive fields, which allow the network to focus on small local regions of the input.
In a CNN, each neuron in a convolutional layer is connected to a small local receptive field, which is a subset of the previous layer's neurons. This arrangement enables the network to capture spatial dependencies and extract local features, such as edges, textures, and patterns, while sharing parameters across different spatial locations.

The use of local receptive fields in CNNs brings several improvements. Firstly, it reduces the number of parameters in the network compared to fully connected networks, making it more computationally efficient. Secondly, it introduces translational invariance, enabling the network to recognize patterns regardless of their specific location in the input. This property makes CNNs robust to variations in position and size of objects.
### Deep Learning
By stacking multiple convolutional layers, CNNs can learn hierarchical representations of increasing complexity. Lower layers learn low-level features like edges, while higher layers learn more abstract and high-level features. This hierarchical feature extraction capability allows CNNs to automatically learn discriminative features directly from raw input data, leading to improved performance in tasks such as image classification, object detection, and segmentation.

### Using GPU's
Using GPUs (Graphics Processing Units) in CNNs provides significant advantages due to their parallel processing capabilities. CNN operations, such as convolutions and matrix multiplications, can be efficiently parallelized, allowing for faster training and inference times.
In the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), an annual competition for image classification, the introduction of GPUs brought a transformative impact. The ImageNet dataset consists of millions of labeled images belonging to thousands of categories.

Before GPUs, training CNNs on such large-scale datasets was prohibitively time-consuming. However, the parallel computing power of GPUs enabled researchers to train deeper and more complex CNN architectures effectively, leading to breakthrough results in the competition.
One notable milestone was achieved in 2012 when a deep CNN model called AlexNet, trained using GPUs, significantly surpassed traditional approaches. AlexNet reduced the top-5 error rate in the ImageNet competition from around 26% to about 16%. This substantial improvement showcased the potential of deep CNNs in large-scale image classification tasks.

The utilization of GPUs in CNNs allows for faster model training, enabling researchers to experiment with more extensive architectures, deeper layers, and larger datasets. This has contributed to remarkable advancements in the field of computer vision, driving progress in areas like object recognition, image segmentation, and even transfer learning for various real-world applications.
### Convolution layer
A convolution layer applies a set of learnable filters to the input data, convolving them with small local receptive fields to extract spatial features.

> If the input has 3 Dimensions (for example RGB) the convolutional layer will still make a single output

A convolutional layer is a **local support** structure in a neural network as it processes localized information using sliding filters. In contrast, a fully connected layer establishes connections between all neurons, disregarding location. This allows convolutional layers to capture local features and reduce the number of trainable parameters. On the other hand, fully connected layers learn global patterns and require more parameters. The convolutional architecture is efficient for data with spatial structure, such as images, while fully connected layers are more versatile for general problems.

> - The number of parameters to optimize in a convolutional layer is fewer than in a fully connected layer, making it more computationally efficient.
> - Convolutional layers, as they perform the same operation for each kernel, can be parallelized, making GPUs extremely useful in their execution.
### Activation Maps
It is possible to add multiple different kernels, and activation maps are visual representations of the features learned by the multiple kernels in the convolutional layers of a neural network. These maps show how each kernel responds to specific regions of the input data, highlighting areas where relevant features are detected. With the presence of multiple kernels, different activation maps are generated, each focusing on capturing a specific feature. These activation maps provide a visual understanding of how the network processes and extracts features from the data, aiding in the analysis, interpretation, and debugging of the model in a more detailed manner. **Afterwards, there will be one neuron for each pixel in each slice of the activation map.**


> Each kernel may vary from each other in value but not in size
Then I will add all this activation maps into a new input for the next convolutional layer.
### Pooling / Sub-sampling layer
A sub-sampling layer, also known as a pooling layer, is a technique used in neural networks to reduce the dimensionality and spatial size of the data. This layer is applied after convolutional layers and aims to decrease the amount of information by retaining only the most important features. Using operations such as max pooling or average pooling, the maximum or average value is selected from a local region in the activation maps. This reduces the spatial size of the data and creates lower-resolution activation maps, preserving the most relevant features and facilitating processing in subsequent layers.
In a scenario where you have two sub-sampling layers after each convolutional layer, the purpose is to further reduce the dimensionality and spatial size of the data. The first sub-sampling layer would perform pooling on the activation maps obtained from the first convolutional layer, reducing their size. Then, the second convolutional layer would be applied, generating new activation maps. Finally, the second sub-sampling layer would further decrease the size of these new activation maps. This sequential arrangement of sub-sampling layers helps in progressively extracting and retaining the most salient features from the data while reducing computational complexity. At the end we have a flaten input to the model only with the most importat features of our image

I will train the convolutional layers using gradient descent also. Therefore I can use Backpropagation on all the CNN, the model and the convolutional part including the kernels. This is a full End to End process where our full model will not only predict better but also know what are the best features to learn. Here is a bigger example:

There are generally two types of pooling commonly used in convolutional neural networks: max pooling and average pooling.
- **Max Pooling**: In this type of pooling, a pooling window slides over the input data, and the maximum value within each window is selected as the representative value for that region. Max pooling helps to retain the most prominent features, as the maximum value represents the strongest activation within the window.

- **Average Pooling:** Unlike max pooling, average pooling calculates the average value within each pooling window. This type of pooling provides a smoothed representation of the input data, reducing the impact of outliers and emphasizing overall patterns.

> These pooling techniques assist in downsampling the feature maps, reducing the spatial dimensions while retaining important information for subsequent layers in the network and dont add parameters to learn.
### Stride in Pooling
Let's consider an example with a 7x7 input image and different pooling strides.
- **Pooling with Stride 1:**
If we apply pooling with a stride of 1, the pooling window will move one step at a time. In this case, the output size will be 5x5. Each output element represents the maximum or average value within a 2x2 pooling window. The stride of 1 allows overlapping regions, resulting in a slightly smaller output size compared to the input.

- **Pooling with Stride 2:**
Now, let's apply pooling with a stride of 2. The pooling window will move two steps at a time, effectively skipping every other input element. With this stride, the output size will be 3x3. Each output element represents the maximum or average value within a 2x2 pooling window, but now the windows are non-overlapping due to the larger stride.

- **Pooling with Stride 3:**
If we further increase the stride to 3, the pooling window will move three steps at a time, resulting in a non-overlapping pooling operation. In this case, the output size will be 2x2. Each output element represents the maximum or average value within a 2x2 pooling window, and due to the larger stride, fewer windows are applied, resulting in a smaller output size.

In summary, by controlling the pooling stride, we can adjust the downsampling factor and the spatial dimensions of the output. Smaller strides result in smaller output sizes, while larger strides lead to more significant reductions in size. The stride parameter provides flexibility in controlling the level of downsampling and the level of information retained during the pooling operation.

### Padding
Padding is a technique used in convolutional neural networks to preserve the spatial dimensions of an input image during convolution operations. In the example of a 7x7 image, padding involves adding extra pixels around the image before applying convolutions.
- No Padding:
Without padding, when using a 3x3 kernel, the convolutions can only be applied seven times horizontally and vertically, resulting in a smaller output size (5x5) due to the lost information at the image boundaries.
- Padding:
To maintain the spatial dimensions, padding can be applied. For instance, with a padding of one, an additional row and column of pixels are added around the image. This results in a 9x9 image, allowing for nine convolutions both horizontally and vertically. Consequently, the output size after convolution with a 3x3 kernel would be 7x7, preserving the original dimensions.
Padding is beneficial as it prevents the reduction of spatial information, enables the network to capture features at the image boundaries, and retains more detailed information throughout the network layers. It also ensures that objects of interest near the edges of the image are not disproportionately affected by the convolutional operations.

### Convolutional layer Summary

Example:



> The bias is shared for all the layer, same as all the weights on a convolutional layer
### Tips & tricks
When the typical tips and tricks for improving neural networks don't yield the desired results, there are several additional strategies you can try. Here are some expanded suggestions to consider:
- **Fetch more diverse data**: Rather than simply acquiring more data, focus on obtaining a more diverse dataset that covers a broader range of scenarios. This can help the network generalize better and improve performance.
- **Experiment with different network architectures**: Instead of just adding more layers to the neural network, explore different architectures such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), or attention-based models. These architectures are specifically designed for different types of data and tasks, and they may be more suitable for your problem domain.
- **Explore advanced techniques**: Investigate advanced techniques such as transfer learning, ensemble methods, or meta-learning. Transfer learning allows you to leverage pre-trained models on large datasets and fine-tune them for your specific task. Ensemble methods combine multiple models to improve performance, while meta-learning focuses on training models that can learn how to learn.
- **Increase training duration**: Training for longer periods can sometimes lead to improved performance. However, be mindful of the computational resources required and the risk of overfitting if the model trains for too long.
- **Adjust the batch size**: The batch size used during training affects the stability and convergence of the network. Experiment with different batch sizes to find the optimal one for your specific problem. Smaller batch sizes can lead to better generalization but slower convergence, while larger batch sizes may converge faster but risk overfitting.
- **Apply regularization techniques**: Regularization methods such as L1 or L2 regularization, dropout, or batch normalization can help mitigate overfitting. These techniques introduce additional constraints or modifications to the network during training to encourage generalization.
- **Analyze bias-variance trade-off**: Analyze the bias-variance trade-off in your model. If the model is underfitting (high bias), consider increasing model complexity or collecting more diverse data. If the model is overfitting (high variance), try regularization techniques, reduce model complexity, or increase the amount of training data.
- **Optimize computation**: If training time is a concern, consider optimizing the computation process. This can involve using distributed training across multiple GPUs or leveraging specialized hardware like tensor processing units (TPUs) to accelerate training.
Remember, troubleshooting neural networks can involve an iterative process of experimentation and analysis. It's essential to carefully monitor and evaluate the results at each step to understand the impact of the changes made and guide further improvements.
---
### Normalization
cuando se hacen la normalizacion, no se deben mezclar la media y el desvio entre test, val y test. Cada uno se debe hacer por separado si se quisiese hacer

### Batch Normalization
**En vez de normalizar al principio puedo normalizar por batches**

Es decir, que puedo tomar grupos en el centro de la red y normalizar ahí

**Dado un numero N de training samples en un batch:**

**Agrego gamma y beta como parametros a aprender**
