# Roadmap to Computer Vision ๐Ÿฆ Welcome to the Computer Vision Roadmap by Programming Club IITK, wherein we'll be building right from the preliminary techniques of Image Processing and thereafter largely cover the intersection of Machine Learning and Image Processing. We're expecting that you're here after completing the Machine Learning Roadmap, which covers the basics of Machine Learning, Neural Networks, Pipelines and other preliminary concepts; so in this roadmap, we assume that the explorer has that knowledge and we pick up from there. Even if you have explored Machine Learning independently, we strongly recommend you to go through the ML Roadmap once before starting this. Although all topics are not prerequisite, but most of them are. Also do remember, in case you face any issues, the coordies and secies \@Pclub are happy to help. But a thorough research amongst the resources provided before asking your queries is expected :) Computer Vision is all about giving 1s and 0s the ability to see. Let's get started with Computer Vision ๐Ÿ‘พ https://www.ibm.com/topics/computer-vision ## ๐Ÿฆ Week 1: Introduction to Image Processing ### ๐Ÿ‘พ Day 1: What is Image? In digital world an image is a multi-dimensional array of numeric values. It is represented through rows & columns each storing data in numeric format which represents a particular region of Image. The individual cell is called pixel, the image's dimension is represented through number of pixels in it's height & width. To learn in detail, [click here](http://lapi.fi-p.unam.mx/wp-content/uploads/1.4-Opt_ContinousImages.pdf). ### ๐Ÿ‘พ Day 2: Different Colour Spaces ![Colour-space-thumbnail](https://hackmd.io/_uploads/B1G8MngwC.jpg) Color is a vital element in computer vision and digital imaging, influencing how we perceive and interpret visual information. In the world of programming and image processing, color spaces play a crucial role in manipulating and analyzing images. Understanding color spaces is essential for anyone working with image data, as they provide different ways to represent and process color information. A color space is a specific organization of colors, allowing us to define, create, and visualize color in a standardized way. Different color spaces are used for various applications, each offering unique advantages depending on the context. The most common color spaces include RGB (Red, Green, Blue), HSV (Hue, Saturation, Value), HSL(Hue, Saturation & Lightness) and CMYK(Cyan, Magenta, Yellow, and Key). Read more about Colour Space, [here](https://programmingdesignsystems.com/color/color-models-and-color-spaces/index.html) & more [here](https://cklixx.people.wm.edu/teaching/math400/Nolan-paper2.pdf) ### ๐Ÿ‘พ Day 3: Covolutions Convolutions are a fundamental operation in the field of image processing and computer vision, forming the backbone of many advanced techniques and algorithms. They are essential for extracting features from images, enabling us to detect edges, textures, and patterns that are crucial for understanding and analyzing visual data. At its core, a convolution is a mathematical operation that combines two functions to produce a third function, expressing how the shape of one is modified by the other. In the context of image processing, this involves overlaying a small matrix called a kernel or filter on an image and performing element-wise multiplication and summation. The result is a new image that highlights specific features based on the chosen kernel. [Watch](https://www.youtube.com/watch?v=KuXjwB4LzSA) this amazing video on convolutions Convolutions are extensively used in various applications, including: **Edge Detection:** Identifying boundaries and contours within images. **Blurring and Smoothing:** Reducing noise and enhancing important features. **Sharpening:** Enhancing details and making features more distinct. **Feature Extraction:** In neural networks, convolutions help in identifying and learning important features from images for tasks like classification and object detection. Understanding convolutions is critical for anyone working with image data, as they provide the tools to transform raw images into meaningful information. In this section, we will delve into the principles of convolutions, explore different types of kernels, and demonstrate how they are applied in practical image processing tasks. By mastering convolutions, you'll gain the ability to manipulate and analyze images with precision and efficiency. [![YouTube](https://img.shields.io/badge/YouTube-FF0000?style=for-the-badge&logo=youtube&logoColor=white&label=Watch%20Now)](https://youtube.com/playlist?list=PL57aAW1NtY5jRsQXi5LXPk5u1PyvQrTqg&si=oK__FNfUDG19BuSr) Watch the 1-3 videos to learn further. ### ๐Ÿ‘พ Day 4: Strides & Padding In the realm of image processing and neural networks, strides and padding are two crucial concepts that significantly impact the behavior and output of convolutional operations. Understanding these concepts is essential for effectively designing and implementing convolutional neural networks (CNNs) and other image processing algorithms. **Strides** Strides determine how the convolutional filter moves across the input image. In simpler terms, strides specify the number of pixels by which the filter shifts at each step. The stride length affects the dimensions of the output feature map. ***Stride of 1:*** The filter moves one pixel at a time. This results in a detailed and often larger output. ***Stride of 2:*** The filter moves two pixels at a time, effectively downsampling the input image. This produces a smaller output but reduces computational complexity. Choosing the right stride is important for balancing detail and efficiency. Smaller strides capture more detailed information, while larger strides reduce the size of the output, which can be beneficial for computational efficiency and reducing overfitting in deep learning models. **Padding** Padding refers to the addition of extra pixels around the border of the input image before applying the convolutional filter. Padding is used to control the spatial dimensions of the output feature map and to preserve information at the edges of the input image. ***Valid Padding (No Padding):*** No extra pixels are added. The output feature map is smaller than the input image because the filter cannot go beyond the boundaries. ***Same Padding (Zero Padding):*** Extra pixels (usually zeros) are added around the border to ensure that the output feature map has the same spatial dimensions as the input image. This allows the filter to cover all regions of the input image, including the edges. Padding helps to maintain the spatial resolution of the input image and is crucial for deep neural networks where preserving the size of feature maps is often desirable. **Why Strides and Padding Matter** ***Control Over Output Size:*** By adjusting strides and padding, you can control the size of the output feature maps, which is crucial for designing deep neural networks with specific architectural requirements. ***Edge Information Preservation:*** Padding helps to preserve edge information, which is important for accurately capturing features located at the boundaries of the input image. Computational Efficiency: Strides can be used to downsample the input image, reducing the computational burden and memory usage in deep learning models. [![YouTube](https://img.shields.io/badge/YouTube-FF0000?style=for-the-badge&logo=youtube&logoColor=white&label=Watch%20Now)](https://youtube.com/playlist?list=PL57aAW1NtY5jRsQXi5LXPk5u1PyvQrTqg&si=oK__FNfUDG19BuSr) Here's video for above concepts. Refer videos 4-5. [Here's](https://medium.com/analytics-vidhya/convolution-padding-stride-and-pooling-in-cnn-13dc1f3ada26) an amazing blog, make sure to check it out. [Read](https://ai.stanford.edu/~syyeung/cvweb/tutorial1.html) this tutorial as intro to Image Processing techniques. ### ๐Ÿ‘พ Day 5 & 6: OpenCV Tired of Reading all those theory, let's get to some Practicals now. Let's start by Python's library- **OpenCV** and then proceed to using OpenCV in code. OpenCV is a library designed to offer a common infrastructure for computer vision applications and to accelerate the use of machine perception in commercial products. Its primary goal is to provide a ready-to-use, flexible, and efficient computer vision framework that developers can easily integrate into their applications. You can download OpenCV using **pip**. [Refer](https://www.youtube.com/watch?v=M6jukmppMqU) here for some Video demonstration. [This](https://medium.com/analytics-vidhya/introduction-to-computer-vision-opencv-in-python-fb722e805e8b) is a blog on OpenCV. Finally let's get hands on using OpenCV. [Here's](https://drive.google.com/file/d/1nq3atbKhK5fyfRWWGTfCOwBHsFkOAgsR/view?usp=drive_link) an amazing course on OpenCV, read only OpenCV part for now. ### ๐Ÿ‘พ Day 7: Let's do some stuff Congratulations on making it through your first week of learning computer vision! Today, we'll explore some beginner-friendly projects to help you apply what you've learned. These projects will strengthen your understanding and provide practical experience in working with computer vision. **1. Face Detection and Recognition** **Description:** Implement a system that can detect faces in real-time from a webcam feed and recognize known individuals. **Key Concepts:** 1. Haar Cascades or HOG + Linear SVM for face detection. 2. LBPH, Eigenfaces, or Fisherfaces for face recognition. 3. Dlib or OpenCV's deep learning-based face detector. **Project Steps:** 1. Capture video from a webcam. 2. Detect faces in each frame. 3. Recognize detected faces using a pre-trained model. 4. Display names or IDs of recognized faces on the video feed. **2. Edge Detection and Image Filtering** **Description:** Develop an application that applies different image filters and edge detection algorithms to an input image. **Key Concepts:** 1. Sobel, Canny, and Laplacian filters for edge detection. 2. Gaussian, median, and bilateral filters for smoothing and noise reduction. **Project Steps:** 1. Load an image from the disk. 2. Apply various edge detection algorithms. 3. Implement different image filtering techniques. 4. Display the original and processed images side by side. **3. Barcode and QR Code Scanner** **Description:** Develop an application that can scan and decode barcodes and QR codes in real-time. **Key Concepts:** 1. Image thresholding and binarization. 2. Contour detection and perspective transformation. 3. Barcode and QR code decoding using libraries like zbar or OpenCV's built-in functions. **Project Steps:** 1. Capture video from a webcam. 2. Detect and localize barcodes or QR codes in each frame. 3. Decode the detected codes. 4. Display the decoded information on the video feed. ## ๐Ÿฆ Week 2: Convolutional Neural Networks Convolutional Neural Networks (CNNs) are a class of deep neural networks commonly used for analyzing visual data. They are particularly effective for tasks involving image recognition, classification, and processing due to their ability to capture spatial hierarchies and patterns within images. Developed in the late 1980s, CNNs have since revolutionized the field of computer vision and have become the cornerstone of many modern AI applications. **What are CNNs?** CNNs are inspired by the human visual system and consist of multiple layers designed to automatically and adaptively learn spatial hierarchies of features from input images. Unlike traditional neural networks, which treat input data as a flat array, CNNs preserve the spatial structure of the data by using a grid-like topology. **Key Components of CNNs** **1.Convolutional Layers:** These layers apply convolution operations to the input, using learnable filters to detect various features such as edges, textures, and patterns. Each filter slides over the input image, producing a feature map that highlights the presence of specific features. **2.Pooling Layers:** Pooling layers reduce the spatial dimensions of feature maps, helping to downsample the data and make the network more computationally efficient. Common pooling operations include max pooling and average pooling. **3.Fully Connected Layers:** After a series of convolutional and pooling layers, the high-level reasoning in the network is done via fully connected layers. These layers flatten the input data and pass it through one or more dense layers to perform classification or regression tasks. **4.Activation Functions:** Non-linear activation functions like ReLU (Rectified Linear Unit) introduce non-linearity into the model, allowing it to learn complex patterns. Activation functions are applied after convolutional and fully connected layers. **5.Regularization Techniques:** Techniques like dropout and batch normalization are used to prevent overfitting and improve the generalization capability of the network. You can read the below blog, to get further on CNNs. [Introduction to CNN (Read Here)](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53). <!-- A Convolutional Neural Network is a Deep Learning algorithm that can take in input images and be able to differentiate one from the other or maybe identify some objects in the image. Convolution is a mathematical operation to merge two sets of information. Convolution is covered quite extensively in the last few lectures of MTH102. However, you don't require any of that knowledge to understand the concept here. (Although knowing it will give you a better insight into the algorithm) To get a basic understanding of what CNN is about, this blog is a very good read. [Introduction to CNN (Read Here)](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53). If you are more comfortable reading in a research-based format, this paper is also a good read. [Read Here](https://arxiv.org/pdf/1511.08458.pdf). Now it's time to revise and implement some part of what we did yesterday and implement it in one of the applications. It's best to cover week 1 of this course by Andrew NG on Coursera. You can audit the course for now. [Week 1 CNN Course](https://www.coursera.org/learn/convolutional-neural-networks/home/week/1). The videos are more or less the same as what you had read in the blog. The assignments of the course are available on this GitHub repository. You can clone the whole repository and open the C4 convolutional neural networks folder followed by week 1. You'll find all the programming assignments here. [GitHub Repository](https://github.com/amanchadha/coursera-deep-learning-specialization.git). Please check out the issues section of this repository as well. This completes our basic understanding of CNNs. Now, let's understand some basic networks that are required to implement CNNs. In this roadmap, for the week, we'll tell you to cover Residual Networks only, however to get an in-depth knowledge of the topic, you can read about the other networks (MobileNet, etc) later on. For day 3, Watch the week 2 lectures on ResNet (First 5 videos of week 2) of that course on Coursera and solve the first assignment on residual networks (using the same GitHub repository). Watch the remaining portion of the week 2 lectures from the course and solve the assignment on Transfer Learning using MobileNet. To get a more in-depth knowledge of MobileNet you can check out this paper [Read Here](https://arxiv.org/pdf/1704.04861.pdf) Watch the first 9 lectures of week 3 (until the YOLO detection algorithm) and solve the assignment on car detection with YOLO. A nice blog on the working of YOLO is available here. You can refer to it alongside the course. [Read Here](https://www.v7labs.com/blog/yolo-object-detection) This blog gives a version-wise overview of the YOLO algorithm models is a good read to get more in depth on how the algorithm was developed and what is next in this domain. [Read Here](https://www.datacamp.com/blog/yolo-object-detection-explained) Cover the remaining portion of week 3 and try to do the assignment on image segmentation with U-Net from the GitHub repository. From week 4 of the course, watch the first 5 videos on face recognition and try to do the assignment on face recognition from the GitHub repository. You can skip over the last part on style generation on neural style transfer of the course for now. You can try it out later. This is a GitHub repository for implementing the facenet model in keras. You can check it out to go more in depth. [Click Here](https://github.com/nyoki-mtl/keras-facenet). --> ### ๐Ÿ‘พ Day 1:Introduction to Convolutional Layers Convolutional layers are the fundamental building blocks of Convolutional Neural Networks (CNNs), which are designed to process and analyze visual data. These layers play a crucial role in detecting and learning various features within images, such as edges, textures, and patterns, enabling CNNs to excel in tasks like image classification, object detection, and segmentation. A convolutional layer performs a mathematical operation called convolution. This operation involves sliding a small matrix, known as a filter or kernel, over the input image to produce a feature map. Each filter detects a specific feature, such as a vertical edge or a texture pattern, by performing element-wise multiplication and summing the results as it moves across the image. [![YouTube](https://img.shields.io/badge/YouTube-FF0000?style=for-the-badge&logo=youtube&logoColor=white&label=Watch%20Now)](https://youtube.com/playlist?list=PL57aAW1NtY5jRsQXi5LXPk5u1PyvQrTqg&si=oK__FNfUDG19BuSr) Watch videos 6-8. ### ๐Ÿ‘พ Day 2:Exploring Pooling Layers Now let's learn about Pooling Layers, unlike convolutional layers having kernels with trainable parameters, in pooling there's no sort of trainable params. So if they aren't trainable, what's it's use, basically pooling layers are used to compress/reduce the size of data while still having retained it's important aspects. There are various kinds of Pooling Layers- Max Pooling, Min Pooling, Average Pooling, etc. Here's a short blog on [Pooling Layers](https://medium.com/@abhishekjainindore24/pooling-and-their-types-in-cnn-4a4b8a7a4611). If you are interested in some depths, check out [this](https://www.nature.com/articles/s41598-024-51258-6) [![YouTube](https://img.shields.io/badge/YouTube-FF0000?style=for-the-badge&logo=youtube&logoColor=white&label=Watch%20Now)](https://youtube.com/playlist?list=PL57aAW1NtY5jRsQXi5LXPk5u1PyvQrTqg&si=oK__FNfUDG19BuSr) Watch video 9. ### ๐Ÿ‘พ Day 3:Understanding Fully Connected Layers and Activation Functions After Covolutional & Pooling layers, let's now move to Fully Connected Layers. So a FC layer is basically a Dense Nueral Network Layer, which leads to a Numerical Output, achieving our task of utilizing image data for tasks like classification, object detection, etc. You must be knowning about Activation functions- ones which brings non-linearity to our Model. Let's learn about some more activation functions and also revise the earlier ones. [Visit Here](https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6) [Here's](https://towardsdatascience.com/sigmoid-and-softmax-functions-in-5-minutes-f516c80ea1f9) a blog on Sigmoid & Softmax Activation functions [Here's](https://conferences.computer.org/ictapub/pdfs/ITCA2020-6EIiKprXTS23UiQ2usLpR0/114100a429/114100a429.pdf) a research paper highlighting importance of an Activation function. ### ๐Ÿ‘พ Day 4:Regularization Techniques As you delve deeper into Convolutional Neural Networks (CNNs), you'll encounter challenges like overfitting and the need for more efficient and effective model architectures. Regularization techniques are essential tools to address these challenges, enhancing the performance and generalization capabilities of your CNN models. **Regularization Techniques** Regularization techniques are strategies used to prevent overfitting, where a model performs well on training data but poorly on unseen data. These techniques introduce constraints and modifications to the training process, encouraging the model to learn more general patterns rather than memorizing the training data. **1.Dropout:** Dropout is a technique where, during each training iteration, a random subset of neurons is "dropped out" or deactivated. This prevents neurons from co-adapting too much and forces the network to learn redundant representations, making it more robust. Typically, dropout is applied to fully connected layers, but it can also be used in convolutional layers. **2.Batch Normalization:** Batch normalization normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation. This stabilizes and accelerates the training process, allowing for higher learning rates. It also acts as a regularizer, reducing the need for dropout in some cases. Read this [blog](https://medium.com/analytics-vidhya/everything-you-need-to-know-about-regularizer-eb477b0c82ba) to grasp on above topics. ### ๐Ÿ‘พ Day 5:Image Augmentaion: Read this article to get idea of image augmentation: https://towardsdatascience.com/image-augmentation-14a0aafd0498 How to do data augmentation using tensorflow : https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/images/data_augmentation.ipynb No need to mug up the commands, rather learn about what possible data augmentation you can do with tensorflow ### ๐Ÿ‘พ Day 6:Image Preproccessing This is a great article on Image Preproccessing , answering the Whys and the Hows. https://medium.com/@maahip1304/the-complete-guide-to-image-preprocessing-techniques-in-python-dca30804550c ### ๐Ÿ‘พ Day 7:Projects It's time to get hands on some Practical Implementations. Your task is to create a model for "Automatic Covid disease detection by Chest X-ray Images through CNN model". **Description:** You are given a dataset of X-rays images. Dataset is divided into train and test data further divided into three categories - Normal , Covid and Viral Pneumonia. You need to create a common directory for train and test images (hint: you need to import os library for this, look on internet for working) then resize the images and normalize data if needed, Split train data into train data and validation data. Convert categorical data into numerical data.You can also use ImageDataGenerator function from Tensorflow Library. Then Build a CNN model and fit the model on train data, validate the model and play with the hyperparameters for eg. no of layers, no. of filters, size of filters, padding, stride. You can also try different optimizers and activation functions. Also plot the your results for eg, loss vs epochs , accuracy vs epochs for train and validation data. You are expected to use Google Colab for this task. **DATASET:** Here's [Dataset](https://drive.google.com/file/d/1xKQAAqI4IR0a5UGzpGMVcLF5XE2oZPvd/view?usp=drive_link). ## ๐Ÿฆ Week 3: Transfer Learning and CNN Architectures ### ๐Ÿ‘พ Day 1:Tranfer Learning #### What is Transfer Learning? - Transfer learning is a machine learning technique where a model trained on one task is reused or adapted as the starting point for a model on a different but related task. - It leverages knowledge gained from the source task to improve learning on the target task, especially when labeled data for the target task is limited. - Benefits include reduced training time and the ability to achieve better performance on the target task with less data. - Common in applications like natural language processing, computer vision, and speech recognition. - Techniques include fine-tuning existing models, feature extraction, and domain adaptation. [**An excercise**](https://www.tensorflow.org/tutorials/images/transfer_learning) which will teach you how to use tranfer learining in your projects using tensorflow. ### ๐Ÿ‘พ Day 2:AlexNet - AlexNet is a pioneering convolutional neural network (CNN) architecture that significantly contributed to the resurgence of interest in deep learning. - This architecture made a substantial impact by winning the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012, reducing the top-5 error rate by more than 10% compared to the previous state-of-the-art. - AlexNet was one of the first models to leverage the computational power of GPUs extensively. The model was trained on two GPUs, which allowed for faster training times and handling of large datasets like ImageNet. [Read this article on AlexNet](https://medium.com/@siddheshb008/alexnet-architecture-explained-b6240c528bd5) The article also contains code implimentation of AlexNet, **Brownie Points** for understanding and implimenting it. ### ๐Ÿ‘พ Day 3:ZFNet (2013) - The architecture is a refined version of AlexNet, focusing on improving performance by adjusting hyperparameters and exploring the learned features in greater depth. - Utilizes a [**Deconvolutional Network**](https://medium.com/@marsxiang/convolutions-transposed-and-deconvolution-6430c358a5b6) to visualize the learned features and understand the workings of convolutional layers. [Read this article on ZFNet](https://medium.com/coinmonks/paper-review-of-zfnet-the-winner-of-ilsvlc-2013-image-classification-d1a5a0c45103) to see how it improves and builds upon AlexNet. ### ๐Ÿ‘พ Day 4:VGG16 and VGG19 (2014) - VGG16 and VGG19 were introduced by the Visual Geometry Group (VGG) at Oxford University. - These architectures were presented in the paper "Very Deep Convolutional Networks for Large-Scale Image Recognition" and stood out in the ILSVRC 2014. - VGG16 and VGG19 are known for their simplicity and depth, consisting of 16 and 19 layers, respectively thus giving their names. ### ๐Ÿ‘พ Day 5:Image Segmentation Image segmentation refers to the task of dividing an image into segments where each pixel in the image is mapped to an object. This task has multiple variants such as instance segmentation, panoptic segmentation and semantic segmentation.In simpler terms Image segmentation looks at an image and assigns each pixel an object class(such as background, trees, human, cat etc.). This finds several use cases such as image editing, background removal, autonomous driving and whatnot. it differs from object detection as this technique creates a mask over the object rather than a bounding box. It is majorly divided into 3 types: 1. Semantic Segmentation 2. Instance Segmentation 3. Panoptic Segmentation ![Untitled](https://hackmd.io/_uploads/H1x5GyyvC.jpg) Read this [**Article**](https://www.ibm.com/topics/image-segmentation#:~:text=Image%20segmentation%20is%20a%20computer,faster%2C%20more%20advanced%20image%20processing.) to fully understand what image segmentation is and its different types. <!-- Hello,Hi --> <!-- Namaste,bye-bye --> Heres another [**article**](https://www.geeksforgeeks.org/explain-image-segmentation-techniques-and-applications/) by GFG that goes more in-depth. ### ๐Ÿ‘พ Day 6:UNet ![image](https://hackmd.io/_uploads/HyhN-JkPR.png) - The architecture is named U-Net because of its U-shaped structure, consisting of an encoder (downsampling path) and a decoder (upsampling path). - U-Net has a symmetric architecture with an **encoder** (contracting path) and a **decoder** (expanding path). This allows it to effectively capture and utilize both local and global contextual information. -Read [this article](https://www.analyticsvidhya.com/blog/2023/08/unet-architecture-mastering-image-segmentation/) on UNET Architecture to get a deaper understanding. ### ๐Ÿ‘พ Day 7:Implement Image Segmentation Follow this [**article**](https://www.analyticsvidhya.com/blog/2022/10/image-segmentation-with-u-net/) and implement image segmentation using UNet by yourself. ## ๐Ÿฆ Week 4: CNN Architectures (contd) and Object Detection ### ๐Ÿ‘พ Day 1:InceptionNet IneptionNet ,developed by google, achieved a huge milestone in CNN classifiers when previous models tried to improve the performance and accuracy by just adding more and more layers. The Inception network, on the other hand, is heavily engineered. It uses a lot of tricks to push performance, both in terms of speed and accuracy. its most notable feature is the **inception modules** that it uses. These modules are an arrangement of 1x1,3x3,5x5 convolution and max-pooling layers that aim to capture features on different **'levels'** .These levels refer to the different sizes an object can be in an image Read this [**article**](https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202) to learn more about the inception net and its different versions. ### ๐Ÿ‘พ Day 2:ResNet ResNet or Residual Networks is similar to Inception Net in a sense because both aimed to solve the problem of CNN using more and more layers to try to get better efficiency. This architecture uses the residual blocks that work in a similar way to RNNs. They add a weighted part of the input to the output of the convolutional layer in order to mitigate the loss of information.Given below is an image of a residual block. ![image](https://hackmd.io/_uploads/H1lbQU1wC.png) Read this [**article**](https://towardsdatascience.com/understanding-and-visualizing-resnets-442284831be8) to learn more about residual networks. ### ๐Ÿ‘พ Day 3:Region-Based CNN, Fast RCNN, Faster RCNN Region-Based-CNN(R-CNN),introduced in 2013, was one of the first successful methods for object detection using CNNs. It used multiple networks to propose regions with possible objects and then passed these regions thrugh other networks to produce confidence scores.But there was a problem with this model, it was slow ,as it fed each proposed region(read the given article for more information) individually through the CNN. Improving upon this, came the Fast R-CNN in 2015,it solved some of the problems by eliminating the need to pass multiple regions, it rather passed the whole image. Finally the Faster-RCNN was proposed to improve the Fast-RCNN .It did so by changing the method it used to search the probable regions. It was around 20 times faster than the original R-CNN. ๐Ÿ”Ž Read this [**article**](https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-36d53571365e) to review your understanding about R-CNN, Fast R-CNN, Faster R-CNN and YOLO(discussed later). ### ๐Ÿ‘พ Day 4:RoboFlow Roboflow is one of the most useful platforms when it comes to computer vision. It provides you with basically everything essential for computer vision and more ranging from Data Management, Model Training, Deployment and Collaboration. 1. **Data Management -** Roboflow facilitates data annotation, preprocessing, and organization, including tasks like image labeling (bounding boxes, segmentation), dataset management, and version control. 2. **Model Training -** Users can train custom computer vision models or utilize pre-trained models for tasks such as object detection and image classification. It supports popular frameworks like TensorFlow and PyTorch. 3. **Deployment -** The platform enables easy deployment of trained models across various environments, including mobile devices, web applications, and edge devices. Tools are available for optimizing model size and latency. 4. **Collaboration -** Teams can collaborate effectively on projects through role-based access control and version control for datasets and models. To get a more comprehensive idea of what roboflow can do take a look at this [**video**](https://www.youtube.com/watch?v=O-ZPxTpb2Yg) and follow it through for a quick implementation of a project.(for a more interesting approach, try to make a project of your own with a custom dataset) Also check out this [**official blog**](https://blog.roboflow.com/getting-started-with-roboflow/) and this [**playlist**](https://www.youtube.com/playlist?list=PLZCA39VpuaZaV5JhBUPz4AcM4SL496eFo) by RoboFlow themselves. Roboflow can be used to implement YOLO, one of the most useful models around in the domain of computer vision. Let us learn more about it. ### ๐Ÿ‘พ Day 5:YOLO YOLO (stands for You-Only-Look-Once), is a series of computer vision models that can preform object detection, image classification, and segmentation tasks. I would say that YOLO is a jack of all trades in computer vision. YOLO has seen constant development over the years starting with the first YOLO model YOLOv1 to the current latest YOLOv10. It is a single shot algorithm that looks at the image once and predicts the classes and bounding boxes in a single forward pass, this makes YOLO much faster than other object detection models.This single stage architecture also aids in the training process making it more efficient than others Hereโ€™s a simplified breakdown of how it works.Image is divided into a grid and split into cells. Each cell predicts bounding boxes for objects it might contain and then features are extracted. Each cell analyzes the image features within its area using CNNs. These features capture the imageโ€™s details. Bounding boxes and confidence scores are predicted. Based on the extracted features, the model predicts-bounding box coordinates(if multiple overlapping bounding boxes are predicted for the same class then yolo performs NMS to eliminate bounding boxes) and confidence score. * Non-Maxima Suppression (NMS): If multiple cells predict bounding boxes for the same object, NMS selects the most confident one and discards overlapping or low-confidence boxes. Here is a [**video**](https://youtu.be/VAo84c1hQX8?si=GXcKFQI3ukGkLpEM) to get an idea about NMS ๐Ÿ”Ž Read this [**article**](https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-36d53571365e) to review your understanding about R-CNN, Fast R-CNN, Faster R-CNN and YOLO. ### ๐Ÿ‘พ Day 6:Hands-on YOLO Now that you have familiarised yourself with yolo its time to do some work. ๐Ÿ”Ž Implement YOLOv8 by following this [**video**](https://youtu.be/m9fH9OWn8YM?si=EohDM710R-BEHAe9) and create a project for yourself. *If you are done with this try to do Object tracking on a video with yolo.* ### ๐Ÿ‘พ Day 7:OCR OCR, short for Optical Character Recognition, is a technology used to detect and convert the text contained in images(such as scanned documents, photos of signs, or screenshots) into machine-readable text. This enables computers to analyze the content within images, making it searchable, editable, or suitable for various other applications. There are a lot of python libraries and models for OCR, and to get started lets use EasyOCR First install the requirements ```bash pip install opencv-python pip install easyocr ``` Here's the code ```python import easyocr reader = easyocr.Reader(['en']) # specify the language result = reader.readtext('image.jpg') for (bbox, text, prob) in result: print(f'Text: {text}, Probability: {prob}') ``` There are many more OCRs available on the internet such as KerasOCR,, tesseract OCR. Try them out on your own. ## ๐Ÿฆ Week 5: Face Detection and Recognition, Pose Estimations, AutoEncoders, Autoregressive Models In the domain of computer vision two very critical technologies, Face Recognition and Detection, have emerged offering various applications in real world problems .These are used in a variety of fields such as demographic analysis in marketing, emotional analysis for enhanced human-computer interactions and various domains pertaining to security. Suppose you want to create a security and surveillance system, or say you want to filter out familiar faces from a video, or want to create a facial attendance system, an access-control system for a utility, or something akin to a snapchat filter, you will be working with the aforementioned technologies. ### ๐Ÿ‘พ Day 1,2:Face Detection Face Detection technologies have come a long way, with the _haar cascades_ being the very first major breakthrough in the field followed by _Dlib's_ implementatuion of HoG(histogram of gradients).These methods used classical techniques, and with the rapid advancements in Deep learning, newer Face detectors came to rise, leaving behind the previous techniques in the ability to perform well in varied conditions such as lighting, occlusion, detecting faces having different expressions and scale, and their overall accuracy.Let us discuss two of these face detectors. #### 1.[MTCNN](https://towardsdatascience.com/robust-face-detection-with-mtcnn-400fa81adc2e) MTCNN or Multi-Task Cascaded Convolutional Neural Networks is a neural network which detects faces and facial landmarks on images. It was published in 2016 by Zhang et al([**paper**](https://arxiv.org/abs/1604.02878)). MTCNN uses a 3-stage neural network to detect not only the the face but also the facial landmarks (i.e. position of nose, eyes and the mouth). Firstly it creates multiple resized images to detect faces of different sizes. Then these images are passed on to the first stage of the model, the P-net or proposal net which as the name suggests proposes areas of interest to the next stage, the R-net or Refine network which filters the proposed detections. In the final stage, the O-net (output) takes the refined bounding boxes and does the final refinement to produce accurate results. A short implementation of MTCNN is as follows- before trying it you you will have to install mtcnn first by the following command ```bash pip install mtcnn ``` ```python # import the necessary libraries import matplotlib.pyplot as plt from mtcnn.mtcnn import MTCNN from matplotlib.patches import Rectangle from matplotlib.patches import Circle # draw an image with detected objects def draw_facebox(filename, result_list): data = plt.imread(filename) # plot the image plt.imshow(data) # get the context for drawing boxes ax = plt.gca() # plot each box for result in result_list: # get coordinates x, y, width, height = result['box'] rect = plt.Rectangle((x, y), width, height,fill=False, color='orange') # draw the box ax.add_patch(rect) # draw the dots for key, value in result['keypoints'].items(): dot = plt.Circle(value, radius=2, color='red') #change radius in accordance to ax.add_patch(dot) filename = r'test.jpg' #change to point to the image image = plt.imread(filename) # detector is defined detector = MTCNN() # detect faces in the image faces = detector.detect_faces(image) # display faces on the original image draw_facebox(filename, faces) plt.show() ``` #### 2.[RetinaFace](https://medium.com/@harshshrm94/mask-detection-using-deep-learning-2958503d42b1) RetinaFace is state-of-the-art(SOTA) model developed to detect faces in adverse conditions and to outperform its predecessors . It has a reputation for being the most accurate of open-source face detection models.([**paper**](https://arxiv.org/pdf/1905.00641)). RetinaFace boasts one of the strongest face detection capabilities accounting for occlusion, small faces,and expressive faces.If you want a very robust face detector then RetinaFace is the go to choice A short implementation of the following is given below and also remember to install the python-library by the following pip command ```bash pip install git+https://github.com/hukkelas/DSFD-Pytorch-Inference.git ``` ```python import cv2 import face_detection # Initialize detector detector = face_detection.build_detector("DSFDDetector", confidence_threshold=.5, nms_iou_threshold=.3) # Read image img = cv2.imread('test.jpg') # Getting detections faces = detector.detect(img) # print(detections) for result in faces: x, y, x1, y1 ,_ = result cv2.rectangle(img, (int(x), int(y)), (int(x1), int(y1)), (0, 0, 255), 2) cv2.imshow("Image with Detected Faces", img) cv2.waitKey(0) # Wait for a key press to close the window cv2.destroyAllWindows() ``` The two face detectors mentioned here are definitely not the only ones present. Here are some more that you are free to explore 1. YuNet 2. MediaPipe 3. Dual shot face detector 4. DNN module of openCV ### ๐Ÿ‘พ Day 3:Face Reconition If face detection answered the question "Is there a face in the image" then face recognition goes a step further and answers the question "Whose face is this?". If you know about classification models, then you must be thinking that this task can be solved by using those, and while this is correct, classification models require a lot of data i.e. a lot of pictures of the same person to get trained on and get good results, which is seldom possible. If there is enough data to train your model, you could very well go along with a classification model. #### One-Shot Learning To tackle the problem of unavailability of data a technique called one shot learning is used.One shot learning aims to train models to recognize new categories of objects with only a **single** training example per category.This ability to learn about the data from just a single example is where its usefulness lies.One-Shot Learning is particularly valuable in scenarios where acquiring large datasets for every object category is impractical or expensive.This could include tasks like signature verification, identity confirmation, detecting new species,searching for similar products on the web and what not. ๐Ÿ”Ž Read this [**article**](https://medium.com/data-science-in-your-pocket/starting-off-with-one-shot-learning-609c1ac6ec9b) to get indepth understanding about one shot learning. #### Siamese Networks In the context of computer vision, Siamese neural networks are networks designed to compare similarities between two images and give a similarity score between 0 and 1.Siamese Neural Networks are a type of twin neural network that consists of two identical CNNs that share weights. It takes two input images, each is then processed by one of the networks to generate a feature vector, and then computes the distance between these vectors using a loss function. If the distance is small (less than a specified threshold), the images are classified as belonging to the same category. ๐Ÿ”Ž Read this [**article**](https://medium.com/@hugh.ding9189/an-introduction-to-siamese-neural-networks-42f1e78fb635) to get a deeper understanding about Siamese Networks and try out their simple implementation by yourself. ### ๐Ÿ‘พ Day 4:Pose estimation Human pose estimation refers to the identification and classification of the poses of human body parts and joints in images or videos. In general, a model-based technique is used to represent and infer human body poses in 2D and 3D space.Essentially using this technique we find the coordinates of the human body joints like wrist, shoulder, knees, eyes, ears, ankles, and arms, which can describe a pose of a person.This finds its uses in Healthcare, AR and VR technologies, Animation, Character modeling and activity recognition. To get started with pose estimation, I would suggest using MediaPipe or OpenPose for your first project.Here are some implementations to follow through for the same. 1. [**MediaPipe**](https://www.analyticsvidhya.com/blog/2022/03/pose-detection-in-image-using-mediapipe-library/) 2. [**OpenPose**](https://medium.com/pixel-wise/real-time-pose-estimation-in-webcam-using-openpose-python-2-3-opencv-91af0372c31c) ### ๐Ÿ‘พ Day 5: AutoEncoders Autoencoders are a specialized class of algorithms that can learn efficient representations of input data with no need for labels. It is a class of artificial neural networks designed for unsupervised learning. Learning to compress and effectively represent input data without specific labels is the essential principle of an automatic decoder. This is accomplished using a two-fold structure that consists of an encoder and a decoder. The encoder transforms the input data into a reduced-dimensional representation, which is often referred to as โ€œlatent spaceโ€ or โ€œencodingโ€. From that representation, a decoder rebuilds the initial input. For the network to gain meaningful patterns in data, a process of encoding and decoding facilitates the definition of essential features. ![image](https://hackmd.io/_uploads/H18BJZVPC.png) Go through the following to learn more about autoencoders: - [Introductory Video on AutoEncoders](https://www.youtube.com/watch?v=qiUEgSCyY5o) - [Intro to autoencoders](https://www.geeksforgeeks.org/auto-encoders/) - [IndetmediateTopics in Autoencoders](https://www.ibm.com/topics/autoencoder) ### ๐Ÿ‘พ Day 6: Implementing AutoEncoders Let's get your hands dirty and work on implementation of AutoEncoders. Follow the following links for tutorials: - [Autoencoders for image reconstruction via TensorFlow](https://www.tensorflow.org/tutorials/generative/autoencoder) - [Autoencoders for image denoising and latent space repsresentation in Keras](https://blog.keras.io/building-autoencoders-in-keras.html) Dive deep into autoencoders via the following articles: - [Types of Autoencoders and their applications](https://intellipaat.com/blog/autoencoders-in-deep-learning/) - [Applications of AutoEncoders](https://iq.opengenus.org/applications-of-autoencoders/) ### ๐Ÿ‘พ Day 7: Autoregressive Models and PixelCNN Autoregressive models are a class of machine learning (ML) models that automatically predict the next component in a sequence by taking measurements from previous inputs in the sequence. For understanding, Autoregressive Models go through [this article](https://aws.amazon.com/what-is/autoregressive-models/). PixelCNN is a CNN architecture that utilizes AutoRegressive Models for Image Generation. Go through the following articles to understand its architecture and implementation: - https://towardsdatascience.com/auto-regressive-generative-models-pixelrnn-pixelcnn-32d192911173 - https://medium.com/game-of-bits/image-synthesis-using-pixel-cnn-based-autoregressive-generative-model-05179bf662ff BONUS: If you're interested in learning more about AutoRegressive Models, read [this paper](https://arxiv.org/pdf/2404.02905) on Visual Autoregressive Modeling. ## ๐Ÿฆ Week 6: Variational Autoencoders ### ๐Ÿ‘พ Day 1: Intoduction to Variational Autoencoders Lets start with some easy-to-read articles that introduce you to what autoencoders are, why they're useful, and the different types out there. - [VAEs: An Overview](https://www.geeksforgeeks.org/variational-autoencoders/) - [A Beginnerโ€™s Guide to VAEs](https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73) ### ๐Ÿ‘พ Day 2: A deeper dive in VAEs Now lets go through some videos for more help. - [![YouTube](https://img.shields.io/badge/YouTube-FF0000?style=for-the-badge&logo=youtube&logoColor=white&label=Watch%20Now)](https://www.youtube.com/watch?v=qiUEgSCyY5o&pp=ygUjZGVlcCBsaXphcmRzIHZpZG9lcyBvbiBhdXRvZW5jb2RlcnM%3D) - [![YouTube](https://img.shields.io/badge/YouTube-FF0000?style=for-the-badge&logo=youtube&logoColor=white&label=Watch%20Now)](https://www.youtube.com/watch?v=9zKuYvjFFS8) ### ๐Ÿ‘พ Day 3: Lets see the code Well going deep will involve learning with code. Go through [this](https://medium.com/@sofeikov/implementing-variational-autoencoders-from-scratch-533782d8eb95) or [this](https://learnopencv.com/variational-autoencoder-in-tensorflow/). First one contains implementation of VAEs with pytorch other with tensorflow. Go through it thoroughly. For those who want to delve in more details and maths here are the links to the original research paper and a revised version of it. -[https://arxiv.org/abs/1312.6114](https://arxiv.org/abs/1312.6114) -[https://arxiv.org/abs/1606.05908](https://arxiv.org/abs/1606.05908) ### ๐Ÿ‘พ Day 4: Application of VAEs It is really important for us to understand what exactly are the uses of VAEs. For this we should go through some resources explaining its application and how it is implemented. -[https://medium.com/@saba99/navigating-the-world-of-variational-autoencoders-from-architecture-to-applications-05da018e0f61](https://medium.com/@saba99/navigating-the-world-of-variational-autoencoders-from-architecture-to-applications-05da018e0f61) -[This](https://cedar.buffalo.edu/~srihari/CSE676/21.3-VAE-Apps.pdf) pdf will also help. -[This](https://rendazhang.medium.com/variational-autoencoders-series-4-beyond-images-the-multidomain-applications-of-vaes-fa7749c17efa) article may also help. ### ๐Ÿ‘พ Day 5: Variations of VAEs Variations of Variational Autoencoders (VAEs) extend the basic VAE framework to address specific challenges and enhance functionality. Conditional VAEs (CVAEs) incorporate additional information by conditioning the output on extra variables, making them effective for tasks like controlled data generation. Beta-VAEs introduce a weighting term to the KL divergence, promoting disentangled representations and making it easier to interpret the latent space. Disentangled VAEs aim to separate distinct factors of variation in the data, enhancing the model's ability to generate and manipulate complex features independently. These variations expand the versatility of VAEs, enabling more precise control over generated outputs and improving their applicability across diverse machine learning tasks. Here are some resources but it is not necessary have to have full profiency on all of them. This part may me maths intense so bear with it. -[This](https://ieeexplore.ieee.org/document/9171997) covers a lot of variations of VAEs. -[Here](https://medium.com/@sofeikov/implementing-conditional-variational-auto-encoders-cvae-from-scratch-29fcbb8cb08f) is an implementaion of CVAEs. ### ๐Ÿ‘พ Day 6 and 7: Get hands dirty with code With all you have learnt 1. Implement a Conditional VAE (CVAE) for fashion item generation: Dataset: Fashion-MNIST Use PyTorch or TensorFlow Architecture: Deeper encoder and decoder (4-5 layers each) Latent space dimension: 16 or 32 Condition on item category 2. Advanced training techniques: Implement KL annealing Use a cyclical annealing schedule Add a perceptual loss using a pre-trained classifier 3. Evaluation and visualization: Reconstruct and generate images for each category Interpolate between different styles in latent space Quantitative evaluation: Frechet Inception Distance (FID) ## ๐Ÿฆ Week 7: GNNs and GANs ### ๐Ÿ‘พ Day 1: Graph Neural Networks More often than not, the input datapoints provided are related to each other in some way or the other. For instance, in time series data, a datapoint at a specific point of time might be related to the datapoints in history. For analyzing such data we use sequencial data models called Recurrent Neural Networks. In images, one pixel is correlated to the pixels around it so we use Convolutions in order to capture the relationships between proximate pixels. Sometimes, these correlations are not ordered, but connected in an unordered manner. For instance the energy of atoms and molecules in a box. For such complex relationships, we use a special type of Neural Network called Graph Neural Networks. Following is a great article to get started with graph as a data structure, their properties, need of Graph Neural Networks and their architecture: https://distill.pub/2021/gnn-intro/ Go through the following video for a visual understanding of Graph neural Networks: https://www.youtube.com/watch?v=GXhBEj1ZtE8&t=1s ### ๐Ÿ‘พ Day 2 and 3: Implementing Graph Neural Networks Considering that now you have a basic intuition behind GNNs, let's dive deep into the mathematics and implementation. Go through the following videos: - [GNNs using PyTorch Geomretic](https://www.youtube.com/watch?v=-UjytpbqX4A) - [Types of GNNs and implementation](https://www.youtube.com/watch?v=8owQBFAHw7E) - [Mathematics of GNN simplified](https://www.youtube.com/watch?v=zCEYiCxrL_0) ### ๐Ÿ‘พ Day 4:What Are Generative Adversarial Networks? GANs are used for generative tasks and consist of 2 different models, a generator and a discriminator. The work of the discriminator is to distinguish the fake images from real images, and the task of the generator is to generate images so real that the discriminator is not able to distinguish. [This is a great video to start understanding what are GANs](https://www.youtube.com/watch?v=TpMIssRdhco) [Video giving deeper insites of what GANs are](https://www.youtube.com/watch?v=Sw9r8CL98N0) #### Working of GANs: [This blog](https://medium.com/@marcodelpra/generative-adversarial-networks-dba10e1b4424) is based on the original GAN paper,but explains the topics in great details, it gets slightly mathematically involved at the end. You can skip the code and the maths for now, as the resources for them are attached in the upcomming days. ### ๐Ÿ‘พ Day 5:Understanding the Maths: Today we will try to understand how actually the loss function of GANs work, this will help us understand how the generator knows it produces real-like images or not, how can we say that the discriminator is not able to distinguish, and how actually are there values updated. [17 min video explaining the maths of GANs](https://www.youtube.com/watch?v=Gib_kiXgnvA) [Blog explaining maths](https://jaketae.github.io/study/gan-math/) ### ๐Ÿ‘พ Day 6:Understanding different types of GANs-DAY 1(DCGANs) In the next 2 days we will discuss 2 types of GANs namely DCGANs and StackGANs. ***DCGANs:*** - DCGANs, or Deep Convolutional Generative Adversarial Networks, are a type of Generative Adversarial Network (GAN) that uses deep convolutional networks in both the generator and discriminator components. - The loss calculation and gradient descent in DCGANs are same as that in the vanilla GAN, only difference is in the architecture of the the Generator and Discriminator Enough of theory , lets get our hands a little dirty!! [This is a DCGAN tutorial by tensorflow](https://www.tensorflow.org/tutorials/generative/dcgan), please do not get intimidated by the code,Use this [Youtube video](https://www.youtube.com/watch?v=AALBGpLbj6Q) as a guide to understand the code better. Feel free to use google, AI assistance , or to contact us if you have problem understanding the code. ### ๐Ÿ‘พ Day 7:Understanding different types of GANs-DAY 2(StackGANs) Today we will try to understand the working and the architecture of StackGANs, and how are they able to generate images based on a prompt #### ***Architecture***: The StackGAN architecture consists of two stages, each utilizing a separate GAN: 1. **Stage-I GAN:** This first stage generates a low-resolution image based on the text description. It learns to produce a basic structure and rough colors that correspond to the input text. 1. **Stage-II GAN:** Building upon the output of the Stage-I GAN, the Stage-II GAN takes the low-resolution image and refines it to generate a higher-resolution image that closely matches the details specified in the text description. **Why use Stacked Architecture?** By using a stacked architecture, StackGAN aims to overcome the limitations of directly generating high-resolution images from text, which can be challenging due to the complexity and detail involved. The approach of using two stages allows the network to progressively refine the generated images, leading to more detailed and realistic outputs. It has been found out in research, that generating high-resoulution image directly from noise gives nonsensical outputs.Thus having a GAN that first generate low-resolution image, and then feeding this image rather than noise to our Stage-2, acts as a support and produces photo-realistic images. **Resources:** [Blog explaining architecture of StackGANs](https://www.scaler.com/topics/stackgan/) [Video on Architecture](https://www.youtube.com/watch?v=hH_wKra7AYg) ## ๐Ÿฆ Week 8: Diffusion Models and Further Readings ### ๐Ÿ‘พ Day 1: Introduction to Diffusion Models So far we've studied about various Generative Models, for isntance GANs, AutoEncoders, Variational AutoEncoders etc. Diffusion is another type of Generative Model that work by destroying training data through the successive addition of Gaussian noise, and then learning to recover the data by reversing this noising process. After training, we can use the Diffusion Model to generate data by simply passing randomly sampled noise through the learned denoising process. For understanding the difference between Diffusion and Auto Regression, watch the following video: https://www.youtube.com/watch?v=zc5NTeJbk-k&t=752s For a high level understaning of functioning of Stable Diffusion (a famous diffusion model), follow this link: https://www.youtube.com/watch?v=1CIpzeNxIhU Folloing is a brief introduction to Diffusion Models: https://www.assemblyai.com/blog/diffusion-models-for-machine-learning-introduction/ ### ๐Ÿ‘พ Day 2: Pysics and Probability The core concept behind diffusion models has been derived from Thermodynamics. Understanding the same would provide an intuitive understadning of how Diffusion Model work. The following link provides an explanation regarding the same: https://www.assemblyai.com/blog/how-physics-advanced-generative-ai/#generative-ai-with-thermodynamics Understanding the mathematics behind Diffusion Models requires some additional prerequisites. I'm assuming that you've gone through the mathematical prerequisites for ML highlighted in the ML Roadmap. Go through the following articles and videos for the same: - [Introduction to Stochastic Processes](https://www.youtube.com/watch?v=cYrugmAAcso) - [Intorudction to Markov Chains](https://math.libretexts.org/Bookshelves/Applied_Mathematics/Applied_Finite_Mathematics_(Sekhon_and_Bloom)/10%3A_Markov_Chains/10.01%3A_Introduction_to_Markov_Chains) - [Introduction to Poission Processes](https://www.probabilitycourse.com/chapter11/11_1_2_basic_concepts_of_the_poisson_process.php) ### ๐Ÿ‘พ Day 4: Score-Based Generative Models Score-Based Generative Models are a class of generative models that use the score function to estimate the likelihood of data samples. The score function, also known as the gradient of the log-likelihood with respect to the data, provides essential information about the local structure of the data distribution. SGMs use the score function to estimate the data's probability density at any given point. This allows them to effectively model complex and high-dimensional data distributions. Although the score function can be computed analytically for some probability distributions, it is often estimated using automatic differentiation and neural networks. For better understanding, go through the blog post by Yang Song: https://yang-song.net/blog/2021/score/ (Tip: For delving deep into a topic, its usually a nice practice to go through the relevant references of the article you're reading) ### ๐Ÿ‘พ Day 3: Denoising Diffusion Probabilistic Models DDPMs are a type of diffusion model used for probabilistic data generation. As mentioned earlier, diffusion models generate data by applying transformations to random noise. DDPMs, in particular, operate by simulating a diffusion process that transforms noisy data into clean data samples. During inference (generation), DDPMs start with noisy data (e.g., noisy images) and iteratively apply the learned transformations in reverse to obtain denoised and realistic data samples. DDPMs are particularly effective for image-denoising tasks. They can effectively remove noise from corrupted images and produce visually appealing denoised versions. Moreover, DDPMs can also be used for image inpainting and super-resolution, among other applications For a greater understanding of how DDPMs work, go through the following links: - [DDPM Landmark Research Paper](https://proceedings.neurips.cc/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf) - [Intro to DDPMs Article](https://medium.com/@gitau_am/a-friendly-introduction-to-denoising-diffusion-probabilistic-models-cc76b8abef25) - [Intro to DDPMs Video](https://www.youtube.com/watch?v=H45lF4sUgiE) ### ๐Ÿ‘พ Day 5: Mathematics Behind Diffusion Models Now that you have a decent understanding of how Diffusion models and their types work, this is a great time to understand the mathematics behind them. Note that this is not compulsory, but just buinds better understanding of how things are derived and structured. Following is a great article for the same: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/ ### ๐Ÿ‘พ Day 6: Implementing Diffusion Model Considering we're largely done with the introductory material for Diffucion Models, let's get our hands dirty with implementing DDPMs: - [Implementing DDPMs via Keras](https://keras.io/examples/generative/ddpm/) - [Intro to Implementing DDPMs via PyTorch](https://medium.com/@mickael.boillaud/denoising-diffusion-model-from-scratch-using-pytorch-658805d293b4) ### ๐Ÿ‘พ Day 7: AutoRegressive Diffusion Model (ARDM) AutoRegressive Diffusion Model capture the benefits of both, Autoregressive models and Diffusion Models. Unlike standard ARMs, they do not require causal masking of model representations, and can be trained using an efficient objective similar to modern probabilistic diffusion models that scales favourably to highly-dimensional data. At test time, ARDMs support parallel generation which can be adapted to fit any given generation budget. We find that ARDMs require significantly fewer steps than discrete diffusion models to attain the same performance. Finally, we apply ARDMs to lossless compression, and show that they are uniquely suited to this task. Contrary to existing approaches based on bits-back coding, ARDMs. Go through the following links for a better understanding: - [ARDM Paper by Google](https://arxiv.org/pdf/2110.02037) - [ARDM Paper Explanation](https://www.youtube.com/watch?v=2h4tRsQzipQ) **Contributors** - Aarush Singh Kushwaha \| +91 96432 16563 - Anirudh Singh \| +91 6397 561 276 - Himanshu Sharma \| +91 99996 33455 - Kshitij Gupta \| +91 98976 05316 - Suyash Kumar \| +91 98071 99316