multimedia final report

# multimedia final report ### 4.1. Data preprocessing Since we use a pile of terrain images, there are not lots of things that we can do to preprocess. We totally got 4320 pairs of terrain and label images with image size of 428\*240 in the training dataset. On the other hand, we hava 720 label images of public and private test datasets. However, we do not get the real terrain inages of these test datasets. We first split data into river and road files because we believe that train two models that one is for generate river images and the other one is for generate road data may perform better than just one model. Next, we split these data into train and validation which contain 0.8 and 0.2 respectively. Since our model can only accept 256\*256 size images, we first resize our all images from 428\*240 into 256\*144 and padding into 256. The following image is the result of model after the data augmentation ![image](https://hackmd.io/_uploads/HkoB04-I0.png) ![image](https://hackmd.io/_uploads/H1JPAE-I0.png) After training, we have to recover the images into the original size, so we first crop the image from 56 to 200 of the height and then resize into 420*240. The following image is the result of data processing. ![image](https://hackmd.io/_uploads/ry-okH-LA.png) ![image](https://hackmd.io/_uploads/S1Ep1rZLA.png) ### 4.2. Model 2. pix2pix model Pix2pix model is dedicated to solving image-to-image translation Problems which is very suitable for our problem. They modify GANs in the conditional setting. Just as GANs learn a generative model of data, conditional GANs (cGANs) learn a conditional generative model. This makes cGANs suitable for image-to-image translation tasks, where we condition on an input image and generate a corresponding output image. Here’s a brief overview: (a) Loss function: Conditional GANs instead learn a structured loss. Structured losses penalize the joint configuration of the output. A large body of literature has considered losses of this kind, with popular methods including conditional random fields. We have try numerous of loss function as the following description. LcGAN (G, D) =Ex,y∼pdata(x,y) [log D(x, y)] Ex∼pdata(x),z∼pz [log(1 − D(x, G(z, x)))] (3) MSELoss: MSELoss stands for Mean Squared Error Loss. It is a commonly used loss function in regression problems in machine learning and deep learning. The goal of this loss function is to measure the average squared difference between the predicted values and the actual target values. ![image](https://hackmd.io/_uploads/BJEZGSWLA.png) BCEWithLogitsLoss:Binary Cross Entropy (BCE) Loss measures the difference between two probability distributions for binary classification tasks. BCEWithLogitsLoss combines a Sigmoid layer and the BCELoss in one single class. This version is numerically more stable than using a plain Sigmoid followed by a BCELoss as it applies the sigmoid operation within the loss function itself. ![image](https://hackmd.io/_uploads/SyFrGHb8R.png) L1Loss:L1Loss, also known as Mean Absolute Error (MAE) Loss, is a commonly used loss function in regression tasks in machine learning and deep learning. It measures the average absolute difference between the predicted values and the actual target values ![image](https://hackmd.io/_uploads/S1DizSW80.png) Finally, we found MSELoss and L1Loss with lambda_L1 = 100 that we got the best performace in our model. (b) Generator structure: We tried U-net generator and ResnetGenerator repectively and we discovered that ResnetGenerator can generate higher resolution and quality pictures. We will show our result in the below. U-net generator:The U-Net Generator is a specialized neural network architecture designed for image-to-image translation tasks, prominently used in medical image segmentation. Introduced by Olaf Ronneberger and colleagues in 2015, the U-Net features a distinctive U-shaped structure formed by an encoder-decoder architecture with symmetric skip connections. These connections facilitate the transfer of high-resolution features from the encoder to the decoder, preserving spatial information that could be lost during downsampling. The encoder captures context by progressively reducing the spatial dimensions while increasing feature complexity, and the decoder reconstructs the image by progressively increasing the spatial dimensions. This architecture excels in tasks requiring precise localization, making it highly effective in various applications like image denoising, inpainting, and super-resolution. ![image](https://hackmd.io/_uploads/BkFlNHZIA.png) ![image](https://hackmd.io/_uploads/rJKPTBb8C.png) ResnetGenerator:The ResnetGenerator is a neural network architecture commonly used in image-to-image translation tasks, such as style transfer and super-resolution, within generative adversarial networks (GANs). Building upon the Residual Network (ResNet) architecture, it leverages residual blocks to facilitate the training of deep networks by allowing the gradients to flow through the network more effectively. Each residual block contains skip connections that bypass one or more layers, mitigating the vanishing gradient problem and enabling the network to learn identity mappings. This structure helps the generator produce high-quality and realistic images by maintaining stability during training and preserving important features from the input images. The ResnetGenerator's ability to effectively combine low-level and high-level features makes it a powerful tool in generative modeling and image synthesis tasks. ![image](https://hackmd.io/_uploads/ryNFNrbU0.png) ![image](https://hackmd.io/_uploads/Sy1nrS-L0.png) (C) Discriminator: We have used the NLayerDiscriminator and PixelDiscriminator. We found that the NLayerDiscriminator has the best result of the generate images. NLayerDiscriminator:The NLayerDiscriminator is a versatile and configurable neural network architecture used as the discriminator in generative adversarial networks (GANs). Its primary role is to differentiate between real and generated images, thereby guiding the generator to produce more realistic outputs. The "NLayer" aspect refers to the network's customizable depth, allowing it to consist of a user-defined number of convolutional layers. Each layer progressively captures increasingly complex features, making the discriminator adept at distinguishing intricate details in images. The architecture typically employs convolutional layers followed by Leaky ReLU activations and downsampling through strided convolutions, which help in reducing spatial dimensions and emphasizing feature extraction. This modularity and flexibility enable the NLayerDiscriminator to be adapted for various GAN applications, from simple image generation to complex high-resolution image synthesis, enhancing its effectiveness across different domains. PixelDiscriminator:The PixelDiscriminator is a specialized discriminator architecture used in generative adversarial networks (GANs) that focuses on evaluating the realism of individual pixels rather than the entire image. This discriminator operates by assessing the quality of each pixel independently, making it particularly effective in tasks where fine-grained detail and local texture consistency are crucial, such as image super-resolution and texture synthesis. Typically composed of several convolutional layers with small receptive fields, the PixelDiscriminator processes each pixel in a localized context, allowing it to scrutinize minute details without being influenced by global image structures. This focused approach enables the generator to produce highly detailed and realistic textures, enhancing the overall quality of generated images in tasks requiring meticulous pixel-level fidelity. (D) Optimization and inference: The optimizer of generator and discriminator both use Adam method with lr=0.0002, beta1=0.5 and beta2 = 0.999. Training epoch is 200. The reason why we choose 200 for our epoch number is that trainning time is about 7-8 hours for a trainning work. We can not let the machine run too much time in our laboratory. We beleive that if we can increase epoch number we can get better performance in our project. 3. The CUT (Contrastive Unpaired Translation) GAN is a state-of-the-art model for unpaired image-to-image translation tasks, designed to address the limitations of traditional methods like CycleGAN. Introduced to leverage the power of contrastive learning, CUT GAN improves the quality and consistency of generated images by maximizing mutual information between corresponding patches in the input and generated images, ensuring better preservation of detailed content. Utilizing a patch-based discriminator and innovative PatchNCE loss, it evaluates the realism of local patches, capturing fine-grained details more effectively. By simplifying the architecture and removing the need for cycle consistency loss, CUT GAN achieves superior results in various applications such as style transfer, image enhancement, and domain adaptation, making it a versatile and powerful tool in generative modeling. As a result, we consider CUT GAN may be a suitable method for our task. 4. (a) Loss function: We also used MSELoss, BCEWithLogitsLoss and L1Loss in our CUT gan model. However, we discovered that L1Loss do not improve performance unlike PIX2PIX2 model. In the end, we only apply MSELoss to get the best performance. In addtion, PatchNCE Loss, short for Patch-based Noise Contrastive Estimation Loss, is a crucial component in the CUT (Contrastive Unpaired Translation) GAN model. This innovative loss function leverages contrastive learning to improve image-to-image translation tasks by focusing on local patch-level details. Unlike traditional losses that evaluate entire images, PatchNCE Loss works by maximizing the similarity between corresponding patches in the input and generated images while minimizing the similarity with non-corresponding patches. This approach helps in preserving detailed content and enhancing the overall quality and consistency of the translated images. By emphasizing local features and ensuring high-fidelity reconstruction, PatchNCE Loss enables the CUT GAN to produce more realistic and detailed outputs in unpaired image translation scenarios. The following picture is the structure of PatchNCE Loss in CUT GAN model. ![image](https://hackmd.io/_uploads/H1-wMjWLA.png) (B) Generator Discriminator and structure: CUT GAN generator is the same as PIX2PIX GAN model which both contain U-net generator and ResnetGenerator. We also found that ResnetGenerator is better than U-net generator. The following picture is generate by these two different generator.The left one is ResnetGenerator and the other one is U-net generator ![image](https://hackmd.io/_uploads/SyrSEsZIR.png) ![image](https://hackmd.io/_uploads/rJctNi-UA.png) (C) Optimization and inference: The optimizer is the same as PIX2PIX2 model. However, CUT GAN model did not out-performance than PIX2PIX model. We think that PatchNCE Loss is not suitable for this task since the label image and real terrain image are totally different. As a result PatchNCE Loss can not perform well by maximizing the similarity between corresponding patches in the input and generated images. 7. conclusion: Our model get 206 score of FID. This is quit far way than fist place score which means we still have a lot of things to do to improve our model. During this project, the most difficult thing is that we have to wait so long for trainning process. We have to spend over 6 hours for each trainning task. Moreover, we only have a machine that can run our code. As a result, we do not have too many times to try our model. However, we still learn a lot from this project. We found that CUT GAN model is not good at generate this kinds of pictures compared to PIX2PIX. The reason why we beleive CUT GAN model can perform well is that CUT GAN model is pubished later by the same researcher of PIX2PIX. They claimed that CUT GAN model is better, faster and preciser than PIX2PIX model. We beleive that CUT GAN model in other task such as changing horse into zebra can better than PIX2PIX. However, we found that the trainning time of CUT GAN model is longer than PIX2PIX, contrastly they claimed that CUT GAN is faster than PIX2PIX. In the end, we considered that if we increase epoch number, we may get better performance and we should also try other model since PIX2PIX is published in 2017. Nowadays there are numerous model published to solve this kind of image to image translation. If we want to get better performance, we must apply new model in this task and do more research.