<!-- - The text is not stand-alone; it's not peer understandable.
- Using a term before defining/motivating it.
- Unclear logical reasoning step.
- Inconsistent use of terminology.
- Too much unnecessary detail.
- Writing too verbose / full sentences: bullet point should be one-two lines, one sentence, grammar optional
- Too many topics per storyline bullet point
- Too many storyline bullet points.
- Not following my writing guidelines: https://jvgemert.github.io/writing.pdf -->
# Improving Masking of StableVITON Virtual Try-On
## Project Members
Violeta Chatalbasheva (Student Number: 5080428)
Shayan Ramezani (Student Number: 5052025)
Yash Mundhra (Student Number: 5017874)
## Introduction
In recent years, the world has witnessed a significant transformation in the retail landscape, driven by the rapid growth of e-commerce and online shopping. [^5] This shift has revolutionized the way people purchase goods, offering unprecedented convenience and accessibility. Consumers can now browse and buy products from the comfort of their homes, avoiding crowded stores and long checkout lines. However, this convenience comes with its own set of challenges. One of the most pressing issues is the dramatic increase in product returns, particularly in the apparel sector. Customers often find that the clothing they ordered online does not fit as expected or looks different in person, leading to high return rates. This not only creates logistical headaches for retailers but also contributes to increased carbon emissions due to the additional shipping and handling required for returned items [^4].
The inability to accurately know how clothing will look and fit on one's body is a key factor behind these high return rates. [^6] The high return rates have forced retailers to adapt and experiment with technologies such as the virtual try-on. Virtual try-on allows users to superimpose different clothing items onto their digital images, giving them a realistic preview of how the products will look on them.
Using virtual try-on provides consumers an immersive experience for experimenting with various products, resulting in a more confident and personalised purchase. Not only does virtual try-on improve online shopping experience by giving consumers a feel of shopping in a physical store, but it also reduces the guesswork involved in online shopping. By offering a virtual fitting room, these tools can enhance customer satisfaction. [^7]
The journey of virtual try-on technology has been marked by significant advancements over the years. Early attempts relied heavily on computer graphics, essentially creating digital clothing and draping it over generic 3D models. This resulted in unrealistic fits and a lack of personalization. The turning point came with the integration of computer vision techniques. Sophisticated models like VITON-HD [^8] allowed virtual try-on systems to analyze a user's body shape and posture in real-time. This shift from a purely graphics-based approach to one that incorporates computer vision opened the door for more accurate and dynamic virtual try-on experiences. Despite these advancements, challenges remain. Current models often struggle with accurately fitting clothing to diverse body types, handling various poses, and capturing intricate details of the garments and personal characteristics. These limitations highlight the need for continued innovation and improvement in the field.
Building on these advancements, our study focuses on addressing some of the existing challenges in virtual try-on technology, particularly in the areas of segmentation and masking. This is where the [StableVITON](https://rlawjdghek.github.io/StableVITON/)[^1] model comes into play. StableVITON aims to enhance the realism and accuracy of virtual try-ons by incorporating advanced segmentation techniques. Accurate segmentation is crucial for achieving a realistic and accurate fitting for virtual try-on experiences. To this end, we will utilize the [Segment Anything Model (SAM)](https://segment-anything.com/)[^9] developed by Meta, a tool designed to enhance segmentation accuracy. By integrating SAM with StableVITON, we seek to evaluate how improved segmentation impacts the overall performance and realism of virtual try-ons.
<!--In this study, we aim to address some of the existing challenges in virtual try-on technology by focusing on the segmentation and masking components of StableVITON [^1]. Accurate segmentation is crucial for achieving a realistic and accurate fitting for virtual try-on experience. To this end, we will utilize the Segment Anything Model (SAM) developed by Meta [^9], a tool designed to enhance segmentation accuracy. By integrating SAM with StableViton, we seek to evaluate how improved segmentation impacts the overall performance and realism of virtual try-ons. Our goal is to demonstrate that more precise segmentation can lead to better fitting and more visually appealing results, ultimately advancing the state of virtual try-on technology and its application in the e-commerce industry. -->
### Identifying the Research Gap
In our analysis, we observed that StableVITON intoduced several artifacts in the resulting try-on images. Looking at figure 1, we first observe that StableVITON alters the original hairstyle from short to long. We attribute this to the masking of the neck area which fails to preserve the specific hairstyle from the original image, thus creating an illusion of long hair. This could be an indicator of an underlying bias where the StableVITON model defaults to stereotypically feminine characteristics, such as longer hair for women. In addition, the traditional static method of masking the arms proves inadequate for tailoring this garment, given its complex sleeve design that deviates significantly from the simpler lines of a slim-fit shirt. We think that with a more precise segmentation, a significant improvement can be made in the cut-out of the garment to achieve a cleaner and more realistic try-on such that the model will learn to preserve the physical appearance of the person.
<div style="text-align: center; white-space: nowrap;">
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/HJZPUrVHC.jpg" width="180"/>
<div class="caption" style="text-align: center;">3/4 sleeve shirt</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/Hke0dH4HC.jpg" width="180"/>
<div class="caption" style="text-align: center;">Agnostic image</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/SJTdOS4HC.jpg" width="180"/>
<div class="caption" style="text-align: center;">Try-on result</div>
</div>
<figcaption>[Fig 1] Example of artifacts from StableVITON</figcaption>
</div>
### Our Goal
Our goal is to demonstrate that more precise segmentation can lead to better fitting and more visually appealing results. More specifically, we aim to enhance StableVITON model's performance through more precise masking techniques. Our study will evaluate the impact of these improvements on upper-body garments by examining metrics such as Frechet Inception Distance (FID), Kernel Inception Distance (KID), and Inception Score (IS). Additionally, we will explore how these refined segmentation methods influence the accuracy of clothing fitting and the preservation of personal identity in the try-on images. Lastly, we will test the model's capability to generalize to lower body images and clothing items, assessing its robustness and overall performance using the same evaluation metrics.
## Related Work
Two of the most prominent deep learning methods utilized for virtual try-on are generative adversarial networks (GANs) and diffusion models. These techniques have revolutionized the way virtual try-on systems are designed, enabling highly realistic and interactive user experiences. GANs [^11], known for their ability to generate high-fidelity images through adversarial training, aim to model the real image distribution by forcing the generated samples to be indistinguishable from the real images. On the other hand, diffusion models, which leverage probabilistic approaches to synthesize images, have surpassed the performance of GANs in the domain of image synthesis [^10]. This section will provide a comprehensive review of the existing literature on these two methodologies in the context of virtual try-on systems.
### Generative Adversarial Networks (GAN's)
GANs [^24] leverages two neural networks to achieve high-quality image synthesis by involving a generator network that attempts to deceive a discriminator network, which, in turn, learns to distinguish between real and fake samples. Most GAN based virtual try-on approaches use a separate warping module to deform the garment to the human body and use GAN generators to blend with the target person.
The first study that made use of GAN's was that of Jetchev et al. who introduced a conditional analogy GAN (CAGAN) [^25] that approached virtual try-on as an image analogy problem. However, CAGAN fails to account for the deformation of clothing items based on the user's body shape, resulting in subpar try-on performance. The next significant work was VITON-HD by Choi et al. [^8] who proposed an alignment-aware segment (ALIAS) normalisation and ALIAS generator to handle misalignment between the warped clothes and the desired clothing regions. POVNet by Li et al. [^26] is one of the more recent works that uses a GAN architecture which performs multiple warps on the garment image to ensure that the clothing texture always covers misaligned regions.
Overall, the usage of GAN's for the virtual try-on task remains extremely relevant and there are new papers being published with improvements regularly.
### Diffusion Models
In recent studies, the performance of diffusion models has surpassed that of GANs in the domain of image synthesis. Diffusion models are a type of generative model that iteratively add and remove noise to and from an image, thereby learning to generate new data samples. They are based on the principles of probabilistic diffusion processes.
The first work using diffusion models was that of Zhu et al. in the paper TryOnDiffusion [^27]. The paper introduces Parallel-UNet, which involves training two denoising U-Nets simultaneously, where one transmits information to the other through cross-attention. In the paper DCI-VTON by Gou et al. [^28] the authors propose to use a warping network to predict the appearance flow, thereby warping the clothes. These warped clothes are then added to the input of the DM, along with the global condition. One of the most recent works that use diffusion models is the StableVITON paper which we will also be using in this project and have discussed in detail in the following sections.
## Background
Before moving on to the main model used in the project, namely StableVITON, it is essential to familiarize ourselves with several foundational concepts integral to its functionality. These include U-nets, zero cross-attention, attention total variation loss, image inpainting, CLIP image encoder, and spatial encoding. Each of these components plays a crucial role in the virtual try-on process and contributes to the overall performance and accuracy of the model.
### U-nets
U-nets are a type of convolutional neural network (CNN) architecture particularly well-suited for image segmentation tasks. Introduced by Ronneberger et al. (2015)[^2], U-nets have a symmetric architecture consisting of a contracting path (encoder) and an expansive path (decoder), connected by a bottleneck (see figure 2). The contracting path captures context by downsampling the input image, while the expansive path enables precise localization through upsampling. U-nets are particularly effective for tasks where high-resolution output is needed, such as medical image analysis and, in this context, virtual try-on systems.
<!-- <div style="text-align: center; white-space: nowrap;">
<img src="https://hackmd.io/_uploads/rJuoEprrR.png" alt="Alternative Text" style="width: 75%;"/>
<div class="caption" style="text-align: center;">
U-net architecture. Blue bloxes represent a multi-channel feature map with the<br>number of channels on the top. The image dimensions are indicated at the <br>lower left edge of the box. White boxes represent copied feature map.<sup id="ref2a">[^2]</sup>
</div>
</div> -->
<figure style="text-align: center; white-space: nowrap;">
<img src="https://hackmd.io/_uploads/rJuoEprrR.png"
alt="U-net architecture" style="width: 75%;"/>
<figcaption>[Fig 2] U-net architecture. Blue bloxes represent a multi-channel feature map with the<br>number of channels on the top. The image dimensions are indicated at the <br>lower left edge of the box. White boxes represent copied feature map.<sup><a href="#fn15">[15]</a></sup></figcaption>
</figure>
### Cross-Attention
Cross-attention is a key mechanism in neural networks that allows for the integration and alignment of information from two distinct input sources. It allows a model to focus on different parts of an input sequence when generating each part of an output sequence. It differs from self-attention, which operates within a single input sequence, by focusing on relationships between elements from two different inputs[^12]. This process is achieved through three main components: queries (Q) from one input, and keys (K) and values (V) from another input. In virtual try-on tasks, cross-attention is particularly helpful in aligning features from different sources, namely clothing features with body features.
<!-- Zero cross-attention, as introduced in StableVITON, is a specialized form of this mechanism. It leverages the attention mechanism within the latent space of the pre-trained diffusion model, enabling the alignment of clothing features with the human body features in the image. This method helps preserve fine details and ensures the generated images maintain high fidelity to the input clothing items. -->
<figure style="text-align: center; white-space: nowrap;">
<img src="https://hackmd.io/_uploads/Hy7eppBrA.png"
alt="Cross-Attention visualized" style="width:40%;"/>
<figcaption>[Fig 3] Cross-attention, visualized here, is the same as <br>self-attention, with only the inputs being different.<sup><a href="#fn16">[16]</a></sup></figcaption>
</figure>
### Attention Total Variation Loss
Attention total variation (ATV) loss is a regularization technique used to refine attention maps in neural networks. In the context of StableVITON, ATV loss helps sharpen the attention regions on the clothing, ensuring that the model accurately captures the intricate details of the clothing items. By minimizing the dispersion of high attention scores, ATV loss enhances the quality of the semantic correspondence learned by the model, leading to more precise and realistic virtual try-on results.
### Image Inpainting
Image inpainting refers to the process of reconstructing lost or deteriorated parts of an image. In virtual try-on applications, this technique is used to fill in areas of the target image where the original clothing has been removed, seamlessly integrating the new clothing item. StableVITON approaches virtual try-on as an exemplar-based image inpainting problem [^13], utilizing the generative capabilities of a pre-trained diffusion model to achieve high-quality image reconstruction.
### CLIP Image Encoder
CLIP (Contrastive Language-Image Pre-training) is a model developed by OpenAI[^3] that learns visual concepts from natural language descriptions. The CLIP image encoder transforms images into a latent space representation. In StableVITON, the CLIP image encoding is used to condition the U-Net with the clothing image, helping in accurately reflecting the visual characteristics of the input clothing in the generated output.
<!-- ### Spatial Encoder
The spatial encoder in StableVITON is designed to capture the spatial information of the clothing item. It takes the latent representation of the clothing image and conditions the intermediate feature maps of the U-Net through cross-attention blocks. This process ensures that the clothing details are preserved and correctly aligned with the target person's body in the generated image. -->
## StableVITON
StableVITON is an innovative approach to virtual try-on, leveraging the power of pre-trained diffusion models, specifically Stable Diffusion, to generate realistic and high-fidelity images to virtually fit a clothing piece on a person. The model operates by learning the semantic correspondence between clothing items and human bodies within the latent space of the pre-trained diffusion model. This is achieved by integrating the previously discussed concepts with additional techniques, specifically Zero Cross-Attention Block, spatial encoding and augmentation of the feature map.
The process begins with the input of a person image $\textbf{x}$ and a clothing image $\textbf{c}$. The person image is transformed into an agnostic map, which removes the existing clothing. The model then uses a pre-trained U-Net architecture, from Stable Diffusion, to generate the final output, which is the person $\textbf{x}$ with the clothing $\textbf{c}$. The input to the U-net is the concatenation of the following four components:
- **noisy image:** This is obtained by using the forward process part of diffusion models that adds Gaussian noise to the person image[^14]. The use of noise integral in training diffusion models, and helps in enhancing robustness, generalizability, image inpainting and reconstruction, and encourages detailed learning;
- **latent agnostic map:** Person image with any clothing information related to the piece to be replaced eliminated (masked) and afterwards encoded. This helps the model in ignoring the original clothing and focussing on the body;
- **resized clothing-agnostic mask:** The mask of the clothing itself. This helps the model determine where the clothing was present. It indicates the regions to be filled with the new clothing, ensuring accurate placement and alignment in the final output. While the agnostic map removes the original clothing to focus on the body shape and pose, the clothing-agnostic mask specifically guides the placement of the new garment, enhancing the precision and realism of the virtual try-on.
- **latent dense pose condition:** This component helps preserve the person's pose by providing a detailed encoded representation of the body’s position and orientation, ensuring that the new clothing aligns accurately with the body's natural posture.
In addition, a copy of this U-Net is used to map the clothing image **$c$** into the latent space for fitting the specific garment on the person **$x$**.
To improve the performance of the U-Net, StableVITON integrates several key components:
* **CLIP Image Encoder:** The CLIP image encoder transforms the clothing image $\textbf{c}$ into a latent space representation that conditions the U-Net. Conditioning the U-Net means providing it with additional context and information about the clothing image, ensuring that the generated output accurately reflects the visual characteristics of the input clothing, thus enhancing the model's ability to preserve fine details and textures.
* **Zero Cross-Attention Block:** This block aligns the clothing features with the body features in the latent space. By using the query (Q) from the human body features and the key (K) and value (V) from the clothing features, the zero cross-attention block ensures that the clothing details are accurately aligned and preserved. Additionally, a zero linear layer initializes weights to zero, stabilizing the alignment process and maintaining the integrity of the clothing's appearance in the final image.
* **Spatial Encoder:** The spatial encoder captures the spatial information of the clothing item and conditions the intermediate feature maps of the U-Net through zero cross-attention blocks. This process helps in accurately aligning the clothing details with the target person's body.
* **Augmentation:** Augmentation techniques such as horizontal flips, random shifts, and HSV adjustments are applied to the input conditions during training. These augmentations help the model learn "fine-grained semantic correspondence"[^1] by altering the feature map, ensuring accurate and realistic virtual try-on results, and improving the model's robustness and ability to generalize across diverse inputs.
By integrating these advanced techniques, StableVITON outperforms traditional virtual try-on methods, providing a robust solution for generating high-quality virtual try-on images that maintain the intricate details of the clothing items and the realism of the target person's appearance. This comprehensive approach highlights the interplay of various components, showcasing the sophisticated mechanisms that contribute to the state-of-the-art performance of the StableVITON model in virtual try-on applications. Figure 4 gives a high-level overview of the StableVITON architecture.
<figure style="text-align: center; white-space: nowrap;">
<img src="https://hackmd.io/_uploads/BJbb9c2BR.png"
alt="StableVITON architecture" style="width:100%;"/>
<figcaption>[Fig 4] For the virtual try-on task, StableVITON additionally takes three conditions:<br>agnostic map <math><msub><mi>x</mi><mn>a</mn></msub></math>, agnostic mask <math><msub><mi>m</mi><mn>a</mn></msub></math>, and dense pose <math><msub><mi>x</mi><mn>p</mn></msub></math>, as the input of the <br>pre-trained U-Net, which serves as the query (Q) for the cross-attention.<br>The feature map of the clothing <math><msub><mi>x</mi><mn>c</mn></msub></math> is used as the key (K) and value (V)<br> for the cross-attention and is conditioned on the UNet (b) <sup><a href="#fn10">[10]</a></sup>.</figcaption>
</figure>
## Dataset
To evaluate the performance of SAM on StableVITON we make use of two distinct datasets, VITON-HD [^8] and DressCode lower [^17].
### VITON-HD
The VITON-HD dataset is a high-resolution (1024×768) virtual try-on dataset commonly used in the domain of vitual try-on research, including in the StableVITON paper and numerous other papers. Specifically, VITON-HD consists of 13,697 pairs of frontal-view images of women and top clothing items, which are split into training and test sets of size 11,647 and 2,032 respectively. By employing VITON-HD, we ensure a comprehensive and standardized evaluation of our model's performance.
### Dress Code
The second dataset that we used is the DressCode dataset. The DressCode dataset, comprising over 50,000 high-resolution garments (1024x768) and more than 100,000 high-resolution images, includes three categories: upper body, lower body, and dresses. The StableVITON model has been trained only on upper body garments from the VITON-HD dataset. Therefore, by using the lower body images, we aim to explore whether StableVITON can generalize to unseen clothing items and how it performs with a more precise segmentation. Additionally, the DressCode dataset includes more complex poses and postures, making inference more challenging but also more interesting to explore.
In addition to the garment image and the original image, both datasets had additional info such as the keypoints, skeleton, and densepose as shown in figure 5.
<figure style="text-align: center; white-space: nowrap;">
<img src="https://hackmd.io/_uploads/r1gkHJjHR.png"
alt="Cross-Attention visualized" style="width:100%;"/>
<figcaption>[Fig 5] For each image the dataset consist of several attributes such <br>as the keypoints, skeleton and densepose.</figcaption>
</figure>
## Improving the Segmentation Module
As indicated, our goal is to address one of the limitations of StableVITON, specifically the segmentation of the garments. StableVITON applies the segmentation procedure utilized in the VITON-HD paper[^8]. It modifies the image by masking body parts to focus the model's generation power solely on the garment. It uses segmented body parts (parse data) and pose data to accurately place the masks. It adjusts arm lengths for symmetry, masks arms, torso and neck with grey overlays and refines the garment edges using the parse data. However, this generic masking approach is not suitable for garments with varying designs. For instance, a t-shirt is much simpler than the shirt we saw in figure 1, which suggests that a different segmentation strategy may be required. To address these issues, we aim to apply a more advanced segmentation technique.
### Segment Anything Model
The [Segment Anything Model (SAM)](https://github.com/facebookresearch/segment-anything) represents a significant advancement in image segmentation technology. It is trained using an extensive and diverse dataset known as the [SA-1B dataset](https://ai.meta.com/datasets/segment-anything/). This dataset is one of the most comprehensive image segmentation datasets currently available and contains over 1 billion high-quality segmentation masks derived from around 11 million images [^16].
SAM is designed to handle the task of image segmentation more effectively with higher precision and flexibility [^16]. It achieves this through several key components, including the following
* **Image Encoder**: SAM incorporates a high-performance image encoder based on Vision Transformer technology, which allows it to treat various scales and complexities in the input images effectively.
* **Prompt Encoder**: The model features a prompt encoder capable of processing diverse types of input prompts, including points, boxes, text, and rough masks. This encoder plays a crucial role in guiding the segmentation process, allowing for tailored segmentation that can adapt based on the specificity of the task.
* **Mask Decoder**: Driven by the detailed and context-rich information from the encoder, SAM's mask decoder works to produce fast and accurate segmentation masks. This ensures that the final outputs are both precise and contextually appropriate.
Additionally, the SAM model incorporates advanced training techniques and optimization strategies, including the use of multi-task learning frameworks and attention mechanisms that further refine the model’s ability to segment complex image features accurately [^16]. Such techniques enhance SAM’s generalization capabilities, enabling it to perform well on unseen images by effectively leveraging learned representations.
All of this demonstrates the superiority of SAM's performance, making it a powerful tool in image segmentation applications, which motivates our choice to employ this tool in our analysis. This model segments objects with various shapes seamlessly with a much higher precision, providing a significant advantage over the traditional methods in terms of quality. By integrating SAM into StableVITON try-on generation pipeline, we anticipate notable improvements in the accuracy of garment representation.
The network architecture allows SAM to process a wide variety of input types delivering highly precise segmentation masks. Upon generating the final masks, it provides a list of all segmentations it succesfully extracted (see figure 6). Each of them is in the form of a dictionary with attributes such as the segmentation itself, the bounding box of the object, the area of the mask, a model's prediction of the mask's quality and a measure of the mask's quality filtered according to a predefined threshold.
<figure style="text-align: center; white-space: nowrap;">
<img src="https://hackmd.io/_uploads/HJlgX_dH0.png"
alt="different segmentation masks for one image" style="width:100%;"/>
<figcaption>[Fig 6] All segmentation masks for one image.</figcaption>
</figure>
We automated the segmentation of garments for numerous images which enables us to run inference with StableVITON to compare try-on images and evaluate the outcomes. This process includes several key steps. First, we utilize the large SAM checkpoint, `sam_vit_l_0b3195.pth`, to guarantee better segmentation performance. We only need to initialize the SAM model once to continuously process the segmentation of garments from images. Secondly, we read each image using the [OpenCV library](https://opencv.org/) and convert it from BGR to RGB color space. Then, we employ the `generate` function to get the masks. This function outputs all masks the model deemed relevant, so they need to be filtered (see figure 6). We select the garment by defining a range for the mask areas that correspond to typical clothing sizes. These were tested empirically on randomly selected images from the test set of VITON-HD. Most of the try-on datasets [^8][^17][^22][^23] center the garment in the image since the upper body is the primary focus in the images. Therefore, we expect the mask to be located in the center of the image. To verify this, we use the bounding box of the segmentation and we check if the center of the bounding box is within 10% of the image's center. From the masks that match this condition, we select the target mask based on the highest quality score predicted by the model. Figure 7 exemplifies the huge improvement in the precision of the agnostic mask.
<div style="text-align: center; white-space: nowrap; margin: 10px;">
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/HJZPUrVHC.jpg" width="180"/>
<div class="caption" style="text-align: center;">3/4 sleeve shirt</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/BkZSMqOHR.png" width="180"/>
<div class="caption" style="text-align: center;">Agnostic mask used by StableVITON</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/H1J6ZqdS0.jpg" width="180"/>
<div class="caption" style="text-align: center;">Agnostic mask generated by SAM</div>
</div>
<figcaption>[Fig 7] Original masking from StableVITON vs masking with SAM.</figcaption>
</div>
Unfortunately, the garment's edges do not exactly align with the body of the person. Thus, we dilate the mask to ensure adequate coverage. This step is crucial as the model generates the portion that is obscured by the agnostic mask and enlarging the mask helps the model render the try-on garment boundaries more realistically. Figure 8 shows the effect of the dilation on one person selected from the results of our segmentation method.
<div style="text-align: center; white-space: nowrap; margin: 10px">
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/r1YoJs_rC.jpg" width="180"/>
<div class="caption" style="text-align: center;">Result without dilation</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/H1WD1juBC.jpg" width="180"/>
<div class="caption" style="text-align: center;">Result with dilation</div>
</div>
<figcaption>[Fig 8] Effect of mask dilation.</figcaption>
</div>
Finally, we binarize the image and color the masked region in grey. This aligns with the StableVITON model's training on grey agnostic images. This process helps in maintaining consistency with the model’s training data, facilitating better performance.
### Precision in segmentation
SAM showcases incredible flexibility in segmenting complex patterns. However, this level of precision presents a challenge in the context of virtual try-on. The goal is to segment the whole garment even if they have small elements or images, patterns or text printed on them as showcased in figure 8. Here, the extreme focus on detail can become quite problematic.
While SAM's precision is typically an asset, in the case on virtual try-on this can lead to over-segmentation where even the smallest details are isolated. The masks in figure 9 serve as examples of cases where a zipper, several buttons and an image print were regarded as separate objects from the garments. One way to mitigate the challenge of over-segmentation of the garment could be improving the way of prompting SAM's segmentation by utilizing text prompts while the method used in this research segments everything and filters out the results. An alternative would be to incorporate task-specific knowledge into SAM's fine-tuning process for the purpose of virtual try-on so that the model could learn to maintain the integrity of the garment as a single entity.
<div style="text-align: center; white-space: nowrap; margin: 10px;">
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/HkCUPKDS0.jpg" width="180"/>
<div class="caption" style="text-align: center;">Buttons</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/SyALvYwS0.jpg" width="180"/>
<div class="caption" style="text-align: center;">Zipper</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/HJ0LwKwBR.jpg" width="180"/>
<div class="caption" style="text-align: center;">Image print</div>
</div>
<figcaption>[Fig 9] Examples of over-segmentation.</figcaption>
</div>
The largest challenge lies in balancing the model's segmentation power to adapt it to the requirements of the virtual try-on application. Since we aim to show that more exact garment masking leads to better try-on fits, we have excluded cases, such as the zipper and button examples from above, from our testing data to give the StableVITON model a fair chance in presenting convincing improvements.
## Experimental Setup
In this project, a series of experiments were performed to assess the performance of the StableVITON model, enhanced with an improved segmentation model, for virtual try-on applications. This section provides a detailed description of the components used in the evaluation process.
### Inference
Inference for virtual try-on can be conducted in two main ways, paired and unpaired. In the paired setting, the objective is to reconstruct the original clothing item on the person image, maintaining the same outfit. In contrast, the unpaired setting involves replacing the clothing item on the person image with a different one. In our experiments we decided to only use the unpaired settings as it is more representative of real-world applications where users wish to see how various clothing items look on them.
<!-- We start by selecting a set of images from both the VITON-HD and DressCode datasets. The VITON-HD dataset was utilized for upper body clothing items, while the DressCode dataset provided images of lower body clothing items. -->
Given the selected images from the VITON-HD and DressCode dataset we use SAM to segment the clothing items from these images. After segmenting the clothing items, we made a selection of 300 images from each dataset to perform inference on. The segmented masks were then applied to these images, allowing us to perform the virtual try-on by replacing the clothing items. The inference was carried out on these 300 selected images, and the results were evaluated using the metrics described in the next section.
To establish a benchmark, we also performed inference on 300 images from the VITON-HD dataset using the original segmentation masks from the StableVITON model. This baseline performance served as a reference point, allowing us to compare the effectiveness of our segmentation process.
### Evaluation
To gain an understanding of how the SAM model performs in comparison to the original segmentation method we evaluate the generated images from both datasets using qualitative and quantitative methods.
#### Quantitative Evaluation
We evaluate the generated images using the following four performance metrics, as these are currently used in state-of-the-art papers:
* **Fréchet Inception Distance (FID)**[^18]: The FID is a performance metric used to evaluate the quality of generated images by comparing their distribution to that of real images. Specifically, FID calculates the distance between the feature vectors of real and generated images, which are extracted using a pre-trained Inception v3 model. Lower scores indicate the two groups of images are more alike, or have more similar statistics, with a perfect score being 0.0 indicating that the two groups of images are identical.
* **Kernel Inception Distance (KID)** [^21]: Similar to FID, KID evaluates the quality of the generated images but does not rely on the assumption that the feature space is Gaussian. KID measures the similarity between the distributions of real and generated images by computing the squared maximum mean discrepancy (MMD) using features extracted from a pre-trained Inception network. Additionally, KID is known to have an unbiased estimator, which means it can provide more reliable comparisons, especially with smaller sample sizes.
* **Inception Score (IS)** [^20]: The IS evaluates the quality and diversity of generated images using the Kullback-Leibler (KL) divergence between the probability outputs of a pre-trained Inception v3 model for each generated image and the marginal distribution (the cumulative distribution of all generated samples). A high IS indicates that the generated images are not only easily classifiable into distinct categories (reflecting high quality and realism) but also cover a wide range of categories (indicating diversity).
* **Structural Similarity Index (SSIM)** [^19]: The SSIM measures the similarity between two images by assessing their structural information. It evaluates changes in structural content, luminance, and contrast. SSIM is particularly valuable in image processing tasks, such as image compression, restoration, and generation. Higher SSIM values indicate greater similarity, reflecting better preservation of image details and overall quality.
#### Qualitative Evaluation
In addition to quantitatively evaluating the generated images, we conduct a manual qualitative analysis to compare images generated using the original segmentation method with those produced by the SAM segmentation method. This qualitative evaluation helps identify strengths and limitations that may not be apparent through quantitative metrics alone. By carefully examining the images, we can detect artifacts, inconsistencies, and areas where the segmentation may fail, such as boundary inaccuracies, occlusions, or unrealistic deformations. It will also provide insights into the overall realism and perceptual quality of the images, which is crucial for applications like virtual try-on, where visual appeal is key. Through visual inspection, we assess how well the clothing items integrate with the person’s body, the preservation of texture and details, and the naturalness of poses and expressions.
<!-- In addition to quantitatively evaluating the generated images, we conduct a manual qualitative analysis to compare images generated using the original segmentation method with those produced by the SAM segmentation method. This qualitative evaluation helps identify strengths as well as limitations that may not be apparent through quantitative metrics alone. By carefully examining the images, we can detect artifacts, inconsistencies, and areas where the segmentation may fail, such as boundary inaccuracies, occlusions, or unrealistic deformations.
Furthermore, qualitative analysis provides insights into the overall realism and perceptual quality of the images, which is important for applications like virtual try-on, where visual appeal is key. Through visual inspection, we assess how well the clothing items integrate with the person’s body, the preservation of texture and details, and the naturalness of poses and expressions. -->
## Results
### Upper Body (VITON-HD)
The table below compare the performance of the original segmentation method to that of the SAM in terms of FID, KID, SSIM and IS.
| | FID | KID | SSIM | IS |
|------------------------------------|--------|---------|-------|--------------------|
| Original Segmentation | **35.816** | **0.00205** | 0.456 | **3.064 +- 0.340** |
| SAM Segmentation | 37.524 | 0.00338 | **0.461** | 2.946 +- 0.298 |
<!-- Overall, the original segmentation method appears to perform better across several metrics. -->
<!-- In terms of FID, the original segmentation method achieves a lower score of 35.816 compared to 37.524 for SAM segmentation, indicating that the original method produces images closer to the real ones in terms of distribution. Similarly for KID, the original segmentation again outperforms SAM with a lower score of 0.00205 versus 0.00338, suggesting better quality and diversity in the original method's generated images. -->
<!-- When considering SSIM, which measures the perceived quality of images, the SAM segmentation slightly outperforms the original method with a score of 0.461 compared to 0.456, indicating marginally better structural similarity in the SAM results. However, for the IS metric, which again evaluates the quality and diversity of generated images, the original segmentation method achieves a higher score of 3.06389 ± 0.33975 compared to 2.94575 ± 0.29831 for SAM segmentation. -->
The increased specificity in the SAM segmentation method likely contributes to its slightly better performance in SSIM, as SSIM measures the structural similarity between images and a more precise mask can better preserve the structural details of the clothing item. However, this might also lead to less flexibility in the generated images, which could explain the higher FID and KID scores. Higher FID and KID scores suggest that the generated images from the SAM segmentation method deviate more from the real image distribution, potentially due precise masks, which can limit the natural blending and variation.
Additionally, the IS metric, which assesses the quality and diversity of the generated images, is higher for the original segmentation method. This could be because the more generic masks of the original method allow for greater variability and richness in the generated images, as they incorporate more context and parts of the body.
### Lower Body (DressCode Lower)
| | FID | KID | SSIM | IS |
|------------------------------------|--------|---------|-------|--------------------|
| SAM Segmentation | 37.846 | 0.00269 | 0.499 | 2.7081 +- 0.38302 |
The results for the lower body dataset using SAM segmentation on the DressCode lower dataset indicate that this method performs reasonably well in generating realistic and high-quality images. With a FID score of 37.846 and a KID score of 0.00269, the generated images align well with real image distributions, indicating good quality and diversity. The SSIM score of 0.499 suggests decent structural similarity, maintaining significant details and textures. An IS of 2.7081 further supports the quality and diversity of the images. Overall, SAM segmentation effectively produces realistic and structurally sound virtual try-on images for the lower body. However, since there is no benchmark, it is quite hard to estimate how well the model performs in comparison to the original segmentation method.
### Qualitative Analysis
In this section we manually inspect the differences between the results achieved with our segmentation method compared to the StableVITON baseline.
Figure 10 shows that more precise segmentation can have a positive impact on the preservation of the personal identity, in this case the natural hair length. The second image shows a noticable extension in the hair length, whereas the third image, processed with SAM, closely matches the original hairstyle of the person. The rightmost image dispays more clearly the change in the fit of the garment which enhances the visual appeal of the try-on.
<div style="text-align: center; white-space: nowrap; margin: 10px;">
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/HJZPUrVHC.jpg" width="180"/>
<div class="caption" style="text-align: center;">Original image</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/BymnxwhrC.jpg" width="180"/>
<div class="caption" style="text-align: center;">StableVITON</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/BkmheD3r0.jpg" width="180"/>
<div class="caption" style="text-align: center;">StableVITON with SAM</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/HJQr-nFS0.gif" width="180"/>
<div class="caption" style="text-align: center;">Impact on the fit</div>
</div>
<figcaption>[Fig 10] Improved try-on preserving hair length.</figcaption>
</div>
Figure 11 illustrates an example where the improved segmentation method handles details in the neckline more accurately than the baseline which completely erased that part of the try-on garment. This strengthens the hypothesis that more flexibility in capturing the exact shape of the garment leads to better quality try-on result.
<div style="text-align: center; white-space: nowrap; margin: 10px;">
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/HJUtvTYHR.jpg" width="180"/>
<div class="caption" style="text-align: center;">StableVITON</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/BJ8tP6KSR.jpg" width="180"/>
<div class="caption" style="text-align: center;">StableVITON with SAM</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/BkIKD6tSR.jpg" width="180"/>
<div class="caption" style="text-align: center;">Try-on garment</div>
</div>
<figcaption>[Fig 11] Improved try-on preserving the negarment's neckline.</figcaption>
</div>
Figure 12 highlights another successful attempt of preserving the personal identity, in this case the earings of the person. This suggests that more precise segmentation ensures a more accurate representation not only of the garment, but also of the accompanying accessories.
<div style="text-align: center; white-space: nowrap; margin: 10px;">
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/Hyl9PTKBA.jpg" width="180"/>
<div class="caption" style="text-align: center;">Original image</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/B1g5wTKrA.jpg" width="180"/>
<div class="caption" style="text-align: center;">StableVITON</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/ByxqD6YSA.jpg" width="180"/>
<div class="caption" style="text-align: center;">StableVITON with SAM</div>
</div>
<figcaption>[Fig 12] Improved try-on preserving the person's earings.</figcaption>
</div>
Figure 13 focuses on the preservation of the hand positioning and the continuation of the grament's sleeves. It can be seen that the left hand of the person is generated by the StableVITON model since it vaguely resembles a hand. This can be attributed to a more imprecise garment cut-out, while our segmentation method maintains the natural appearance of the hand.
<div style="text-align: center; white-space: nowrap; margin: 10px;">
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/HJ7APTKH0.jpg" width="180"/>
<div class="caption" style="text-align: center;">Original image</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/HJXRv6KBC.jpg" width="180"/>
<div class="caption" style="text-align: center;">StableVITON</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/rk70vatS0.jpg" width="180"/>
<div class="caption" style="text-align: center;">StableVITON with SAM</div>
</div>
<figcaption>[Fig 13] Improved try-on preserving the person's hand.</figcaption>
</div>
One issue the model encounters is occlusion of the garment by the person's hair, as can be seen in figure 14. Sometimes the hair spreads widely, preventing the segmentation tool from capturing all hair strands in varying directions. This leads to some fragments from the original garment's image showing in the generated image, particularly around the hair region on the left since the mask fails to precisely outline the hair. This is particularly noticeable with the red label, making the artifact more obvious.
Our method focuses solely on segmenting the garment. Therefore, in instances where the person wears a short-sleeve shirt and the try-on garment has long sleeves, only the area within the original t-shirt’s boundaries is generated by StableVITON. Apart from that, both versions display the garment's body pattern extending onto the sleeves of the resulting image, even though the try-on garment, which can be seen in the right most picture in Figure 14, features clean sleeves.
<div style="text-align: center; white-space: nowrap; margin: 10px">
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/S1N0z2KrA.jpg" width="180"/>
<div class="caption" style="text-align: center;">Original garment</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/HJhez2FSR.jpg" width="180"/>
<div class="caption" style="text-align: center;">StableVITON</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/ByhefhtHR.jpg" width="180"/>
<div class="caption" style="text-align: center;">StableVITON with SAM</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/By40fntSC.jpg" width="180"/>
<div class="caption" style="text-align: center;">Try-on garment</div>
</div>
<figcaption>[Fig 14] Drawbacks of the proposed segmentaion method.</figcaption>
</div>
Figure 15, as shown below is an example of the virtual try-on inference on the DressCode lower dataset. As obsevered previously, the segmentation of SAM allows for preservation of personal details such as the purse shown in the image. Although the model manages to put the try-on garment on the original image quite succefully there are a few weaknesses in the model. Firstly, the color of the garment has changed and the patterns in the generated image are also quite fragmented. Aditionally, the shape of the garment is also taken from the original image which makes it hard to perform virtual try-on with new garments such as a skirt.
<div style="text-align: center; white-space: nowrap; margin: 10px;">
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/rkkUpO2BA.jpg" width="180"/>
<div class="caption" style="text-align: center;">Try-on garment</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/Byzha_nrC.jpg" width="180"/>
<div class="caption" style="text-align: center;">Original Image</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/HyjdCd2B0.jpg" width="180"/>
<div class="caption" style="text-align: center;">StableVITON with SAM</div>
</div>
<figcaption>[Fig 15] Lower body virtual try-on preserving accesories such as the purse</figcaption>
</div>
Figure 16 is another example of the virtual try-on inference for the lower body. Once again the garment has been placed quite good on the original image however the model struggles with generating new parts which is evident around the feet of the generated image. We believe that training the StableVITON further on lower body images would minimize this error as the model would be able to generate new parts more succesfully.
<div style="text-align: center; white-space: nowrap; margin: 10px;">
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/S15MaK3BR.jpg" width="180"/>
<div class="caption" style="text-align: center;">Try-on garment</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/SkTeaK3SA.jpg" width="180"/>
<div class="caption" style="text-align: center;">Original Image</div>
</div>
<div style="display: inline-block; margin: 5px;">
<img src="https://hackmd.io/_uploads/S1kuwKhSC.jpg" width="180"/>
<div class="caption" style="text-align: center;">StableVITON with SAM</div>
</div>
<figcaption>[Fig 16] Lower body virtual try-on with fragmentation near the feet</figcaption>
</div>
## Discussion
Looking at the results of the evaluation metrics, it is not hard to jump to the conclusion that the original segmentation outperforms our proposed SAM-based segmentation method.
On the other hand, a human evaluator would likely choose the new segmentation method more often than not, as it prevents undesired changes and better preserves personal attributes like hair length, tattoos, and earrings. SAM's ability to segment intricate patterns and small details ensures that the model can handle complex garments and accessories more effectively, providing a higher level of detail and accuracy in the generated images. To make this really useful for StableVITON, SAM needs to be tuned further and StableVITON needs to be further trained with the new segmentation results, so the model is able to perform optimally in instances like figure 14 and 15.
### About the Metrics
The metrics used in our evaluation are well-established for various image generation tasks. However, they might not fully capture the specific requirements of our virtual try-on system. These metrics are generally designed for use cases that differ from ours and might not fully reflect the visual and contextual accuracy important for virtual try-on applications.
- **FID and KID:** These metrics suggest that the original segmentation method generates images closer to the real ones in terms of distribution and quality. However, they might not accurately measure the fidelity of garment alignment and fit, which are crucial for a realistic try-on experience. One significant limitation of FID is that scores are massively biased by the number of samples used; fewer samples result in larger scores.
- **SSIM:** Also this metric is not able to fully account for the nuances of garment fitting and personal detail preservation that are essential in virtual try-on applications. The advantage of this score is that it can still tell us something about the quality of the garment specifically.
- **IS:** The Inception Score evaluates image quality and diversity, but it might not directly correlate with the accuracy of clothing fit and personalization in our context. Specifically, the diversity of images is not important at all, as we may want to fit the same garment on many persons in which case this metric will give a bad score.
### Future Work
One significant limitation we faced was the lack of computational power and memory, which prevented us from conducting comprehensive training sessions. We hypothesize that further training with the proposed method for generating masks will improve the StableVITON model beyond the level of performance it has been able to achieve. To this end, future work should focus on acquiring more computational resources to enable extensive model training and fine-tuning. In addition, an improved loss function needs to be defined to teach the model to extend beyond the mask whenever needed, for example to be able to replace short sleeves by long ones, a problem we encountered in figure 14.
Furthermore, to improve the effectiveness of our virtual try-on system, more useful and diverse datasets are necessary. These datasets should encompass a wider variety of clothing types, body shapes, poses, and even environment to ensure that the model can generalize well across different scenarios. By expanding the training data, some of the weaknesses and biases observed can be addressed and the overall performance of the system can be enhanced.
## Conclusion
In summary, while the current evaluation metrics suggest that the original segmentation method outperforms our proposed SAM-based approach, these metrics do not fully capture the specific requirements and nuances of virtual try-on systems. The SAM-based method shows promise in preserving personal details and preventing undesired changes, indicating potential for further development and improvement. Future work should focus on overcoming computational limitations and expanding datasets to achieve better results.
By integrating advanced segmentation techniques like SAM, our study demonstrates the potential for more accurate and personalized virtual try-on experiences. As we address the identified weaknesses and leverage greater computational resources, we anticipate significant advancements in the realism and applicability of virtual try-on technology, ultimately enhancing the online shopping experience for consumers.
## Team contributions
Violeta:
* Segmentation with SAM
* Preprocessing of segmentation masks
* Prepare images for inference with StableVITON
* Write blog
Shayan:
* Setting up inference
* Setting up training (both Kaggle and Google Cloud with dissapointing result, due to insufficient memory)
* Second part of training dataset "preprocessing"
* Write blog
Yash:
* Perform inference with pre processed segmentation masks
* Evaluate the generated images with four performance metrics
* Generate agnostic masks and clothing mask for lower-body dataset
* Write blog
## References
[^1]: [Kim, J., Gu, G., Park, M., Park, S., & Choo, J. (2023). StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2312.01725](https://github.com/rlawjdghek/StableVITON)
[^2]: [Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. CoRR, abs/1505.04597.](https://arxiv.org/abs/1505.04597)
[^3]:[OpenAI. (2021, January 5). Clip: Connecting text and images. https://openai.com/index/clip](https://openai.com/index/clip)
[^4]: [Davison, T. (2024, June 4). What is the environmental impact of returning online products?. What is the Environmental Impact of Returning Online Products?](https://blog.cleanhub.com/ecommerce-returns-environmental-impact#:~:text=Up%20to%2024%20million%20metric,year%20(Optoro%2C%202022))
[^5]: [Statista Research Department,. (2023, December 18). Topic: Fashion E-commerce worldwide. Statista. https://www.statista.com/topics/9288/fashion-e-commerce-worldwide/#topicOverview](https://www.statista.com/topics/9288/fashion-e-commerce-worldwide/#topicOverview)
[^6]: [Kalpoe, R. (2020). Technology acceptance and return management in apparel e-commerce. Journal of Supply Chain Management Science, 1(3-4), 118–137. https://doi.org/10.18757/jscms.2020.5454](https://doi.org/10.18757/jscms.2020.5454)
[^7]: [Kim, J., & Forsythe, S. (2008). Adoption of Virtual Try-on technology for online apparel shopping. Journal of Interactive Marketing, 22(2), 45–59. doi:10.1002/dir.20113](https://journals.sagepub.com/doi/abs/10.1002/dir.20113)
[^8]: [Choi S., Park, S., Lee, M., & Choo, J. (2021). VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization. CoRR, abs/2103.16874. Retrieved from https://arxiv.org/abs/2103.16874](https://arxiv.org/abs/2103.16874)
[^9]: [Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., … Girshick, R. (2023). Segment Anything. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2304.02643](http://arxiv.org/abs/2304.02643)
[^10]: [Dhariwal, P., & Nichol, A. (2021). Diffusion Models Beat GANs on Image Synthesis. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, & J. W. Vaughan (Eds.), Advances in Neural Information Processing Systems (Vol. 34, pp. 8780–8794).](https://proceedings.neurips.cc/paper_files/paper/2021/file/49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf)
[^11]: [Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative Adversarial Nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems (Vol. 27).](https://proceedings.neurips.cc/paper_files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf)
[^12]: [Kosar, V. (2021, December 28). Cross-attention in Transformer architecture. https://vaclavkosar.com/ml/cross-attention-in-transformer-architecture.](https://vaclavkosar.com/ml/cross-attention-in-transformer-architecture)
[^13]: [Yang B., Gu S., Zhang B., Zhang T., Chen X., Sun X., Chen D., and Wen F. (2023). Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 18381-18391).](https://arxiv.org/abs/2211.13227)
[^14]: [Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840-6851.](https://arxiv.org/abs/2006.11239)
[^15]: [Sun, W., Liu, Z., Zhang, Y., Zhong, Y., & Barnes, N. (2023). An Alternative to WSSS? An Empirical Study of the Segment Anything Model (SAM) on Weakly-Supervised Semantic Segmentation Problems. ArXiv, abs/2305.01586.](https://arxiv.org/abs/2305.01586)
[^16]: [Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W., Dollár, P., & Girshick, R.B. (2023). Segment Anything. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 3992-4003.](https://ieeexplore.ieee.org/document/10378323/)
[^17]: [Morelli, D., Fincato, M., Cornia, M., Landi, F., Cesari, F., & Cucchiara, R. (2022). Dress Code: High-Resolution Multi-Category Virtual Try-On. Proceedings of the European Conference on Computer Vision.](https://arxiv.org/abs/2204.08532)
[^18]: [Heusel, M., Ramsauer, H., Unterthiner, T, Nessler, B., Klambauer, G., & Hochreiter, S. (2017). GANs Trained by a Two Time-Scale Update Rule Converge to a Nash Equilibrium. CoRR, abs/1706.08500.](http://arxiv.org/abs/1706.08500)
[^19]: [Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612. doi:10.1109/TIP.2003.819861](https://ieeexplore.ieee.org/document/1284395)
[^20]: [Salimans, T., Goodfellow, I. J., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved Techniques for Training GANs. CoRR, abs/1606.03498.](http://arxiv.org/abs/1606.03498)
[^21]: [Bińkowski, M., Sutherland, D. J., Arbel, M., & Gretton, A. (2018). Demystifying mmd gans. arXiv preprint arXiv:1801.01401.](http://arxiv.org/abs/1801.01401)
[^22]: [Wang, T.Y., Ceylan, D., Popović, J., & Mitra, N.J. (2018). Learning a shared shape space for multimodal garment design. ACM Transactions on Graphics (TOG), 37, 1 - 13.](https://arxiv.org/abs/1806.11335)
[^23]: [Dong, H., Liang, X., Wang, B., Lai, H., Zhu, J., & Yin, J. (2019). Towards Multi-Pose Guided Virtual Try-On Network. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 9025-9034.](https://arxiv.org/abs/1902.11026)
[^24]: [Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative Adversarial Nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems (Vol. 27).](https://proceedings.neurips.cc/paper_files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf)
[^25]: [Jetchev, N., & Bergmann, U. (2017). The conditional analogy gan: Swapping fashion articles on people images. In Proceedings of the IEEE international conference on computer vision workshops (pp. 2287-2292).
](http://arxiv.org/abs/1709.04695)
[^26]: [Li, K., Zhang, J., & Forsyth, D. (2023). POVNet: Image-Based Virtual Try-On Through Accurate Warping and Residual. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10), 12222–12235. doi:10.1109/TPAMI.2023.3283302](https://pubmed.ncbi.nlm.nih.gov/37294645/)
[^27]: [Zhu, L., Yang, D., Zhu, T., Reda, F., Chan, W., Saharia, C., ... & Kemelmacher-Shlizerman, I. (2023). Tryondiffusion: A tale of two unets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4606-4615).](http://arxiv.org/abs/2306.08276)
[^28]: [Gou, J., Sun, S., Zhang, J., Si, J., Qian, C., & Zhang, L. (2023). Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow. Proceedings of the 31st ACM International Conference on Multimedia.](https://arxiv.org/abs/2308.06101)