[Tech Report]TIPO: Text to Image with text presampling for Prompt Optimization

# TIPO: Text to Image with text presampling for Prompt Optimization **This report is still working in progress** ## Introduction In this project, we introduce "TIPO" (**T**ext to **I**mage with text presampling for **P**rompt **O**ptimization), an innovative framework designed to significantly enhance the quality and usability of Text-to-Image (T2I) generative models. TIPO utilizes the Large Language Models (LLMs) to perform "Text Presampling" within the inference pipeline of text-to-image generative modeling. By refining and extending user input prompts, TIPO enables generative models to produce superior results with minimal user effort, making T2I systems more accessible and effective for a wider range of users. ## Method and Concept ### Concept The core concept behind **TIPO** revolves around the relationship between prompt specificity and the quality and diversity of generated images. To formalize this, let: - **$\mathcal{P}$** denote the set of all possible prompts. - **$\mathcal{N}$** represent the space of Gaussian noise vectors (since Gaussian noise is commonly used as the initial input for image generative models). - **$\mathcal{I}$** denote the set of all possible images. A text-to-image model can be viewed as a mapping function: $$ f(p): \mathcal{N} \rightarrow \mathcal{I}_p $$ where, for a given prompt $p \in \mathcal{P}$, the model maps noise vectors from $\mathcal{N}$ to images in the subset $\mathcal{I}_p \subseteq \mathcal{I}$ corresponding to the prompt $p$. **Key Idea:** - **Simple Prompts:** Brief and general prompts correspond to broad distributions of possible outputs. - **Detailed Prompts:** Longer and more specific prompts correspond to narrower distributions of outputs. Formally, let $p_s$ be a simple prompt and $p_d$ be a detailed prompt such that $p_d$ is an extension or refinement of $p_s$. Then, the set of images generated from $p_d$, denoted as $f(p_d)(\mathcal{N})$, is a subset of the images generated from $p_s$, denoted as $f(p_s)(\mathcal{N})$: $$ f(p_d)(\mathcal{N}) \subseteq f(p_s)(\mathcal{N}) $$ In an ideal text-to-image model, we would have $f(p_d)(\mathcal{N}) = \mathcal{I}_{p_d}$ and $f(p_s)(\mathcal{N}) = \mathcal{I}_{p_s}$, meaning the model perfectly captures all possible images corresponding to each prompt. However, in reality, the model's capacity to generate a wide distribution of content is limited. Consequently, it may struggle to adequately represent the broad distribution associated with simple prompts. **Text Pre-sampling:** This limitation suggests that to capture the full distribution $\mathcal{I}_{p_s}$, it may be more effective to aggregate the outputs from multiple detailed prompts derived from $p_s$. Specifically, by generating images from all possible detailed prompts $p_{d_i}$ that extend $p_s$ and aggregating the outputs $f(p_{d_i})(\mathcal{N})$, we can approximate the broader distribution $\mathcal{I}_{p_s}$ more effectively than by sampling directly from $f(p_s)(\mathcal{N})$. We refer to this approach as **"Text Pre-sampling."** ### TIPO Framework **TIPO** leverages Large Language Models (LLMs) to automatically extend and refine user-provided prompts. By generating detailed, content-rich prompts $p_d$ from simpler user inputs $p_s$, TIPO ensures that the resultant prompts capture a more specific subset of possible outputs while maintaining alignment with the original user intent. #### Constructing $p_s$ and $p_d$ The construction of simple prompts ($p_s$) and detailed prompts ($p_d$) varies depending on the type of dataset and input modality. We consider two primary scenarios: tag-based datasets and natural language captions. ##### 1. Tag-Based Captions For datasets like **danbooru2023**[18, 19], tags are used to describe the content of images. In text-to-image models trained on such datasets, the concatenation of tags serves as the caption for each image. - **Tag Sets:** - Let $T_n = \{t_1, t_2, t_3, \ldots, t_n\}$ represent the complete set of tags for an image. - Derive the simple tag set $T_s = \{t_1, t_2, \ldots, t_m\}$ where $m < n$. - The detailed tag set is $T_d = T_n$. - **Prompts:** - Simple prompt: $p_s = \text{concat}(T_s)$ - Detailed prompt: $p_d = \text{concat}(T_d)$ This approach ensures that $p_d$ is a superset of $p_s$, aligning with the concept that detailed prompts provide a more specific description. ##### 2. Natural Language Captions For datasets such as **GBC10M**[23] and **CoyoHD11M**[24], natural language captions generated by models like LLaVA[25] are utilized. Defining $p_s$ and $p_d$ in this context involves additional considerations: - **Short and Long Captions:** - Each image is associated with two types of captions: - **Short Caption ($p_s$):** A brief, general description. - **Long Caption ($p_d$):** An extended, detailed description. - **Note:** In this case, $p_d$ is not a direct expansion of $p_s$ but rather a paraphrased version that includes more details. - **Single Long Caption:** - For images with a single long caption, split $p_d$ into multiple sentences by identifying periods. - Generate $p_s$ by retaining the first $k$ sentences and omitting the last $m$ sentences: $$ p_s = \text{concat}(\{ \text{sentence}_1, \text{sentence}_2, \ldots, \text{sentence}_{k} \}) $$ $$ p_d = \text{concat}(\{ \text{sentence}_1, \text{sentence}_2, \ldots, \text{sentence}_{k+m} \}) $$ #### Generation Process To generate $p_d$ from $p_s$, TIPO employs two distinct formats based on the relationship between $p_s$ and $p_d$: 1. **When $p_s$ is a Substring of $p_d$:** - **Format:** `<meta> <p_d>` - **Implementation:** In a causal language model, $p_d$ is directly used after the `<meta>` token. The model learns to generate the subsequent tokens based on any substring of $p_d$. 2. **When $p_s$ is Not a Substring of $p_d$:** - **Format:** `<meta> <p_s> <p_d>` - **Implementation:** Since $p_s$ is not contained within $p_d$, both are included separately after the `<meta>` token to guide the generation process effectively. #### Handling Multiple Inputs In scenarios where both tag sequences ($T_s$) and natural language prompts ($p_s$) are present, TIPO processes each input type separately to maintain clarity and coherence. - **Processing Logic:** 1. **Isolate Input Types:** In each generation cycle, only one type of input is treated as the primary prompt ($p_s$), while the other types are considered as metadata. 2. **Sequential Generation:** - **Step 1:** Use $p_s$ (e.g., natural language prompt) with its corresponding metadata (e.g., aspect ratio, artists, etc.) to generate $T_d$ from $T_s$. - **Step 2:** Update the metadata with $T_d$ and use $p_s$ to generate the detailed natural language prompt $p_d$. - **Example Workflow:** - **Initial Inputs:** $p_s$ (natural language) and $T_s$ (tags). - **Generation 1:** Input `<meta> p_s T_s` to generate $T_d$ from $T_s$. - **Generation 2:** Input `<meta> T_d p_s` to generate $p_d$. - **Aggregation:** Use $T_d$ and $p_d$ and metadatas to construct final outputs. This approach ensures that each input type is expanded and refined without interference, allowing TIPO to effectively capture the comprehensive distribution of possible outputs. #### Mathematical Formalization To encapsulate the generation process, consider the following formalization: - **Prompt Expansion Function:** $\mathcal{E}: \mathcal{P}_s \times \mathcal{M} \rightarrow \mathcal{P}_d$ where $\mathcal{P}_s$ is the space of simple prompts, $\mathcal{M}$ represents metadata (e.g., tags), and $\mathcal{P}_d$ is the space of detailed prompts. - **Generation Steps:** $p_d = \mathcal{E}(p_s, M)$ where $M$ could be either $T_s$ or another form of metadata depending on the input type. By iteratively applying the expansion function $\mathcal{E}$, TIPO systematically refines prompts to enhance the diversity and quality of generated images. ## Experiments Setup In this section, we detail the experimental setup used to evaluate the **TIPO** framework. We outline the formatting conventions for prompts and metadata, define the various tasks employed during training, and describe the models and datasets utilized in our experiments. ### Prompt and Metadata Formatting In **TIPO**, we define a simple and consistent format for the content of the `<meta>` token or the simple prompt $p_s$. The format is as follows: ``` <Category>: <Content> ``` For example, metadata categories include **artist**, **copyright**, **aspect ratio**, **quality**, and **year**. In certain cases, input tags or natural language (NL) prompts are treated as metadata as well, especially when the task involves generating tags or NL prompts conditioned on each other. ### Task Definitions and Training Formats **TIPO** encompasses three primary tasks: extending tag sequences, extending NL prompts, and generating refined NL prompts. To facilitate these tasks, we define several specific task types and corresponding formatting methods during training: 1. **tag_to_long**: Use tags as metadata to generate a new NL prompt or extend a user-provided NL prompt. 2. **long_to_tag**: Use an NL prompt as metadata to extend a tag sequence. 3. **short_to_tag**: Use the simple prompt $p_s$ as metadata to extend a tag sequence. 4. **short_to_long**: Use a user-input NL prompt as metadata to generate a refined detailed prompt $p_d$. 5. **short_to_tag_to_long**: Use a user-input NL prompt or tag sequence as metadata to generate a refined detailed prompt $p_d$. 6. **short_to_long_to_tag**: Use a user-input NL prompt or generated NL prompt as metadata to extend a tag sequence. 7. **tag_to_short_to_long**: Use user-input tags or NL prompts as metadata to generate a refined detailed prompt $p_d$. By defining these task types, we can achieve special cases where, for instance, inputting a short description or a short tag sequence results in a full-size tag sequence and NL prompt in a single pass (i.e., generating $T_d$ and $p_d$ simultaneously). ### Training Procedure In our experiments, we ensure that each training pass generates a new prompt of a single type. This means that to generate an extended detailed prompt $p_d$, a refined prompt $p_d$, and an extended tag sequence $T_d$, the model requires at least three passes. During training, for each dataset entry, we randomly choose one of the seven task types to apply. Additionally, the manner in which we split the simple prompt from the detailed prompt ($p_s$ from $p_d$) and the simple tag set from the detailed tag set ($T_s$ from $T_d$) is also randomly decided. This approach effectively increases the real dataset size beyond the number of entries present in the dataset due to the combinatorial possibilities introduced by the random task selection and prompt splitting. ### Model Architecture and Training Details We utilize the **LLaMA** architecture [35, 36] with models of 200 million and 500 million parameters: - **200M Model**: Pretrained on the **Danbooru2023** [18, 19] and **GBC10M** [23] datasets for 5 epochs, then fine-tuned on the Danbooru2023, GBC10M, and **CoyoHD11M** [24] datasets for 3 epochs, resulting in a total of approximately 40 billion tokens seen. - **500M Model**: Pretrained on the Danbooru2023, GBC10M, and CoyoHD11M datasets for 5 epochs, resulting in a total of approximately 30 billion tokens seen. ***Note***: We count the token seen based on "non-padding tokens", with relatively-short and varing range of data, the token seen here may be lower than expectation. ### Dataset Augmentation and Effective Size As mentioned previously, the randomization in task selection and prompt splitting effectively increases the real size of the training dataset. By generating multiple variations from a single dataset entry, the model is exposed to a wider range of inputs and outputs, enhancing its ability to generalize and perform the various tasks defined in **TIPO**. ## Evaluation Results All results presented in this section are tested on the **TIPO-200M** model. ### Generation Processes In our experiments, we set up two distinct generation processes: #### 1. Short/Truncated Long Test - **Short Prompts:** - 10k short prompts randomly selected from **GBC10M**. - 10k short prompts randomly selected from **CoyoHD11M**. - **Truncated Long Prompts:** - 10k long prompts randomly selected from **GBC10M**. - 10k long prompts randomly selected from **CoyoHD11M**. - Each long prompt is truncated to two sentences by splitting at periods. - **TIPO-Enhanced Prompts:** - **TIPO + Short:** Apply the *short_to_long* task on short prompts, generating a total of 20k prompts. - **TIPO + Truncated Long:** Apply the *long_to_tag* task while forcing the model to expand the input long prompts, resulting in extended long prompts. (generated tags are ignored) For the short/truncated long test, we generate one image from each prompt using the **SDXL-1.0-base** model \[6\]. #### 2. Scenery Tag Test In this test, we randomly select 32,768 entries from **Danbooru2023** that include the "scenery" tag. We set up the following inputs: - **"Scenery" + Meta:** - Retain all metadata categories as described earlier. - Only include "scenery" as the content tag. - **"Scenery" + Meta + TIPO:** - Use "scenery" + meta as input. - Extend $T_s$ (scenery) to $T_d$ and generate $p_d$ from $T_d$. For the scenery tag test, we generate one image from each prompt using the **Kohaku-XL-zeta** model, which is a fine-tuned SDXL model on the Danbooru dataset \[35\]. ### Evaluation Metrics We employ the following metrics to evaluate the quality and alignment of the generated prompts and images: #### 1. Aesthetic Score (Higher is Better) We compute the Aesthetic Score using the **Aesthetic Predictor V2.5** \[29\]. This metric is calculated on the short/truncated long test. ![Aesthetic Score Distribution](https://hackmd.io/_uploads/HkJphkSCA.png) *Figure 1: Aesthetic Score distribution. TIPO-generated prompts provide significantly higher values in the 25\% quantile, indicating a higher lower bound in the aesthetic score. Both the median and mean values are also significantly higher.* #### 2. AI Corrupt Score (Higher is Better) \[30\] The AI Corrupt Score is obtained from the **AICorruptMetrics** in **sdeval** \[33\]. This metric is trained on AI-generated images with human-annotated "corrupt or not" labels. A higher value indicates a higher likelihood of the image being "correct" or "complete". This metric is calculated on the short/truncated long test. ![AI Corrupt Score Distribution](https://hackmd.io/_uploads/SJlktvE0R.png) *Figure 2: AI Corrupt Score distribution. TIPO-generated prompts achieve higher scores, indicating less corrupted images.* #### 3. Frechet Dino Distance (FDD) on Scenery Tag Test \[31, 32\] Traditionally, the **Frechet Inception Distance (FID)** \[34\] has been used to measure the distribution distance between datasets and generated images. However, recent works have shown that FID is not always aligned with human preferences \[31\]. Therefore, we measure the Frechet Distance using **DinoV2** outputs across four different scales of the DinoV2 model. We use FDD on the Scenery Tag Test to demonstrate that when input prompts address a smaller distribution, the model struggles to generate images that reflect the true distribution. However, with **TIPO**, this issue is mitigated. | FDD Model | `<meta> scenery` only | `<meta> scenery` + TIPO | |------------------|-----------------------|-------------------------| | DinoV2 ViT-S | 0.1917 | **0.1786** | | DinoV2 ViT-B | 0.2002 | **0.1755** | | DinoV2 ViT-L | 0.2017 | **0.1863** | | DinoV2 ViT-G | 0.2359 | **0.2096** | *Table 1: Frechet Dino Distance (FDD) on Scenery Tag Test. Lower values indicate better alignment with the original dataset distribution.* As shown in Table 1, applying **TIPO** significantly improves FDD performance across all DinoV2 models. This implies that with **TIPO**, the generated images more closely match the original distribution in the dataset. ## Conclusion In this report, we introduced **TIPO**, a novel framework that enhances the quality of Text-to-Image (T2I) models through automatic prompt engineering. By leveraging Large Language Models (LLMs) to extend and refine user-provided prompts, **TIPO** effectively bridges the gap between simple user inputs and detailed, content-rich prompts. This approach ensures that the generated prompts are not only more specific but also maintain strong alignment with the original user intent. ### Key Contributions 1. **Automatic Prompt Engineering:** We demonstrated the potential of **TIPO** in automating the process of prompt refinement. By transforming simple prompts ($p_s$) into detailed prompts ($p_d$), **TIPO** enhances the specificity and richness of the prompts, leading to higher-quality image generation. 2. **Versatile Task Framework:** **TIPO** encompasses a diverse set of tasks, including extending tag sequences, generating refined natural language prompts, and converting between different prompt formats. This versatility allows **TIPO** to handle various input modalities and dataset types effectively. 3. **Enhanced Alignment with T2I Datasets:** Our experiments revealed that improvements achieved through prompt modification can surpass the performance differences between different text-to-image model architectures. This finding underscores the critical importance of aligning user inputs with the underlying T2I dataset, highlighting that effective prompt engineering can significantly enhance model performance without necessitating architectural changes. ### Experimental Insights The experimental results validated the efficacy of **TIPO** across multiple metrics: - **Aesthetic Score:** **TIPO**-enhanced prompts consistently achieved higher aesthetic scores, indicating an improvement in the visual quality and appeal of the generated images. - **AI Corrupt Score:** Higher AI Corrupt Scores for images generated on **TIPO**-generated prompt suggest that these images are more likely to be "correct" and "complete," reflecting better adherence to the desired content and structure. - **Frechet Dino Distance (FDD):** **TIPO** significantly reduced the FDD score, demonstrating that the generated images more closely align with the original dataset distribution. This improvement highlights **TIPO**'s ability to help user to generate images that are not only high in quality but also representative of the target data distribution. ### Implications and Future Work The success of **TIPO** in improving T2I model outputs through prompt engineering opens several avenues for future research: - **Broader Application of Prompt Engineering:** Exploring **TIPO**'s applicability to other generative tasks, such as text generation or audio synthesis, could further demonstrate the versatility and impact of automatic prompt engineering. - **Integration with Interactive Systems:** Incorporating **TIPO** into interactive applications where users can iteratively refine prompts in real-time may enhance user experience and enable more precise control over generated content. - **Advanced Alignment Techniques:** Investigating more sophisticated methods for aligning user inputs with dataset distributions could further enhance the performance and reliability of generative models. ### Conclusion This study underscores the pivotal role of prompt engineering in the domain of Text-to-Image generation. By automating the refinement and extension of user prompts, **TIPO** not only improves the quality and specificity of generated images but also highlights the significance of aligning user inputs with the underlying data distributions. The findings suggest that strategic modifications to prompts can lead to substantial performance gains, potentially surpassing those achieved through architectural innovations alone. As generative models continue to evolve, frameworks like **TIPO** will be instrumental in unlocking their full potential and ensuring that they meet diverse user needs with precision and creativity. # Reference 1. Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontĳo-Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, & Mohammad Norouzi (2022). Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. *In Advances in Neural Information Processing Systems.* 2. Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, & Ilya Sutskever. (2021). Zero-Shot Text-to-Image Generation. 3. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, & Mark Chen. (2022). Hierarchical Text-Conditional Image Generation with CLIP Latents. 4. James Betker, Gabriel Goh, Li Jing, TimBrooks, Jianfeng Wang, Linjie Li, LongOuyang, JuntangZhuang, JoyceLee, YufeiGuo, WesamManassra, PrafullaDhariwal, CaseyChu, YunxinJiao, & Aditya Ramesh (2023). Improving Image Generation with Better Captions. 5. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis With Latent Diffusion Models. *In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)* (pp. 10684-10695). 6. Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, & Robin Rombach. (2023). SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. 7. Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, & Robin Rombach. (2024). Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation. 8. Junsong Chen, Jincheng YU, Chongjian GE, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, & Zhenguo Li (2024). PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis. *In The Twelfth International Conference on Learning Representations.* 9. Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, & Zhenguo Li. (2024). PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation. 10. Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, & Surya Ganguli (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. In *ICML* (pp. 2256-2265). 11. Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. In Advances in *Neural Information Processing Systems* (pp. 6840–6851). Curran Associates, Inc.. 12. Prafulla Dhariwal, & Alexander Quinn Nichol (2021). Diffusion Models Beat GANs on Image Synthesis. In *Advances in Neural Information Processing Systems*. 13. Jonathan Ho, & Tim Salimans (2021). Classifier-Free Diffusion Guidance. In *NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.* 14. Peter J. Liu*, Mohammad Saleh*, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, & Noam Shazeer (2018). Generating Wikipedia by Summarizing Long Sequences. In *International Conference on Learning Representations*. 15. Jacob Devlin, Ming-Wei Chang, Kenton Lee, & Kristina Toutanova. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 16. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, u., & Polosukhin, I. (2017). Attention is All you Need. In *Advances in Neural Information Processing Systems. Curran Associates, Inc.*. 17. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. *J. Mach. Learn. Res.*, 21(1). 18. nyanko7. (2024). nyanko7/danbooru2023 · Datasets at Hugging Face –- huggingface.co. - Webp Converted Version: KohakuBlueLeaf. (2024). KBlueLeaf/danbooru2023-webp-4Mpixel · Datasets at Hugging Face –- huggingface.co. 19. KohakuBluLeaf. (2024). GitHub - KohakuBlueleaf/HakuBooru: text-image dataset maker for anime-style images –- github.com. 20. Shih-Ying Yeh, Yu-Guan Hsieh, Zhidong Gao, Bernard B W Yang, Giyeong Oh, & Yanmin Gong (2024). Navigating Text-To-Image Customization: From LyCORIS Fine-Tuning to Model Evaluation. In *The Twelfth International Conference on Learning Representations*. 21. Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z. Kaplan, & Enrico Shippole (2024). Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers. *CoRR*, abs/2401.11605. 22. Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, & Ion Stoica. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. 23. Yu-Guan Hsieh, Cheng-Yu Hsieh, Shih-Ying Yeh, Louis Béthune, Hadi Pour Ansari, Pavan Kumar Anasosalu Vasu, Chun-Liang Li, Ranjay Krishna, Oncel Tuzel, & Marco Cuturi. (2024). Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions. 24. CaptionEmporium. (2023). coyo-hd-11m-llavanext. HuggingFace. https://huggingface.co/datasets/CaptionEmporium/coyo-hd-11m-llavanext 25. Haotian Liu, Chunyuan Li, Qingyang Wu, & Yong Jae Lee. (2023). Visual Instruction Tuning. 26. Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue, Yangyu Tao, Jianchen Zhu, Kai Liu, Sihuan Lin, Yifu Sun, Yun Li, Dongdong Wang, Mingtao Chen, Zhichao Hu, Xiao Xiao, Yan Chen, Yuhong Liu, Wei Liu, Di Wang, Yong Yang, Jie Jiang, & Qinglin Lu. (2024). Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding. 27. Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, & Robin Rombach. (2024). Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. 28. Black Forest Labs. Flux. GitHub. https://github.com/black-forest-labs/flux 29. discus0434. (2024). SigLIP-based Aesthetic Score Predictor. GitHub. https://github.com/discus0434/aesthetic-predictor-v2-5 30. narugo1992. (2023). AI-Corrupt Score for Anime Images. HuggingFace. https://huggingface.co/deepghs/ai_image_corrupted 31. George Stein, Jesse C. Cresswell, Rasa Hosseinzadeh, Yi Sui, Brendan Leigh Ross, Valentin Villecroze, Zhaoyan Liu, Anthony L. Caterini, J. Eric T. Taylor, & Gabriel Loaiza-Ganem. (2023). Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. 32. Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, & Enrico Shippole (2024). Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers. In *Forty-first International Conference on Machine Learning.* 33. narugo1992. (2023). Deepghs/sdeval: Evaluation for stable diffusion model training. GitHub. https://github.com/deepghs/sdeval 34. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, & Sepp Hochreiter (2017). GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In *Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA* (pp. 6626–6637). 35. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, & Guillaume Lample. (2023). LLaMA: Open and Efficient Foundation Language Models. 36. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, & Thomas Scialom. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models.