# **PyTorch Masterclass: Part 4 – Generative Models with PyTorch** **Duration: ~120 minutes** #PyTorch #GenerativeAI #GANs #VAEs #DiffusionModels #Autoencoders #TextToImage #DeepLearning #MachineLearning #AI #GenerativeAdversarialNetworks #VariationalAutoencoders #StableDiffusion #DALLE #ImageGeneration #MusicGeneration #AudioSynthesis #LatentSpace #PyTorchGenerative --- ## **Table of Contents** 1. [Recap of Parts 1-3: Foundations, Computer Vision, and NLP](#recap-of-parts-1-3-foundations-computer-vision-and-nlp) 2. [Introduction to Generative Models](#introduction-to-generative-models) 3. [Autoencoders: Theory and Implementation](#autoencoders-theory-and-implementation) 4. [Variational Autoencoders (VAEs)](#variational-autoencoders-vaes) 5. [Generative Adversarial Networks (GANs)](#generative-adversarial-networks-gans) 6. [Diffusion Models](#diffusion-models) 7. [Text-to-Image Generation](#text-to-image-generation) 8. [Music and Audio Generation](#music-and-audio-generation) 9. [Evaluating Generative Models](#evaluating-generative-models) 10. [Building a Complete Image Generation Pipeline](#building-a-complete-image-generation-pipeline) 11. [Quiz 4: Test Your Understanding of Generative Models](#quiz-4-test-your-understanding-of-generative-models) 12. [Summary and What's Next in Part 5](#summary-and-whats-next-in-part-5) --- ## **Recap of Parts 1-3: Foundations, Computer Vision, and NLP** Welcome to **Part 4** of our comprehensive PyTorch Masterclass! In **Part 1**, we established the foundations of PyTorch by covering: - Core tensor operations and GPU acceleration - Automatic differentiation with Autograd - Building and training neural networks from scratch - Loss functions, optimizers, and training loops - Debugging with TensorBoard In **Part 2**, we explored **computer vision** with: - Dataset and DataLoader for efficient data handling - Image preprocessing and augmentation with Transforms - Convolutional Neural Networks (CNNs) architecture and theory - Training CNNs on CIFAR-10 from scratch - Transfer learning with pretrained models (ResNet, EfficientNet) Then in **Part 3**, we delved into **Natural Language Processing (NLP)**: - Text data processing and tokenization - Word embeddings (Word2Vec, GloVe, BERT) - Recurrent Neural Networks (RNNs, LSTMs, GRUs) - Attention mechanisms and Transformer architecture - Building sentiment analysis models with BERT Now, it's time to explore the exciting world of **generative models**, which create new data that resembles the training data. These models power: - AI art generators (DALL-E, Midjourney, Stable Diffusion) - Deepfake technology - Music composition systems - Drug discovery pipelines - Data augmentation for training other models In this part, you'll learn: - How autoencoders learn efficient data representations - The probabilistic framework of Variational Autoencoders - The adversarial training approach of GANs - The denoising process of diffusion models - How to build a complete text-to-image generation pipeline Let's dive into the creative side of deep learning! --- ## **Introduction to Generative Models** Generative models learn the underlying probability distribution of data $p(\mathbf{x})$ and can generate new samples that resemble the training data. This contrasts with **discriminative models** (like classifiers) that learn $p(y|\mathbf{x})$. ### **Why Generative Models Matter** Generative models have revolutionized multiple fields: - **Art and design**: Creating novel images, music, and text - **Healthcare**: Generating synthetic medical images for training - **Science**: Molecular design for drug discovery - **Entertainment**: Video game content generation - **Data augmentation**: Creating additional training samples According to a 2023 report by McKinsey, generative AI could add **$2.6-$4.4 trillion annually** to the global economy. ### **Types of Generative Models** | Model Type | Key Idea | Strengths | Limitations | |------------|----------|-----------|-------------| | **Autoregressive Models** | Model $p(\mathbf{x})=\prod_ip(x_i\mathbf{x}_{<i})$ | Exact likelihood, high-quality samples | Slow generation, sequential dependency | | **Variational Autoencoders (VAEs)** | Learn latent representation with variational inference | Fast generation, probabilistic framework | Blurry samples, approximate inference | | **Generative Adversarial Networks (GANs)** | Adversarial training of generator and discriminator | Sharp, realistic samples | Training instability, mode collapse | | **Flow-based Models** | Use invertible transformations for exact likelihood | Exact likelihood, efficient sampling | Restricted architecture | | **Diffusion Models** | Gradually denoise from random noise | High-quality samples, stable training | Slow generation, complex training | ### **The Generative Modeling Framework** All generative models aim to approximate the data distribution $p_{\text{data}}(\mathbf{x})$. They differ in how they parameterize and optimize this approximation. Let $\mathcal{X}$ be the data space (e.g., images, text). A generative model learns a distribution $p_{\theta}(\mathbf{x})$ that approximates $p_{\text{data}}(\mathbf{x})$. The goal is to minimize the discrepancy between $p_{\theta}$ and $p_{\text{data}}$, often measured by the **Kullback-Leibler (KL) divergence**: $$D_{\text{KL}}(p_{\text{data}}\|p_{\theta})=\int p_{\text{data}}(\mathbf{x})\log\frac{p_{\text{data}}(\mathbf{x})}{p_{\theta}(\mathbf{x})}d\mathbf{x}$$ However, directly minimizing this is intractable since we don't know $p_{\text{data}}$. Different generative models approach this problem differently. ### **Key Challenges in Generative Modeling** 1. **High-dimensional data**: Images have millions of dimensions 2. **Mode collapse**: Generator produces limited varieties of samples 3. **Evaluation**: No single metric captures sample quality and diversity 4. **Training stability**: Especially problematic for GANs 5. **Computational cost**: Training requires significant resources ### **Why PyTorch for Generative Models?** PyTorch is the preferred framework for generative modeling because: - **Dynamic computation graphs**: Essential for complex training procedures - **Strong GPU support**: Critical for training large generative models - **Rich ecosystem**: Libraries like `torchgan`, `pytorch-lightning`, `diffusers` - **Research-friendly**: Most cutting-edge generative models are released in PyTorch According to a 2023 survey by Generative AI Research, **89% of new generative model papers** use PyTorch as their implementation framework. --- ## **Autoencoders: Theory and Implementation** Autoencoders are neural networks trained to reconstruct their inputs, learning efficient data representations in the process. ### **Autoencoder Architecture** An autoencoder consists of two parts: 1. **Encoder**: Maps input $\mathbf{x}$ to a latent representation $\mathbf{z}$ 2. **Decoder**: Reconstructs input from latent representation ![Autoencoder Architecture](https://miro.medium.com/v2/resize:fit:1400/1*44eDEuZBEsmG_TCAKRI3Kw@2x.png) Mathematically: - Encoder: $\mathbf{z}=f_{\theta}(\mathbf{x})$ - Decoder: $\mathbf{\hat{x}}=g_{\phi}(\mathbf{z})$ The model is trained to minimize the reconstruction loss: $$\mathcal{L}(\theta,\phi)=\|\mathbf{x}-g_{\phi}(f_{\theta}(\mathbf{x}))\|^{2}$$ ### **Implementing a Basic Autoencoder in PyTorch** ```python import torch import torch.nn as nn import torch.nn.functional as F class Autoencoder(nn.Module): def __init__(self, input_dim, hidden_dim, latent_dim): super(Autoencoder, self).__init__() # Encoder self.encoder = nn.Sequential( nn.Linear(input_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, latent_dim) ) # Decoder self.decoder = nn.Sequential( nn.Linear(latent_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, input_dim), nn.Sigmoid() ) def forward(self, x): # Flatten input x = x.view(x.size(0), -1) # Encode z = self.encoder(x) # Decode x_recon = self.decoder(z) # Reshape to original dimensions x_recon = x_recon.view(x.size(0), *self.input_shape) return x_recon, z # Example: MNIST autoencoder input_dim = 28 * 28 # MNIST image size hidden_dim = 512 latent_dim = 32 model = Autoencoder(input_dim, hidden_dim, latent_dim) ``` ### **Training an Autoencoder** ```python # Loss function and optimizer criterion = nn.MSELoss() optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) # Training loop def train_autoencoder(model, dataloader, optimizer, criterion, device, epochs=10): model.train() for epoch in range(epochs): total_loss = 0 for data, _ in dataloader: data = data.to(device) # Forward pass recon, _ = model(data) loss = criterion(recon, data) # Backward pass optimizer.zero_grad() loss.backward() optimizer.step() total_loss += loss.item() print(f"Epoch {epoch+1}, Loss: {total_loss/len(dataloader):.4f}") return model ``` ### **Latent Space Interpolation** One powerful application of autoencoders is **latent space interpolation**: ```python def interpolate_latent(model, x1, x2, num_steps=10): """ Interpolate between two inputs in latent space """ model.eval() with torch.no_grad(): # Encode inputs z1 = model.encoder(x1.view(1, -1)) z2 = model.encoder(x2.view(1, -1)) # Interpolate in latent space steps = torch.linspace(0, 1, num_steps) interpolations = [] for step in steps: z = (1 - step) * z1 + step * z2 recon = model.decoder(z) interpolations.append(recon.view(28, 28)) return interpolations # Visualize interpolation import matplotlib.pyplot as plt # Get two MNIST images x1, x2 = next(iter(test_loader))[0][0], next(iter(test_loader))[0][1] # Interpolate interpolations = interpolate_latent(model, x1, x2) # Plot plt.figure(figsize=(15, 3)) for i, img in enumerate(interpolations): plt.subplot(1, len(interpolations), i+1) plt.imshow(img.cpu().numpy(), cmap='gray') plt.axis('off') plt.tight_layout() plt.show() ``` ### **Denoising Autoencoders** Denoising autoencoders learn to reconstruct clean data from corrupted inputs: ```python class DenoisingAutoencoder(Autoencoder): def __init__(self, input_dim, hidden_dim, latent_dim, noise_factor=0.2): super().__init__(input_dim, hidden_dim, latent_dim) self.noise_factor = noise_factor def add_noise(self, x): """Add Gaussian noise to input""" noise = torch.randn_like(x) * self.noise_factor noisy = x + noise return torch.clamp(noisy, 0., 1.) def forward(self, x): # Add noise during training if self.training: x = self.add_noise(x) # Flatten input x = x.view(x.size(0), -1) # Encode z = self.encoder(x) # Decode x_recon = self.decoder(z) # Reshape to original dimensions x_recon = x_recon.view(x.size(0), *self.input_shape) return x_recon, z # Train denoising autoencoder denoising_ae = DenoisingAutoencoder(input_dim, hidden_dim, latent_dim) train_autoencoder(denoising_ae, train_loader, optimizer, criterion, device) ``` ### **Convolutional Autoencoders** For image data, convolutional autoencoders work better: ```python class ConvAutoencoder(nn.Module): def __init__(self): super(ConvAutoencoder, self).__init__() # Encoder self.encoder = nn.Sequential( nn.Conv2d(1, 16, 3, stride=2, padding=1), nn.ReLU(), nn.Conv2d(16, 32, 3, stride=2, padding=1), nn.ReLU(), nn.Conv2d(32, 64, 7) ) # Decoder self.decoder = nn.Sequential( nn.ConvTranspose2d(64, 32, 7), nn.ReLU(), nn.ConvTranspose2d(32, 16, 3, stride=2, padding=1, output_padding=1), nn.ReLU(), nn.ConvTranspose2d(16, 1, 3, stride=2, padding=1, output_padding=1), nn.Sigmoid() ) def forward(self, x): x = self.encoder(x) x = self.decoder(x) return x, None # Return None for latent for consistency # Example usage conv_ae = ConvAutoencoder().to(device) train_autoencoder(conv_ae, train_loader, optimizer, criterion, device) ``` ### **Applications of Autoencoders** 1. **Dimensionality reduction**: Visualize high-dimensional data 2. **Anomaly detection**: High reconstruction loss indicates anomalies 3. **Denoising**: Remove noise from images or audio 4. **Feature learning**: Latent representations for downstream tasks 5. **Data compression**: Efficient storage of data ### **Limitations of Standard Autoencoders** Standard autoencoders have several limitations: - **No structure in latent space**: Similar inputs may have distant latent representations - **No generative capability**: Cannot sample from latent space to generate new data - **Blurriness**: MSE loss encourages averaging of multiple possibilities These limitations led to the development of **Variational Autoencoders (VAEs)**, which we'll explore next. --- ## **Variational Autoencoders (VAEs)** Variational Autoencoders (VAEs) extend autoencoders with a probabilistic approach, creating a structured latent space that enables generative capabilities. ### **The Probabilistic Framework** Unlike standard autoencoders, VAEs model the data distribution $p_{\theta}(\mathbf{x})$ as: $$p_{\theta}(\mathbf{x})=\int p_{\theta}(\mathbf{x}|\mathbf{z})p(\mathbf{z})d\mathbf{z}$$ Where: - $p(\mathbf{z})$ is the prior distribution (typically $\mathcal{N}(\mathbf{0},\mathbf{I})$) - $p_{\theta}(\mathbf{x}|\mathbf{z})$ is the likelihood The challenge is that this integral is intractable. VAEs use **variational inference** to approximate the posterior $p_{\theta}(\mathbf{z}|\mathbf{x})$ with a simpler distribution $q_{\phi}(\mathbf{z}|\mathbf{x})$. ### **The Reparameterization Trick** The key innovation of VAEs is the **reparameterization trick**, which makes the latent space differentiable: $$\mathbf{z}=\mu+\sigma\odot\epsilon\quad\text{where}\quad\epsilon\sim\mathcal{N}(\mathbf{0},\mathbf{I})$$ This separates the stochasticity from the parameters, allowing backpropagation through the sampling process. ### **VAE Objective Function** The VAE optimization objective is the **Evidence Lower Bound (ELBO)**: $$\mathcal{L}(\theta,\phi;\mathbf{x})=\mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})]-D_{\text{KL}}(q_{\phi}(\mathbf{z}|\mathbf{x})\|p(\mathbf{z}))$$ This consists of: 1. **Reconstruction loss**: How well we can reconstruct the input 2. **KL divergence**: How close the latent distribution is to the prior ### **Implementing a VAE in PyTorch** ```python class VAE(nn.Module): def __init__(self, input_dim, hidden_dim, latent_dim): super(VAE, self).__init__() self.latent_dim = latent_dim # Encoder self.encoder = nn.Sequential( nn.Linear(input_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim) ) # Latent space self.mu = nn.Linear(hidden_dim, latent_dim) self.logvar = nn.Linear(hidden_dim, latent_dim) # Decoder self.decoder = nn.Sequential( nn.Linear(latent_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, input_dim), nn.Sigmoid() ) def encode(self, x): h = self.encoder(x) return self.mu(h), self.logvar(h) def reparameterize(self, mu, logvar): std = torch.exp(0.5 * logvar) eps = torch.randn_like(std) return mu + eps * std def decode(self, z): return self.decoder(z) def forward(self, x): # Flatten input x = x.view(x.size(0), -1) # Encode mu, logvar = self.encode(x) # Reparameterize z = self.reparameterize(mu, logvar) # Decode x_recon = self.decode(z) # Reshape to original dimensions x_recon = x_recon.view(x.size(0), *self.input_shape) return x_recon, mu, logvar, z # Loss function for VAE def vae_loss(recon_x, x, mu, logvar, recon_loss_weight=1.0): # Reconstruction loss BCE = F.binary_cross_entropy(recon_x.view(-1, 784), x.view(-1, 784), reduction='sum') # KL divergence KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp()) return recon_loss_weight * BCE + KLD ``` ### **Training a VAE** ```python # Initialize VAE input_dim = 28 * 28 hidden_dim = 400 latent_dim = 20 model = VAE(input_dim, hidden_dim, latent_dim).to(device) # Optimizer optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) # Training loop def train_vae(model, dataloader, optimizer, device, epochs=20): model.train() for epoch in range(epochs): total_loss = 0 for data, _ in dataloader: data = data.to(device) # Forward pass recon, mu, logvar, _ = model(data) loss = vae_loss(recon, data, mu, logvar) # Backward pass optimizer.zero_grad() loss.backward() optimizer.step() total_loss += loss.item() print(f"Epoch {epoch+1}, Loss: {total_loss/len(dataloader):.4f}") return model # Train VAE trained_vae = train_vae(model, train_loader, optimizer, device) ``` ### **Generating New Samples** One of the key advantages of VAEs is their ability to generate new samples: ```python def generate_samples(model, num_samples=10, device='cpu'): model.eval() with torch.no_grad(): # Sample from prior distribution z = torch.randn(num_samples, model.latent_dim).to(device) # Decode samples = model.decode(z) # Reshape to image dimensions samples = samples.view(num_samples, 1, 28, 28) return samples # Generate and visualize samples generated = generate_samples(trained_vae, num_samples=10, device=device) plt.figure(figsize=(15, 3)) for i in range(10): plt.subplot(1, 10, i+1) plt.imshow(generated[i, 0].cpu().numpy(), cmap='gray') plt.axis('off') plt.tight_layout() plt.show() ``` ### **Latent Space Manipulation** VAEs create a continuous, structured latent space that enables meaningful manipulations: ```python def interpolate_latent_vae(model, num_steps=10, device='cpu'): model.eval() with torch.no_grad(): # Sample two random points in latent space z1 = torch.randn(1, model.latent_dim).to(device) z2 = torch.randn(1, model.latent_dim).to(device) # Interpolate between them steps = torch.linspace(0, 1, num_steps) interpolations = [] for step in steps: z = (1 - step) * z1 + step * z2 recon = model.decode(z) interpolations.append(recon.view(28, 28)) return interpolations # Visualize latent space interpolation interpolations = interpolate_latent_vae(trained_vae, num_steps=10, device=device) plt.figure(figsize=(15, 3)) for i, img in enumerate(interpolations): plt.subplot(1, len(interpolations), i+1) plt.imshow(img.cpu().numpy(), cmap='gray') plt.axis('off') plt.tight_layout() plt.show() ``` ### **Conditional VAEs** Conditional VAEs (CVAEs) generate samples conditioned on additional information (e.g., class labels): ```python class CVAE(nn.Module): def __init__(self, input_dim, hidden_dim, latent_dim, num_classes): super(CVAE, self).__init__() self.latent_dim = latent_dim # Label embedding self.label_emb = nn.Embedding(num_classes, num_classes) # Encoder self.encoder = nn.Sequential( nn.Linear(input_dim + num_classes, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim) ) # Latent space self.mu = nn.Linear(hidden_dim, latent_dim) self.logvar = nn.Linear(hidden_dim, latent_dim) # Decoder self.decoder = nn.Sequential( nn.Linear(latent_dim + num_classes, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, input_dim), nn.Sigmoid() ) def encode(self, x, labels): # Embed labels label_emb = self.label_emb(labels) # Concatenate input and labels x = torch.cat([x, label_emb], dim=1) # Encode h = self.encoder(x) return self.mu(h), self.logvar(h) def reparameterize(self, mu, logvar): std = torch.exp(0.5 * logvar) eps = torch.randn_like(std) return mu + eps * std def decode(self, z, labels): # Embed labels label_emb = self.label_emb(labels) # Concatenate latent and labels z = torch.cat([z, label_emb], dim=1) # Decode return self.decoder(z) def forward(self, x, labels): # Flatten input x = x.view(x.size(0), -1) # Encode mu, logvar = self.encode(x, labels) # Reparameterize z = self.reparameterize(mu, logvar) # Decode x_recon = self.decode(z, labels) # Reshape to original dimensions x_recon = x_recon.view(x.size(0), *self.input_shape) return x_recon, mu, logvar, z # Train conditional VAE on MNIST num_classes = 10 cvae = CVAE(input_dim, hidden_dim, latent_dim, num_classes).to(device) optimizer = torch.optim.Adam(cvae.parameters(), lr=1e-3) def train_cvae(model, dataloader, optimizer, device, epochs=20): model.train() for epoch in range(epochs): total_loss = 0 for data, labels in dataloader: data, labels = data.to(device), labels.to(device) # Forward pass recon, mu, logvar, _ = model(data, labels) loss = vae_loss(recon, data, mu, logvar) # Backward pass optimizer.zero_grad() loss.backward() optimizer.step() total_loss += loss.item() print(f"Epoch {epoch+1}, Loss: {total_loss/len(dataloader):.4f}") return model # Generate class-conditional samples def generate_conditional_samples(model, class_label, num_samples=10, device='cpu'): model.eval() with torch.no_grad(): # Create labels tensor labels = torch.full((num_samples,), class_label, dtype=torch.long).to(device) # Sample from prior z = torch.randn(num_samples, model.latent_dim).to(device) # Decode with labels samples = model.decode(z, labels) # Reshape samples = samples.view(num_samples, 1, 28, 28) return samples # Generate samples for digit '5' samples = generate_conditional_samples(cvae, class_label=5, num_samples=10, device=device) ``` ### **Beta-VAEs and Disentanglement** Beta-VAEs modify the ELBO objective to encourage disentangled representations: $$\mathcal{L}_{\beta}=\mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})]-\beta D_{\text{KL}}(q_{\phi}(\mathbf{z}|\mathbf{x})\|p(\mathbf{z}))$$ Higher $\beta$ values increase the KL term's weight, forcing the latent space to match the prior more closely, often resulting in more disentangled representations. ```python def beta_vae_loss(recon_x, x, mu, logvar, beta=4.0): """Beta-VAE loss with adjustable beta parameter""" BCE = F.binary_cross_entropy(recon_x.view(-1, 784), x.view(-1, 784), reduction='sum') KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp()) return BCE + beta * KLD ``` ### **Limitations of VAEs** Despite their strengths, VAEs have limitations: - **Blurriness**: MSE/BCE loss encourages averaging of multiple possibilities - **Posterior collapse**: KL term can dominate, causing $q_{\phi}(\mathbf{z}|\mathbf{x})\approx p(\mathbf{z})$ - **Simplistic priors**: Standard normal prior may not match true latent distribution - **Training instability**: Balancing reconstruction and KL terms can be challenging These limitations motivated the development of **Generative Adversarial Networks (GANs)**, which we'll explore next. --- ## **Generative Adversarial Networks (GANs)** Generative Adversarial Networks (GANs) introduced a novel approach to generative modeling through adversarial training. ### **The GAN Framework** GANs consist of two networks trained in opposition: 1. **Generator ($G$)**: Creates fake samples from random noise 2. **Discriminator ($D$)**: Distinguishes real from fake samples ![GAN Architecture](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQhpV1m2IAZRcvdwkhVak3MpWEeWVrpArLxtA&s) The training process is a **minimax game** with the following objective: $$\min_{G}\max_{D}V(D,G)=\mathbb{E}_{\mathbf{x}\sim p_{\text{data}}}[\log D(\mathbf{x})]+\mathbb{E}_{\mathbf{z}\sim p_{\mathbf{z}}}[\log(1-D(G(\mathbf{z})))]$$ At equilibrium, the generator produces samples indistinguishable from real data. ### **Implementing a Basic GAN in PyTorch** ```python # Generator class Generator(nn.Module): def __init__(self, latent_dim, hidden_dim, output_dim): super(Generator, self).__init__() self.model = nn.Sequential( nn.Linear(latent_dim, hidden_dim), nn.LeakyReLU(0.2), nn.Linear(hidden_dim, hidden_dim), nn.LeakyReLU(0.2), nn.Linear(hidden_dim, output_dim), nn.Tanh() ) def forward(self, z): return self.model(z) # Discriminator class Discriminator(nn.Module): def __init__(self, input_dim, hidden_dim): super(Discriminator, self).__init__() self.model = nn.Sequential( nn.Linear(input_dim, hidden_dim), nn.LeakyReLU(0.2), nn.Dropout(0.3), nn.Linear(hidden_dim, hidden_dim), nn.LeakyReLU(0.2), nn.Dropout(0.3), nn.Linear(hidden_dim, 1), nn.Sigmoid() ) def forward(self, x): x = x.view(x.size(0), -1) return self.model(x) # Hyperparameters latent_dim = 100 hidden_dim = 128 output_dim = 28 * 28 # MNIST # Initialize models generator = Generator(latent_dim, hidden_dim, output_dim).to(device) discriminator = Discriminator(output_dim, hidden_dim).to(device) # Optimizers g_optimizer = torch.optim.Adam(generator.parameters(), lr=2e-4, betas=(0.5, 0.999)) d_optimizer = torch.optim.Adam(discriminator.parameters(), lr=2e-4, betas=(0.5, 0.999)) # Loss function criterion = nn.BCELoss() ``` ### **GAN Training Algorithm** The standard GAN training procedure: ```python def train_gan(generator, discriminator, dataloader, g_optimizer, d_optimizer, criterion, device, epochs=50): for epoch in range(epochs): d_losses = [] g_losses = [] for real_images, _ in dataloader: batch_size = real_images.size(0) real_images = real_images.to(device) # Train Discriminator d_optimizer.zero_grad() # Real images real_labels = torch.ones(batch_size, 1).to(device) d_real_loss = criterion(discriminator(real_images), real_labels) # Fake images z = torch.randn(batch_size, latent_dim).to(device) fake_images = generator(z) fake_labels = torch.zeros(batch_size, 1).to(device) d_fake_loss = criterion(discriminator(fake_images.detach()), fake_labels) # Total discriminator loss d_loss = (d_real_loss + d_fake_loss) / 2 d_loss.backward() d_optimizer.step() # Train Generator g_optimizer.zero_grad() # Fool discriminator validity = discriminator(fake_images) g_loss = criterion(validity, real_labels) # Try to get discriminator to say "real" g_loss.backward() g_optimizer.step() d_losses.append(d_loss.item()) g_losses.append(g_loss.item()) # Print progress print(f"Epoch {epoch+1}/{epochs} | D Loss: {sum(d_losses)/len(d_losses):.4f} | G Loss: {sum(g_losses)/len(g_losses):.4f}") # Generate and save sample images if (epoch+1) % 5 == 0: generate_and_save_samples(generator, epoch+1, device) return generator, discriminator ``` ### **Generating Samples from a Trained GAN** ```python def generate_and_save_samples(generator, epoch, device, num_samples=25): """Generate and save sample images""" generator.eval() with torch.no_grad(): z = torch.randn(num_samples, latent_dim).to(device) samples = generator(z).view(num_samples, 1, 28, 28) # Create grid of images grid = make_grid(samples, nrow=5, normalize=True) # Save image plt.figure(figsize=(8, 8)) plt.imshow(grid.cpu().numpy().transpose(1, 2, 0), cmap='gray') plt.axis('off') plt.title(f'GAN Samples - Epoch {epoch}') plt.savefig(f'gan_samples_epoch_{epoch}.png') plt.close() ``` ### **Common GAN Architectures** #### **1. Deep Convolutional GAN (DCGAN)** Uses convolutional layers for both generator and discriminator: ```python class DCGAN_Generator(nn.Module): def __init__(self, latent_dim, img_channels=1, feature_map_size=64): super(DCGAN_Generator, self).__init__() self.latent_dim = latent_dim self.gen = nn.Sequential( # Input: latent_dim x 1 x 1 nn.ConvTranspose2d(latent_dim, feature_map_size * 8, 4, 1, 0, bias=False), nn.BatchNorm2d(feature_map_size * 8), nn.ReLU(True), # State: (feature_map_size*8) x 4 x 4 nn.ConvTranspose2d(feature_map_size * 8, feature_map_size * 4, 4, 2, 1, bias=False), nn.BatchNorm2d(feature_map_size * 4), nn.ReLU(True), # State: (feature_map_size*4) x 8 x 8 nn.ConvTranspose2d(feature_map_size * 4, feature_map_size * 2, 4, 2, 1, bias=False), nn.BatchNorm2d(feature_map_size * 2), nn.ReLU(True), # State: (feature_map_size*2) x 16 x 16 nn.ConvTranspose2d(feature_map_size * 2, img_channels, 4, 2, 1, bias=False), nn.Tanh() # Output: img_channels x 32 x 32 ) def forward(self, z): z = z.view(z.size(0), z.size(1), 1, 1) return self.gen(z) class DCGAN_Discriminator(nn.Module): def __init__(self, img_channels=1, feature_map_size=64): super(DCGAN_Discriminator, self).__init__() self.disc = nn.Sequential( # Input: img_channels x 32 x 32 nn.Conv2d(img_channels, feature_map_size, 4, 2, 1, bias=False), nn.LeakyReLU(0.2, inplace=True), # State: feature_map_size x 16 x 16 nn.Conv2d(feature_map_size, feature_map_size * 2, 4, 2, 1, bias=False), nn.BatchNorm2d(feature_map_size * 2), nn.LeakyReLU(0.2, inplace=True), # State: (feature_map_size*2) x 8 x 8 nn.Conv2d(feature_map_size * 2, feature_map_size * 4, 4, 2, 1, bias=False), nn.BatchNorm2d(feature_map_size * 4), nn.LeakyReLU(0.2, inplace=True), # State: (feature_map_size*4) x 4 x 4 nn.Conv2d(feature_map_size * 4, 1, 4, 1, 0, bias=False), nn.Sigmoid() # Output: 1 x 1 x 1 ) def forward(self, x): return self.disc(x).view(-1, 1) ``` #### **2. Conditional GAN (cGAN)** Generates samples conditioned on additional information: ```python class cGAN_Generator(nn.Module): def __init__(self, latent_dim, label_dim, img_channels=1, feature_map_size=64): super(cGAN_Generator, self).__init__() # Label embedding self.label_emb = nn.Embedding(label_dim, label_dim) self.gen = nn.Sequential( nn.Linear(latent_dim + label_dim, 128 * 4 * 4), nn.ReLU(True), nn.Unflatten(1, (128, 4, 4)), nn.ConvTranspose2d(128, 64, 4, 2, 1, bias=False), nn.BatchNorm2d(64), nn.ReLU(True), nn.ConvTranspose2d(64, img_channels, 4, 2, 1, bias=False), nn.Tanh() ) def forward(self, z, labels): # Embed labels label_emb = self.label_emb(labels) # Concatenate noise and labels gen_input = torch.cat((z, label_emb), -1) # Generate image return self.gen(gen_input) # During training def train_cgan(generator, discriminator, dataloader, g_optimizer, d_optimizer, criterion, device, epochs=50): for epoch in range(epochs): for real_images, labels in dataloader: # Training code similar to standard GAN but using labels # ... # Generate fake images with specific labels z = torch.randn(batch_size, latent_dim).to(device) fake_images = generator(z, labels) # Train discriminator and generator with labels # ... ``` #### **3. Wasserstein GAN (WGAN)** Uses Wasserstein distance for more stable training: ```python # Critic (replaces discriminator) class WGAN_Critic(nn.Module): def __init__(self, img_channels=1, feature_map_size=64): super(WGAN_Critic, self).__init__() self.critic = nn.Sequential( nn.Conv2d(img_channels, feature_map_size, 4, 2, 1, bias=False), nn.LeakyReLU(0.2, inplace=True), nn.Conv2d(feature_map_size, feature_map_size * 2, 4, 2, 1, bias=False), nn.InstanceNorm2d(feature_map_size * 2), nn.LeakyReLU(0.2, inplace=True), nn.Conv2d(feature_map_size * 2, feature_map_size * 4, 4, 2, 1, bias=False), nn.InstanceNorm2d(feature_map_size * 4), nn.LeakyReLU(0.2, inplace=True), nn.Conv2d(feature_map_size * 4, 1, 4, 1, 0, bias=False) ) def forward(self, x): return self.critic(x).view(-1) # WGAN training loop def train_wgan(generator, critic, dataloader, g_optimizer, c_optimizer, device, epochs=50, clip_value=0.01): for epoch in range(epochs): for _ in range(5): # Train critic more often # Train critic c_optimizer.zero_grad() # Real images real_images = next(iter(dataloader))[0].to(device) real_loss = critic(real_images).mean() # Fake images z = torch.randn(real_images.size(0), latent_dim).to(device) fake_images = generator(z) fake_loss = critic(fake_images.detach()).mean() # Wasserstein loss c_loss = fake_loss - real_loss c_loss.backward() c_optimizer.step() # Clip weights (for WGAN-GP, use gradient penalty instead) for p in critic.parameters(): p.data.clamp_(-clip_value, clip_value) # Train generator g_optimizer.zero_grad() z = torch.randn(real_images.size(0), latent_dim).to(device) fake_images = generator(z) g_loss = -critic(fake_images).mean() g_loss.backward() g_optimizer.step() ``` ### **GAN Loss Functions** #### **1. Standard GAN Loss** As described in the original paper: $$\mathcal{L}_{\text{disc}}=-\mathbb{E}_{\mathbf{x}\sim p_{\text{data}}}[\log D(\mathbf{x})]-\mathbb{E}_{\mathbf{z}\sim p_{\mathbf{z}}}[\log(1-D(G(\mathbf{z})))]$$ $$\mathcal{L}_{\text{gen}}=-\mathbb{E}_{\mathbf{z}\sim p_{\mathbf{z}}}[\log D(G(\mathbf{z}))]$$ #### **2. Least Squares GAN (LSGAN)** Reduces vanishing gradients: $$\mathcal{L}_{\text{disc}}=\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim p_{\text{data}}}[(D(\mathbf{x})-1)^{2}]+\frac{1}{2}\mathbb{E}_{\mathbf{z}\sim p_{\mathbf{z}}}[D(G(\mathbf{z}))^{2}]$$ $$\mathcal{L}_{\text{gen}}=\frac{1}{2}\mathbb{E}_{\mathbf{z}\sim p_{\mathbf{z}}}[(D(G(\mathbf{z}))-1)^{2}]$$ #### **3. Wasserstein GAN Loss** Uses Earth Mover's distance: $$\mathcal{L}_{\text{critic}}=\mathbb{E}_{\mathbf{z}\sim p_{\mathbf{z}}}[D(G(\mathbf{z}))]-\mathbb{E}_{\mathbf{x}\sim p_{\text{data}}}[D(\mathbf{x})]$$ $$\mathcal{L}_{\text{gen}}=-\mathbb{E}_{\mathbf{z}\sim p_{\mathbf{z}}}[D(G(\mathbf{z}))]$$ With gradient penalty for WGAN-GP: $$\mathcal{L}_{\text{gp}} = \lambda \mathbb{E}_{\hat{\mathbf{x}} \sim p_{\hat{\mathbf{x}}}} \left[ \left( \|\nabla_{\hat{\mathbf{x}}} D(\hat{\mathbf{x}})\|_{2} - 1 \right)^{2} \right]$$ Where $\hat{\mathbf{x}}=\epsilon\mathbf{x}+(1-\epsilon)G(\mathbf{z})$ for $\epsilon\sim\text{Uniform}(0,1)$ ### **Advanced GAN Techniques** #### **1. Progressive Growing of GANs (ProGAN)** Trains GANs progressively from low to high resolution: ```python class ProGAN_Generator(nn.Module): def __init__(self, latent_dim, img_channels=3, max_resolution=1024): super(ProGAN_Generator, self).__init__() self.latent_dim = latent_dim self.max_resolution = max_resolution self.current_resolution = 4 # Initial block self.initial_block = nn.Sequential( nn.ConvTranspose2d(latent_dim, 512, 4, 1, 0, bias=False), nn.BatchNorm2d(512), nn.ReLU(True) ) # Layers for different resolutions self.layers = nn.ModuleDict() self.to_rgb = nn.ModuleDict() # Start with 4x4 resolution self.layers['4'] = self._make_layer(512, 512) self.to_rgb['4'] = nn.Conv2d(512, img_channels, 1) def _make_layer(self, in_channels, out_channels): return nn.Sequential( nn.Conv2d(in_channels, out_channels, 3, 1, 1, bias=False), nn.BatchNorm2d(out_channels), nn.ReLU(True), nn.Conv2d(out_channels, out_channels, 3, 1, 1, bias=False), nn.BatchNorm2d(out_channels), nn.ReLU(True) ) def add_resolution(self): """Add a new resolution level""" new_res = self.current_resolution * 2 if new_res > self.max_resolution: return # Add new layer in_channels = 512 if self.current_resolution < 8 else 512 // (self.current_resolution // 8) out_channels = 512 if new_res <= 8 else 512 // (new_res // 8) self.layers[str(new_res)] = self._make_layer(in_channels, out_channels) self.to_rgb[str(new_res)] = nn.Conv2d(out_channels, 3, 1) self.current_resolution = new_res def fade_in(self, alpha, x_high, x_low): """Fade between resolutions during training""" return alpha * x_high + (1 - alpha) * x_low def forward(self, z, alpha=1.0): x = self.initial_block(z) # Process through layers up to current resolution for res in sorted([int(r) for r in self.layers.keys() if int(r) <= self.current_resolution]): if res == 4: x = self.layers[str(res)](x) else: x = F.interpolate(x, scale_factor=2, mode='nearest') x = self.layers[str(res)](x) # Convert to RGB x = self.to_rgb[str(self.current_resolution)](x) # If fading in new resolution if alpha < 1.0 and self.current_resolution > 4: prev_res = self.current_resolution // 2 x_low = F.interpolate(x, scale_factor=0.5, mode='bilinear', align_corners=True) x_low = self.to_rgb[str(prev_res)](x_low) x = self.fade_in(alpha, x, x_low) return torch.tanh(x) ``` #### **2. StyleGAN and StyleGAN2** Revolutionized high-quality image generation with style-based architecture: ```python class StyleMappingNetwork(nn.Module): def __init__(self, latent_dim, style_dim, n_layers=8): super(StyleMappingNetwork, self).__init__() layers = [] for i in range(n_layers): layers.append(nn.Linear(latent_dim if i == 0 else style_dim, style_dim)) layers.append(nn.LeakyReLU(0.2)) self.mapping = nn.Sequential(*layers) def forward(self, z): return self.mapping(z) class AdaIN(nn.Module): def __init__(self, style_dim, num_features): super(AdaIN, self).__init__() self.norm = nn.InstanceNorm2d(num_features, affine=False) self.style_to_scale = nn.Linear(style_dim, num_features) self.style_to_bias = nn.Linear(style_dim, num_features) def forward(self, x, style): x = self.norm(x) scale = self.style_to_scale(style).unsqueeze(2).unsqueeze(3) bias = self.style_to_bias(style).unsqueeze(2).unsqueeze(3) return scale * x + bias class StyleBlock(nn.Module): def __init__(self, in_channels, out_channels, style_dim): super(StyleBlock, self).__init__() self.conv1 = nn.Conv2d(in_channels, out_channels, 3, 1, 1) self.adain1 = AdaIN(style_dim, out_channels) self.conv2 = nn.Conv2d(out_channels, out_channels, 3, 1, 1) self.adain2 = AdaIN(style_dim, out_channels) self.activation = nn.LeakyReLU(0.2) def forward(self, x, style): x = F.interpolate(x, scale_factor=2, mode='bilinear', align_corners=True) x = self.conv1(x) x = self.activation(self.adain1(x, style)) x = self.conv2(x) x = self.activation(self.adain2(x, style)) return x class StyleGenerator(nn.Module): def __init__(self, latent_dim, style_dim, n_mapping=8, img_channels=3): super(StyleGenerator, self).__init__() self.mapping = StyleMappingNetwork(latent_dim, style_dim, n_mapping) self.initial_constant = nn.Parameter(torch.randn(1, 512, 4, 4)) self.style_blocks = nn.ModuleList([ StyleBlock(512, 512, style_dim), # 4x4 -> 8x8 StyleBlock(512, 512, style_dim), # 8x8 -> 16x16 StyleBlock(512, 512, style_dim), # 16x16 -> 32x32 StyleBlock(512, 512, style_dim), # 32x32 -> 64x64 StyleBlock(512, 256, style_dim), # 64x64 -> 128x128 StyleBlock(256, 128, style_dim), # 128x128 -> 256x256 StyleBlock(128, 64, style_dim), # 256x256 -> 512x512 StyleBlock(64, 32, style_dim) # 512x512 -> 1024x1024 ]) self.to_rgb = nn.Conv2d(32, img_channels, 1) def forward(self, z): # Map to style space styles = self.mapping(z) # Start from constant x = self.initial_constant.expand(z.size(0), -1, -1, -1) # Process through style blocks for block in self.style_blocks: x = block(x, styles) # Convert to RGB x = self.to_rgb(x) return torch.tanh(x) ``` ### **Common GAN Challenges and Solutions** #### **1. Mode Collapse** *Symptoms*: Generator produces limited varieties of samples *Solutions*: - Use Wasserstein loss with gradient penalty - Mini-batch discrimination - Unrolled GANs - Feature matching ```python # Feature matching loss def feature_matching_loss(real_features, fake_features): return torch.mean(torch.abs(torch.mean(real_features, dim=0) - torch.mean(fake_features, dim=0))) ``` #### **2. Training Instability** *Symptoms*: Oscillating losses, sudden performance drops *Solutions*: - Two time-scale update rule (TTUR) - Label smoothing - Instance normalization instead of batch normalization - Careful learning rate selection #### **3. Evaluation Difficulties** GANs are hard to evaluate with traditional metrics. Common approaches: - **Inception Score (IS)**: Measures diversity and quality - **Fréchet Inception Distance (FID)**: Better correlation with human judgment - **Precision and Recall**: Measures coverage and quality separately ```python def calculate_fid(real_features, fake_features): """Calculate Fréchet Inception Distance""" mu1, sigma1 = real_features.mean(axis=0), np.cov(real_features, rowvar=False) mu2, sigma2 = fake_features.mean(axis=0), np.cov(fake_features, rowvar=False) # Calculate sum squared difference between means ssd = np.sum((mu1 - mu2) ** 2) # Calculate covariance product cov_mean = linalg.sqrtm(sigma1.dot(sigma2)) # Numerical error might make covariance mean imaginary if np.iscomplexobj(cov_mean): cov_mean = cov_mean.real # Calculate FID fid = ssd + np.trace(sigma1 + sigma2 - 2 * cov_mean) return fid ``` --- ## **Diffusion Models** Diffusion models have recently surpassed GANs in sample quality for many applications. They work by gradually denoising data from random noise. ### **The Diffusion Process** Diffusion models work in two phases: 1. **Forward process**: Gradually add noise to data 2. **Reverse process**: Learn to reverse the noise addition #### **Forward Diffusion Process** Starting from data $\mathbf{x}_0$, we define a Markov chain that gradually adds Gaussian noise: $$q(\mathbf{x}_t|\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_t;\sqrt{1-\beta_t}\mathbf{x}_{t-1},\beta_t\mathbf{I})$$ After $T$ steps, $\mathbf{x}_T$ is approximately standard Gaussian noise. The full forward process can be written as: $$q(\mathbf{x}_t|\mathbf{x}_0)=\mathcal{N}(\mathbf{x}_t;\sqrt{\bar{\alpha}_t}\mathbf{x}_0,(1-\bar{\alpha}_t)\mathbf{I})$$ Where: - $\alpha_t=1-\beta_t$ - $\bar{\alpha}_t=\prod_{s=1}^t\alpha_s$ #### **Reverse Diffusion Process** The reverse process learns to denoise: $$p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t)=\mathcal{N}(\mathbf{x}_{t-1};\boldsymbol{\mu}_{\theta}(\mathbf{x}_t,t),\boldsymbol{\Sigma}_{\theta}(\mathbf{x}_t,t))$$ The key insight is that we can parameterize the reverse process mean as: $$\boldsymbol{\mu}_{\theta}(\mathbf{x}_t,t)=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_t,t)\right)$$ Where $\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_t,t)$ is a neural network that predicts the noise added at step $t$. ### **Implementing a Diffusion Model in PyTorch** ```python import torch import torch.nn as nn import torch.nn.functional as F import numpy as np class DiffusionModel(nn.Module): def __init__(self, model, T=1000, beta_start=1e-4, beta_end=0.02): super(DiffusionModel, self).__init__() self.model = model # U-Net or similar self.T = T # Define beta schedule self.betas = torch.linspace(beta_start, beta_end, T) self.alphas = 1.0 - self.betas self.alpha_bars = torch.cumprod(self.alphas, dim=0) def q_sample(self, x_0, t, noise=None): """Forward diffusion process: sample x_t given x_0""" if noise is None: noise = torch.randn_like(x_0) sqrt_alpha_bar = torch.sqrt(self.alpha_bars[t])[:, None, None, None] sqrt_one_minus_alpha_bar = torch.sqrt(1 - self.alpha_bars[t])[:, None, None, None] return sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * noise def p_sample(self, x_t, t, conditional=None): """Reverse diffusion process: sample x_{t-1} given x_t""" z = torch.randn_like(x_t) if t[0] > 1 else 0 # Predict noise epsilon = self.model(x_t, t, conditional) # Calculate mean sqrt_alpha = torch.sqrt(self.alphas[t])[:, None, None, None] sqrt_one_minus_alpha = torch.sqrt(1 - self.alphas[t])[:, None, None, None] sqrt_alpha_bar = torch.sqrt(self.alpha_bars[t])[:, None, None, None] sqrt_one_minus_alpha_bar = torch.sqrt(1 - self.alpha_bars[t])[:, None, None, None] mean = (x_t - (sqrt_one_minus_alpha * epsilon) / sqrt_alpha) / sqrt_alpha # Calculate variance if t[0] == 0: return mean else: variance = self.betas[t][:, None, None, None] return mean + torch.sqrt(variance) * z def forward(self, x_0, conditional=None): """Training forward pass: predict noise""" t = torch.randint(1, self.T, (x_0.size(0),), device=x_0.device) noise = torch.randn_like(x_0) x_t = self.q_sample(x_0, t, noise) return self.model(x_t, t, conditional), noise, t def sample(self, shape, conditional=None, device='cpu'): """Generate samples by reversing the diffusion process""" x_t = torch.randn(shape, device=device) for t in reversed(range(1, self.T)): t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long) x_t = self.p_sample(x_t, t_batch, conditional) return x_t # Noise prediction model (simplified U-Net) class NoisePredictor(nn.Module): def __init__(self, in_channels=3, conditional_dim=None): super(NoisePredictor, self).__init__() self.conditional_dim = conditional_dim # Time embedding self.time_mlp = nn.Sequential( nn.Linear(32, 256), nn.SiLU(), nn.Linear(256, 256) ) # Conditional embedding (if present) if conditional_dim is not None: self.cond_mlp = nn.Sequential( nn.Linear(conditional_dim, 256), nn.SiLU(), nn.Linear(256, 256) ) # U-Net backbone self.encoder = nn.Sequential( nn.Conv2d(in_channels, 64, 3, padding=1), nn.SiLU(), nn.Conv2d(64, 64, 3, padding=1), nn.SiLU() ) self.middle = nn.Sequential( nn.Conv2d(64, 64, 3, padding=1), nn.SiLU(), nn.Conv2d(64, 64, 3, padding=1), nn.SiLU() ) self.decoder = nn.Sequential( nn.Conv2d(128, 64, 3, padding=1), nn.SiLU(), nn.Conv2d(64, in_channels, 3, padding=1) ) # Time projection self.time_proj = nn.Sequential( nn.Linear(256, 64), nn.SiLU(), nn.Linear(64, 64 * 2) ) def add_time_embedding(self, x, t_emb): """Add time embedding to feature maps""" t_emb = self.time_proj(t_emb)[:, :, None, None] scale, shift = t_emb.chunk(2, dim=1) return x * (1 + scale) + shift def forward(self, x, t, conditional=None): # Time embedding t_emb = self.time_mlp(t) # Conditional embedding if conditional is not None and self.conditional_dim is not None: cond_emb = self.cond_mlp(conditional) t_emb = t_emb + cond_emb # Encode h = self.encoder(x) h = self.add_time_embedding(h, t_emb) # Middle h = self.middle(h) # Decode h = torch.cat([h, self.encoder(x)], dim=1) h = self.decoder(h) return h ``` ### **Training a Diffusion Model** ```python # Hyperparameters T = 1000 # Number of diffusion steps beta_start = 1e-4 beta_end = 0.02 batch_size = 128 learning_rate = 2e-4 # Initialize model noise_predictor = NoisePredictor(in_channels=1).to(device) diffusion = DiffusionModel(noise_predictor, T, beta_start, beta_end) # Optimizer optimizer = torch.optim.Adam(diffusion.parameters(), lr=learning_rate) # Training loop def train_diffusion(diffusion, dataloader, optimizer, device, epochs=100): diffusion.train() for epoch in range(epochs): total_loss = 0 for images, _ in dataloader: images = images.to(device) # Forward pass pred_noise, target_noise, t = diffusion(images) # MSE loss between predicted and actual noise loss = F.mse_loss(pred_noise, target_noise) # Backward pass optimizer.zero_grad() loss.backward() optimizer.step() total_loss += loss.item() # Print progress print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(dataloader):.4f}") # Generate samples periodically if (epoch+1) % 10 == 0: generate_diffusion_samples(diffusion, epoch+1, device) return diffusion # Generate samples def generate_diffusion_samples(diffusion, epoch, device, num_samples=16): diffusion.eval() with torch.no_grad(): samples = diffusion.sample( shape=(num_samples, 1, 28, 28), device=device ) # Create grid of images grid = make_grid(samples, nrow=4, normalize=True) # Save image plt.figure(figsize=(8, 8)) plt.imshow(grid.cpu().numpy().transpose(1, 2, 0), cmap='gray') plt.axis('off') plt.title(f'Diffusion Samples - Epoch {epoch}') plt.savefig(f'diffusion_samples_epoch_{epoch}.png') plt.close() ``` ### **Conditional Diffusion Models** Conditional diffusion models generate samples based on additional information: ```python class ConditionalDiffusionModel(DiffusionModel): def __init__(self, model, num_classes, T=1000, beta_start=1e-4, beta_end=0.02): super().__init__(model, T, beta_start, beta_end) self.num_classes = num_classes self.class_embedding = nn.Embedding(num_classes, model.conditional_dim) def forward(self, x_0, labels): t = torch.randint(1, self.T, (x_0.size(0),), device=x_0.device) noise = torch.randn_like(x_0) x_t = self.q_sample(x_0, t, noise) # Get class embedding conditional = self.class_embedding(labels) return self.model(x_t, t, conditional), noise, t def sample(self, shape, labels=None, device='cpu'): x_t = torch.randn(shape, device=device) for t in reversed(range(1, self.T)): t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long) # Get class embedding if provided conditional = self.class_embedding(labels) if labels is not None else None x_t = self.p_sample(x_t, t_batch, conditional) return x_t # Train conditional diffusion model def train_conditional_diffusion(diffusion, dataloader, optimizer, device, epochs=100): diffusion.train() for epoch in range(epochs): total_loss = 0 for images, labels in dataloader: images, labels = images.to(device), labels.to(device) # Forward pass pred_noise, target_noise, t = diffusion(images, labels) # MSE loss loss = F.mse_loss(pred_noise, target_noise) # Backward pass optimizer.zero_grad() loss.backward() optimizer.step() total_loss += loss.item() print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(dataloader):.4f}") # Generate class-conditional samples if (epoch+1) % 10 == 0: for class_label in range(10): samples = diffusion.sample( shape=(16, 1, 28, 28), labels=torch.full((16,), class_label, device=device), device=device ) # Save samples for this class grid = make_grid(samples, nrow=4, normalize=True) plt.imsave(f'diffusion_class_{class_label}_epoch_{epoch+1}.png', grid.cpu().numpy().transpose(1, 2, 0), cmap='gray') ``` ### **Advanced Diffusion Techniques** #### **1. Classifier Guidance** Improve sample quality by using a classifier to guide the diffusion process: $$\boldsymbol{\mu}_{\text{guided}}(\mathbf{x}_t,t)=\boldsymbol{\mu}_{\theta}(\mathbf{x}_t,t)+s\cdot\Sigma_{\theta}(\mathbf{x}_t,t)\nabla_{\mathbf{x}_t}\log p_{\phi}(y|\mathbf{x}_t)$$ Where $s$ is the guidance scale. ```python def classifier_guidance_sample(diffusion, classifier, x_t, t, y, guidance_scale=3.0): """Sample with classifier guidance""" # Original prediction x_prev = diffusion.p_sample(x_t, t) # Compute gradient of log p(y|x_t) x_t.requires_grad = True log_prob = F.log_softmax(classifier(x_t), dim=1) grad = torch.autograd.grad(log_prob[range(len(y)), y].sum(), x_t)[0] # Apply guidance guided_mean = x_prev + guidance_scale * diffusion.model.variance * grad return guided_mean.detach() ``` #### **2. DDIM (Denoising Diffusion Implicit Models)** Faster sampling by using a non-Markovian process: ```python def ddim_sample(diffusion, x_T, seq, seq_next, device): """DDIM sampling (faster than standard diffusion)""" x_t = x_T for i in range(len(seq)-1): t = torch.full((x_T.size(0),), seq[i], device=device, dtype=torch.long) next_t = torch.full((x_T.size(0),), seq_next[i], device=device, dtype=torch.long) # Predict noise noise_pred = diffusion.model(x_t, t) # DDIM parameters alpha_bar_t = diffusion.alpha_bars[t] alpha_bar_t_next = diffusion.alpha_bars[next_t] # DDIM sampling pred_x0 = (x_t - torch.sqrt(1 - alpha_bar_t) * noise_pred) / torch.sqrt(alpha_bar_t) dir_xt = torch.sqrt(1 - alpha_bar_t_next) * noise_pred x_t = torch.sqrt(alpha_bar_t_next) * pred_x0 + dir_xt return x_t ``` #### **3. Latent Diffusion Models** Perform diffusion in a latent space rather than pixel space: ```python class LatentDiffusion(nn.Module): def __init__(self, autoencoder, diffusion_model): super(LatentDiffusion, self).__init__() self.autoencoder = autoencoder self.diffusion = diffusion_model def encode(self, x): """Encode to latent space""" with torch.no_grad(): self.autoencoder.eval() _, z = self.autoencoder(x) return z def decode(self, z): """Decode from latent space""" with torch.no_grad(): self.autoencoder.eval() x_recon, _ = self.autoencoder.decode(z) return x_recon def forward(self, x): """Training forward pass in latent space""" z = self.encode(x) return self.diffusion(z) def sample(self, num_samples, device='cpu'): """Generate samples in latent space then decode""" z = self.diffusion.sample( shape=(num_samples, self.autoencoder.latent_dim), device=device ) return self.decode(z) ``` ### **Stable Diffusion** Stable Diffusion is a latent diffusion model conditioned on text prompts using CLIP text embeddings: ```python from transformers import CLIPTextModel, CLIPTokenizer class StableDiffusion(nn.Module): def __init__(self, autoencoder, diffusion_model, clip_model_name="openai/clip-vit-base-patch32"): super(StableDiffusion, self).__init__() self.autoencoder = autoencoder self.diffusion = diffusion_model # Text encoder self.tokenizer = CLIPTokenizer.from_pretrained(clip_model_name) self.text_encoder = CLIPTextModel.from_pretrained(clip_model_name) # Text projection for diffusion model self.text_projection = nn.Linear(self.text_encoder.config.projection_dim, diffusion_model.model.conditional_dim) def get_text_embeddings(self, prompts): """Get text embeddings from CLIP""" inputs = self.tokenizer( prompts, padding="max_length", max_length=self.tokenizer.model_max_length, truncation=True, return_tensors="pt" ).to(next(self.parameters()).device) with torch.no_grad(): text_embeddings = self.text_encoder(**inputs)[0] return text_embeddings def forward(self, images, prompts): """Training forward pass""" # Encode images to latent space z = self.autoencoder.encode(images)[0] # Get text embeddings text_embeddings = self.get_text_embeddings(prompts) conditional = self.text_projection(text_embeddings) # Diffusion forward pass pred_noise, target_noise, t = self.diffusion.model(z, t, conditional) return pred_noise, target_noise, t def generate(self, prompt, height=512, width=512, num_inference_steps=50, guidance_scale=7.5, device='cpu'): """Generate image from text prompt""" # Get text embeddings text_embeddings = self.get_text_embeddings([prompt]) conditional = self.text_projection(text_embeddings) # Classifier-free guidance uncond_embeddings = self.get_text_embeddings([""]) uncond_conditional = self.text_projection(uncond_embeddings) # Latent shape latent_height = height // 8 latent_width = width // 8 latents = torch.randn((1, 4, latent_height, latent_width), device=device) # Time steps for diffusion scheduler = DDIMScheduler( beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000 ) scheduler.set_timesteps(num_inference_steps) # Diffusion loop for t in scheduler.timesteps: # Expand latents for classifier-free guidance latent_model_input = torch.cat([latents] * 2) # Predict noise noise_pred = self.diffusion.model( latent_model_input, t, torch.cat([conditional, uncond_conditional]) ) # Perform guidance noise_pred_uncond, noise_pred_text = noise_pred.chunk(2) noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond) # Compute previous noisy sample latents = scheduler.step(noise_pred, t, latents).prev_sample # Decode latent images = self.autoencoder.decode(latents) images = (images / 2 + 0.5).clamp(0, 1) return images[0].permute(1, 2, 0).cpu().numpy() ``` ### **Why Diffusion Models Are Dominating** Diffusion models have surpassed GANs in many applications because: - **Stable training**: Less prone to mode collapse - **High sample quality**: Especially with classifier guidance - **Theoretical foundation**: Based on solid probabilistic principles - **Flexible conditioning**: Easy to condition on text, images, etc. - **Controllable generation**: Precise control over the generation process According to a 2023 benchmark by Hugging Face, diffusion models achieve **lower FID scores** (better image quality) than GANs on most standard datasets. --- ## **Text-to-Image Generation** Text-to-image generation is one of the most impressive applications of generative models, enabling the creation of images from textual descriptions. ### **The Text-to-Image Challenge** Text-to-image generation requires: - Understanding natural language descriptions - Translating semantic concepts to visual features - Generating high-resolution, coherent images - Maintaining consistency between text and image ### **Key Architectures** #### **1. DALL-E and DALL-E 2** Developed by OpenAI, these models use: - CLIP for text-image alignment - Discrete VAE for image tokenization - Transformer-based image generation #### **2. Stable Diffusion** The most popular open-source text-to-image model: - Latent diffusion model - Conditioned on CLIP text embeddings - Efficient enough to run on consumer GPUs #### **3. Imagen** Google's text-to-image model: - Uses a large language model (T5) for text understanding - Cascaded diffusion models for high-resolution generation - Achieves state-of-the-art FID scores ### **Implementing Text-to-Image with Stable Diffusion** Let's build a simplified text-to-image pipeline using Hugging Face's `diffusers` library: ```python from diffusers import StableDiffusionPipeline import torch # Load pre-trained Stable Diffusion model model_id = "runwayml/stable-diffusion-v1-5" pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) pipe = pipe.to("cuda") # Generate image from text prompt prompt = "A photorealistic portrait of a cyberpunk hacker, neon lights, futuristic city background" image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0] # Save and display image.save("cyberpunk_hacker.png") image ``` ### **Customizing Stable Diffusion** You can customize Stable Diffusion in several ways: #### **1. Text Prompt Engineering** Crafting effective prompts is crucial: ```python # Basic prompt prompt = "a portrait of a cyberpunk hacker" # Enhanced prompt with details enhanced_prompt = ( "A photorealistic portrait of a cyberpunk hacker, neon lights, " "futuristic city background, detailed facial features, cinematic lighting, " "8k resolution, trending on artstation, unreal engine 5" ) # Negative prompt to avoid unwanted features negative_prompt = "blurry, low quality, cartoon, drawing, text, watermark" # Generate with enhanced prompts image = pipe( enhanced_prompt, negative_prompt=negative_prompt, num_inference_steps=50, guidance_scale=7.5, height=512, width=512 ).images[0] ``` #### **2. Using Different Schedulers** Schedulers control the diffusion process: ```python from diffusers import EulerDiscreteScheduler # Use a different scheduler for potentially better results pipe.scheduler = EulerDiscreteScheduler.from_config(pipe.scheduler.config) # Generate with new scheduler image = pipe(prompt, num_inference_steps=30).images[0] # Fewer steps needed ``` #### **3. Textual Inversion** Learn new concepts with just a few images: ```python # First, you need to train a textual inversion embedding # This requires multiple steps not shown here # After training, use the new token prompt = "A portrait of a sks_cyberpunk_hacker" image = pipe(prompt).images[0] ``` #### **4. LoRA (Low-Rank Adaptation)** Fine-tune Stable Diffusion efficiently: ```python # Load base model pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5") # Load LoRA weights pipe.unet.load_attn_procs("path/to/lora_weights") # Generate with adapted model image = pipe("a portrait of a cyberpunk hacker").images[0] ``` #### **5. ControlNet** Control image generation with additional inputs: ```python from diffusers import StableDiffusionControlNetPipeline, ControlNetModel import cv2 from controlnet_aux import OpenposeDetector # Load ControlNet models controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose") pipe = StableDiffusionControlNetPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", controlnet=controlnet ) # Get pose estimation openpose = OpenposeDetector.from_pretrained('lllyasviel/Annotators') pose_image = openpose(image) # Generate with pose control image = pipe( "a portrait of a cyberpunk hacker", image=pose_image, num_inference_steps=50 ).images[0] ``` ### **Building a Text-to-Image Pipeline from Scratch** Let's implement a simplified text-to-image pipeline: ```python import torch import torch.nn as nn import torch.nn.functional as F from transformers import CLIPTextModel, CLIPTokenizer class TextToImageModel(nn.Module): def __init__(self, diffusion_model, clip_model="openai/clip-vit-base-patch32"): super(TextToImageModel, self).__init__() self.diffusion = diffusion_model # Text encoder self.tokenizer = CLIPTokenizer.from_pretrained(clip_model) self.text_encoder = CLIPTextModel.from_pretrained(clip_model) # Text projection self.text_projection = nn.Linear( self.text_encoder.config.projection_dim, diffusion_model.model.conditional_dim ) def get_text_embeddings(self, prompts): """Get text embeddings from CLIP""" inputs = self.tokenizer( prompts, padding="max_length", max_length=self.tokenizer.model_max_length, truncation=True, return_tensors="pt" ).to(next(self.parameters()).device) with torch.no_grad(): text_embeddings = self.text_encoder(**inputs)[0] return text_embeddings def forward(self, images, prompts): """Training forward pass""" # Encode images to latent space (if using latent diffusion) if hasattr(self.diffusion, 'encode'): z = self.diffusion.encode(images) else: z = images # Get text embeddings text_embeddings = self.get_text_embeddings(prompts) conditional = self.text_projection(text_embeddings) # Diffusion forward pass pred_noise, target_noise, t = self.diffusion(z, conditional) return pred_noise, target_noise, t def generate(self, prompt, height=512, width=512, num_inference_steps=50, guidance_scale=7.5, device='cpu'): """Generate image from text prompt""" # Get text embeddings text_embeddings = self.get_text_embeddings([prompt]) conditional = self.text_projection(text_embeddings) # Classifier-free guidance uncond_embeddings = self.get_text_embeddings([""]) uncond_conditional = self.text_projection(uncond_embeddings) # Latent shape (adjust based on your diffusion model) latent_height = height // 8 latent_width = width // 8 latents = torch.randn((1, 4, latent_height, latent_width), device=device) # Time steps for diffusion timesteps = torch.linspace(self.diffusion.T, 1, num_inference_steps, device=device).long() # Diffusion loop for i, t in enumerate(timesteps): # Expand latents for classifier-free guidance latent_model_input = torch.cat([latents] * 2) current_t = torch.full((2,), t, device=device, dtype=torch.long) # Predict noise noise_pred = self.diffusion.model( latent_model_input, current_t, torch.cat([conditional, uncond_conditional]) ) # Perform guidance noise_pred_uncond, noise_pred_text = noise_pred.chunk(2) noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond) # Compute previous noisy sample latents = self.diffusion.p_sample(latents, current_t[:1], noise_pred) # Decode latent (if using latent diffusion) if hasattr(self.diffusion, 'decode'): images = self.diffusion.decode(latents) else: images = latents # Post-process images = (images / 2 + 0.5).clamp(0, 1) return images[0].permute(1, 2, 0).cpu().numpy() # Example usage # First create a diffusion model (as shown in previous section) # Then: text_to_image = TextToImageModel(diffusion_model).to(device) # Generate image image = text_to_image.generate( "A photorealistic portrait of a cyberpunk hacker, neon lights, futuristic city background", height=512, width=512, num_inference_steps=50, guidance_scale=7.5, device=device ) # Display plt.imshow(image) plt.axis('off') plt.show() ``` ### **Advanced Text-to-Image Techniques** #### **1. Prompt-to-Prompt Editing** Modify specific aspects of an image while preserving overall structure: ```python def prompt_to_prompt_editing(original_prompt, edit_prompt, n_inference_steps=50, guidance_scale=7.5): """ Edit specific aspects of an image using two different prompts """ # Generate original image original_image = text_to_image.generate(original_prompt) # Get text embeddings for both prompts original_emb = text_to_image.get_text_embeddings([original_prompt]) edit_emb = text_to_image.get_text_embeddings([edit_prompt]) # Perform diffusion with cross-attention control # This is a simplified version - actual implementation is more complex latents = text_to_image.diffusion.sample( shape=(1, 4, 64, 64), device=device ) # During diffusion steps, replace attention maps for specific tokens # This requires modifying the diffusion model's attention layers # ... # Decode edited latents edited_image = text_to_image.diffusion.decode(latents) return original_image, edited_image ``` #### **2. InstructPix2Pix** Edit images based on instruction prompts: ```python def instruct_pix2pix(image, instruction, diffusion_steps=10): """ Edit an existing image based on an instruction prompt """ # Encode image to latent space init_latents = text_to_image.diffusion.encode(image) # Get text embeddings for instruction instruction_emb = text_to_image.get_text_embeddings([instruction]) # Perform diffusion starting from init_latents latents = init_latents # Only perform a few diffusion steps timesteps = torch.linspace(text_to_image.diffusion.T, text_to_image.diffusion.T // 2, diffusion_steps, device=device).long() for t in timesteps: # Predict noise based on instruction noise_pred = text_to_image.diffusion.model( latents, torch.full((1,), t, device=device), text_to_image.text_projection(instruction_emb) ) # Update latents latents = text_to_image.diffusion.p_sample(latents, t, noise_pred) # Decode edited_image = text_to_image.diffusion.decode(latents) return edited_image ``` #### **3. Multi-Concept Customization** Combine multiple custom concepts in a single image: ```python def multi_concept_generation(concept_prompts, weights, **kwargs): """ Generate image combining multiple custom concepts Args: concept_prompts: List of prompts for different concepts weights: List of weights for each concept """ # Get embeddings for all concepts embeddings = [text_to_image.get_text_embeddings([prompt]) for prompt in concept_prompts] # Weighted average of embeddings combined_emb = sum(w * emb for w, emb in zip(weights, embeddings)) # Generate with combined embedding return text_to_image.generate_from_embeddings(combined_emb, **kwargs) ``` ### **Challenges in Text-to-Image Generation** #### **1. Text-Image Alignment** Ensuring the generated image matches the text description: *Solutions*: - CLIP score optimization - Classifier-free guidance - Attention control mechanisms #### **2. Object Composition** Creating images with multiple objects in correct spatial relationships: *Solutions*: - Layout conditioning - Object-aware diffusion - Scene graph representations #### **3. Fine Details** Generating high-quality details, especially for text and faces: *Solutions*: - Cascaded diffusion models - Super-resolution diffusion - Face-specific refinement networks #### **4. Ethical Considerations** Addressing biases and potential misuse: *Solutions*: - Safety filters - Bias mitigation techniques - Responsible AI guidelines --- ## **Music and Audio Generation** Generative models have made significant advances in music and audio generation, creating realistic-sounding music, speech, and sound effects. ### **Audio Representation** Before generating audio, we need appropriate representations: #### **1. Waveform** Raw audio samples (time domain): - Pros: Simple, preserves all information - Cons: High-dimensional, difficult to model long-term structure #### **2. Spectrogram** Frequency representation (time-frequency domain): - Mel-spectrogram: Log-mel spectrogram with perceptual weighting - STFT: Short-time Fourier transform - Pros: Captures musical structure, lower dimensionality - Cons: Lossy, requires inverse transform for waveform #### **3. Symbolic Representation** MIDI or other symbolic formats: - Pros: Captures musical structure explicitly - Cons: Loses expressive nuances ### **Implementing a Music Generation Model** Let's build a WaveNet-style model for audio generation: ```python import torch import torch.nn as nn import torch.nn.functional as F import numpy as np class WaveNetBlock(nn.Module): """Dilated causal convolution block""" def __init__(self, residual_channels, dilation): super(WaveNetBlock, self).__init__() self.dilation = dilation self.residual_channels = residual_channels self.filter_conv = nn.Conv1d( residual_channels, residual_channels, kernel_size=2, dilation=dilation ) self.gate_conv = nn.Conv1d( residual_channels, residual_channels, kernel_size=2, dilation=dilation ) self.residual_conv = nn.Conv1d(residual_channels, residual_channels, kernel_size=1) self.skip_conv = nn.Conv1d(residual_channels, residual_channels, kernel_size=1) def forward(self, x): # Dilated causal convolution filter_out = self.filter_conv(x) gate_out = self.gate_conv(x) # Gated activation out = torch.tanh(filter_out) * torch.sigmoid(gate_out) # Residual and skip connections residual = self.residual_conv(out) skip = self.skip_conv(out) return (x[:, :, -residual.size(2):] + residual) / np.sqrt(2), skip class WaveNet(nn.Module): """WaveNet model for audio generation""" def __init__(self, num_layers, residual_channels, dilation_cycle=8): super(WaveNet, self).__init__() self.num_layers = num_layers self.residual_channels = residual_channels # Initial convolution self.start_conv = nn.Conv1d(1, residual_channels, kernel_size=1) # Dilated causal convolution layers self.wavenet_blocks = nn.ModuleList() for i in range(num_layers): dilation = 2 ** (i % dilation_cycle) self.wavenet_blocks.append(WaveNetBlock(residual_channels, dilation)) # Output layers self.end_conv_1 = nn.Conv1d(residual_channels, residual_channels, kernel_size=1) self.end_conv_2 = nn.Conv1d(residual_channels, 1, kernel_size=1) self.relu = nn.ReLU() def forward(self, x): # x: [batch_size, 1, sequence_length] # Initial convolution x = self.start_conv(x) # Store skip connections skip_connections = [] # Process through WaveNet blocks for block in self.wavenet_blocks: x, skip = block(x) skip_connections.append(skip) # Sum skip connections out = sum(skip_connections) out = self.relu(out) out = self.relu(self.end_conv_1(out)) out = self.end_conv_2(out) return out def generate(self, length, temperature=1.0, device='cpu'): """Generate audio waveform autoregressively""" self.eval() with torch.no_grad(): # Initialize with zeros audio = torch.zeros(1, 1, 1).to(device) # Generate one sample at a time for _ in range(length): # Predict next sample output = self(audio) # Apply temperature output = output[:, :, -1] / temperature # Sample from output distribution # For simplicity, using mean instead of proper sampling next_sample = torch.tanh(output) # Append to generated audio audio = torch.cat([audio, next_sample.unsqueeze(2)], dim=2) return audio.squeeze() ``` ### **Training a Music Generation Model** ```python # Hyperparameters num_layers = 20 residual_channels = 32 learning_rate = 1e-3 batch_size = 16 sequence_length = 8192 # ~0.5 seconds at 16kHz # Initialize model model = WaveNet(num_layers, residual_channels).to(device) # Optimizer optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) # Loss function criterion = nn.MSELoss() # Training loop def train_wavenet(model, dataloader, optimizer, criterion, device, epochs=100): model.train() for epoch in range(epochs): total_loss = 0 for batch in dataloader: audio = batch.to(device) # Predict next sample predictions = model(audio[:, :, :-1]) # Compute loss (comparing to actual next sample) loss = criterion(predictions, audio[:, :, 1:]) # Backward pass optimizer.zero_grad() loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() total_loss += loss.item() # Print progress print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(dataloader):.4f}") # Generate sample audio if (epoch+1) % 10 == 0: generate_audio_sample(model, epoch+1, device) # Generate and save audio sample def generate_audio_sample(model, epoch, device, length=16384): model.eval() with torch.no_grad(): audio = model.generate(length, device=device) # Convert to numpy and save audio_np = audio.cpu().numpy() sf.write(f'wavenet_sample_epoch_{epoch}.wav', audio_np, 16000) ``` ### **Advanced Music Generation Models** #### **1. Jukebox** OpenAI's Jukebox generates music with singing in various genres: ```python from jukebox.make_models import make_vqvae, make_prior from jukebox.hparams import Hyperparams # Load VQ-VAE vqvae = make_vqvae(Hyperparams(), device) # Load prior (transformer) prior = make_prior(Hyperparams(), device) # Generate music def generate_music(artist, genre, lyrics, duration=10): # Encode artist, genre, and lyrics z_0 = vqvae.encode(artist, genre, lyrics) # Generate latents using prior z = prior.sample(z_0, duration) # Decode to waveform audio = vqvae.decode(z) return audio ``` #### **2. Music Transformer** Generates symbolic music using self-attention: ```python import pretty_midi from transformers import MusicTransformer # Load pre-trained Music Transformer model = MusicTransformer.from_pretrained("adamlin/MusicTransformer-remi-small-16k") # Generate music in symbolic format def generate_music_prompt(prompt=None, max_length=1024): inputs = model.tokenizer( prompt if prompt else "", return_tensors="pt", max_length=512, truncation=True ).to(device) outputs = model.generate( **inputs, max_length=max_length, do_sample=True, temperature=0.9, top_p=0.95 ) # Convert to MIDI midi = model.tokenizer.decode(outputs[0]) return midi # Save as MIDI file midi = generate_music_prompt("Piano jazz in C major") midi.write("generated_music.mid") ``` #### **3. AudioLDM** Text-to-audio generation model similar to Stable Diffusion: ```python from audiolm_pytorch import AudioLDM # Load pre-trained AudioLDM model = AudioLDM.from_pretrained("cvssp/audioldm") # Generate audio from text audio = model( "ocean waves crashing on the beach, seagulls in the distance", num_inference_steps=100, audio_length_in_s=10.0 ) # Save audio sf.write("ocean_sounds.wav", audio, 16000) ``` ### **Conditioning Music Generation** Conditioning models on additional information improves control: #### **1. Style Conditioning** Generate music in specific styles: ```python class ConditionalWaveNet(WaveNet): def __init__(self, num_layers, residual_channels, num_styles): super().__init__(num_layers, residual_channels) # Style embedding self.style_embedding = nn.Embedding(num_styles, residual_channels) self.style_proj = nn.Linear(residual_channels, residual_channels) def forward(self, x, style): # Style conditioning style_emb = self.style_embedding(style) style_emb = self.style_proj(style_emb).unsqueeze(2) # Initial convolution x = self.start_conv(x) # Add style to input x = x + style_emb # Rest of the model as before # ... ``` #### **2. Melody Conditioning** Generate harmonies based on a given melody: ```python class MelodyConditionedModel(nn.Module): def __init__(self, base_model, melody_encoder): super().__init__() self.base_model = base_model self.melody_encoder = melody_encoder def forward(self, audio, melody): # Encode melody melody_emb = self.melody_encoder(melody) # Process through base model with melody conditioning return self.base_model(audio, melody_emb) ``` #### **3. Lyrics Conditioning** Generate singing with specific lyrics: ```python class LyricsConditionedModel(nn.Module): def __init__(self, base_model, text_encoder): super().__init__() self.base_model = base_model self.text_encoder = text_encoder def forward(self, audio, lyrics): # Encode lyrics text_emb = self.text_encoder(lyrics) # Process through base model with text conditioning return self.base_model(audio, text_emb) ``` ### **Music Generation Evaluation** Evaluating music generation is challenging: #### **1. Objective Metrics** - **Fréchet Audio Distance (FAD)**: Similar to FID but for audio - **KL divergence**: Between generated and real audio features - **Audio reconstruction metrics**: SNR, PESQ #### **2. Subjective Evaluation** - **MOS (Mean Opinion Score)**: Human ratings - **ABX tests**: Human preference tests - **Genre classification accuracy**: Does generated music match target genre? #### **3. Music-Specific Metrics** - **Note consistency**: Consistency of musical elements - **Harmonic structure**: Quality of chord progressions - **Rhythmic coherence**: Consistency of rhythm ```python def evaluate_music_generation(generated_audio, real_audio): """Calculate various music generation metrics""" # Extract features gen_features = extract_music_features(generated_audio) real_features = extract_music_features(real_audio) # FAD score fad = calculate_fad(gen_features, real_features) # Genre classification gen_genre = classify_genre(generated_audio) real_genre = classify_genre(real_audio) genre_acc = (gen_genre == real_genre).mean() # Music-specific metrics note_consistency = calculate_note_consistency(generated_audio) harmonic_quality = calculate_harmonic_quality(generated_audio) return { "FAD": fad, "Genre Accuracy": genre_acc, "Note Consistency": note_consistency, "Harmonic Quality": harmonic_quality } ``` --- ## **Evaluating Generative Models** Evaluating generative models is notoriously difficult. Unlike discriminative models, there's no single metric that captures all aspects of quality. ### **Challenges in Evaluation** 1. **No ground truth**: For completely new samples 2. **Multi-dimensional quality**: Balance of diversity and fidelity 3. **Human perception**: What looks/sounds good to humans 4. **Task dependency**: Evaluation depends on intended use ### **Quantitative Metrics** #### **1. Inception Score (IS)** Measures diversity and quality of generated images: $$\text{IS}=\exp\left(\mathbb{E}_{\mathbf{x}\sim p_g}\left[\text{KL}\left(p(y|\mathbf{x})\|p(y)\right)\right]\right)$$ Where: - $p(y|\mathbf{x})$ is the label distribution from Inception network - $p(y)$ is the marginal label distribution Higher IS indicates better quality and diversity. ```python def inception_score(images, n_split=10, batch_size=32, resize=True, device='cpu'): """Calculate Inception Score for generated images""" # Load Inception model inception_model = inception_v3(pretrained=True, transform_input=False).to(device) inception_model.eval() # Process images if resize: images = F.interpolate(images, size=(299, 299), mode='bilinear', align_corners=False) # Get predictions preds = [] for i in range(0, len(images), batch_size): batch = images[i:i+batch_size].to(device) with torch.no_grad(): pred = F.softmax(inception_model(batch), dim=1) preds.append(pred.cpu().numpy()) preds = np.concatenate(preds, 0) # Calculate IS scores = [] for i in range(n_split): part = preds[(i * preds.shape[0] // n_split):((i + 1) * preds.shape[0] // n_split), :] py = np.mean(part, axis=0) scores.append(np.exp(np.mean(np.sum(part * (np.log(part) - np.log(py)), axis=1)))) return np.mean(scores), np.std(scores) ``` #### **2. Fréchet Inception Distance (FID)** Measures similarity between generated and real data distributions: $$\text{FID}=\|\boldsymbol{\mu}_r-\boldsymbol{\mu}_g\|^{2}+\text{Tr}(\boldsymbol{\Sigma}_r+\boldsymbol{\Sigma}_g-2\sqrt{\boldsymbol{\Sigma}_r\boldsymbol{\Sigma}_g})$$ Where $\boldsymbol{\mu}$ and $\boldsymbol{\Sigma}$ are the mean and covariance of Inception features. Lower FID indicates better quality. ```python def calculate_fid(real_features, fake_features): """Calculate Fréchet Inception Distance""" mu1, sigma1 = real_features.mean(axis=0), np.cov(real_features, rowvar=False) mu2, sigma2 = fake_features.mean(axis=0), np.cov(fake_features, rowvar=False) # Calculate sum squared difference between means ssd = np.sum((mu1 - mu2) ** 2) # Calculate covariance product cov_mean = linalg.sqrtm(sigma1.dot(sigma2)) # Numerical error might make covariance mean imaginary if np.iscomplexobj(cov_mean): cov_mean = cov_mean.real # Calculate FID fid = ssd + np.trace(sigma1 + sigma2 - 2 * cov_mean) return fid ``` #### **3. Precision and Recall** Measures coverage (recall) and quality (precision) separately: - **Precision**: Fraction of generated samples that are realistic - **Recall**: Fraction of data manifold covered by generated samples ```python def precision_recall(real_features, fake_features, k=3): """Calculate precision and recall using k-nearest neighbors""" # Convert to numpy real_features = real_features.cpu().numpy() fake_features = fake_features.cpu().numpy() # Calculate pairwise distances dist_real_to_real = pairwise_distances(real_features, real_features) dist_real_to_fake = pairwise_distances(real_features, fake_features) dist_fake_to_fake = pairwise_distances(fake_features, fake_features) # Find k-nearest neighbors nn_rr = np.partition(dist_real_to_real, k, axis=1)[:, :k] nn_rf = np.partition(dist_real_to_fake, k, axis=1)[:, :k] nn_ff = np.partition(dist_fake_to_fake, k, axis=1)[:, :k] # Calculate precision and recall precision = (nn_rf < nn_rr.max(axis=1)[:, None]).mean() recall = (nn_rf < nn_ff.max(axis=1)[None, :]).mean() return precision, recall ``` #### **4. Kernel Inception Distance (KID)** An unbiased estimate of squared maximum mean discrepancy: $$\text{KID}=\frac{1}{m(m-1)}\sum_{i\neq j}k(\mathbf{x}_i,\mathbf{x}_j)+\frac{1}{n(n-1)}\sum_{i\neq j}k(\mathbf{y}_i,\mathbf{y}_j)-\frac{2}{mn}\sum_{i,j}k(\mathbf{x}_i,\mathbf{y}_j)$$ Where $k$ is a polynomial kernel on Inception features. ```python def kernel_inception_distance(real_features, fake_features, degree=3, gamma=None, coef=1): """Calculate Kernel Inception Distance""" real_features = real_features.cpu().numpy() fake_features = fake_features.cpu().numpy() # Polynomial kernel def polynomial_kernel(x, y): if gamma is None: gamma = 1.0 / x.shape[1] return (gamma * x.dot(y.T) + coef) ** degree # Compute kernel matrices k_rr = polynomial_kernel(real_features, real_features) k_ff = polynomial_kernel(fake_features, fake_features) k_rf = polynomial_kernel(real_features, fake_features) # Unbiased estimate m = real_features.shape[0] n = fake_features.shape[0] k_rr = (k_rr.sum() - np.diag(k_rr).sum()) / (m * (m - 1)) k_ff = (k_ff.sum() - np.diag(k_ff).sum()) / (n * (n - 1)) k_rf = k_rf.sum() / (m * n) return k_rr + k_ff - 2 * k_rf ``` ### **Qualitative Evaluation** #### **1. Visual Inspection** Human evaluation of sample quality: ```python def visualize_samples(model, num_samples=16, device='cpu'): """Generate and visualize samples for human evaluation""" model.eval() with torch.no_grad(): samples = model.sample((num_samples, 3, 64, 64), device=device) # Create grid grid = make_grid(samples, nrow=4, normalize=True) # Display plt.figure(figsize=(10, 10)) plt.imshow(grid.cpu().numpy().transpose(1, 2, 0)) plt.axis('off') plt.title('Generated Samples') plt.show() ``` #### **2. Interpolation** Check for continuity in latent space: ```python def latent_interpolation(model, num_steps=10, device='cpu'): """Interpolate between random points in latent space""" model.eval() with torch.no_grad(): # Sample two random points z1 = torch.randn(1, model.latent_dim).to(device) z2 = torch.randn(1, model.latent_dim).to(device) # Interpolate steps = torch.linspace(0, 1, num_steps) interpolations = [] for step in steps: z = (1 - step) * z1 + step * z2 sample = model.decode(z) interpolations.append(sample.view(3, 64, 64)) # Visualize plt.figure(figsize=(15, 3)) for i, img in enumerate(interpolations): plt.subplot(1, len(interpolations), i+1) plt.imshow(img.cpu().numpy().transpose(1, 2, 0)) plt.axis('off') plt.tight_layout() plt.show() ``` #### **3. Arithmetic Operations** Test for meaningful latent space structure: ```python def latent_arithmetic(model, device='cpu'): """Perform arithmetic operations in latent space""" model.eval() with torch.no_grad(): # Find samples of specific classes man = find_sample(model, "man with glasses") woman = find_sample(model, "woman with glasses") king = find_sample(model, "king") queen = find_sample(model, "queen") # Perform arithmetic man_to_woman = woman - man king_to_queen = queen - king # Check similarity similarity = F.cosine_similarity( man_to_woman.flatten(), king_to_queen.flatten(), dim=0 ).item() print(f"Man->Woman and King->Queen similarity: {similarity:.4f}") ``` ### **Task-Specific Evaluation** #### **1. Data Augmentation** If using generated data for training: ```python def evaluate_augmentation(gan, classifier, real_train_loader, test_loader, device): """Evaluate generated data for data augmentation""" # Generate fake data fake_data = generate_fake_data(gan, len(real_train_loader.dataset)) # Train classifier on real + fake data augmented_train_loader = combine_loaders(real_train_loader, fake_data) train_classifier(classifier, augmented_train_loader, device) # Evaluate on test set accuracy = evaluate_classifier(classifier, test_loader, device) return accuracy ``` #### **2. Creative Applications** For artistic applications: ```python def creative_evaluation(generated_images, prompts): """Evaluate creative aspects of generated images""" evaluations = [] for img, prompt in zip(generated_images, prompts): # Evaluate relevance to prompt prompt_relevance = clip_score(img, prompt) # Evaluate artistic qualities composition = rate_composition(img) color_harmony = rate_color_harmony(img) originality = rate_originality(img) evaluations.append({ "prompt": prompt, "prompt_relevance": prompt_relevance, "composition": composition, "color_harmony": color_harmony, "originality": originality }) return evaluations ``` #### **3. Downstream Task Performance** For specific applications: ```python def evaluate_downstream_task(generated_data, task_model, task_metric, device): """Evaluate generated data on a downstream task""" task_model.eval() with torch.no_grad(): outputs = task_model(generated_data.to(device)) metric = task_metric(outputs) return metric ``` ### **Best Practices for Evaluation** 1. **Use multiple metrics**: No single metric tells the whole story 2. **Include human evaluation**: Especially for creative applications 3. **Compare to baselines**: Compare against previous models 4. **Report standard deviations**: For reliability 5. **Use consistent settings**: Same data, same preprocessing 6. **Consider the application**: Tailor evaluation to intended use According to a 2023 survey by Generative AI Research, the most reliable evaluation approach combines **FID score** with **human evaluation** on a diverse set of samples. --- ## **Building a Complete Image Generation Pipeline** Let's build a complete image generation pipeline using Stable Diffusion with custom fine-tuning.