###### tags: `paper`
# Abstract and Introduction
Inspired by the recent success of AutoML in deep compression, we introduce AutoML to GAN compression and develop an AutoGAN-Distiller (AGD) framework.
AGD is fully automatic, standalone (i.e., needing no trained discriminators), and generically applicable to various GAN models.
Generative adversarial networks (GANs) nowadays have empowered many real applications such as image stylization, image editing and enhancement. Those applications have driven the growing demand to deploy GANs, usually their trained generators, on resource-constrained platforms, e.g., for real-time style transfer and super resolution in mobile APPs. However, just like other deep models, GAN generators require heavy memory and computation resources to run, which challenge most mobile devices.
This paper aims to significantly push forward the application frontier of GAN compression.
# Related Work
## 2.1. AutoML: Neural Architecture Search
As one of the most significant sub-fields of AutoML, Neural Architecture Search (NAS) seeks an optimal neural network architecture from data, instead of using hand-crafting.
Already by now, NAS methods have outperformed manually designed architectures
on a range of tasks such as image classification, and segmentation.
(Gong et al., 2019) developed the first NAS framework for GANs, on the task of unconditional image generation from random noise. However, the existing NAS for GAN framework is not directly applicable in our case due to the following reasons.
## 2.2. Model Compression
### knowledge distillation
transfer the knowledge from one model to another by imitating the soft labels generated by the former, therefore compressing the latter.
### pruning
sparsifying the model weights by thresholding the unimportant weights
All of them follow a similar workflow: (iteratively) pruning the model to a smaller size and then retraining to recover accuracy.
### quantization
reduces the float-number representations of weights and activations to lower numerical precision. The extreme case could even just use binary values.
## 2.3. GANs and GAN Compression
Substantial evidences reveal that the generative quality of GANs consistently benefits from larger-scale training. However, the GAN training is notoriously unstable, and therefore numerous techniques were proposed to stabilize the training of increasingly larger GANs.
Despite their performance boost, the growing complexity of those large GANs conflicts the demands of mobile deployments, calling for compression techniques.
# Our AutoGAN-Distiller Framework
Given a pretrained generator $G_0$ over the data $X = \{ x_i \} _ {i = 1} ^ N$ , our goal is to obtain a more compact and hardware-friendly generator $G$, from $G_0$ while the generated quality does not sacrifice much, e.g., $G_0(x) \approx G(x), x \in X$.
We adopt the **differentiable** design for our NAS framework. A NAS framework consists of two key components: the **search space** and the **proxy task**. AGD customizes both these two components for the specific task of GAN compression, as described in the following sections.
## 3.1. The Proposed Customized Search Space
### General Design Philosophy
does not choose directed acyclic graph (DAG)-type search spaces, choose a **sequential search space**, for the ease of implementing the desired parallelism on real mobile devices
jointly search for the operator type and the width of each layer, enable each layer’s width and operators to be independently searchable.
### Application-specific Supernet
On top of the above general design philosophy, we note that different tasks have developed their own preferred and customized components, calling for instantiating different network architectures per application needs.
> 針對不同應用使用不同的 supernet network architecture (?)
#### Image Translation

| | stem | backbone | header |
| ---------- | ---------------------- | -------------------- | ----------------------------------------------------------- |
| layers | 3 convolutional layers | 9 sequential layers | 2 transposed convolutional layers and 1 convolutional layer |
| search for | widths | operators and widths | widths |
- 繼承原版的 **CycleGAN** structure (their downsampling/upsampling structures contribute to lowering the computational complexity of the backbone between them)
- For the backbone part, we adopt 9 sequential layers with both searchable operators and widths to trade off between model performance and efficiency.
- For the normalization layers, we adopt the **instance normalization** that is widely used in image translation and style transfer.
#### Super Resolution

> "Upconv" denotes bilinearly upsampling followed by a convolutional layer
- 受 **SRResNet** 啟發 (where most computation is performed in the low resolution feature space for efficiency)
- search for
- stem and header: fixed
- residual: operators and widths
- For the residual network structure
- ESRGAN introduces residual-in-residual (RiR) blocks with dense connections which have higher capacity and is easier to train
- Despite improved performance, such densely-connected blocks are hardware unfriendly
- replace the dense blocks in RiR modules with 5 sequential layers with both searchable operators and widths
- remove all the batch normalization layers in the network to eliminate artifacts
### Operator Search
search for the following operators:
- Conv 1 x 1
- Conv 3 x 3
- Residual Block ("ResBlock") (2 layers of Conv 3 x 3, with a skip connection)
- Depthwise Block ("DwsBlock") (Conv 1 x 1 + DepthwiseConv 3 x 3 + Conv 1 x 1, with a skip connection)
did not include dilated convolutions (because their hardware unfriendliness)
search for the operator for each layer in a differentiable manner
- For the $i$ -th layer, we use an architecture parameter $\alpha_i$ to determine the operator for the current layer
- the softmax value of $\alpha_i$ denotes the probability of choosing this operator
- the output is the weighted sum of all the operators determined by all the softmax values of all $\alpha$
### Width Search
By searching the widths, we merge the pruning step into the searching process for an end-to-end GAN compression
- set a single convolution kernel with a maximal width, named **superkernel**
- search for the expansion ratio $\phi$ to make use of only a subset of the input/output
dimensions of the superkernel
- set $\phi \in [ \frac {1} {3}, \frac {1} {2}, \frac {3} {4}, \frac {5} {6}, 1 ]$ and use the architecture parameter $\gamma_i$ to control the probability of choosing each expansion ratio in the $i$ -th layer
- apply Gumbel-Softmax to approximate differentiable sampling for $\phi$ based on $\gamma$
- Therefore, during the searching process only one most likely expansion ratio will be activated each time, which saves both memory and computation cost
## 3.2. The Proposed Customized Proxy Task

- AGD searches for an efficient generator $G$ under the guidance of distillation from the original model $G_0$, through optimizing **Eq. 1**.
- the objective function in **Eq. 1** is free of any trained discriminator, because
1. in practice, the discriminator is often discarded after the generator is trained
2. add a discriminator loss term into **Eq. 1** found no improvement
- In order for more flexible model space exploration while enforcing the budget
- we set an upper bound and a lower bound for the target computational budget
- double (half) $\lambda$ when the computational budget of the derived architecture determined by the current architecture parameters is larger (smaller) than the upper (lower) bound

- During the search process, if we jointly update $\alpha$ and $\gamma$ , model is prone to suffering from "architecture collapse" problem (NAS is quickly biased towards some low-latency yet low-performance models)
- Therefore, we decouple the computational budget calculation into two individual terms (the operator-aware and width-aware parts, and weight the two differently)
# 4. Experiment Results
## 4.1. Considered Tasks & Models
### Unpaired image-to-image translation
- apply AGD on compressing **CycleGAN**
- dataset: horse2zebra, summer2winter
- individually conduct the architecture search for each task in one dataset
- consider a special case for AGD that all the weights and activations are quantized to 8-bit integers for hardware-friendly implementation
### Super resolution
- apply AGD on compressing **ESRGAN**
- dataset: a combined dataset of **DIV2K** and **Flickr2K** with a upscale factor of 4
- benchmarks: Set5, Set14, BSD100, Urban100
## 4.2. Evaluation Metrics
- For the efficiency aspect, measure the model size and the inference FLOPs
- measure the real-device inference latency using NVIDIA GEFORCE RTX 2080 Ti
### Unpaired image-to-image translation
visual quality, FID
### Super resolution
visual quality, PSNR
## 4.3. Training details
### NAS Search
- The AGD framework adopts the **differential search algorithm**
- split the training dataset into two halves: one for **updating supernet weight** and the other for **updating architecture parameters**
- procedure: pretrain, search, derive
## 4.4. AGD for Efficient Unpaired Image Translation
- compare our **AGD** with **CEC** (the existing GAN compression algorithm specifically for CycleGAN) and **classical structural pruning** (developed for classification models)
- for all tasks, **AGD** consistently achieves significantly better than **CEC** and **Prune**
- if further quantize all the weights and activations to 8-bit integers on the AGD-compressed CycleGAN, the models maintain both competitive FID and comparable visualization quality
## 4.5. AGD for Efficient Super Resolution
- apply AGD on compressing ESRGAN
- compare our **AGD** with **ESRGAN**, **ESRGAN-Prune**, **SRGAN**, **VDSR**
- little-to-no performance loss of AGD on ESRGAN
# 5. Conclusion
We propose the AGD framework to significantly push forward the frontier of GAN compression by introducing AutoML here.
Starting with a specially designed search space of efficient building blocks, AGD performs differential neural architecture search under both the guidance of knowledge distillation and the constraint of computational resources.
AGD is automatic with no assumptions about GAN structures, loss forms, or the availability of discriminators, and can be applicable to various GAN tasks.
Experiments show that AGD outperforms existing options with aggressively reduced FLOPs, model size, and real-device latency, without degrading model performance.