These are paper found to improve the vq-vae model, ranked by priority ( personal judegement ) : all found in the lucidreains reposit : 'https://github.com/lucidrains/vector-quantize-pytorch?tab=readme-ov-file' ## Model architectures : * **Residual VQ** *2021* : The idea here is to use multiple quntizations stages rather than just one as in the original VQ-VAE paper, where in each stage, a residual vector is calculated, by taking the latent vector from the previous stage, and substracting the latent vector from its quantized vector, this helps the model to capture more semantics and informations. to note : each stage has its own CodeBook. (paper : ) >> https://arxiv.org/pdf/2107.03312 * **RQ-VAE** *2022*: uses Residual-VQ to construct the RQ-VAE, for generating high resolution images with more compressed codes. They make two modifications. The first is to share the codebook across all quantizers. The second is to stochastically sample the codes rather than always taking the closest match. You can use both of these features with two extra keyword arguments. >> https://arxiv.org/pdf/2203.01941 ## Initializations : * The SoundStream paper proposes that the codebook should be initialized by the kmeans centroids of the first batch. (fetch more details from paper later) "*First, instead of using a randominitialization for the codebook vectors, we run the k-means algorithm on the first training batch and use the learned centroids as initialization. This allows the codebook to be close to the distribution of its inputs and improves its usage. Second, as proposed in [34], when a codebook vector has not been assigned any input frame for several batches, we replace it with an input frame randomly sampled within the current batch. More precisely, we track the exponential moving average of the assignments to each vector (with a decay factor of 0.99) and replace the vectors of which this statistic falls below 2.*" ## CodeBook Utilization : * **Re-build and Fine-tune Strategy** (2021) : "*Next we sample P features uniformly from the entire set of features found in training images, where P is the sampling number and far larger than the desired codebook capacity Kt. This ensures that the rebuild codebook is composed of valid latent codes. Since the process of codebook training is basically the process of finding cluster centres, we directly employ k-means with AFK-MC2 [2] on the sampled P features and utilize the centres to re-build the codebook Zt. We then replace the original codebook with the re-build Zt and fine-tune it on top of the well-trained discrete VAE*" >> https://arxiv.org/pdf/2112.01799 * **Gradient Computation (rotation trick )** *2024* : In standard VQ methods, the latent representations produced by the encoder may not fully utilize the available codebook. Some regions of the latent space may map to only a few codebook entries, leading to redundancy and inefficiency. The rotation trick learns an *orthogonal matrix* (Orthogonality ensures that the transformation is invertible and does not distort distances between points in the latent space, preserving the geometry of the latent representations) leading to better alignment between the encoder's output and the quantizer's codebook. this by ensuring that the latent vectors are more evenly spread out in the rotated space, which helps in maximizing the usage of the codebook entries. This reduces codebook collapse and leads to better performance, especially when using smaller codebooks. >> https://arxiv.org/pdf/2410.06424 >>![image](https://hackmd.io/_uploads/SJZeU6d41x.png) * **Improved VQGAN** : paper propose the following improvements to the VQ-GAN architecture to use efficiently all the codes from the codebook : codebook kept in a lower dimension. The encoder values are projected down before being projected back to high dimensional after quantization + l2 normalize the codes and the encoded vectors, which boils down to using cosine similarity for the distance. They claim enforcing the vectors on a sphere leads to improvements in code usage and downstream reconstruction. >> https://openreview.net/pdf?id=pfNyExj7z2 * **Expiring stale codes** : the *SoundStream* paper has a scheme where they replace codes that have hits below a certain threshold with randomly selected vector from the current batch * **Orthogonal regularization loss** (2023) : proposes that when using vector quantization on images, enforcing the codebook to be orthogonal leads to translation equivariance of the discretized codes, leading to large improvements in downstream text to image generation tasks. >> https://arxiv.org/pdf/2112.00384 * **Sim VQ** (2024) : proposes a scheme where the codebook is frozen, and the codes are implicitly generated through a linear projection. The authors claim this setup leads to less codebook collapse as well as easier convergence. I have found this to perform even better when paired with rotation trick from Fifty et al., and expanding the linear projection to a small one layer MLP >> https://arxiv.org/pdf/2411.02038 >> ![image](https://hackmd.io/_uploads/By4Z3a_E1l.png)