Reproducibility of "EG3D: Efficient Geometry-aware 3D Generative Adversarial Networks"

# Reproducibility of "EG3D: Efficient Geometry-aware 3D Generative Adversarial Networks" **Authors:** - Maurits Kienhuis (5300126, M.A.A.Kienhuis@student.tudelft.nl) - Petar Petrov (5215781, P.I.Petrov@student.tudelft.nl) - Niels van der Voort (5243076, N.A.vanderVoort-1@student.tudelft.nl) - Jan Warchocki (5344646, J.Z.Warchocki-1@student.tudelft.nl) **Code:** https://github.com/Jaswar/dl-project **Original paper:** https://nvlabs.github.io/eg3d/ *This blogpost was written as part of the Deep Learning course at Delft University of Technology.* ## Introduction <figure id="fig_1"> <img src="https://hackmd.io/_uploads/HyyUjpLxC.png" alt="Figure 1."/> <figcaption><b>Figure 1. </b> <em>EG3D model structure.</em> </figcaption> </figure> In "EG3D: Efficient Geometry-aware 3D Generative Adversarial Networks" [[1]](#1), the authors introduce a novel approach for synthesizing high-resolution multi-view-consistent images in real-time while producing high-quality 3D geometry. To achieve this, the authors use a tri-plane representation instead of the traditional neural implicit or explicit voxel grid representations in the literature. This novel representation approach provides better efficiency by reducing the size of the decoder and depending more on explicit features. The tri-plane representation is quite simple at its core. Instead of using fully connected layers with positional encodings or encoding an explicit voxel grid, the authors align explicit features along three axis-aligned orthogonal feature planes, each with a resolution of $N \times N \times C$, with $N$ being spatial resolution and $C$ the number of channels. They then project positions onto the plane and aggregate the three feature vectors via summation. Additionally, the authors employ a small neural network to extract color and density from the tri-plane representation. Furthermore, the authors decouple feature generation and neural rendering, allowing them to use state-of-the-art generators. The full architecture of the model can be seen in Figure [[1]](#fig_1). At the time of its proposal, the model achieved state-of-the-art performance on the FFHQ and AFHQ Cats datasets. Additionally, the qualitative assessment showed that the model generates high quality images and 3D models. In this work, we extend the analysis made by the authors of eg3d. We show how style editing can be performed on the model. Furthermore, sometimes the background generated by the model is black. We show in which cases this happens and provide potential reasons for this phenomenon. Finally, we attempt to explain the observation that human eyes face the camera regardless of the pose. ## Feature 1: Style Editing The goal of this first step is to extend the style-mixing presented in Figure 8 of the original paper [[1]](#1). In that figure, two latent vectors are mixed to combine the features of the resulting images. We note that the editing ability of this approach is limited: two images of people without glasses will (most likely) not result in an image of a person with glasses. Therefore, we would like to use style editing instead, where we are allowed to control which features are and which are not present in the output image. A possible solution for the style-editing problem using latent vectors was proposed in [[2]](#2). The suggested approach, called InterFaceGAN, attempts to find a normal vector, moving along which changes only the desired feature. As noted by the authors, the approach showed promising results on faces generated by PGGAN [[3]](#3) and StyleGAN [[4]](#4). In this blog, we show how to apply this technique to the models generated by eg3d. ### Background Let us assume a simple generative model that takes as input a latent vector of only two components ($x$ and $y$) and generates some output image $I$. Suppose we wish to edit a binary feature, for example, whether a person is wearing glasses. In the latent space, we could then draw which latent vectors correspond to people wearing glasses and which to people without glasses. An example of such a graph is shown in Figure 1. <figure> <img src="https://hackmd.io/_uploads/ryfpMUyx0.png" alt="Figure 1."/> <figcaption><b>Figure 2. </b> <em>An example latent space for distinguishing between people with and without glasses. The normal vector found by a linear classifier is denoted in black with the decision boundary in red.</em> </figcaption> </figure> As we can observe, the latent vectors can be linearly separated between the two classes. Following the approach from InterFaceGAN, imagine we move along the vector perpendicular to this decision boundary (denoted in black). If we start at some latent encoding corresponding to a person with no glasses and move towards the class of people with glasses, we would expect the generated image to resemble more and more a person with glasses. Although this approach might make sense intuitively, two caveats need to be taken care of: 1. **The classes need to be linearly separable.** As noted by the authors of InterFaceGAN, the ability to perform arithmetic on latent vectors for style-mixing makes this assumption reasonable in practice. Furthermore, we observe a high accuracy of linear classifiers on such a dataset, hinting at the correctness of this assumption. 2. **Moving along the normal vector only changes the desired feature.** This assumption only holds if the representations learned by the generative model are *disentangled*. Therefore, changing one feature does not change the other. Similarly to [[2]](#2), we also find that eg3d learns disentangled representations well, allowing this assumption to work in practice. ### Implementation details for eg3d Since eg3d is based on StyleGAN, similarly to [[2]](#2), we work in the $\mathcal{Z}$-latent space. The exact procedure is based on [[2]](#2) and is as follows: 1. Train a classifier to recognize whether an image contains a person with or without the desired binary feature. 2. Generate a large number of faces (5k-10k) using eg3d. Apart from the generated images we also store the latent vectors that lead to these images. 3. Using the classifier from Step 1 and the generated images we classify each latent vector as whether it leads to a person with or without the binary feature. The result of this step is a dataset with X being the latent vectors and Y being the classification prediction. This dataset therefore corresponds to Figure 1 from Background. 4. We train an SVM with a linear kernel and C = 1 to fit the dataset generated in Step 3. The resulting weights give the vector perpendicular to the decision boundary. We normalize this vector and save it to a file. 5. When generating new images using eg3d we now add or subtract this vector to the current latent vector. Whether the vector is added or subtracted changes the presence or the absence of the binary feature. We apply the procedure for two different features: whether the person has glasses and gender. Further details for each implementation are given below. **Eyeglasses feature.** Here, Step 1 is implemented by fine-tuning ResNet50 [[5]](#5) on the CelebA dataset [[6]](#6). The final, classification layer of ResNet50 is replaced with a fully-connected layer with only one output neuron. Batch normalization is also added before the final layer. The model is trained with the binary cross-entropy loss, batch size of 128, and Adam optimizer [[7]](#7) with a learning rate of 0.001. ImageNet [[8]](#8) normalization is applied to all images and the training images are also flipped horizontally at random. The resulting classifier obtains a validation accuracy of around 99%. A dummy classifier, always predicting that a person wears glasses obtains around 94% accuracy. Therefore, the trained classifier can be accurately used for Step 3 of the procedure. For Step 3, 10000 images are generated using eg3d. A large number is required for this feature as the number of people without glasses is much larger than with. This also follows from [[2]](#2). The SVM trained in Step 4 obtains a test accuracy of around 88% compared to the 72% achieved by the dummy classifier. We note that the class priors are different here than in Step 1 likely due to eg3d being trained on a different dataset. **Gender feature.** We use a pre-trained model [[9]](#9) from HuggingFace to classify gender. In particular, we used a model built on top of HuggingFace's library for fine-tuning vision transformers [[10]](#10), which allows for easy and efficient training of transformer-based models for computer vision. According to the information published on HuggingFace, the model in question has a self-reported accuracy of 99.1%, allowing us to generate gender labels with ease. Additionally, we produced a Jupyter notebook to simplify the vector generation process. In the notebook, we start by loading the eg3d model and testing it before generating a dataset of 10000 images and loading the HuggingFace classifier. Afterward, we iterate over the dataset to generate the necessary labels before training an SVM classifier with the same setup as for the glasses feature. With this setup, we achieve a test accuracy of around 82%. Finally, we interpolate the created vector using the following formula: $$ \mathit{embedding}' = \mathit{embedding} + \mathit{vector}*\alpha $$ Where <i>embedding</i> is a latent space embedding from eg3d, <i>vector</i> is the feature vector, and $\alpha$ is a scaling factor. ### Results **Results for the eyeglasses feature.** We present qualitative results of adding and removing the eyeglasses from faces generated by eg3d. Figure 2 presents an example of adding glasses to a generated image, while Figure 3 shows an example of removing glasses. As we can observe, the only noticeable difference is the presence/absence of the desired feature. Therefore, the style editing was successful for this feature. <figure> <table> <tr> <td> <img src="https://hackmd.io/_uploads/rk025UelC.png"> </td> <td> <img src="https://hackmd.io/_uploads/rJVpc8leR.png"> </td> </tr> </table> <figcaption><b>Figure 3. </b> <em>Original image (left) and the transformed image (right) by adding the normal vector to the latent vector. Results in a person with glasses.</em> </figcaption> </figure> <figure> <table> <tr> <td> <img src="https://hackmd.io/_uploads/S1pfpUgxA.png"> </td> <td> <img src="https://hackmd.io/_uploads/ByKmpIex0.png"> </td> </tr> </table> <figcaption><b>Figure 4. </b> <em>Original image (left) and the transformed image (right) by subtracting the normal vector to the latent vector. Results in a person without glasses.</em> </figcaption> </figure> **Results for the gender feature.** <figure> <table> <tr> <td> <img src="https://hackmd.io/_uploads/rkJkVDbe0.jpg" alt="Generated image with gender feature factor 0"> </td> <td> <img src="https://hackmd.io/_uploads/rknqNw-l0.jpg" alt="Generated image with gender feature factor -1"> </td> <td> <img src="https://hackmd.io/_uploads/Sy2v4DWeA.jpg" alt="Generated image with gender feature factor -2"> </td> <td> <img src="https://hackmd.io/_uploads/SJ3P4wbgC.jpg" alt="Generated image with gender feature factor -3"> </td> <td> <img src="https://hackmd.io/_uploads/SkhP4wWeA.jpg" alt="Generated image with gender feature factor -4"> </td> <td> <img src="https://hackmd.io/_uploads/Bk2wNvbeR.jpg" alt="Generated image with gender feature factor -5"> </td> </tr> </table> <figcaption><b>Figure 5.</b> <em>Progression of applying the gender feature vector with factors 0, -1, -2, -3, -4, -5 in order, respectively.</em></figcaption> </figure> <figure> <table> <tr> <td> <img src="https://hackmd.io/_uploads/S1w7vvWlR.jpg" alt="Generated image with gender feature factor 0"> </td> <td> <img src="https://hackmd.io/_uploads/HJD7vPbxC.jpg" alt="Generated image with gender feature factor 1"> </td> <td> <img src="https://hackmd.io/_uploads/B1w7vPblC.jpg" alt="Generated image with gender feature factor 2"> </td> <td> <img src="https://hackmd.io/_uploads/HkDXwD-gA.jpg" alt="Generated image with gender feature factor 3"> </td> <td> <img src="https://hackmd.io/_uploads/SkDmDwZe0.jpg" alt="Generated image with gender feature factor 4"> </td> <td> <img src="https://hackmd.io/_uploads/Sy_QDDZe0.jpg" alt="Generated image with gender feature factor 5"> </td> </tr> </table> <figcaption><b>Figure 6.</b> <em>Progression of applying the gender feature vector with factors 0, 1, 2, 3, 4, 5 in order, respectively.</em></figcaption> </figure> Figures 4 and 5 show the effect of applying the gender feature vector to Male and Female embeddings, respectively. It is important to note that to go from a male to a female embedding we need to subtract the vector rather than add it. Another interesting finding is that scaling embeddings from male to female seems to cause the model to hallucinate or degrade in quality much faster than scaling from female to male. We believe this could be due to an imbalance in the model's training data or an inherent flaw in how we generate vectors. It would be beneficial to explore the FFHQ dataset and verify this hypothesis. It also appears that when scaling from female to male, the vector also makes subjects appear older. This could be due to a correlation present in the model itself or an issue with the classifier. The SVM may be learning a correlation between age and being male due to the age distribution of the dataset we generate. It is also possible that the model learns a correlation due to the feature distribution of the FFHQ dataset, but we could not verify this. We can mitigate this age-gender interaction by training separate vectors for age and gender and then turning them into an orthonormal basis using the Gram-Schmidt procedure, resulting in orthonormal vectors [[2]](#2). The new vectors should be disentangled, but similarly to the FFHQ dataset exploration, we did not look into this for our reproduction blogpost. ## Feature 2: Background Disappearance A problem which arises whenever the angle of a face rotation becomes too large is the introduction of black backgrounds as in figure 7. This black background arises when generating the images using the code provided by the authors of the original paper [[1]](#1). However, these black borders are not visible in the figures of this paper. Since we used the repository created by the authors of this paper, we further investigated potential causes of this. <figure> <img src="https://hackmd.io/_uploads/Hkw67X-eR.jpg"> <figcaption><b>Figure 7.</b> <em>The angle_y values that are used to generate these images are 1.2, 0.8, 0.4 and 0. These angles are used in both directions.</em></figcaption> </figure> In some generated images, this phenomenon is more visible than in others. For this reason we tried to reproduce the exact seeds that were used in the paper. However we were not able to find the exact images from the dataset. We are therefore unsure whether these images had better performance when it comes to background generation, or whether a different training set was used, which could also explain the difference in performance. As can be seen in figure 8, the background of the images is a part of the generated 3D mesh. Therefore when the dimensions of the 2D image are outside of this mesh, this part will become black. This occurs when the angle is changed beyond what was included in the training data. <figure> <table> <tr> <td> <img src="https://hackmd.io/_uploads/ry2Dj7We0.png"> </td> <td> <img src="https://hackmd.io/_uploads/H13_j7-gR.png"> </td> </tr> </table> <figcaption><b>Figure 8. </b> <em>Mesh from the front (left) and the same mesh at an angle (right).</em> </figcaption> </figure> In Figure 5 of the original paper [[1]](#1), similar steep angles are taken without visible black borders. The writers state that the most extreme angles correspond to 0.1% of the training samples, which would imply that further angles will most likely not turn out well, since there is not enough training data to learn it. Therefore reducing the black background could be done by including more extreme angles in the training set. A rebalanced model would allow for the less represented training data, namely faces with a large angle, to be better represented and learned. Besides this, it might be possible to treat the background as a feature, and manually post-process the background using a machine learning model. This could be achieved through the method explained in the previous section. ## Feature 3: Eye extraction One of the most featured works of art, the Mona Lisa, is often described as looking at the beholder no matter what angle it is viewed from. While Van Gogh's goal might have been to create a mesmerizing picture this system has the goal of creating realistic images. The current issue is that no matter the angle a human face is generated from, the eyes are always looking directly at the camera (figure 9). While this is still technically a realistic result it is not desired as the system should create images in such a way that the eyes follow the head pose. <figure> <table> <tr> <td> <img src="https://kienhuis.eu/moving.gif"> </td> </tr> </table> <figcaption><b>Figure 9. </b> <em>We can see that the eyes are tracking the camera.</em> </figcaption> </figure> The main cause seems to be visible within the 3D model. When extracting just the eyes we can see that in most scenarios the eyes are rendered as some kind of hollow shape. The sampled depth image can be used to verify the hollowness of the eyes. An example of this phenomenon is shown in figure 10.  <figure> <table> <tr> <td> <img src="https://hackmd.io/_uploads/SJc1u4meR.png"> </td> </tr> </table> <figcaption><b>Figure 10. </b> <em>Mesh zoomed in on the eyes (left) and the corresponding depth image (right).</em> </figcaption> </figure> What is of note is that this issue does not appear when generating cat faces (figure 11). Only when attempting to generate human faces does this property appear. As the model remains to be a black box we can only theorize as to the true nature of this emergent property. <figure> <table> <tr> <td> <img src="https://kienhuis.eu/cats.gif"> </td> </tr> </table> <figcaption><b>Figure 11. </b> <em>In cats the issue is less prevalent.</em> </figcaption> </figure> **Training data** is often the first scapegoat when it comes to inaccurate models. An important step in every process is to scrutinize the data used for training. Humans are aware of cameras and most images will be taken of humans looking directly at the camera as if they are having their picture taken. The paper mentions the use of the FFHQ dataset [[11]](#11). The FFHQ dataset is scraped from a public website known as [Flickr](https://flickr.com); a website for image hosting and photography enthusiasts. FFHQ warns that while their dataset is varied between age, ethnicity, and accessories (like glasses and hats) all biases of the Flickr website are present in the dataset. This, most importantly, includes head poses and eye directions. The eg3d model is aware of some of these limitations as stated in their section about the modeling of pose-correlated attributes, such as smiling. Yet they fail to mention this critical point. Therefore it would make sense that within the human training data, eyes are always looking at the camera while the eg3d model is unaware of this feature. As shown in figure 12, this is indeed the case. <figure> <table> <tr> <td> <img src="https://raw.githubusercontent.com/NVlabs/ffhq-dataset/master/ffhq-teaser.png"> </td> </tr> </table> <figcaption><b>Figure 12. </b> <em>FFHQ dataset main page at time of writing, notice all faces above from humans of a higher than toddler age are looking directly at the camera.</em> </figcaption> </figure> If the eyes were to always look at the camera then the model would be trained to simulate this as well. Since the model cannot generate the face in such a way that the eyes are generated separately from the head position it can only approximate this behavior by creating hollow eyes. With hollow eyes, the view direction matters less and the eyes are always focused on the camera. However, cats do not have this camera viewing instinct and therefore their images will contain more variance in the eye direction. Allowing the model to better generalize. A potential fix therefore would be to apply a stronger filter on the training data. Using both a gaze estimation model together with a head pose model and only using the image for training data if these two correspond, resulting in the person looking straightforward. This should result in a training set wherein the eyes are now viewed as a convex component to the head and are always looking forward. **Resolution** is also an important factor, in order for a system to properly generalize, high-frequency samples can be required to properly create a complex model. Eyes are only a small part of the entire image constructed. These internal images are only of a 128x128 resolution before they are run through the superresolution component. Given that the eye sockets are initially hollow between the eyebrows and the cheek before reaching the solid eyeball there are only a few pixels to create the non-hollow part of the eye socket. Combined with the other theory this property has the possibility of emerging. To verify this the model should be retrained to use a higher underlying resolution. This might be infeasible as it was a deliberate choice to use a low-resolution internal structure run through a superresolution component with regards to compute time. ## Conclusion In this blogpost, we expanded the analysis of the eg3d model [[1]](#1). We showed how the approach from InterFaceGAN [[2]](#2) can be successfully applied to perform style editing in the model. We were able to successfully change the gender feature and whether a person is wearing glasses. We also provide a description of how other features could be changed using this method. Furthermore, we explored why the background appears black when extreme camera angles are used. We find that the background is generated as part of the 3D model. Therefore, if the camera angles reach beyond this mesh, there is no object to render, hence the background becomes black. Finally, we explored the hollowness and the orientation of eyes in the generated images. We find that in the FFHQ dataset [[11]](#11) that eg3d was trained on, humans are frequently looking directly at the camera. Since the model is unaware that the eyes can move separate from the head it simulates this camera tracking movement by creating hollow eyes. Generating faces from animals does not have this issue. ## Work distribution | Person | Code | Blogpost | | -------- | -------- | -------- | | Maurits Kienhuis | Feature 3 | Feature 3 | | Petar Petrov | Kaggle setup + Feature 1 (gender feature) | Introduction + Feature 1 (gender) | | Niels van der Voort | Feature 2 | Feature 2 | | Jan Warchocki | Feature 1 (base model + eyeglasses) | Introduction + Feature 1 (background, implementation, eyeglasses) + Conclusion | ## References [comment]: <> (Use APA 7TH for references!) <a id="1">[1]</a> Chan, E. R., Lin, C. Z., Chan, M. A., Nagano, K., Pan, B., De Mello, S., Gallo, O., Guibas, L., Tremblay, J., Khamis, S., Karras, T., & Wetzstein, G. (2021). Efficient Geometry-aware 3D Generative Adversarial Networks. https://doi.org/10.48550/ARXIV.2112.07945 <a id="2">[2]</a> Shen, Y., Gu, J., Tang, X., & Zhou, B. (2019). Interpreting the Latent Space of GANs for Semantic Face Editing. https://doi.org/10.48550/ARXIV.1907.10786 <a id="3">[3]</a> Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive Growing of GANs for Improved Quality, Stability, and Variation. https://doi.org/10.48550/ARXIV.1710.10196 <a id="4">[4]</a> Karras, T., Laine, S., & Aila, T. (2018). A Style-Based Generator Architecture for Generative Adversarial Networks. https://doi.org/10.48550/ARXIV.1812.04948 <a id="5">[5]</a> He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition. https://doi.org/10.48550/ARXIV.1512.03385 <a id="6">[6]</a> Liu, Z., Luo, P., Wang, X., & Tang, X. (2015, December). Deep Learning Face Attributes in the Wild. Proceedings of International Conference on Computer Vision (ICCV). <a id="7">[7]</a> Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. https://doi.org/10.48550/ARXIV.1412.6980 <a id="8">[8]</a> Deng, J., Dong, W., Socher, R., Li, L.-J., Kai Li, & Li Fei-Fei. (2009). ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255. https://doi.org/10.1109/CVPR.2009.5206848 <a id="9">[9]</a> rizvandwiki/gender-classification-2 . Hugging Face. (n.d.). https://huggingface.co/rizvandwiki/gender-classification-2 <a id="10">[10]</a> Nateraw. (n.d.). GitHub - nateraw/huggingpics: 🤗🖼️ HuggingPics: Fine-tune Vision Transformers for anything using images found on the web. GitHub. https://github.com/nateraw/huggingpics?tab=readme-ov-file <a id="11">[11]</a> FFHQ-dataset by NVlabs. Github. https://github.com/NVlabs/ffhq-dataset