Reproduction of HouseGan paper

Introduction

The paper by Nelson Nauata et al. presents a novel graph-constrained generative adversarial network for a new house layout generation problem, whose task is to take an architectural constraint (a bubble diagram) as a graph (i.e., the number and types of rooms with their spatial adjacency) and produce a set of axis-aligned bounding boxes of rooms. The paper also employ convolutional message passing neural networks (Conv-MPN), which differs from graph convolutional networks (GCNs). They argue that the architecture enables more effective higher-order reasoning for composing layouts and validating adjacency constraints.
We aim to reproduce the results shown in the paper by using the existing code (making some changes as required to fit our dataset).
First, we tried to replicate the results from the LIFULL HOME's dataset as used in the paper and then tried out four different methods to see if the results changed. Here are the alterations below describing our method:

Remove the CMP layers, edges and room-type related features from the discriminator.
Replacing the CMP layer with CNN and removing the edges and room-type features, which are required information in CMP. Since CMP is removed, this information will not be needed anymore.
Changing sum-pooling to average pooling in the CMP layer.
Changing the amount of neurons in particular networks.

Original Model Structure

It is crucial to understand the orinal paper and architeture for the reproduction purpose. Starting from the dataset, we will explain the data set and HouseGan architecture used in this paper.

Data Set

This paper uses bubble diagrams and images of house layouts. These graphs are originally from the LIFULL HOME’s database, where they extract 117,587 house layouts.

Bubble diagrams

The bubble diagrams are not raw data, and are derived by applying an algorithm on the layouts. Every room is a node in the bubble diagrams, each with information of room types such as living room, kitchen, bedroom, etc. The authors provides preproccesed data in their Git hub repository. Figure below is a exmaple of the bubble diagrams reconstruted by Nelson Nauata et al.
!

House layout

Rooms are in the form of bounding boxes and are aligned with axes in these images. It is assumed that two rooms are connected if the Manhattan distance between the two bounding boxes is less than 8 pixels.

GAN Structure

The HouseGAN, just as a noram GAN, has a generator and a discriminator. The crucial mechanism used in the paper is called convolutional message passing (CMP) which allows the network to identify representations that the aurthor wants to keep, such as room type and room adjacent information. Following subsections will give detailed explanination on the HouseGan and CMP struture.

Generator

The following flow chart gives general idea about the generator.

step 1: The generator takes an n x 128 x 1 gaussian noise and nx10x1 hot encoding variables as input where n is the total number of rooms in a batch. The 10x1 hot encoding represents the room type and there are 10 different types of rooms.
step 2: Noise and room type inputs are concatenated and form a new input with dimension nx138x1.
step 3: Apply a linear layer. The output size becomes nx1024x1.
step 4: Reshape the input. The output size has dimension n x 16 x 8 x 8.
step 5: Extract both adjacent and non adjacent room information for each room and sum up them separately which forms adjacent feature n x 16 x 8 x 8 and non adjacent feature, n x 16 x 8 x 8. Then, concatenate these two features with original input along the second dimension (channel). The output size is n x 48 x 8 x 8. Feature extraction uses edge information and is given.
step 6: Pass through 3 layers of CNN. Each layer uses leakyrelu as activation function. With specified stride and padding, these 3 layers only compress the number of channels. The output from the CNNs are n x 32 x 8 x 8, n x 32 x 8 x 8, and n x 16 x 8 x 8. Step 5 and 6 together are called the CMP layer.
step 7: Upsample the input and use leakyRelu (kernel size 4, stride 2, padding 1). The output has a size of n x 16 x 16 x 16.
step 8: Repeat step 5 and 6, but this time the image has size 16 x 16 instead of 8 x 8. After passing through the CMP layer, the output has szie n x 16 x 16 x 16.
step 9: Upsample the input and use leakyRelu (kernel size 4, stride 2, padding 1), and the output becomes n x 16 x 32 x 32.
step 10: Pass through a decoder layer which is a 3 layer CNN network using leakyRelu for the first two layers and tanh for the last layer (kernel size 3, stride 1, padding 1). The output sizes from each layer are n x 256 x 32 x 32, n x 128 x 32 x 32 and n x 1 x 32 x 32. The final outputs are the GAN generated room masks.

st=>start: input: Noise + Nodes

op=>operation: Linear layer & Reshape
op2=>operation: CMP layer (Need edges as argument)
op3=>operation: Upsample layer
op4=>operation: CMP layer (Need edges as argument)
op5=>operation: Upsample layer
op6=>operation: Decoder layer

e=>end: Output


st->op->op2->op3->op4->op5->op6->e

Discriminator

The following flow chart gives general idea about the discriminator.

Step 1: The discriminator takes input of a mask with size of n x 32 x 32 x 1 either from the output of the generator or a real floorplan. Where n is total number of room in a batch.
Step 2: A linear layer with 8,192 neurons is applied with another input of node, whose size is 10 x 1. The output is further reshaped from 8,192 to n x 32 x 32 x 8.
Step 3: Concate the masks and the output from step 2. The output becomes n x 32 x 32 x 9.
Step 4: Put the output from step 3 into three convolutional layers. Because padding and stride are both 1 (padding = 1 here means to pad both side of width and height with 1), the output size is only decided by the spatial dimension of input and channel dimension of the final CNN (Input channel, output channel, kernel size, kernel size) = (n x 16 x 16 x 3 x 3) . Therefore, the output is n x 32 x 32 x 16.
Step 5: Put the output from step 4 into two times of CMP and downsample. The output size is n x 8 x 8 x 16.
Step 6: Put the output from step 5 into three CNN layers. Because the stride of the three CNN layers are 2 and padding are 1, the spatial dimension is halved for each CNN layer, i.e. 8x8, 4x4, 2x2 and finally 1 x 1. Additionally, the channel dimension of the final CNN (Input channel, output channel, kernel size, kernel size) = (128 x 128 x 3 x 3) . Therefore, the output size is n x 1 x 1 x 128.
Step 7 : There is one subtle but important thing to mention. All previous steps are conducted by each room. In this step, these rooms from the same graph will be merged together. Accordingly, the output from step 6 is n rooms features with dimension of 1 x 1 x 128. The paper merges rooms which belong to the same graph with a sum pool. The output size is now m x 1 x 1 x 128, where m is the number of example in a batch size. Then, we reshape its dimension from m x 1 x 1 x 128 into m x 128.
Final step: Put the reshaped output from step 7 into a linear layer with one neuron. The output size is 1.

st=>start: Input: Nodes

op=>operation: Linear layer & Reshape
op2=>operation: Concatination with Masks
op3=>operation: 3 CNN layers
op4=>operation: CMP layers (Need edges as argument) & Downsample
op5=>operation: 3 CNN layers
op6=>operation: Sum pooling & Reshape
op7=>operation: Linear layer

e=>end: Output


st->op->op2->op3->op4->op5->op6->op7->e

Convolutional Message Passing Neural Network (Conv-MPN /CMP)

Conv-MPN is a variant of a graph neural network (GNN), and learns to infer relationships of nodes by exchanging messages. Conv-MPN is specifically designed for cases where a node has an explicit spatial embedding, and makes two key distinctions from a standard message passing neural network (MPN):

the feature of a node is represented as a 3D volume as in CNNs instead of a 1D vector; and
convolutions encode messages instead of fully connected layers or matrix multiplications. This design allows Conv-MPN to exploit the spatial information associated with the nodes.

The Conv-MPN in this paper takes the standard MPN architecture then replaces:

a latent vector with a latent 3D volume for the feature representation
a fully connected layers (or matrix multiplications) with convolutions for the message encoding.

Convolutional message passing
Conv-MPN module updates a graph of room-wise feature volumes via convolutional message passing, a node feature
spreads across a volume and a simple pooling could keep
all the information in a message without collisions. Instead of encoding a message for every pair of nodes, the paper just pool features across all the neighboring nodes to encode a message, followed by CNN to update a feature vector.
The Con-MPN update the vector by :

concatenating a sum-pooled feature across rooms that are connected in the graph
concatenating a sum-pooled feature across non-connected rooms
applying a CNN:
\[ g^l_r \leftarrow CNN[\,g^l_r \;; Pool_{s \in N(r)} \;\; g^l_s; Pool_{s in \overline{N}(r)}\;\; g^l_s\,] \]

Experiments

We have conducted five sets of experiment runing on GOOGLE Cloud/Colab using up to four GPUs. These five simulation include one original model from paper and four modified network strutures. Each model is trained with 20 epochs due to the computational limits, 20 minutes per epoch under 4 GPUs, and 45 minutes under 1 GPU.

Adjust pooling methods

The Conv-MPN module updates the graph by concatenating a sum-pooled feature across rooms for both connected and non-connected rooms, in the paper of CMP they explained a simple pooling could keep all the information in a message for a node feature spreads across a volume and without collisions, our goal is that we would like to know how much it would affect the result by changing sum-pooling to average pooling.
The features we derived are the representation for each room. When we concatenate adjacent and non-adjacent features, these features should also represent one room instead of the sum of features for all rooms. Therefore, we want to use average pooling to repesent one room feature.

Remove edges and room types information in discriminator

From lecture notes, we learned that the GAN discriminator tries to discriminate between real and fake images. The HouseGAN discriminator takes three variables as inputs, room mask, edge, and room type which add computation complexity. We would like to test if the discriminator will still learn something without additional information (such as edge and room type) and try to lower the computational cost. Hence, for one design, the CMP and room-type related layers are removed from the discriminator, and for the other design, the CMP layer is replaced by a simple 3 layer CNN.

Remove CMP layers in discriminator

Two CMP layers in the discrminator are removed, which tremendously decreases the complexity. The input and output sizes are adjusted to fit with the rests of layers.

Replace CMP layers with CNN layers in Discriminator

In the paper of deep convolutional GAN in 2015 [https://arxiv.org/abs/1511.06434], the authors indicated the stridden CNN in discriminator can give a good result. Therefore, we would like to know whether replacing each CMP layers with three CNN layers can derive a similar result. The arguments of the three CNN layers are each with (C_in, C_out, Kernel_size, p, s) = (16,16,3,1,1). Since CMP is removed, the information of edges and room types will not be used.

Reduce number of Neurons

While the paper goes in depth about which types of hidden layers are used in the network and the amount of neurons that are present in each layer, the paper fails to offer any motivation on why these particular amounts of neurons were used. We thus assume this means the amounts were somewhat arbitrarily chosen. This made us wonder whether any alterations to the amount of neurons in the existing layers would still produce comparable results.
In this variation instead of adding a 32x32x8 tensor to the segmentation mask at the start of the discriminator, a 32x32x6 tensor was added instead. Another change was made during the upsampling, where instead of the feature of size 32x32x9 being upsampled to size 32x32x16, we get a feature of size 32x32x7 that is upsampled to size 32x32x12. In the end the amount of layers and their type remain the same and only the amount of neurons in some of the layers are altered.

Results

Five sets of results are generated and are cross compared by our architecture member for a subjective score. Original we decided to use the FID score, but we need to generate 50,000 fake images with 5000 real one. This large amount of images had shut down the Jupytern notebook conntion to the virtual machine and we did not find the solution.

Pooling

CMP learns to infer relationships by exchanging messages during feature extraction.
It represents the feature associated with a node as a feature volume and utilizes CNN for message passing while retaining the standard message-passing neural architecture.
After comparing the results, the sum pooling layer in CMP in the original paper performs pretty much the same.
We assume that is because average pooling method smooths out the image and hence the sharp features may not be identified when this pooling method is used, while Sum pooling (which is proportional to Mean pooling) measures the summed value of the existence of a pattern in a given region. In short, sum-pooling is just a scaled version of mean pooling, but the non-linear layers in the models will lead to a different result, this result might be subtle, so the outcome shouldn’t differ too much.
But we can still see in the average pooling method, that the proportional size of a single room is very out of scale compared to sum pooling, we argue that this is because the room size information is not included in the CMP layer.

Remove edges and room types information in Discrimintor

Remove CMP

The generated images use a removed CMP discriminator is shown in figure below.

We believe that the CMP and room type are useful information in the discriminator. Without this information, generated room maps lose the edge structures and all rooms stack together.

Replace CMP layers with CNN layers

The results look poor. This architectures mostly generates layouts of small sizes. Also, the relation of nodes is not expressed in most of the outputs. This result further confirms our belief that the information of edges and nodes are important.

Neuron reduction

The results are quite similar to those of the original. While the rooms are structured differently, there is no clear difference in quality between the original method and this variation. This indicates that the network functions similarily while using fewer neurons, this would imply some redundancy might be present in the original network.
Whether the optimal amount of neurons was used in the original experiment or in this one is hard to tell, but this does mean that there is merit in spending time testing out more different neural combinations to figure out which combinations perform optimally. Another interesting thing that could be looked at is finding the minimum number of neurons that can be used in the network that would still produce visually similar results to the original.

Overall Comparsion

It can be seen that the model with average pooling, and the model with reduced neurons give the similar results as the original model. For the models removing information of edges and room types in discriminator, it can be seen the output are obviously worse than original one.

Input graph	Original	Pooling	No CMP	CNN	Neuron reduction

Conclusion

In the past couple of weeks we have taken an in depth look at the paper and the provided code. We also studied any method/network that was used in the paper in order to get a full understanding of how the GAN functions. We applied this knowledge to make small alterations to the network in order to test the importance of specific parts. The alteration where we replaced sumpooling with average pooling as well as the alteration where we reduced the amount of neurons in the network, we obtained very similar results to the baseline, meaning these specific factors were not too important to the results. Our other two alterations showed the importance of the edges and room type features, as these methods left those features out and did not generate adequate layouts. Ultimately, we're pleased to see the variation in results between the different methods considering the hardware limitations that did not allow us to train the models extensively nor generate an FID score.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.