# A summary on plausible multi-object scene synthesis using structured latent spaces: RELATE
paper link: https://arxiv.org/abs/2007.01272
# Motives of the paper
This paper introduces the synthesis of scenes and videos of multiple interacting objects using a structured latent space by modelling the data around an architecture comprising of a GAN based formulation along with a model that accounts the correlation between individual objects.
# Problems addressed by the paper
While the previous generative models have given satisfactory results in scene synthesis, there is although one significant drawback involving the latent space that parameterizes the resulting images. They are not physically interpretable.
Take the example of the original GAN architecture.The latent space was just a bunch of random noise vectors which were then mapped to the image samples. The Random noise which acts as the parameter cannot be interpreted as such physically.
More on latent spaces: https://medium.com/@jain.yasha/gan-latent-space-1b32cd34cfda
One of the interesting solutions to this was the BlockGAN architecture which had interpretable parameter space taking into account the orientation and position of the objects.This too had a drawback though due to the assumption that objects are independent.
# Method described
RELATE comprises of two modules:
1. Interaction Module: This module is responsible for computing and capturing the correlation between the objects.
2. Scene composition and Rendering Module: Used for featuring the parameter space comprising of appearance and position vectors.
# Model Architecture

In any modeling task, the very first process is to decide and initialize a set of parameters which would lay the foundation for the rest of the steps. The model considers scenes contanining K distinct objects and the appearance and pose are the two main parameters which were focussed upon.
### How can we separating fg-bg. How did they generate input data?
The model starts by sampling appearance parameters z1, . . . , zK for each individual foreground object as well as a parameter z0 for the background.These parameters are similar to the noise vectors which were used in the earlier GAN setup.
Now comes the core part of the model which is modeling the correlation of the inherent objects in a scene. For this the pose parameters are sampled as a vector of K i.i.d (independent and identically distributed) poses $Θ$^. This vector of poses is then passed to a 'correction' network Γ (refer to the figure). This network then maps this vector to a final configuration which accounts for the correlation between the objects and the background,$Θ := Γ(Θˆ , Z)$.
The pose parameters are updated according to this below relation:
$θk = ˆθk + ζ(ˆθk, zk, |z0$, { $5zi,ˆθi$}i≥1,i not equal to k)
*Each foreground object is associated with a pose parameter (θk), which is geometrically interpretable. In this case, θk is assumed to be a 2D translation, applied to Ψk.*
The function ζ is implemented in a manner similar to the Neural Physics Engine (NPE)
more on NPE: https://arxiv.org/abs/1612.00341
In the paper two types of correlations were discussed in detail namely the order of the objects in a scene and the dynamics.
Modeling Order of the objects:
For presevering the natural order of the objects, the parameters θk are simply modelled as a markov process.
This is done by first sampling ˆθ1 , and then applying a correction to account for the background z0 as before, finally sampling the other objects in sequence:
θ1 = ˆθ1 + f0(ˆθ1, z1, z0), ∀k > 1 : θk = θk−1 + f1(θk−1, zk−1, z0),
Modeling the dynamics:
For this the initial positions are sampled θk(0) and are updated incrementally in the similar way as proposed by the general correlation update given above
$θk(t + 1) = θk(t) + vk(t + 1)$,
**To obtain $vk(t+1)$ let $Vk(t) = [vk(t − i)]i=2,1,0$ denote the last three velocities of the k-th object. The intial value Vk(0) = ev(zk, z0, θk(0)) is initialized as a function of the appearance parameters and initial positions and NPE style updates are performed to obtain the value.**
Finally once the appearance and pose vectors are obtained, rendering the image begins. The the appearance parameter zk is first mapped to a tensor Ψk ∈ R (H×H×C) .
This is done via two separate learned decoder networks, one for the background Ψ0 = Ψb(z0) and one for the foreground objects Ψk = Ψf (zk).
The θk parameters acts on this tensor Ψk via bilinear resampling resulting in the following relation: Ψˆk = θk · Ψk such that [Ψˆk]u = [Ψk]u+θk
Foreground and background objects are composed into an overall scene tensor W ∈ R
H×H×C via element-wise max- (or sum-) pooling as Wu = maxk=0,...,K[Ψˆk]u.
In this manner, the scene tensor is a function W(Θ, Z) of the pose parameters Θ := (θ1, . . . , θK) and the appearance parameters Z := (z0, z1, . . . , zK). Finally, a decoder network ˆI = G(W) renders the composed scene as an image,similar to the GAN setup.
# Learning Objective
The learning objective described was a sum of two high fidelity losses and a structural loss.For high fidelity, images ˆI generated by the model are contrasted to real images I from the training set using the standard GAN discriminator LGAN(ˆI, I) and style Lstyle(ˆI, I).
One unique addition to the learning process was the addition of a regularizer to prevent the generation of trivial relationships between the object positions and the images generated.
For this, we train a position regressor network P that, given a generated image ˆI, predicts the location of the objects in it. In practice, we simplify thistask and generate an image ˆI0 by retaining only object k of the K objects at random and minimizing ˇθk − P(G(W(z0, zk, θk)) by taking the L2 norm. This completes the whole learning process of the setup.
# Evaluation metrics and Setup requirements
For comparing the peformance of the model with previous SOTA architectures, the FID (Frechet inception distance) metric was used. The training requires the compute power of a single V100 GPU and the codebase can be setup on colab.
# Limitations of the model
A main limitation on which further work can be done is the model's restriction to planar motions which prevents it from representing arbitrary 3D motions featuring angular rotation more faithfully,most notably highlighted by the experiments for video generation.