<!-- To make centered tables -->
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS_HTML-full"></script> <script type="text/x-mathjax-config"> MathJax.Hub.Config({"HTML-CSS": { preferredFont: "TeX", availableFonts:["STIX","TeX"], linebreaks: { automatic:true }, EqnChunk:(MathJax.Hub.Browser.isMobile ? 10 : 50) }, tex2jax: { inlineMath: [ ["$", "$"], ["\\\\(","\\\\)"] ], displayMath: [ ["$$","$$"], ["\report\[", "\\]"] ], processEscapes: true, ignoreClass: "tex2jax_ignore|dno" }, TeX: { noUndefined: { attributes: { mathcolor: "red", mathbackground: "#FFEEEE", mathsize: "90%" } }, Macros: { href: "{}" } }, messageStyle: "none" }); </script>
# Predicting Physics Videos - Novel Dataset & Model Comparison
**Authors:**
- Maurits Kienhuis (5300126, M.A.A.Kienhuis@student.tudelft.nl)
- Niels van der Voort (5243076, N.A.vanderVoort-1@student.tudelft.nl)
- Jan Warchocki (5344646, J.Z.Warchocki-1@student.tudelft.nl)
**Code:** https://github.com/Jaswar/cvdl-project
**Real-life recordings:** https://www.kaggle.com/datasets/koramund/physics-inverse
*This blogpost was written as part of the Computer Vision course at Delft University of Technology. Certain elements of the project were completed prior to its official start and are clearly indicated as such. Additionally, the project partially builds upon the work of Alejandro Garcia, a PhD student in the Pattern Recognition and Bioinformatics group. We ensure that these contributions are properly acknowledged and specify which sections are based on his research.*
## Introduction
Physics-informed video prediction aims to predict future video frames by fitting the available frames to an ordinary differential equation (ODE). The parameters to the ODE can describe the current and future state of the system, like the current timestep and length of a swinging pendulum would. Systems that use such physical information have been shown to outperform unguided video prediction models on selected problems [[1]](#1).
<!-- By utilizing those parameters the problem changes from general video prediction to a sort of classification task combined with guided video prediction. -->
There are several models which are able to apply physics-informed frame generation. However, these models are often evaluated using different setups, metrics, and data instances. This makes comparison and reproduction of the work difficult.
Currently there are papers tackling experiments including the pendulum [[1](#1), [6](#6)], gravity-influenced balls [[1](#1), [6](#6)], and bouncing balls [[7](#7), [8](#8)]. These papers each use different experiments, and do not include all of these experiments. Additionally there is the distinction between simulated data and real-life recordings. Currently the field is more focussed on simulated data, however, real-world data includes more noise and is therefore also potentially interesting to evaluate [[2]](#2).
We propose a dataset containing both simulated and real-life recordings of different physics experiments. This dataset should be suitable as a benchmark that would allow for comparison between different models. We furthermore also evaluate two existing models by Jaques et al. [[1]](#1) and by Hofherr et al. [[2]](#2) on the designed dataset.
## Physics informed models
To provide the reader with the necessary background, we provide brief descriptions of two physics informed models: by Jaques et al. [[1]](#1) and by Hofherr et al. [[2]](#2). We will refer to these models as PAIG (Physics-as-Inverse-Graphics) and PPI (Physical Parameter Inference) respectively. Those are also the two models that are later evaluated on the designed dataset. We choose these models since their implementations exist and are publicly available.
### Physics-as-Inverse-Graphics (PAIG)
<figure id="fig_1">
<img src="https://hackmd.io/_uploads/HyldmMSSC.png" alt="Figure 1."/>
<figcaption><b>Figure 1. </b> <em>Structure of the Physics-as-Inverse-Graphics model. It follows an encoder-decoder architecture with a physics engine capable of simulating desired physical experiments. Visualization taken from [1]. </em> </figcaption>
</figure>
Proposed in [[1]](#1), the model is visualized in Figure 1 and follows an encoder-decoder structure. The video frames are first passed through the encoder. The goal of the encoder is to segment the objects of interest from the video, which is realized with a U-Net [[3]](#3) for efficiency. Based on the segmented image, the encoder then attempts to predict the locations of the objects. Given the predicted positions, the model then calculates the velocities.
The combination of position and velocities is passed to the physics engine. This component differentiates physics-informed video prediction methods from the standard, non-informed models. Based on the positions and velocities for current frames, the physics engine predicts the future positions and velocities. To this end, the physics engine contains learnable parameters, such as the length of a pendulum or the gravitational acceleration constant.
Finally, the decoder takes the predicted positions of the objects and attempts to reconstruct future frames. Most importantly, the decoder contains a Spatial Transformer [[4]](#4) which places a given object in the position that is predicted by the physics engine. Different object and background representations are then summed to form the final predicted frame.
The training procedure of this model is also relevant to this blog. For each physics experiment, the model requires a large dataset (~10k training samples). The frames are furthermore also relatively low resolution of up to $64 \times 64$ and can hold color. The model has not been tested on real-life recordings in the original paper.
The implementation used in this blogpost is based on the work of Alejandro Garcia, who migrated PAIG from Tensorflow 1 to Tensorflow 2. His original migration can be found [here](https://github.com/Alejandro-neuro/paig_reproducibility).
### Physical Parameter Inference (PPI)
<figure id="fig_1">
<img src="https://hackmd.io/_uploads/BkMP3GSBC.png" alt="Figure 1."/>
<figcaption><b>Figure 2. </b> <em>Structure of the Physical Parameter Inference model. The algorithm models the object and background seperately using implicit representations. An ordinary differential equation (ODE) solver is used to predict future object locations. Visualization taken from [2]. </em> </figcaption>
</figure>
The Physical Parameter Inference (PPI) model, proposed in [[2]](#2) and shown in Figure 2, is simpler than the above-mentioned PAIG model. Similarly to PAIG, the model separates the object and background. In PPI this is realized by learning an implicit representation of the object and the background using a neural network (second column in Figure 2).
The physics component is realized with the ordinary differential equation (ODE) solver (first column in Figure 2). Given initial object position $z_0$ and the parameters of the system $\theta_{\text{ode}}$, the solver predicts the future positions. The predicted positions are then used to build the transformation $T$, which tells the model where to place the object in the frame.
Similarly to PAIG, the object and background representations are then merged. Since the object implicit representation is calculated for each pixel, it is necessary for PPI to predict the opacity alongside the color. This allows for objects of arbitrary shape to be represented. PPI uses alpha blending to merge the object and background representations with the given opacity.
An important factor for this model is that it learns only from a single video. This makes it difficult to test on different videos (mostly since $z_0$ is learnable). We discuss later how to, despite this issue, compare this model against PAIG. Furthermore, PPI requires the user to provide object masks alongside the video. Although constructing these for synthethic data is trivial, we elaborate later how these can be obtained for real-life videos. In the original paper, the model is trained on both synthethic and real videos. Nonetheless, the comparison against PAIG included in the paper is limited and not all experiments that we propose here are included in the PPI evaluation.
## Dataset construction
We construct experiments based on the existing literature. We develop four experiments: pendulum, a sliding block, a bouncing ball being dropped, and a ball thrown at an angle. We note that each of these experiments poses different challenges for the models. Sliding block is a simple experiment with movement in only one direction. A ball thrown at an angle requires the models to model movement in both x and y-axes as well as accurately predict initial velocities. The bouncing ball experiment forces the models to learn a complex behaviour upon impact. Finally, pendulum is the only experiment that uses angular, instead of Cartesian, coordinates.
For all experiments we provide synthethically generated videos. We furthermore provide real-life recordings for the first three experiments. No real-life recordings are provided for the ball throw experiment due to difficulties in ensuring consistent recordings. We now give an overview of the experiments, followed by the method used to arrive at synthethic and real recordings.
### Physics behind experiments
#### Pendulum
The differential equation governing the move of the pendulum is given by:
$$
\begin{equation}
\frac{d^2\theta}{dt^2}(t, \theta) = - \frac{g}{l}\sin \left( \theta \right)
\end{equation}
$$
Where $\theta$ is the angle of the pendulum, $g$ is the gravitational acceleration, and $l$ is the length of the pendulum. Both $g$ and $l$ are learnable parameters. Pendulum is the only experiment where the coordinate system is angular, not Cartesian. Although the authors of PAIG mention the support for angular coordinate systems, no such support was found in the codebase. Hence, we add this support by using the predicted angle as the rotation of the spatial transformer, which follows the description from the paper. This support was added before the project commenced. On the other hand, PPI already supports the pendulum experiment. We thus reuse the authors' implementation for an undamped pendulum.
[PAIG implementation ](https://github.com/Jaswar/cvdl-project/blob/8d831a46fa6900b7efe8b58f7f581e883e5af2ad/paig/nn/network/cells.py#L17)
[PPI implementation](https://github.com/Jaswar/cvdl-project/blob/8d831a46fa6900b7efe8b58f7f581e883e5af2ad/PhysParamInference/models/ode.py#L5)
#### Sliding block
The main equation for the sliding block is given by:
$$
\begin{equation}
\frac{d^2x}{dt^2}(t) = g \left( \sin \left( \alpha \right) - \mu \cos \left( \alpha \right) \right)
\end{equation}
$$
Where $x$ is the position of the block on the slope, $g$ is the gravitational acceleration, $\alpha$ is the inclination of the slope, and $\mu$ is the friction coefficient. $g$, $\alpha$, and $\mu$ are all learnable parameters. We implemented a custom physics cell for this experiment for PAIG. PPI already had support for this experiment, thus we reused the implementation.
[PAIG implementation ](https://github.com/Jaswar/cvdl-project/blob/8d831a46fa6900b7efe8b58f7f581e883e5af2ad/paig/nn/network/cells.py#L181)
[PPI implementation](https://github.com/Jaswar/cvdl-project/blob/8d831a46fa6900b7efe8b58f7f581e883e5af2ad/PhysParamInference/models/ode.py#L49)
#### Bouncing ball
In this experiment, a ball is dropped from a random height and is allowed to bounce off the ground. Let $r$ be the radius of the ball and $g$ the gravitational acceleration, then the differential equations are given by:
$$
\begin{align}
\frac{d^2y}{dt^2}(t) &= \begin{cases}
+\infty, & \text{if $y$ < $r$} \\
-g, & \text{otherwise}
\end{cases} \\
\frac{d^2x}{dt^2}(t) &= 0
\end{align}
$$
Where it must additionally hold that $\lim_{t_1 \rightarrow t_2} \int_{t_1}^{t_2} \frac{d^2y}{dt^2}(t) dt = (1+\gamma)v_{t_1}$ with $t_1$ being the time the ball comes into contact with the floor ($y < r$), $t_2$ the time it leaves the ground to bounce back, $v_{t_1}$ the velocity of the ball upon impact, and $\gamma$ the elasticity factor. Since this bounce is instantenous in our implementation, we require the limit $\lim_{t_1 \rightarrow t_2}$.
In simpler words, the ball is free-falling to the ground until it touches it ($y < r$). Upon touch, the velocity immediately changes from $v_{t_1}$ to $v_{t_2}$ according to $v_{t_2} = -\gamma v_{t_1}$. That is, the ball simply bounces back with a velocity reduced by the elasticity factor $\gamma$.
Since in the actual model implementations the time is discretized, no infinities appear. Upon detecting the contact with the ground, we simply invert the velocity according to the equation $v_{t_2} = -\gamma v_{t_1}$. None of the two models contained this physical model thus we implement it ourselves. The radius $r$ is the only non-learnable parameter as we have not been able to allow the gradients to flow back through this variable.
[PAIG implementation ](https://github.com/Jaswar/cvdl-project/blob/8d831a46fa6900b7efe8b58f7f581e883e5af2ad/paig/nn/network/cells.py#L111)
[PPI implementation](https://github.com/Jaswar/cvdl-project/blob/8d831a46fa6900b7efe8b58f7f581e883e5af2ad/PhysParamInference/models/ode.py#L76)
#### Thrown ball
In this experiment a ball is thrown at a 45° angle with random initial velocity $v_0$, such that $||v_0|| \sim U(0, 15)$. The differential equations are then:
$$
\begin{align}
\frac{d^2y}{dt^2}(t) &= -g \\
\frac{d^2x}{dt^2}(t) &= 0
\end{align}
$$
Additionally, it should be noted that the ball always starts to the left of the video and is allowed to exit the frame to the right or on the bottom. Only the gravitational acceleration parameter $g$ is learnable. PPI contains the implementation of this experiment already. PAIG required a new implementation.
[PAIG implementation ](https://github.com/Jaswar/cvdl-project/blob/8d831a46fa6900b7efe8b58f7f581e883e5af2ad/paig/nn/network/cells.py#L144)
[PPI implementation](https://github.com/Jaswar/cvdl-project/blob/8d831a46fa6900b7efe8b58f7f581e883e5af2ad/PhysParamInference/models/ode.py#L64)
### Synthethic data generation
Similarly to PAIG, we use the Euler's method for approximating (and thus simulating) differential equations [[1]](#1). Let $p_t$ and $v_t$ be, respectively, the current position and the current velocity of the simulated object. The next position and velocites $p_{t+1}$ and $v_{t+1}$ are then calculated as:
$$
\begin{align}
p_{t+1} &= p_t + \frac{\Delta t}{N} v_t \\
v_{t+1} &= v_t + \frac{\Delta t}{N} f(t, p_t)
\end{align}
$$
Where $f(t, p_t)$ is the second order, experiment-specific differential equation as outlined in the previous section. $\Delta t$ and $N$ are user-specified parameters. Similarly to PAIG, we set $\Delta t = 0.3$ in all experiments but thrown ball. For this experiment, we found it necessary to set $\Delta t = 0.05$ to allow for sequences long enough to be generated. We set $N = 10$, compared to $N = 5$ in PAIG for a slightly higher approximation accuracy.
Similarly to PAIG the generated frames are $32 \times 32$ and in colour. Special steps had to be taken to allow both models to be trained on the generated data. First, PAIG works with a dataset of videos, while PPI works with only a single video. This also implies that PPI cannot work with different starting positions of the simulated objects. In the PAIG paper, the generated datasets contain shorter videos for training (12 frames) and longer for testing (30 frames). As such, we have decided to generate one dataset of videos consisting of 42 frames. Then, the first 12 frames can be used for training while the rest should only be used for testing. PAIG is then allowed to use the entire dataset of 10,000 videos while PPI is allowed to only train on the first video of this dataset. The performance of these models is then always tested on only the first video (since PPI cannot be easily evaluated on other videos).
Furthermore, the PPI model requires object masks during training. We generate those masks with 1 indicating pixels with the object and 0 for background. This is also the same format as required by the model. Examples of the generated videos for each of the four experiments can be seen in Figure 3. The values of the relevant physical parameters can be found in the source code of this project.
<figure>
<table>
<tr>
<td>
<img src="https://hackmd.io/_uploads/SkDbcyGHA.gif">
</td>
<td>
<img src="https://hackmd.io/_uploads/B15-_kGHC.gif">
</td>
<td>
<img src="https://hackmd.io/_uploads/S19R9kMH0.gif">
</td>
<td>
<img src="https://hackmd.io/_uploads/r1VK5kzSC.gif">
</td>
</tr>
</table>
<figcaption><b>Figure 3.</b> <i>The generated synthethic experiments. From the left: pendulum, sliding block, bouncing ball, and thrown ball.</i></figcaption>
</figure>
Finally, it should be noted that for the sliding block experiments we do not rotate the frame according to the inclination angle. This is due to the low image resolution, where rotating the box would cause it to appear highly pixelated. The physics is still estimated as if the slope was inclined.
The code used to generate synthethic datasets can be found [here](https://github.com/Jaswar/cvdl-project/blob/master/generators.py). This script also specifies the values of the physical parameters used for generating the data.
### Real-life recordings
With most modern systems you do not know their true strength until they face a storm. While the real storm would be general video prediction, only a scaled down version is considered here. Synthetic data is by nature very accuracte, yet nature itself is wildly more varied than the synthetic data could model. For this reason the models are also evaluated on a real world benchmark containing the aforementioned experiments of pendulum, sliding block, and bouncing ball. It must be mentioned that, as they are real life experiments, they are not perfect: the pendulum is not evenly dampened throughout its swing, the balls bounce more eratic, and the sliding blocks do not always start the same way.
In the interest of future models that are capable or simulating larger images like PPI, the dataset is provided at source resolution of $1920 \times 1080$ in either portrait or landscape mode depending on the dominant axis of motion. The videos were shot at 60 frames per second. The pendulum file name contains the length of the string towards the centre of mass. All pendulum experiments were started with a horizontal offset of 20cm. For sliding block, the file name contains the height of the right angle triangle where the hypotenuse was measured to be 67cm. This leads to sliding angles as shown in Table 1. The bouncing ball was recorded next to a tape measure indicated every 10cm. Examples of recorded videos are shown in Figure 4.
$$
\begin{array}{|c|c|}
\hline
\textbf{Height} & \textbf{Sliding angle} \\
\hline
\text{18.5cm} & \text{16.0°} \\
\text{21.5cm} & \text{18.7°} \\
\text{24.5cm} & \text{21.5°} \\
\hline
\end{array}
$$
<figure>
<figcaption><b>Table 1.</b> <i>Sliding block triangle height versus the sliding angle of the sliding block experiment.</i></figcaption>
</center>
</figure>
<figure>
<table>
<tr>
<td>
<img src="https://kienhuis.eu/pendulum_60_7.gif">
</td>
<td>
<img src="https://kienhuis.eu/sliding_block_18_5_0.gif">
</td>
<td>
<img src="https://kienhuis.eu/bouncing_ball_10.gif">
</td>
</tr>
</table>
<figcaption><b>Figure 4.</b> <i>The recorded real life experiments. From the left: pendulum, sliding block, and bouncing ball.</i></figcaption>
</figure>
As the PPI model requires masks for its evaluations, some rigor is required to add those to the dataset. Similar to a real life situation these masks are procedurally generated instead of hand annotated. The technique used in order to segment the objects was to boost the saturation and value based on the HSV standard. This is then followed by taking an RGB range to create a mask. This mask then gets eroded and dilated to remove any small artifacts. The accuracy of these masks was assessed on a qualitative basis and examples can be seen in Figure 5.
<figure>
<table>
<tr>
<td>
<img src="https://kienhuis.eu/pendulum_60_7_mask.gif">
</td>
<td>
<img src="https://kienhuis.eu/sliding_block_18_5_0_mask.gif">
</td>
<td>
<img src="https://kienhuis.eu/bouncing_ball_10_mask.gif">
</td>
</tr>
</table>
<figcaption><b>Figure 5.</b> <i>The masks for the recorded real life experiments. From the left: pendulum, sliding block, and bouncing ball.</i></figcaption>
</figure>
Extra steps need to be taken to ensure the models can be trained and tested on the real-life videos. We split each video into chunks of 42 frames. From each chunk, the first 12 frames are used as the training data and the remaining 30 frames are used as evaluation. Furthermore, the videos and masks are downscaled to $32 \times 32$. Finally, the masks are converted into a binary format, where 1 indicates object presence and 0 indicates the lack of object. These steps make it possible to train and evaluate the models in a way identical to the synthethic experiments. The script used to convert real-life data into the desirable format can be found [here](https://github.com/Jaswar/cvdl-project/blob/master/from_real_data.py).
## Results
We now evaluate the PPI and PAIG models on the constructed dataset. We perform both a quantitative and qualitative analysis. For the quantative analysis, following [[2]](#2), we measure the peak-signal-to-noise-ratio (PSNR) of the predicted video to the ground truth. This metric is given by the following equation:
$$
\begin{equation}
\text{PSNR} = - \frac{10}{T} \sum_{t=0}^{T - 1} \log_{10}\left( \frac{1}{3WH} \sum_{c=0}^2 \sum_{i=0}^{H - 1} \sum_{j=0}^{W - 1} \left( \hat{V}_{t,c,i,j} - V_{t,c,i,j} \right)^2 \right)
\end{equation}
$$
Where $\hat{V}$ is the predicted video and $V$ is the ground truth. Hence, this metric is the average PSNR of the video frames from the standard defintion [[5]](#5). Importantly, the higher this metric is, the better is the model. We use a single Nvidia RTX 4090 both for training and evaluation. The training details, such as number of epochs, batch size, and network architecture can be found in the source code of the project. The script used for evaluating the models can be found [here](https://github.com/Jaswar/cvdl-project/blob/master/evaluate.py).
### Synthethic data
<figure>
<table>
<tr>
<td>
<img src="https://hackmd.io/_uploads/rkeM5PDSC.jpg">
</td>
</tr>
<tr>
<td>
<img src="https://hackmd.io/_uploads/Sy0tcvDrR.jpg">
</td>
</tr>
<tr>
<td>
<img src="https://hackmd.io/_uploads/H1Es5DvrR.jpg">
</td>
</tr>
<tr>
<td>
<img src="https://hackmd.io/_uploads/H1HjqDvH0.jpg">
</td>
</tr>
</table>
<figcaption><b>Figure 6.</b> <i>The predicted sequences for all 4 experiments. From the top: pendulum, sliding block, bouncing ball drop, and ball throw. In each row, the top sequence is the ground truth, middle is PAIG prediction, and bottom is PPI prediction. The images are best viewed if opened in a new window.</i></figcaption>
</figure>
We first analyze the results qualitatively. The comparison of the different models is shown in Figure 6. First, as we can observe both of the models learned the pendulum example almost perfectly. This is to be expected as both models have been tested on the pendulum by the authors of both papers.
In the sliding block example, we observe only PAIG is able to learn the dynamics correctly. Although initial few frames for PPI are almost perfect, the quality degrades over future frames. This is perhaps a suprising result considering that PPI was tested by the authors on the sliding block example. Since, in the PPI implementation the object is rotated by the inclination angle ([here](https://github.com/florianHofherr/PhysParamInference/blob/f1eb3454d3e310fcbc6dcb1e690e7d1f7baa8a28/models/sceneRepresentation.py#L343)), we have suspected the lack of rotation in the frame to be an issue. Removing the rotation from the PPI model did not, however, lead to any improvements.
The results for both bouncing ball and thrown ball show poor performance for both models. This hints at the increased difficulty of these experiments compared to the previous two. Bouncing ball is particularly difficult as learning elasticity requires the model to trigger the if-condition checking the impact with the ground.
$$
\begin{array}{|c|c|c|c|c|}
\hline
\textbf{Model} & \textbf{Pendulum} & \textbf{Sliding block} & \textbf{Bouncing ball} & \textbf{Thrown ball} \\
\hline
\text{PAIG} & \text{25.04} & \textbf{39.6} & \textbf{20.12} & \text{18.42} \\
\text{PPI} & \textbf{35.78} & \text{20.24} & \text{19.78} & \textbf{20.72} \\
\hline
\end{array}
$$
<figure>
<figcaption><b>Table 2.</b> <i>PSNR for the four synthethic experiments. Bolded is the better model in the given experiment.</i></figcaption>
</center>
</figure>
Finally, we analyze the PSNR for the aforementioned experiments. The results of that computation are presented in Table 2. As we can observe for the pendulum example, PPI achieves a much higher PSNR despite the qualitative analysis showing almost identical results. Furthermore, as expected, PAIG reaches a much higher PSNR than PPI for the sliding block example. For the remaining two experiments the score is lower and similar for both models.
### Real data
<figure>
<table>
<tr>
<td>
<img src="https://hackmd.io/_uploads/B13rjvDH0.jpg">
</td>
</tr>
<tr>
<td>
<img src="https://hackmd.io/_uploads/BJRHoDwHR.jpg">
</td>
</tr>
<tr>
<td>
<img src="https://hackmd.io/_uploads/Byc7yKir0.jpg">
</td>
</tr>
</table>
<figcaption><b>Figure 7.</b> <i>The predicted sequences for two real-life experiments. From the top: pendulum, sliding block, and bouncing ball. In each row, the top sequence is the ground truth, middle is PAIG prediction, and bottom is PPI prediction. The images are best viewed if opened in a new window. Due to different channel orderings in OpenCV, the colours may not match the videos from Figure 4. This does not have any adverse effect during training or evaluation.</i></figcaption>
</figure>
We now evaluate the models on real data. For the pendulum, the training (and testing) sequences are obtained from the `pendulum_45` folder. For the sliding block, the sequences come from the `sliding_block_24_5` folder. We select these folders as they should have the largest differences between subsequent frames (pendulum period decreases with the decrease in rope length and a steeper decline will cause the block to move down faster). The bouncing ball experiments are done based on the recordings in the `bouncing_ball` folder. The results of the comparison can be seen in Figure 7.
As we can observe, the models fail to learn the sequence in all cases. Most surprising is the negative result of PPI on the sliding block. The model has been trained and tested by the authors on a real-life sliding block example, yet it fails to work for our data. We find that the most likely cause for this behaviour is that the model learns to predict a mask of all ones (i.e. object everywhere). Nonetheless, we have verified the training masks to match the object correctly and be in the right format. We have also performed extra changes to the model or the data in an attempt to fix it. We attempted to train on a higher resolution ($480 \times 480$), closer to the resolution from the original paper. We also tried flipping the frames to ensure the sliding direction is the same as in the paper. Finally we attempted to modify the hyperparameters of the network. None of these changes lead to an improvement.
Similarly, it might be surpising to find that PAIG fails to learn the pendulum example correctly. Upon a closer inspection, one might find that the model attempts to rotate the ball around the middle of the frame. This is incorrect as the point of rotation is above the image, where the ball hangs from. We believed this issue to be caused by the spatial transformer, which, in the default implementation, rotates the features around the middle. We have however, attempted to add translation to the spatial transformer by the means of learnable or handcrafted position. This did not lead to any improvement.
Neither of the two models learn to predict the bouncing ball correctly. This is, however, an expected result as none of the models learned the synthethic version of this experiment. On top of the difficulty posed by the synthethic data, the real-life ball can also move in the horizontal direction upon bouncing. This makes modelling the system correctly even harder.
$$
\begin{array}{|c|c|c|}
\hline
\textbf{Model} & \textbf{Pendulum} & \textbf{Sliding block} & \textbf{Bouncing ball} \\
\hline
\text{PAIG} & \textbf{26.69} & \textbf{26.53} & \textbf{20.58} \\
\text{PPI} & \text{24.78} & \text{24.85} & \text{17.13} \\
\hline
\end{array}
$$
<figure>
<figcaption><b>Table 3.</b> <i>PSNR for the real-data experiments. Bolded is the better model in the given experiment.</i></figcaption>
</center>
</figure>
Finally, for completeness, we calculate the PSNR of the two models on the real data, the results of which can be seen in Table 3. We find that on all experiments PAIG outperforms PPI. This suggests that PAIG at least learns background/object decomposition better than PPI and can model the background more accurately. Nonetheless, as was shown in the qualitative analysis, both models perform poorly and as such it can not be concluded that PAIG outperforms PPI.
## Discussion
#### Model performance
As shown in the previous section, both models perform poorly in multiple experiments. Although for some experiments (such as bouncing ball) this could be caused by the difficulty of the task, for others the reason is unknown. We have attempted to make changes to the models or the data in order to improve their behaviour. This included training on higher resolution, making architectural or hyperparameter changes, or skipping frames to make differences between frames more visible. None of the attempted modifications resulted in visible improvement. As such, the reason why these models perform poorly on this data remains an open question.
#### Real versus simulated data
We have observed that real-life videos are more difficult for models to learn. For example, although both models learn the synthethic pendulum example, they fail on real-life data. Similarly, PAIG fails on real recordings of a sliding block, yet, works very well on synthethic data. On the contrary, there is not a single real-life experiment where one of the models would perform better than on synthethic data. As such, we conclude that real-life experiments pose unique challenges that are not present in synthethic data. Therefore, it is important that future models are evaluated using both types of data.
#### Metric choice
In this work, we used peak-signal-to-noise-ratio for the quantitative analysis. We find that it can be a useful metric if the models perform similarly and might be hard to distinguish with a qualitative analysis. This property of the metric can be best observed for the synthethic pendulum example, where despite both models performing qualitatively similarly, PSNR is much higher for PPI. Nonetheless, we also find that the metric is not informative if both models perform poorly (Table 3). Thus, we suggest that a qualitative analysis should always be performed before the quantitative analysis to judge the performance of the model. Additionally, other metrics, perhaps based on the trajectories of the objects could be designed.
#### Dataset usability
We took multiple steps to increase the usability of the designed dataset. The synthethic data is generated with a script. As such, future researchers can easily modify the script to generate longer videos, of higher resolution, or with a custom background. The real-life videos are recorded in high resolution ($1920 \times 1080$), which would allow larger models to be trained on the data. Furthermore, we attempted to make the real-life video background as mono-color as possible. This makes it possible to extract object masks and could enable future researchers to replace the background with one of their choice.
## Conclusion
Physics-informed video prediction differs from standard video prediction in that, aside from the frames, the physical equations governing the recorded system are also known. Although multiple approaches to this problem exist [[1](#1), [2](#2), [6](#6), [7](#7), [8](#8)], they often use different physical experiments or in-house datasets, making comparison between models difficult, if not impossible. In this blogpost we introduced a novel dataset for physics-informed video prediction. The dataset contains multiple synthethic and real-life experiments inspired by existing literature. We also evaluate two existing models, by Jaques et al. [[1]](#1) and by Hofherr et al. [[2]](#2) on the proposed dataset. We find that the models generalize poorly on the new dataset, being able to predict frames correctly only for certain synthethic experiments.
This blogpost leaves space for future work. More experiments could be designed, in particular ones where multiple objects interact. Furthermore, more existing models could be evaluated, including ones that do not have a publicly-available implementation. Finally, the dataset was designed in a way that enables future researchers to test and tune their algorithms. As such, this dataset could be used as a benchmarking tool for future models.
## References
[comment]: <> (Use APA 7TH for references!)
<a id="1">[1]</a> Jaques, M., Burke, M., & Hospedales, T. (2019). Physics-as-Inverse-Graphics: Unsupervised Physical Parameter Estimation from Video. https://doi.org/10.48550/ARXIV.1905.11169
<a id="2">[2]</a> Hofherr, F., Koestler, L., Bernard, F., & Cremers, D. (2023). Neural Implicit Representations for Physical Parameter Inference From a Single Video. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2093–2103.
<a id="3">[3]</a> Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation (Version 1). arXiv. https://doi.org/10.48550/ARXIV.1505.04597
<a id="4">[4]</a> Jaderberg, M., Simonyan, K., Zisserman, A., & Kavukcuoglu, K. (2015). Spatial Transformer Networks. In arXiv e-prints. https://doi.org/10.48550/arXiv.1506.02025
<a id="5">[5]</a> Wikipedia contributors. (2024, May 30). Peak signal-to-noise ratio. In Wikipedia, The Free Encyclopedia. Retrieved 23:52, June 11, 2024, from https://en.wikipedia.org/w/index.php?title=Peak_signal-to-noise_ratio&oldid=1226400206
<a id="6">[6]</a> T.-Y. Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge, “Learning physics constrained dynamics using autoencoders,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35, Curran Associates, Inc., 2022, pp. 17 157–17 172. https://proceedings.neurips.cc/paper_files/paper/2022/file/6d5e035724687454549b97d6c805dc84-Paper-Conference.pdf.
<a id="7">[7]</a> P. Velickovic et al., “Reasoning-modulated representations,” in Proceedings of the First Learning on Graphs Conference, B. Rieck and R. Pascanu, Eds., ser. Proceedings of Machine Learning Research, vol. 198, PMLR, Dec. 2022, 50:1–50:17. https://proceedings.mlr.press/v198/velickovic22a.html.
<a id="8">[8]</a> N. Watters, A. Tacchetti, T. Weber, R. Pascanu, P. Battaglia, and D. Zoran, “Visual interaction networks,” 2017. doi: 10.48550/ARXIV.1706.01433. https://arxiv.org/abs/1706.01433.