Research Summary on GATSBI

**GATSBI: Generative Agent-Centric Spatio-Temporal Object Interaction** *Cheol-Hui Min, Jinseok Bae, Junho Lee, Young Min Kim; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 3074-3083* :pushpin: Overview --- This study's goal is to develop a generative model that can turn a series of unstructured observations into a detailed latent representation of the spatio-temporal context of an agent's activities. GATSBI, which stands for Generative Agent-centric Spatio-temporal Bayesian Interaction, is the name of the model. :pushpin: Introduction --- * It has been demonstrated that GATSBI generalises to a range of contexts where various robot and object kinds interact with one another dynamically. * Generative agents extract the latent variables that encode relationship between differant entities wwith prior knowledge. * Here each video frame gets into individual components and extracts latent dynamics. * Object-centric analyzes the relationship between objects and the events. :pushpin: Section 2 --- **Object-centric Representation Learning** * Genrative models project the high dimensional to low dimensional representations. * The representation can be grouped into three categories- 1. Attention based- use for object discovery and captures the locally consistent spatial features. 2. Spatial Mixture based- Reprensent large scene entities with GMM. 3. Keyypoint based- extracts the keypoint from feature maps. **Latent Dynamics model from visual sequence** * Latent representations extend the models to temporal transitions and interaction of detected objects. * SSM use state variavle t desctibe system by set of first order diffrencials. * RNN model sequence data using feed forward for passing a information over time sequence. * GNN makes conclusion based on nodes and edges. :::info *:pushpin: Firstly it is using the low dimensional latent dynamics but it does not represent entity wise interations, by these all approches GATSBI is suitable to locates the active agents and for learing differant physical properties.* ::: :pushpin: Section 3 --- **Fig. 2- A probabilistic graphical model of GATSBI** ![image](https://hackmd.io/_uploads/SJEM5sxBp.png) In training phase the structured latent variables are checking the sementics by opening the reccurent states and observations. In Test phase after updations GATSBI generates the future obserbations. *These representations are based on the VAE Variational AutoEncoder which finds low dimemnsional data from a distibutions.* ![image](https://hackmd.io/_uploads/BJ22J3lr6.png) **Lower Bound-** ![WhatsApp Image 2023-11-26 at 17.44.44_3a88fe6d](https://hackmd.io/_uploads/HJ2AU2lBp.jpg) :::info :pushpin: * ![image](https://hackmd.io/_uploads/BJzGdhlBa.png) - Obsevations Likelihood * ![image](https://hackmd.io/_uploads/BkO5_hlB6.png) - Posterior Distribution * ![image](https://hackmd.io/_uploads/rJ_ad2lSp.png) - Prior Distribution ::: **Expanding the basic VAE by SSM, RNN memorizes the information to the hidden state.** ![image](https://hackmd.io/_uploads/ryR5s3grT.png) * Hidden State ![image](https://hackmd.io/_uploads/HyIOxTeST.png) force with the action ![image](https://hackmd.io/_uploads/ByQ9x6lrT.png) for both posterior and prior generation. * Encoding the spatio temporal, embeds the latent varibles into background agents and objects. :::info :pushpin: MLP Multi layered perceptron ![image](https://hackmd.io/_uploads/rkmOzpgBp.png) is used to increse the dimensions as it has previously working on low dimensions. ::: **Entity wise Decompostion** ![image](https://hackmd.io/_uploads/BkYmm6eST.png) Fig.3 - Overall Scheme of GATSBI * Mixture Module- It extracts the large components and acquires the latent varibles for GMM of static backgounds and active agents. * Keypoint Module- It detects the dynamic Features assigns as a agent and remaining as a background. * object Module- It discovers the passive scene entities like active agent, static backgrounds, and the passive objects. * Interaction module- It constucts the agent centric interactions with decomposed and updated states. They divide the frame into coarse grid and individual objects are assigned into the cells. :::info :pushpin: Similar Goal can also be achieved using attention based object discovery, but it only represents only small objects. ::: ![image](https://hackmd.io/_uploads/B1OqTgGra.png) Fig.4 - Spatio temporal GMM ![image](https://hackmd.io/_uploads/r1NbRgfrT.png) - Reccurent states ![image](https://hackmd.io/_uploads/HJw7RxMHT.png) - Latent varibles ![image](https://hackmd.io/_uploads/rJVFAgzBT.png) - Mixture (a) Decomposing reccurent states and the action agent into the individual latent variables. (b) Updating mask variable ![image](https://hackmd.io/_uploads/B10aJ-GHa.png) (c) updating component variable ![image](https://hackmd.io/_uploads/rJ7dg-fHT.png) **Mixture Module-** In contrast to the standard latent representation z of VAE, it assumes that there exist K entities in the scene, and each entity is embedded into separate latent variables ![image](https://hackmd.io/_uploads/ry-VWWMHp.png) :::info :pushpin: Fig. 4 shows omitting the time index t, GATSBI factorizes the latent variable for each entity ![image](https://hackmd.io/_uploads/HygffWzBp.png) into a mask ![image](https://hackmd.io/_uploads/HJNmMbGS6.png) and the corresponding component ![image](https://hackmd.io/_uploads/SJGrMWGr6.png) ::: ![image](https://hackmd.io/_uploads/Bk02MZGHp.png) For k-th entity, * mask generate the M pixels in image. * component encodes the apperance and the observation ![image](https://hackmd.io/_uploads/BJe-NZGrp.png) ***Mask varible is deciding the individual scene entities sequentially to determine how each component looks like.***