Writing - OORL - Attention version

# Writing - OORL - Attention version ## Note: Symbols (04/2021) - Environment - $\mathbb{L}$ = object library - $N = |\mathbb{L}|$ = number of objects in the library - $K (2 \le K \le N)$ = number of objects on each scene (fixed for all scenes) - Model (GNN + self-attention) - $s_t \in \mathbb{R}^{50 \times 50}$ = state image input at time t - $m_t \in \mathbb{R}^{K \times 5 \times 5}$ = object masks / feature maps (for $K$ objects) (assuming using $10\times 10$ filter) - $m_k \in \mathbb{R}^{5 \times 5}$ is the feature map of object $k \in [K]$ - **(We are still deciding if the mask should have $K$ or $N$ objects.)** - $z_t \in \mathbb{R}^{K \times D}$, where $D$ is the feature dimension for each object - $z_k \in \mathbb{R}^{D}$ is the embedding for each object - $\mathbf{Q}, \mathbf{K}, \mathbf{V}$ = query, key, and value matrices in self-attention - Element-wise example: $\text{key}_k = \text{MLP}_{\mathbf{K}}(m_k) \in \mathbb{R}^N$ (or $\mathbf{k}_k$) - Every key or query is $N$-dimensional - Conceptually, it's useful to think of $N$ keys and values for $N$ objects available, but in the implementation, we don't really care about the embeddings (locations) of invisible objects (i.e. other unchosen objects in the library). - $\mathbf{Q} \in \mathbb{R}^{K \times N}$ has $K$ visible objects of $N$-dimensional query embeddings - **(For key and value, we are still deciding the dimension for $K$ or $N$ objects.)** - $\mathbf{K} \in \mathbb{R}^{N \times N}$ may have all $N$ objects of $N$-dimensional key embeddings - $\mathbf{V} \in \mathbb{R}^{N \times D}$ may have all $N$ objects of $D$-dimensional value embeddings (the same dimension as $z_k$) - $\text{index}_i, i \in [N]$ - $\text{key}_n, \text{value}_n, n \in [N]$ ## Self-attention - Background / Current case - We aim to study compositional generalization, which requires the policy mapping $\pi: \mathcal{S} \rightarrow \Delta(\mathcal{A})$ to be equivariance w.r.t. object replacement. - Thus, in order to achieve equivariant policies, we first study the equivariance property of (deterministic) transition model $T: \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{S}$. - For object-oriented environments, graph neural networks (GNNs) are commonly used to model the object-factorized dynamics by considering objects as graph nodes and objects' interactions as edges connecting objects. - It is known that fully-connected GNNs (i.e. assuming all objects are interacting) are equivariance w.r.t. the labeling of nodes. - Since we have $K$ objects on each scene, a object-factorized GNN is equivariant w.r.t. permutation symmetry group $S_K$. - Motivation - Although a GNN, which is factorized by objects, is able to be permutation equivariant, it may fail to be equivariant to object replacement symmetry $S_K \times S_N$. - Motivating results - We show the illustrative results - Illustrative Examples - (my object examples) - Idea - Self-attention version - . - Implementation - Examples