# Notes on Neural Function Modules with Sparse Arguments: A Dynamic Approach to Integrating Information across Layers #### Author: [Sharath Chandra](https://sharathraparthy.github.io/) ## [Paper Link](https://arxiv.org/pdf/2010.08012.pdf) ## Outline 1. Most of the current deep learning architectures, MLPs CNNs or RNNs, process the information in the current layer (hidden state) based on the complete information from the previous layer (previous hidden state). 2. This kind of information processing despite being able to achieve state of the art results in many domains is not modular enough and hence lacks the ability to dynamically "select" input arguments. 3. This paper proposes Neural Function Modules (NFM) which introduces the modularity in the context of feed forward networks by combining the attention, sparsity, top-down and bottom-up feedback. 4. This idea is motivated from the programming languages where we can see a proper structure in the way methods are written which can be considered as specialized modules in performing a particular type of unique task with specific input arguments. The advantage of this clean structure leads to modularity and re-usability of these methods whenever needed. Based on this, the authors propose a architecture which selectively constructs the current input by attending over previous modules by using attention mechanism. 5. One more motivation from the programming languages is that their ability to use recently or much earlier computed variables in the program which connects to top-down, bottom-up feedback in cognitive psychology literature. Drawing inspiration from this, the authors use a multi-pass setup to allow NFM to attend over all previously seen passes. 6. The authors empirically show that the proposed algorithm which dynamically selects the functional modules from the previous layer by using attention improves classification, OoD generalization (due to dynamic selection), generative modeling and reinforcement learning. ## Neural Function Modules The architecture design has following goals: 1. Dynamic communication between layers. 2. Selective information routing. 3. Combining the top-down and bottom-up information in a dynamic way using attention and, 4. Introduce more flexibility into deep architectures ### Algorithm The algorithm is summarized below: ![](https://i.imgur.com/XLQh3mN.png) **How o dynamically attend to previous modules?** For the dynamic selection, NFM uses Multi-Head attention mechanism with as few modifications as possible. More specifically, it uses the current layer's state to get the query by linear transformation using a parameterized query weight matrix $W_q$. Similarly the keys and values are calculated as follows; $\hat{K} = KW_k$ and $\hat{V} = VW_v$. To induce the competition between attention scores, a zero vector is appended to values and keys and then the top-k attention scores are calculated. The attention matrix $A$ can be computed as $A = Softmax(\frac{\hat{Q}\hat{K}^T}{d_k})\hat{K}$ where $d_k$ is the key dimension. After this the final hidden state is calculated by applying the activation function and projecting it back to the original dimension: $h_1 = \sigma(AW_{o1})W_{o2}$. Finally the output $h_1$ is multiplied with a scalar $\gamma$ and added to the residual skip-connection: $h_2 = R + \gamma h_1$. Hence in a layer the parameters for attention correspond to $\theta_A = (W_q, W_k, W_v, W_{o1}, W_{o2}, \gamma)$ **How to deal with dimension mismatches?** Whenever we are attending over previous modules, there is a possibility of the dimension mismatch. To avoid that the authors propose a re-scaling strategy. They perform no re-scaling if the spatial dimensions are the same, i.e., $len(h_i) = len(h_j)$. But when $len(h_i) > len(h_j)$, they perform nearest neighbor upsampling. When $len(h_i) < len(h_j)$, they perform a SpacetoDepth operation to treat all of the points in the local window of size $\frac{len(h_i)}{len(h_j)}$ as separate positions for the attention. ## Experiments The paper presents various experiments which demonstrate the performance improvement over challenging tasks. They show that the information bottleneck created by the NFMs drastically improves performance, specialization and generalization. **Generative Modeling** NFM + InfoMax GAN : ![](https://i.imgur.com/uLznjuL.png) **Sort-of-CLEVR for relational reasoning** CNN + NFM: ![](https://i.imgur.com/QczCLHN.png) **Image Classification** ![](https://i.imgur.com/7NoAn7b.png) **Reinforcement Learning** ![](https://i.imgur.com/E9BRWV9.png)