# ICLR rebut MOCA
<!-- example refer to https://hackmd.io/Fg4NScxuT4WBUzEno05obg -->
# General response
We would like to thank all reviewers for their feedback and constructive suggestions. We are very encouraged by reviewers' evaluation that our work is interesting ("timely and interesting" (JpF6), "concept is interesting" (4n7p), "novel and significant contributions" (9zZN)). We have taken into account reviewer feedback, and provided an updated revision with a detailed response to each reviewer's suggestions and feedback. In the revised manuscript, we have highlighted our changes in red. **Below we summarize our update to the manuscript.**
* To avoid naming confusion, we change rename the previously defined "prototype semantic cell" as "semantic cell" and the previously defined "prototype component cell" as "prototype cell".
* We rephrase the introdution part to clarify the connection between our proposed method and neuroscience. In essence, our model is inspired from the hypothesis of "grandmother-cell", and we explore the utility of prototype cells as memory priors in a standard computer vision image generation task to seek providing potential insights into the advantages of having these "grandmother" neurons inthe visual cortex at a functional level.
* We add comparison with other works that use prototype learning and memory bank in the related work and highlights the uniqueness of our work.
* We modify Figure 1 and Figure 2 to clarify the three projection heads $\theta(\cdot)$, $\phi(\cdot)$ and $\psi(\cdot)$ and how they are used in our model.
* Visualization results are added as new Figure 5 and Figure 6 in main text to provide further understanding of the semantic meanings of different clusters and prototypes formed in MoCA. Corresponding discussions are updated in Section 4.3.
* To verify that MoCA is not simply memorize the training samples, we visualize the closest top 3 training data for the generated images in Appendix 8.8.
* We move the original analysis for MoCA's prototype (original Figure 5, Figure 6 and subsection “Importance of memory organization”) to the Appendix Sec.8.1 and Appendix 8.6.
# Response to JpF6
We thank the reviewer for the insightful comment. Below we would like to address the reviewer's concerns.
**(Connection to neuroscience)**
We thank the reviewer's cautionary note for the connection of our work to neuroscience. In fact, all three reviewers raised this issue, but all agreed that this is not a major concern, as we did not claim this is a neural model, just neural-inspired, and we agree that the paper should be judged as a computer vision paper. We will revise the paper to clarify this point.
That said, we must say that our inquiry was genuinely triggered by the finding of the super-sparse code of neurons in the superficial layer of V1. The question we asked was, what could be the computational advantages on feedforward (analysis) / feedback (synthesis) paths in the hierarchical visual system for having this type of neurons with such high response sparsity and selectivity? We used a computer vision task to explore this question at a functional level. We did not claim that our model is an explicit neural model with a serious correspondence at an architectural or cellular level. However, we still believe our work is relevant to neuroscience at a functional level. From our perspective, CNN and GAN are relevant for understanding feedforward and feedback computation in the hierarchical visual cortex, respectively. The conceptual framework of analysis-by-synthesis and predictive coding are popular among cognitive and theoretical neuroscientists. The study of GAN and image generation are meaningful and potentially relevant for understanding the function of recurrent feedback. We tuned down our abstract and revised introduction to say, "Although our work is inspired by neurophysiological findings, we do not claim that the proposed MoCA module is a neural model. Rather, our goal is to explore the utility of prototype cells as memory priors in a standard computer vision image generation task, which might provide potential insights into the advantages of having these "grandmother" neurons in the visual cortex at a functional level."
**(Parameter Inflation)**
We thank the reviewer for bringing up this question. We address this concern with experiments from the results already documented in the original paper as well as new experiments we conducted in the rebuttal period.
First, we do include a fair comparison with the baseline in our manuscript. Specifically, MoCA is an architecture layer that has two paths: one is conventional Self-Attention (SA), the other is a newly proposed prototype memory attentional path. Note that the projection head ($\theta(\cdot)$, $\phi(\cdot)$, $\psi(\cdot)$) of prototype memory path is shared together with the Self-Attention, hence comparing to using self-attention alone, there's no **learnable** parameter inflation (learnable in the sense of gradient descent). Yet, Table 3 in the main text shows MoCA's improvement compared with Self-Attention.
When it comes to comparison with original FastGAN and StyleGAN, our method only has an increase of a tiny percent of parameters in comparison to the overall **learnable** parameter numbers (about 0.1 % on FastGAN baseline and 0.6% on StyleGAN). Yet, the addition of a fixed parameter MoCA layer provides 5 % ~ 20 % performance improvement (measured on FID, see main text section 4.1).
Furthermore, to completely control for the number of parameters, we added more parameters in terms of "features" in a layer to FastGAN so that they have the same number of parameters as MoCA + FastGAN. We found that the performance improvement it produced was minuscule (best FID improves less than 1% and overlap with the normal size model for most of the time). Thus, the performance improvement observed in MOCA cannot be explained by parameter size only.
**(Memorizing examples from training set)**
Does MOCA simply remember example images for few-shot image generation? We will provide further justification and empirical analysis in the Appendix. MoCA is a module that can be installed in every layer in theory. In our current implementation, we put MoCA in an intermediate layer. Hence it only caches the parts level activations, analogous to visual concepts, but in a remapped space. The prototypes are clusters over the entire dataset, gathering part-level prototypes that can potentially be used as reconfigurable parts to synthesis novel images in a compositional framework. Such parts are useful for few-shot image generation because the training data is limited, and hence reconfiguration of the existed parts could ease the training process. However, in contrast to AndOr Image Grammar, our reconfiguration is softer, implemented using an attention mechanism.
Empirically, we perform comparison with top 3 most similar training images (Based on pretrained VGG features and cosine similarity metrics) in Obama, MSCOCO-300 and Animal Face Dog dataset to demonstrate that our MoCA trained model is not a memorization of the training set (See **Appendix Section 8.8** for demonstration and further discussion). We can qualitatively see that although they have some degree of similarity, they are **not** simply memorization of the train images.
**(Concerns on MoCA cluster visualization)**
To facilitate further understanding of MoCA system, we update the manuscript to include two additional analyses. We answer the questions about (i) which semantic cluster is used during the attention process for different locations in the image? Is there any real semantic associated with? (ii) What are the relationships between the prototype cells and its corresponding clusters? We provide detailed results and discussions in Section 4.3. Briefly, the results suggest that **(1)** Sementic clusters have semantic meanings, and different semantic clusters will bias different regions of the images based on the semantics. **(2)** Prototype cells are specialized sub-parts of the parent semantic concept clusters, suggesting MoCA may have the potential to facilitate a hierarchical compositional system in the image synthesis path.
To make it concrete, consider a layer in which the prototype component cells are encoding face parts such as eyes, noses, and mouth. The prototype semantic cell will be coding "face-ness." The self-attention 1x1 function serves to map eyes, noses, and mouth to proximal locations in a low-dimensional transformed space, where these related concepts are clustered together and can be represented by their cluster center -- "face-ness" semantic cell. Note that this "face-ness" cell has the spatial scope of one hyper-column only and is different from the neurons in the preceding layer (upstream), which, with a larger receptive field or projective field, represent the entire face, with all the face parts in an appropriate spatial configuration. The prototype semantic cell is thus more like a circuit switch or grandmother selector. It represents an abstract semantic meaning like face-ness and is different from visual concepts or prototypes. The prototype semantic cell itself does not have all the fine detailed information to provide the appropriate modulation. We found that a subset of grandmothers turned on (or not turned off) by the prototype semantic cell need to work together to provide the appropriate contextual modulation.
# Further Response to JpF6
Dear Reviewer JpF6,
Thank you again for your feedback. As the deadline for discussion is approaching, we would be happy to provide any additional clarifications that you may need.
In our previous comments, we have carefully studied your comments and made updates to the revision as summarized below:
- Provided a discussion on how our work is related to neuroscience and revised the manuscript to clarify our work is not a neural model and MoCA only corresponds with neuroscience at a functional level.
- Provided a discussion based on existed ablation and additional results to show that MoCA's performance improvements are not due to parameter inflation but the proposed mechansim itself.
- Provided the top-3 nearest neighbors of the generated images to demonstrate that MoCA is not only remembering the train set but facilitating generalization by re-configuring parts-level information.
- Add more intepretable visualization to demonstrate how MoCA works: Cache parts-level information which is then used to softly bias the corresponding activation.
Please let us know if you have any questions remaining. We would be happy to do anything that would be helpful in the time remaining!
Thank you for your time!
Best,
Authors
# Response to 4n7p
We thank the reviewer for the constructive feedback. Below we would like to address the concerns.
**(Concerns about gradmother cells in neuroscience)**
We thank the reviewer for the advice. We made it clear in the abstract and the introduction of the updated manuscript as follows, "We called these highly selective sparse-responding feature detectors 'grandmother neurons' to highlight their possible explicit encoding of prototypes, even though in reality, a prototype is likely represented by a sparse cluster of neurons in the brain. Neurons in different layers of each visual area exhibit different degrees of response sparseness, complementing one another in various functions. "
It is worth noting that even in our model, there is also a variety of neural codes, ranging from distributed, sparse, and "grandmotherly." The code of the neurons in the feature layers in the GAN for example are quite distributed. Distributed codes provide greater flexibility and power in analysis and synthesis. Further, while a "grandmother cell" in MoCA might be coding an explicit prototype memory or visual concept, in MoCA, multiple grandmother cells, not just one, in the same semantic cluster would be activated by any particular input, and their codes are combined (weighted by softmax) to generate the needed contextual modulation to the activation patterns for downstream processing.
**(Comparison to existed ideas on memory bank and prototype learning)**
We appreciate the reviewer's suggestion. We add additional comparison for the memory bank ideas in the literatures under the related work section titled "Prototype Memory Mechanism". Indeed memory bank is a not a completely new idea. However most of the work is using memory bank at instance level [1] but our method propose to use the memory bank at intermediate representation level (parts level). By having memory bank at parts level, our system incorporate the idea of compositional system and reconfigurable parts naturally. We show that this is particularly valuable during the few shot image generation task.
In addition, we also discuss our difference with the existed prototype ideas like [2] under the related work section titled "Few Shot Prototypes Learning" in the updated manuscript. The main difference lies in the prototype forming level and how we utilize the prototypes. The prototypes in [2] are learnt as representatives for different classes at instance level, however, our prototype exists at the intermediate feature level (parts level). Also our method utilizes attention mechanism to softly bias the intermediate features using the cached prototypes during inference but [2] select the closest prototype in a discrete manner in order to perform category predictions at test time. Although we both involves prototype generation and usage during inference, our method with attention mechansim can smoothly modify the intermediate repsentation, hence can be applied to more challenging task like image generation in the few-shot setting.
**(Neuroscience correspondence, regarding the neural analog of semantic prototype cell, and Why not use Ki for prototype memory)**
Our inquiry was inspired by the finding of the super-sparse code of neurons in the superficial layer of V1. We wish to search for the computational advantages of having the neurons with such high response sparsity and selectivity in the feedforward (analysis) / feedback (synthesis) paths in the hierarchical visual system. We used a computer vision task to explore this question at a functional level. Note that we don't claim our work is a neural model that resemble deeply to the biological details, but rather a exploration of potential computational roles of these recently found "grandmotherly" neurons in a hierachracial visual system. The benefit MoCA brings on the computer vision task can potentially hint our questions about the computational advantages of such neurons. Overall our model is a computer vision model that is designed with some neuroscience property in mind. We clearify the connection and the limitation in the revised manuscript (abstract, introduction and conclusion).
As to the questions, "What do the prototype semantic cells and the prototype component cells correspond to in the neocortex? Is there a correspondence, or is it just for network design? " In our mind, the prototype component cells correspond to the super-sparse cells found in a superficial layer of V1 that motivated our study. The prototype semantic cells, if you allow us to speculate, might be a type of inhibitory neurons (SOM) that perform circuit switching to select a certain subset of prototype cells to participate in contextual modulation.
The prototype semantic cell represents more abstract information and does not have all the fine detailed information to provide the appropriate contextual modulation. The prototype semantic cells are responsible for selecting an appropriate set of prototype cells to participate in the attention process. The group of prototype (grandmother) cells need to work together to provide the appropriate contextual modulation.
**(Questions about $\theta$, $\phi$ and $\psi$)**
We appreciate the reviewer's question. We clarify in the updated main architecture figure (Figure 1) that our proposed method contains 2 paths, one is Memory Concept Attention (MoCA) and the other is Self-Attention (SA). MoCA and SA share the same projection head. The use of 3 head is followed as the convention in the Self-Attention/non-local network literatures: $\theta$ is considered as query, $\tilde{\phi}$ is considered as key in MoCA ($\phi$ is the key module in SA, and $\tilde{\phi}$ is its corresponding momentum update version), through which we cache $\tilde{\phi(A_{ij})}$ into the memory so that everytime we have a query, we can find a group of related key from the memory and bias the query. $\psi$ corresponds to the 'value' in SA literature. We update the Figure 1 for more clear demonstration of the network architectures. We do not intend to use capitalized $\Theta$, $\Phi$ and $\Psi$ and correct them in the updated manuscript.
[1] He, Kaiming, et al. "Momentum contrast for unsupervised visual representation learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
[2] Snell, J., Swersky, K., & Zemel, R. S. (2017). Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175.
# Further Response to 4n7p
Dear Reviewer 4n7p,
Thank you again for your feedback. As the deadline for discussion is approaching, we would be happy to provide any additional clarifications that you may need.
In our previous comments, we have carefully studied your comments and made updates to the revision as summarized below:
- Revised the manuscript to clarify our connection to neuroscience and provide a discussion on the correspondance with grandmother cells, role of semantic cells and the rationale of choosing prototype memory.
- Provided a further comparison with existing works utilized memory bank and prototype ideas.
- Clarified the use of projection heads $\theta$, $\phi$ and $\psi$. Update the main architecture figures to make it more clear.
Please let us know if you have any questions remaining. We would be happy to do anything that would be helpful in the time remaining!
Thank you for your time!
Best,
Authors
# Response to 9zZN
We thank the reviewer for very positive comment. We hope that the updates we added to the paper have further strengthened it. Below we would like to address your concerns.
**(Concerns about baseline numbers)**
We appreciate the reviewer's concern. FastGAN and StyleGANv2 are chosen because they are the strongest image generations models in either unconditional few shot image generation (FastGAN) or the image generation domain in general (StyleGANv2) in terms of architecture design. We do include another baseline model (LSGAN) to further test the generalization of MoCA. Although the LSGAN backbone is weaker comparing to FastGAN and StyleGANv2, we still observe that MoCA can improve upon the baseline model by certain amount (See Appendix 8.2 for details).
**(Concerns about neuroscience correspondence)**
We thank the reviewer for sharing this concern. Our work was inspired by the neuroscience findings but the correspondence at best is at a functional level. We don't intend to claim this is a neural model nor did we based on neural evidence to construct the corresponding architectures. Our work is rather a computer vision method that was designed with neuroscience questions in mind. Although the exact biological correspondence may still be not obvious at the moment, the performance improvement in the image generation task could spark ideas about the potential computation roles of the sparse complex feature detectors in the hierarchical compositional visual systems. We have updated our manuscript to make the connection and limitations more clear.