owned this note
owned this note
Published
Linked with GitHub
SOM Network
===========
Pre
---
|![](https://i.imgur.com/CeFV1uN.png)|
|--|
|An example of SOM.|
|![](https://i.imgur.com/eC9BnwQ.png)|
|--|
||
Reconstruction
---
|![](https://i.imgur.com/YEa8y2e.png)|![](https://i.imgur.com/ZcpUp2W.png)|![](https://i.imgur.com/8N3lS0H.png)|
|--|--|--|
|Original digits (MNIST)| Reconstructed from patches| SOM of the patchs|
|![](https://i.imgur.com/ewHHjgg.png)|![](https://i.imgur.com/797p6MZ.png)|![](https://i.imgur.com/dikwXob.png)|
|--|--|--|
|Original Frog (CIFAR)| Reconstructed from patches| SOM of the patches|
In CIFAR, it's much more difficult to accurately reconstruct the image from patches because the space of patch is very large since the object is more complicated than digit and there are many color of it. Reconstructing $4\times4$ patches from CIFAR is still much more blurry than taking $9\times9$ patches on MNIST.
|![](https://i.imgur.com/797p6MZ.png)|![](https://i.imgur.com/VJ5ROL1.png)|![](https://i.imgur.com/gM0cgE0.png)|
|--|--|--|
|rec' with SOM with 400 synaptic|rec' with SOM with 2500 synaptic| SOM with 2500 synaptic|
With more synaptic, we can enhance the reconstruction to some degree (e.g. the color is more distinguishable).
|![](https://i.imgur.com/f9M84YX.png)|
|--|
|Mapping images of a car from different angles to a SOM with a looped 1-D lattice.|
Stacked Patched SOM
---
|![](https://i.imgur.com/EMwN4dT.png)|
|--|
|Stacked patch SOM: We use the coordinate (topological position) of the winner-neuron as the input of the next layer. This is the clustered result of the last layer. Apparently, this architecture can poorly cluster the digits in a correct way. |
SOM as a Mask
---
In all of the following discussions, we are talking about using SOM as a mask.
|![](https://i.imgur.com/98VirUK.png)|
|--|
|Illustration of the SOM-Mask. The number of neurons in the SOM map is equal to the one of the feed forward layer (FC or CNN). An input is send to the SOM to calculate the neighborhood function as a mask. The mask is then applied to the feed forward layer.|
SOM+FC
---
Fully connected layer is easier to manipulate with compared with CNN. The following network consists of three fully connected layers. SOM is applied on the first two layers.
The basic idea is to use the SOM to calculate a mask for the connections between input and output. For an input sample $x$, the output of a traditional fully connected layer is $y_j = \sum_{i}x_iW_{ij}$ for each neuron $y_j$. However, since we have organized the input $x$ and there should be a small domain of neurons around the winner neuron that are activated. Therefore, we can calculate a mask $m_j$ indicating how much a neuron $j$ should be activated.
If the winner neuron of $x$ in the output layer is $y_{j^*}$, then the output will be $$y_j=dist(j, j^*) \sum_{i}x_iW_{ij}$$, where $dist(\cdot, \cdot)$ is the distance between two neurons in the topological space.
Parameter $\sigma$ indicates the influence radius of the winner neuron and is critical to the training. If $\sigma$ is small enough, the network can a good job on organization but the accuracy drops because the activation is too sparse. On the other hand, if $\sigma$ is large, then the performance can approximate the one without, SOM but the organizing map becomes blurry like the average of different images.
|![](https://i.imgur.com/xWXYR7A.png)|
|--|
|Performance on CIFAR10: Four curves from top to bottom are $\sigma=+\infty$ (model without SOM), $\sigma=50$, $\sigma=20$, $\sigma=5$. A small $\sigma$ helps organizing but affect the performance negatively due to the sparsity of the activation.|
By adapting different $\sigma$ between organization and forwarding, we can achieve both performance and interpretability. We use a small $\sigma$ to train the organizing map and use a large $\sigma$ to calculate the forwarding mask.
|![](https://i.imgur.com/deSZvRQ.png)|
|--|
|The self-organized map of the first layer on MNIST.|
|![](https://i.imgur.com/uzi23vD.png)|
|--|
|Performance on MNIST: The loss, testing accuracy and training accuracy of the network. The blue one and the red one are models with SOM (with different $\sigma$ between organization and forwarding) and the orange one is the one without SOM. The blue one applies SOM on both the first and the second layer while the red one applies SOM only on the first layer.|
Using SOM on FC in this way acts like clustering the input samples before handling them. Although the activation is sparse, we still have to do the entire matrix multiplication and no computation can be saved. Furthermore, there is no performance boost observed from this model. The only benefit is that we have a good visualization, which can be achieved by the vanilla SOM. **(TODO)** Other potential contribution: filters organization, deeper SOM, defending adversarial samples.
The self-organization observed in the striate cortex are based on small patches rather than the entire image.
We cannot observe any organizing-patterns from the filters.
|![](https://i.imgur.com/Gfkhg3R.png)|
|--|
|The visualization demonstrates the inputs $(28*28)$ that can activate $(10*10)$ filters at most.|
Implementing CartPole Agent:
- regularization is dangerous.
- gamma is important (never set to 0).
- Use a large negative reward to compromise the sparsity of negative samples.
- Control the size of memory.
- Train after episode, not step.
- Train minibatch sample by sample rather than one time.
- Training frequency is related to the memory size but should be independent with the length of episode.
- CartPole-v0 is different from CartPole-v1.
SOM-ReLU vs ReLU-SOM
---
SOM-ReLU is slightly better than ReLU-SOM.
|![](https://i.imgur.com/02JSatI.png)
|
|--|
|From top to bottom is the performance of (i) without SOM (ii) Mask-ReLU (iii) ReLU-Mask.|
SOM+CNN
---
CNN with SOM. The SOM takes the entire image as the input to generate a mask for different channels/filters in the output.
|![](https://i.imgur.com/TRE4Qof.png)|
|--|
|Hard to find an organization in the filters. The SOM is trained on the complete image.|
SOMCNN can outperform CNN by an extremely slight margin (99.02% vs 98.76%).
Some thoughts
---
How to apply SOM?
- Find a scenario that **vector quantization** is useful (i.e. continuous control). Then we can use SOM as the algorithm to find out the representative vectors for this problem.
- Focus on vision tasks, where SOM is rooted.
A common aspect of CNN and SOM is that they works on small patches of the input rather than the entire one. CNN do the computation on all possible patches while it's computational expensive to do that with SOM. Maybe attention can be applied to pick the correct patches here?
Threshold Mask
---
Not good. It sacrifices too much accuracy to sparsify the activation.
CSOM
---
Convolutional SOM as mask. The inputs of the SOM are gray patches rather than the entire image.
Since we have to extract numerous patches from each images and this process slows down the training of SOM. However, SOM converges quickly and doesn't change after that. Still, we can find organized patterns in the synaptic vectors but not the filters.
We can achieve competitive result with about 30% of the activation strength, which is energy-economical in biological way but utilizes more computation in the programming.
|![](https://i.imgur.com/HXTg5Ho.png)|
|--|
|CNN filters and Synaptic on MNIST.|
|![](https://i.imgur.com/E61Ug5A.png)|
|--|
|Synaptic on CIFAR10.|
Threshold Mask
---
To make the activation more sparse, we can set a threshold to the mask. It takes only 5% of the neurons in the CNN layers and can achieve similar accuracy (98.9%) on MNIST.
|![](https://i.imgur.com/yBanooR.png)|
|--|
|Top-to-Bottom: Threshold of 0.0, 0.5, and 0.9.|
|![](https://i.imgur.com/zE37DGr.png)|![](https://i.imgur.com/YcnZJI3.png)|
|--|--|
|Gray: CNN with 30 neurons. Green: CNN with 100 neurons but only 30 activated neurons masked by a SOM.|Orange: CNN with 5 neurons. Blue: CNN with 100 neurons but only 5 activated neurons masked by a SOM.|
Updated
---
|![](https://i.imgur.com/lfF4wpj.png)|
|--|
|From top to bottom: Original one (100 neurons), SOM-Masked (5 activated neurons), pruned network (5 neurons), Dropout (keep ratio is 0.05).|