Notes on "Attention Transfer Knowledge Distillation"

tags: `notes` `knowledge-distillation` `attention`

Note: Actual title of paper is "Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer" and it was published as a ICLR '17 poster.

Brief Outline

By properly defining attention for CNNs, it can be used to significantly improve the performance of a student CNN network by forcing it to mimic the attention maps of a powerful teacher network.

Introduction

Artificial attention lets a system “attend” to an object to examine it with greater detail.
They seek to answer the following question - can a teacher network improve the performance of another student network by providing it information about where it looks, i.e., about where it concentrates its attention into?
They consider attention as a set of spatial maps that essentially try to encode on which spatial areas of the input the network focuses most for taking its output decision.
Further, these maps can be defined w.r.t. various layers of the network to capture low, mid, and high-level representation information.

Methodology

Activation-based Attention Transfer

Consider a CNN layer and it's corresponding activation tensor
$A \in R^{C \times H \times W}$ . It has
$C$ feature planes with spatial dimensions
$H \times W$ .
An activation-based mapping function
$F$ (w.r.t. that layer) takes as input the 3D tensor
$A$ and outputs a spatial attention map i.e. a flattened 2D tensor defined over the spatial dimensions.

\begin{matrix} (1) & F : R^{C \times H \times W} \to R^{H \times W} \end{matrix}

They make an assumption that the absolute value of a hidden neuron activation can be used as an indication of the importance of that neuron w.r.t. the specific input.
Thus, they consider the following spatial maps in this work:
- Sum of absolute values:
  $F_{sum} (A) = \sum_{i = 1}^{C} | A_{i} |$
- Sum of absolute values raised to the power of
  $p$ (where
  $p > 1$ ):
  $F_{sum}^{p} (A) = \sum_{i = 1}^{C} | A_{i} |^{p}$
- Maximum of absolute values raised to the power of
  $p$ (where
  $p > 1$ ):
  $F_{max}^{p} (A) = max_{i = 1, C} | A_{i} |^{p}$
Here,
$A_{i} = A (i, :, :)$ (using MATLAB notation) and max, power and absolute value operations are elementwise.

Visualizing Attention Maps

From the visualizations, it is found that attention maps focus on different parts for different layers in the network.
In the first layers, neurons activation level is high for low-level gradient points while in the middle, it is higher for the most discriminative regions such as eyes or wheels, and in the top layers, it reflects full objects.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Above image shows
$F_{sum}$ attention maps at different levels of a trained face recognition network. Mid-level attention maps have higher activation level around eyes, nose and lips, high-level activations correspond to the whole face.
It is concluded that most discriminative regions have higher activation levels and that shape details disappear as the parameter
$p$ increases.

Attention Mapping Function Properties

Compared to
$F_{sum} (A)$ , the spatial map
$F_{sum}^{p} (A)$ (where
$p > 1$ ) puts more weight to spatial locations that correspond to the neurons with the highest activations, i.e., puts more weight to the most discriminative parts (the larger the
$p$ the more focus is placed on those parts with highest activations).
Furthermore, among all neuron activations corresponding to the same spatial location,
$F_{max}^{p} (A)$ will consider only one of them to assign a weight to that spatial location (as opposed to
$F_{sum}^{p} (A)$ that will favor spatial locations that carry multiple neurons with high activations).

Training Procedure

In attention transfer, given the spatial attention maps of a teacher network, the goal is to train a student network that will not only make correct predictions but will also have attentions maps that are similar to those of the teacher.
They assume that transfer losses are placed between student and teacher attention maps of same shape, but if needed, attention maps can be interpolated to match shapes.
Let
$S, T$ and
$W_{S}, W_{T}$ denote the student, teacher and their weights respectively and let
$L (W, x)$ denote the standard cross entropy loss.
Let
$I$ denote the indices of all teacher-student activation layer pairs for which we want to transfer attention maps. Then we can define the following total loss:

\begin{matrix} (5) & L_{A T} = L (W_{S}, x) + \frac{β}{2} \sum_{j \in I} | | \frac{Q_{S}^{j}}{| | Q_{S}^{j} | |_{2}} - \frac{Q_{T}^{j}}{| | Q_{T}^{j} | |_{2}} | |_{p} \end{matrix}

Here,
$Q_{S}^{j} = vec (F (A_{S}^{j}))$ and
$Q_{T}^{j} = vec (F (A_{T}^{j}))$ are respectively the
$j$ -th pair of student and teacher attention maps in vectorized form, and
$p$ refers to norm type (in the experiments, they use
$p$ =2).
They use
$l_{2}$ -normalized attention maps i.e.
$\frac{Q}{| | Q | |_{2}}$ instead of just
$Q$ . They emphasize that normalizing the attention map is important for the success of this approach.

Gradient-based Attention Transfer

They define attention as gradient w.r.t. input which can be viewed as an input sensitivity map i.e. attention at an input spatial location encodes how sensitive the output prediction is w.r.t. changes at that input location.
Define the gradient of the loss w.r.t. input for teacher and student as

\begin{matrix} (6) & J_{S} = \frac{\partial}{\partial x} L (W_{S}, x) and J_{T} = \frac{\partial}{\partial x} L (W_{T}, x) \end{matrix}

To make student gradient attention similar to the teacher, they minimize a distance between them (they use
$l_{2}$ , but other distances can also be used):

\begin{matrix} (7) & L_{A T} (W_{S}, W_{T}, x) = L (W_{S}, x) + \frac{β}{2} | | J_{S} - J_{T} | |_{2} \end{matrix}

As
$W_{T}$ and
$x$ are given, to get the needed derivative w.r.t.
$W_{S}$ (this is done by using Eq. 6 and chain rule of derivatives):

\begin{matrix} (8) & \frac{\partial}{\partial W_{S}} L_{A T} = \frac{\partial}{\partial W_{S}} L (W_{S}, x) + β (J_{S} - J_{T}) \frac{\partial^{2}}{\partial W_{S} \partial x} L (W_{S}, x) \end{matrix}

To do the weight update, first forward and backprop is done to get
$J_{S}$ and
$J_{T}$ . Then, the second error
$\frac{β}{2} | | J_{S} - J_{T} | |_{2}$ is computed and propagated backwards a second time.
This can be implemented efficiently in a framework supporting automatic differentiation.
They also enforce horizontal flip invariance on gradient attention maps.
To do that they propagate horizontally flipped images as well as originals, backpropagate and flip gradient attention maps back. Then add
$l_{2}$ losses on the obtained attentions and outputs, and do second backpropagation:

L_{sym} (W, x) = L (W, x) + \frac{β}{2} | | \frac{\partial}{\partial x} L (W, x) - flip (\frac{\partial}{\partial x} L (W, flip (x))) | |_{2}

Here,
$flip (x)$ denotes the flip operator. Experimentally, this had a regularizing effect on training.
Note that in this work, they consider only gradients w.r.t. the input layer, but in general one might have the proposed attention transfer and symmetry constraints w.r.t. higher layers of the network.

Experiments

They demonstrate the performance of the proposed techniques on various image classification datasets. See paper for details.
They report that
$F_{sum}^{2}$ performs the best among all activation-based mapping functions.
They also mention that KD struggles to work if teacher and student have different architecture/depth.
Code is available on GitHub.

Conclusion

They present several ways of transferring attention from one network to another.
It will be interesting to see how attention transfer works in cases where spatial information is more important like object detection or weakly-supervised localization.

Notes on "Attention Transfer Knowledge Distillation"

tags: notes knowledge-distillation attention

Brief Outline

Introduction

Methodology

Activation-based Attention Transfer

Visualizing Attention Maps

Attention Mapping Function Properties

Training Procedure

Gradient-based Attention Transfer

Experiments

Conclusion

Read more

Notes on "[DACS: Domain Adaptation via Cross-domain Mixed Sampling](https://arxiv.org/abs/2007.08702)"

Notes on "[Prototypical Pseudo Label Denoising and Target Structure Learning for DA sem. seg.](https://arxiv.org/abs/2101.10979)"

Notes on "[Consistency Regularization with High-dimensional Non-adversarial Source-guided Perturbation for UDA in Segmentation](https://arxiv.org/abs/2009.08610)"

Notes on "[Discover, Hallucinate, and Adapt: Open Compound Domain Adaptation for Semantic Segmentation](https://papers.nips.cc/paper/2020/hash/7a9a322cbe0d06a98667fdc5160dc6f8-Abstract.html)"

tags: `notes` `knowledge-distillation` `attention`