# Unsupervised Sparse Dirichlet-Net for Hyperspectral Image Super-Resolution
Authors: Ying Qu, Hairong Qi, Chiman Kwan
Link: https://arxiv.org/abs/1804.05042
Written by: Pronoma Banerjee
---
### Abstract and Introduction
The paper presents the first successful attempt for solving the image super-resolution problem using an unsupervised encoder-decoder architecture, called Sparse Dirichlet-Net or uSDN, with the following uniqueness in approach:
1. It is composed of 2 encoder-decoder networks, sharing their decoder weights, to preserve the spectral information from the HSI network.
2. The network encourages the representations from both modalities to follow a sparse Dirichlet distribution which naturally incorporates the two physical constraints of HSI and MSI, i.e, sum-to-one and sparsity. The Dirichlet distribution incorporates the sum-to-one property, and since each pixel of the image only consists of a few spectral bases, the sparsity of the representations is guaranteed by minimizing their entropy function.
4. Angular difference between the networks is minimized in order to reduce the spectral distortion.
The network architecture is shown below. The dashed lines show the path of backpropagation. ![](https://i.imgur.com/EAMiQQt.png)
---
### Problem Formulation
Given the lower-resolution (LR) HSI $\bar{Y_h} \in R^{m\times n\times L}$ and the higher resolution (HR) MSI $\bar{Y_m} \in R^{M\times N\times l}$, the goal is to estimate the HR HSI $\bar{X} \in R^{M\times N\times L}$. Here $M>>m, N>>n, L>>l$. For subsequent processing the 3D images are unfolded as 2D matrices, $Y_h \in R^{mn\times L}$, $Y_m \in R^{MN\times l}$ and $X \in R^{MN\times L}$.
Each row of $Y_h$ is a linear combination of $c$ basis vectors (or spectral signatures). $Y_h$ is represented as: $Y_h = S_hΦ_h$,
where $Φ_h \in R^{c\times L}$, each row of which denotes the spectral basis that preserves the spectral information, and $S_h\in R^{mn\times c}$ is the corresponding proportional coefficients (representations).
$Y_m$ is expressed as: $Y_m = S_mΦ_m, Φ_m = Φ_hR$,
where $Φ_m \in R^{cxl}$, each row of which indicates the spectral basis of MSI. $R \in R^{L\times l}$ is the transformation matrix given as a prior from the sensor which describes the relationship between HSI and MSI bases. With $Φ_h \in R^{c\times L}$ carrying the high spectral information, and $S_m \in R^{MNxc}$ carry high spatial information, the desired $X$ is generated from the equation: $X = S_mΦ_h.$
The problem of HSI-SR can be described mathematically as P(X|Yh,Ym). Since the ground truth X is not available, the problem should be solved in an unsupervised fashion. Three unique requirements of HSI-SR need to be given special consideration. First, in representing HSI or MSI as a linear combination of spectral signatures, the representation vectors should be non-negative and sum-to-one. That is, $\sum^c_{j=1} s_{ij} = 1$, where si is the row vector of either $S_h$ or $S_m$.
---
### Proposed Approach
**Network Architecture**
The network architecture is shown in the figure above. The network reconstructs both $Y_h$ and $Y_m$ in a coupled fashion. First considering the LR HSI network. It consists of an encoder $E_h(\theta_{he})$ which maps the input variables to the low-dimensional representations ($p_{\theta_{he}} (S_h|Y_h)$, and a decoder $D_h (θ_{hd})$ which reconstructs the data from the representations ($p_{\theta_{he}} (\hat{Y}_h|S_h)$. The bottleneck hidden layer $S_h$ behaves as the representation layer reflecting the spatial information and the weights $\theta_{hd}$ serve as $Φ_h$.
The HSI is reconstructed by $\hat{Y}_h = f_k(W_{dk}f_{k−1}(...(f_1(S_hW_{d1} + b_1)...) + b_{k−1}) + b_k)$, where W_{dk} denotes the weights in the $k$th layer. $S_h$ follows Dirichlet distribution with sum-to-one property naturally incorporated.
Equivalently, the bottom network reconstructs the HR MSI in a similar way with encoder $E_m (θ_{me} )$ and decoder $D_m (θ_{md} )$. However, since (number of input variables) $l ≤ c ≤ L$ (number of latent variables) , the MSI network is very unstable and hard to train. The spectral basis of HR MSI can be transformed from those of LR HSI which possesses more spectral information, the decoder of the MSI is designed to share the weights with that of HSI in terms of $θ_{md} = Φ_m = θ_{hd} R = Φ_h R$. Then the reconstructed HR MSI can be obtained by $\hat{Y}_m = S_mΦ_hR$. Hence, only the encoder $E_m(θ_{me})$ of the MSI is updated during the optimization, where the HR spatial information S_m is extracted from MSI. Eventually, the desired HR HSI is generated directly by $X = S_m Φ_h$.
**Sparse Dirichlet-Net with Dense Connectivity**
To extract stable spectral information, we need to enforce the proportional coefficients $S = (s_1,s_2,··· ,s_i,··· ,s_p)^T$ of each pixel to sum-to-one. To naturally incorporate the sum-to-one property, the representations are encouraged to follow a Dirichlet distribution with the stick breaking process (check paper for details) described in the figure below. Furthermore, entropy function is adopted to reinforce sparsity of the representation.
![](https://i.imgur.com/vFxJy97.png)
To avoid gradient vanishing and increase the representation power of the proposed method, the encoder of the network is densely connected, i.e., each layer is fully connected with all its subsequent layers. Instead of concatenating all the preceding layers, the input of the kth layer is the summation of all the preceding layers x0, x1, xk−1 with their own weights, i.e., $W_0x_0 +W_1x_1 +...+W_{k−1}x_{k−1}$. In this way, fewer number of layers is required to learn the optimal representations.
Although the stick-breaking structure encourages the representations to follow a Dirichlet distribution, it does not guarantee the sparsity of the representations. Generalized Shannon entropy function to reinforce the sparsity of the representation layer which works effectively even with the sum-to-one constraint. The entropy function was first proposed in compressive sensing field to solve the signal recovery problem. It is defined as:
$H_p(s)= -\sum^N_{j=1} \frac{|s_j|^p}{∥s∥^p_p}$log$\frac{|s_j|^p}{∥s∥^p_p}$.
Compared to the more popular Shannon entropy, the generalized entropy function decreases monotonically when the data becomes sparse. This nice property guarantees the sparsity of arbitrary data even the data are with the sum- to-one constraint. Due to the stick-breaking structure, the latent variables at the representation layer are positive. We choose p = 1 which is more efficient and will encourage the variables to be sparse.
**Angle similarity**
The representations $S_h$ and $S_m$ are encouraged to follow a similar pattern to prevent spectral distrortion. This similarity is measured via the angular difference between the 2 representations, measured using SAM.
In the network, representations $S_h ∈ R^{mn×c}$ and S_m ∈ R^{MN ×c} , from two different modalities have different dimensions. To minimize the angular difference, we increase the size of the low-dimensional $S_h$ by duplicating its values at each pixel to its nearest neighborhood. Then the duplicated representations $\bar S_h ∈ R^{MN ×c}$ have the same dimension as Sm. With vectors of equal size, the angular difference is defined as
$A(\bar S_h,S_m)=\frac{1}{MN} \sum^{MN}_{i=1}$ arccos $(\frac{\bar{s}^i_h s^i_h}{∥\bar{s}^i_h∥∥s^i_h∥} )$
**Optimization and Implementation details**
The objective functions of the proposed network architecture can be expressed as:
$L(θ_{he},θ_{hd})= \frac{1}{2}∥Y_h(θ_{he},θ_{hd})−\hat Y_h(θ_{he},θ_{hd})∥^2_F + λH_1(S_h(θ_{he})) + μ∥θ_{hd}∥^2_F$,
$L(θ_{me})= \frac{1}{2}∥Y_m(θ_{me},θ_{hd})−\hat Y_m(θ_{me},θ_{hd})∥^2_F + λH_1(S_m(θ_{me}))$,
$L(θ_{me}) = J(\bar S_h(θ_{he}), S_m(θ_{me}))$,
where λ and μ are parameters that balance the trade-off be- tween the reconstruction error and the sparsity and weights loss, respectively.
The implementation and solution to the 3 objective functions can be described step-wise as follows:
1. Since the decoder weights $θ_{hd}$ of the HSI network preserves the spectral information $Φ_h$, the HSI network is first updated, to find the optimal $θ_{hd}$. To prevent over-fitting, an l2 norm is applied on the decoder of the HSI network.
2. The estimated decoder weights $θ_{hd}$ are fixed and shared with the decoder of the MSI network. Update the encoder weights $θ_{me}$ of the MSI network.
3. To reduce spectral distortion, every 10 iterations, we minimize the angular difference between the representations of two modalities. Since we already have $θ_{he}$ from the first step, only the encoder $θ_{me}$ of the MSI network is updated during the optimization.
For all the experiments, both the input and output of the HSI network have 31 nodes, representing the number of spectral bands in the data. There are 3 layers in the HSI network and each layer contains 10 nodes. The MSI network has 5 layers with the number of nodes increases from 4 to 10. The network shares the decoder with 2 layers and each layer has 10 nodes. Since different images have different spectral bases and representations, the network is trained on each pair of LR HSI and HR MSI to reconstruct each image accurately.
<!-- ### Questions/ Drawbacks
1. How do the representations/coefficients $S_h$ preserve the spatial structure? (Problem formulation: 2nd paragraph)
2. How do we get the transformation matrix given as prior from the sensor (Problem formulation, 3rd paragraph)
3. Walk-through: problem formulation
4. The problem of HSI-SR can be described mathematically as P(X|Yh,Ym).
5. Understanding the encoder nets better (fig 3)
6. What exactly does angular difference and SAM value say. -->