# Fusion of End-to-End Deep Learning Models for Sequence-to-Sequence Sleep Staging Original Paper Link: [Uni Luebeck Link](https://www.isip.uni-luebeck.de/fileadmin/files/publications/phan2019c.pdf) # Introduction This paper presents a fusion approach to sleep classification. The Fusion Model consists of two base sleep stage classifiers, SeqSleepNet and DeepSleepNet. According to the authors, these two networks perform well as an ensemble due to their high diversity, and the same is demonstrated experimentally. The ensemble achieved an overall accuracy of 88.0, a macro F1-score of 84.3 %, and a Cohen’s kappa of 0.828. # Dataset The paper tests it's hypothesis on the MASS dataset, and converts 20s epochs into 30s epochs by including 5 second segment before and after each epoch. # Architecture ![Network Architecture](https://i.imgur.com/eoHhlHE.png) The model tries to classify all the \\(L\\) epochs at once. The model is trained to maximize \\(p(y_{1}, y_{2},...,y_{L}|S_{1}, S_{2},..., S_{L}) \\), where \\((S_{1}, S_{2},..., S_{L}) \\) is the sequence of \\(L\\) consecutive epochs and \\((y_{1}, y_{2},...,y_{L})\\) represent the corresponding \\(L\\) one hot encoding vectors of the ground truth output labels. The Epoch Processing Block(EPB) extracts a feature vector for each epoch. The EPB is same for all epochs. After the feature vectors are obtained, they are passed through a bi-directional RNN, which outputs a sequence of vectors(1 vector per epoch). Each vector is passed through a softmax layer for sleep stage classification. The loss function used is sequence classification loss, along with L2 norm regularization. The final loss function is: \\(E(\theta) = -\frac{1}{L} \sum_\text{n=1}^N \sum_\text{i=1}^L y_{i}log(\hat{y}_{i}(\theta)) + \frac{\lambda}{2}\lvert\lvert\theta\rvert\rvert_2^2\\) # Epoch Processing Block(EPB) ![Epoch Processing Block](https://i.imgur.com/TdzomEg.png) ### SeqSleepNet In SeqSleepNet, the EPB is the combination of three filterbank layers , a two-layer biRNN realized by Gated Recurrent Unit (GRU) cell , and an attention layer as illustrated in Fig. 2(a). Note that this epoch-level biRNN should not be confused with the sequence-level biRNN in Fig. 1. Each of the filterbank layers with 32 filters is firstly used to preprocess one input image channels. Afterwards, the resulting image channels are stacked in the frequency dimension to form a single image. The biRNN then treats the image as a sequence of \\(T\\) local feature vectors (image columns) \\((z_{1}, z_{2}, : : :, z_{T})\\) and encodes this sequence into a sequence of output vectors \\((a_{1}, a_{2}, : : :, a_{T})\\). These output vectors collectively form the epoch-wise feature vector \\(x\\) for an epoch \\(S\\). \\(x = \sum_\text{t=1}^t \alpha_{t}a_{t}\\) where \\(\alpha_{t}\\) is the attention weight. ### DeepSleepNet At the network’s input layer, the raw EEG, EOG, and EMG signals are stacked to form 3-channel input. The EPB only uses a part of the DeepSleepNet architecture, as shown in the figure. # Combined Architecture Inspection on class-wise performance reveals that while SeqSleepNet worked better for N1 and REM, DeepSleepNet was favourable for N3. SeqSleepNet employs an attentional RNN combined with filterbank layers as an epoch-wise feature learning engine whereas this component is operated by a deep CNN in DeepSleepNet. Here, a late fusion method is employed in which the probabilistically multiplicative aggregation scheme is used to fuse decisions coming from two network bases. \\(f(y_{i}) = \frac{1}{L} \prod_\text{j=i-L+1}^i P_{1}(y_{i}|S_{j}) P_{2}(y_{i}|S_{j})\\) Where \\(f(y_{i})\\) is likelihood of sleep stage \\(y_{i}\\) at an epoch index \\(i\\), \\(S_{j}\\) is epoch sequence starting at index \\(j\\). \\(P_{1}\\) and \\(P_{2}\\) are output probabilities from SeqSleepNet and DeepSleepNet. When the number of decisions involved in the above equation is large, the aggregation should be conducted in the logarithmic domain to avoid possible numerical problems. We experimented with different sequence length L = {10, 20 ,30} PSG epochs, equivalent to {5, 10, 15} minutes. The sequences were sampled from the PSG recordings with a maximum overlapping (i.e. \\(L-1\\) epochs). In this way, all possible sequences from the training recordings were generated for network training purpose. # Conclusion The paper shows that an ensemble of two SOTA but diversified architectures(here, DeepSleepNet and SeqSleepNet) for sleep staging can be combined to increase accuracy. The authors showed that the ensemble outperforms either of the constituent models overall and in almost all metrics.