A deep learning architecture for temporal sleep stage classification using multivariate and multimodal time series

# A deep learning architecture for temporal sleep stage classification using multivariate and multimodal time series Original Paper Link: [arxiv Link](https://arxiv.org/abs/1707.03321) # Introduction This paper demonstrates the use of PSG data to classify sleep stages. It shows that using multiple channels is better than using single channel raw EEG, and using upto 6 EEG channels, in addition to 2 EOG channels and 3 EMG channels shows a considerable boost in performance. Their system exploits the multivariate and multimodal nature of PSG signals in order to deliver SOTA classification performance with a small computational cost. One of the main reasons to use multi-channel was that the first layer learns linear spatial filters that exploit the array of sensors to increase the signal-to-noise ratio. # Network Architecture: Two different approaches are shown in the paper, one which uses simple Deep Learning methods, and the other one exploits the statistical nature of sleep signals by aggregating some past and future signals. The proposed feature extractor exhibits a simple and versatile 2 layer architecture. Increasing/decreasing layers doesn't boost performance. The authors used spatial and temporal convolutions strictly separately. By doing so, they replaced possible 2D expensive convolutions by a 1D spatial convolution and a 1D temporal convolution. The overall network does not exhibit more than 104 parameters when considering an extended spatial context, and not more than 105 parameters when considering both an extended spatial context and an extended temporal context. ### Multivariate Network Architecture ![Multivariate Network Architecture](https://i.imgur.com/s3vCi3y.png) The authors linear spatial filtering(a linear transformation) to estimate so called virtual channels, convolutive layers to capture spectral features and separate pipelines for EEG/EOG and EMG respectively. \\(Z: R^{C\times T} \longrightarrow R^{D} \\) Here, \\(D\\) is the size of the estimated feature space, \\(C\\) are the number of channels, and \\(T\\) is the size of data of each channel. The authors keep \\(D = C \times T\\), so the effective transformation has the same number of dimensions. The whole architecture is detailed in the table below: ![Architecture Details](https://i.imgur.com/kDkk4wq.png) The parameters have been set for signals sampled at 128 Hz. In this case the number of time steps is \\(T = 128 \times 30 = 3840 \\). Each block first convolves its input signal with 8 estimated kernels of length 64 with stride 1(\\(\eqsim\\)0.5s of record) before applying a ReLU non-linearity. The outputs are then reduced along the time axis with a max pooling layer (size of 16 without overlap). The output of the two convolution blocks is finally passed through a dropout layer with frequency 0.25(**Note:** The text in the paper mentions 0.25, while the table in the paper(shown above) mentions 0.5 dropout) The resulting outputs are then concatenated to form the feature space of dimension D before being fed into a final layer with 5 neurons and a softmax non-linearity to obtain a probability vector. ### Time Distributed Multivariate Architecture ![Time Distributed Multivariate Architecture](https://i.imgur.com/Hc3x4dh.png) To take into account the statistical properties of the signals before and after the sample of interest, the authors propose to aggregate the different features extracted by Z on a number of time segments preceding or following the sample of interest. Let \\(S_k^t = \lbrace X_{t-k}, \cdot\cdot\cdot, X_{t}, \cdot\cdot\cdot, X_{t+k}\rbrace \in \chi_{k}\\) be a sequence of \\(2k + 1\\) neighboring samples (\\(k\\) samples in the past and \\(k\\) samples in the future). Distributing in time the features extractor consists in applying \\(Z\\) to each sample in \\(S_k\\) and aggregating the \\(2k + 1\\) outputs forming a vector of size \\(D(2k+1)\\). Then, the obtained vector is fed into the final softmax classifier. # Training Procedure The authors proposed to balance the distribution of each class in minibatches of size 128. As there are 5 classes it means that during training, each batch has about 20% of samples of each class. The Adam optimizer is used for optimization with the following parameters \\(\alpha\\) = 0.001 (learning rate), \\(\beta_{1} \\) = 0.9, \\(\beta_{2} \\) = 0.999 and \\(\varepsilon \\) = 10-8. Weights were initialized with a normal distribution with mean \\(\mu \\) = 0, and standard deviation \\(\sigma \\) = 0.1. Those values were obtained empirically by monitoring the loss during training. The training of the time distributed network was done in two steps. First, the authors trained the multivariate network, especially its feature extractor part \\(Z_{t}\\) without temporal context (k = 0). The trained model was then used to set the weights of the feature extractor distributed in time. Second, they freezed the weights of the feature extractor distributed in time and we trained the final softmax classifier with aggregated features. # Experiments In this section, the authors first present the dataset and pre-processing techniques used by them. Then, they present the different features extractors of the literature which they use in their benchmark. They then present the experiments which aim at: (i) establishing a general benchmark of their feature extractor against SOTA approaches in univariate (single derivation) and bivariate (2 channels) contexts (ii) studying the influence of the spatial context (iii) evaluating the gain obtained by using the temporal context (iv) evaluating the impact of the quantity of training data ### Data and Pre-processing steps Dataset used is the MASS dataset. The train, validation, and test split was 41:10:10. The time series from all the available sensors were first lowpass filtered with a 30 Hz cutoff frequency. Then they were downsampled to a sampling rate of 128 Hz. The downsampling step speeds up the computations for the neural networks, while keeping the information up to 64 Hz (Nyquist frequency). The filter employed was a zero-phase finite impulse response(FIR) filter with transition bandwidth of approximately 7 Hz. When investigating the use of temporal context by feeding the predictors with sequences of consecutive samples \\(S_{k}\\), the authors used zero-padding at beginning and end of night. The time series fed into the different neural networks were standardized. For each channel, every 30 s sample is standardized individually such that it has zero mean and unit variance. This was done because in a task like sleep stage recording, events like skin humidity, body temperature, and electrode contact can affect the readings greatly. This operation only rescales the frequency powers in every frequency band, without altering their relative amplitudes where the discriminant information for the considered sleep stage classification task lies. ### Experiments: 1. Experiment 1(Comparison of feature extractors on the Fz / Cz channels): Processing multiple channels increases performance significantly. Also, the multivariate method described above also helps in getting some additional boost in performance. 2. Experiment 2(More sensors increase performance): Using 6 EEG channels benefits the performance greatly, but no further significant gains are visible, if we use EOG and EMG data with EEG data. 3. Experiment 3(Temporal context boosts performance): The close temporal context induces a boost in classification performances whereas considering a too large temporal context induces a decrease in performance. The gain strongly depends on the spatial context taken into account. The authors' +- 30s temporal context seems to boost performance significantly. 4. Experiment 4(More training data boost performance): This is a bit obvious. The authors tried n = {3, 12, 22, 31, 41} and found performance gains as n increases. Furthermore, the increase in channels can compensate to some extent the scarcity of data. The input configuration with 6 EEG channels plus 2 EOG and 3 EMG channels with only 12 training subjects reaches the same performance as the 2 EEG channels input configuration with 41 training subjects. 5. Experiment 5(Opening the model box): The authors find that the model works to some extent in a way that a human might work.