# Convolution- and Attention-Based Neural Network for Automated Sleep Stage Classification (2020)
>[source](https://www.mdpi.com/1660-4601/17/11/4152)
- ### Abstract
- Attention is reuired to specific signal chracteristics(K-complexes and spindles) while scoring sleep epochs.
- A CNN model with attention mechanism is devised to perform automatic sleep staging.
- The CNN learns local signal chracteristics while attention mechanism learns inter and intra-epoch features
- ### Model
- A neural network model based on aCNNand an attention mechanism for automated sleep stage classification, using a single-channel raw EEG signal is used
- A neural network based on convolution and attention mechanism is built. The network uses a CNN to extract local signal features and multilayer attention networks to learn intra- and inter-epoch features.
- For the unbalanced dataset, the proposed method uses a weighted loss function during training to improve model performance on minority classes.
- The model outperforms other methods on sleep-edf and sleep-edfx datasets utilizing various training and testing set partitioning methods without changing the model’s structure or any of its parameters.
- Dataset used are sleep edf and sleep edfx available on PhysioNet.
- There are eight sleep records in the sleep-edf database, four from healthy subjects and four from subjects with sleep disorders. Sleep-edfx contains 197 records of 61 healthy individuals and 20 individuals with sleep disorders.
- Preprocessing:
- Results were compared using complete sleep-edf database and on the first 20 healthy individual records (subjects 0–19) from the sleep-edfx database.
- For each record in the sleep-edfx dataset, 30 min of wake stage data were retrained from before the first sleep epoch and from after the final sleep epoch.
- The model used the Fpz-Cz channel as input.
- Due to differences between individuals, collection equipment, and environments, the resulting data distributions also have distinct differences that make the model diffcult to train.
- In order to make the training more stable, z-score normalization on the data from each individual was performed.
- 
- There are two types of data partitioning methos for clinical data - subject wise or record wise (called as epoch wise in the paper)also known as independent and non-independent methods.
- The paper uses epoch wise mthod on sleep edf and subject wise on sleep edfx dataset.
- Epoch wise method ,dataset is shuffled before partitioning.
- Subject wise method partitions based on subjects.
- 
- ### Architecture
- Model is divided into 3 components:
- Window feature learning
- Intra-epoch feature learning
- Inter-epoch feature learning.
- The model inputs multiple signal windows to window feature learning in parallel.It uses CNN to construct feature vectors for each window.
- The intra epoch learning uses self attention mechanism to obtain weight of each signal window on an epoch and then does weighted addition to get epoch feature.
- Window feature is updated using feed forward layer.
- Inter epoch feature also uses self attention to learn temporal dependency between current and adjacent epochs.
- Window length used is 200 with overlap length as 100 to have 29 windows per epoch.
- 

- Window feature model has 5 convolutional blocks and a global average pooling layer as per diagram.
- Intra and inter epoch feature have poitional embedding , two attention blocks and one GAP layer.
- The difference is in inputs (29,256) for intra epoch learning and (3,256) for inter epoch.
- After the previous three components, we finally obtained the feature vector of the current epoch with shape (1, 256).
- The model uses two fully connected layers as the classifier and will output each stage class probability of the current epoch.
- The first fully connected layer contains the ReLU and dropout layers.
- The second fully connected layer connects to the softmax layer, which normalizes the output probability.
- 
- ### Training and Testing
- Weighted cross entropy loss is used in training to reduce impact of unbalanced data.
- 
- Weight βi corresponds to real category yi.
- Adam optimizer is used with LookAhead mechanis, with initial learning rate as 1e-4,rate decay was 2e-4 and gradient clip value was 0.1.
- Testing:
- Different partioning was done on different datasets.
- Sleep edf was 70% training 30% testing epoch wise partition.
- It was trained with 100 epochs(iterations not sleep epoch)
- Sleep edfx used subject wise partioning with 19 subjects as training and 1 as testing.
- It was done 20 times for evaluation on the entire dataset.Each training had only 35 epochs due to large dataset.
- The early stopping strategy was not used during training.
- Ensemble method was used to improve stability.
- The principle underlying this method is that ensemble outputs are obtained by using multiple models to infer the same input to get a final output,where Pi (Xt) is the stage probability vector of model i for the input at time t, and yt is the final output stage.
- 
- The parameters of the last five epochs of the model are saved during training to obtain multiple models.
- Evaluation:
- To evaluate the performance it was done per category and overall.
- For each category,calculated papams were precision, recall, and F1-score of the model.
- For the overall evaluation,accuracy is used to obtain an intuitive understanding of the model’s performance on the entire dataset.
- However, because the distribution of each stage in the dataset is uneven, overall accuracy cannot reflect the model’s true performance.
- To better reflect the model’s performance on imbalanced datasets, the macro average F1-score (MF1)is used(with C =5) (MF1) to evaluate it.
- 
- ### Results and Analysis
- The classification of N1 accuracy is poor due to small number of samples.
- When removing window feature learning, the raw window signal was directly used as input to the intra-epoch attention module.
- When removing the intra- or inter-epoch attention module, the output of the previous module was directly connected to the subsequent GAP layer.
- Taking the full model as the baseline, the removal of any component will reduce the model’s MF1 metric.
- The removal of the window feature caused the greatest decline in performance.
- The removal of weighted loss didnt affect accuracy but did reduce the MF1 score.
- ### Conclusion
- A convolution and attention based network using single EEG channel was used to realise automated sleep stage classification.
- CNN is used as feature extractor and attetion replaces use of RNN.
- Weighted loss function was very important in this architecture.