A Transformer-based Framework for Multivariate Time Series Representation Learning

# A Transformer-based Framework for Multivariate Time Series Representation Learning ## Terms Explanation Transformer ~ A structure of Deep learning Multivariate time series(MTS) : A time series with multiple variables Ex: Stock market Representaion Learning : Learn a simpler representaion of data to describe it Pretext task (前置任務) : Tasks that should preform **before** training the model It's the **auxiliary task** that will help our training tasks Downstream tasks(下游任務) : **Fine tunes** after training the model Autoregressive : To denoise the input. When getting the input, try learning the representation similar to our input Data Imputaion : To complete the data by adding what it miss ## Abstract a novel framework for MTS based on transformer structure It beats lots of SOTA model It can perform well even if the dataset is small ## Goal + Modeling Time Series (Ex. NLP) + Generalize the use of transformers + Can be readily used for a wide variety of **downstream tasks** after utilizing a **pre-trained model** ## Challenges with Time Series * Highly complex * SOTA usually have high computational cost * Popular methods don't exploit GPU * Often scales poorly to longer series * Labeling sequences is expensive and requires expertise * Most SOTA developed for univariate * No model for general use(Each model is designed for a certain type of data) ## Pros of Transformer * Capable of long series and can do so **concurrently** * learns multiple representations of input * Attention is applied many times, allowing for more complicated representations ## Proposed Structure ![](https://i.imgur.com/sP3s8JN.png) :::success By adding different layers after the Transformer Encoder and performing different **downstream tasks**, the model can do either **Classification** or **Regression** ::: ## How does TST work 1. Input encoding : To get the input of Transformer Encoder, there are four steps in this part: a. Normalization b. Flatten the vector c. Positional encoding d. Padding 2. Transformer Encoder : It's the common transformer with following layer a. Self attention layer >The layer which learns the representation b. Skip connection + batch normalization c. Feed forward layer + batch normalization :::success :notes: The Decoder part of the transformer is not used in this architecture ::: 3. Unsupervised Pre-training Goal: : Get model to learn how to reconstruct sequences Task: : Autoregressive task How: : By **stochasticly masking** part of the input, the model is asked to predicted the masked part > :notes: The author found that the mask should be a **big segment** of input, cuz it's the only way forcing the model to truly learn the relationship between variables in the sequences **Reconstruction error**: : $$ L_{MSE} = \frac 1 {|M|} {\sum\sum}_{(t,i)\in M}(\hat x(t,i) - x(t,i))^2 \\M = \text{ # of masked observations} \\ \hat x(t,i) = \text{ Reconstructed sequence} \\ x(t,i) = \text{ Input sequence} $$ 4. Supervised fine-tuning: Regression: : How: : Contancate the output of **Transformer Encoder** to a single vector : Feed the vector into a **linear layer** and train with loss : Loss: : $$ \mathcal L = ||\hat y - y||^2 \\ \hat y \text{ is the output of linear layer} \\ \text{while}\space \hat y = W_o \bar z + b_o \\ \space W_o \text{ is the weight of linear layer} \\ b_o \text{ is the offset of each neuron} \\ \bar z \text{ is the single vector produced by}\space \mathbf{Transformer Encoder} $$ :notes: All layers, including both **Input Encoding** and **Transformer Encoder** is trainable during this procedure Classification: : How: : Concatenate the output of **Transformer Encoder** into a single vector : Add a fully feed-forward layer with **one hidden unit for each class** to be predicted : Apply softmax activation fcn to the output : Loss: : Cross Entropy with the categorical ground truth label ## How well does it work? ![](https://i.imgur.com/au72A7O.png) Avg.Rel.Diff.Mean: Average Relative Difference from Mean : -0.3 means the model on average attains 30% less **RMSE** on a dataset comparing to average model performance on **the same dataset** :::success It's also available to fine-tune the model during inferencing :::