# A Transformer-based Framework for Multivariate Time Series Representation Learning
## Terms Explanation
Transformer
~ A structure of Deep learning
Multivariate time series(MTS)
: A time series with multiple variables
Ex: Stock market
Representaion Learning
: Learn a simpler representaion of data to describe it
Pretext task (前置任務)
: Tasks that should preform **before** training the model
It's the **auxiliary task** that will help our training tasks
Downstream tasks(下游任務)
: **Fine tunes** after training the model
Autoregressive
: To denoise the input.
When getting the input, try learning the representation similar to our input
Data Imputaion
: To complete the data by adding what it miss
## Abstract
a novel framework for MTS based on transformer structure
It beats lots of SOTA model
It can perform well even if the dataset is small
## Goal
+ Modeling Time Series (Ex. NLP)
+ Generalize the use of transformers
+ Can be readily used for a wide variety of **downstream tasks** after utilizing a **pre-trained model**
## Challenges with Time Series
* Highly complex
* SOTA usually have high computational cost
* Popular methods don't exploit GPU
* Often scales poorly to longer series
* Labeling sequences is expensive and requires expertise
* Most SOTA developed for univariate
* No model for general use(Each model is designed for a certain type of data)
## Pros of Transformer
* Capable of long series and can do so **concurrently**
* learns multiple representations of input
* Attention is applied many times, allowing for more complicated representations
## Proposed Structure

:::success
By adding different layers after the Transformer Encoder and performing different **downstream tasks**, the model can do either **Classification** or **Regression**
:::
## How does TST work
1. Input encoding
: To get the input of Transformer Encoder, there are four steps in this part:
a. Normalization
b. Flatten the vector
c. Positional encoding
d. Padding
2. Transformer Encoder
: It's the common transformer with following layer
a. Self attention layer
>The layer which learns the representation
b. Skip connection + batch normalization
c. Feed forward layer + batch normalization
:::success
:notes: The Decoder part of the transformer is not used in this architecture
:::
3. Unsupervised Pre-training
Goal:
: Get model to learn how to reconstruct sequences
Task:
: Autoregressive task
How:
: By **stochasticly masking** part of the input, the model is asked to predicted the masked part
> :notes: The author found that the mask should be a **big segment** of input, cuz it's the only way forcing the model to truly learn the relationship between variables in the sequences
**Reconstruction error**:
: $$
L_{MSE} = \frac 1 {|M|} {\sum\sum}_{(t,i)\in M}(\hat x(t,i) - x(t,i))^2
\\M = \text{ # of masked observations}
\\ \hat x(t,i) = \text{ Reconstructed sequence}
\\ x(t,i) = \text{ Input sequence}
$$
4. Supervised fine-tuning:
Regression:
: How:
: Contancate the output of **Transformer Encoder** to a single vector
: Feed the vector into a **linear layer** and train with loss
: Loss:
: $$
\mathcal L = ||\hat y - y||^2
\\ \hat y \text{ is the output of linear layer}
\\ \text{while}\space \hat y = W_o \bar z + b_o
\\ \space W_o \text{ is the weight of linear layer}
\\ b_o \text{ is the offset of each neuron}
\\ \bar z \text{ is the single vector produced by}\space \mathbf{Transformer Encoder}
$$
:notes: All layers, including both **Input Encoding** and **Transformer Encoder** is trainable during this procedure
Classification:
: How:
: Concatenate the output of **Transformer Encoder** into a single vector
: Add a fully feed-forward layer with **one hidden unit for each class** to be predicted
: Apply softmax activation fcn to the output
: Loss:
: Cross Entropy with the categorical ground truth label
## How well does it work?

Avg.Rel.Diff.Mean: Average Relative Difference from Mean
: -0.3 means the model on average attains 30% less **RMSE** on a dataset comparing to average model performance on **the same dataset**
:::success
It's also available to fine-tune the model during inferencing
:::