Time Series Generative Adversarial Networks. NIPS 2019.

Time Series Generative Adversarial Networks. NIPS 2019.

Abstract

A good generative model for time-series data should preserve temporal dynamics, in the sense that new sequences respect the original relationships between variables across time. The authors propose a novel framework for generating realistic time series data that combines the flexibility of the unsupervised paradigm with the control afforded by supervised training. Through a learned embedding space jointly optimized with both supervised and adversarial objectives, the authors encourage the network to adhere to the dynamics of the training data during sampling.

Introduction

The temporal setting poses a unique challenge to generative modeling. A model is not only tasked with capturing the distributions of features within each time point, it should also capture the potentially complex dynamics of those variables across time. In modeling multivariate sequential data

x_{1 : T} = (x_{1}, \dots, x_{T})

, we wish to accurately capture the conditional distribution

p (x_{t} | x_{1 : t - 1})

of temporal transitions.

Contribution

First, in addition to the unsupervised adversarial loss on both real and synthetic sequences, we introduce a stepwise supervised loss using the original data as supervision, thereby explicitly encouraging the model to capture the stepwise conditional distributions in the data.

Second, the authors introduce an embedding network to provide a reversible mapping between features and latent representations, thereby reducing the high-dimensionality of the adversarial learning space. This capitalizes on the fact the temporal dynamics of even complex systems are often driven by fewer and lower-dimensional factors of variation.

"train on synthetic, test on real (TSTR)" framework to the sequence prediction task, the authors evaluate how well the generated data preserves the predictive characteristics of the original.

Autoregressive recurrent networks trained via the maximum likelihood principle are prone to potentially large prediction errors when performing multi-step sampling, due to the discrepancy between closed-loop training (i.e. conditioned on ground truths) and open-loop inference (i.e. conditioned on previous guesses).

Multiple studies have straightforwardly inherited the GAN framework within the temporal setting. The first (C-RNN-GAN) directly applied the GAN architecture to sequential data, using LSTM networks for generator and discriminator.

Representation learning in the time-series setting primarily deals with the benefits of learning compact encodings for the benefit of downstream tasks such as prediction, forecasting, and classification.

Problem Definition

Let

S

be a vector space of static features,

X

of temporal features, and let

S \in S, X \in X

be random vectors that can be instantiated with specific values denoted

s

and

x

. We consider tuples of theform

(S, X_{1 : T})

with some joint distribution

p

. In the training data, let individual samples be indexed by

n \in 1, \dots, N

, so we can denote the training dataset

D = {(s_{n}, x_{n, 1 : T_{n}})}_{n = 1}^{N}

The goal is to use training data

D

to learn a density

\hat{p} (S, X_{1 : T})

that best approximates

p (S, X_{1 : T})

. Therefore we additionally make use of the autoregressive decomposition of the joint

p (S, X_{1 : T}) = p (S) \prod_{t} p (X_{t} | S, X_{1 : t - 1})

to focus specifically on the conditionals

\hat{p} (X_{t} | S, X_{1 : t - 1})

that best approximates

p (X_{t} | S, X_{1 : t - 1})

at any time

t

Two objectives. Importantly, this breaks down the sequence-level objective (matching the joint distribution) into a series of stepwise objectives (matching the conditionals).

The first is global

\begin{matrix} (1) & min_{\hat{p}} D (p (S, X_{1 : T}) | | \hat{p} (S, X_{1 : T})) \end{matrix}

where

D

is some appropriate measure of distance between distributions. The second is local,

\begin{matrix} (2) & min_{\hat{p}} D (p (X_{t} | S, X_{1 : t - 1}) | | \hat{p} (X_{t} | S, X_{1 : t - 1})) \end{matrix}

for any

t

. The former takes the form of the Jensen-Shannon divergence. The latter takes the form of the Kullback-Leibler divergence.

Methodology

TimeGAN consists of four network components: an embedding function, recovery function, sequence generator, and sequence discriminator. The key insight is that the autoencoding components (first two) are trained jointly with the adversarial components (latter two), such that TimeGAN simultaneously learns to encode features, generate representations, and iterate across time.

Embedding and Recovery Functions

Let

H_{S}

H_{X}

denote the latent vector spaces corresponding to feature spaces

S

X

. Then the embedding function

e : S \times \prod_{t} X \to H_{S} \times \prod_{t} H_{X}

takes static and temporal features to their latent codes

h_{S}

h_{1 : T} = e (s, x_{1 : T})

. Here, the authors implement

e

via a recurrent network

\begin{matrix} (3) & h_{S} = e_{S} (s) h_{t} = e_{X} (h_{S}, h_{t - 1}, x_{t}) \end{matrix}

where,

e_{S} : S \to H_{S}

is an embedding network for static feature and

e_{X} : H_{S} \times H_{X} \times X \to H_{X}

is a recurrent embedding network for temporal features.

In the opposite direction, the recovery function

r : H_{S} \times \prod_{t} H_{X} \to S \times \prod_{t} X

takes static and temporal codes back to their feature representations

\tilde{s}

{\tilde{x}}_{1 : T} = r (h_{S}, h_{1 : T})

. Here, the authors implement

r

via a feedforward network

\begin{matrix} (4) & \tilde{s} = r_{S} (h_{s}) {\tilde{x}}_{t} = r_{X} (h_{t}) \end{matrix}

where,

r_{S} : H_{S} \to S

and

r_{X} : H_{X} \to X

are recovery networks for static and temporal embeddings.

Note that the embedding and recovery functions can be parameterized by any architecture of choice, with the only stipulation being that they be autoregressive and obey causal ordering (i.e. output(s) at each step can only depend on preceding information). For example, it is just as possible to implement the former with temporal convolutions, or the latter via an attention-based decode

Sequence Generator and Discriminator

Instead of producing synthetic output directly in feature space, the generator first outputs into the embedding space.

Let

Z_{S}

Z_{X}

denote vector spaces over which known distributions are defined, and from which random vectors are drawn as input for generating into

H_{S}

H_{X}

. Then the generating function

g : Z_{S} \times \prod_{t} Z_{X} \to H_{S} \times \prod_{t} H_{X}

takes a tuple of static and temporal random vectors to synthetic latent codes

{\hat{h}}_{S}

{\hat{h}}_{1 : T} = g (z_{S}, z_{1 : T})

. Here, the authors implement

e

via a recurrent network

\begin{matrix} (5) & {\hat{h}}_{S} = g_{S} (z_{S}) {\hat{h}}_{t} = g_{X} ({\hat{h}}_{S}, {\hat{h}}_{t - 1}, z_{t}) \end{matrix}

where

g_{S} : Z_{S} \to H_{S}

is an generator network for static features, and

g_{X} : H_{S} \times H_{X} \times Z_{X} \to H_{X}

is a recurrent generator for temporal features. Random vector

z_{S}

can be sampled from a distribution of choice, and

z_{t}

follows a stochastic process; here we use the Gaussian distribution and Wiener process respectively.

Finally, the discriminator also operates from the embedding space. The discrimination function

d : H_{S} \times \prod_{t} \to [0, 1] \times \prod_{t} [0, 1]

receives the static and temporal codes, returning classifications

{\tilde{y}}_{S}

{\tilde{y}}_{1 : T} = d ({\tilde{h}}_{S}, ({\tilde{h}}_{1 : T})

. The

{\tilde{h}}_{*}

notation denotes either real (

h_{*}

) or synthetic (

{\hat{h}}_{*}

) embeddings; similarly, the

{\tilde{y}}_{*}

notation denotes classifications of either real (

y_{*}

) or synthetic (

{\hat{y}}_{*}

) data. Here, the authors implement

d

via a bidirectional recurrent network with a feedforward output layer,

\begin{matrix} (6) & {\tilde{y}}_{S} = d_{S} ({\tilde{h}}_{S}) {\tilde{y}}_{t} = d_{X} ({\overset{\leftarrow}{u}}_{t}, {\vec{u}}_{t}) \end{matrix}

where,

{\vec{u}}_{t} = {\vec{c}}_{X} ({\tilde{h}}_{S}, {\tilde{h}}_{t}, {\vec{u}}_{t - 1})

and

{\overset{\leftarrow}{u}}_{t} = {\vec{c}}_{X} ({\tilde{h}}_{S}, {\tilde{h}}_{t}, {\overset{\leftarrow}{u}}_{t + 1})

respectively denote the sequences of orward and backward hidden states,

{\vec{c}}_{X}

{\overset{\leftarrow}{c}}_{X}

are recurrent functions, and

d_{S}

d_{X}

are output layer classification functions.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Figure 1: (a) Block diagram of component functions and objectives. (b) Training scheme; solid lines indicate forward propagation of data, and dashed lines indicate backpropagation of gradients.

Jointly Learning to Encode, Generate, and Iterate

First, purely as a reversible mapping between feature and latent spaces, the embedding and recovery functions should enable accurate reconstructions

\tilde{s}

{\tilde{x}}_{1 : T}

of the original data

s

x_{1 : T}

from their latent representations

h_{S}

h_{1 : T}

. The first objective function is the reconstruction loss

\begin{matrix} (7) & L_{R} = E_{s, x_{1 : T} \sim p} [| | s - \tilde{s} | |_{2} + \sum_{t} | | x_{t} - {\tilde{x}}_{t} | |] \end{matrix}

Gradients are then computed on the unsupervised loss. This is as one would expect-that is, to allow maximizing (for the discriminator) or minimizing (for the generator) the likelihood of providing correct classifications

{\hat{y}}_{S}

{\hat{y}}_{1 : T}

for both the training data

h_{S}

h_{1 : T}

as well as for synthetic output

{\hat{h}}_{S}

{\hat{h}}_{1 : T}

from the generator

\begin{matrix} (8) & L_{U} = E_{s, x_{1 : T} \sim p} [log y_{S} + \sum_{t} log y_{t}] + E_{s, x_{1 : T} \sim \hat{p}} [log (1 - {\hat{y}}_{S}) + \sum_{t} log (1 - {\hat{y}}_{t})] \end{matrix}

The authors also train in closed-loop mode, where the generator receives sequences of embeddings of actual data

h_{1 : t - 1}

(i.e. computed by the embedding network) to generate the next latent vector. Gradients can now be computed on a loss that captures the discrepancy between distributions

p (H_{t} | H_{S}, H_{1 : t - 1})

and

\hat{p} (H_{t} | H_{S}, H_{1 : t - 1})

. Applying maximum likelihood yields the familiar supervised loss

\begin{matrix} (9) & L_{S} = E_{s, x_{1 : T} \sim p} [\sum_{t} | | h_{t} - g_{X} (h_{S}, h_{t - 1}, z_{t}) | |_{2}] \end{matrix}

where

g_{X} (h_{S}, h_{t - 1}, z_{t})

approximates

E_{z_{t} \sim N} [\hat{p} (H_{t} | H_{S}, H_{1 : t - 1}, z_{t})]

with one sample

z_{t}

-as is standard in stochastic gradient descent.

In sum, at any step in a training sequence, one assess the difference between the actual next-step latent vector (from the embedding function) and synthetic next-step latent vector (from the generator-conditioned on the actual historical sequence of latents). While
$L_{U}$ pushes the generator to create realistic sequences (evaluated by an imperfect adversary),
$L_{S}$ further ensures that it produces similar stepwise transitions (evaluated by ground-truth targets).

Optimization. Figure 1(b) illustrates the mechanics of our approach at training. Let

θ_{e}

θ_{r}

θ_{g}

θ_{d}

respectively denote the parameters of the embedding, recovery, generator, and discriminator networks. The first two components are trained on both the reconstruction and supervised losses

\begin{matrix} (10) & min_{θ_{e}, θ_{r}} (λ L_{S} + L_{R}) \end{matrix}

Next, the generator and discriminator networks are trained adversarially as follows

\begin{matrix} (11) & min_{θ_{g}} (η L_{S} + max_{θ_{d}} L_{U}) \end{matrix}

Figure 2: (a) TimeGAN instantiated with RNNs, (b) C-RNN-GAN, and © RCGAN. Solid lines denote function application, dashed lines denote recurrence, and orange lines indicate loss computation.

Experiments

Benchmarks and Evaluation. We compare TimeGAN with RCGAN and C-RNN-GAN, the two most closely related methods. For purely autoregressive approaches, the authors compare against RNNs trained with teacher-forcing (T-Forcing) as well as professor-forcing (P-Forcing). For additional comparison, the authors consider the performance of WaveNet as well as its GAN counterpart WaveGAN. To assess the quality of generated data, the authors observe three desiderata: (1) diversity-samples should be distributed to cover the real data; (2) fidelity-samples should be indistinguishable from the real data; and (3) usefulness-samples should be just as useful as the real data when used for the same predictive purposes (i.e. train-on-synthetic, test-on-real).

(1) Visualization. We apply t-SNE and PCA analyses on both the original and synthetic datasets (flattening the temporal dimension).

(2) Discriminative Score. For a quantitative measure of similarity, the authors train a post-hoc time-series classification model (by optimizing a 2-layer LSTM) to distinguish between sequences from the original and generated datasets.

(3) Predictive Score. In order to be useful, the sampled data should inherit the predictive characteris-tics of the original. In particular, the authors expect TimeGAN to excel in capturing conditional distributions over time. Therefore, using the synthetic dataset, the authors train a post-hoc sequence-prediction model (by optimizing a 2-layer LSTM) to predict next-step temporal vectors over each input sequence. Then, the authors evaluate the trained model on the original dataset. Performance is measured in terms of the mean absolute error (MAE).

Figure 3: t-SNE visualization on Sines (1st row) and Stocks (2nd row). Each column provides the visualization for each of the 7 benchmarks. Red denotes original data, and blue denotes synthetic.

Time Series Generative Adversarial Networks. NIPS 2019.

Abstract

Introduction

Contribution

Related Work

Problem Definition

Methodology

Embedding and Recovery Functions

Sequence Generator and Discriminator

Jointly Learning to Encode, Generate, and Iterate

Experiments

Read more

ATOM: Robustifying Out-of-Distribution Detection Using Outlier Mining. ECML PKDD 2021.

Practical Adversarial Attacks on Spatiotemporal Traffic Forecasting Models. NIPS 2022.