# [Causal Discovery with Attention-Based Convolutional Neural Networks](https://www.mdpi.com/2504-4990/1/1/19)
*MDPI, 2019, machine learning and knowledge extraction open access journal*
###### tags: `references`
1. Problem statement: Given a dataset $X$ containing $N$ observed continuous time series of the same length $T$ (i.e., $X = \left\{ \ X_1, X_2, ...,X_N \right\}$ $\in$$\mathbb{R}^{N\times T}$)
the goal is to discover the causal relationships between all $N$ time series in $X$ and the time delay between cause and effect

2. Workflow:

3. Model architecture

4. Model implementation:
* [Temporal Convolutional Network (TCN)](https://medium.com/@cyeninesky3/%E6%99%82%E9%96%93%E5%8D%B7%E7%A9%8D%E7%B6%B2%E7%B5%A1-tcn-%E9%97%9C%E6%96%BC%E5%BE%9E%E9%A2%A8%E6%8E%A7%E9%A0%85%E7%9B%AE%E7%95%B6%E4%B8%AD%E7%9A%84%E5%AD%B8%E7%BF%92-11693d762f5) based
* Adaption for Multivariate: Since TCN is for univariate time series modeling, the study modified TCN to a one-dimensional depthwise separable architecture in which the input time series stay separated. The TCDF is consist of $N$ channels, each for an input series.
* Attention mechanism:
* Eacn network $N_j$ has its own attentions $a_{j} = \left\{ \ a_{1,j}, a_{2,j}, ...,a_{N,j} \right\}$
* The initialization for each attention scores $a_{i,j}$ is 1
* The learned $a_{i,j}$$\in$$R$
* Tranform $a_{i,j}$ into causalities: $$ h_{ij}=\left\{\begin{aligned}sigma(a_{ij}) & & \ if & & a_{ij}\geq t_j\\0 & & \ else \\\end{aligned}\right.$$
$t_{j}$ is determined by the largest gap between $a_{j}$.
* The set potential cause $P_j$ contains $X_i$ whose $h_{ij} > 0$

5. Permutation Importance Validation: for each $X_i$$\in$$P_j$,
* Randomly permute $X_i$ to form a new dataset $I_i$
* Denote $L^1_G$ as the loss of epoch1, $L^{final}_G$ as the loss in the final epoch, where $G$ is the original dataset
* $\triangle$$L_G=L^1_G-L^{final}_G$, $\triangle$$L_{I_{i}}=L^1_G-L^{final}_I$
* if $\triangle$$L_{I_i}<0.8\triangle$$L_G$, $X_i$ is a true cause of $X_j$
6. Delay discovery: for a true cause $X_i$ of $X_j$, from the $i th$ channel of model $N_j$,
