Spatial Imputation with EKMC

$\newcommand{\R}{\mathbb{R}}$ $\newcommand{\C}{\mathbb{C}}$ $\DeclareMathOperator{\min}{min}$ $\newcommand{\diag}{\mathrm{diag}}$ # Spatial Imputation with EKMC ## Promblem definition There are $N_1$ sensors whose reading are unknown/hidden while $N_2$ sensors are accessible. Assuming sensor $i$'s reading $x$ at time $t$ is linearly related to all $N_2$ sensors' readings, where $i\in N_1$, we have the following relationship: $$ x(i,t) = \sum_{j\in N_2} [w(j,t-L)x(j,t-L)+...+w(j,t+L)x(j,t+L)] $$ Form $Y$ and $X$ as: $$ Y_t = \begin{pmatrix} x(i_{N_1},t) \\ \vdots \\ x(i_2,t) \\ x(i_1,t) \end{pmatrix} \text{ and } X_{temp} = \begin{pmatrix} x(j_{N_2},t-L) & ... & x(j_{N_2},t) & ... & x(j_{N_2},t+L)\\ \vdots \\ x(j_2,t-L) & ... & x(j_2,t) & ... & x(j_2,t+L) \\ x(j_1,t-L) & ... & x(j_1,t) & ... & x(j_1,t+L) \end{pmatrix}, $$ where $Y_t \in \R^{N_1\times 1}$ and $X_{temp} \in \R^{N_2\times (2L+1)}$, since each entry of $Y_t$ is related to all entries from $X_{temp}$, we can vectorize $X_{temp}$ for computation simplicity, where $$ X = (x(j_{N_2},t-L),...,x(j_2,t+L),x(j_2-1,t+L),...,x(j_1,t+L))^\intercal \in \R^{N_2(2L+1)\times1} $$ so that $$ x(i,t) = <W(i),X>,Y = <W,X>,Y_{tr} = <W,X_{tr}>, $$ where $W(i) \in \R^{1\times N_2(2L+1)},W \in \R^{N_1\times N_2(2L+1)},Y_{tr}=\R^{N_1\times T_{tr}},X_{tr}=\R^{N_1\times T_{tr}}$. this helps form the target matrix $$ Z = \begin{pmatrix} Y_{tr} & Y_{te}\\ X_{tr} & X_{te} \end{pmatrix} \in \R^{[N_1+N_2(2L+1)]\times[T_{tr}+T_{te}]} $$ $Z$ is low rank because $[Y_{tr} \ Y_{te}] = W [X_{tr} \ X_{te}]$. Next we introduce the non-linearality by kernels: $$ \Phi(X_{tr}) = [\Phi(X_1),...,\Phi(X_{Tr})] \text{ and } \Phi(X_{te}) = [\Phi(X_1),...,\Phi(X_{Te})] $$ so that $$ Z = \begin{pmatrix} Y_{tr} & Y_{te}\\ \Phi(X_{tr}) & \Phi(X_{te}) \end{pmatrix} \in \R^{[N_1+h]\times[T_{tr}+T_{te}]} $$ :::info How to define kernels? In this setting the input for $\Phi$ function are columns of X, each of which ($X_t$, or vectorized $X_{temp}$) was identified by temporal index $t$. - Define by $K(x_{t_1},x_{t_2})$? Radial basis function with periodical patterns ~~describes a closer relationship for more recent readings.~~ adds nonlinearity and describes the periodicity to some degree. How to encode spatial relationship? - Define by $\Phi(X_t)$? Bivariate gaussian kernel describes a closer relationship for more recent readings and closer location, but the ituition is twisted: it means value of $X_t$ will be mapped greater for more recent readings and closer location. Shouldn't that be weight? ::: **Alternative: Matrix Factorization** $$Y = [Y_1,...,Y_T] = WX$$ where $Y\in \R^{N_1\times T}, W\in \R^{N_1\times N_2(2L+1)},X\in\R^{N_2(2L+1)\times T}$. Use traditional kernelized MF: $$ L = \|\Phi(Y)-\Phi(W)X\|_F^2 \\=\text{tr}(K_{gram}^{yy})+\text{tr}(X^\intercal K_{gram}^{ww}X) - 2\text{tr}(K_{gram}^{yu}X) $$ Because $X$ is fixed, this is more like linear equation. **Alternative: Encoding geographical info** Instead of default spherical kernel that gives equal weight to each coordinate, modify the kernel to assign geo-info-encoded weight to each coordinate: $$ K(x_1,x_2) = D(\frac{(x_1-x_2)^\intercal A (x_1-x_2)}{\lambda}) $$ where A is a PSD diagonal matrix. (A is of order $N_2(2L+1)$ because $x_i \in \R^{N_2(2L+1)\times1}$.) For coordinate $x(j_n,t+\Delta)$ with $1\leq n \leq N_2, -L\leq\Delta\leq L$, the corresponding diagonal value $a = \exp{(-\frac{\|i_{N_1}-j_n\|+...+\|i_1-j_n\|}{c_0})}$, so that this weight measures how much the sensor (located at) $j_n$ contributes to all other hidden sensors. ## 2nd formation Assuming sensor $i$'s reading $x$ at time $t$ is linearly related to its k nearest neighbor observed ($nn(i) \in N_2$) sensors' readings, where $i\in N_1$, we have the following relationship: $$ x(i,t) = \sum [w(nn(i)_1,t-L)x(nn(i)_1,t-L)+...+w(nn(i)_k,t+L)x(nn(i)_k,t+L)] $$ Form Y as: $$ Y_i = \begin{pmatrix} x(i,t) \\ x(i,t+1) \\ \vdots \\ x(i,t+H) \end{pmatrix} = W* \begin{pmatrix} x(nn(i)_1,t-L) \\ \vdots \\ x(nn(i)_1,t+L) \\ x(nn(i)_2,t-L) \\ \vdots \\ x(nn(i)_k,t+L) \end{pmatrix} $$ :::info The formation of $[Y_{i_1},...,Y_{i_{N_2}}] = W[X_{nn(i_1)},...X_{nn(i_{N_2})}]$ have two issues: - W doesn't share across the columns of X. - $K(X_{nn(i_m)},X_{nn(i_n)})$? ::: **Alternative: Encoding geographical info, but with 1 hidden sensor only** Assume we only impute one hidden sensor, form Y as: $$ Y_{t_1} = \begin{pmatrix} x(i,t_1) \\ x(i,t_1+1) \\ \vdots \\ x(i,t_1+H) \end{pmatrix} = W* \begin{pmatrix} x(j_1,t_1-L) \\ \vdots \\ x(j_1,t_1+L) \\ x(j_2,t_1-L) \\ \vdots \\ x(j_{N_2},t_1+L) \end{pmatrix} = W*X_{t_1} $$ Define: $$ Y_{tr} = [Y_{t_1},Y_{t_2},...,Y_{t_T}] \in \R^{(H+1)\times T} \\ X_{tr} = [X_{t_1},X_{t_2},...,X_{t_T}] \in \R^{N_2(2L+1)\times T} \\ Y_{te} \in \R^{(H+1)\times 1}, \ X_{te} \in \R^{N_2(2L+1)\times 1} $$ For $K(X_{t_1},X_{t_2}) = D(\frac{(X_{t_1}-X_{t_2})^\intercal A (X_{t_1}-X_{t_2})}{\lambda})$, the diagonal value of A corresponding to $X(j_n,t+\Delta)$ should be $\exp{-\frac{\|j_n-i\|}{c_0}}$, which assigns weight based on the distance between predictor and response variable. ### Multiple Kernel AdaBoost Regression Idea: Weighted linear combination of multiple kernelized MC as weak learner. Input: $X_{tr},Y_{tr},X_{te},N \text{ kernel functions } \Phi_1,...\Phi_N,T \text{ max iterations}$. Initialization: $w_i=\frac{1}{N},i=1,...,N$ Updates: For $t = 1,..,T$: $$ \epsilon_t = \sum_{i=1}^{N}w_i^{(t)}L_i(\frac{|Y_{tr}-h_t(\Phi_i(X_{tr}))|}{\max |Y_{tr}-h_t(\Phi_i(X_{tr}))|}) \\ \beta_t = \frac{\epsilon_t}{1-\epsilon_t} \\ w_i^{(t+1)} = w_i^t\beta_t^{1-L_i(\frac{|Y_{tr}-h_t(\Phi_i(X_{tr}))|}{\max |Y_{tr}-h_t(\Phi_i(X_{tr}))|})} \\ f_t(Y_{te}) = \sum_{i=1}^Nw_i^th_t(\Phi_i(X_{te})) $$ Output predictions: $$ f(Y_{te}) = \frac{\sum_t(\log\frac{1}{\beta_t})f_t(Y_{te})}{\sum_t(\log\frac{1}{\beta_t})} $$ - [Improving Regressors using Boosting Techniques](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.31.314&rep=rep1&type=pdf "title") - [Experiments with AdaBoost.RT, an Improved Boosting Scheme for Regression](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.100.4560&rep=rep1&type=pdf "title") - [Boosting Methodology for Regression Problems](http://proceedings.mlr.press/r2/ridgeway99a/ridgeway99a.pdf "title") :::warning the weight that gets updates are not affecting $h_t$ because they are weights on $\hat{Y}_{tr}$ in the linear combination. $h_t$ being stable means $L_i$ is stable. ::: $$ X_{tr} = [X_{t_1},X_{t_2},...,X_{t_T}] \in \R^{N_2(2L+1)\times T} \\ \Theta = [\theta_1,...,\theta_{N_2}] \in \R^{N_2(2L+1)\times T} $$ Let $h_t$ be on $\Theta\odot X_{tr}$. ## Posterior Analysis on Factorization 1. How does the pattern of hidden sensors affecting algorithm performance? ![](https://i.imgur.com/1qfGqxw.png) - Permutation Feature Importance ![](https://i.imgur.com/yqV7AKA.png) ![](https://i.imgur.com/6qNv1SR.png) - Mann-Whitney U test: Non-parametric statistical test on unpaired observations 2. Interpretability from U & V? Can we identify, just by having access to the decomposed factors, the main effects and the interaction effect? Can we tell how each feature affects the prediction individually, and which part is due to the interaction of both features. - [Finding the users corresponding to the columns that are direct preimages of the K features.](http://ceur-ws.org/Vol-1210/SP2014_12.pdf) - [Interpreting features as users.](https://ieeexplore.ieee.org/document/6927629) ### Latent feature analysis The factorization process: We start from $$ Z = \begin{pmatrix} Y_{tr} & Y_{te}\\ X_{tr} & X_{te} \end{pmatrix} \in \R^{[N_1+N_2(2L+1)]\times[T_{tr}+T_{te}]}, $$ ![](https://i.imgur.com/q5s1EIx.png) the lower section (Input X) of which is soon mapped into higher dimension with kernels: $$ Z_\phi = \begin{pmatrix} Y_{tr} & Y_{te}\\ \phi(X_{tr}) & \phi(X_{te}) \end{pmatrix} \in \R^{[N_1+h]\times[T_{tr}+T_{te}]}, $$ then $Z_\phi$ is factorized into $U$ and $V$: $$ U = \begin{bmatrix} U_{tr}\\ U_{te} \end{bmatrix} \in \R^{[N_1+h]\times r} \text{ and } V = \begin{bmatrix} V_{tr}\\ V_{te} \end{bmatrix} \in \R^{[T_{tr}+T_{te}]\times r} $$ where $U_{tr}\in \R^{N_1\times r},V_{tr}\in \R^{T_{tr}\times r},U_{te}\in \R^{h\times r},$ and $V_{te}\in \R^{T_{te}\times r}$. We know from the factorization that: $$ Y_{te} = U_{tr}(V_{te})^\intercal,Y_{tr}=U_{tr}(V_{tr})^\intercal,X_{te} = U_{te}(V_{te})^\intercal,X_{tr} = U_{te}(V_{tr})^\intercal, $$ in which $X_{te}$ (more specific, $U_{te}$) can't be calculated explicitly because of kernel mapping. Thus we can only use $V$. By Pearson correlation we have: ![](https://i.imgur.com/OHe8CFo.png) ### Fréchet distance [Fréchet distance](https://en.wikipedia.org/wiki/Fr%C3%A9chet_distance) is a measure of similarity between curves. Consider two parameterized curves, from which the fréchet distance is the minimum over all monotone increasing parameterizations of the maximum distance between corresponding points. (Fréchet distance between P and Q equals the shortest possible length of a coupling between P and Q.) It's computaion complexity is $O(mn\log(mn))$ ![](https://i.imgur.com/HQSL1KW.png) For comparison, the corresponding Pearson figure with same setting is: ![](https://i.imgur.com/XKv35xN.png) * [Shape Matching: Similarity Measures and Algorithms](http://www.cs.uu.nl/groups/MG/multimedia/publications/art/smi2001.pdf) ### Transfer latent features to new MF [Group-Sparse Matrix Factorization for Transfer Learning of Word Embeddings](https://arxiv.org/pdf/2104.08928.pdf) [A Comprehensive Survey on Transfer Learning](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9134370) [Return of Frustratingly Easy Domain Adaptation](https://arxiv.org/pdf/1511.05547.pdf) Consider the extracted factor $V_\text{est}\in\R^{r\times[T_{tr}+T_{te}]}$ is a temporal information embeddings. Intuitively, other hidden sensors may share this same summarized temporal embeddings, thus we may form a new MF optimization with an assumption that $V$ for these new hidden sensors (or in factorization of a new matrix $Z$) is similar to our extracted $V_\text{est}$. $$ L = \|M\circ(Z-UV)\|_F^2+\lambda\|V^\intercal V-V_\text{est}^\intercal V_\text{est}\|_F^2 $$ Solve the gradient as: $$ \frac{\partial L}{\partial U} = -2M\circ(Z-UV)V^\intercal \\ \frac{\partial L}{\partial V} = -2U^\intercal M\circ(Z-UV)+2\lambda V(V^\intercal V-V_\text{est}^\intercal V_\text{est}) $$ #### alternative formation (Wenqing's formation) $$ L = \|M\circ(Z-UV)\|_F^2+\mu\|U\|_F^2 +\lambda\|V-WV_\text{est}\|_F^2 $$ Solve the gradient as: $$ \frac{\partial L}{\partial U} = -2M\circ(Z-UV)V^\intercal+2\mu U \\ \frac{\partial L}{\partial V} = -2U^\intercal M\circ(Z-UV)+2\lambda V \\ \frac{\partial L}{\partial W} = -2\lambda(V-WV_\text{est})V_\text{est}^\intercal $$ ### AdaBoost for KMC $$ Z(\Theta) = \begin{pmatrix} \mathbf{y}_{1}^{tr} & ... & \mathbf{y}_{tr}^{tr} & Y_{te}\\ \theta_1 * \phi(\mathbf{x}_1^{tr}) & ... &\theta_{tr} * \phi(\mathbf{x}_{tr}^{tr}) & \phi(X_{te}) \end{pmatrix} \in \R^{[N_1+h]\times[T_{tr}+T_{te}]}, $$ Input: $X_{tr},Y_{tr},X_{te},T \text{ max iterations}$. Initialization: $\theta_i=\frac{1}{|tr|},i=1,...,tr$ Updates: For $t = 1,..,T$: $$ \epsilon_t = \frac{\sum_{i=1}^{|tr|}\theta_i^{(t)}L_i(\frac{|\mathbf{y}_i^{(t)}(x_i)-\mathbf{y}_i|}{\max_i |\mathbf{y}_i^{(t)}(x_i)-\mathbf{y}_i|}) }{\sum_{i=1}^{|tr|}\theta_i^{(t)}}\\ \beta_t = \frac{\epsilon_t}{1-\epsilon_t} \\ \theta_i^{(t+1)} = \theta_i^t\beta_t^{1-L_i(\frac{|\mathbf{y}_i^{(t)}(x_i)-\mathbf{y}_i|}{\max_i |\mathbf{y}_i^{(t)}(x_i)-\mathbf{y}_i|})} \\ \alpha^{(t)} = \frac{\beta^{(k)}}{\sum_1^T \beta^{(k)}} $$ Output predictions: $$ f(Y_{te}) = \sum_t \alpha^{(t)}Y_{te}(\Theta^{(t)}) \\ f(Y_{tr}) = \sum_t \alpha^{(t)}Y_{tr}(\Theta^{(t)}) $$ ### AdaBoost.RT for KMC [Experiments with AdaBoost.RT, an Improved Boosting Scheme for Regression](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.100.4560&rep=rep1&type=pdf) Input: $X_{tr},Y_{tr},X_{te},T \text{ max iterations}$, $\Phi (0<\Phi<1)$ threshold Initialization: $\theta_i=\frac{1}{|tr|},i=1,...,tr$ Updates: For $t = 1,..,T$: $$ ARE_t(i)=|\frac{\mathbf{y}_i^{(t)}(x_i)-\mathbf{y}_i}{\mathbf{y}_i}| \\ \epsilon_t =\sum_{i:ARE_t(i)>\Phi}\theta_i^t \\ \beta_t=\epsilon_t^n,\text{ where n is power coefficient } \\ \theta_i^{(t+1)} = \begin{cases} \frac{\theta_i^t \beta_t}{Z_t} & \quad \text{if } ARE_t(i)\leq \phi \\ \frac{\theta_i^t}{Z_t} & \quad \text{otherwise} \end{cases} $$ Output predictions: $$ f(Y_{te}) =\frac{\sum_t(\log\frac{1}{\beta_t})f_t(x)}{\sum_t(\log\frac{1}{\beta_t})} $$ #### Experiment results KMC results with different kernel setting | Kernel | Tr | Te | | -------- | -------- | -------- | | Poly 2 | 10.91% | 14.95% | | Laplacian | 3.11% | 30.77% | | exponential chi-squared | 3.11% | 28.91% | | Linear kernel | \ | 3.899 | | Sigmoid kernel | 0.35% | 28.91% | | Additive chi-squared | 5.93% | 20.44% | | cosine | 1.31% | 18.41% | | Method | Tr | Te | Parameters | -------- | -------- | -------- | -------- | | GASM | ? | 9.48% | \ | | KMC | 10.91% | 14.94% | \ | | AdaBoost.RT | 5.44% | 8.0123 |#iter=100,r=5,threshold=0.05,power=2| | AdaBoost.RT,init theta =1| 10.91% | 14.93% |#iter=100,r=5,threshold=0.05,power=2| | EKMC | 5.42% | 4.3966 |#iter=100,r=5| **AdaBoost.RT,init theta = 1** ![](https://i.imgur.com/oaUsMnb.png) ![](https://i.imgur.com/bu1btWG.png) **AdaBoost.RT** ![](https://i.imgur.com/zNcxb6S.png) ![](https://i.imgur.com/tATgb5u.png) **EKMC** ![](https://i.imgur.com/LJROkVZ.png) ![](https://i.imgur.com/4DEPxkg.png)