Integrative Data Analysis Presentation

# Integrative Data Analysis  === ---  # Integrative Data Analysis ---  # Context - Multiple heterogeneous datasets. - Genomic data, The Cancer Genome Atlas. - Several different kinds of cancers. - Hundreds of patients. - Thousands of genes. - Several different measurements for each gene. - Gene expression. - Copy number. - We want to take everything into account. ----  Data | Network :--------------------:|:--------------------: <img src="https://i.imgur.com/ZNMmdoM.png" alt="HTML5 Icon" width="250" height="275" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none">| <img src="https://i.imgur.com/oPeYxs2.png" alt="HTML5 Icon" width="276" height="264" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> <img src="https://i.imgur.com/wI5jZJU.png" alt="HTML5 Icon" width="150" height="275" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none">| <img src="https://i.imgur.com/1vNN1rR.png" alt="HTML5 Icon" width="257" height="276" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> ---  ## Basic tools - $\textrm{SVD}$ - $A=U\Sigma V^T=\displaystyle\sum_{i=1}^{r} \sigma_iu_iv_i^T$ - $U$ is an orthonormal basis for $\textrm{col}(A)$ - $V$ is an orthonormal basis for $\textrm{row}(A)$ ----  - SVD gives us an optimal rank-k approximation of $A$: - $A_k=\displaystyle\sum_{i=1}^{k} \sigma_iu_iv_i^T$ ----  - Singular vectors typically have nonzero entries in all coordinates - Introduce a penalty to encourage sparse solutions ----  ### Graphs induced from data - Given a dataset $X$ with observations $a_1,\ldots,a_n$ and some notion of similarity $s_{ij}$ we can form the matrix $S$ called a similarity matrix. - Examples of similarity: - K nearest neighbor similarity - Gaussian similarity: $\textrm{exp}(-\|a_i-a_j\|^2/2\sigma^2)$ - Choice of $\sigma$ allows varying degrees of nonlinearity in the data - In this presentation we will work on similarity matrices. ----  ### Toy Data Example 1: Orginal data <img src="https://i.imgur.com/YlEHUXh.png" alt="HTML5 Icon" width="300" height="250" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none"> <img src="https://i.imgur.com/pUjWnCp.png" alt="HTML5 Icon" width="300" height="500" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none"> ---  # JIVE - Joint and Individual Variance Explained - Decomposes matrices into additive parts ----  ## Theory/explanation of algorithm Input: $$ X= \begin{bmatrix} X_1\\ \vdots\\ X_k \end{bmatrix} : (p_1+,\ldots,+p_k)\times n. $$ ----  Output: $$ \begin{align*} X_1 =& \,J_1 + I_1 +\varepsilon_1\\ & \vdots \\ X_k =& \,J_k + I_k +\varepsilon_k ,\end{align*} $$ ----  $$ X=J+I+R $$ - Constraints: - Constraints on rank of $J$ and each $I_k$ - Joint and individual components are orthogonal: $$ JI^T=0 $$ ----  $$ R=X-J-I= \begin{bmatrix} X_1 - J_1 - I_1\\ \vdots\\ X_k - J_k - I_k \end{bmatrix} $$ - With $J$ fixed, find $I_1,\ldots,I_k$ that minimize $\|R\|_F$ - With $I_1,\ldots,I_k$ fixed, find $J$ that minimizes $\|R\|_F$ - Repeat until convergence ----  ## Toy Data Example 1: Iterative JIVE Result <img src="https://i.imgur.com/MnT5d3N.png" alt="SparseSVDPC" width="600" height="250" style="opacity:1;background: transparent;border:0px solid white; box-shadow: none"> ----  ## Problems - Slow - Not clear what is meant by "joint" ---  # Non-Iterative JIVE - Tries to circumvent iteratition. - Formulates what is meant by joint/Individual in terms of row spaces: ----  $$ \textrm{row}(J_k)=\textrm{row}(J)\\ \displaystyle\bigcap_{k=1}^k\textrm{row}(I_k)={0} $$ - Such a decomposition is always possible ----  ## Theory/explanation of algorithm - Input: matrices $X_1,\ldots, X_k$ ----  - Use SVD and threshold singular values to separate noise and signal - Get signal matrices $A_1= \displaystyle\sum_{i=1}^{r_1} u^1_i (v^1_i)^T \sigma_i^1 ,\ldots, A_k=\displaystyle\sum_{i=1}^{r_k} u^k_i (v_i^k)^T \sigma^k_i$ and error matrices $E_1,\ldots, E_k$ <img src="https://i.imgur.com/3lgQZ66.png" alt="HTML5 Icon" width="300" height="300" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none"> <img src="https://i.imgur.com/yT5UeOF.png" alt="HTML5 Icon" width="300" height="300" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none"> ----  - Form the matrix: $M=\begin{bmatrix}(v^1_{1,\ldots,r_1})^T\\ \vdots\\(v^k_{1,\ldots,r_k})^T\end{bmatrix}.$ SVD of $M=\tilde{U}\tilde\Sigma \tilde{V}$ <img src="https://i.imgur.com/lFpjj7g.png" alt="SparseSVDPC" width="500" height="300" style="opacity:1;background: transparent;border:0px solid white; box-shadow: none"> ----  - Singular values are thresholded to determine which directions are joint and which are individual. - Project the data to appropriate subspaces to get the final decomposition. - Project to space spanned by the joint vectors gives joint part. - Project to complement of joint space for individual part. - Threshold singular value to separate signal from noise. ----  ## Toy Data Example 1: Non-Iterative JIVE Result <img src="https://i.imgur.com/wC2fRcc.png" alt="SparseSVDPC" width="600" height="250" style="opacity:1;background: transparent;border:0px solid white; box-shadow: none"> ----  ### SVD Singular vectors Sensitive to noise. <img src="https://i.imgur.com/8RiSKRp.png" alt="SVDPC" width="600" height="250" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> <img src="https://i.imgur.com/Y2CkqBj.png" alt="SVDPC" width="600" height="250" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> ----  ### Sparse SVD Singular vectors <img src="https://i.imgur.com/m8dNzu7.png" alt="SparseSVDPC" width="600" height="250" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> ----  ### Sparse SVD Non-Iterative JIVE Result <img src="https://i.imgur.com/vgUshDn.png" alt="SVDPC" width="600" height="250" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> ----  ## Problems - The joint/individual dichotomy is arbitrary - We might want to find structures which are in between joint and individual, or joint between some datasets but not all. - Singular vectors are sensetive to noise. ---  # Pairwise matching Instead look at similar directions either by angles or by adding norms. Threshold these. Get several different joint components and maybe meassures of "jointness". Produce networks as a clear visualisation. ----  Toy data example 1 <img src="https://i.imgur.com/Y2CkqBj.png" alt="SVDPC" width="600" height="250" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> <img src="https://i.imgur.com/m8dNzu7.png" alt="SparseSVDPC" width="600" height="250" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> ----  Produces matrix <img src="https://i.imgur.com/YkiSO7M.png" alt="HTML5 Icon" width="700" height="350" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> ----  That produces a network (for sparse SVD) <img src="https://i.imgur.com/kxUDisS.png" alt="HTML5 Icon" width="500" height="500" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> ----  And for normal SVD <img src="https://i.imgur.com/y0Zh6uq.png" alt="HTML5 Icon" width="500" height="500" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> ----  Combinations <img src="https://i.imgur.com/n3gH1Lz.png" alt="HTML5 Icon" width="500" height="361" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> ---  # Cluster Coverage - Instead compares clustering properties of the data and interprets it as a graph - Similarity measure: - $\textrm{Sim}(A_k,B_l) = \frac{1}{\|A\|}\displaystyle\sum_{i\in\{j:L_j^k=A_k \}} I_{\{L_i^l=B_l\}}(i)$ - i.e. the proportion of observations in cluster $B_l$ in dataset $l$, that have also been clustered to cluster $A_k$ in dataset $k$ ----  ## Example - Input: $X_1,\ldots X_k$. - For each dataset, cluster the $n$ columns using some clustering method to obtain cluster labels $L_i^\ell$. - Compute cluster similarities. - Output: network of clusters. ----  ## Toy Data Example 1 ### Cluster indicators Cluster Indicators | Network :--------------------:|:-------------------: <img src="https://i.imgur.com/8oHDw96.png" alt="HTML5 Icon" width="800" height="250" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none">|<img src="https://i.imgur.com/zwv4JfL.png" alt="HTML5 Icon" width="700" height="300" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> ----  ## Toy Data Example 2: Partially Overlapping clusters <img src="https://i.imgur.com/WSBnlf1.png" alt="HTML5 Icon" width="225" height="500" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none"> <img src="https://i.imgur.com/lOGZTbW.png" alt="HTML5 Icon" width="300" height="500" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none"> ----  ## Toy Data Example 2: Partially Overlapping clusters <img src="https://i.imgur.com/WSBnlf1.png" alt="HTML5 Icon" width="225" height="500" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none"> <img src="https://i.imgur.com/WTzIJRw.png" alt="HTML5 Icon" width="450" height="500" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none"> ----  ## Data Example Data | JIVE decomposition | Network :--------------------:|:--------------------:|--------------------- <img src="https://i.imgur.com/GaZmqz9.png" alt="RealData" width="350" height="250" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none">|<img src="https://i.imgur.com/twZX0KE.png" alt="RealData" width="400" height="250" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none">|<img src="https://i.imgur.com/gNXQKYU.png" alt="RealData" width="400" height="250" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none"> ----  <section> <video class="stretch" loop data-autoplay src="https://super-monday.000webhostapp.com/output.mp4"></video> </section> ---