# Integrative Data Analysis <!-- .slide: data-autoslide="4000" data-transition-speed="slow" --> === --- <!-- .slide: data-background-video="https://super-monday.000webhostapp.com/everything2clust.mp4" data-background-video-loop="loop" --> # Integrative Data Analysis --- <!-- .slide: data-background="#fefefe" --> # Context - Multiple heterogeneous datasets. - Genomic data, The Cancer Genome Atlas. - Several different kinds of cancers. - Hundreds of patients. - Thousands of genes. - Several different measurements for each gene. - Gene expression. - Copy number. - We want to take everything into account. ---- <!-- .slide: data-background="#fefefe" --> Data | Network :--------------------:|:--------------------: <img src="https://i.imgur.com/ZNMmdoM.png" alt="HTML5 Icon" width="250" height="275" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none">| <img src="https://i.imgur.com/oPeYxs2.png" alt="HTML5 Icon" width="276" height="264" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> <img src="https://i.imgur.com/wI5jZJU.png" alt="HTML5 Icon" width="150" height="275" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none">| <img src="https://i.imgur.com/1vNN1rR.png" alt="HTML5 Icon" width="257" height="276" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> --- <!-- .slide: data-background="#f1fefe" --> ## Basic tools - $\textrm{SVD}$ - $A=U\Sigma V^T=\displaystyle\sum_{i=1}^{r} \sigma_iu_iv_i^T$ - $U$ is an orthonormal basis for $\textrm{col}(A)$ - $V$ is an orthonormal basis for $\textrm{row}(A)$ ---- <!-- .slide: data-background="#f1fefe" --> - SVD gives us an optimal rank-k approximation of $A$: - $A_k=\displaystyle\sum_{i=1}^{k} \sigma_iu_iv_i^T$ ---- <!-- .slide: data-background="#f1fefe" --> - Singular vectors typically have nonzero entries in all coordinates - Introduce a penalty to encourage sparse solutions ---- <!-- .slide: data-background="#f1fefe" --> ### Graphs induced from data - Given a dataset $X$ with observations $a_1,\ldots,a_n$ and some notion of similarity $s_{ij}$ we can form the matrix $S$ called a similarity matrix. - Examples of similarity: - K nearest neighbor similarity - Gaussian similarity: $\textrm{exp}(-\|a_i-a_j\|^2/2\sigma^2)$ - Choice of $\sigma$ allows varying degrees of nonlinearity in the data - In this presentation we will work on similarity matrices. ---- <!-- .slide: data-background="#f1fefe" --> ### Toy Data Example 1: Orginal data <img src="https://i.imgur.com/YlEHUXh.png" alt="HTML5 Icon" width="300" height="250" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none"> <img src="https://i.imgur.com/pUjWnCp.png" alt="HTML5 Icon" width="300" height="500" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none"> --- <!-- .slide: data-background="#D8D3C8" --> # JIVE - Joint and Individual Variance Explained - Decomposes matrices into additive parts ---- <!-- .slide: data-background="#D8D3C8" --> ## Theory/explanation of algorithm Input: $$ X= \begin{bmatrix} X_1\\ \vdots\\ X_k \end{bmatrix} : (p_1+,\ldots,+p_k)\times n. $$ ---- <!-- .slide: data-background="#D8D3C8" --> Output: $$ \begin{align*} X_1 =& \,J_1 + I_1 +\varepsilon_1\\ & \vdots \\ X_k =& \,J_k + I_k +\varepsilon_k ,\end{align*} $$ ---- <!-- .slide: data-background="#D8D3C8" --> $$ X=J+I+R $$ - Constraints: - Constraints on rank of $J$ and each $I_k$ - Joint and individual components are orthogonal: $$ JI^T=0 $$ ---- <!-- .slide: data-background="#D8D3C8" --> $$ R=X-J-I= \begin{bmatrix} X_1 - J_1 - I_1\\ \vdots\\ X_k - J_k - I_k \end{bmatrix} $$ - With $J$ fixed, find $I_1,\ldots,I_k$ that minimize $\|R\|_F$ - With $I_1,\ldots,I_k$ fixed, find $J$ that minimizes $\|R\|_F$ - Repeat until convergence ---- <!-- .slide: data-background="#D8D3C8" --> ## Toy Data Example 1: Iterative JIVE Result <img src="https://i.imgur.com/MnT5d3N.png" alt="SparseSVDPC" width="600" height="250" style="opacity:1;background: transparent;border:0px solid white; box-shadow: none"> ---- <!-- .slide: data-background="#D8D3C8" --> ## Problems - Slow - Not clear what is meant by "joint" --- <!-- .slide: data-background="#D3D4D6" --> # Non-Iterative JIVE - Tries to circumvent iteratition. - Formulates what is meant by joint/Individual in terms of row spaces: ---- <!-- .slide: data-background="#D3D4D6" --> $$ \textrm{row}(J_k)=\textrm{row}(J)\\ \displaystyle\bigcap_{k=1}^k\textrm{row}(I_k)={0} $$ - Such a decomposition is always possible ---- <!-- .slide: data-background="#D3D4D6" --> ## Theory/explanation of algorithm - Input: matrices $X_1,\ldots, X_k$ ---- <!-- .slide: data-background="#D3D4D6" --> - Use SVD and threshold singular values to separate noise and signal - Get signal matrices $A_1= \displaystyle\sum_{i=1}^{r_1} u^1_i (v^1_i)^T \sigma_i^1 ,\ldots, A_k=\displaystyle\sum_{i=1}^{r_k} u^k_i (v_i^k)^T \sigma^k_i$ and error matrices $E_1,\ldots, E_k$ <img src="https://i.imgur.com/3lgQZ66.png" alt="HTML5 Icon" width="300" height="300" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none"> <img src="https://i.imgur.com/yT5UeOF.png" alt="HTML5 Icon" width="300" height="300" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none"> ---- <!-- .slide: data-background="#D3D4D6" --> - Form the matrix: $M=\begin{bmatrix}(v^1_{1,\ldots,r_1})^T\\ \vdots\\(v^k_{1,\ldots,r_k})^T\end{bmatrix}.$ SVD of $M=\tilde{U}\tilde\Sigma \tilde{V}$ <img src="https://i.imgur.com/lFpjj7g.png" alt="SparseSVDPC" width="500" height="300" style="opacity:1;background: transparent;border:0px solid white; box-shadow: none"> ---- <!-- .slide: data-background="#D3D4D6" --> - Singular values are thresholded to determine which directions are joint and which are individual. - Project the data to appropriate subspaces to get the final decomposition. - Project to space spanned by the joint vectors gives joint part. - Project to complement of joint space for individual part. - Threshold singular value to separate signal from noise. ---- <!-- .slide: data-background="#D3D4D6" --> ## Toy Data Example 1: Non-Iterative JIVE Result <img src="https://i.imgur.com/wC2fRcc.png" alt="SparseSVDPC" width="600" height="250" style="opacity:1;background: transparent;border:0px solid white; box-shadow: none"> ---- <!-- .slide: data-background="#D3D4D6" --> ### SVD Singular vectors Sensitive to noise. <img src="https://i.imgur.com/8RiSKRp.png" alt="SVDPC" width="600" height="250" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> <img src="https://i.imgur.com/Y2CkqBj.png" alt="SVDPC" width="600" height="250" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> ---- <!-- .slide: data-background="#D3D4D6" --> ### Sparse SVD Singular vectors <img src="https://i.imgur.com/m8dNzu7.png" alt="SparseSVDPC" width="600" height="250" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> ---- <!-- .slide: data-background="#D3D4D6" --> ### Sparse SVD Non-Iterative JIVE Result <img src="https://i.imgur.com/vgUshDn.png" alt="SVDPC" width="600" height="250" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> ---- <!-- .slide: data-background="#D3D4D6" --> ## Problems - The joint/individual dichotomy is arbitrary - We might want to find structures which are in between joint and individual, or joint between some datasets but not all. - Singular vectors are sensetive to noise. --- <!-- .slide: data-background="#EEEEEE" --> # Pairwise matching Instead look at similar directions either by angles or by adding norms. Threshold these. Get several different joint components and maybe meassures of "jointness". Produce networks as a clear visualisation. ---- <!-- .slide: data-background="#EEEEEE" --> Toy data example 1 <img src="https://i.imgur.com/Y2CkqBj.png" alt="SVDPC" width="600" height="250" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> <img src="https://i.imgur.com/m8dNzu7.png" alt="SparseSVDPC" width="600" height="250" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> ---- <!-- .slide: data-background="#EEEEEE" --> Produces matrix <img src="https://i.imgur.com/YkiSO7M.png" alt="HTML5 Icon" width="700" height="350" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> ---- <!-- .slide: data-background="#EEEEEE" --> That produces a network (for sparse SVD) <img src="https://i.imgur.com/kxUDisS.png" alt="HTML5 Icon" width="500" height="500" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> ---- <!-- .slide: data-background="#EEEEEE" --> And for normal SVD <img src="https://i.imgur.com/y0Zh6uq.png" alt="HTML5 Icon" width="500" height="500" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> ---- <!-- .slide: data-background="#EEEEEE" --> Combinations <img src="https://i.imgur.com/n3gH1Lz.png" alt="HTML5 Icon" width="500" height="361" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> --- <!-- .slide: data-background="#E3EEE3" --> # Cluster Coverage - Instead compares clustering properties of the data and interprets it as a graph - Similarity measure: - $\textrm{Sim}(A_k,B_l) = \frac{1}{\|A\|}\displaystyle\sum_{i\in\{j:L_j^k=A_k \}} I_{\{L_i^l=B_l\}}(i)$ - i.e. the proportion of observations in cluster $B_l$ in dataset $l$, that have also been clustered to cluster $A_k$ in dataset $k$ ---- <!-- .slide: data-background="#E3EEE3" --> ## Example - Input: $X_1,\ldots X_k$. - For each dataset, cluster the $n$ columns using some clustering method to obtain cluster labels $L_i^\ell$. - Compute cluster similarities. - Output: network of clusters. ---- <!-- .slide: data-background="#E3EEE3" --> ## Toy Data Example 1 ### Cluster indicators Cluster Indicators | Network :--------------------:|:-------------------: <img src="https://i.imgur.com/8oHDw96.png" alt="HTML5 Icon" width="800" height="250" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none">|<img src="https://i.imgur.com/zwv4JfL.png" alt="HTML5 Icon" width="700" height="300" style="opacity:1; background: transparent;border:0px solid white; box-shadow: none"> ---- <!-- .slide: data-background="#E3EEE3" --> ## Toy Data Example 2: Partially Overlapping clusters <img src="https://i.imgur.com/WSBnlf1.png" alt="HTML5 Icon" width="225" height="500" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none"> <img src="https://i.imgur.com/lOGZTbW.png" alt="HTML5 Icon" width="300" height="500" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none"> ---- <!-- .slide: data-background="#E3EEE3" --> ## Toy Data Example 2: Partially Overlapping clusters <img src="https://i.imgur.com/WSBnlf1.png" alt="HTML5 Icon" width="225" height="500" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none"> <img src="https://i.imgur.com/WTzIJRw.png" alt="HTML5 Icon" width="450" height="500" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none"> ---- <!-- .slide: data-background="#E3EEE3" --> ## Data Example Data | JIVE decomposition | Network :--------------------:|:--------------------:|--------------------- <img src="https://i.imgur.com/GaZmqz9.png" alt="RealData" width="350" height="250" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none">|<img src="https://i.imgur.com/twZX0KE.png" alt="RealData" width="400" height="250" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none">|<img src="https://i.imgur.com/gNXQKYU.png" alt="RealData" width="400" height="250" style="opacity:1; background: transparent;border:9px transparent; box-shadow: none"> ---- <!-- .slide: data-background="#E3EEE3" --> <section> <video class="stretch" loop data-autoplay src="https://super-monday.000webhostapp.com/output.mp4"></video> </section> --- <!-- .slide: data-background-video="https://super-monday.000webhostapp.com/thatsallfolks.mp4" -->
{"metaMigratedAt":"2023-06-14T12:56:57.022Z","metaMigratedFrom":"YAML","title":"Integrative Data Analysis Presentation","breaks":true,"slideOptions":"{\"center\":true,\"defaultTiming\":1200}","contributors":"[]"}
    554 views
   Owned this note