A practical introduction to multi-omics integration and network analysis

# A practical introduction to multi-omics integration and network analysis ## Resources Course website: https://nbisweden.github.io/workshop_omicsint_ISMBECCB/ Labs: https://nbisweden.github.io/workshop_omicsint_ISMBECCB/labs HackMD: https://hackmd.io/LI_HCxeRT8-Ty5qjikeFpQ?both ## Teachers **Rui Benfeitas** - Course leader, NBIS bioinformatician, Stockholm University. Work in integrative omics projects, mostly employing transcriptomic, metabolomic, proteomic, epigenomic data. Favorite programming language is Python. **Nikolay Oskolkov** - Course co-leader, NBIS bioinformatician, Lund University Sweden. I have my background in genomics and medical genetics. Interested in evolution and ancient DNA research. Working on different data types. **Ashfaq Ali** - Course co-leader, NBIS bioinformatician, Lund University Sweden. Work with omics data using statitical and sytems biology approaches to analyze and integrate omics data for biomarker/target discovery. ## Connections issues If you have trouble with connecting to the Zoom room, or anything else you want to get in touch with us fast, write here: ## Installation Refer to the [tutorial homepage](https://nbisweden.github.io/workshop_omicsint_ISMBECCB/) for detailed information. ## Questions? **For bugs and installation instructions refer to the [homepage](https://nbisweden.github.io/workshop_omicsint_ISMBECCB/).** For any other questions feel free to write them here and we will try to answer them as soon as possible. ### Course preparation materials - Will Docker be nessasary? Can I work locally on my computer? - RB: It is advisable to install docker if you want to run the notebooks on your computer. If you prefer to install all required packages on your own please feel free, but the docker installation will save you from some compatibility issues - Where can we find the docker images? - RB: You can find them under [the course preparation materials](https://nbisweden.github.io/workshop_omicsint_ISMBECCB/precourse) - Will the recording of the talks will be available later. - RB: They should be shared with you by the conference organizers afterwards. ### July 22, ML view of integration, Supervised and Unsupervised Integration - Slide 6 of [ML view of integration](https://nbisweden.github.io/workshop_omicsint_ISMBECCB/session_ml/MachineLearningViewOmicsIntegration_Oskolkov.pdf): `are the data distributions coming from biological or simulated data?` - RB: these were simulated, but biologically-generated data will show very similar patterns. - As I am working mostly with translational research I get a lot of small studies that I need to integrate. You have stated that this type of analysis will not be covered by the tutorial. But can you advise on what tools are recommended in this case? Teanscriptomics both by arrays and RNA-seq of 3-12 samples each depending on the experiment - RB: It will depend on your dataset size (sample number and feature number) and aim. The biggest challenge is in selecting the most informative features in such a small dataset, but some feature selection can be applied after which you will be able to apply some of today's or tomorrow's approaches. [PLSDA](http://mixomics.org/methods/pls-da/) as a supervised approach should be very helpful. One of [tomorrow's notebooks](https://nbisweden.github.io/workshop_omicsint_ISMBECCB/session_topology/lab.html) shows some techniques applied in a RNAseq dataset. - NO: I would recommend you to start with a simple linear integration (across samples) with ComBat, limma or MINT, and MNN if you would like to perform non-linear integration - Does MOFA have limitation in terms of number of samples, features? - RB: It requires a large sample size ([at least 15 according to the authors](https://bioconductor.org/packages/devel/bioc/vignettes/MOFA2/inst/doc/getting_started_R.html)), and it is crucial that the same omics are derived from the same samples. From experience, even 15 may not be enough depending on the data. For feature size it should be improved based on some preliminary feature selection to remove noisy features. I will leave to Nikolay to give a more detailed response. - Clustering of clusters is the same with ensemble clustering? - NO: yes, it has many names, but ensemble clustering is conceptually similar to the "clustering of clusters" - What input is fed into UMAP for the case of two omics dataset integration? Just a concatenation of the two matrices? - NO: no, you run UMAP for each individual OMIC, in this way you contruct individual graphs, and at the final step you overlap the graphs. UMAP can do all the steps for you. Please check the material from the extended version of our course here https://github.com/NBISweden/workshop_omics_integration/tree/main/session_ml/UMAP_DataIntegration - When you say feature selection do you mean preclustering? Or do you mean choosing genes? - NO: no, feature pre-selection is not pre-clustering. I mean choosing most informative and non-redundant genes - Why use autoencoders compared to NN since the integration is effectively accomplished at the hidden layers? - NO: autoencoder is an NN. In both examples, i.e. with supervised and unsupervised analysis with NN, the integration part happens in the hidden layers. The difference is that in autoencoder you can visualize your samples / cells in the bottleneck (which will be a consensus plot after the integration has been done in the hidden layer), while for the supevised NN you are not after visualization (or data pre-processing) but would like to predict classes from your phenotype of interest. - Do you suggest R library for Autoencoder for single cell? - NO:I used scvis and scVI. As far as I remember, both are Python based. I think, traditionally neural networks are easier to develop in Python, that is why. However, there is a very user-friendly keras implementation in R that is very comfortable to use for developing own autoencoders. - Where should I read more about Similarity Network Fusion? - RB: [Here](https://nbisweden.github.io/workshop_omics_integration/session_nmf/SNF_main.html) you can find a quick review of SNF methods. [Here](https://nbisweden.github.io/workshop_omics_integration/session_nmf/SNF_lab.html) you can find a short notebook demoing it. - How was the single-cell data prepared for integration? Is the input normalized? Is it counts? - NO: yes, it is counts, the major step was the filtering of the individual OMICs, you need to do quite a lot of work to remove the noisy / redundant features - For integration of the scRNAseq, how the data is preprocessed. Should the data be normalized before integration? - NO: yes, you can normalize the scRNAseq data across cells acounting for library size differences, however I usually use a log-transfrom as a mild normalization step. To my experience, normalization plays a minor role for scRNaseq data (in contrast to bulk RNaseq data) because usually you hav very biologically distinct cell populations, so any normalization (or absence of normalization) does not influence cell clustering much. But in principle, you are right, normalization should be at least considered for scRNAseq - Can these methods be used for integration of metagenomics with host omics? - NO: good question, yes, I tried it without much success though (but this might have been due to the quality of the data rather than the methodology itself). Kim-Anh- Le Cao has mixMC developed for microbiome analysis http://mixomics.org/mixmc/, you might want to try it. However, integration with microbiome data is not a very developed area (as Rui mentioned) and I would interested in seeing other than mixMC methods - In unsupervised integration using MOFA how do we decide upon the max number of iterations required? - NO: number of iterations should be enough to reach a stationary solution. This means that specifying a large enough number of iterations is a good ide but the model could converge earlier than you specified (the rest if iterations will the waste of resouces in this case). I typically monitor the number of latent factors (LFs) kept (reported by MOFA) as it drops LFs capturing low amount of variation. The minimal amount of variation per LF is a much more crucial hyperparameter - What are the units of the input for the unsupervised datasets? What units is for the RNA-seq or the methylomics? Or the drug? - NO: I can't unfortunately answer about the units for drugs other than referring to the original publication [Dietrich et al., J. Clin. Inv. 2017.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5749541/) RNAseq should be just counts. - Could you comment on interpretability of the illustrated methods? Biologists some time are confused by transformed variables and loose the link between the analyis and biological concepts (genes, proteins, etc). How can we keep or rebuild this link? - NO: the linear methods (PLS / DIBALO and MOFA) are quite interpretable as one can always get a ranked list of features (genes, proteins etc.) that are responsible for patterns (inhomogeneity or classes of samples / cells) captured by the models. Perhaps the main outcome of the methods in this tutorial is the (possibly) novel biomarker candidates that one can discover. In contrast, non-linear methods such as neural networks, random forest or UMAP are much less interpretable compared to PLS / MOFA / PCA / graph based methods. This is because the complex transformation of the raw OMICs data that the non-linear methods apply. This transformation is often needed for a better mathematical robustness, however the price is the interpretability. In general, most interpretable methods (e.g. linear regression or t-test) suffer from the lack of generalizability / robustness, while "black boxes" often tend to be "more mathematically correct". - Any special considerations to combine binary (mutations) with non-binary data? - NO: this is a general challenge in data integration, i.e. combining data from different data distributions such as continuous (following e.g. poisson or gaussian distribution) and binary (following bernoulli / binomial distribution). You combine binary and non-binary data in the same way as any other data. I would suggest to either model the probability distributions explicitly and combine them within the Bayes rule as a joint probability distribution (this is what MOFA or Bayesian Networks are doing), or you transform your data into a non-parametric space where the different data types loose their "memory" of technology that produced them (this is what Neural Networks, UMAP or Similarity Network Fusion do). In the latter case, one can say that you convert your data to the same common distribution, after that transformation it is ok to just merge / concatenate the data matrices. - What was the old way using anaconda to run the Rmd files from yesterday. The MultiOmics gives me problems both on windows and on linux. - RB: Hi, I will help out with this during the hands-on session if you prefer. The way we had with conda was as stated [here](https://github.com/NBISweden/workshop_omicsint_ISMBECCB/blob/main/conda_instructions.md). If the problem is `mixomics`, perhaps the easiest way to install everything is to create an environment with only the basic needed packages as below: ``` channels: - conda-forge - bioconda - anaconda - defaults dependencies: ## languages - conda-forge::r-base ## R packages - bioconductor-mixOmics=6.16.0 - r-reticulate - rstudio ``` ### July 23, Network analysis and meta-analysis - Do you suggest papers, libraries allowing both + and - edge combined analysis? - Paerhaps interesting https://academic.oup.com/nar/article/38/1/e1/3112413 - RB: The Leiden algorithm ([original paper](https://www.nature.com/articles/s41598-019-41695-z) and [here](https://arxiv.org/abs/0811.2329)) allows you to compute communities with both positive and negative edge weights. igraph in [python](https://igraph.org/python/) or [R](https://igraph.org/r/) can be used to analyse positive and negative weights. - You mentioned that due to differences in dynamic range between omics (e.g. proteomics and transcriptomics) building a common network may not make biological sense... for which questions do you think it would still be justified? and would it make sense to give both omics different 'weights' in terms of edge confidence or treat them differently somehow? - RB: It may still make sense for instance if you want to test whether a similar pattern is observed at both levels. For instance, knowing that at transcriptomic level you find X number of biological processes, would the same processes be found at proteomic level despite the lower coverage? See [here](https://www.tandfonline.com/doi/full/10.1080/22221751.2020.1799723) for an example. You may certainly give different weights, but beware that if you have built an integrated network (transcriptomic + proteomic), you will tend to see a lower number of correlations between the omic than within the omic. - Regarding the discussion on positive and negative correlations as weights of the network edges, could you use absolute value? Then it is magnitude not direction of effect? - RB: Yes, absolutely, and you will be able to compute node weighted centrality, and identify communities. This can be very advantageous in your network analysis as it will allow you to identify pivotal nodes regardless of the weight sign (positive or negative). The drawback is on the interpretation of the communities biologically, but you can nevertheless hypothesis about dysregulation of the biological process associated with a given community. - In a protein-protein interaction network (PPIs), is there any topologycal parameters (centrality, degree, pagerank, betweeness, ...) used to identify key proteins in this type of network? which parameter is the most "widely" used in biological terms to identify key proteins? - RB: You can still calculate all the metrics, and should compute several centrality metrics as these have different underlying assumptions for computing centrality (for instance, degree computes the number of closest neighbors, eccentricity computes maximum shortest path). [This article](https://www.nature.com/articles/srep26234) applies an elegant analysis of a PPI that may be useful. - Is there a module analysis that allows me to identify new pharmacological targets in a protein-protein interaction network? (all proteins being related in a specific pathology) - RB: It may be helpful to combine information from drug-targets and pathology data in your network. As an example, you can use [Enrichr](https://maayanlab.cloud/Enrichr/#libraries) where your gene sets can be derived from drug targets and disease information. It may be useful to combine this with genome-wide CRISPR/Cas9 essentiality screens. - Which tools would you recommend for network analysis? - RB: I tend to use igraph for [python](https://igraph.org/python/) or [R](https://igraph.org/r/). You also have [networkX in python](https://networkx.org/) though it used to have fewer features than igraph. [sna.py](https://snap.stanford.edu/snappy/) is geared towards very big networks, but limited in what properties you can calculate due to high computational demand. [Cytoscape](cytoscape.org/) is a very good tool that will allow you to do network analysis and visualization, but becomes very heavy due to its GUI for larger networks. [Gephi](https://gephi.org/) was another recommendation from Nikolay. - Do you suggest a library devoted to hypothetical protein function predictions or analysis especially with respect to networks? - [StringDB](https://string-db.org/) may help you as it collects information from many different sources. You can also query it from [within Cytoscape](https://apps.cytoscape.org/apps/stringapp). - Are there any good reading materials/pipelines that try to integrate both non-cell specific information such as PPIs and Gene Ontology hierarchies, together with cell specific multi-omics data and drug responses, to construct networks and reveal important genes relating to drug responses or disease mechanisms? Thanks. - RB: [Depmap](https://depmap.org/portal/) may be helpful here, as it contains a lot of information with drug responses from different omics. However, I don't think it contains PPI information. Another one that you can look at [is this one](https://www.nature.com/articles/s41467-019-10887-6). ## Feedback If you have any feedback during the course, feel free to add it here: - thank you for the loaded, wonderful workshop! - Thank you very much. Please fill in the feedback form here https://forms.gle/y5s1Lm5yjn5pGxHJ8 - Please share our longer integrative omics course with your contacts here https://www.scilifelab.se/event/elixir-omics-integration-and-systems-biology-online/ ## Post-course questions If you have any questions please email us at `edu.omics-integration[ at ]nbis.se`

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.