Omics Integration and Systems Biology

# Omics Integration and Systems Biology [TOC] ## Resources Course website: https://uppsala.instructure.com/courses/75208 HackMD: https://hackmd.io/AjoVRvYCTgitnKRhgXznJQ Connection via Zoom: https://lu-se.zoom.us/j/66468616131?pwd=ZUIrNUhQOUxDUmVLRW1yclR0ZXJ5Zz09 ## Course Organizers **Nikolay Oskolkov** - Course co-leader, NBIS bioinformatician, Lund University Sweden. I have my background in genomics and medical genetics. Interested in evolution and ancient DNA research. Working on different data types. **Rui Benfeitas** - Course leader, NBIS bioinformatician, Stockholm University. Work in integrative omics projects, mostly employing transcriptomic, metabolomic, proteomic, epigenomic data. Favorite programming language is Python. **Ashfaq Ali** - Course co-leader, NBIS bioinformatician, Lund University Sweden. Work with omics data using statitical and sytems biology approaches to analyze and integrate omics data for biomarker/target discovery. ## Questions? - **Participant:** I had a problem with containers. How should I install them? - *Nikolay Oskolkov*: please do not follow the container way so far. I know it is written in the pre-course material, but we decided to skip containers this time. The labs will not be run within containers, and not even conda environments, but will simply be presented as html-files for the participants. The participants can later install the software and run the labs on their own. We focus this time on lectures and explaining concepts. Nevertheless, all the codes and notebooks will be provided for the particiapnts to run them on their own. - **Participant** Can we get a confirmation of participation at the end? (with times noted?) - *Nima Rafati:* Yes you will receive a certificate. - **Participant**: Will the lectures be recorded ? - *Nima Rafati*: Not for this round of course. - **Participant**: Is proteomics manageable even if the spatial dimension is added? - *Nikolay Oskolkov:* might be problematic, yes, if one adds an additional time or spatial dimension. Hard to say immediately without taking a closer look at your project. - **Participant**: Can you speak to if there will still be problems with common differential expression analysis n<< p Where no correlation is assumed and just false discovery rate is being used. - *Nikolay Oskolkov*: FDR is supposed to solve the curse of dimensionaloty problem to some extent. However not fully. Please check the paper of Naomi Altman here https://www.nature.com/articles/s41592-018-0019-x and their Figure 3 where they show how the number of false-positives increases with the number of added dimensions even after adjusting for multiple testing with FDR. - **Participant**: (logistics question) do we need to be logged in into Uppsala canvas for this course? I am not a UU student so I cannot login when prompted. **solved** - *Nima:* It seems you need to be registered. We can look into that later but in meanwhile you can access the materials through github https://github.com/NBISweden/workshop_omics_integration - **A**: All works fine (as not UU student) when using this link: https://uppsala.instructure.com/courses/75208/pages/schedule - *Nikolay Oskolkov:* the canvas pages should have been available for non Uppsala university students throughout the course. If some pages are not accessible for some reason, this is a bug, please let me know, and I will fix the access asap - **Participant**: Shall we do a list of participants and topics we work on with contact adresses below? I started doing it, hope thats ok :) - *Nikolay Oskolkov:* I have a list of participants' addresses and their research areas from the registration form. Do you suggest to share this among the participants? - **Participant E**: what about a slack channel, we can have different integration combos and help each other :) - *Nikolay Oskolkov:* we had a slack channel in previous runs of the course, however not this time because it puts some additional admin work, and we right now do not have enough resources. - *Sebastian Hesse:* I just setup a slack chanel for us to keep in touch and collaborate in the future. https://join.slack.com/t/slack-ai26534/shared_invite/zt-1p11zvor6-WQdM4wj8Z3PmEknfkRXz4A - *Nikolay Oskolkov:* thank you very much Sebastian, this is a great initiative! - **Participant**: Could we say that, realistically, a vertical integration anchor would be a certain cluster (obtained by dimensionality reduction methods), instead of the cells themselves? Meaning that vertical integration would always assume clustering in a real scenario. - *Nikolay Oskolkov:* yes, I agree with this intuition. The anchors for vertical integration ("across omics") can be samples / cells / individuals or clusters / groups of samples / cells / individuals. This can be (and probably is) a basis for the diagonal integration where same type of tissue (i.e. same functional groups of cells) is used instead of the same biological cells. - **Participant**: Whether the data normalization methods need to be uniformed for the multiomics data from different sources? e.g., distribution of microbiome data is very different from metabolomics, and do they all need to be normally distributed? - *Nima:* You can specify different distribution. In the exercises you can see the examples (e.g. guassian, Bernouli,...). As Nikolay explained one has to use this with cautious. - *Nikolay Oskolkov:* I agree that the methods of normalization are very different between different omics. Even though normalization within omics is not explicitly required by some integrative methods such as MOFA and DIABLO, I would probably still apply them prior to integration to be sure that we at least have tried to minimize the technical (not biologically relevant) variation within omics, which would probaby facilitate the integration step. - **Participant** thanks for your answer,while I am still confused about should I use the same data normalization and transformation for multiomics data to let them follow the same distribution, or at least use the same normalization and transformation method? I got very different results when I changed the normalization method - *Payam*: You have to use normalization and transformation methods that are specific for each "OMICS". For example, some of the very popular normalization methods for RNAseq data does not even work on metabolomics. However, if possible and allowed, the same normalization and transformation methods across different omics data can ensure that the data are on the same scale and have a similar distribution. So is also important to consider the specific characteristics of each omics data type and if possible use the same method in different Omics. Otherwise, you need to use different methods. - *Nikolay Oskolkov:* you should not use the same normalization for different omics. Perhaps by normalization in this question you mean some procedure of "equalizing" across omics, i.e. bringing them to the same scale, i.e. make omics' probability distributions similar? If so, there is no easy equation that can harmonize omics, and the course is exactly about this, i.e. you unfortunately need some sofisticated methods for this type of harmonization. For example, one could use standardization via Z-score = (X- mean_X) / std_X, but this will work only for data with bell-shaped distributions but not for e.g. binary data. Therefore we use graphs and matrix factorization techniques here for "normalizing across omics". - **Participant** : Is it at all possible to do exploratory analysis in a data set with N<<<<<P? e.g. each group wth <10 N but ~ 10000P, or hyphothesis approach alays needs to be used? - *Nikolay Oskolkov:* yes, exploratory analysis is possible for N << P, this is where dimension reduction methods are useful. By doing dimension reduction, you replace your P features K factors, where K < P, and sometimes K << P. This is one way to solve the curse of dimensionality problem. - **Participant**: I am still confused about the difference in doing cross validation and train and test splitting. Could you clarify the difference? I read that for small data sets it is better to keep all the data for training and do more CV instead, if the modelling is used for feature selection only. What is your opinion? - *Nima:* In short, in the example that *Nikolay* mentioned (K-fold cross validation), you subset your data multiple time to train and test. If you split only to train and test, there is a risk that we train/test our model on specific set of data that may include certain state of the data. To avoid that K-fold cross-validation is used. - ![](https://i.imgur.com/uT6CxlU.png) (From Nikolay's presentation). - *Nikolay Oskolkov:* cross-validation is done on a train data set. In contrast, a test data set is used for a final evaluation of your model. If you initial data set is small in sample size, this splitting train-validation-test becomes more an more problematic. However, a test data set is always required for an ultimate evaluation. Cross-validation will not tell you how good your model is, the purpose of cross-validation is to tune hyperparameters. However, if you want to report to the community how well your model performs, it should be test on an independent test data set. - *Participant*: so the cross validation is not only within the training set but within the whole set? and no test set is needed after the CV is performed? I think that is what is confusing, that people often use CV within train set and then also validate with extra test set. - *Nima:* No please note that (to best of my knowledge) you still have another set for test. CV is done in the train set. I will ask Nikolay to comment on this it might be interesting for other students to discuss too. - *Nikolay Oskolkov:* yes, I agree with Nima, that a test data set is always needed for a final evaluation. So you do a train-test split, then you do cross-validation on the train detaset by splitting it (many times) into train-validation (this is the K-fold cross-validation step). After you tuned hyperparameters by cros-validation and parameters by fitting the model, you have to do one final model validation on the independent test data set that you reserved at the very beginning. - *Participant*: Thanks, that how I understand it is typically done. However, can test set be sacrifised in case of too small data set in favour to have more strong training of the model (on the whole data set), in case the aim is to select festures and not the resulting model itself? - *Nima:* remember that it will become matter of how much our model can explain the variation if (as you said) sacrifice the test data and use it for training then your model will be suitable for that dataset only (e.i., overfitting problem) - *Nikolay Oskolkov:* I would not recommend to sucrifice a test data set if your goal is a model itself. Then a test subset is critical for an ultimate model evaluation. However, if your goal is to get a ranked list of biomarkers, yes, you can do it by utilizing the whole initial data set for training. Still, there might be a risk that the ranked list of biomrkers is not robust enough. So by evaluating your model prediction accuracy on a test subset you not only get a good model but also prove that your ranked list of biomarkers is reliable. - *Participant*: I understand the best case scenerio what you do when you have a nice data set. My question is rather what do you do in the worse scenerio when you sample set is small, lets say you have 30 samples in your smallest group. Is it then still preferable to split the data set into train Many people suggest to use all for training in this case. Especially nterested in the selected variables and using them further in other analysis and not interested in useng the model itself - *Nikolay Oskolkov:* if you have a small data set (but still want to analyze it in a machine learning way), my suggestion would be the following. You try to do your best with the train-validation-test split and get a model that predicts best on the test subset. Now, you apply the model to the whole your initial data set and get a ranked list of biomarkers (feature importances). A trained model essentiall means an equation (like liniear model) with optimized parameters (slope and intercept in terms of linear model) and hyperparameters (such as lambda in LASSO). Now, I understand that the model is not your interest, however a good model (meaning predicting well on a test subset) will give you a good / reliable ranked list of biomarkers. If you skip the train-validation-test paradigm, the model you get is more prone to overfitting (because you did not validate it) and hence might give you a biased ranked list of biomarkers. Nevertheless, I am not saying that your suggestion of skipping train-validation-test and doing only cross-validation is wrong. For a small data set, it might work equally well as what I suggest. There is not so much space for manoeuvre with a small data set. - **Participant**: when would be choose PLS vs LDA and vice versa? are there particular data constraints? - *Payam*: PLS is for regression whereas LDA is for classification. So PLS is suitable for predicting a continuous response while LDA is suitable for predicting a categorical response in cases where the classes are well-separated and the distribution of the predictors is normal. - *Nikolay Oskolkov:* I agree with Payam. However, PLS has its modification called PLS-DA (PLS discriminant analysis) which in my opinion is equivalent to LDA. PLS-DA is for example what DIABLO is based on. Overall, yes, LDA is often (if not always) used for classification, while PLS can be used for both regression and classification. - **Participant**: what would it be a reasonable size input for DIABLO in terms of N and P ? is there a mathematical way to evaluate the statistical power? - *Payam*: There is no concept of power in DIABLO. Power is more relevant for hypothesis testing (the ability of a statistical test to detect an effect of a certain size, given a specific significance level (α), sample size, and alternative hypothesis). However, However, the accuracy and predictive performance of DIABLO can still be influenced by factors such as sample size, the number of predictors, and the signal-to-noise ratio etc. In general, the larger the N vs. P, the better. There is no golden number. One has to take into account the complexity of the relationship between the predictors and the response, the amount of noise in the data, and the desired level of precision in the estimates. - *Nikolay Oskolkov:* I agree with Payam. DIABLO and PLS-DA and mixOmics are essentially machine learning algorithms, and there is no concept of power in machine learning (nor in Bayesian stats either). We dod not estimate a statistical power prior to training a machine learning model but simply use a prediction accuracy on an independent data set as a criterion of success / failure. Regarding using DIABLO (and any other machine learning algorithm) for integration, I would not run it naively of P is for example 10 or 100 times greater than N. In this case I would recommend to do feature selection (LASSO) or dimension reduction (PCA) and get the ratio from the initial P / N = 10 (or 100) down to P / N = 1 (or 0.1 or even smaller), and only after that start with DIABLO. - **Participant**: Does LASSO control for metadata covariates (confounder factors)?... Is the output orthogonal? - *Nikolay Oskolkov:* I believe to include covariates in LASSO is less simple than to a regular linear model. However, LASSO (at least within glmnet R package) has a "weights" argument that aims to account for confounders. This can be done within so-called "propensity score" modelling, when one first regresses the confounder from the data and builds the "weights" in this way, the weights measure how much each individual is affected by the confounder. The "weights" computed in this way can be used for running LASSO, this will adjust for covariates / confounders - *Payam:* I should add this to Nikolay's answer. In my opinion, you can have them as part of your model for example y~conf1+conf2+age+something and set the penalty factor to zero for the confounders. However, the interpretation is not as straightforward as ordinary regression. - **Participant**: Can the `weight` argument in LASSO fix the problem with imbalanced dataset? Has something to do with later integration? - *Payam:* Yes and no! weights can be used to down-weight the influence of outliers so you can use it to lessen the influence of the bigger class. However, it does not change the distribution of the data. - *Nikolay Oskolkov:* I agree with Payam. Just one thing to add, weights in linear models are usually used for accounting for confounders. In this case, the weights will inform observations are most affected by the confounding factor which will result in less biased model predictions. However, weights will probably not solve the problem with an imbalanced data set. Hard to say without proper experimenting and testing - **Participant**: When combining different omics which have very different distribution, or some was binary while some was continuous, or some has lots of zeros, you mentioned that we should do something like normalization or scaling before the integration. I see in the Notebook: supervised_omics_integr_CLL.Rmd, four different omics were used for this example. But seems like these omics were integrated directly without preprocessing. My question is how to determine whether I should do some normalization / scaling prior to integration? - *Nikolay Oskolkov:* I actually said that the methods we went through yesterday (e.g. DIABLO and MOFA) do not specifically do normalization or scaling. This is because it is not easy to figure out to normalize or standardize data with e.g. bimodal distribution. When you standardize your data (this was a simple "integration" approach people used in the early days), it is based on computing Z-scores as Z = (data-mean_data) / standard_deviation_data. This approach becomes problematic if you have bimodal data distribution, how to define a mean in this case? In summary, the modern integrative Omics methods (to the best of my knowledge, hard to know for sure without going through the source codes) do not explicitly normalize / standardize / scale data prior to integration. Whether it might result in biases (or it is safe to skip) is an open question for me - *Participant:* Thanks for very nice explanation! If we have omics data from different experiment or source like if I want to integrate data from UK Biobank omics data with bulk RNA-Seq from another source or sources, which methods would be better for data integration? - *Payam:* If you are deal with datasets X,Y,Z in which have matching variables (transcripts for example) and different samples, N-integration is good. If you have matching samples but different variables then P-integration (almost all of the methods in this course). If both samples and variables are different across X,Y,Z, then you have a headache! In this case, you should do "late" integration or conceptual integration. That means analyze X,Y and Z separately and then make a global interpretation or integration on the results. Diagonal integration is another way but honestly you need to have many many samples to be even partially accurate. - **Participant**: What is the adavantage of graph analysis approach over unsupervised integration approach like MOFA? - *Payam*: Graphs can capture very complex relationships between variables and provide a visual representation of the relationships. They often work better when different data modalities have very different distributions. But that does not stop here. Graphs are very flexible and dynamic. You can update then as the new data come with minimum computational overhead. They can be used for many different data types including continuous, categorical, and time-series data. And i think they are easier to interpret. - *Nikolay Oskolkov:* I agree with Payam, graphs are more intuitively clear, while MOFA might be difficult to grasp (think about its probabilistic framework, variational bayes and ELBO minimization). On the other hand, MOFA delivesrs loadings (i.e. key drivers of the biological process in question), while graphs are not straightforward to extract features importances from. - **Sebastian Hesse**: Are there any models to perform **multiomic time course analysis**? I have proteome, mRNA and miRNA data of 6 development stages of a cell type I wish to integrate into a in-silico model. Do you know of any examples from publications achiving this? - *Payam:* You can use almost all the methods presented in this course but if you are after specific packages MEFISTO (https://biofam.github.io/MOFA2/MEFISTO.html) and timeOmics (https://www.bioconductor.org/packages/release/bioc/html/timeOmics.html) are two good options. - **Participant**: What does the negative score value in LASSO means? - *Payam:* It means your model is not predictive at all if you are talking about R^2 value. - *Nikolay Oskolkov:* I belive the particiapnt is asking about ranking features by LASSO scores (instead of p-values as in univariate feature selection). If so, the positive and negative LASSO scores approximately mean positive and negative correlation of each feature with the response variable. Think about running Y ~ beta1\*X1 + beta2\*X2 with LASSO, the LASSO scores have a meaning of beta1 and beta2 coefficients (not exactly but more or less). The coefficients can be positive or negative, and similarly the LASSO scores can be positive and negative. - **Xue**: I am interested about how transcription is coordinated at overlapping convergent gene pairs. So far, I did PCC analysis but I found PCC is not enough to find connection between sense and antisense genes. My plan is to combine different epigenome data (H3K4me1 ChIP-seq, RNAPII occupancy, NET-seq, H3K36me3) and bulk RNA-seq from same genotypes (WT, other mutants) to try to identify which features are more important or can contribute to the coordination.I am quite new to omics integration. Do you have any suggestions ? Thanks! - **Xue**: Could someone answer this ? I am still very confused! Great Thanks!! - *Nikolay Oskolkov:* Xue I believe you can apply a variety of methods that we have been discussing at this course. In order to be more specific, I for example would like to know more about number of samples and number of features (this is what is important for methods I was explaining at the course), as well as what hypothesis (if any) you have, and more information. Your design at first glance seems to fit the vertical integration methodology. You are welcome to start with e.g. MOFA, DIABLO and graphs and contact me in case something is not working for you. - **Participant**:when we do biological network and try to integrate metabolome and genome data, is that feasibile if the genome data is catergory variable (eg., 0,1,2) but the metabolome data is continu? - *Payam:* it is possible to integrate categorical and continuous data in a biological network. - *Nikolay Oskolkov*: yes, agree, I think the most straightforward way of integrating genotype and metabolomics data is the graph approach. I would recommend you to constract to separate KNN-graphs for genotype data (with Sörensen-Dice or Jaccard distance) and metabolomics (with Euclidean distance) and intersect the graphs. The codes for building the graphs can be found here https://github.com/NikolayOskolkov/UMAPDataIntegration/blob/main/UMAP_DataIntegration.ipynb. Also, as explained in the UMAP for data integration notebook, you can do the same via UMAP. - **Jiaqi**: when doing the UMAP using scanpy, the tutorial says that we need to calculate the neighbours (scanpy.pp.neighbors: https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.neighbors.html) and then umap (scanpy.tl.umap: https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.umap.html). However, in the description of scanpy.pp.neighbors page it says that "The neighbor search efficiency of this heavily relies on UMAP". It seems a the chicken and egg question to me and I am curious what is the relationship of UMAP and the neighbour graph? - *Nikolay Oskolkov*: UMAP is a neighbor-graph algortihm. In the first step UMAP literally constracts a neighbor-graph out of the raw data matrix, i.e. it computes pairwise correlations between data points which is equivalent to a fully connected graph. Jiaqi if you have a more specific question, feel free to continue discussion here. - **Aurelia Morabito**: when using tSNE or UMAP, we usually input PCA-reduced data. My question is: isn't this procedure a little too disruptive of data? If I want to build a ML model to determine the major phenotype discriminative features, how am I supposed to go back to the biological meaning of the UMAP-PCA reduced data? - It is! You often do that to to address the computational scalability issues that can arise when working with high-dimensional data. However, these methods are often use to identify clusters in the data. You can then use much simpler models to derive information about clusters (compare then etc). Interpretability and explainability are huge issues in complex multi-stage/layer models. It takes a bit of experience and efforts to go back to the biological information. - *Nikolay Oskolkov*: Aurelia I believe you are asking two questions. Replacing your data by a good (optimal) number of principal components should be 1) almost equivalent to your initial data, 2) is nesessary for overcoming the curse of dimensionality (otherwise it would be too naive to input the very raw data to UMAP / tSNE, they do not work well for very high dimensional data). So the proccedure "apply PCA on gene expression and then feed it to tSNE / UMAP" seems very correct to me. Now, your second question is about interpretability of what you get after having applied that procedure. Here indeed there is a problem. As I mentioned at the lectures, more complex models (and UMAP is a complex model) can be very accurate but the price for the increased accuracy is the loss of interpretability. It does not mean that UMAP-PCA is not interpretable at all, but this is not straightforward. A general way to interpret UMAP-PCA model would be to perturb input genes one-by-one and measure how much the final outcome of UMAP-PCA model is affected by this perturbation. In this way one can get feature importances, i.e. interpretation. Finally, in single cell omics when people get a UMAP with cell clusters, they simply do differential gene expression between the clusters in order to figure gene markers for each cluster. In this wasy you get some interpretability of UMAP, i.e. you know what genes drive each cluster. - **Participant**:Is there something similar to the Universal approximation theorem, but for uniqueness of solution. Say for example if I look at the feature importance/SHAP values from a machine learning algorithm (non-linear). Could other features give then same performance with regard to predictability, but for explainability this will give two different set of features. - *Payam:* I guess we talked about this question in more detail. There is no similar theorem for uniqueness of solution in the same sense. The reason is that there can be multiple solutions to the same problem, and the choice of solution depends on various factors such as the initialization of the weights, the optimization algorithm used, and the presence of local minima. This is how it looks in a simple case!! The solution space of model parameters can have incomprehensive forms in most of the cases if dealing with many variables. ![](https://i.imgur.com/7XZpjZw.png =300x300) - *Nikolay Oskolkov*: I would add here that the more validation of your model you do (on a few different independent data sets) the less likely you end up with a situation when totally different sets of features result in models with comparable predictive capasities. The artifacts like this (different sets of features give similar performance) usually comes in small data sets and should disappear with increase of sample size, which in turn allows for more and more validation. - **Participant**:Simple question: In PCA one can get the contributions of each feature to each principal component. Can one do the same with UMAP latent components? - *Nima:* As Nikolay explained, maybe not in the same fashion as PCs; What you can do is to use UMAP latent component and run a linear model with your observation and check how much of variation is explained. - *Nikolay Oskolkov*: yes, to re-iterate, I use the following way to estimate how much of variance in my data each UMAP component explains. If you initial data matrix is Y (with dimensions N-samples and P-features), and you constructed e.g. 3 UMAP components, you build a new matrix X = (UMAP1, UMAP2, UMAP3) as a simple merge / concatenate of the UMAP components (the X matrix will have dimensions N-samples and 3-features). Now you can run a linear regression (something like PLS) Y~X, and compute R^2 (please check my Unsupervised Omics integration lecture, MOFA part, to see how R^2 is defined), which will tell you how much of the variation in Y is explained by X. Now, your X matrix can be just UMAP1 or UMAP2, in this way you can compute how much of variation is explained by UMAP1 and UMAP2. Finally, the way I am suggesting can be disputed (people might ask why I use linear regression for inferring variance explained by a non-linear method, and I can give my arguments...I skip this discussion for now), but I do not know any other better way to get a feeling of variance explained out of UMAP. -Response from participant: Hi, thanks for the answer, yes I get how to calculate the "contribution" of each component. But what I mean is the contribution of each feature to each new latent component, something like this for PCA: ![](https://i.imgur.com/8D3l07o.png) - *Nima:* I see your point. I can refer you to the example in MOFA exercise where weights of each feature in relation to latent FACTORs can be visualised by `plotWeights` https://uppsala.instructure.com/courses/75208/pages/bonus-lab-unsupervised-integration-with-mofa-v1. In MOFA tutorial, there are other functions that you can use for other type of visualisations with features weight and LFs. It is not directly answer to UMAP but I hope it can give you some hints on possibility of measuring the contribution of feature(s) to latent factors/components. I think Nikolay and others can add more inputs here :) - *Nikolay Oskolkov*: oops, I realized I also (as Nima) probably misunderstood the previous question :) As Nima says, MOFA does estimate a contribution of each feature to a latent factor via R^2. UMAP components are not linear combinations of the input features. For UMAP, what I would probably do is I would perturb each input feature (ad some variation), and check how much UMAP is affected by eahc feature. Also perhaps a simpler way would be to just correlate each input feature with UMAP1 in order to figure our what features contribute the most to UMAP1. - **Xue**: Do you have problem to run the labs? When I ran unsupervisedMOFAintegration, at line 193, it gave me error. it showed Not possible to recompute the variance explained estimates when using non-gaussian likelihoods. We havescRNA-seq with gaussion distribution and two others with bernoulli distribution. Could somebody debug this? Thanks! - *Nima:* Are you referring to this file `UnsupervisedOMICsIntegration.Rmd` ? Could you also specify the code? (not the line number only)? - **Xue** I ran UnsuperviseOMICsIntegration/UnsupervisedOMICsIntegration_MOFA1.Rmd. The command is r2 <- calculate_variance_explained(MOFAobject),after this I have error:Not possible to recompute the variance explained estimates when using non-gaussian likelihoods. In the Rmd file you provide, innitially I ran command calculatevarianceexplained(), it was not working and said function was not found. Then I know in MOFA2 package, the command should be calculate_variance_explained(). But however it didn't work either. Therefore, I wonder if you updated the Rmd file but you didn't check if the commands are producible. Can you try to run or if somebody else has ran and didn't have problems? - *Nima:* Xue, it has to do with different versions of MOFA; In new version R2 is already calcualted and you can find it in the object. So please try this: `MOFAobject@cache$variance_explained$r2_total[[1]]` . You do not need this input in next steps it's just to see the R2 for each datatype/omics data/layer. And for plotting since I assume it should be MOFA2 please try this command instead `plot_variance_explained(MOFAobject)` - Thanks, Nima. It works now after I modified: - r2 <- MOFAobject@cache$variance_explained - r2$r2_total - head(r2$r2_per_factor) - plot_variance_explained(MOFAobject) - *Nima:* Great! :) - *Nikolay Oskolkov*: Great Nima, thanks a lot! - **Kristian**: Can you speak to when running a standard GSEA analysis to subset only on statistically significant genes (after doing differential expression). In the original papers it seems that one should not subset, but I have also heard the argument that it makes it more robust to subset on only significant regulated genes? - *Ashfaq Ali:* If you use the gene set analyses (GSA) using hypergeometric test, one should subset but for GSEA, one would usually let the gene rank take care of the enrichment. But I dont think that you would be doing anything statistically wrong if you limited it significantly differentially regulated genes. - **Rima** : can anyone recommend a well documented R package for bayesian networks? is it usable for longitudinal data as well? would appreciate some experience exchange over here. THANKS! - *Nikolay Oskolkov:* Rima there is a very well documented and esy to use "bnlearn" R package here https://www.bnlearn.com/. I have a very positive experience using the package and recommend it. I heard your questions about causal ineference and inferring directionality (arrows) in a correlation network, and believe you will find a lot of interesting information if you learn bayesian networks via bnleran. I am myself very interested in bayesian networks, causality and mendelian randomization but we have unfortunately not covered it in this course, but we could perhaps keep in touch and keep talking about it. - **Rima**: is it possible to contact some of the docents with specific questions to the labs afterwards since we have not really managed to go through them together? - *Nikolay Oskolkov:* yes, absolutely, please contact me (preferrably, as I am the course responsible) or other teachers from the course, and we will try to do our best to help you with the notebooks. - **Rima**: it should theroretically be possible - when having negative values - to shift the entire distribution into the positive space. Anyone know a good way to it ? generally I would think about adding a positive number that exceeds the highest absolute negative number but this seems slightly naive. any other ideas? - *Nikolay Oskolkov*: I agree with your suggestion Rima and think this is a good approach because it does not change the overall probability distribution of your data (all the patterns are still there) but makes many statistical algorithms to work, which otherwise would fail (for example log-functions do not like negative values, etc.) - **Participant**: Nikolay mentioned in the last session that to select the no. of PC to taken into account in a PCA one can do "shuffling" of the data to measure kind of the noise. What does that exactly mean? randomize all the measurements and do a PCA and see what is the highest % explained by PC1? - *Nikolay Oskolkov*: to see the code for this procedure you could for example check my blog post here https://towardsdatascience.com/how-to-tune-hyperparameters-of-tsne-7c0596a18868. Briefly, by shuffling the original matrix I mean you randomy shuffle the rows and columns simultaneously, i.e. each element of the matrix moves to a random positiion in the metrix. For example an element with coordinates 12th row and 305th column replaces an element located at 125th row and 5th column etc. Once you shuffled your data matrix, it should not have any corrleation structure remained (it should be close to a pure Gaussian noise), thus if you run a PCA on this shuffled matrix, eahc PC should explain just a few percent of variation (for example ~3%), which gives you an estimate of how much variatio a PC1, PC2 etc. would explain on a purely random / noise matrix. This is your noise zone. Now, supose a PC1 on your original un-shuffled matrix explains 30% of variation, PC2 explains 12% etc. ... PC8 explains 5% and PC9 explains 3%. Then I suggest to keep only PC1-8 components bacause the rest PCs explain percentage of variation below the noise zone. This is how you can show that 8 is the optimal number of principal components. ## Participants MultiOmics slack channel: https://join.slack.com/t/slack-ai26534/shared_invite/zt-1p11zvor6-WQdM4wj8Z3PmEknfkRXz4A - **Xue Zhang**: - Postdoctoral researcher focusing on how transcription is coordinated at overlapping convergent gene pairs at single locus level, learning to be bioinformatician(Uppsala,Sweden) - Areas of interest - NGS analyses(RNA-seq,scRNA-seq,ChIP-seq) and Data integration - single cell technologies - Contact - https://www.linkedin.com/in/xue-zhang-49278018a/ - Xue.zhang@slu.se - **Sebastian Hesse**: - Physician scientist in peadiatrics and genetic immunodeficiencies. Located in Munich, Germany. - Areas of interest: - Integration of bulk proteome, mRNA and microRNA to analyze cell development (Time course analysis) - Whole exome/genome gene variant analysis and integration with proteome - Contact - sebastian.hesse@med.lmu.de - https://www.linkedin.com/in/sebastian-hesse-0138a111/ - **Evi Vlachou**: - Biologist by trainning, bionformatician by profession, and an Immunology enthousiast. Started my PhD in ComputationaL Cancer Biology last October. - Areas of interest - Understanding the bone marrow niche in AML, cell-to-cell intreactions, gene regulatory network inference, enhancer-TF-gene relationship. - Main methodologies: multiome: scRNAseq + scATACseq + maybe CITEseq (proteome). Spatial transcriptomics. So I am interested in multiomic/multimodal integration. - Contact - evi.vlachou@embl.de - **Tingting Wang** - Researcher in metabolomics group in COPSAC, Denmark, done PhD study in analytical chemistry in DTU - Areas of interest - Maternal metabolomics study in relation to the neurodevelopment of offsprings - Multiomics integration (metabolome, genome, microbiome, and immunology) to explain the mechanism of neurodevelopment disorder in childhood - Contact - ting.wang@dbac.dk - https://www.linkedin.com/in/tingting-wang-2086a9a9/ - **Natalia Rivera** - Geneticist interested in disease mechanisms of autoimmune diseases - Areas of interest - Understand disease mechanisms, genetic architecture, and molecular factors implicated in autoimmune diseases, chiefly sarcoidosis and rheumatoid arthritis - Using various molecular phenotypes: genetics, DNA methylation, bulk and single-cell RNA, ATAC-seq and want to integrate these data - Contact - natalia.rivera@ki.se - https://www.linkedin.com/in/natalia-v-rivera-12a8852/ - **Thaher Pelaseyed**: - PI in Mucin Biology Group at University of Gothenburg, Sweden. - Areas of interest: - Integration of bulk proteomic, transcriptomic and metagenomic data, aiming at defining the functional role of mucins in intestinal defense systems and host-microbiota interactions - Contact - thaher.pelaseyed@medkem.gu.se - https://www.linkedin.com/in/thaherpelaseyed/ - https://pelaseyedlab.org/ - **Ulrike Münzner**: Bioinformatics Scientist at Nykode ASA (Oslo, Norway) - Areas of interest - cancer immunotherapy - infectious diseases - vaccines - Contact - utmuenzner@nykode.com - https://www.linkedin.com/in/ulrike-m%C3%BCnzner-b48637191/ - **Carolina Oses**: Researcher in Spatial Proteomics Facility in Scilifelab, Stockholm, Sweden. - I work principally with PhenoCycler/CODEX, COMET, and LabSat. Human and Mouse organisms, FFPE and FF samples, and any kind of tissue/cells - Areas of interest - Spatial proteomics and understanding how to analyze the data - Integrate other OMICS technologies into our one, and analyze them. - Solve some of the principal problems after we deliver our data -> Analysis! - Contact - https://www.linkedin.com/in/oses-carolina/ - carolina.oses@scilifelab.se - **Aurelia Morabito**: - I am a PhD student in bioengineering at Politecnico di Milano (Milan), working on a project about high-level multi-omics integration at Mario Negri Institute (Milan). - Areas of interest - Omics data-driven integration - Omics data pre-processing and data - Contact - https://www.linkedin.com/in/aurelia-morabito - aurelia.morabito@marionegri.it - **Giulia De Simone**: - phd student in Biotechnology at the University of Milan-Bicocca (Italy) working in mass spectrometry Laboratory at Mario Negri Institute (Milan, Italy) - Areas of interest - development of proteomics and metabolomics strategies to define molecular networks - integration of other -omics to our data - Contact - giulia.desimone@marionegri.it - **Sergio Mosquim Junior**: - PhD student in the Computational Proteomics Group at Lund University (PI Fredrik Levander), and recently started working in platform projects and care of LC-MS equipment. Biotech Engineer by training - Areas of interest - I work primarily with biomarker discovery in cancer from bulk proteomics data - Integration of bulk proteomic and phosphoproteomic data with both clinical and RNA-seq data - scProteomics - Contact - https://www.linkedin.com/in/sergio-mosquim-junior-b074b271/ - sergio.mosquim_junior@immun.lth.se - **Aonghus Naughton**: - Recently registered PhD student in Karolinska Institutet. - Areas of interest - Epigenetic plasticity in childhood cases of acute leukemia - single-cell and bulk epigenomics/transcriptomics. - EV-mediated cell-to-cell interactions in AML bone-marrow niche - Single-cell/bulk RNA-seq + proteomics. - Contact - aonghus.naughton@ki.se - **Rima Chakaroun**: - Physician scientist, specialist in internal medicine and endocinology, currently undertaking a fellowship at the Wallenberg Lab at the Baeckhed Lab - Areas of interest - interested in multisystem biology of sexual dimorphism on cardiometabolic disease and integrating meta-organism information - focus on microbiome host interactions and how microbiome contribute to early determinants of CMD - Contact - https://scholar.google.com/citations?user=fzVq7RwAAAAJ&hl=en - rima.chakaroun@wlab.gu.se - **Veronica Finazzi**: - Computational biologist working with multiomic data to study neurodevelopmental disorders. Currently a post-graduate fellow at Human Technopole (Milan, Italy). - Areas of interest - Integration of scRNA-seq and scATAC-seq datasets - Analysis of gene regulatory networks - Contact - veronica.finazzi@external.fht.org - **Laura M. Palma Medina**: - Biological engineer with a PhD on medical sciences. I work on severe infectious diseases caused by bacteria. I'm currently working on necrotizing soft tissue infections and sepsis. - Areas of interest - Impact of the tissue microenviroment over the development of severe infections - Biomarker discovery for early detection of severe invasive infections - Currently working with bulk RNASeq, Olink-based proteomics, and clinical data - Contact - https://scholar.google.nl/citations?user=eyMZY0oAAAAJ&hl=en - https://twitter.com/LauraMPalmaM - laura.palma.medina@ki.se - **Mauricio Roza**: - PhD Student in Environmental Science, SciLifeLab, Department of Environmental Science, Stockholm University - Areas of interest: - Multi-generational effects of contaminants - Transgenerational Epigenetic Inheritance - DNA methylation, ATAC-seq, ChIP-seq, RNA-seq - Contact - mauricio.roza@aces.su.se - https://www.linkedin.com/in/mauricio-roza-28b09313b/ - **Kristian Moss Bendtsen** - Physicist, I am working for Novo Nordisk with bio-informatic datasets trying to understand Mode of action and treatment responds biomarkers from pre-clinical models to trials. - Areas of interest: - Mode of action from drug treatment - Translational biomarkers (pre-clinical to clinical) - Translational science - Predict - Contact: - KRMB@novonordisk.com ## Suggested articles: - [The curse(s) of dimensionality by Naomi Altman & Martin Krzywinski](https://www.nature.com/articles/s41592-018-0019-) - [Drug-perturbation-based stratification of blood cancer by Dietrich et al (uses DIABLO)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5749541/) - [Applications of Bayesian network models in predicting types of hematological malignancies](https://www.nature.com/articles/s41598-018-24758-5) - [Integration of multi-omics data for prediction of phenotypic traits using random forest](https://pubmed.ncbi.nlm.nih.gov/27295212/) - [Multi-Omics Factor Analysis—a framework for unsupervised integration of multi-omics data sets](https://www.embopress.org/doi/full/10.15252/msb.20178124) - [A wellness study of 108 individuals using personal, dense, dynamic data clouds](https://www.[nature.com/articles/nbt.3870) - [A Comparative Analysis of Community Detection Algorithms on Artificial Networks](https://www.nature.com/articles/srep30750) - [Computational principles and challenges in single-cell data integration](https://www.nature.com/articles/s41587-021-00895-7) - [A benchmark of batch-effect correction methods for single-cell RNA sequencing data](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1850-9) - [Identifying temporal and spatial patterns of variation from multimodal data using MEFISTO](https://www.nature.com/articles/s41592-021-01343-9) - - Sebastian Hesse kindly offered his help with making a slack channel for keeping the participants of this course connected. If you are interested, please join the channel here https://join.slack.com/t/slack-ai26534/shared_invite/zt-1p11zvor6-WQdM4wj8Z3PmEknfkRXz4A

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.