RaukR 2021 - HackMD

# RaukR 2021 Welcome to HackMD RaukR 2021 space. Here, we share questions, suggestions for improvements and other valuable comments. Course content: https://nbisweden.github.io/RaukR-2021/ Refresh often! =) ## Day 1 ### General comments day 1 ### Module -- Reproducible Research #### Lecture > Useful link: [RStudio intro to `renv`](https://rstudio.github.io/renv/articles/renv.html#:~:text=Using%20renv%20%2C%20it's%20possible%20to,of%20your%20project%20to%20renv). > > **Question 1.** Often if you work with conda environments, you get a lot of dependencies issues when installing r packages. What would be the best way to avoid this using `renv` package? Are there any tips on how to set up an environment using conda when packages used in a project are not present in conda repositories? - [ ] Basically, the module about Python and R integration on Wed. will be discussing such questions in more details. - [ ] Perhaps it's better to manage R through Conda and R packages through renv? - [ ] One possible way is the installing R and renv using `conda` and all other R packages can be installed by `renv`. > **Question 2.** I have been using *renv* in a project and it used to work perfectly until the latest update on rstudio. Sine the update I am not able to open the project anymore and Rstudio crashes right away. I figured out that this is due to python integration tool *reticulate* being part of the environment and when removing it from the environment file, I can open the project again. Has anyone experienced something similar and/or has a solution? - [ ] Maybe a good discussion point for the lecture, Python and R integration. **Comment:** Rmarkdown is great, but just to be aware it is also possible to write an R script, which RStudio converts to Rmarkdown by itself and outputs report in html or chosen format. #### Lab ##### Notes ~~> **Error** in lab.~~ > **Question 1.** How can we save figures included in a report - [ ] R chunk options that save figures in a selected folder Please note 'pdf' doesn't work well with HTML document. To avoid that but still get figures in pdf, you can set as below, `c('png', 'pdf')`. ``` knitr::opts_chunk$set( dev= c('png', 'pdf'), # save figures in two formats fig.path= '../reports/figure/' # save figure in files ) ``` > **Question 2.** what does "Please note that no contents under `renv` folder is needed to restore the R environment" means? - [ ] Sorry that it was not clear enough. It means that only `renv.lock` file is needed to restore the R environment. For example, when we want to transfer a project to a new place, the files under the `renv` folder, which can be quite big, are not required. Just `renv.lock` and our scripts are enough to restore the R environment. > **Question 3.** What are the differences between an R Markdown and an R Notebook?. When looking at the definitions and uses look the same. - [ ] RNotebook is a special type of document that internally uses RMarkdown. RNotebooks are RStudio specific and they are a way of presenting and formatting RMarkdown. ### Module -- R code style guidelines #### Lecture - **Styler package** as an rstudio addin to style your code. https://www.tidyverse.org/blog/2017/12/styler-1.0.0/ The package allows you to style the code in Rmarkdown files and/code chunks for a selected formatting style. ### Module -- Functions and scripts #### Lab **Question 1.** I get this error when trying to run the R script on the command line: `chmod +x terminal_script.R` `Rscript terminal_script.R` ``` > zsh: command not found: Rscript ./terminal_script.R > env: Rscript: No such file or directory ``` It's a new computer, do I need to parameter something in the hidden files of the system ? - [ ] It seems like Rscript is not installed. This is a bit strange as it should be installed with R. Perhaps this link can help: https://stackoverflow.com/questions/38456144/rscript-command-not-found , otherwise I would need to know more about specifics like your operating system. **Question 2.** I also get an error when running Rscript in the terminal: ``` dyld: Library not loaded: @rpath/libncurses.6.dylib Referenced from: /Users/tili/miniconda3/lib/libreadline.8.1.dylib Reason: image not found Abort trap: 6 ``` Is it because I am currently inside a conda base environment so that I need to specify the path to the library? - [x] By "Also", do you mean that you are the asker of question 1 as well? In that case it is probably the same error. Some googling tells me that this error is connected to conda. Some people are reporting that a fix is to update/install conda in R. I am not sure what the newest version is there but assuming they have the latest stable release: ``` conda install r-base conda install -c r r=4.1.0 ``` - I have updated r-base in my conda base environment and it still gives the error: ``` dyld: Library not loaded: @rpath/libncurses.6.dylib Referenced from: /Users/tili/miniconda3/lib/libreadline.8.1.dylib Reason: image not found Abort trap: 6 ``` - [x] Hard to problemsolve this remotely and because it is a conda issue, not an R issue, per se. But according to https://github.com/conda/conda/issues/6183 you could try changing your libreadline.8.1.dylib to libreadline.6.2.dylib as the "newest R is still depending on readline 6.2 but conda will install 8.1 for you automatically.". Hope this works. **Question 3.** Let your script print the arguments. Run it with a few extra words or numbers and see what happens. What does the "run it with a few extra words or numbers" mean? - [ ] The idea here is to show how to pass arguments. You can create a script which only print passed arguments (as shown below) and call it printingArguments.R for instance. ``` #!/usr/bin/env Rscript print(commandArgs(trailingOnly = TRUE)) ``` - [ ] Then you can run it with extra arguments (run for instance from the console `Rscript printingArguments.R myInputFileName 22`) **Question 4.** When trying to run the R script in console, I should use hashbang line: #!/usr/bin/env Rscript. Nothing happens when I run this line in the console. What does run in console actually mean? - [x] We mean run in a terminal/command line i.e. not in R. The way you access such a terminal will depend on your OS so feel free to contact us if you need more specific help. In R, indeed nothing happens when running this line as it is interpreted as a comment (#). **Followup question 2** I specify the library path inside the script but it still doesn't work. - [x] Could you add your script and the command you are calling it with to the question please? - Here is the library path that I found in the R section: ``` .libPaths("/Users/tili/Desktop/course/RaukR/renv/library/R-4.0/x86_64-apple-darwin17.0") ``` - It still doesn't work if I save it and try to run it again in the terminal. - [x] I need more information to help you with this. Write a chat to me in zoom and we can talk in breakout room so I can see your screen and what you are trying to do here. **Question 5.** I have written the script and got it to run in the console, but when I try and run it with optparse it keeps saying that package is not installed, even though I installed in within R 10 minutes ago. Is there something else I need to do to pass the package to the command line? - [ ] check `.libPaths()` inside of R, it returns probably two library path. you can set your library path inside r with `.libPaths("/Users/..")`and include it in your r script so it know where to look for your library. - [ ] It is not a question of "pass the package to the command line", rather the package needs to be loaded in the R script. This is being done in the example with: `suppressPackageStartupMessages(require(optparse))` - [ ] If neither of above work, we probably need to see your script and the error when you try to run it to problemsolve. Ask for help within zoom. **Question 6.** About the practice on functions and scripts, exercise 4: I have tried including the commands `write(x,file=stderr())` or `write(x,file=stdout())` to get warning-related messages, and it starts running but it does not progress to produce the output. Is there something I might be doing wrong? - [ ] This is perhaps not so well described, but the point is that in some cases you don't want error handling that you have built into your script to mess upp the output you are after, so you can store any warnings or messages you get and print them so they are seen in your terminal, rather than in the output file. Here is an example from the script used in the exercise: ``` #!/usr/bin/env Rscript #!/usr/bin/env Rscript input_con <- file("stdin") open(input_con) oneline <- readLines(con = input_con, n = 1, warn = FALSE) close(input_con) mean=as.numeric(oneline) mydata=rnorm(1000,mean = mean) #Assume I have performed some check and now feel it is necessary to warn the user. write("something is wrong", file = stderr()) # This will be output to file or forwarded in text stream print(summary(mydata)) ``` **Question 7.** I recently came across the function `sink()` to direct the output generated from an script to a determined file. Is this also a good option to keep track of the progress of the job executed from R? - [ ] It is really good to keep a log, what happened during a script run. `sink()` is one option. An alternative is `utils::capture.output()`, which can be more flexible in terms of output. If your script includes some code for a progress bar, please note often it creates unexpectedly too long log file. ## Day 2 ### General comments day 2 ### Module -- Code debugging, optimization and #### Lab If you were not able to run **profr::ggplot.profr** first run the following function and visualise using it. ``` ggplot.profr <- function(data, ..., minlabel = 0.1, angle=0) { if (!requireNamespace("ggplot2", quietly = TRUE)) stop("Please install ggplot2 to use this plotting method") data$range <- diff(range(data$time)) # quiet R CMD check note start <- NULL end <- NULL time <- NULL ggplot2::ggplot(as.data.frame(data)) + ggplot2::geom_rect( ggplot2::aes(xmin = start, xmax = end, ymin = level - 0.5, ymax = level + 0.5), fill = "grey95", colour = "black", size = 0.5) + ggplot2::geom_text( ggplot2::aes(start + range / 60, level, label = f), data = subset(data, time > max(time) * minlabel), size = 4, angle = angle, hjust = 0) + ggplot2::scale_y_continuous("time") + ggplot2::scale_x_continuous("level") } ``` ## Module -- R packages ### Lab **Question 1.** When trying to look at the vignette using the "knit" button I get an error from the line "library(myPackage)" saying that "there is no package called myPackage" (but I can run "library(myPackage)" in the R console without problem). Googling suggest that this can happen when the function is not exported properly, but I have added the "@export" line in the R script. - [x] This does seem strange. Did you install it using Cmd/Ctrl + Shift + B, before knitting the vignette? - No, I missed that step:) Works now, thanks! **Question 2.** After trying to fix the notes about iris and head (by adding them in utility, and to Description, just like we did for reshape2), I cannot build the package anymore.. Says "Error: object ‘iris’ is not exported by 'namespace:datasets'". Don't see why I would need to export iris? - [ ] The iris dataset does introduce some hard to solve NOTES from check(). One way of solving this is taking the iris dataset and putting it into sysdata.rda instead, for use in your package. You can allso apparently call it using `datasets::iris` and `@import datasets` so your NAMESPACE get it. **Question 3.** Did anyone get the warning " WARNING ‘qpdf’ is needed for checks on size reduction of PDFs" - and if so, how did you solve it? - [ ] You typically get warnings when you are missing suggested packages. Try installing `qpdf` manually. **Question 4** When trying to install the package after adding the C++ capability I keep getting the error: ``` Error: package or namespace load failed for ‘Day2TestPackage’ in library.dynam(lib, package, package.lib): shared object ‘typicalr.so’ not found ``` The documentation says I should run the commands: ``` pkgbuild::compile_dll() devtools::check() ``` In order to fix this, but however many times I run them I keep getting the error. - [ ] Try restarting Rstudio and reloading your package. Otherwise, revert your package and delete everything the C++ added and start over with integrating Rcpp. It should work if you follow the instructions closely. **Question 5.** In the lecture you mentioned that in the github, it is better to make the repository public because if you have too many private repositories then you need to pay for it. Is there a limit on how many private repositories that you can have? In this case, if you have code and output files from unpublished dataset, how do you deal with that? - [x] GitHub has changed the policy and has unlimited number of private repos within a private [plan](https://github.com/pricing). However, I think a private repo cannot have more than 3 contributors or so within the Free plan. It is changing, so check the details using the above link. - [x] Addendum: WHen I said you should have a public repo I was specifically talking about using Github Actions, which is only free for such repositories. **Question 6.** I have the same error as in Question 4 above, and when trying to fix it I ended up with a new error that appears everytime I run a devtools-command: ``` Error in getDLLRegisteredRoutines.DLLInfo(dll, addNames = FALSE) : must specify DLL via a “DLLInfo” object. See getLoadedDLLs() ``` - [ ] Try restarting Rstudio and reloading your package. Otherwise, revert your package and delete everything the C++ added and start over with integrating Rcpp. It should work if you follow the instructions closely. ## Day3 ### Module OOP **Comment 1** It appears that `my_protein` cannot be created, as there is both Phosphorylation and Methylation listed (works fine with only on of them), and it was specified in the validate function and text that it should be only one of them. - [x] Good point and good observation. Will altering the Validator solve the issue? **Question 1** In the creating .ext_protein S4 class part, when I use S4 class protein, why does the information inside protein (i.e. sequence length etc.) appear under prot slot instead of appearing in the corresponding slots (i.e. sequence, length etc.)? - [ ] Do you mean the result of `str(your_object)`? ## Module Vectorization ## Module Parallelisation **Question 1.** When it comes to bootstrapping, what would be the best approach when using HPC resources? To use functions as `future::plan`or to run a single R script as many times as bootstrap replicates? - [ ] If I got the question right, it depends a bit on the policy of that particular HPC. Typically jobs using not more than N cores are getting higher prio. Sometimes short execution time is preferred so in this case parallel futurized version would be of choice. I recommend reading your HPC guidelines and, if still in doubt, talking to the admins. **Question 2** I seem to be getting exactly the same time when I run plan(sequential) or plan(multicore) when I run in RStudio, and when I run in R directly I the plan(multicore) takes 4x longer than the plan(sequential). This seems to be opposite to the behaviour I would expect? - [ ] This is what you would expect for highly cpu-intensive tasks. When you request `plan(multicore)`, your OS has to use some overhead time to distribute the jobs to multiple cores and than gather them back. This takes some, typically fixed, time. However if the tasks you are running are simple, the overhead time may actually be greater than the parallelization gain and thus sequential approach will be faster. **Question 3** Every time when I want to use the future package for parallelisation, should I need to run this `plan(multisession)` in each chunk of codes. - [ ] No, you set plan one time per R session, unless you want to change it throughout your code. **Question 4** I tried to run `plan(sequential)` using ggplot and tidyverse. The code uses some tables and data from my computer but I show it to give you a better idea. Same thing happens with multicore, etc. ```{r} plan(sequential) proc.time( SiO4 %<-% { NUT_KS %>% group_by(DATE) %>% summarise(meanSiO4=mean(SiO4)) %>% ungroup() %>% ggplot(aes(DATE, meanSiO4)) + geom_vline(xintercept = BA_DATES, color = "black", size=.3) + geom_point(size=2)+ theme_bw() + THEME + DATES+ xlab("Month") + ylab(expression(paste("SiO"[4]," (µM)"))) + scale_y_continuous(breaks = seq(0, 30, by = 5)) } NO3 %<-% { NUT_KS %>% group_by(DATE) %>% summarise(meanNO3=mean(NO3)) %>% ungroup() %>% ggplot(aes(DATE, meanNO3)) + geom_vline(xintercept = BA_DATES, color = "black", size=.3) + geom_point(size=2)+ theme_bw() + THEME + DATES + xlab("Month") + ylab(expression(paste("NO"[3],"+NO"[2]," (µM)"))) + ylim(0,4)} #evaluate futures by requesting outcome values NO3 ) ``` The error is `Error: unexpected symbol in: " #evaluate futures by requesting outcome values NO3"` - [ ] This error seems to have to do with the comment inside of proc.time() near the end. Could you remove the proc.time() wrap and just try running the SiO4 and NO3 future for example. What is the error then? - [ ] PS: I tried running ggplot inside a future and it worked, so my initial guess that this might not be supported in a future environment didn't pan out. ## Module Python and R [Link to files for the lab.](https://www.dropbox.com/sh/7ew796jgpbv0sth/AADbWrMyoSPJRvUh4tnqM5_Sa?dl=0) **Question:** How can one check if the environment is active? - [x] One way is to use the `py_discover_config()` function, there you can see from which location your packages are loaded. ## Day 4 ### Module Tidyverse [Link to files for the general lab.](https://www.dropbox.com/s/dxxt959g6iefeaf/tidyverse_lab_data.zip?dl=0) [Link to files for the Nanopore channel activity lab.](https://www.dropbox.com/s/dxxt959g6iefeaf/tidyverse_lab_data.zip?dl=0) **Question:** When and why do you need to use `` ` ` `` in the tidy verse code? E.g.: ``` N <- rnorm(16) %>% matrix(ncol = 4) %>% `colnames<-`(letters[1:4]) %>% summary()` ``` - [ ] If an object (e.g. data.frame, function) has syntactically valid name, then there is no need to be surrounded by backticks. Otherwise, e.g. `colnames<-`, `[`, `[<-`, ``df$`Age of sample` ``, it should be surrounded by them. And why does `letters[1:4]`needs to be in `()`? - [ ] The outcomes of the following three examples are identical. Because the function to modify column name, `colnames<-`, is defined as other functions, we need to add within `()`. ``` a <- a %>% `colnames<-`(letters[1:4]) a <- `colnames<-`(a, letters[1:4]) colnames(a) <- letters[1:4] ``` The function, `colnames<-`, was likely defined as below. ``` `colnames<-` <- function (x, value) { if (is.data.frame(x)) { names(x) <- value } ... x } ``` or ``` setMethod( f= "colnames<-", signature(x = "data.frame"), function(x, value) { ... } ) ``` **Question 2** In the 2.1.4 section when rewriting: `P <- M %x% t(N)`. Why does the `t()` is not needed in the solution? - [ ] `A %x% B` is a matrix multiplication that works only when `nrows(A)` == `ncols(B)` and `ncols(A) == nwows(B)`. So, if you want to multiply a non-square matrix by itself you need `A %x% t(A)`. Here, M and N are square, so if you skip `t()` the multiplication can be done anyway... ## Module `ggplot` **y-Axis alignment of two separate plots** ``` library(gridExtra) p1 <- iris %>% ggplot(aes(x=Sepal.Length,y=Sepal.Width)) + geom_point() p2 <- iris %>% ggplot(aes(x=Sepal.Length,y=log10(Sepal.Width))) + geom_point() grid.arrange(p1,p2,ncol=2,widths=c(3,3)) ## =============== ## aligning y axis ## =============== ## arranging both plots limits <- c(0,5) breaks <- seq(limits[1], limits[2], by=1) # assign common axis to both plots p1.common.y <- p1 + scale_y_continuous(limits=limits, breaks=breaks) p2.common.y <- p2 + scale_y_continuous(limits=limits, breaks=breaks) # At this point, they have the same axis, but the axis lengths are unequal, so ... # build the plots p1.common.y <- ggplot_gtable(ggplot_build(p1.common.y)) p2.common.y <- ggplot_gtable(ggplot_build(p2.common.y)) # copy the plot height from p1 to p2 p2.common.y$heights <- p1.common.y$heights grid.arrange(p1.common.y,p2.common.y,ncol=2,widths=c(3,3)) ``` I found the `ggbreak` package may related to axis breaks. https://cran.r-project.org/web/packages/ggbreak/vignettes/ggbreak.html very nice package and is still under development. * [ ] This is good to know :) Thanks! **Question about shiny app** In the lab section, when defining tab panel like the following code: `tabPanel("tab3", wellPanel(helpText("Well Panel"))` Is the `wellPanel` a function that is defined previously? * [x] No, it is a function from shiny. `?shiny::wellPanel` ## Rmarkdown presentation https://raukr-boost-rmd-skills.netlify.app/?panelset7=in-my_script.r2&panelset8=my_script.r2&panelset9=template.rmd2&panelset10=rmd-code2&panelset11=previous-css2&panelset12=style.sass2&panelset13=support-of-latex2#1 ## Project ideas ### Exploratory data analyses https://github.com/Sebastian-D/ExploreData All projects under this category aim at analysing and visualising data from one of the following datasets: **1/ AIS Data** -- positions and some other characteristics of vessels around Visby harbor collected during RaukR2019. Data similar to marinetraffic.com Interested: * Ona Deulofeu (Second opt) **2/ ADS-B data** -- positions, and some other characteristics of the airplanes flying over Visby area during RaukR2019. Data similar to the one displayed in flightradar24.com Interested: **3/ Genomic data** from one human chromosome in a number of individuals coming from different populations. Interested: * Mónica Angulo * Marika Oksanen * Astradeni Efthymiadou * Zhuang Liu (First opt) **4/Heart attack analysis dataset** The dataset of Rashik Rahman was obtained from Kaggle, which was distributed with CC0: Public Domain license. *About this dataset* Age : Age of the patient Sex : Sex of the patient exang: exercise induced angina (1 = yes; 0 = no) ca: number of major vessels (0-3) cp : Chest Pain type chest pain type Value 1: typical angina Value 2: atypical angina Value 3: non-anginal pain Value 4: asymptomatic trtbps : resting blood pressure (in mm Hg) chol : cholesterol in mg/dl fetched via BMI sensor fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) rest_ecg : resting electrocardiographic results Value 0: normal Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria thalach : maximum heart rate achieved target : 0= less chance of heart attack 1= more chance of heart attack https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset Interested: ### Graphics Develop a package for plotting medieval style maps in R. Some initial code will be provided. Interested: * Mónica Angulo * Zhuang Liu (Second opt) ### OOP Use OOP, preferably R6 classes, to make an epidemiological simulation of, e.g. effects of vaccinations on covid-19. Interested: * Alvaro Sanchez * Tianyi Li ## Your own project ideas ### Tim First: Look [here](https://imgur.com/a/WDLoapt) :) This is a picture of the way chilli peppers are shared by different regions across the globe, based on the genetic analysis of the samples stored in genebanks. It records cool things like the Age-of-discovery transatlantic trade routes, the silk road, and the very underappreciated uniqueness of chilli peppers grown and bred in Africa. The method (which will fingers-crossed be published very soon :) ) is called ReMIXTURE and currently consists of a few hundred lines of somewhat sloppy R code. It starts with just an all-vs-all distance matrix between samples, and produces plots like in the image, which (in super-simplified terms, show the "overlappyness" of one region with others as lines of varying strength). I would love to work with a group to implement it using R6 OOP, build a package, and optimise the code a bit ... I anticipate this would involve building a ReMIXTURE class with methods like 'initialise with data', 'run ReMIXTURE', and 'produce plots'. Maybe even convenience wrappers to do things like PCA on the input data. And definitely if we were fortunate enough to produce something meeting CRAN standards, we could write an application note or some other small manuscript to submit somewhere for publication :) **Interested**: Frederik; Pavlo H ------ Linnéa ------ I would like to extend the "genomic data" project mentioned above for a dataset where we have more sample information. I have SNP data from multiple individuals from a small recently founded population (my "focal" population), together with multiple individuals from a larger nearby population (the "source" population). For my focal data I have a lot of additional information, including sampling year, inbreeding information, founding individuals etc. I would like to merge all this information in some tidy way, compare the two populations (for example using heterozygosity or frequency of derived alleles) and ideally vizualize it in some fancy way using ggplot. **Interested:** Linnéa, Cecilia T (second opt) ------- Detect highly divergent genomic regions ------ The evolution of supressed recombination is key to the maintance of haplotypes that control discrete yet complex polymorphisms. One of the consequences of long-term supressed recombination is the divergence of the non-recombining genomic regions. Indeed, the identification of genomic regions showing high divergence and nucleotide polymorphis is commonly used to identify sex chromosomes and other supergenes. In this project, I would like to develop a package capable of handling large data sets with windowed estimates of nucleotide diversity (pi) and divergence (dxy), doing statistical analyses to test for significant differences, and plot the results. I intend this to be an individual project, but I am also interested in learning how to use collaborative tools. If someone is also planning to work individually, we could exchange packages at some point and give output to each other to make it more interactive. **Suggested by:** Juanita. -------Grouping patients (HIV-1 and COVID) based on viral load or CD4 counts or CD8 counts-------**Jamirah** Background: When one gets infected with a virus (in absence of treatment), it takes a few days or months to show symptoms or actually having the disease. Some people take just weeks or days to show symptoms but others take longer. This has been attributed to the different CD4 (immune cells that fight the disease) temporal dynamics within individuals but also the viral load dynamics (amount of virus in the body). Aim: Visualize and group patients into (a) fast and slow progressors (based on CD4); (b) viral supressors and non-viral suppressors (based on viral load) Data: 50 individuals infected with HIV/Hepatis/CMV (focus on HIV) have been followed for 2-3 years. CD4 levels and viral load have been measured atleast 3-10 times a year. The sample collection times are different for each patient but the estimated date of infection is available. Genomics data and Proteomics data has been collected to assess associations (will not be included in this exercise) Methods (subject to change): 1. Visualise the data (CD4 and viral load over time) -**ggplot** 2. Use non linear or linear regression with some smoothing function to get predicted CD4 or viral load values from same days for all patients. No extrapolation, hence the length of the vectors will still be unequal. *I guess the CD4 or viral load values have to be normally distributed* **tidy models** 3. Generate cluster profiles based on euclidean distances (as an example) for the predicted CD4 or viral load values. either based on shape or levels. **hierarchical clustering ** 4. Visualize the groups **ggplot** Note. It would be really nice to develop this in a way that it can be used on other datasets .i.e. It is a common task in my group and other people that we collaborate with ***Interested:*** * Julius Lautenbach * Jennie Olofsson * Zoé Pochon * Ona Deulofeu -----------Interactive tool to explore transcriptome data----------- We will build an interactive tool to explore expression of genes and miRNAs in the embryogenesis of Norway spruce. The tool will enable easier sharing and mining of the data by collaborators and will be later published alongside the manuscript as a data resource. Therefore we would like to keep the project private. Interested: Kristina Benevides, Katja Stojkovic ------- Shiny + Computer Vision ------ This project it's a proof of concept. It has no real use apart from learning. My idea is to create a Shiny app that uses as input drawing with the mouse on a plot [something like this](https://stackoverflow.com/questions/41701807/way-to-free-hand-draw-shapes-in-shiny). You literally will draw a letter or a phrase. Then, by using the MNIST dataset consisting of 70.000 pictures of hand written letters, we'll train a model to recognize what has been drawn as input. This will be the first part, and achieving it, will be already a success. We will also implement a 'console' in shiny, so if you draw: `print('Hello world')` the shiny app will literally do that. It'll be cool, to do implement more complex stuff such as plots. This is a challenging project, so be prepare to suffer. **Suggested by:** Alvaro ------- Yeast growth/performance analysis ------ I have collected growth/performance data from 25 yeast strains grown in 30 different conditions. I would like to analyze the growth curves and the performance of every single strain and then put all the strains together in a large dataset. Subsequently, I would like to use that dataset for data analysis, plots, and everything cool that we can think of. Basically comparison of strains and conditions and any interesting trend. It is sensitive information so I am not sure what is the protocol here, but it would be so cool to work on it! I have a starting script but compared to what we have learned this week it can be improved by 200%. **Suggested by:** Cecilia T Interested: ----- Interactive tool to select genetic markers for fish stock identification ----- I plan to build a shiny app that allows the user to easily identify a set of genetic markers that best discriminate populations of interest, out of a tested SNP panel derived from whole-genome scans. The app will include a map showing reference locations, summary statistics and graphics (PCA, Manhattan plot, others) that will help inform the user whether the output marker set is informative enough. The plan is to include this tool in an upcoming publication, thus the data is private. **Interested:** Angela ------- opls package with permutated variable selection ------ I have written a package that uses the Bioconductor package ropls to produce the multivariate analysis OPLS models including variable selection using correlations (p(corr)) and VIP. My package also selects the number of orthogonals in the models optimizing the performance of the model post variable selection. The robustness of the models is investigated by permutations pre variable selection with proceeding variable selection of every permutation resulting in p-values for R2 and Q2 over the variable selection procedure. HTML reports are generated for all group comparisons including optional stratification by a secondary ID showing score plots, loading plots, ROC curves together with model statistics and permutation tests. A summary HTML report is also generated for 5 different models including strategies suited for finding potential therapeutic as well as selecting variables for pathway analysis. The project is for now on a private git hub repository. I am writing a publication using the script on several datasets why I prefer to perform this project by myself. The plan is to implement the format of the package we learned and use best coding practice. Would be great to work with someone who also work on an individual project. **Suggested by:** Marika Strom -------- Exploratory data analysis project ---------- -----Suggested by Mina Ali--- Alcohol-related liver disease (ALD) is caused by damage to the liver from long-term excessive drinking and is a major cause of chronic liver disease worldwide. For this project, we aim to understand the effect of genetic variants on lipid metabolism in the context of ALD development. For this aim, we have quantified 198 annotated lipids from plasma tissue, using the lipid profiles of the 301 consecutive patients with a history of harmful drinking and gathered clinical information like age, gender, disease stage and BMI. We also have the genotype information of these patients for 17 genetic variants that have been associated with liver disease in previous GWASes. **Methods:** Comparison between lipid levels will be done by analysis of variance (ANOVA) and analysis of covariance (ANCOVA). Genotype-Phenotype associations will be analyzed by the multiple linear model. ggplot will be used for visualization. The generated pipeline might be formed in a R package for further use. This is an unpublished project and data will be shared partly. **Interested:**

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.