GRN workshop(Idea HUB)

# GRN workshop(Idea HUB) ## Outline >Workshop will have reserved compute nodes (36CPU, 105GB) and the slurm scripts will have `#SBATCH --reservation grnsims` . This will land all the slurm jobs on the reserved nodes .i.e. no queue 1. Introduction a. Why are simulations important? b. What are Gene Regulatory Networks? c. Simulating Gene Regulatory Networks d. A (brief) introduction to the Stochastic Simulation algorithm 2. Getting started with sismonr a. Introduction to the sismonr package b. Workshop’s challenge: modelling the anthocyanin biosynthesis regulation pathway in eudicots c. Running a first simulation (interactive) d. Why is scaling-up important 3. Scaling up your work a. Introduction to HPC b. Introduction to slurm scheduler and directives c. First slurm job d. Assessing time and memory usage e. slurm profiling f. (provisional) running more than 2 jobs 4. Parallel computing on NeSI (DAY 2) a. Introduction to parallel computing b. How to execute parallel code on NeSI 5. Post-processing a. Checking that the job array worked b. Reading the data c. Some visualisation 6. Supplementary a. NeSI Mahuika Jupyter login b. Opening a Juptyer kernel (sismonr/R-.1.0) and Jupyter terminal c. Local setup >NOTES: * single simulation = 10minutes (1CPU, 0.5GB of memory) * 500 simulations 500CPUs, 250GB of memory per group ### 3. Scaling up :examples 1. First Slurm job - Not using sismonr 2. Looking at efficiency stats with `nn_seff` 3. SLURM profiling (intro with a short script -Not using sismonr) 4. Running more than one job - 2 separate slurm jobs ( < 8 minutes) 5. Introduction to Job arrays * array slurm directive --array * First array example - THIS IS SISMONR (< 8 minutes runtime) * Second array example - WITH SLURM PROFILE (NOW THIS CAN BE A TRUE RUN USING THE SAME SIMULATION FROM DAY 2) 6. Running the 500/250 arrays - PLEASE MAKE SURE TO REMOVE `--profile task` # Olivia's webinar ## New code - 1 network, 10,000 simulations of different individuals The idea here is that the simulations of different *in silico* individuals for a same network should take approx. the same time to run. I saw that when I was doing my own scaling, the running time of each network increased linearly with the number of individuals simulated, so the running time per individual should be not too different between individuals. ### Non-parallel, generate 1 network This one should take a couple of minutes, it generates 1 network that we will use for the rest of the simulations. The system will be saved as `/nesi/nobackup/massey02963/data/09_2020_simulations/small_system.RData` ```{r} ## --------------------------------------------------------------------------------------------------------------------------------## ## R script to generate the network to be simulated ## To save as: /nesi/nobackup/massey02963/code/generate_one_network.R ## --------------------------------------------------------------------------------------------------------------------------------## library(sismonr) library(tidyverse) #core_id = as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID")) nbGenes = 20 ## number of genes in a system nReg = 10 ## number of regulators among the genes (TC or other) posReg = 0.8 ## probability of a regulatory edge to be positive regulation ## Where to save the generated networks outfolder = "/nesi/nobackup/massey02963/data/09_2020_simulations/" system(paste("mkdir -p", outfolder)) while(TRUE){ sysargs = insilicosystemargs(G = nbGenes, PC.p = 1, ## all protein-coding genes PC.MR.p = 1, ## proba of each gene to be a target (for now 1, we'll select the regulators later) TC.pos.p = posReg, ## proba of each edge to be positive TC.PC.autoregproba = 0, ## no auto-regulation TL.PC.autoregproba = 0, ## no auto-regulation RD.PC.autoregproba = 0, ## no auto-regulation PD.PC.autoregproba = 0, ## no auto-regulation PTM.PC.autoregproba = 0, ## no auto-regulation TC.PC.twonodesloop = F, ## no two-nodes loop regcomplexes = "none") ## no regulatory complexes genes = createGenes(sysargs) # Select the TC regulators sampled_reg = sample(1:nbGenes, nReg, replace = FALSE) genes[sampled_reg, "TargetReaction"] = "TC" monw = createMultiOmicNetwork(genes, sysargs) mysystem = c(monw, list("sysargs" = sysargs)) attr(mysystem, "class") = "insilicosystem" ## make sure that there is more than 1 edge in the network if(nrow(mysystem$edg) >= 1){ break } } mypop = createInSilicoPopulation(1, mysystem, ngenevariants = 100, qtleffect_samplingfct = function (x){ truncnorm::rtruncnorm(x, a = 0, b = Inf, mean = 1, sd = 0.05)}, initvar_samplingfct = function (x){ truncnorm::rtruncnorm(x, a = 0, b = Inf, mean = 1, sd = 0.05)}) save(mysystem, mypop, file = paste0(outfolder, "small_system.RData")) ``` ### R script to generate and simulate 1 *in silico* individual ```{r} ## --------------------------------------------------------------------------------------------------------------------------------## ## R script to generate and simulate the in silico individuals ## To save as: /nesi/nobackup/massey02963/code/simulated_diff_ind.R ## --------------------------------------------------------------------------------------------------------------------------------## library(sismonr) library(tidyverse) core_id = as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID")) ## Where to save the generated networks/individuals outfolder = "/nesi/nobackup/massey02963/data/09_2020_simulations/" load(paste0(outfolder, "small_system.RData")) ## Create a new in silico individual mynewpop = createInSilicoPopulation(1, mysystem, genvariants = mypop$GenesVariants) # save(mysystem, mynewpop, file = paste0(outfolder, "sim", core_id, "_small_system.RData")) ## Choosing port for Julia process port = 1233 + core_id ## checking existing connections excon = showConnections()[,"description"] ## existing open connections excon = excon[stringr::str_detect(excon, "->localhost:")] excon = as.integer(stringr::str_replace(excon, "->localhost:", "")) while(port %in% excon){ port = 2*port } newJuliaEvaluator(port = as.integer(port)) sim = simulateInSilicoSystem(mysystem, mynewpop, simtime = 2000, ntrials = 50) save(mysystem, mynewpop, sim, file = paste0(outfolder, "sim", core_id, "_simulations.RData")) ``` ### SLURM script to run the code above (parallel) We can have a chat about how to testimate the running time and memory requirements, my guess is you run 10 or 20, and that should give an estimate. ``` ## --------------------------------------------------------------------------------------------------------------------------------## ## SLURM script to generate and simulate the individuals (parallel) ## --------------------------------------------------------------------------------------------------------------------------------## #!/bin/bash -e #SBATCH --job-name=sim_on_ind # job name (shows up in the queue) #SBATCH --time=00:20:00 # TO EVALUATE #SBATCH --mem=512MB # NO CLUE EITHER #SBATCH --array=1-10000 # Array jobs #SBATCH --output sim_on_ind_%a.out # Include the job ID in the names of #SBATCH --error sim_on_ind_%a.err # the output and error files ml R/3.6.1-gimkl-2018b ml Julia/1.1.0 srun Rscript /nesi/nobackup/massey02963/code/simulated_diff_ind.R ``` ## Old code ### Parallel version of step one ```{r} ## --------------------------------------------------------------------------------------------------------------------------------## ## R script to generate the networks to be simulated (parallel version) ## To save as: /nesi/nobackup/massey02963/code/generate_nw_parallel.R ## --------------------------------------------------------------------------------------------------------------------------------## library(sismonr) library(tidyverse) core_id = as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID")) nbGenes = 20 ## number of genes in a system nReg = 10 ## number of regulators among the genes (TC or other) posReg = 0.8 ## probability of a regulatory edge to be positive regulation N = 2000 ## Number of in silico individuals to simulate altReg = c("TL", "RD", "PD", "PTM") ## Alternative regulation in the system (besides transcription) nbAltReg = c(0, 3, 5, 7) # Number of alternative regulators - corresponds to fraction 0.3 0.5 0.7 ## Where to save the generated networks outfolder = "/nesi/nobackup/massey02963/data/systems_10000/" system(paste("mkdir -p", outfolder)) generate_insilico_system = function(altreg, ## alternative regulation in the network (i.e. non TC) naltreg, ## nb of alternate regulators (i.e. non TC) outpath, ## where to save the output file outname){ ## name of the output file while(TRUE){ sysargs = insilicosystemargs(G = nbGenes, PC.p = 1, ## all protein-coding genes PC.MR.p = 1, ## proba of each gene to be a target (for now 1, we'll select the regulators later) TC.pos.p = posReg, ## proba of each edge to be positive TC.PC.autoregproba = 0, ## no auto-regulation TL.PC.autoregproba = 0, ## no auto-regulation RD.PC.autoregproba = 0, ## no auto-regulation PD.PC.autoregproba = 0, ## no auto-regulation PTM.PC.autoregproba = 0, ## no auto-regulation TC.PC.twonodesloop = F, ## no two-nodes loop regcomplexes = "none") ## no regulatory complexes genes = createGenes(sysargs) # Select the TC regulators sampled_reg = sample(1:nbGenes, nReg - naltreg, replace = FALSE) genes[sampled_reg, "TargetReaction"] = "TC" # Select the alternative regulators given naltreg sampled_reg = sample(1:nbGenes, naltreg, replace = FALSE) genes[sampled_reg, "TargetReaction"] = altreg monw = createMultiOmicNetwork(genes, sysargs) mysystem = c(monw, list("sysargs" = sysargs)) attr(mysystem, "class") = "insilicosystem" ## make sure that there is more than 1 edge in the network if(nrow(mysystem$edg) >= 1){ break } } mypop = createInSilicoPopulation(N, mysystem, ngenevariants = 100) save(mysystem, mypop, file = paste0(outpath, outname, "_system.RData")) return(mysystem) } nrep = 10000 / 400 ## how many simulations we want to run (each core will generate 25 output) for(i in 1:nrep){ mysystem = generate_insilico_system(altreg = sample(altReg, 1), naltreg = c(0, 3, 5, 7), outpath = outfolder, outname = paste0("rep", i + 25 * (core_id - 1))) } ``` ### SLURM script to generate the networks (parallel) ``` ## --------------------------------------------------------------------------------------------------------------------------------## ## SLURM script to generate the networks (parallel) ## --------------------------------------------------------------------------------------------------------------------------------## #!/bin/bash -e #SBATCH --job-name=generate_nw # job name (shows up in the queue) #SBATCH --time=00:20:00 # Walltime (HH:MM:SS) 20 min should be more than enough #SBATCH --mem=512MB # Memory don't think you need more #SBATCH --array=1-400 # Array jobs #SBATCH --output generate_nw_%a.out # Include the job ID in the names of #SBATCH --error generate_nw_%a.err # the output and error files ml R/3.6.1-gimkl-2018b ml Julia/1.1.0 srun Rscript /nesi/nobackup/massey02963/code/generate_nw_parallel.R ``` ### R script to generate the networks to be simulated (To run before the simulations) ```{r} ## --------------------------------------------------------------------------------------------------------------------------------## ## R script to generate the networks to be simulated ## To run before the simulations, don't know how long it would take ## --------------------------------------------------------------------------------------------------------------------------------## library(sismonr) library(tidyverse) nbGenes = 20 ## number of genes in a system nReg = 10 ## number of regulators among the genes (TC or other) posReg = 0.8 ## probability of a regulatory edge to be positive regulation N = 2000 ## Number of in silico individuals to simulate altReg = c("TL", "RD", "PD", "PTM") ## Alternative regulation in the system (besides transcription) nbAltReg = c(0, 3, 5, 7) # Number of alternative regulators - corresponds to fraction 0.3 0.5 0.7 ## Where to save the generated networks outfolder = "/nesi/nobackup/massey02963/data/systems_10000/" system(paste("mkdir -p", outfolder)) generate_insilico_system = function(altreg, ## alternative regulation in the network (i.e. non TC) naltreg, ## nb of alternate regulators (i.e. non TC) outpath, ## where to save the output file outname){ ## name of the output file while(TRUE){ sysargs = insilicosystemargs(G = nbGenes, PC.p = 1, ## all protein-coding genes PC.MR.p = 1, ## proba of each gene to be a target (for now 1, we'll select the regulators later) TC.pos.p = posReg, ## proba of each edge to be positive TC.PC.autoregproba = 0, ## no auto-regulation TL.PC.autoregproba = 0, ## no auto-regulation RD.PC.autoregproba = 0, ## no auto-regulation PD.PC.autoregproba = 0, ## no auto-regulation PTM.PC.autoregproba = 0, ## no auto-regulation TC.PC.twonodesloop = F, ## no two-nodes loop regcomplexes = "none") ## no regulatory complexes genes = createGenes(sysargs) # Select the TC regulators sampled_reg = sample(1:nbGenes, nReg - naltreg, replace = FALSE) genes[sampled_reg, "TargetReaction"] = "TC" # Select the alternative regulators given naltreg sampled_reg = sample(1:nbGenes, naltreg, replace = FALSE) genes[sampled_reg, "TargetReaction"] = altreg monw = createMultiOmicNetwork(genes, sysargs) mysystem = c(monw, list("sysargs" = sysargs)) attr(mysystem, "class") = "insilicosystem" ## make sure that there is more than 1 edge in the network if(nrow(mysystem$edg) >= 1){ break } } mypop = createInSilicoPopulation(N, mysystem, ngenevariants = 100) save(mysystem, mypop, file = paste0(outpath, outname, "_system.RData")) return(mysystem) } nrep = 10000 ## how many simulations we want to run for(i in 1:nrep){ mysystem = generate_insilico_system(altreg = sample(altReg, 1), naltreg = c(0, 3, 5, 7), outpath = outfolder, outname = paste0("rep", i)) } ``` ### R script to run the simulations for a subset of 50 individuals (scaling) ```{r} ## --------------------------------------------------------------------------------------------------------------------------------## ## R script to run the simulations for a subset of 50 individuals (scaling) ## To save as: /nesi/nobackup/massey02963/code/scaling_10000/simulation_ev.R ## --------------------------------------------------------------------------------------------------------------------------------## library(sismonr) library(stringr) core_id = as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID")) ## get the input file for the corresponding core file = list.files("/nesi/nobackup/massey02963/data/systems_10000/")[core_id] output = str_replace(file, "system", "simulation") ## Choosing port for Julia process port = 1233 + core_id ## checking existing connections excon = showConnections()[,"description"] ## existing open connections excon = excon[stringr::str_detect(excon, "->localhost:")] excon = as.integer(stringr::str_replace(excon, "->localhost:", "")) while(port %in% excon){ port = 2*port } newJuliaEvaluator(port = as.integer(port)) ## load the system load(paste0("/nesi/nobackup/massey02963/data/systems_10000/", file)) smallpop = mypop smallpop$individualsList = smallpop$individualsList[1:50] ## pick the first 50 individuals sim = simulateInSilicoSystem(mysystem, smallpop, simtime = 2000) ## save(sim, file = paste0("/nesi/nobackup/massey02963/data/output/simulations_10000_scaling/", output)) ``` ### SLURM script to run the simulations for a subset of 50 individuals (scaling) ``` ## --------------------------------------------------------------------------------------------------------------------------------## ## SLURM script to run the simulations for a subset of 50 individuals (scaling) ## --------------------------------------------------------------------------------------------------------------------------------## #!/bin/bash -e #SBATCH --job-name=testing_ind50 # job name (shows up in the queue) #SBATCH --time=00:30:00 # Walltime (HH:MM:SS) not sure if 30min is too much or too little #SBATCH --mem=512MB # Memory #SBATCH --array=1-260 # Array jobs #SBATCH --output testing_ind10_%a.out # Include the job ID in the names of #SBATCH --error testing_ind10_%a.err # the output and error files ml R/3.6.1-gimkl-2018b ml Julia/1.1.0 srun Rscript /nesi/nobackup/massey02963/code/scaling_10000/simulation_ev.R ``` ### R script to run the simulations ```{r} ## --------------------------------------------------------------------------------------------------------------------------------## ## R script to run the simulations ## To save as: /nesi/nobackup/massey02963/code/sismonr_sim_10000.R ## --------------------------------------------------------------------------------------------------------------------------------## library(sismonr) library(stringr) core_id = as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID")) ## get the input file for the corresponding core file = list.files("/nesi/nobackup/massey02963/data/systems_10000/")[core_id] output = str_replace(file, "system", "simulation") ## Choosing port for Julia process port = 1233 + core_id ## checking existing connections excon = showConnections()[,"description"] ## existing open connections excon = excon[stringr::str_detect(excon, "->localhost:")] excon = as.integer(stringr::str_replace(excon, "->localhost:", "")) while(port %in% excon){ port = 2*port } newJuliaEvaluator(port = as.integer(port)) ## load the system load(paste0("/nesi/nobackup/massey02963/data/systems_10000/", file)) sim = simulateInSilicoSystem(mysystem, mypop, simtime = 2000) save(sim, file = paste0("/nesi/nobackup/massey02963/data/output/simulations_10000/", output)) ## comment out if we don't want to save the output ``` ### SLURM script to run the simulations ``` #!/bin/bash -e #SBATCH --job-name=sismonr_sim # job name (shows up in the queue) #SBATCH --time=12:00:00 # Walltime (HH:MM:SS) the estimated running time would be around 9h #SBATCH --mem=10GB # Memory #SBATCH --array=1-10000 # Array jobs #SBATCH --output sismonr_sim_%a.out # Include the job ID in the names of #SBATCH --error sismonr_sim_%a.err # the output and error files ml R/3.6.1-gimkl-2018b ml Julia/1.1.0 srun Rscript /nesi/nobackup/massey02963/code/sismonr_sim_10000.R ```