# Tutorial: Running Horse ROH Scripts ## Starting off What you need: - 1 MAP and 1 PED file with the horses of the breed you are looking at. - Ie, morgpruned.map and morgpruned.ped - The Rscript you are using to run your ROH analysis. - The one used as an example in this tutorial can be found in Farm through the path: ```/group/ctbrowngrp/finnolab_shared/ROH``` and is named as ```morgans_skrypto_Jan25.R``` If the horse MAP and PED files you are using haven't been pruned yet, run the following in the command line (plink must be downloaded): ``` plink –file bbigdata –allow-extra-chr –chr-set 32 –keep morg.txt –out morg –recode plink –file morg –allow-extra-chr –chr-set 32 –geno 0.1 –maf 0.001 –out morgpruned –recode ``` Make sure to prune out the X chromosome: ``` plink --allow-extra-chr --chr-set 32 --file cl2pruned --out cl3pruned --recode --not-chr 33 ``` Now download the pruned MAP and PED files, and put them in one folder together. Next, in a text editor, change the first column in the PED file to the breed abbreviation for each horse, for example MG for Morgan. - The first line will now be the breed, and the second line will remain as the horse ID. ## Running ROH on your home computer Now open the ROH script in Rstudio. Once opened, the first thing to do is set the folder with your MAP and PED files as your working directory. - Example, run in the Rstudio console: ``` setwd("C:/Users/slvb9/OneDrive/Desktop/extdata") ``` ### Rscript parameters Before running the script, make sure the PED and MAP files with their respective breed abbreviations are changed in the R script before starting to run it. Next, you want to make sure the parameters fit the data you are running. ##### For whole genome sequence data: - minSNP = 30, ROHet = FALSE, maxGap = 10^6, minLengthBps = 100000, maxOppRun = 1, maxMissRun = 5 ##### For microarray data: - minSNP = 30 (between 30 to 50 is fine) ROHet = FALSE, maxGap = 10^6, minLengthBps = 10^6, maxOppRun = 0, maxMissRun = 5 Keep running commands in the script until you get to `` quantile(summaryList$SNPinRun$Count, probs = c(.5, .9, .99)) `` In which once you run this command, you will get various thresholds, you are interested in .99 one which will indicate the top 1%. You need to take the number outputted from that threshold, and you have to divide this number by the number of animals that you have in your data set for that breed. - In the script, change threshold in the topRuns line to the percentage you get from the above division. Now- keep running commands until you are done, and that’s it! Make sure to check your outputted files to ensure the figures make sense. ## If you are running the script through Farm: First, login to Farm: ``` ssh vanburen@farm.hpc.ucdavis.edu ``` Allocating resources: ``` srun -p high2 --time=4:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 1 --mem 10GB --pty /bin/bash ``` - time and memory can be changed to fit your needs for the script and the amount of data. Next you need to activate an environment that you downloaded R4.2.2 on. The reason 4.2.2 is needed is because the detectRuns package used to calculate ROH, is not compatable with the newest version of R. If you do not know how to create an environment and download in it, check out this [link](https://hackmd.io/@ctb/S11OXiCK6#Creating-your-first-environment-amp-installing-csvtk). I created and downloaded an R.4.2.2 envornment using this code: ```mamba create --name R422 -y -r-base==4.2.2``` as well as ```conda install -c conda-forge r-devtools r-remotes``` Back to commands, loading and activating the environment: ``` module load mamba mamba activate ROHenvironment ``` Next, you need to launch an R window through running this command: ``` module load rstudio-server rstudio-launch ``` Once you run these commands, you should get a prompt to run and launch a new terminal with an your Farm username and something like -L(numbers)cpu(numbers)etc. You will need to run that command they are telling you in a new terminal window. For example: ``` ssh vanburen@farm.hpc.ucdavis.edu -L49139:cpu-3-51:49139 vanburen@farm.hpc.ucdavis.edu ``` The -L code is what the specific code the terminal prompt will give you. It will also give you a login and password for remotely accessing the Rstudio server on the web, follow that. Once both on the Farm and in Rstudio server Navigate to the Finno lab files for ROH: ``` cd /group/ctbrowngrp/finnolab_shared/ROH ``` In Rstudio, I put my downloaded R script into the ROH folder by uploading the file to the Farm home directory. You should see it appear in the home directory file. This command: ```cp ~/morgans_skrypto_Jan25.R .``` sends it to the directory I am currently in, ie the ROH folder. Ensure your pruned PED and MAP files are also in this folder. - I recommend editing the script for your specific paramenters and breed abbreviations before uploading it to Farm. Now, the script can be run as explained above and the output will be put into Farm. #### Run RStudio Server on your reserved node ``` srun -p high2 --time=5:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 2 --mem 16GB --pty /bin/bash ``` Now run: ``` module load rstudio-server ``` followed by: ``` module load R ``` followed by: ``` rstudio-launch ``` The first command sets up your account to use the RStudio Server software. The second command sets up your account to use a specific version of R. The third command _runs_ RStudio Server on farm. You should see output that looks like this; >Run the following command in a new terminal on your computer: > > ssh -i /Users/Samantha/.ssh/id_rsa_farm -L50700:cpu-3-64:50700 datalab-02@farm.hpc.ucdavis.edu > > Then, on your computer, navigate your browser to: > > URL: http://localhost:50700 > Username: datalab-02 > Password: attention-plausible-overripe-sliceable-vacant-imprint > > NOTE: Using R at /share/apps/conda/environments/r-4.2.3/bin/R. ### Filtering pruned MAP + PED to remove 33 ``` plink --allow-extra-chr --chr-set 32 --file cl2pruned --out cl3pruned --recode --not-chr 33 ```