Emma's Castilleja notes - de novo

# Castilleja ddRAD data - denovo assembly ###### tags: `RADSeq` `STACKS` --- author: Emma Fetterly --- **Resources:** COPIED FROM DYLAN's NOTES 1. Link to STACKS website: https://catchenlab.life.illinois.edu/stacks/ *This guide is meant to help users with the basic workflow and to produce output files for population genetic analyses.* 2. Bioinformatics guide by Eric C. Anderson https://eriqande.github.io/eca-bioinf-handbook/shell-programming.html 3. Speciation & Population Genomics: a how-to-guide https://speciationgenomics.github.io/mapping_reference/ ------ Process Overview 1. process_radrags - demultiplex and clean reads 3. STACKS: de novo pipeline 4. STACKS: populations 5. filtering with VCF tools and plink ---- # Quest basics ## Logging on Logging on to quest It will ask for your net ID password ``` ssh ewf7555@quest.northwestern.edu ``` To make a directory `mkdir` + name To change to a directory `cd` + name or full path `cd ..` will change directory to the previous directory you were in Moving files can be achieved either by using copy paste/move linux commands followed by the path-to-the-new-folder, or by using cyberduck or filezila to transfer them manually. I prefer linux commands as they are less clunky. cp - copy paste command (this will duplicate files, leaving a copy in your current directoy) mv - move command (this will move your files and not leave a copy in the current directory) - mv can also be used to rename files/folders if you are not going to a new directory: mv name.txt name2.txt (this will rename name.txt to name2.txt) *Asterisk without () is a wildcard, this can be used to select all files with the same ending or similar pattern cp *.1.fq.gz = copy paste all files with the ending .1.fq.gz and move to your directory of choice ## Castilleja data folder Jeremie's copy "Castilleja" won't allow for edits so I have a copy called "Castilleja_emma" File path: ``` $ cd /projects/p32037/Castilleja_emma ``` This folder contains the index barcodes in .txt files and the sequence data in .fastq files There should be 12 indicies with 8 samples in each index group. :::danger I only see barcode indices for 12, 2 (twice - .tsv and .txt), 4, 5, 6, 7 ::: This is okay, the other fastq files are from another species in the same run. ``` [ewf7555@quser34 Castilleja]$ ls barcodes_index12.txt barcodes_index2.tsv barcodes_index2.txt barcodes_index4.txt barcodes_index5.txt barcodes_index6.txt barcodes_index7.txt C85HWANXX_s5_1_BC_X.fastq C85HWANXX_s5_1_illumina12index_10_SL136389.fastq C85HWANXX_s5_1_illumina12index_11_SL136390.fastq C85HWANXX_s5_1_illumina12index_12_SL136384.fastq C85HWANXX_s5_1_illumina12index_1_SL136385.fastq C85HWANXX_s5_1_illumina12index_2_SL136379.fastq C85HWANXX_s5_1_illumina12index_3_SL136386.fastq C85HWANXX_s5_1_illumina12index_4_SL136380.fastq C85HWANXX_s5_1_illumina12index_5_SL136381.fastq C85HWANXX_s5_1_illumina12index_6_SL136382.fastq C85HWANXX_s5_1_illumina12index_7_SL136383.fastq C85HWANXX_s5_1_illumina12index_8_SL136387.fastq C85HWANXX_s5_1_illumina12index_9_SL136388.fastq C85HWANXX_s5_2_BC_X.fastq C85HWANXX_s5_2_illumina12index_10_SL136389.fastq C85HWANXX_s5_2_illumina12index_11_SL136390.fastq C85HWANXX_s5_2_illumina12index_12_SL136384.fastq C85HWANXX_s5_2_illumina12index_1_SL136385.fastq C85HWANXX_s5_2_illumina12index_2_SL136379.fastq C85HWANXX_s5_2_illumina12index_3_SL136386.fastq C85HWANXX_s5_2_illumina12index_4_SL136380.fastq C85HWANXX_s5_2_illumina12index_5_SL136381.fastq C85HWANXX_s5_2_illumina12index_6_SL136382.fastq C85HWANXX_s5_2_illumina12index_7_SL136383.fastq C85HWANXX_s5_2_illumina12index_8_SL136387.fastq C85HWANXX_s5_2_illumina12index_9_SL136388.fastq ``` Look at the raw data sequences # Process radtags ## Code 1. `process_radtags -h` Process radtags is the first step in the STACKS pipeline. This will take your raw sequencing files and prepare them for SNP and loci calling. EXAMPLE process_radtags CODE FROM DYLAN: ``` nohup process_radtags -1 Amorphy_S1_R1_001.fastq.gz -2 Amorphy_S1_R2_001.fastq.gz -i gzfastq -b Pbar -o Phy_out --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 & ``` My version for Castilleja: ``` process_radtags -1 C85HWANXX_s5_1_illumina12index_12_SL136384.fastq -2 C85HWANXX_s5_2_illumina12index_12_SL136384.fastq -i fastq -b barcodes_index12.txt -o CACOradtags_output --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 & ``` process_radtags - the command to call the program. -1 - the read 1 files -2 - the read 2 files -i - how are the raw files inputed as (fastq) -b - this is the barcodes file (in this case: barcodes_index12.txt) with Sample + barcode identifer AACCA A_hubrichtii_7 CGATC A_arenaria_TX TCGAT A_orientalis_14 TGCAT A_cil_tenuifolia_188 Note the orientation, identifier followed by the sample name with a tab inbetween. The barcode file could be made in a text editor (BBedit) and uploaded to the server using Cyberduck or (Filezilla?) or in Linux using nano. -o - this is the output directory that should be created before you call process radtags **I made a new directory within the "Castilleja_emma" folder called "CACOradtags_output" for the first/test group (index 12)**` --inline_null -> this tells STACKS the postion of our barcodes(adapters) for each sample --renz_1 ecoRI - restriction enzyme 1 --renz_2 mspI - restriction enzyme 2 ***** Make sure to change these if using different enzymes! These are the adapters Fant/Skogen lab uses --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 - We are allowing two mismatched bases in the adpaters -r- rescue barcodes and RAD-Tag cut sites. -c-clean data, remove any read with an uncalled base. -q-discard reads with low quality scores. ***The ends of raw reads are usally very messy and so most people truncate/trim reads to avoid problems with assembling loci downstream. We have 150 bp reads, and truncate to 100. to inspect raw non demulitplexed files: ``` zless *+* file name ``` -t 100 - truncate final read length to this value. -s 10 - number of threads to run. (10) &>ptags2.out& - redirects output/log files to ptags2.out. Example SLURM SCRIPT for RADTAGS from DYLAN's notes: ``` #!/bin/bash #SBATCH --account= ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX #SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc) #SBATCH --time= ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter) #SBATCH --nodes=1 ## how many computers/nodes do you need (no default) #SBATCH --ntasks-per-node=2 ## how many cpus or processors do you need on per computer/node (default value 1) #SBATCH --mem-per-cpu= 20 gb ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need)) #SBATCH --job-name= ## When you run squeue -u #SBATCH --mail-type= ## BEGIN, END, FAIL or ALL #SBATCH --mail-user= #SBATCH --error= # Enter a error log file name #SBATCH --output= # Enter your out.log file name module purge module stacks/2.64-gcc-9.2.0 process_radtags -1 Amorphy_S1_R1_001.fastq.gz -2 Amorphy_S1_R2_001.fastq.gz -i gzfastq -b Pbar -o Phy_out --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 ``` ## Slurm scripts Actual Slurm script I used for Castilleja radtags: * This was for "index 12" ``` #!/bin/bash #SBATCH --account=p32037 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX #SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc) #SBATCH --time=12:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter) #SBATCH --nodes=1 ## how many computers/nodes do you need (no default) #SBATCH --ntasks-per-node=2 ## how many cpus or processors do you need on per computer/node (default value 1) #SBATCH --mem-per-cpu=25GB ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need)) #SBATCH --job-name=CACO_radtags ## When you run squeue -u #SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=CACO_radtags.error # Enter a error log file name #SBATCH --output=CACO_radtags.out # Enter your out.log file name module purge module load stacks/2.64-gcc-9.2.0 ##specify environment process_radtags -1 C85HWANXX_s5_1_illumina12index_12_SL136384.fastq -2 C85HWANXX_s5_2_illumina12index_12_SL136384.fastq -i fastq -b barcodes_index12.txt -o CACOradtags_output --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 ``` Notes on running a Slurm: * open a new file with nano * copy and paste above script for the slurm * save with control x * name it "process_radtags.sh" Run the Slurm with ``` sbatch process_radtags.sh #start slurm script squeue -u ewf7555 #check status of slurm ``` Notes on Slurm scripts: Make sure to put your account name into the script, how much time you are giving it to run, number of nodes, cores, and gb, job name, etc. nohup and & are not needed with slurm scripts because these will run in the background regardless since its a slurm submission. Output: ``` [ewf7555@quser31 Castilleja_emma]$ sh process_radtags.sh Processing paired-end data. Using Phred+33 encoding for quality scores. Reads will be truncated to 100bp Filtering reads for adapter sequence: ACACTCTTTCCCTACACGACGCTCTTCCGATCT GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT 2 mismatches allowed to adapter sequence. Found 1 paired input file(s). Searching for single-end, inlined barcodes. Loaded 8 barcodes (6bp). Will attempt to recover barcodes with at most 1 mismatches. Processing file 1 of 1 [C85HWANXX_s5_1_illumina12index_12_SL136384.fastq] Reading data from: C85HWANXX_s5_1_illumina12index_12_SL136384.fastq and C85HWANXX_s5_2_illumina12index_12_SL136384.fastq Processing pairs of RAD-Tags...1M...2M...3M...4M...5M...6M...7M...8M...9M...10M... 11M...12M...13M...14M...15M...16M...17M...18M...19M...20M...21M...22M...23M...24M... 25M...26M...27M...28M...29M...30M...31M... 63459904 total reads; -3137246 ambiguous barcodes; -411809 ambiguous RAD-Tags; +1108469 recovered; -2782870 low quality reads; 55721085 retained reads. 1406894 reads with adapter sequence. Closing files, flushing buffers...done. 63459904 total sequences 1406894 reads contained adapter sequence (2.2%) 3137246 barcode not found drops (4.9%) 2782870 low quality read drops (4.4%) 411809 RAD cutsite not found drops (0.6%) 55721085 retained reads (87.8%) Details logged: 'CACOradtags_output/process_radtags.log' For a summary, execute one or more: stacks-dist-extract CACOradtags_output/process_radtags.log total_raw_read_counts stacks-dist-extract --pretty CACOradtags_output/process_radtags.log per_barcode_raw_read_counts process_radtags is done. [ewf7555@quser31 Castilleja_emma]$ ``` This finished very quickly ~ less than 1 hour with 87.5 % retained reads. Trying again with "index 2" using the same Slurm script but subsituting different file names: Forward file (-1) C85HWANXX_s5_1_illumina12index_2_SL136379.fastq Reverse file (-2) C85HWANXX_s5_2_illumina12index_2_SL136379.fastq Barcode file: barcodes_index2.txt Job title: CACO_radtags2 For now I am putting this in a new output file in the "Castilleja_emma" folder to keep it separate from the index 12 run. ``` #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=12:00:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=2 #SBATCH --mem-per-cpu=25GB #SBATCH --job-name=CACO_radtags2 #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=CACO_radtags.error #SBATCH --output=CACO_radtags.out module purge module load stacks/2.64-gcc-9.2.0 ##specify environment process_radtags -1 C85HWANXX_s5_1_illumina12index_2_SL136379.fastq -2 C85HWANXX_s5_2_illumina12index_2_SL136379.fastq -i fastq -b barcodes_index2.txt -o CACOradtags_output2 --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 ``` Started at 11 am - immediate failure becasue there were too many columns in thebarcodes text file... need to investigate this. Used the "barcodes _index2.tsv" file instead - this worked Submitted slurm at 11:03 and finished within 30 minutes, with very similar results to the index 12 run (85.8% retained reads) Output here: ```(base) [ewf7555@quser34 Castilleja_emma]$ sh process_radtags.sh Processing paired-end data. Using Phred+33 encoding for quality scores. Reads will be truncated to 100bp Filtering reads for adapter sequence: ACACTCTTTCCCTACACGACGCTCTTCCGATCT GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT 2 mismatches allowed to adapter sequence. Found 1 paired input file(s). Searching for single-end, inlined barcodes. Too many columns (3) specified in 'barcodes_index2.txt' for single-end barcodes on line 1. (base) [ewf7555@quser34 Castilleja_emma]$ nano process_radtags.sh (base) [ewf7555@quser34 Castilleja_emma]$ sh process_radtags.sh Processing paired-end data. Using Phred+33 encoding for quality scores. Reads will be truncated to 100bp Filtering reads for adapter sequence: ACACTCTTTCCCTACACGACGCTCTTCCGATCT GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT 2 mismatches allowed to adapter sequence. Found 1 paired input file(s). Searching for single-end, inlined barcodes. Loaded 8 barcodes (6bp). Will attempt to recover barcodes with at most 1 mismatches. Processing file 1 of 1 [C85HWANXX_s5_1_illumina12index_2_SL136379.fastq] Reading data from: C85HWANXX_s5_1_illumina12index_2_SL136379.fastq and C85HWANXX_s5_2_illumina12index_2_SL136379.fastq Processing pairs of RAD-Tags...1M...2M...3M...4M...5M...6M...7M...8M...9M...10M... 11M...12M...13M...14M...15M...16M...17M...18M...19M...20M...21M...22M...23M...24M... 25M...26M...27M... 55589592 total reads; -4051380 ambiguous barcodes; -403793 ambiguous RAD-Tags; +1024563 recovered; -2372433 low quality reads; 47706734 retained reads. 1055252 reads with adapter sequence. Closing files, flushing buffers...done. 55589592 total sequences 1055252 reads contained adapter sequence (1.9%) 4051380 barcode not found drops (7.3%) 2372433 low quality read drops (4.3%) 403793 RAD cutsite not found drops (0.7%) 47706734 retained reads (85.8%) Details logged: 'CACOradtags_output2/process_radtags.log' For a summary, execute one or more: stacks-dist-extract CACOradtags_output2/process_radtags.log total_raw_read_counts stacks-dist-extract --pretty CACOradtags_output2/process_radtags.log per_barcode_raw_read_counts process_radtags is done. (base) [ewf7555@quser34 Castilleja_emma]$ ``` Looks good, now I need to do this for index 4, 5, 6, 7 ``` mkdir CACOradtags_output4 mkdir CACOradtags_output5 mkdir CACOradtags_output6 mkdir CACOradtags_output7 Slurm for process_radtags4.sh #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=12:00:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=2 #SBATCH --mem-per-cpu=25GB #SBATCH --job-name=CACO_radtags4 #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=CACO_radtags.error #SBATCH --output=CACO_radtags.out module purge module load stacks/2.64-gcc-9.2.0 ##specify environment process_radtags -1 C85HWANXX_s5_1_illumina12index_4_SL136380.fastq -2 C85HWANXX_s5_2_illumina12index_4_SL136380.fastq -i fastq -b barcodes_index4.txt -o CACOradtags_output4 --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 Slurm for process_radtags5.sh #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=12:00:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=2 #SBATCH --mem-per-cpu=25GB #SBATCH --job-name=CACO_radtags2 #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=CACO_radtags.error #SBATCH --output=CACO_radtags.out module purge module load stacks/2.64-gcc-9.2.0 ##specify environment process_radtags -1 C85HWANXX_s5_1_illumina12index_5_SL136381.fastq -2 C85HWANXX_s5_2_illumina12index_5_SL136381.fastq -i fastq -b barcodes_index5.txt -o CACOradtags_output5 --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 Slurm for processradtags6.sh #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=12:00:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=2 #SBATCH --mem-per-cpu=25GB #SBATCH --job-name=CACO_radtags6 #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=CACO_radtags.error #SBATCH --output=CACO_radtags.out module purge module load stacks/2.64-gcc-9.2.0 ##specify environment process_radtags -1 C85HWANXX_s5_1_illumina12index_6_SL136382.fastq -2 C85HWANXX_s5_2_illumina12index_6_SL136382.fastq -i fastq -b barcodes_index6.txt -o CACOradtags_output6 --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 Slurm for process_radtags7.sh** #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=12:00:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=2 #SBATCH --mem-per-cpu=25GB #SBATCH --job-name=CACO_radtags7 #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=CACO_radtags.error #SBATCH --output=CACO_radtags.out module purge module load stacks/2.64-gcc-9.2.0 ##specify environment process_radtags -1 C85HWANXX_s5_1_illumina12index_7_SL136383.fastq -2 C85HWANXX_s5_2_illumina12index_7_SL136383.fastq -i fastq -b barcodes_index7.txt -o CACOradtags_output7 --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 ``` I then ran these slurms and the outputs should be in spearate folders ie. CACOradtags_output4 , CACOradtags_output5, CACOradtags_output6 and CACOradtags_output **Quick results of process radtags** --- **Index 12:** 63459904 total sequences 1406894 reads contained adapter sequence (2.2%) 3137246 barcode not found drops (4.9%) 2782870 low quality read drops (4.4%) 411809 RAD cutsite not found drops (0.6%) 55721085 retained reads (87.8%) **Index 2:** 55589592 total sequences 1055252 reads contained adapter sequence (1.9%) 4051380 barcode not found drops (7.3%) 2372433 low quality read drops (4.3%) 403793 RAD cutsite not found drops (0.7%) 47706734 retained reads (85.8%) **Index 4:** 39541866 total sequences 723528 reads contained adapter sequence (1.8%) 2560698 barcode not found drops (6.5%) 1591721 low quality read drops (4.0%) 285054 RAD cutsite not found drops (0.7%) 34380865 retained reads (86.9%) **Index 5:** 112035426 total sequences 2212884 reads contained adapter sequence (2.0%) 4991864 barcode not found drops (4.5%) 4956510 low quality read drops (4.4%) 706719 RAD cutsite not found drops (0.6%) 99167449 retained reads (88.5%) **Index 6:** 73285042 total sequences 1446848 reads contained adapter sequence (2.0%) 3755980 barcode not found drops (5.1%) 2914390 low quality read drops (4.0%) 496599 RAD cutsite not found drops (0.7%) 64671225 retained reads (88.2%) **Index 7:** 69138146 total sequences 1210578 reads contained adapter sequence (1.8%) 4440828 barcode not found drops (6.4%) 2743297 low quality read drops (4.0%) 469015 RAD cutsite not found drops (0.7%) 60274428 retained reads (87.2%) Once process_radtags has finished running, change directories to your output directory and inspect the process_radtags.log file. This will give you basic information on how manys sequences were found per sample, and how many passed preliminary filtering. To view the the log file ``` cat process_radtags.log ``` STACKS outputs four different files for each sample if you are using paird end reads: sample_name.rem.1.fq.gz sample_name.rem.2.fq.gz sample_name.1.fq.gz sample_name.2.fq.gz We only use the files without the .rem; those files with a .rem. can be deleted. ``` rm *.rem.1.fq.gz rm *.rem.2.fq.gz ``` If you are planning on generating loci/SNPs for multiple sublibraries, you will need to move all demultiplexed files to one folder, if not skip below to denovo pipeline. Moving files can be achieved either by using copy paste/move linux commands followed by the path-to-the-new-folder, or by using cyberduck or filezila to transfer them manually. I prefer linux commands as they are less clunky. cp - copy paste command (this will duplicate files, leaving a copy in your current directoy) mv - move command (this will move your files and not leave a copy in the current directory) - mv can also be used to rename files/folders if you are not going to a new directory: mv name.txt name2.txt (this will rename name.txt to name2.txt) (*) - Asterisk without () is a wildcard, this can be used to select all files with the same ending or similar pattern cp *.1.fq.gz = copy paste all files with the ending .1.fq.gz and move to your directory of choice **Removed .rem files and Moved all files into a single file in "Castilleja_emma" called "readfiles_all"** I was curious about the raw read quality of these sequences, so ran a fast qc on the raw read files ``` module load fastqc/0.12.0 fastqc -o /projects/p32037/Castilleja_emma/readfiles_all/fastqc -t 6 *.fq ``` Hard to review each file... so ran a multiqc https://multiqc.info/docs/ in the readfiles_all directory to view all the reports at once ``` [ewf7555@quser32 readfiles_all]$ module load multiqc [ewf7555@quser32 readfiles_all]$ multiqc . #this searched through the entire read_files directory and found the fastqc reports generated /// MultiQC 🔍 | v1.11 | multiqc | MultiQC Version v1.19 now available! | multiqc | Search path : /projects/p32037/Castilleja_emma/readfiles_all | searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 575/575 | picard | Found 35 AlignmentSummaryMetrics reports | fastqc | Found 97 reports | fastqc | Sample had zero reads: 'merged_DP830' | multiqc | Compressing plot data | multiqc | Report : multiqc_report.html | multiqc | Data : multiqc_data | multiqc | MultiQC complete ``` # De-Novo run All files from process_radtags are in the `readfiles_all` directory This is true workhorse of the STACKS pipeline. During the denovo_mapper STACKS will run the following sub commands: ustacks, cstacks, sstacks, tsv2bam, gstacks, and populations. Populations can be run after the denovo assembler to fine tune parameters such as r and p, and also create different output files as well as some basic pop gen stats. Below are some of the key parameters (and short descriptions)most users of STACKS will investigate. This table is from Paris et al 2017 lost in parameter space. My suggestion would be to start with default parameters until you are comfortable running the denovo mapper. The default parameters are: m = 3 M = 2 and n = 1. I would also make r = 80 as this seems to be the rule for most population genetics studies, but if you have a lot of missing data expore different r and p values. Once you are comfortable calling the pipline and understand its outputs, then I suggest moving to explore parameters. From Eun Sun's study: "*M = n = 3 were chosen as optimal parameter values*" In order to run denovo_map.pl you will need: 1. All the fq.gz (fastq.gunzipped) sample files in one directory. They are present in the "readfiles_all" directory 3. A population map file. This will be the single_pop.txt popmap Start off by creating a sample folder and moving the samplename.1.fq.gz and samplename.2.fq.gz files that you wish to run the denovo assembler pipleine to that folder. Next you will need to create a population map file. There should be no header, and just two columns, the first with sample ids, and the second with the population assigment. Make sure to include a tab inbetween the sample id and population. Here is an example code for the denovo_map.pl ``` denovo_map.pl -m 4 -M 3 -n 3 --samples /data/amsonia/raw/data-06-28-2022/FUG_100 --popmap /data/amsonia/raw/data-06-28-2022/FUG_100/Fug_map -o /data/amsonia/raw/data-06-28-2022/r60/FUG/4_3_3 -T 10 --paired -X "populations:-r 0.8" -X "populations: --write-random-snp" -X "populations: --vcf" ``` Need to make a new direcotry for the denovo run **Made a dirctory in Castilleja_emma called "denovo_stacks"** in that directory made a folder called 3_3_3 (for the parameters used in this run) file path for this directory `/projects/p32037/Castilleja_emma/denovo_stacks/3_3_3` Castilleja code for denovo_map.pl: ``` denovo_map.pl -m 3 -M 3 -n 3 --samples /projects/p32037/Castilleja_emma/readfiles_all --popmap /projects/p32037/Castilleja_emma/sorted_bams/popmap/single_pop.txt -o /projects/p32037/Castilleja_emma/denovo_stacks/3_3_3 -T 10 --paired -X "populations:-r 0.8" -X "populations: --write-random-snp" -X "populations: --vcf" ``` Remember to check out the help document associated with the assembler denovo_map.pl -h This will fill you in on some the parameters we are using, the homepage for STACKS will have a better description. Briefly, -m, -M, -n -> are the key STACKS parameters, not including these will cause the default to run (3-2-1) –samples -> this is the flag followed by the path to your sample directory/folder you wish to run –popmap -> if the map isnt in the same folder as the script then you will have to provide the entire path –o -> this is the output directory that you will need to set up prior to running. -T -> how many cores/processors you wish to use –paired -> this tells STACKS you are working with paried end data -X “populations” -> this is the populations flag that you will need to use if you want to change population parameters during the denovo assembler. Note you can also rerun populations, so if you missed something on the inital run dont fret. -X “populations:-r 0.8” -> minimum percentage of individuals in a population required to process a locus for that population. -X “populations: --write-random-snp” -> restrict data analysis to one random SNP per locus. -X “populations: --vcf” — output SNPs and haplotypes in Variant Call Format (VCF). There are many more options for both denovo mapper as well as the populations program. These are just some of the basic options to get started. I like to run my code in bash scripts. Here is an example of one of my scripts to run denovo_map.pl: ``` #!/bin/bash nohup denovo_map.pl -m 3 -M 2 -n 2 --samples/data/amsonia/raw/data-08-30-2022/LONG_100 --popmap /data/amsonia/raw/data-08-30-2022/LONG_100/Long_map -o /data/amsonia/raw/data-08-30-2022/r80/Long/3_2_2 -T 10 --paired -X "populations:-r 0.8" -X "populations: --write-random-snp" -X "populations: --vcf" &> deLONG.out& ``` You can use the same slurm script format from process radtags for denovo assembly, just modify the slurm script, changing the the made code line Writing a slurm script for the denovo run saved as "denovo_3-3-3.sh" in the 3_3_3 directory of the denovo_stacks directory - *had to make a new popmap ("single_pop_nosort.txt" in the denovo directory) and remove the .sort on the samples IDs from the reference run* ``` #!/bin/bash #SBATCH --account=p32037 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX #SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc) #SBATCH --time=12:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter) #SBATCH --nodes=1 ## how many computers/nodes do you need (no default) #SBATCH --ntasks-per-node=2 ## how many cpus or processors do you need on per computer/node (default value 1) #SBATCH --mem-per-cpu=25GB ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need)) #SBATCH --job-name=CACO_3-3-3 ## When you run squeue -u #SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=CACO_3-3-3.error # Enter a error log file name #SBATCH --output=CACO_3-3-3.out # Enter your out.log file name module purge module load stacks/2.64-gcc-9.2.0 ##specify environment denovo_map.pl -m 3 -M 3 -n 3 --samples /projects/p32037/Castilleja_emma/readfiles_all/gz_reads --popmap /projects/p32037/Castilleja_emma/denovo_stacks/single_pop_nosort.txt -o /projects/p32037/Castilleja_emma/denovo_stacks/3_3_3 -T 10 --paired -X "populations:-r 0.8" -X "populations: --write-random-snp" -X "populations: --vcf" ``` Make sure to check the output log files either the one you are redirecting the log output to or the there will be a denovo.map.log file found in the output directory. There will be a lot of important statistics to consider including depth of coverage and the amount of loci, and SNPs found.