GTSeq Workflow Notes

# GTSeq Workflow Notes *Note copied over from: [MECLAB HackMd] 20250417 (https://hackmd.io/PkHqP8d7QF6oZ3DinzgTkA?both)* EA notes: I will put questions in blue/yellow boxes for your review. I did give you and Lisa access to my github where I uploaded the scripts and have been working on a markdown with script annotations/background: [GithubScriptMarkdown](https://github.com/estefany-argueta/BrazilKinship/blob/13cefb6d049dd6c0d14d1a9d5da95f1813ec16b4/brazil_gtseq/scripts/README.md) :::warning Will need the transfer of previous (Optimization run) sequencing. Link here for list of samples: [BRSamplesSuccess](https://drive.google.com/file/d/1F5dA5qZ844N2SsY50iNxd4HVyIiDefHQ/view?usp=drive_link) ::: ## Jamies Script Workflow: 1. untar the raw file from the sequencer, and run checksum 2. demultiplex raw data to generate raw fastq files 3. trim raw fastq files with trimmomatic (light trimming for quality and adaptor contamination 4. added this step later, but now I flash the PE reads together before aligning to reference amplicon sequences 5. use BWA to align to the target amplicon reference fasta file. at this step also use samtools to get alignment statistics and coverage 6. run r summarize coverage script (there's also a script to submit to R script on unity) 7. look at plots etc of coverage and alignments, remove failed samples 8. prep bam files for GATK genotyping (need to sort bams and add read group headers with picard) 9. run GATK (3 steps: haplotypecaller, then genomicsimportDB, then jointgenotyping) 10. merge locus-specific vcf files with bcftools 11. filter vcfs with vcftools- iterative filter for data missingness, quality 12. Generate sam files from bam files (prep for microhaplot) 13. Run base microhaplot script on the cluster- requires sam files for all samples and the combined,filtered VCF you generated from GATK. This step will generate .RDA files that you will pull into R on your local computer 14. Filtering of microhaplotypes and reformatting for colony - there will be a lot of noisy/shitty calls that we need to filter. Do this in R on your local computer (requires the microhaplot shiny app, doesn't play nicely with unity) 15. prep colony files in the Colony GUI on your local computer, using the tables/files we generated in step 14 16. Run colony GUI 17. Inspect Colony output data files in R, filter questionable assignments, population genetic analyses, etc :::info Do you have scripts for VCFtools or a workflow for colony? -- Totally fine if I need to play around in colony but it would be helpful if you have something for VCF tools (steps 10-11) ::: ## EA Notes / Questions ### Trimmomatic Script (step 3 - script 02) Script: ``` module load trimmomatic/0.39 for filename in *_R1_*.fastq.gz do # first, make the base by removing fastq.gz base=$(basename $filename .fastq.gz) echo $base # now, construct the R2 filename by replacing R1 with R2 baseR2=${base/_R1_/_R2_} echo $baseR2 # finally, run Trimmomatic java -jar /modules/apps/trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 -threads 16 \ ${base}.fastq.gz ${baseR2}.fastq.gz \ ../02_trimmed_fastq/${base}.qc.fq.gz s1_se \ ../02_trimmed_fastq/${baseR2}.qc.fq.gz s2_se \ ILLUMINACLIP:/nese/meclab/Shared/bin/trimmomatic/NexteraPE-PE.fa:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15 MINLEN:75 # save the orphans gzip -9c s1_se s2_se >> ../02_trimmed_fastq/orphans.qc.fq.gz rm -f s1_se s2_se done ``` :::info Within Trimmoatic script -- ILLUMINACLIP FLAG -- Do I need to edit or change things within my script when calling on the adaptor file in the shared nese space -- "NexteraPE-PE.fa:2:30:10"? Does this file vary with the adapters with used during the GTSeq prep? ::: #### Flash PE Step 4 (script 03) ``` module load conda/latest conda activate gtseq FQDIR=/scratch3/workspace/jstoll_umass_edu-spring/gtseq/2024_production_run/02_trimmed_fastq for FILE in $FQDIR/*_L001_R1_001.qc.fq do echo $FILE # first, make the base by removing fastq.gz baseR1=$(basename $FILE .qc.fq) echo $baseR1 # now, construct the R2 filename by replacing R1 with R2 baseR2=${baseR1/_R1_/_R2_} echo $baseR2 #make sample name base=$(basename $FILE _L001_R1_001.qc.fq) echo $base #merge PE reads flash --max-overlap 135 --output-prefix $base $baseR1.qc.fq $baseR2.qc.fq done conda deactivate ``` :::info In following steps, scripts call for `extendedFrags.fastq` but I don't see this script creating that file type? It may be automatic when running it but haven't ran this step yet on the new data. (the data you sent me was past this part already in the workflow). ::: ### BWA Step 5 (script 03) ``` #settings based on one lane of micro kit on miseq, scale up for larger runs module load bwa/0.7.17 module load samtools/1.19.2 REFDIR=/nese/meclab/Jamie/RAD_GTHI/cm_RAD_2/6_SNPdisco REF=targetamps11.21.fa #run from folder containing trimmed reads #mkdir alignments #mkdir alnstats #mkdir coverage #mkdir unmapped_reads for file in *.extendedFrags.fastq do # first, make the base by removing fastq.gz echo $file base=$(echo "$file" | awk -F "." '{print $1}') #run alignment bwa mem -t 10 -M \ -R "@RG\tID:${base}\tLB:amplicon\tPL:ILLUMINA\tSM:${base}" \ $REFDIR/$REF \ ./$base.extendedFrags.fastq | \ samtools sort -O bam -o ../03_alignments/$base.qc.flashed.bam - #get alignment stats samtools flagstat ../03_alignments/$base.qc.flashed.bam >../03_alignments/alnstats/$base.qc.flashed.flagstat.txt #get coverage samtools depth ../alignments/$base.bam > ../coverage/$base.coverage #get unmapped reads in sam format samtools view -f 4 alignments/$base.bam >unmapped_reads/$base.unmapped.sam done ``` :::info The first step in the script pulls out file names from the `extendedFrags` file (which I noted above but am unsure if we created). Since we're only extracting names, I think we could modify this script to work with just the .fq files. What do you think? ::: :::warning But the running alignment calls on the `.extendedFrags.fastq` file, so I'm not sure how to work around that. My main question is: is there a step between FLASH PE and BWA where we created this extendedFrags file? ::: ### GATK (scripts 08-10) **I have been running the scripts on the files you sent me but have stopped at the GATK Import Step. I don't have the files or visability of the file formats to complete the processing. Details below** **09_GATK_import** ``` module load uri/main GATK/4.3.0.0-GCCcore-11.3.0-Java-11 SAMPLE_DIR=/scratch3/workspace/jstoll_umass_edu-spring/gtseq/2024_production_run/00_info OUTDIR=/scratch3/workspace/jstoll_umass_edu-spring/gtseq/2024_production_run/05_genotyping/FlashedReads SAMPLE_FILE=nofailHI_genomicsDBimport.map SAMPLE_NAME=${SAMPLE_FILE%_genomicsDBimport.map} INTERVAL_FILE=/scratch3/workspace/jstoll_umass_edu-spring/gtseq/2024_production_run/00_info/ampnames.txt INTERVAL=$(sed -n ${SLURM_ARRAY_TASK_ID}'p' /scratch3/workspace/jstoll_umass_edu-spring/gtseq/2024_production_run/00_info/ampnames.txt) echo "Using samples from ${SAMPLE_FILE} with the interval listed in the file ${INTERVAL}. Output is written to ${OUTDIR} with the folder name ${SAMPLE_NAME}." cd ${OUTDIR} # creating genomicsDB for whole genome #gatk \ # GenomicsDBImport \ # --java-options "-Xms60G -Xmx90G -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" \ # --genomicsdb-workspace-path ${OUTDIR}/genomicsDB-salado_only \ # --sample-name-map ${SAMPLE_DIR}/salado_only_genomicsDBimport.map \ # --L ${SAMPLE_DIR}/interval.list \ # --batch-size 50 # creating genomicsDB per chromosome mkdir -p ${SAMPLE_NAME} #make main overarching directory if doesn't exist gatk --java-options "-Xms4g -Xmx4g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" GenomicsDBImport \ --sample-name-map ${SAMPLE_DIR}/${SAMPLE_FILE} \ --genomicsdb-workspace-path ${OUTDIR}/${SAMPLE_NAME}/${INTERVAL} \ -L $INTERVAL \ --batch-size 50 \ --reader-threads 10 ``` :::info What does the `nofail_HI_GenomicsDBimport.map` file format look like? Is this something you created after analyzing coverage? Do you have script to write this or does the previous step create it? ::: When looking up MAP files, I found that its just a file with sample names and their correspondign GVCF files in two columns. like this: ``` sample1 /path/to/sample1.g.vcf.gz sample2 /path/to/sample2.g.vcf.gz sample3 /path/to/sample3.g.vcf.gz ``` Let me know if that looks right. :::info Still within the same script, I will need an example or if you already have a file for the `production_run/00_info_ampnames.txt`. Do you have a copy or can let me know how I can create it? ::: :::warning Also in this script, there is a commented-out section for GenomesDB whole genome processing. Is this section commented out because you previously ran it and no longer needed it, or should I ignore it? ::: **10_GATK_import** ``` module load uri/main GATK/4.3.0.0-GCCcore-11.3.0-Java-11 REFDIR=/nese/meclab/Jamie/RAD_GTHI/cm_RAD_2/6_SNPdisco # You need to download your reference genome to here REF=targetamps11.21.fa DIR=/scratch3/workspace/jstoll_umass_edu-spring/gtseq/2024_production_run/05_genotyping/FlashedReads OUTDIR=/scratch3/workspace/jstoll_umass_edu-spring/gtseq/2024_production_run/05_genotyping/FlashedReads SAMPLE_DB=nofailHI INTERVAL=$(sed -n ${SLURM_ARRAY_TASK_ID}'p' /scratch3/workspace/jstoll_umass_edu-spring/gtseq/2024_production_run/00_info/reformat_bedforgatk4.txt) AMPNAME=$(echo ${INTERVAL} | awk -F ":" '{print $1 ":" $2}') cd ${OUTDIR} gatk --java-options "-Xms4G -Xmx4G -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" GenotypeGVCFs \ -R ${REFDIR}/${REF} \ -V gendb://${SAMPLE_DB}/${AMPNAME} \ -O ${OUTDIR}/${SAMPLE_DB}_${AMPNAME}.vcf.gz ``` :::info Would you be able to share what the `nofailHI` file looks like so I can create something similar for my project? Also can you share the `reformat_bedforgatk4.txt` file with me? Is this something that is created early on? ::: --- Otherwise, I've annotated the scripts in my github and things ran smoothly for me! I haven't ran it on the newest sequencing batch but will continue to make updates to my scripts readme doc. --- :::info For Colony - are you running it on the GUI interface or does it need the power to be ran on the cluster? ::: --- ## New Notes from Jamie **Filtering VCF files:** Load the "gtseq" conda environment, while in a unity-compute session. Run all of the following from command line 1. BCFtools to merge all the vcfs from the different loci: `bcftools concat -f ampvcf_file_list.txt -O v -o nofailHI_allloci.vcf` the file list is just the list of all paths for the ind vcf files, O indicates output is vcf, -o indicates the name to give the output vcf file. used bcftools/1.19 2. First vcf filter: `vcftools –vcf nofailHI_allloci.vcf –minQ 30 –minDP 10 –max-missing 0.2 –min-alleles 2 –max-alleles 2 –recode –out allHI_snp_mMiss0.2` - keeps loci that has genotypes for at least 20% of individuals (max-missing 0.2), min quality of 30 for a base, recode genotypes as missing if the total read depth for a site is less than 10, min and max alleles at 2 to keep only biallelic SNPs, out specifies output file name 3. Second vcf filter, look at missingness by individual: `vcftools --vcf allHI_snp_mMiss0.2.recode.vcf --missing-indv` ; this will output a file with the proprotion of missing sites for every individual. 4. Identify individuals missing data at more than half of sites: `awk '$5 > 0.5' out.imiss | cut -f1 > lowDP.indv` ; the out.imiss is generated in the step above. You could make this filter more or less stringent if you'd like 5. Remove low depth individuals: `vcftools --vcf pallHI_snp_mMiss0.2.recode.vcf --remove lowDP.indv --recode --recode-INFO-all --out allHI_snp_mMiss0.2_nolowdp` 6. Filter for minor allele frequency and missingness again: `vcftools –vcf allHI_snp_mMiss0.2_nolowdp.recod.vcf –max-missing 0.7 –maf 0.05 –recode –out allHI_snp_mMiss0.8_nolowdp_maf0.05` 7. You can also look at missingness by site to remove loci where the majority of samples are missing: `vcftools --vcf nofailHI_allloci_filt_nolowdp.recode.vcf --missing-site --out missing-site-nolowdp` and then remove sites missing a lot of data at this point (I think I remove sites missing data for more than 60% of individuals) --- **Convert your bam files to sam files, in prep for microhaplot:** ``` #!/bin/bash while read -r FILE; do echo "converting bam $FILE to sam" samtools view -h $FILE > ${FILE/.bam/.sam} done < bamfilelist.txt ``` requires a list of bam files, just the full path Run from within the folder that has your bam alignments, if you are in a unity-compute session with gtseq conda env loaded you can just run this script from the command line, or submit as a job. --- **Run microhaplot on the command line **(requires you first have generated all your sam files and have a vcf with genotyped snps for those files) R script: ``` ###microhaplot#### devtools::install_github("ngthomas/microhaplot", build_vignettes = TRUE, build_opts = c("--no-resave-data", "--no-manual")) microhaplot::mvShinyHaplot("~/Shiny") # provide a directory path to host the microhaplot app library(microhaplot) library(tidyverse) # for your dataset: customize the following paths sam.path <- "/scratch3/workspace/jstoll_umass_edu-spring/gtseq/2025_production_run/ALL_SAM_FILES" label.path <- "/scratch3/workspace/jstoll_umass_edu-spring/gtseq/2025_production_run/ALL_SAM_FILES/ALLSAMLIST.txt" vcf.path <- "/scratch3/workspace/jstoll_umass_edu-spring/gtseq/2025_production_run/05_genotyping/allHI_samples_filtered.recode.vcf" app.path <- "~/Shiny/microhaplot" out.path <- tempdir() haplo.read.tbl <- prepHaplotFiles(run.label = "HIALL", sam.path = sam.path, out.path = out.path, label.path = label.path, vcf.path = vcf.path, app.path = app.path, add.filter = 1, n.jobs = 200) # assume running on dual core- this was for requesting 100CPU ``` I ran this as a script in the Rstudio app through unity ondemand, I used R version 4.4.0, requested 8 hours, 30 GB of memory, and 100CPU. It takes a LONG time to run depending on how many samples you have included, I think this was for all ~3000 of my samples. Output here will be two .rds files, download them to your local computer or drive. --- **Generating and filtering microhaplotypes:** Once you've downloaded those .rds files, open r script on your computer and launch the microhaplot shiny app: ``` app.path <- ("G:/.shortcut-targets-by-id/1qbm3xJEOyJAEQZhQMpieLrxBpR3x-Ev3/MEC_lab_projects/Chmy Hawaii sex ratios/06-GT-seq/Shiny/microhaplot") runShinyHaplot(app.path) ``` Then you can visually inspect different loci (microhaplot tutorials are available on the package github to get an understanding of what things look like and how to use the app) From within the app, download the "locus_annotation.csv", "haplotype_frequency.csv" and the "unfiltered_haplotype.csv". You may annotate some loci has rejects based on their profiles, that info will be saved in the annotation csv. **The main things we will filter are for read depth per haplotype per individual, and allele balance per locus.** In theory, every individual should only have TWO haplotypes per locus (because these are diploid individuals) but there will likely be more than 2 initially. We'll remove microhaplotypes that are not supported by high enough read depth (typically if read depth is less than 20) and if allele balance thresholds are not met. The allele balance is the ratio of reads supporting each haplotype within in an individual. In a true heterozygous individual that has a total of 10k reads at a locus, 5k should be for one haplotype, and the other 5k should be for the other haplotype- the ratio of 5k:5k is 1. In a homozygous scenario, ideally 10k reads support one haplotype, but it more likely looks like 9.5k reads in support of one haplotype and 0.5k in support of another - the allele balance here is 1 for the first haplotype (the true one) and 0.05 for the second haplotype (0.5k/9.5k). We set the allele balance ratio to be high (at least 0.7) to remove these erroneous haplotype calls. This is probably better explained on the microhaplot github sorry lol. The same script where we filter also has a section to prep files for colony: 1. Convert all the microhaplotype character alleles to unique numeric identifiers 2. Generate reformatted genotype table for colony 3. Include metadata so you can generate genotype tables for males, females, and offspring, and known sibling relationships. NOTE ON NAMING SCHEMES: Each unique haplotype per locus will require a standardized name (generate this using the microhaplot.R script) like "mhLKcm1" which stands for microhaplotype - lisakomoroske- cheloniamydas- 1 , for the first possible haplotype. Name samples consistently, and they can't have more than 20 characters in the name for colony. So while I was working in R, I had long names like "2022_T135_H18_20220614NTT135", and for colony I converted each to a short unique idenifier like "HI_2" Output text files that you can load into the Colony GUI, which will help you make the colony.dat file needed for the run. --- **Running Colony** I ran on the GUI with my home PC, it took about 2-3 days to run with aroun 3000 samples. I have a computer with 16 cores, so keep that in mind. Setting I used (select these in the GUI when prompted, or choose your own): ``` 2470 ! Number of offspring in the sample 175 ! Number of loci 1234 ! Seed for random number generator 0 ! 0/1=Not updating/updating allele frequency 2 ! 2/1=Dioecious/Monoecious species 0 ! 0/1=Inbreeding absent/present 0 ! 0/1=Diploid species/HaploDiploid species 0 0 ! 0/1=Polygamy/Monogamy for males & females 0 ! 0/1 = Clone inference = No/Yes 1 ! 0/1=Scale full sibship=No/Yes 0 ! 0/1/2/3/4=No/Weak/Medium/Strong sibship prior; 4=Optimal sibship prior for Ne 0 ! 0/1=Unknown/Known population allele frequency 1 ! Number of runs 2 ! 1/2/3/4 = Short/Medium/Long/VeryLong run 1 ! 0/1=Monitor method by Iterate#/Time in second 1 ! Monitor interval in Iterate# / in seconds 1 ! 0/1=DOS/Windows version 1 ! 0/1/2=Pair-Likelihood-Score(PLS)/Full-Likelihood(FL)/FL-PLS-combined(FPLS) method 1 ! 0/1/2/3=Low/Medium/High/VeryHigh precision ``` You'll of course have a different number of offpsirng - this is how many offspring I had in my dataset after all the filtering and such, not how many I started with. Same with the number of loci- this is what I ended up with after filtering. Keep in mind that higher precision and longer run time will lead to way more resources, and from published colony papers they say medium precision seems to be good for most projects. --- **Filter and analyze Colony outputs** Colony generate MANY files. You can inspect them in the GUI, but I did most filtering and analyzing in R. The primary output you'll use is the bestconfig file, but also the .maternity, .pairwisematernity, .paternity, and .pairwisepaternity files. Youll want to pay close attention to assignments based on known metadata. Are offspring always assigned to the correct mother? In my experience, not always. I filtered out individuals from the "best configuration" that had less than 80% confidence in assignment to a mother (in pairwisematernity file), and then double checked that the assigned mom was the correct mom (based on metadata). From this table, I generated the number of unique fathers, polygyny, fathers shared with offspring across years, etc. I did NOT filter out low-confidence father assignments- something to consider moving forward, I just removed all offspring that had low mother-offspring assignment probability, as I figured we couldn't trust those offspring assignments for either the mother or the father.