Andrew Kittentails Project

--- title: Andrew Kittentails Project break: false tags: RADSeq, Andrew --- # Synthyris bullii sequence data This document contains notes for a population genetics assessment for Synthyris bullii (Kittentails) using the STACKS pipeline with and without a reference genome. From across Illinois and Indiana, 120 individuals were sequenced using NovaSEQ NextGen Sequencing (10 individuals from 12 populations). The reference genome was generated from an individual at the Nachusa Doug population at Nachusa Grasslands, Illinois, was sequenced by PacBio at Mt. Sinai using a SMRTcell on Revio, and assembled using HifiASM. –adapter enzyme 1 (P1-EcoRI): ACACTCTTTCCCTACACGACGCTCTTCCGATCT –adapter enzyme 2 (P2-MspI): GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT # ADMIXTURE * requires low missing data * pruned for Linkage disequilibrium ## Pipeline Summary - Filter missing data (--max-missing 0.85) - Convert to PLINK - Prune for LD (--indep-pairwise) - Run ADMIXTURE across K values - Visualize ancestry proportions #### 1. begin with vcf file used for pop gen vcf < - vcfR.gq20.smiss96.70.vcf ``` **** Object of Class vcfR ***** 85 samples 163 CHROMs 2,327 variants Object size: 11.5 Mb 18.5 percent missing data ***** ***** ***** ``` #### 2. Filter Missingness in R ``` # Keep SNPs with <20% missing keep_snps <- which(missing_snp < 0.2) vcf_filtered <- vcf[keep_snps, ] # Optionally filter samples too keep_samples <- which(missing_sample < 0.2) vcf_filtered@gt <- vcf_filtered@gt[, c(1, keep_samples + 1)] # +1 for FORMAT column # save filtered vcf write.vcf(vcf_filtered, file = "filtered.vcf") ``` #### 3. Convert VCF to PLINK ``` plink --vcf filtered.vcf --make-bed --out filtered --allow-extra-chr ``` #### 4. Replace 'chromosome' labels with integers * Extract unique chromosomes from filtered vcf file ``` bcftools query -f '%CHROM\n' filtered.vcf | sort -u > scaffolds.txt ``` * Create integer-chromosome map ``` awk '{print $1 "\t" NR}' scaffolds.txt > integer_to_chr_map.txt ``` * replace 'chromosomes' with integers in .bim file ``` awk 'NR==FNR{a[$1]=$2; next} {$1=a[$1]; print}' integer_to_chr_map.txt filtered.bim > filtered.updated.bim mv filtered.bim contig.filtered.bim mv filtered.updated.bim filtered.bim ``` #### 3. LD Pruning with PLINK moving window * find snps in LD and write to pruned files: * window size = 50 * step size (in SNPs) = 10 * r-squared threshold = 0.2 * prune from original plink file ``` plink --bfile filtered --indep-pairwise 50 10 0.2 --out pruned --allow-extra-chr plink --bfile filtered --extract pruned.prune.in --make-bed --out data_pruned --allow-extra-chr ``` ### If you lost a lot of data in the above, try imputation before ADMIXTURE... #### 2A. remove rare SNPs in R (that only occur once, OPTIONAL, SKIP FIRST TIME THROUGH)) * Jump to step X. filter missing data first time through* ``` chrom_counts <- table($vcf$@fix[, "CHROM"]) keep_chroms <- names(chrom_counts[chrom_counts > 1]) vcf_rare_snp_removed <- vcf[vcf@fix[, "CHROM"] %in% keep_chroms, ] write.vcf(vcf_rare_snp_removed, file = "$SAVEPATH$") ``` #### 3A. Impute missing genotypes with Beagle in terminal (OPTIONAL, SKIP FIRST TIME THROUGH) I ended up losing 33 samples and about 1500 snps after filtering for <20% missing by samples and <20% missing by snp, so am trying impute ``` java -jar beagle.jar gt=filtered.vcf out=imputed ``` #### 4A. Convert VCF to PLINK ``` plink --vcf imputed.vcf --make-bed --out imputed --allow-extra-chr ``` #### 5A. Rename scaffolds to integers and create map (necessary for admixture - it doesn't like ptgs or whatever your snps are called) ``` cut -f1 imputed.bim | sort | uniq | nl > scaffold_map.txt awk 'NR==FNR{map[$2]=$1; next} {$1=map[$1]; print}' scaffold_map.txt imputed.bim > imputed.integer.bim ``` rename original imputed and integer imputed for downstream ``` mv imputed.bim imputed.contig.bim mv imputed.integer.bim imputed.bim ``` #### 6A. Filter SNPs by Missingness ``` plink --bfile imputed --geno 0.05 --make-bed --out imputed.filtered ``` #### 7A. LD Pruning ``` plink --bfile imputed.filtered --indep-pairwise 50 5 0.2 --out pruned plink --bfile imputed.filtered --extract pruned.prune.in --make-bed --out data_pruned ``` #### 8A. Run ADMIXTURE across K values ``` for K in {4..12}; do admixture --cv pruned.bed $K | tee log${K}.out; done ``` #### 9 Make pop map based on samples from .fam file * extract column from .fam file on command line ``` cut -f2 data_pruned.fam > sample_names.txt ``` * open .txt file, copy and paste into excel * use =left($a$1, "number of characters in sample name") * then = left($b$1$, "number of characters in pop code") * repeat for all samples * copy and paste column 2 and 3 as values only and delete other columns * save as .csv * or at least this worked for me with samples like HH1234. I just wanted the sample name and a column with the first 2 letters (HH) #### ADMIXTURE PLOTS in R ``` ``` # Revisiting sequence data in 2024 New things to try: * more thoroughly assess raw read quality with fastqc * use 'cut adapt' to remove excess C's generated duting 'dark cycle' of NovaSeq run prior to process rad tags * explore 'runs of homozygosity' as an alternative to Fis as a measure of inbreeding * do not trim reads to 100 instead of 150 * map raw reads to reference genome ### FastQC results of raw data * For all sublibraries, the first several reads have lower per base sequence quality and extreme (100/0%) per base sequence content between reads 6 and 10. * Trimming the first 10 bps to remove these reads with cutadapt (installed with pip on quest, I don't have an allocation so can't submit jobs) * cutadapt -u 10 -o /projects/p31922/01_raw_reads/trimmed_P1_S10_R1_001.fastq.gz -i /projects/p31922/01_raw_reads/P1_S10_R1_001.fastq.gz * running cutadapt on one core on one set of reads (73mil reads) took 10 minutes * Trimmed reads look slightly better (at least improved Quality for first reads and better per base sequence) Running trimmomatic with a sliding scale to remove reads that have a score <20 in a 4 base window, maintaining reads over 100bp. Code: * java -jar trimmomatic-0.39.jar SE -phred33 /projects/p31922/01_raw_reads/trimmed_P1_S10_R1_001.fastq.gz /projects/p31922/01_raw_reads/qualitytrimmed_P1_S10_R1_001.fastq.gz SLIDINGWINDOW:4:20 MINLEN:100 * Not much improvement outside of quality scores - going to move forward with both datasets and compare SNP output Process Radtags Multiplexed Reads Notes (P1 only - can't find log file for P2/P3): * Original raw reads truncated to 100: 137492637 * Re-multiplexed raw reads with no truncation: 137567342 (93.9%) * Trimmed reads: 3599550 (only 2%; must've trimmed barcode - not a great idea) * Trimmed and quality trimmed reads: 2289410 (2.1% - not a good way to go either) * Truncating during process_radtags is perhaps the better way to deal with issues at the beginning/end of sequences # Reference aligning sequence data onto assembled genome Genome details: (BUSCO v5.5; embryophyta lineage (50 genomes and 1614 BUSCOs), gene predictor: metaeuk): ***** Results: ***** C:98.5%[S:74.4%,D:24.1%],F:0.4%,M:1.1%,n:1614 1590 Complete BUSCOs (C) 1201 Complete and single-copy BUSCOs (S) 389 Complete and duplicated BUSCOs (D) 7 Fragmented BUSCOs (F) 17 Missing BUSCOs (M) 1614 Total BUSCO groups searched Assembly Statistics: 1065 Number of scaffolds 1065 Number of contigs 1337213455 Total length 0.000% Percent gaps 9 MB Scaffold N50 9 MB Contigs N50 (BUSCO v5.5; eudicot lineage (31 genomes and 2326 BUSCOs), gene predictor: metaeuk): ***** Results: ***** C:96.5%[S:71.4%,D:25.1%],F:0.4%,M:3.1%,n:2326 2244 Complete BUSCOs (C) 1661 Complete and single-copy BUSCOs (S) 583 Complete and duplicated BUSCOs (D) 10 Fragmented BUSCOs (F) 72 Missing BUSCOs (M) 2326 Total BUSCO groups searched Assembly Statistics: 1065 Number of scaffolds 1065 Number of contigs 1337213455 Total length 0.000% Percent gaps 9 MB Scaffold N50 9 MB Contigs N50 Index assembled genome (command line in quest took about 20 minutes): module load bwa bwa index asm.np2.p_ctg.fa Mapping reads to reference with bwa mem (should i try minimap2?) # Define input and output directories input_dir="/projects/p31922/02_multiplexed_reads/1.multiplexed_raw_trunc100/realign_reads/" output_dir="/projects/p31922/06_genome_reads/aligned_reads2" reference_genome="/projects/p31922/05_genome/polished_genomes/asm.np2.p_ctg.fa" # Loop through all forward FASTQ files in the input directory for forward in "$input_dir"/*.1.fq.gz; do # Extract the sample name (without path and forward extension) sample_name=$(basename "$forward" .1.fq.gz) # Define the corresponding reverse read file reverse="$input_dir/${sample_name}.2.fq.gz" # Check if the reverse file exists if [[ -f "$reverse" ]]; then # Run bwa mem for paired-end reads bwa mem "$reference_genome" "$forward" "$reverse" > "$output_dir/${sample_name}.sam" echo "Processed $sample_name" else echo "Warning: Reverse read file for $sample_name not found!" fi done Check quality of mapping on 1 or 2 individuals with code: samtools flagstat <SAMPLE.sam> Mapped a random indiv (WC8045): samtools flagstat WC8045.sam 181791 + 0 in total (QC-passed reads + QC-failed reads) 993 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 144144 + 0 mapped (79.29% : N/A) 180798 + 0 paired in sequencing 90399 + 0 read1 90399 + 0 read2 129658 + 0 properly paired (71.71% : N/A) 135958 + 0 with itself and mate mapped 7193 + 0 singletons (3.98% : N/A) 6176 + 0 with mate mapped to a different chr 833 + 0 with mate mapped to a different chr (mapQ>=5) On reads truncated to 100bp: K=19 (default): 79.29% mapped K=16: 75.66% mapped K=14: 75.66 K=12: 76.03 K=10: 75.66 All are not very good, >90% is ideal - poorly sequenced indiv most likely - missing data issues again. We'll see if mapped alignment is any better. Explore removing individuals like this - very few starting reads (181k) On untruncated reads: K=19 (default): 77.90% mapped K=16: 77.9% mapped K=14: 77.9 Sticking with truncated reads. On individual PZ0010: (base) [amd9539@quser31 06_genome_reads]$ samtools flagstat PZ0010.K19.sam 2932394 + 0 in total (QC-passed reads + QC-failed reads) 20640 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 2655355 + 0 mapped (90.55% : N/A) 2911754 + 0 paired in sequencing 1455877 + 0 read1 1455877 + 0 read2 0 + 0 properly paired (0.00% : N/A) 2558288 + 0 with itself and mate mapped 76427 + 0 singletons (2.62% : N/A) 2295204 + 0 with mate mapped to a different chr 20744 + 0 with mate mapped to a different chr (mapQ>=5) ## Assess read alignment for all reads ## With RADseq reads mapped to a reference genome, convert sam to bam, index, sort and get quality stats module load samtools # Specify the directory containing the .bam files DIRECTORY="/projects/p31922/06_genome_reads/aligned_reads/" # Navigate to the directory cd "$DIRECTORY" || { echo "Directory not found"; exit 1; } # Check if there are any .sam files if ls *.sam 1> /dev/null 2>&1; then for file in *.sam; do echo "Processing $file..." # Convert SAM to BAM samtools view -bS "$file" -o "${file%.sam}.bam" # Check if the BAM conversion was successful if [[ -f "${file%.sam}.bam" ]]; then echo "Created ${file%.sam}.bam" # Sort the BAM file samtools sort "${file%.sam}.bam" -o "${file%.sam}_sorted.bam" echo "Sorted ${file%.sam}.bam to ${file%.sam}_sorted.bam" # Index the sorted BAM file samtools index "${file%.sam}_sorted.bam" echo "Indexed ${file%.sam}_sorted.bam" # Generate flag statistics samtools flagstat "${file%.sam}_sorted.bam" > "${file%.sam}_flagstat.txt" echo "Generated flag statistics for ${file%.sam}_sorted.bam" else echo "Failed to create ${file%.sam}.bam" fi done else echo "No .sam files found in the directory." fi ## Create summary flagstat to assess overall mappping quality # Set the directory containing the .flagstat.txt files FLAGSTAT_DIR="/projects/p31922/06_genome_reads/aligned_reads/" # Create a summary file SUMMARY_FILE="summary_flagstat.txt" echo -e "Sample\tQC_Passed_Reads\tMapped_Reads\tMapped_%\tProperly_Paired_Reads\tProperly_Paired_%" > "$SUMMARY_FILE" # Loop through all .flagstat.txt files for file in "$FLAGSTAT_DIR"/*_flagstat.txt; do if [[ -e "$file" ]]; then # Extract the sample name (remove directory and extension) SAMPLE=$(basename "$file" _flagstat.txt) # Extract relevant information QC_PASSED=$(grep "in total" "$file" | awk '{print $1}') MAPPED=$(grep "mapped (" "$file" | awk '{print $1}') MAPPED_PCT=$(grep "mapped (" "$file" | awk -F '[()]' '{print $2}') PROPERLY_PAIRED=$(grep "properly paired (" "$file" | awk '{print $1}') PROPERLY_PAIRED_PCT=$(grep "properly paired (" "$file" | awk -F '[()]' '{print $2}') # Append the data to the summary file echo -e "$SAMPLE\t$QC_PASSED\t$MAPPED\t$MAPPED_PCT\t$PROPERLY_PAIRED\t$PROPERLY_PAIRED_PCT" >> "$SUMMARY_FILE" else echo "No _flagstat.txt files found in $FLAGSTAT_DIR." exit 1 fi done echo "Summary of all _flagstat.txt files saved to $SUMMARY_FILE." ### Export summary_file to R Studio... # VCF tools filtering steps/notes * Beginning with PO (population code) and following speciation genomics walkthrough to assess data - I moved all population.snps.vcf files out of their individual pop files into a single vcf folder and renamed as "PO.pop.snps.vcf" * https://speciationgenomics.github.io/filtering_vcfs/ * https://github.com/cheyennelmoore/Erigenia-Analyses * Move the folder with vcf files to home directory * `cp -r vcf /home/amd9539/vcf` * Create variables for easy coding (change PO to appropriate name) * `SUBSET_VCF=~/vcf/PO.pop.snps.vcf` * `OUT=~/vcf/PO_out/PO` * load vcf tools * `module load vcftools/0.1.17` * Calculate: * Allele frequency * `vcftools --gzvcf $SUBSET_VCF --freq2 --out $OUT --max-alleles 2` * Mean depth per indiv * `vcftools --gzvcf $SUBSET_VCF --depth --out $OUT` * Mean depth per site * `vcftools --gzvcf $SUBSET_VCF --site-mean-depth --out $OUT` * Site quality * `vcftools --gzvcf $SUBSET_VCF --site-quality --out $OUT` * Proportion missing data/indiv * `vcftools --gzvcf $SUBSET_VCF --missing-indv --out $OUT` * Proportion missing data/site * `vcftools --gzvcf $SUBSET_VCF --missing-site --out $OUT` * Het and inbred per indiv * `vcftools --gzvcf $SUBSET_VCF --het --out $OUT` * Move all outputs onto laptop and process in R Studio per tutorial to select filtering metrics * I have chosen and set as below, but first, set in and output file for filtered .vcf * `VCF_IN=~/vcf/FR.pop.snps.vcf` * `VCF_OUT=~/vcf/FR.pop.snps.filtered.vcf` * Filtering metrics: * `MAF=0.2` * `MISS=0.75` * `MIN_DEPTH=10` * `MAX_DEPTH=50` * Filter with: * vcftools --gzvcf $VCF_IN --remove-indv LO0003 --remove-indels --maf $MAF --max-missing $MISS --min-meanDP $MIN_DEPTH --max-meanDP $MAX_DEPTH --minDP $MIN_DEPTH --maxDP $MAX_DEPTH --recode --stdout | gzip -c > $VCF_OUT # Workflow for VCF Tools and Plink to set up PCA in R, view missing data, and view population stats * Go to parameter folder that contains assembly data and populations.snps.vcf file * create plink folder * mkdir plink * add scaffold_ to snp ID in new vcf file within plink folder so plink doesn't think it's human chromosome data and give you errors (https://groups.google.com/g/plink2-users/c/b5bP9YZfkFU) * `sed -E 's/^([0-9])/scaffold_\1/g' populations.snps.vcf > plink/pop.snps2.vcf ` * go into plink folder and load VCF tools * `cd plink` * `module load vcftools/0.1.17` * create plink file from vcf * `vcftools --vcf pop.snps2.vcf --plink --out plink` * load plink (https://speciationgenomics.github.io/pca/ <- great guide from here on with more details) * `module load plink/1.9` * perform linkage pruning - i.e. identify prune sites * `plink --vcf pop.snps2.vcf --double-id --allow-extra-chr --set-missing-var-ids @:# --indep-pairwise 50 10 0.1 --out kplink` * create pca eigenvalue files * `plink --vcf pop.snps2.vcf --double-id --allow-extra-chr --set-missing-var-ids @:# --extract kplink.prune.in --make-bed --pca --out kplink$$$` * Export the .eigenval and .eigenvec files to your laptop and use to create a PCA in R Studio... * * In R Studio... * ' * Check missing data * `vcftools --vcf pop.snps2.vcf --missing-indv --out MissingData_322$$$.imiss` * Create out.imiss file * Move to laptop and view in excel to sort by missing data # 4/24/23 Using VCF tools and Plink to filter data and cluster in a PCA * For help: * VCFtools: https://vcftools.github.io/man_latest.html * Plink: https://www.cog-genomics.org/plink/2.0/general_usage * Both vcf tools and Plink are installed on Quest, load software with: * module load vcftools/0.1.17 * module load plink/1.9 * plink/2.0 is listed as preferred but according to the plink manual, plink2 is still undergoing alpha and beta testing, so plink1.9 should be utilized primarily - plink2 should be used as a 'supplement' for specific tasks. * STEP 1 - create plink file from VCF file if not done during denovo assembly (need to create plink folder (mkdir) within parameter folder you are using): * vcftools --vcf [input folder/populations.snps.vcf] --plink --out [output folder/plink] * this took 2 seconds and 1.5MB of memory * STEP 2 - TROUBLESHOOTING * Error: Invalid chromosome code 27 resolution discussion - https://groups.google.com/g/plink2-users/c/b5bP9YZfkFU * Terminology - contig is a continuous stretch of nucleotide sequences while scaffold is a portion of the genome consisting of contigs and gaps. Both contig and scaffold are reconstructed genomic sequences # 4/17/23 Compiling stats for each denovo assembly parameter * In each parameter folder (322, 333 etc.) open the populations.log file to find loci kept and loci variants * Enter into excel sheet to compare and make decisions based on loci * Should I assemble with no population map to generate stats for the subsample? * Yes - run wit and without pop map and we will run PCA * Use plink to generate PCA for assembly run on each population with each sample * THEN use PCA to assess if sublibrary or pop better explain clustering # Conda location environment location: /home/amd9539/.conda/envs/STACKS # 3/15/23 running STACKS with Slurm via notepad ++ ## Create script in Notepadd ++ that contains code for STACKS programing QUESTIONS * How do I view depths of coverage output? What is the filename? A: gstacks.log.distribs * Does specifying the number of cores in stacks denovo assembler change how slurm allocates cpu usage in quest? * Submitting jobs successfully to slurm but commands not actually running - is STACKS not uploaded properly/not being called in properly? No errors in slurm error logs, nothing in slurm output folder. NOTES * To save file as ''.sh', Save As, select 'All Types', and end file name with .sh * NEED TO RUN STACKS WITH parameters: m,M,n; 3,2,2; 4,2,2; 5,2,2; 3,3,3; 3,4,4; 3,5,5 * Use STACKS on QUEST - called with module load stacks/2.64-gcc-9.2.0 * JOB 6421078 COMPLETED (3/27) - "denovo_map.pl" on 49 samples, QUEST allocation used: 1 node, 4 cores/node, Time = 2hr:30min, Memory = 3GB (does memory increase with more data? NO, but add a couple of GB f it's close to request) * Scaling up (3/28) - 60 samples randomly selected from all pools/sublibraries (23 from 1 & 2, 14 from 3; sublibrary 3 has only 30 samples). Script set to run on 6 different parameter sets = 2hr30min x 6 = 15hr, +10 samples => use 24 hr to be safe?? Memory to use? * submitted job with 24hr time, 4GB memory, 1 node, name = Test_Stacks (remember to change job name next time) * Job cancelled - ID: 6536925 Cluster: quest User/Group: amd9539/amd9539 State: NODE_FAIL (exit code 1) => ???? Nodes: 1 Cores per node: 4 CPU Utilized: 00:00:00 CPU Efficiency: 0.00% of 00:12:24 core-walltime Job Wall-clock time: 00:03:06 Memory Utilized: 0.00 MB (estimated maximum) Memory Efficiency: 0.00% of 6.00 GB (6.00 GB/node) ## Create script in Notepad ++ for SLURM that references STACKS script as .sh file with bash