---
title: Andrew Kittentails Project
break: false
tags: RADSeq, Andrew
---
# Synthyris bullii sequence data
This document contains notes for a population genetics assessment for Synthyris bullii (Kittentails) using the STACKS pipeline with and without a reference genome. From across Illinois and Indiana, 120 individuals were sequenced using NovaSEQ NextGen Sequencing (10 individuals from 12 populations). The reference genome was generated from an individual at the Nachusa Doug population at Nachusa Grasslands, Illinois, was sequenced by PacBio at Mt. Sinai using a SMRTcell on Revio, and assembled using HifiASM.
–adapter enzyme 1 (P1-EcoRI): ACACTCTTTCCCTACACGACGCTCTTCCGATCT
–adapter enzyme 2 (P2-MspI): GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
# ADMIXTURE
* requires low missing data
* pruned for Linkage disequilibrium
## Pipeline Summary
- Filter missing data (--max-missing 0.85)
- Convert to PLINK
- Prune for LD (--indep-pairwise)
- Run ADMIXTURE across K values
- Visualize ancestry proportions
#### 1. begin with vcf file used for pop gen
vcf < - vcfR.gq20.smiss96.70.vcf
```
**** Object of Class vcfR *****
85 samples
163 CHROMs
2,327 variants
Object size: 11.5 Mb
18.5 percent missing data
***** ***** *****
```
#### 2. Filter Missingness in R
```
# Keep SNPs with <20% missing
keep_snps <- which(missing_snp < 0.2)
vcf_filtered <- vcf[keep_snps, ]
# Optionally filter samples too
keep_samples <- which(missing_sample < 0.2)
vcf_filtered@gt <- vcf_filtered@gt[, c(1, keep_samples + 1)] # +1 for FORMAT column
# save filtered vcf
write.vcf(vcf_filtered, file = "filtered.vcf")
```
#### 3. Convert VCF to PLINK
```
plink --vcf filtered.vcf --make-bed --out filtered --allow-extra-chr
```
#### 4. Replace 'chromosome' labels with integers
* Extract unique chromosomes from filtered vcf file
```
bcftools query -f '%CHROM\n' filtered.vcf | sort -u > scaffolds.txt
```
* Create integer-chromosome map
```
awk '{print $1 "\t" NR}' scaffolds.txt > integer_to_chr_map.txt
```
* replace 'chromosomes' with integers in .bim file
```
awk 'NR==FNR{a[$1]=$2; next} {$1=a[$1]; print}' integer_to_chr_map.txt filtered.bim > filtered.updated.bim
mv filtered.bim contig.filtered.bim
mv filtered.updated.bim filtered.bim
```
#### 3. LD Pruning with PLINK moving window
* find snps in LD and write to pruned files:
* window size = 50
* step size (in SNPs) = 10
* r-squared threshold = 0.2
* prune from original plink file
```
plink --bfile filtered --indep-pairwise 50 10 0.2 --out pruned --allow-extra-chr
plink --bfile filtered --extract pruned.prune.in --make-bed --out data_pruned --allow-extra-chr
```
### If you lost a lot of data in the above, try imputation before ADMIXTURE...
#### 2A. remove rare SNPs in R (that only occur once, OPTIONAL, SKIP FIRST TIME THROUGH))
* Jump to step X. filter missing data first time through*
```
chrom_counts <- table($vcf$@fix[, "CHROM"])
keep_chroms <- names(chrom_counts[chrom_counts > 1])
vcf_rare_snp_removed <- vcf[vcf@fix[, "CHROM"] %in% keep_chroms, ]
write.vcf(vcf_rare_snp_removed, file = "$SAVEPATH$")
```
#### 3A. Impute missing genotypes with Beagle in terminal (OPTIONAL, SKIP FIRST TIME THROUGH)
I ended up losing 33 samples and about 1500 snps after filtering for
<20% missing by samples and <20% missing by snp, so am trying impute
```
java -jar beagle.jar gt=filtered.vcf out=imputed
```
#### 4A. Convert VCF to PLINK
```
plink --vcf imputed.vcf --make-bed --out imputed --allow-extra-chr
```
#### 5A. Rename scaffolds to integers and create map (necessary for admixture - it doesn't like ptgs or whatever your snps are called)
```
cut -f1 imputed.bim | sort | uniq | nl > scaffold_map.txt
awk 'NR==FNR{map[$2]=$1; next} {$1=map[$1]; print}' scaffold_map.txt imputed.bim > imputed.integer.bim
```
rename original imputed and integer imputed for downstream
```
mv imputed.bim imputed.contig.bim
mv imputed.integer.bim imputed.bim
```
#### 6A. Filter SNPs by Missingness
```
plink --bfile imputed --geno 0.05 --make-bed --out imputed.filtered
```
#### 7A. LD Pruning
```
plink --bfile imputed.filtered --indep-pairwise 50 5 0.2 --out pruned
plink --bfile imputed.filtered --extract pruned.prune.in --make-bed --out data_pruned
```
#### 8A. Run ADMIXTURE across K values
```
for K in {4..12}; do admixture --cv pruned.bed $K | tee log${K}.out; done
```
#### 9 Make pop map based on samples from .fam file
* extract column from .fam file on command line
```
cut -f2 data_pruned.fam > sample_names.txt
```
* open .txt file, copy and paste into excel
* use =left($a$1, "number of characters in sample name")
* then = left($b$1$, "number of characters in pop code")
* repeat for all samples
* copy and paste column 2 and 3 as values only and delete other columns
* save as .csv
* or at least this worked for me with samples like HH1234. I just wanted the sample name and a column with the first 2 letters (HH)
#### ADMIXTURE PLOTS in R
```
```
# Revisiting sequence data in 2024
New things to try:
* more thoroughly assess raw read quality with fastqc
* use 'cut adapt' to remove excess C's generated duting 'dark cycle' of NovaSeq run prior to process rad tags
* explore 'runs of homozygosity' as an alternative to Fis as a measure of inbreeding
* do not trim reads to 100 instead of 150
* map raw reads to reference genome
### FastQC results of raw data
* For all sublibraries, the first several reads have lower per base sequence quality and extreme (100/0%) per base sequence content between reads 6 and 10.
* Trimming the first 10 bps to remove these reads with cutadapt (installed with pip on quest, I don't have an allocation so can't submit jobs)
* cutadapt -u 10 -o /projects/p31922/01_raw_reads/trimmed_P1_S10_R1_001.fastq.gz -i /projects/p31922/01_raw_reads/P1_S10_R1_001.fastq.gz
* running cutadapt on one core on one set of reads (73mil reads) took 10 minutes
* Trimmed reads look slightly better (at least improved Quality for first reads and better per base sequence) Running trimmomatic with a sliding scale to remove reads that have a score <20 in a 4 base window, maintaining reads over 100bp. Code:
* java -jar trimmomatic-0.39.jar SE -phred33 /projects/p31922/01_raw_reads/trimmed_P1_S10_R1_001.fastq.gz /projects/p31922/01_raw_reads/qualitytrimmed_P1_S10_R1_001.fastq.gz SLIDINGWINDOW:4:20 MINLEN:100
* Not much improvement outside of quality scores - going to move forward with both datasets and compare SNP output
Process Radtags
Multiplexed Reads Notes (P1 only - can't find log file for P2/P3):
* Original raw reads truncated to 100: 137492637
* Re-multiplexed raw reads with no truncation: 137567342 (93.9%)
* Trimmed reads: 3599550 (only 2%; must've trimmed barcode - not a great idea)
* Trimmed and quality trimmed reads: 2289410 (2.1% - not a good way to go either)
* Truncating during process_radtags is perhaps the better way to deal with issues at the beginning/end of sequences
# Reference aligning sequence data onto assembled genome
Genome details:
(BUSCO v5.5; embryophyta lineage (50 genomes and 1614 BUSCOs), gene predictor: metaeuk):
***** Results: *****
C:98.5%[S:74.4%,D:24.1%],F:0.4%,M:1.1%,n:1614
1590 Complete BUSCOs (C)
1201 Complete and single-copy BUSCOs (S)
389 Complete and duplicated BUSCOs (D)
7 Fragmented BUSCOs (F)
17 Missing BUSCOs (M)
1614 Total BUSCO groups searched
Assembly Statistics:
1065 Number of scaffolds
1065 Number of contigs
1337213455 Total length
0.000% Percent gaps
9 MB Scaffold N50
9 MB Contigs N50
(BUSCO v5.5; eudicot lineage (31 genomes and 2326 BUSCOs), gene predictor: metaeuk):
***** Results: *****
C:96.5%[S:71.4%,D:25.1%],F:0.4%,M:3.1%,n:2326
2244 Complete BUSCOs (C)
1661 Complete and single-copy BUSCOs (S)
583 Complete and duplicated BUSCOs (D)
10 Fragmented BUSCOs (F)
72 Missing BUSCOs (M)
2326 Total BUSCO groups searched
Assembly Statistics:
1065 Number of scaffolds
1065 Number of contigs
1337213455 Total length
0.000% Percent gaps
9 MB Scaffold N50
9 MB Contigs N50
Index assembled genome (command line in quest took about 20 minutes):
module load bwa
bwa index asm.np2.p_ctg.fa
Mapping reads to reference with bwa mem (should i try minimap2?)
# Define input and output directories
input_dir="/projects/p31922/02_multiplexed_reads/1.multiplexed_raw_trunc100/realign_reads/"
output_dir="/projects/p31922/06_genome_reads/aligned_reads2"
reference_genome="/projects/p31922/05_genome/polished_genomes/asm.np2.p_ctg.fa"
# Loop through all forward FASTQ files in the input directory
for forward in "$input_dir"/*.1.fq.gz; do
# Extract the sample name (without path and forward extension)
sample_name=$(basename "$forward" .1.fq.gz)
# Define the corresponding reverse read file
reverse="$input_dir/${sample_name}.2.fq.gz"
# Check if the reverse file exists
if [[ -f "$reverse" ]]; then
# Run bwa mem for paired-end reads
bwa mem "$reference_genome" "$forward" "$reverse" > "$output_dir/${sample_name}.sam"
echo "Processed $sample_name"
else
echo "Warning: Reverse read file for $sample_name not found!"
fi
done
Check quality of mapping on 1 or 2 individuals with code:
samtools flagstat <SAMPLE.sam>
Mapped a random indiv (WC8045):
samtools flagstat WC8045.sam
181791 + 0 in total (QC-passed reads + QC-failed reads)
993 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
144144 + 0 mapped (79.29% : N/A)
180798 + 0 paired in sequencing
90399 + 0 read1
90399 + 0 read2
129658 + 0 properly paired (71.71% : N/A)
135958 + 0 with itself and mate mapped
7193 + 0 singletons (3.98% : N/A)
6176 + 0 with mate mapped to a different chr
833 + 0 with mate mapped to a different chr (mapQ>=5)
On reads truncated to 100bp:
K=19 (default): 79.29% mapped
K=16: 75.66% mapped
K=14: 75.66
K=12: 76.03
K=10: 75.66
All are not very good, >90% is ideal - poorly sequenced indiv most likely - missing data issues again. We'll see if mapped alignment is any better. Explore removing individuals like this - very few starting reads (181k)
On untruncated reads:
K=19 (default): 77.90% mapped
K=16: 77.9% mapped
K=14: 77.9
Sticking with truncated reads.
On individual PZ0010:
(base) [amd9539@quser31 06_genome_reads]$ samtools flagstat PZ0010.K19.sam
2932394 + 0 in total (QC-passed reads + QC-failed reads)
20640 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
2655355 + 0 mapped (90.55% : N/A)
2911754 + 0 paired in sequencing
1455877 + 0 read1
1455877 + 0 read2
0 + 0 properly paired (0.00% : N/A)
2558288 + 0 with itself and mate mapped
76427 + 0 singletons (2.62% : N/A)
2295204 + 0 with mate mapped to a different chr
20744 + 0 with mate mapped to a different chr (mapQ>=5)
## Assess read alignment for all reads
## With RADseq reads mapped to a reference genome, convert sam to bam, index, sort and get quality stats
module load samtools
# Specify the directory containing the .bam files
DIRECTORY="/projects/p31922/06_genome_reads/aligned_reads/"
# Navigate to the directory
cd "$DIRECTORY" || { echo "Directory not found"; exit 1; }
# Check if there are any .sam files if ls *.sam 1> /dev/null 2>&1; then
for file in *.sam; do
echo "Processing $file..."
# Convert SAM to BAM
samtools view -bS "$file" -o "${file%.sam}.bam"
# Check if the BAM conversion was successful
if [[ -f "${file%.sam}.bam" ]]; then
echo "Created ${file%.sam}.bam"
# Sort the BAM file
samtools sort "${file%.sam}.bam" -o "${file%.sam}_sorted.bam"
echo "Sorted ${file%.sam}.bam to ${file%.sam}_sorted.bam"
# Index the sorted BAM file
samtools index "${file%.sam}_sorted.bam"
echo "Indexed ${file%.sam}_sorted.bam"
# Generate flag statistics
samtools flagstat "${file%.sam}_sorted.bam" > "${file%.sam}_flagstat.txt"
echo "Generated flag statistics for ${file%.sam}_sorted.bam"
else
echo "Failed to create ${file%.sam}.bam"
fi
done
else
echo "No .sam files found in the directory."
fi
## Create summary flagstat to assess overall mappping quality
# Set the directory containing the .flagstat.txt files
FLAGSTAT_DIR="/projects/p31922/06_genome_reads/aligned_reads/"
# Create a summary file
SUMMARY_FILE="summary_flagstat.txt"
echo -e "Sample\tQC_Passed_Reads\tMapped_Reads\tMapped_%\tProperly_Paired_Reads\tProperly_Paired_%" > "$SUMMARY_FILE"
# Loop through all .flagstat.txt files
for file in "$FLAGSTAT_DIR"/*_flagstat.txt; do
if [[ -e "$file" ]]; then
# Extract the sample name (remove directory and extension)
SAMPLE=$(basename "$file" _flagstat.txt)
# Extract relevant information
QC_PASSED=$(grep "in total" "$file" | awk '{print $1}')
MAPPED=$(grep "mapped (" "$file" | awk '{print $1}')
MAPPED_PCT=$(grep "mapped (" "$file" | awk -F '[()]' '{print $2}')
PROPERLY_PAIRED=$(grep "properly paired (" "$file" | awk '{print $1}')
PROPERLY_PAIRED_PCT=$(grep "properly paired (" "$file" | awk -F '[()]' '{print $2}')
# Append the data to the summary file
echo -e "$SAMPLE\t$QC_PASSED\t$MAPPED\t$MAPPED_PCT\t$PROPERLY_PAIRED\t$PROPERLY_PAIRED_PCT" >> "$SUMMARY_FILE"
else
echo "No _flagstat.txt files found in $FLAGSTAT_DIR."
exit 1
fi
done
echo "Summary of all _flagstat.txt files saved to $SUMMARY_FILE."
### Export summary_file to R Studio...
# VCF tools filtering steps/notes
* Beginning with PO (population code) and following speciation genomics walkthrough to assess data - I moved all population.snps.vcf files out of their individual pop files into a single vcf folder and renamed as "PO.pop.snps.vcf"
* https://speciationgenomics.github.io/filtering_vcfs/
* https://github.com/cheyennelmoore/Erigenia-Analyses
* Move the folder with vcf files to home directory
* `cp -r vcf /home/amd9539/vcf`
* Create variables for easy coding (change PO to appropriate name)
* `SUBSET_VCF=~/vcf/PO.pop.snps.vcf`
* `OUT=~/vcf/PO_out/PO`
* load vcf tools
* `module load vcftools/0.1.17`
* Calculate:
* Allele frequency
* `vcftools --gzvcf $SUBSET_VCF --freq2 --out $OUT --max-alleles 2`
* Mean depth per indiv
* `vcftools --gzvcf $SUBSET_VCF --depth --out $OUT`
* Mean depth per site
* `vcftools --gzvcf $SUBSET_VCF --site-mean-depth --out $OUT`
* Site quality
* `vcftools --gzvcf $SUBSET_VCF --site-quality --out $OUT`
* Proportion missing data/indiv
* `vcftools --gzvcf $SUBSET_VCF --missing-indv --out $OUT`
* Proportion missing data/site
* `vcftools --gzvcf $SUBSET_VCF --missing-site --out $OUT`
* Het and inbred per indiv
* `vcftools --gzvcf $SUBSET_VCF --het --out $OUT`
* Move all outputs onto laptop and process in R Studio per tutorial to select filtering metrics
* I have chosen and set as below, but first, set in and output file for filtered .vcf
* `VCF_IN=~/vcf/FR.pop.snps.vcf`
* `VCF_OUT=~/vcf/FR.pop.snps.filtered.vcf`
* Filtering metrics:
* `MAF=0.2`
* `MISS=0.75`
* `MIN_DEPTH=10`
* `MAX_DEPTH=50`
* Filter with:
* vcftools --gzvcf $VCF_IN --remove-indv LO0003 --remove-indels --maf $MAF --max-missing $MISS --min-meanDP $MIN_DEPTH --max-meanDP $MAX_DEPTH --minDP $MIN_DEPTH --maxDP $MAX_DEPTH --recode --stdout | gzip -c > $VCF_OUT
# Workflow for VCF Tools and Plink to set up PCA in R, view missing data, and view population stats
* Go to parameter folder that contains assembly data and populations.snps.vcf file
* create plink folder
* mkdir plink
* add scaffold_ to snp ID in new vcf file within plink folder so plink doesn't think it's human chromosome data and give you errors (https://groups.google.com/g/plink2-users/c/b5bP9YZfkFU)
* `sed -E 's/^([0-9])/scaffold_\1/g' populations.snps.vcf > plink/pop.snps2.vcf `
* go into plink folder and load VCF tools
* `cd plink`
* `module load vcftools/0.1.17`
* create plink file from vcf
* `vcftools --vcf pop.snps2.vcf --plink --out plink`
* load plink (https://speciationgenomics.github.io/pca/ <- great guide from here on with more details)
* `module load plink/1.9`
* perform linkage pruning - i.e. identify prune sites
* `plink --vcf pop.snps2.vcf --double-id --allow-extra-chr --set-missing-var-ids @:# --indep-pairwise 50 10 0.1 --out kplink`
* create pca eigenvalue files
* `plink --vcf pop.snps2.vcf --double-id --allow-extra-chr --set-missing-var-ids @:# --extract kplink.prune.in --make-bed --pca --out kplink$$$`
* Export the .eigenval and .eigenvec files to your laptop and use to create a PCA in R Studio...
*
* In R Studio...
* '
* Check missing data
* `vcftools --vcf pop.snps2.vcf --missing-indv --out MissingData_322$$$.imiss`
* Create out.imiss file
* Move to laptop and view in excel to sort by missing data
# 4/24/23 Using VCF tools and Plink to filter data and cluster in a PCA
* For help:
* VCFtools: https://vcftools.github.io/man_latest.html
* Plink: https://www.cog-genomics.org/plink/2.0/general_usage
* Both vcf tools and Plink are installed on Quest, load software with:
* module load vcftools/0.1.17
* module load plink/1.9
* plink/2.0 is listed as preferred but according to the plink manual, plink2 is still undergoing alpha and beta testing, so plink1.9 should be utilized primarily - plink2 should be used as a 'supplement' for specific tasks.
* STEP 1 - create plink file from VCF file if not done during denovo assembly (need to create plink folder (mkdir) within parameter folder you are using):
* vcftools --vcf [input folder/populations.snps.vcf] --plink --out [output folder/plink]
* this took 2 seconds and 1.5MB of memory
* STEP 2 - TROUBLESHOOTING
* Error: Invalid chromosome code 27 resolution discussion - https://groups.google.com/g/plink2-users/c/b5bP9YZfkFU
* Terminology - contig is a continuous stretch of nucleotide sequences while scaffold is a portion of the genome consisting of contigs and gaps. Both contig and scaffold are reconstructed genomic sequences
# 4/17/23 Compiling stats for each denovo assembly parameter
* In each parameter folder (322, 333 etc.) open the populations.log file to find loci kept and loci variants
* Enter into excel sheet to compare and make decisions based on loci
* Should I assemble with no population map to generate stats for the subsample?
* Yes - run wit and without pop map and we will run PCA
* Use plink to generate PCA for assembly run on each population with each sample
* THEN use PCA to assess if sublibrary or pop better explain clustering
# Conda location
environment location: /home/amd9539/.conda/envs/STACKS
# 3/15/23 running STACKS with Slurm via notepad ++
## Create script in Notepadd ++ that contains code for STACKS programing
QUESTIONS
* How do I view depths of coverage output? What is the filename? A: gstacks.log.distribs
* Does specifying the number of cores in stacks denovo assembler change how slurm allocates cpu usage in quest?
* Submitting jobs successfully to slurm but commands not actually running - is STACKS not uploaded properly/not being called in properly? No errors in slurm error logs, nothing in slurm output folder.
NOTES
* To save file as ''.sh', Save As, select 'All Types', and end file name with .sh
* NEED TO RUN STACKS WITH parameters: m,M,n; 3,2,2; 4,2,2; 5,2,2; 3,3,3; 3,4,4; 3,5,5
* Use STACKS on QUEST - called with module load stacks/2.64-gcc-9.2.0
* JOB 6421078 COMPLETED (3/27) - "denovo_map.pl" on 49 samples, QUEST allocation used: 1 node, 4 cores/node, Time = 2hr:30min, Memory = 3GB (does memory increase with more data? NO, but add a couple of GB f it's close to request)
* Scaling up (3/28) - 60 samples randomly selected from all pools/sublibraries (23 from 1 & 2, 14 from 3; sublibrary 3 has only 30 samples). Script set to run on 6 different parameter sets = 2hr30min x 6 = 15hr, +10 samples => use 24 hr to be safe?? Memory to use?
* submitted job with 24hr time, 4GB memory, 1 node, name = Test_Stacks (remember to change job name next time)
* Job cancelled -
ID: 6536925
Cluster: quest
User/Group: amd9539/amd9539
State: NODE_FAIL (exit code 1) => ????
Nodes: 1
Cores per node: 4
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:12:24 core-walltime
Job Wall-clock time: 00:03:06
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 6.00 GB (6.00 GB/node)
## Create script in Notepad ++ for SLURM that references STACKS script as .sh file with bash