Argan Population Genomics

###### tags: `hybpiper` `madeline` `popgen` # Argan Population Genomics [TOC] ## Cleaning the sequences ### Make a sample list Create a plain text file with the list of all sample names, one per line. This will help "loop" over the samples ``` screen -S ArganHybpiper ``` >Control+A D to detatch > Screen -r name to return > echo $STY <---- to check to see if you are in a screen > Create a working directory first (we call it Argan1?) >`cd /scratch/mslimp/Argan1` >In the Argan1 directory, > `nano ArganNames.txt` > What is the easy way of putting all of the names in? ### Trimming reads Use [fastp](https://github.com/OpenGene/fastp) to trim sequences and keep only high quality paired sequences. Be sure to use the `-g` tag because it's NovaSeq data. Put your command line here: > ``fastp -g -i in.R1.fq.gz -I in.R2.fq.gz -o out.R1.fq.gz -O out.R2.fq.gz`` > This is the 'simple usage'. we need to add -g and make it read the ArganNames.txt file > ` ` ## Running HybPiper [HybPiper Tutorial](https://github.com/mossmatters/HybPiper/wiki) On Sphagnum: `/opt/apps/Software/HybPiper` ### Target file: mega353 Make a copy from: `/scratch/joh97948/GUMO/mega353.fasta ``` cp /scratch/joh97948/GUMO/mega353.fasta workingdirectory ``` ### Run HybPiper in a loop After `reads_first.py` include `intronerate.py` and `cleanup.py`. Uncompress the fastp output for HybPiper ``` gunzip *.fq.gz ``` Then create your script ```nano ArganHybPiper.sh``` ``` bash while read name; do /opt/apps/Software/HybPiper/reads_first.py -b mega353.fasta -r /scratch/mslimp/Argan1/Argan/$name*.fq --prefix $name --bwa --cpu 50 python /opt/apps/Software/HybPiper/intronerate.py --prefix $name python /opt/apps/Software/HybPiper/cleanup.py $name done < ArganNames.txt ``` > do we add 'intronerate.py' and 'cleanup.py' directly after 'reads_first.py' bash ../ArganHybPiper.sh > I dont remember how to run a shell script file lol.... > intronerate.py -h <- to get arguments > htop to check processors --> q ### Get Sequence Lengths ``` python /opt/apps/Software/HybPiper/get_seq_lengths.py mega353.fasta ArganNames.txt dna > argan_seq_lengths.txt ``` ### Run HybPiper stats script `python /opt/apps/Software/HybPiper/hybpiper_stats.py argan_seq_lengths.txt ArganNames.txt > argan_test_stats.txt ` ### Cleaning up Put the trimmed reads in their own directory: ``` mkdir trimmed mv *.fq trimmed ``` Get rid of the untrimmed reads (there's a backup): ``` rm *.gz ``` Move the HybPiper output to its own directory: ``` mkdir hybpiper mv ARGANseq* hybpiper ``` ## Choosing reference sample Some mixture of high sequencing depth (# of reads on target), high sequence recovery (# of genes), and amount of recovery (total bp of supercontigs) ``` cat ARGANseq096/*/ARGANseq096/sequences/intron/*_supercontig.fasta > ../argan.supercontigs.fasta ``` Sample 096 had only 351 genes recovered, add the other two from another sample `cat ARGANseq002/6412/ARGANseq002/sequences/intron/6412_supercontig.fasta >> ../argan.supercontigs.fasta` `cat ARGANseq002/6514/ARGANseq002/sequences/intron/6514_supercontig.fasta >> ../argan.supercontigs.fasta` Sample chosen: ARGANseq096 Genes to add from other samples: 6412, 6514 ## Mapping reads to reference Make a new directory for the SNP workflow: `mkdir SNPworkflow` `cd SNPworkflow` Following [Lindsay's workflow](https://github.com/lindsawi/HybSeq-SNP-Extraction) from the GUMO paper `variantcall.sh` modified to fit your files. Copy a recent copy of the script to your current directory: `cp /scratch/joh97948/isoetes/all_to_one/variantcall.sh .` Modify the script to match the names of your files and directories: ```bash= # Set variables reference=$1 prefix=$2 read1fn="$prefix"_R1.fq read2fn="$prefix"_R2.fq readdir=/scratch/mslimp/Argan1/Argan/trimmed/ ``` ``` bash variantcall.sh argan.supercontigs.fasta ARGANseq001 ``` Try it with one sample first to see if everything runs ok for one sample. If it works, go ahead and create a loop to go through all samples: ``` while read name do bash variantcall.sh prefix.supercontigs.fasta $name done < /scratch/mslimp/Argan1/Argan/namelist.txt ``` ## Calling variants Use the script `GenotypesToPCA.sh` from Lindsay's workflow to call variants for all samples together. Then you filter the variants and generate a PCA from genetic distances. ```bash= #!/bin/bash ##End of Select Varitants for GATK4 set -eo pipefail reference=$1 #Ppyriforme-3728.supercontigs.fasta prefix=$2 #Physcomitrium-pyriforme #Make Samples list ls */*-g.vcf > samples.list # Combine and Jointly call GVCFs gatk CombineGVCFs -R $reference --variant samples.list --output "$prefix".cohort.g.vcf gatk GenotypeGVCFs -R $reference -V "$prefix".cohort.g.vcf -O "$prefix".cohort.unfiltered.vcf # Keep only SNPs passing a hard filter time gatk SelectVariants -V "$prefix".cohort.unfiltered.vcf -R $reference -select-type-to-include SNP -O "$prefix".SNPall.vcf time gatk VariantFiltration -R $reference -V "$prefix".SNPall.vcf --filter-name "hardfilter" -O "$prefix".snp.filtered.vcf --filter-expression "QD < 5.0 && FS > 60.0 && MQ < 45.0 && MQRankSum < -12.5 && ReadPosRankSum < -8.0" gatk SelectVariants -V "$prefix".snp.filtered.vcf -O "$prefix".snp.filtered.nocall.vcf --set-filtered-gt-to-nocall bcftools annotate --set-id +"%CHROM:%POS" "$prefix".snp.filtered.nocall.vcf > "$prefix".snp.filtered.nodot.vcf # Filter out SNPs that didn't pass the filter plink --vcf-filter --vcf "$prefix".snp.filtered.nodot.vcf --allow-extra-chr --recode --make-bed --geno --const-fid --out "$prefix" #Alternate for deleting all SNPs with missing data: #plink --vcf-filter --vcf "$prefix".snp.filtered.nodot.vcf --allow-extra-chr --recode --make-bed --geno 0 --const-fid --out "$prefix" plink --indep 50 5 2 --file "$prefix" --allow-extra-chr --out "$prefix" plink --extract "$prefix".prune.in --out "$prefix"_pruned --file "$prefix" --make-bed --allow-extra-chr --recode # Generate eigenvalues and loadings for 20 PCA axes plink --pca 20 --file "$prefix"_pruned --allow-extra-chr --out "$prefix" # Generate basic stats (heterozygosity, inbreeding coefficient, allele frequencies) plink --freq --het 'small-sample' --ibc --file "$prefix"_pruned --allow-extra-chr -out "$prefix"_pruned ``` Run this from the directory that has all of the `variantcall.sh` directories: `bash GenotypesToPCA.sh argan.supercontigs.fasta argan1` ## Admixture Analysis in LEA (in R) The pruned dataset has 70 individuals and 11577 SNPs. Needed to change `*` characters into missing `0`, so input to LEA is `argan1_pruned_fixed.ped` [LEA tutorial](http://membres-timc.imag.fr/Olivier.Francois/LEA/tutorial.htm) ## Vizualizations (in R) ### Libraries / Gene Recovery Stats Similar to what you had on your poster and what we had in the GUMO paper ### HybPiper Heatmap `gene_recovery_heatmap_ggplot.R` ### PCA You will need to modify the `.eigenvec` file to have a site column to more easily visualize relationships within/among sites ### LEA barchart Like the STRUCTURE plot in the other papers but with your SNP data ### LEA piecharts on map Make sure to include Western Sahara ### Summary stats Using bcftools to look at the various VCF files: `bcftools stats argan1.snp.filtered.nodot.vcf` ``` number of SNPs: 59736 ``` The `argan1_pruned_fixed.geno` file generated by LEA has a format of one SNP per line (each individual is a column). The file has 11577 lines, so that is the number of pruned SNPs (filetered to retain only SNPs with limited evidence of linkage).