###### tags: `hybpiper` `madeline` `popgen`
# Argan Population Genomics
[TOC]
## Cleaning the sequences
### Make a sample list
Create a plain text file with the list of all sample names, one per line. This will help "loop" over the samples
```
screen -S ArganHybpiper
```
>Control+A D to detatch
> Screen -r name to return
> echo $STY <---- to check to see if you are in a screen
> Create a working directory first (we call it Argan1?)
>`cd /scratch/mslimp/Argan1`
>In the Argan1 directory,
> `nano ArganNames.txt`
> What is the easy way of putting all of the names in?
### Trimming reads
Use [fastp](https://github.com/OpenGene/fastp) to trim sequences and keep only high quality paired sequences. Be sure to use the `-g` tag because it's NovaSeq data.
Put your command line here:
> ``fastp -g -i in.R1.fq.gz -I in.R2.fq.gz -o out.R1.fq.gz -O out.R2.fq.gz``
> This is the 'simple usage'. we need to add -g and make it read the ArganNames.txt file
> ` `
## Running HybPiper
[HybPiper Tutorial](https://github.com/mossmatters/HybPiper/wiki)
On Sphagnum: `/opt/apps/Software/HybPiper`
### Target file: mega353
Make a copy from: `/scratch/joh97948/GUMO/mega353.fasta
```
cp /scratch/joh97948/GUMO/mega353.fasta workingdirectory
```
### Run HybPiper in a loop
After `reads_first.py` include `intronerate.py` and `cleanup.py`.
Uncompress the fastp output for HybPiper
```
gunzip *.fq.gz
```
Then create your script
```nano ArganHybPiper.sh```
```
bash
while read name;
do /opt/apps/Software/HybPiper/reads_first.py -b mega353.fasta -r /scratch/mslimp/Argan1/Argan/$name*.fq --prefix $name --bwa --cpu 50
python /opt/apps/Software/HybPiper/intronerate.py --prefix $name
python /opt/apps/Software/HybPiper/cleanup.py $name
done < ArganNames.txt
```
> do we add 'intronerate.py' and 'cleanup.py' directly after 'reads_first.py'
bash ../ArganHybPiper.sh
> I dont remember how to run a shell script file lol....
> intronerate.py -h <- to get arguments
> htop to check processors --> q
### Get Sequence Lengths
```
python /opt/apps/Software/HybPiper/get_seq_lengths.py mega353.fasta ArganNames.txt dna > argan_seq_lengths.txt
```
### Run HybPiper stats script
`python /opt/apps/Software/HybPiper/hybpiper_stats.py argan_seq_lengths.txt ArganNames.txt > argan_test_stats.txt
`
### Cleaning up
Put the trimmed reads in their own directory:
```
mkdir trimmed
mv *.fq trimmed
```
Get rid of the untrimmed reads (there's a backup):
```
rm *.gz
```
Move the HybPiper output to its own directory:
```
mkdir hybpiper
mv ARGANseq* hybpiper
```
## Choosing reference sample
Some mixture of high sequencing depth (# of reads on target), high sequence recovery (# of genes), and amount of recovery (total bp of supercontigs)
```
cat ARGANseq096/*/ARGANseq096/sequences/intron/*_supercontig.fasta > ../argan.supercontigs.fasta
```
Sample 096 had only 351 genes recovered, add the other two from another sample
`cat ARGANseq002/6412/ARGANseq002/sequences/intron/6412_supercontig.fasta >> ../argan.supercontigs.fasta`
`cat ARGANseq002/6514/ARGANseq002/sequences/intron/6514_supercontig.fasta >> ../argan.supercontigs.fasta`
Sample chosen: ARGANseq096
Genes to add from other samples: 6412, 6514
## Mapping reads to reference
Make a new directory for the SNP workflow:
`mkdir SNPworkflow`
`cd SNPworkflow`
Following [Lindsay's workflow](https://github.com/lindsawi/HybSeq-SNP-Extraction) from the GUMO paper `variantcall.sh` modified to fit your files.
Copy a recent copy of the script to your current directory:
`cp /scratch/joh97948/isoetes/all_to_one/variantcall.sh .`
Modify the script to match the names of your files and directories:
```bash=
# Set variables
reference=$1
prefix=$2
read1fn="$prefix"_R1.fq
read2fn="$prefix"_R2.fq
readdir=/scratch/mslimp/Argan1/Argan/trimmed/
```
```
bash variantcall.sh argan.supercontigs.fasta ARGANseq001
```
Try it with one sample first to see if everything runs ok for one sample. If it works, go ahead and create a loop to go through all samples:
```
while read name
do
bash variantcall.sh prefix.supercontigs.fasta $name
done < /scratch/mslimp/Argan1/Argan/namelist.txt
```
## Calling variants
Use the script `GenotypesToPCA.sh` from Lindsay's workflow to call variants for all samples together. Then you filter the variants and generate a PCA from genetic distances.
```bash=
#!/bin/bash
##End of Select Varitants for GATK4
set -eo pipefail
reference=$1 #Ppyriforme-3728.supercontigs.fasta
prefix=$2 #Physcomitrium-pyriforme
#Make Samples list
ls */*-g.vcf > samples.list
# Combine and Jointly call GVCFs
gatk CombineGVCFs -R $reference --variant samples.list --output "$prefix".cohort.g.vcf
gatk GenotypeGVCFs -R $reference -V "$prefix".cohort.g.vcf -O "$prefix".cohort.unfiltered.vcf
# Keep only SNPs passing a hard filter
time gatk SelectVariants -V "$prefix".cohort.unfiltered.vcf -R $reference -select-type-to-include SNP -O "$prefix".SNPall.vcf
time gatk VariantFiltration -R $reference -V "$prefix".SNPall.vcf --filter-name "hardfilter" -O "$prefix".snp.filtered.vcf --filter-expression "QD < 5.0 && FS > 60.0 && MQ < 45.0 && MQRankSum < -12.5 && ReadPosRankSum < -8.0"
gatk SelectVariants -V "$prefix".snp.filtered.vcf -O "$prefix".snp.filtered.nocall.vcf --set-filtered-gt-to-nocall
bcftools annotate --set-id +"%CHROM:%POS" "$prefix".snp.filtered.nocall.vcf > "$prefix".snp.filtered.nodot.vcf
# Filter out SNPs that didn't pass the filter
plink --vcf-filter --vcf "$prefix".snp.filtered.nodot.vcf --allow-extra-chr --recode --make-bed --geno --const-fid --out "$prefix"
#Alternate for deleting all SNPs with missing data:
#plink --vcf-filter --vcf "$prefix".snp.filtered.nodot.vcf --allow-extra-chr --recode --make-bed --geno 0 --const-fid --out "$prefix"
plink --indep 50 5 2 --file "$prefix" --allow-extra-chr --out "$prefix"
plink --extract "$prefix".prune.in --out "$prefix"_pruned --file "$prefix" --make-bed --allow-extra-chr --recode
# Generate eigenvalues and loadings for 20 PCA axes
plink --pca 20 --file "$prefix"_pruned --allow-extra-chr --out "$prefix"
# Generate basic stats (heterozygosity, inbreeding coefficient, allele frequencies)
plink --freq --het 'small-sample' --ibc --file "$prefix"_pruned --allow-extra-chr -out "$prefix"_pruned
```
Run this from the directory that has all of the `variantcall.sh` directories:
`bash GenotypesToPCA.sh argan.supercontigs.fasta argan1`
## Admixture Analysis in LEA (in R)
The pruned dataset has 70 individuals and 11577 SNPs.
Needed to change `*` characters into missing `0`, so input to LEA is `argan1_pruned_fixed.ped`
[LEA tutorial](http://membres-timc.imag.fr/Olivier.Francois/LEA/tutorial.htm)
## Vizualizations (in R)
### Libraries / Gene Recovery Stats
Similar to what you had on your poster and what we had in the GUMO paper
### HybPiper Heatmap
`gene_recovery_heatmap_ggplot.R`
### PCA
You will need to modify the `.eigenvec` file to have a site column to more easily visualize relationships within/among sites
### LEA barchart
Like the STRUCTURE plot in the other papers but with your SNP data
### LEA piecharts on map
Make sure to include Western Sahara
### Summary stats
Using bcftools to look at the various VCF files:
`bcftools stats argan1.snp.filtered.nodot.vcf`
```
number of SNPs: 59736
```
The `argan1_pruned_fixed.geno` file generated by LEA has a format of one SNP per line (each individual is a column).
The file has 11577 lines, so that is the number of pruned SNPs (filetered to retain only SNPs with limited evidence of linkage).