owned this note
owned this note
Published
Linked with GitHub
<h2>Loggerhead WGR migratory genomics workflow</h2>
---
<h3>Data summary</h3>
* Total number of unique individuals: 132
* Sampling groups: 5
* Rookery KEY: GOM, SNWA
* Rookery MB: GOM, SNWA, MAB
* Goal: n~25 per rookery/foraging site type
* Library prep details
* Library pools: 2, each sequenced over 2 lanes of NovaSeq X
* Used 7x PCR cycles to reduce number of PCR duplicates
* details in [pooling spreadsheet](https://docs.google.com/spreadsheets/d/1zQLUvMfeJCWPn_rODM5VE9k3YrD6isMS/edit#gid=924633929)
* Expected sequencing depth: 11-12x
* including ~20% loss for duplicates
---
<h3>Study System</h3>

Samples from nesting loggerhead turtles at two rookeries in Florida (black dots).
Turtles nesting at the Gulf coast site on Keewaydin Island (KEY) migrate to 2 foraging regions (~77% to GOM, ~23% SNWA). Turtles nesting at the east coast site at Melbourne Beach (MB) migrate to 4 foraging regions; focusing on 3 for this study (~59% to SNWA, ~21% GOM, ~17% MAB; not analyzing 3% to SAB due to SIA assignment ambiguity).
---
<h3>Pipeline Summary</h3>
Data processing
1. Download raw data from Novogene
2. Check for download errors with MD5checksums
3. Raw data QC
4. Concatenate libraries sequenced across lanes
5. Trim adapters
6. Trimmed data QC
7. Remove mitochondrial reads
8. Align to reference genome
9. Remove duplicates
10. Calculate coverage depth
Data analyses
1. Assess population structure
2. Fst outlier analysis
3. Cross-check candidate genes in Baypass?
<h3>Raw data</h3>
Downloaded by LMK /nese/meclab/Shared/raw_reads/Cc_WGR/usftp21.novogene.com/01.RawData/
MD5checksums completed by KFP (run_md5CheckSums.sh)
```
LANE=$(sed -n ${SLURM_ARRAY_TASK_ID}p /nese/meclab/Katrina/cc/md5CheckSums/read_folders.txt)
for file in $LANE/*.gz
do
LANE_PREFIX=$(echo $LANE | cut -d "/" -f 9)
md5sum $file >> /nese/meclab/Katrina/cc/md5CheckSums/MD5_postTransfer_${LANE_PREFIX}.txt
done
```
Initial FastQC on raw data ([files and MultiQC](https://drive.google.com/drive/u/1/folders/1cHuyJgKvNmBAVynGJIGSUmSROZEiBcfX), see 01_FastQC_Trim_Align.sh)
```
fastqc $FQDIR/$SAMPLE/${SAMPLE}*_1.fq.gz --outdir ./FastQC/raw_reads
fastqc $FQDIR/$SAMPLE/${SAMPLE}*_2.fq.gz --outdir ./FastQC/raw_reads
multiqc /nese/meclab/Katrina/cc/FastQC/raw_reads/*_fastqc.zip
```
Of note:
1. There are 3 areas of low tile quality that appear across most individuals (though not all as severe as example below)
Raised this issue with Novogene, and they say it's specific to these libraries and not a flowcell problem

2. Mean quality scores mostly >30, dipping at end of reads (flagged <FONT COLOR="#CBAF52">**MB205**</FONT>)

3. Raw GC content (flagged <FONT COLOR="#FF0000">**KEY063**</FONT>)

4. High duplication rates for 6 libraries (<FONT COLOR="#FF0000">**KEY063, KEY435, KEY445, MB111, MB265, MB303**</FONT>)

<h3>Concatenate and Trim</h3>
Concatenated data from 2 sequencing lanes (00_concat_lanes.sh), double-checked that Total sequences from Lane 1 + Lane 2 = concatenated ([CcWGR_concat_check](https://docs.google.com/spreadsheets/d/1uYh6mS4ryzwFO_HXSGbr7jiyeAjxs55G/edit#gid=632560932))
```
SAMPLE=$(sed -n ${SLURM_ARRAY_TASK_ID}p ./all_samples.txt)
#R1
cat /nese/meclab/Shared/raw_reads/Cc_WGR/usftp21.novogene.com/01.RawData/$SAMPLE/${SAMPLE}*_1.fq.gz > /scratch/workspace/katrinaphill_umass_edu-simple/concat/${SAMPLE}_1.fq.gz
#R2
cat /nese/meclab/Shared/raw_reads/Cc_WGR/usftp21.novogene.com/01.RawData/$SAMPLE/${SAMPLE}*_2.fq.gz > /scratch/workspace/katrinaphill_umass_edu-simple/concat/${SAMPLE}_2.fq.gz
```
Distribution of raw read counts after concatenating

Ran Trimmomatic on each concatenated R1 and R2 for quality and removing Illimina adapters (see 01_FastQC_Trim_Align.sh)
```
java -jar /modules/apps/trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 -threads 16 \
$FQDIR/${SAMPLE}_1.fq.gz $FQDIR/${SAMPLE}_2.fq.gz \
$TRIMDIR/paired/${SAMPLE}_R1.paired.fq.gz $TRIMDIR/unpaired/${SAMPLE}_R1.unpaired.fq.gz \
$TRIMDIR/paired/${SAMPLE}_R2.paired.fq.gz $TRIMDIR/unpaired/${SAMPLE}_R2.unpaired.fq.gz \
ILLUMINACLIP:/modules/apps/trimmomatic/0.39/adapters/NexteraPE-PE.fa:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15 MINLEN:75
```
FastQC trimmed data ([MultiQC](https://drive.google.com/drive/u/1/folders/1cHuyJgKvNmBAVynGJIGSUmSROZEiBcfX))
Of note:
1. Trimmed mean read quality looks good

2. One library has GC > AT (KEY063); a satellite-tagged individual, flagged in the raw GC content plot above.
Is more trimming needed before moving forward? Perhaps will be resolved when duplicates are removed?

3. Trimming removed 30% of reads on average (7-86%). Seems like a lot? Perhaps due to the lower average peak size of my libraries?
Number of reads remaining post-trim range from 4.5 - 182M (without two of the four hail mary libraries: average 68M, 15.4 - 182M). See [alternate trimming method](https://hackmd.io/JwGRyGu-T76Yl35ilowk2g?both#Re-visit-trimming-and-downstream-processing) below.
Percent of reads retained after trimming:

<h3>Remove mitochondrial reads</h3>
Aligned trimmed reads to the mitogenome to split mtDNA and nucDNA. Used Mediterranean [FR694649.1](https://www.ncbi.nlm.nih.gov/nuccore/FR694649.1) because reference is Med, though Florida mitogenomes are available (only 2 bp differ?).
```
bbsplit.sh ref=$REFDIR/$MITOREF in1=$TRIMDIR/paired/${SAMPLE}_R1.paired.fq.gz in2=$TRIMDIR/paired/${SAMPLE}_R2.paired.fq.gz basename=$MITODIR/${SAMPLE}_mt_%.fq out1=$SPLITDIR/${SAMPLE}_nomt_R1.fq.gz out2=$SPLITDIR/${SAMPLE}_nomt_R2.fq.gz
```
<h3>Align reads to reference genome</h3>
Indexed loggerhead reference genome
```
gunzip -c GCF_023653815.1_GSC_CCare_1.0_genomic.fna.gz
module load bwa/0.7.17
bwa index GCF_023653815.1_GSC_CCare_1.0_genomic.fna
module load samtools/1.14
samtools faidx GCF_023653815.1_GSC_CCare_1.0_genomic.fna
module load picard/2.26.2
picard CreateSequenceDictionary R=GCF_023653815.1_GSC_CCare_1.0_genomic.fna O=GCF_023653815.1_GSC_CCare_1.0_genomic.dict
```
Which reference to use? Used the Mediterranean [ GSC_CCare_1.0](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_023653815.1/). Alternates: Cape Verde coming soon, Wild Genomes Project(?).
Aligned reads to referece, skipped .sam file and wrote straight to .bam.
```
bwa mem -t 16 $REFDIR/$REF $SPLITDIR/${SAMPLE}_nomt_R1.fq.gz $SPLITDIR/${SAMPLE}_nomt_R2.fq.gz | samtools view -bS - > $OUTDIR/${SAMPLE}.bam
```
Produce alignment statistics
```
samtools flagstat $OUTDIR/${SAMPLE}.bam > $METDIR/alignment_metrics/${SAMPLE}_alignment_stats.txt
```
Percent of reads aligned:

Number of reads aligned:

Sort and index .bam files
```
samtools sort -o $OUTDIR/${SAMPLE}_SORT.bam $OUTDIR/${SAMPLE}.bam
samtools index $OUTDIR/${SAMPLE}_SORT.bam
```
Check mean depth (before removing duplicates)
```
conda activate gatk3
cd ./metrics/depth_metrics/pre_dupe/
mosdepth -n --fast-mode -Q 20 ${SAMPLE}_pre_dupe $OUTDIR/${SAMPLE}_SORT.bam
cd ../../../
#conda deactivate
```
Coverage (pre-dupe), line indicates 8x

Mean pre-dupe coverage:
Keewaydin_GOM: 8.07
Keewaydin_SNWA: 6.88
Melbourne_GOM: 8.69
Melbourne_MAB: 7.98
Melbourne_SNWA: 10.07
Remove duplicates
```
java -Xms72G -Xmx90G -jar /modules/apps/picard/2.26.11/picard.jar MarkDuplicates \
I=$OUTDIR/${SAMPLE}_SORT.bam TMP_DIR=$TMP \
O=$OUTDIR/${SAMPLE}_SORT_DR.bam \
M=$METDIR/dupe_metrics/${SAMPLE}_duplicate_metrics.txt \
ASSUME_SORT_ORDER=coordinate \
VALIDATION_STRINGENCY=SILENT \
REMOVE_DUPLICATES=TRUE \
MAX_RECORDS_IN_RAM=500000
```
Sort files and check depth (post-dupe)
```
samtools index $OUTDIR/${SAMPLE}_SORT_DR.bam
conda activate gatk3
cd ./metrics/depth_metrics/post_dupe/
mosdepth -n --fast-mode -Q 20 ${SAMPLE}_post_dupe $OUTDIR/${SAMPLE}_SORT_DR.bam
cd ../../../
conda deactivate
```
Duplication rate, line indicates 20%

Coverage (post-dupe), line indicates 8x

Not the coverage we were hoping for (~11x), and most samples are below the coverage needed for hard calls. Was this due in part to the high percentage of reads trimmed at the start?
<h3>Re-visit trimming and downstream processing</h3>
Trimming seems dependent on library insert size - is there a less stringent trimming option that might keep a higher proportion of reads with low quality at the ends of reads?

**Trim alternative 1 (bbtrim):** Instead of using the Trimmomatic sliding-window approach, tried [bbduk](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/) to see if a higher number of reads would be retained.
```
bbduk.sh \
ref=./adapters.fa \
in1=$FQDIR/${SAMPLE}_1.fq.gz \
in2=$FQDIR/${SAMPLE}_2.fq.gz \
out1=$TRIMBB/${SAMPLE}_R1.bbtrim.fq.gz \
out2=$TRIMBB/${SAMPLE}_R2.bbtrim.fq.gz \
minlen=75 qtrim=r trimq=20
```
Comparison of percent trimmed

Using bbduk did result in a lower percent of reads trimmed, then continued with the pipeline as above.
Does a lower percent trimmed using bbduk result in higher coverage/more indivduals >8x downstream?
Comparison of coverage after removing duplicates

Using bbduk increased coverage just a bit for low-coverage samples, though reduced coverage for some of the higher coverage samples.
number of IDs with >8x coverage using trimmomatic --> bbduk
KEY_GOM: 5 --> 6
KEY_SNWA: 5 --> 5
MB_GOM: 8 --> 10
MB_MAB: 4 --> 5
MB_SNWA: 12 --> 14
There are 7 samples <1x using Trimmomatic, 6 samples <1x using bbduk
**Trim alternative 2 (bbtrim36):** Trimming with bbduk and setting a 36bp minimum length retained. The percent trimmed was a bit lower, but did not process through all the following steps for now in case another option is better.
```
bbduk.sh \
ref=./adapters.fa \
in1=$FQDIR/${SAMPLE}_1.fq.gz \
in2=$FQDIR/${SAMPLE}_2.fq.gz \
out1=$TRIMBB/${SAMPLE}_R1.bbtrim36.fq.gz \
out2=$TRIMBB/${SAMPLE}_R2.bbtrim36.fq.gz \
minlen=36 qtrim=r trimq=20
```
Comparison of percent trimmed with bbduk minlen=36 ([MultiQC](https://drive.google.com/drive/u/1/folders/1cHuyJgKvNmBAVynGJIGSUmSROZEiBcfX))

Not a huge reduction in reads trimmed, did not continue with pipeline to alignment.
**Trim alternative 3 (bb_noq):** Trimming adapters with bbduk but with no quality or length filters.
```
bbduk.sh \
ref=./adapters.fa \
in1=$FQDIR/${SAMPLE}_1.fq.gz \
in2=$FQDIR/${SAMPLE}_2.fq.gz \
out1=$TRIMBB/${SAMPLE}_R1.bb_noq.fq.gz \
out2=$TRIMBB/${SAMPLE}_R2.bb_noq.fq.gz \
ktrim=r k=31 mink=8 hdist=1 tpe tbo
```
This method **retained 100% of reads**, while trimming bases from reads to remove adapters.
Avgerage read length per individual after adapter trimming 99-148

Thoughts: Variation in mean read length may impact coverage?
Presumably retaining low quality reads will reduce alignment rates as bwa soft-clips as it aligns to genome.
Alignment rates were similar:

Though the number of reads going in were higher, so total number of reads aligned were higher:

Note: mosdepth has quality metric, currently set to ignore reads with Q<20. But they're still there?
Coverage (after duplicates removed) higher, line at 8x

**Trim parameter trouble-shooting**
Trimming with bbduk and no quality filter retained all reads, but why is quality trimming not working as expected? Tested 3 IDs with average insert lengths (111, 200, and 400) to see how small changes to the parameters for trimming with Trimmomatic and bbduk impact reads retained.
Original parameters in the Trimmomatic script:
```
ILLUMINACLIP:/modules/apps/trimmomatic/0.39/adapters/NexteraPE-PE.fa:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15 MINLEN:75
```
Tested: minlen75 (shown above), minlen36, no leading/trailing + minlen75, no leading/trailing + minlen36
Original parameters in the bbduk script:
```
minlen=75 qtrim=r trimq=20
```
Tested: minlen75 + qtrim=r + q20 (shown above), minlen36 + qtrim=r + q20, minlen75 + qtrim=rl + q20, minlen36 + qtrim=rl + q20, minlen75 + qtrim=rl + q15, minlen75 + qtrim=rl + q10, minlen75 + qtrim=rl + q5, minlen36 + qtrim=rl + q5
Trim alternative 3 (above) bbduk script with no qualty filters, which retained 100% of reads:
```
ktrim=r k=31 mink=8 hdist=1 tpe tbo
```
Finally, added a right-kmer trim and minimum k to a quality filter, which retained ~99.4% of reads:
```
ktrim=r mink=8 minlen=36 qtrim=rl trimq=15 tpe tbo
```

The "duk" in “bbduk” stands for Decontamination Using Kmers. It's also used to compare reads to the genome of a potential source of contamination (e.g., E. coli, PhiX) to remove any matching reads.
Here's what I think was happening - We'd been using bbduk to search for matches to whole adapter sequences, which for "normal" insert sizes didn't happen all that often. For libraries with short inserts, many more reads included the whole adapter sequence, and any read containing the whole adapter was treated as a match to contamination and removed.
"The default is filtering – any read matching a reference kmer will be discarded."
By adding the "ktrim = r" parameter, bbduk instead *trims* reads starting at the adapter match and anything to the right of it.
The bbktrim option trims only slightly more reads than the no quality-trimming option (plot below), and for many IDs slightly increases final depth.


I did not re-run all my samples with updated Trimmomatic, but going back to that option...
From the [Trimmomatic manual](http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf): "Trimmomatic uses two strategies for adapter trimming: Palindrome and Simple
With 'simple' trimming, each adapter sequence is tested against the reads, and if a sufficiently accurate match is detected, the read is clipped appropriately. 'Palindrome' trimming is **specifically designed for the case of 'reading through' a short fragment** into the adapter sequence on the other end. In this approach, the appropriate adapter sequences are 'in silico ligated' onto the start of the reads, and the combined adapter+read sequences, forward and reverse are aligned. If they align in a manner which indicates 'readthrough', the forward read is clipped and the reverse read dropped (since it contains no new data)."
So you end up with unpaired reads. Does that explain the high % of reads lost? Perhaps not what we want, may be why bbduk is better.
Adapter file for Trimmomatic:
>PrefixNX/1
AGATGTGTATAAGAGACAG
>PrefixNX/2
AGATGTGTATAAGAGACAG
>Trans1
TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG
>Trans1_rc
CTGTCTCTTATACACATCTGACGCTGCCGACGA
>Trans2
GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG
>Trans2_rc
CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
We had not been using
ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>**:<minAdapterLength>:<keepBothReads>**
for which:
**minAdapterLength**: In addition to the alignment score, palindrome mode can verify that a minimum length of adapter has been detected. If unspecified, this defaults to 8 bases, for historical reasons. However, since palindrome mode has a very low false positive rate, this can be safely reduced, even down to 1, to allow shorter adapter fragments to be removed.
**keepBothReads**: After read-though has been detected by palindrome mode, and the adapter sequence removed, the reverse read contains the same sequence information as the forward read, albeit in reverse complement. For this reason, the default behaviour is to entirely drop the reverse read. By specifying „true‟ for this parameter, the reverse read will also be retained, which may be useful e.g. if the downstream tools cannot handle a combination of paired and unpaired reads.
Trimmomatic with these additions retained ~98.8% of reads.
```
ILLUMINACLIP:/modules/apps/trimmomatic/0.39/adapters/NexteraPE-PE.fa:2:30:10:1:true LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15 MINLEN:36
```
<h3>Prep for analyses in ANGSD</h3>
Working through these steps for now with the initial bbduk-trimmed dataset (in <FONT COLOR="#F8766D">**red** </FONT>above) for an initial look at things.
Prepare files:
Add RG header
```
java -jar -Xms72G -Xmx90G /modules/apps/picard/2.26.11/picard.jar AddOrReplaceReadGroups \
I=$OUTDIR/${SAMPLE}_bb_SORT_DR.bam \
O=$OUTDIR/${SAMPLE}_bb_SORT_DR_RG.bam \
TMP_DIR=$TMP \
SORT_ORDER=coordinate RGID=1 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=${SAMPLE} CREATE_INDEX=True
```
Sort new bam files (with RG header)
```
#samtools index $OUTDIR/${SAMPLE}_bb_SORT_DR_RG.bam
```
Re-align around indels for use with ANGSD
```
conda activate gatk3
gatk3 \
-T RealignerTargetCreator \
-R $REFDIR/$REF \
-I $OUTDIR/${SAMPLE}_bb_SORT_DR_RG.bam \
-o $OUTDIR/${SAMPLE}.intervals \
-drf BadMate
gatk3 \
-T IndelRealigner \
-R $REFDIR/$REF \
-I $OUTDIR/${SAMPLE}_bb_SORT_DR_RG.bam \
--targetIntervals $OUTDIR/${SAMPLE}.intervals \
--consensusDeterminationModel USE_READS \
-o $OUTDIR/${SAMPLE}_bb_SORT_DR_RG_IR.bam
conda deactivate
```
Removed ~5-10% of reads for failing MappingQualityZeroFilter
Final sort and index
```
samtools sort $OUTDIR/${SAMPLE}_bb_SORT_DR_RG_IR.bam -o $OUTDIR/${SAMPLE}_bb_DR_RG_IR_SORT.bam
samtools index $OUTDIR/${SAMPLE}_bb_DR_RG_IR_SORT.bam
```
Generate beagle files to move forward with analyses based on genotype likelihoods in ANGSD. The loggerhead reference genome includes chromosomes and ~2000 scaffolds, using just the 28 chromosomes for assessing population structure (see 02_ANGSD.sh).
```
module load angsd/0.935
angsd -GL 1 -out $OUTDIR/all_${CHR} -nThreads 16 -doGlf 2 -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -C 50 -bam ./all_bam_paths.txt -r $CHR -ref $REFDIR/$REF
```
Concatenate the outputs per chromosome into a single file
```
cat all*.beagle.gz > all.beagle.gz
```
<h3>Preliminary population structure analysis</h3>
<h4> bbduk-trimmed dataset (with quality filtering)</h4>
Covariance matrix estimation with PCAngsd for PCA
```
conda activate pcangsd
pcangsd -b $OUTDIR/8x.beagle.gz -o $OUTDIR/Results/8x_PCAngsd_covmat
conda deactivate
```
Initial PCA including all samples regardless of coverage (n=132)

There were quite a few outliers here, so ran again with just the samples that had 8x coverage or above.
PCA including only samples 8x and above (n=40)

Two pairs of samples from Melbourne seem to be true outliers. From different populations? Their mitochondrial haplotypes are not unique: MB113, MB121 (mtDNA Cc-A2.1); MB103, MB316 (mtDNA ~Cc-A1.1)
Should they be removed before Fst analyses?
PCA including only samples >8x (4 outliers removed)

Admixture estimation with PCAngsd
```
conda activate pcangsd
pcangsd -b $INDIR/8x.beagle.gz --admix -t 16 -e $k -o $OUTDIR/8x_Admix_$k
conda deactivate
```
For the >8x dataset
k=2

k=3

k=4

<h4> bbduk-trimmed dataset (no quality filtering at trimming step)</h4>
PCA including all samples regardless of coverage PC1-PC2 (n=132)

Two outliers are still clear outliers (MB103, MB316), but the other two outliers from previous dataset are less distinct now (MB113, MB121) with added coverage in other IDs.
PCA including all samples regardless of coverage PC2-PC3 (n=132)

PC3 distinguishes MB113 & MB121 from other IDs.
<h3>Filtering</h3>
Which samples to drop from subsequent analysis due to data quality?
Seven IDs aligned to the reference at <90%.
| Sample_ID | % aligned | coverage |
| -------- | -------- | -------- |
| MB303 | 26.59 | 0.37 |
|KEY063_2** | 59.02 | 1.71 |
| MB111 | 62.93 | 4.38* |
| MB265** | 67.34 | 0.28 |
| KEY445 | 74.55 | 1.51 |
| KEY435** | 75.12 | 0.37 |
| KEY456** | 83.97 | 0.47 |
*filtered for alignment even though coverage >3
**satellite-tagged turtle
One additional ID has coverage <3x: KEY287 (1.31x)
With these eight removed (if aligned <90% and/or coverage <3x), leaves n=124:
Keewaydin_GOM (-1) 24/25
Keewaydin_SNWA (-4) 21/25
Melbourne_GOM (-1) 24/25
Melbourne_MAB (-1) 28/29
Melbourne_SNWA (-1) 27/28
PCA filtered data PC1-PC2 (n=124)
3x-filtered-bbktrim

PCA filtered data PC2-PC3 (n=124)
3x-filtered-bbktrim

PCA filtered data PC3-PC4 (n=124)
3x-filtered-bbktrim

Low variance explained by PCs 1-4, no clear clusters among rookeries or foraging areas.
pcangsd admixture analysis
3x-filtered-bbktrim

<h3>Prelim Fst Outlier Analyses</h3>
Trial first with all 124 filtered IDs and parameters from the leatherback dataset (see 03_Fst_outliers.sh):
MIN_MAF=0.05
MIN_DEPTH=1
MAX_DEPTH_FACTOR=4
MAX_DEPTH = (4 * 124) = 496
PERCENT_IND=0.60
MIN_IND= (0.6 * 124) = 74
**Questions: What is optimal value for max depth?**
Create MAF files to select for sites that meet the criteria above:
```
angsd -P $NB_CPU -nQueueSize 50 \
-doMaf 2 -GL 2 -doGlf 2 -doMajorMinor 1 -doCounts 1 \
-ref $REFDIR/$REF -remove_bads 1 -minMapQ 30 -minQ 20 -skipTriallelic 1 \
-minInd $MIN_IND -minMaf $MIN_MAF -SNP_pval 1e-6 -setMaxDepth $MAX_DEPTH -setMinDepthInd $MIN_DEPTH \
-b $DIR/3x_filtered_bb_noq_bam_paths.txt -r ${CHR} \
-out $OUTDIR/all_maf"$MIN_MAF"_pctind"$PERCENT_IND"_maxdepth"$MAX_DEPTH_FACTOR"_${CHR}
```
Convert the MAF files per chromosome to a single file, and format for use with -sites command for SFS (keep just first four columns)
```
for q in {473..500}; do
zcat all_maf0.05_pctind0.60_maxdepth4_NC_064${q}.1.mafs.gz | grep NC_064* >> outputfile.mafs;
done
awk 'BEGIN { OFS="\t" }{print $1,$2,$3,$4}' outputfile.mafs > sites_all_maf0.05_pctind0.60_maxdepth4
```
Index the sites file
```
angsd sites index /nese/meclab/Katrina/cc/bam_paths/sites_all_maf0.05_pctind0.60_maxdepth4
```
First trial with MAB vs SNWA - SFS by group
Pooled IDs from both rookeries for SNWA, MAB is only from Melbourne
MAB: n = 28
SNWA: n = 48
```
for POP in MAB SNWA
do
echo $POP
REFDIR=/nese/meclab/Shared/reference_genomes/ST_reference_genomes/CCare_1.0
REF=GCF_023653815.1_GSC_CCare_1.0_genomic.fna
OUTDIR=/scratch/workspace/katrinaphill_umass_edu-simple/ANGSD/for_Fst
DIR=/nese/meclab/Katrina/cc/bam_paths
MIN_DEPTH=1
MAX_DEPTH_FACTOR=4
PERCENT_IND=0.60
N_IND=$(cat $DIR/$POP'_3x_noq_bam.txt' | wc -l)
MIN_IND_FLOAT=$(echo "($N_IND * $PERCENT_IND)"| bc -l)
MIN_IND=${MIN_IND_FLOAT%.*}
MAX_DEPTH=$(echo "($N_IND * $MAX_DEPTH_FACTOR)" |bc -l)
angsd -P $NB_CPU -nQueueSize 50 \
-doMaf 2 -GL 2 -doMajorMinor 3 -dosaf 1 -docounts 1 \
-b $DIR/$POP'_3x_noq_bam.txt' -ref $REFDIR/$REF -anc $REFDIR/$REF -out $OUTDIR/sfs_results_M-S/$POP \
-remove_bads 1 -minMapQ 30 -minQ 20 -skipTriallelic 1 \
-minInd $MIN_IND -setMaxDepth $MAX_DEPTH -setMinDepthInd $MIN_DEPTH \
-sites "$OUTDIR"/sites_all_maf0.05_pctind0.60_maxdepth4
realSFS $OUTDIR/sfs_results_M-S/$POP.saf.idx -fold 1 >$OUTDIR/sfs_results_M-S/$POP.folded.sfs
done
```
Generate 2d SFS between the 2 groups, capped number of sites to 300,000,000.
```
realSFS $OUTDIR/sfs_results_M-S/MAB.saf.idx $OUTDIR/sfs_results_M-S/SNWA.saf.idx -nSites 300000000 >$OUTDIR/sfs_results_M-S/MAB.SNWA.sfs
```
Caclulate fst values
```
realSFS fst index $OUTDIR/sfs_results_M-S/MAB.saf.idx $OUTDIR/sfs_results_M-S/SNWA.saf.idx -sfs $OUTDIR/sfs_results_M-S/MAB.SNWA.sfs -fstout $OUTDIR/sfs_results_M-S/MAB.SNWA -whichFst 1
```
In .err file from running this script:
IMPORTANT: please make sure that your saf files hasnt been folded with -fold 1 in -doSaf in angsd
Sliding window Fst scan with fst stats
```
realSFS fst stats2 $OUTDIR/sfs_results_M-S/MAB.SNWA.fst.idx -win 10000 -step 1000 >$OUTDIR/sfs_results_M-S/MAB.SNWA.fst.slidingwindow
```
Fst Outliers per chromosome for MAB-SNWA foragers

Max Fst 0.18 on chromosome 5 (NC_064477.1)
**Fst Outlier parameter testing**
*All below with the filtered bb-ktrim dataset*
**Parameters tested**
Minimum MAF: 0.05, 0.03, 0.01
Percent of individuals: 60%, 17%
Max depth factor: 4, 15, 30
Sites file based on subsets or full dataset
No sites file (with and without min MAF)
**Combinations tested**
Fst: KEY-GOM v SNWA, pooled-GOM v SNWA, MB-GOM v SNWA, MB-MAB v SNWA, MB - MAB v GOM, KEY-random v random, pooled-rookery v rookery
PBS: MB-GOM v SNWA v MAB, pooled-GOM v SNWA v MAB
[Summary of fst results](https://drive.google.com/drive/u/0/folders/1wk9s2wj4VLvmBrTQu1e1gVrwss-zh-pR): see fst_analyses_summary.pdf
Results comparisons of note
1. Depth factor 4 -> 15

The average depth of this dataset is ~9x, so depth factor needs adjusting
2. Depth factor 15 -> 30

Only slight differences between depth factors 15 and 30. Moving forward using depth factor 30.
3. Fst between 2 foraging sites vs between 2 randomly sorted groups

Overall range of Fst values doesn't differ much, though the locations of outlier peaks do.
**Is Fst between foraging sites any more informative than random differences?**
4. Fst between two foraging sites vs between the two rookeries

Right panel less 6 IDs for equal sample sizes
**Notes for creating -sites file for Fst analyses using all samples:**
MIN_MAF=0.05 (retained more sites with lower min_maf but resulting Fst peaks similar)
MIN_DEPTH=1
MAX_DEPTH_FACTOR=30
PERCENT_IND=0.17 (because smallest group is Keewaydin-SNWA with 21 samples out of the 124 total) *revisit, perhaps go back to 0.60
And for the SFS per group:
MIN_DEPTH=1
MAX_DEPTH_FACTOR=30
PERCENT_IND=0.60
<h3>Fst Outlier Tests</h3>
3-way PBS within the MB rookery vs across rookeries

Reduced group sizes for equal sampling based on foraging assignment poterior probability (n=24 x3 for MB and n=28 x3 when pooled)
In examples above, there may be a peak of interest on chromosome 6 when comparing foraging sites GOM (n=48) and SNWA (n=48). This peak does not appear when comparing these phenotypes just within the KEY rookery (left panel of #3, n=21 v n=24).

Perhaps related to DLK1
Keep an eye on chrom 15, early trials including the MAB foraging group indicated outlier peak near end of chrom.
When re-doing Fst analysis in angsd for just the region of the peak shown, with 1-bp sliding window to ID location of outliers

How to set outlier threshold? The Fst values are higher with the smaller sliding window than in the original analysis


[Poster presented at Evolution 2024](https://drive.google.com/drive/u/0/folders/1dsKPCLf2JoiBhqN6GCyeDMkVc9yRpRZz): see Evolution_Phillips_draft5.pdf
<h3>Prelim Baypass Analyses</h3>
Started with data from one rookery (KEY) and comparing 2 foraging sites (GOM - SNWA) - core model, no covariates.

12 outliers (in <FONT COLOR="#FF0000">red</FONT> above) across 6 chromosomes:
chrom1, chrom2, chrom4, chrom9, chrom13, chrom26
Peak on chromosome 1 similar to the Fst outlier plot for KEY:GOM-SNWA (below)

Baypass outliers (Rookery: Keewaydin; Foraging sites: GOM-SNWA)
*also indicated as outlier in Fst analyses
| chromosome | position | gene |
| -------- | -------- | -------- |
| NC_064473.1 | 25554446 | MRE11* |
| NC_064473.1 | 25570950 | MRE11* |
| NC_064473.1 | 25588436 | MRE11* |
| NC_064473.1 | 25589016 | MRE11* |
| NC_064473.1 | 188840374 | LOC125645292 |
| NC_064473.1 | 282743125 | HMGA2 |
| NC_064474.1 | 153545914 | LOC125632159*; LOC125632161*; MYLIP |
| NC_064474.1 | 263278740 | no gene at this location? |
| NC_064476.1 | 33553153 | KIAA1549L |
| NC_064481.1 | 86517141 | no gene at this location? |
| NC_064485.1 | 36053543 | SLC26A11* |
| NC_064498.1 | 1645240 | KMT2D |
For the same analysis in the other rookery (MB) and comparing 2 foraging sites (GOM - SNWA) - core model, no covariates: 0 outliers

The equvalent plot for MB:GOM-SNWA (below)

Next up - PCA with just outlier SNPs? From Fst + Baypass
<h3>GATK haplotype calling</h3>
For the 82 samples that have coverage = or >8
Keewaydin_GOM 16/25
Keewaydin_SNWA 14/25
Melbourne_GOM 17/25
Melbourne_MAB 15/29
Melbourne_SNWA 20/28
**Consider including lower coverage samples?**
1. call haplotypes per indivdual, see `10_GATK_HaplotypeCaller-perindiv.sh`
```
gatk \
HaplotypeCaller \
--java-options "-Xms60G -Xmx90G -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" \
-R ${REFDIR}/${REF} \
-ERC GVCF \
-mbq 20 \
-I ${ALIGNDIR}/${SAMPLE}_bb_ktrim_DR_RG_IR_SORT.bam \
-O ${OUTDIR}/${SAMPLE}_bb_ktrim_DR_RG_IR_SORT.bam.g.vcf
```
2. Import into database per chromosome, see `11_GATK_GenomicsDBImport.sh`
```
gatk --java-options "-Xmx240g -Xms120g -XX:ParallelGCThreads=2" GenomicsDBImport \
--sample-name-map sample_map \
--genomicsdb-shared-posixfs-optimizations true \
--batch-size 20 \
--merge-input-intervals \
--genomicsdb-workspace-path ${OUTDIR}/${CHR} \
-L ${CHR}
```
Completed in ~18 hours
3. Joint genotyping, see `12_GATK_JointGenotyping.sh`
```
gatk --java-options "-Xmx2g -Xms2g -XX:ParallelGCThreads=2" GenotypeGVCFs \
-R ${REFDIR}/${REF} \
-V gendb:///scratch2/workspace/katrinaphill_umass_edu-cc_WGR/VCFs/${CHR} \
-O ${OUTDIR}/WGR_82_Cc_${CHR}.vcf
```
4. Select variants, see `13_GATK_selectVariants.sh`
```
gatk \
SelectVariants \
--java-options "-Xms60G -Xmx90G -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" \
-R ${REFDIR}/${REF} \
--select-type-to-include SNP \
--select-type-to-include INDEL \
-L ${CHR} \
-V ${INDIR}/${VCF} \
-O ${OUTDIR}/${SelV_VCF}
```
5. Filter variants, see `14_GATK_FilterVariants.sh`
```
for CHR in {473..500}; do
input_vcf=/scratch2/workspace/katrinaphill_umass_edu-cc_WGR/SelectVariants_VCFs/SelV_82_Cc_NC_064${CHR}.1.vcf #nee tvcf
fileprefix=$(echo NC_064$CHR.1)
vcftools --vcf $input_vcf --min-meanDP 5 --max-meanDP 200 --recode --mac 1 --out /scratch2/workspace/katrinaphill_umass_edu-cc_WGR/filtered_VCFs/${fileprefix} #nee /fvcf/
done
```
6. Concatenate filtered VCFs, see `15_GATK_Concat_filteredvcf.sh`
```
bcftools concat $INDIR/NC_064*.recode.vcf -o /scratch2/workspace/katrinaphill_umass_edu-cc_WGR/filtered_VCFs/Cc_82_combined_filtered.vcf
```
PCA of VCF file
```
#RScript using eigenvec and eigenval output
# read in data
pca <- read_table("./VCF/Cc_WGR.eigenvec", col_names = FALSE)
eigenval <- scan("./VCF/Cc_WGR.eigenval")
# sort out the pca data
# remove nuisance column
pca <- pca[,-1]
# set names
names(pca)[1] <- "ind"
names(pca)[2:ncol(pca)] <- paste0("PC", 1:(ncol(pca)-1))
# sort out by population/location
loc <- rep(NA, length(pca$ind))
loc[grep("KEY", pca$ind)] <- "Keewaydin"
loc[grep("MB", pca$ind)] <- "Mebourne"
# remake data.frame
pca <- as_tibble(data.frame(pca, loc))
# first convert to percentage variance explained
pve <- data.frame(PC = 1:20, pve = eigenval/sum(eigenval)*100)
# make plot
a <- ggplot(pve, aes(PC, pve)) + geom_bar(stat = "identity")
a + ylab("Percentage variance explained") + theme_light()
# calculate the cumulative sum of the percentage variance explained
cumsum(pve$pve)
# plot pca
b <- ggplot(pca, aes(PC1, PC2, col = loc)) + geom_point(size = 3) +
scale_colour_manual(values = c("red", "blue")) +
coord_equal() + theme_light() +
xlab(paste0("PC1 (", signif(pve$pve[1], 3), "%)")) + ylab(paste0("PC2 (", signif(pve$pve[2], 3), "%)"))
```

Not the same two outliers as the ones from the angsd PCA (these two from different rookery)
Optional:
snpEff.sh to annotate variant sites
Another PCA analysis using package SNPrelate ([tutorial](https://www.bioconductor.org/packages/release/bioc/vignettes/SNPRelate/inst/doc/SNPRelate.html))
```
library(gdsfmt)
library(SNPRelate)
#CONVERT PLINK FILES TO GDS
bed.fn <- "./Cc_WGR_numericChr.bed"
fam.fn <- "./Cc_WGR_numericChr.fam"
bim.fn <- "./Cc_WGR_numericChr.bim"
# Convert
snpgdsBED2GDS(bed.fn, fam.fn, bim.fn, "test.gds")
# Summary
snpgdsSummary("test.gds")
#DATA ANALYSIS
# Open the GDS file
genofile <- snpgdsOpen("test.gds")
# Get population information
pop_code <- scan("pop.txt", what=character())
# Get sample id
sample.id <- read.gdsn(index.gdsn(genofile, "sample.id"))
#LD pruning
set.seed(1000)
# Try different LD thresholds for sensitivity analysis
snpset <- snpgdsLDpruning(genofile, ld.threshold=0.2)
# Get all selected snp id
snpset.id <- unlist(unname(snpset))
head(snpset.id)
# Run PCA
pca <- snpgdsPCA(genofile, snp.id=snpset.id, num.thread=2)
# variance explained (%)
pc.percent <- pca$varprop*100
head(round(pc.percent, 2))
# Get sample id
sample.id <- read.gdsn(index.gdsn(genofile, "sample.id"))
# assume the order of sample IDs is as the same as population codes
head(cbind(sample.id, pop_code))
# Make a data.frame
tab <- data.frame(sample.id = pca$sample.id,
pop = factor(pop_code)[match(pca$sample.id, sample.id)],
EV1 = pca$eigenvect[,1], # the first eigenvector
EV2 = pca$eigenvect[,2], # the second eigenvector
stringsAsFactors = FALSE)
head(tab)
# Draw 1-2
plot(tab$EV1, tab$EV2, pch=19, col=as.integer(tab$pop), xlab="eigenvector 1", ylab="eigenvector 2")
legend("bottomright", legend=levels(tab$pop), pch=19, cex=1, col=1:nlevels(tab$pop))
#Plot the principal component pairs for the first four PCs:
lbls <- paste("PC", 1:4, "\n", format(pc.percent[1:4], digits=2), "%", sep="")
pairs(pca$eigenvect[,1:4], col=tab$pop, pch=19, labels=lbls)
```


As an aside, analysis of relatedness
Identity by descent (IBD) using PLINK method of moments (MoM)
```
##Relatedness- Keewaydin
sample.id <- read.gdsn(index.gdsn(genofile, "sample.id"))
KEY.id <- sample.id[pop_code == "Keewaydin"]
# Estimate IBD coefficients
ibd <- snpgdsIBDMoM(genofile, sample.id=KEY.id, snp.id=snpset.id,
maf=0.05, missing.rate=0.05, num.thread=2)
# Make a data.frame
ibd.coeff <- snpgdsIBDSelection(ibd)
head(ibd.coeff)
plot(ibd.coeff$k0, ibd.coeff$k1, xlim=c(0,1), ylim=c(0,1),
xlab="k0", ylab="k1", main="KEY samples (MoM)")
lines(c(0,1), c(1,0), col="red", lty=2)
```
Not entirely sure how to interpret, but seems higher relatedness among Melbourne Beach samples than Keewaydin samples, which would be surprising
k0 = IBD coefficient, the probability of sharing ZERO IBD
k1 = IBD coefficient, the probability of sharing ONE IBD

