Isoetes Target Capture Workflow

###### tags: `hybpiper` `alleles` `matt` # Isoetes Target Capture Workflow [TOC] ## Intro and Previous Issues Previously, we used the supercontigs generated by HybPiper to serve as a reference sequence for each sample, then used the GATK variant pipeline to call variants. These variants are then phased into haplotypes, from which two alleles are extracted for each gene. On most gene trees, there is a set of samples that tends to fall on a long branch. This could be misidentification, or paralogy. There are also a number of gene trees with distinct clades that only have one allele per individual, suggesting paralogy. It is hard to disentangle: - Polyploids - Paralogs at only some loci for some samples - Alleles Our current strategy is to restart everything but with an Isoetes-specific version of the target file. Then, paralogs can be identified with respect to the Isoetes targets and potentially identified using HybPiper's scripts. ## HybPiper ### Choose Target Individual We identified the sample with the best coverage using the `marked.bam` output of the GATK workflow. The top ten samples in terms of reads mapped were: ``` FF290_10_Isoetes_nuttallii 664171 25.13% FF290_9_Isoetes_nuttallii 602407 24.74% FF285_11_Isoetes_orcuttii 592963 25.24% KW380_7_Isoetes_orcuttii 567807 26.31% CR5205_06_Isoetes_nuttallii 564629 23.36% CR5205_01_Isoetes_nuttallii 516355 26.28% FF259_5_Isoetes_nuttallii 513826 33.09% FF286_5_Isoetes_orcuttii 508568 22.02% FF257_8_Isoetes_nuttallii 504183 28.62% FF257_1_Isoetes_nuttallii 502999 26.76% ``` We will use `FF290_10_Isoetes_nuttallii` to extract coding sequences. If it is necessary to also use an example from _I. orcutii_ we can also use `FF285_11_Isoetes_orcuttii`. ### Extract Coding Sequence from Supercontigs Using the supercontig file for `FF290_10_Isoetes_nuttallii` we identified coding regions using [Augustus](http://bioinf.uni-greifswald.de/augustus/) (with Arabidopsis as the "closest" reference sequence for identifying coding sequences): `augustus --species=arabidopsis FF290_10_Isoetes_nuttallii.supercontigs.fasta --gff3=on --codingseq=on > FF290_10_Isoetes_nuttalli.augustus.gff` Extract the coding sequence: `getAnnoFasta.pl FF290_10_Isoetes_nuttalli.augustus.gff` This makes the file `FF290_10_Isoetes_nuttalli.augustus.codingseq` which has DNA bases in lower case, and also adds info to the gene names. Wrote a quick script in Python to make the changes: ```python=3.6 import sys from Bio import SeqIO for seq in SeqIO.parse(sys.argv[1],'fasta'): seq.id = seq.id.split(".")[0] seq.description = '' seq.seq = seq.seq.upper() SeqIO.write(seq,sys.stdout,'fasta') ``` Run with: `python make_target_file.py FF290_10_Isoetes_nuttalli.augustus.codingseq > isoetes.goflag.targets.fasta` One bit of weirdness - gene353 had multiple open reading frames, as the supercontig for this gene was 4500 bp. I'm sure that's not correct, for now HybPiper will ignore this gene due to the duplicates. ### Rerun Hybpiper with New Target File `namelist.txt` contains the prefix for every name in the raw reads HybPiper commands in a loop over all samples in `hybpiper.sh`: ```bash= set -eo pipefail while read name do /opt/apps/Software/HybPiper/reads_first.py \ -r /scratch/lindsawi/isoetes/output/paired/$name/"$name"_R*_paired.fastq \ -b ../new_targets/isoetes.goflag.targets.fasta \ --bwa --cpu 50 --prefix $name python /opt/apps/Software/HybPiper/intronerate.py --prefix $name python /opt/apps/Software/HybPiper/cleanup.py $name done < ../namelist.txt ``` ## Identifying Paralogs HybPiper will identify paralogs if there are two or more assembled contigs that each span >75% of the length of the target gene. The genes with "paralog warnings" are written to `genes_with_paralog_warnings.txt` in the HybPiper output directory. This command extracts coding sequence from each paralog for each sample in `namelist.txt`: `parallel python /opt/apps/Software/HybPiper/paralog_investigator.py {} :::: ../namelist.txt` The next command collects the sequences and potential paralogs for a single gene and writes to a FASTA file. The script also writes a report on the number of sequences found to stderr, which is captured to `paralog_report.txt`: `parallel "python /opt/apps/Software/HybPiper/paralog_retriever.py ../namelist.txt {} > ../sequences/paralogs/{}.paralogs.fasta" :::: ../genelist.txt 2> ../paralog_report.txt` Sorting the report, we find that two samples have a high proportion of genes with paralog warnings: | Sample | GenesRecovered | GenesWithParalogs | --- | --- | --- | FF254_3_Isoetes_orcuttii |221 |139 |FF254_2_Isoetes_orcuttii |220 |28 |FF290_9_Isoetes_nuttallii |223 |7 |FF262_3_Isoetes_orcuttii |223 |6 |FF263_25_Isoetes_nuttallii |220 |6 |FF279_9_Isoetes_nuttallii |220 |6 |FF290_11_Isoetes_nuttallii |222 |6 All other samples had five or fewer genes with paralog warnings. These results suggest that at least two samples, `FF254_3_Isoetes_orcuttii` and `FF254_2_Isoetes_orcuttii` may be polyploids, and should be excluded from further analysis. There were also a few genes with more than 10 samples with paralog warnings, suggesting gene duplicates: | Gene | SamplesWithParalogs | --- | --- | gene58 | 89 | gene101 | 30 | gene353 |26 | gene437 | 18 | gene249 | 17 | gene400 | 11 As previously mentioned, gene353 had multple ORFs and deserves further attention. For now, excluding these six genes should reduce downstream issues. [The full report can be found here](https://texastechuniversity-my.sharepoint.com/:x:/g/personal/matt_johnson_ttu_edu/ETbWSmUWTEtIrHOWdr0sayMBkrr2f6sSEkE2_amHbQctZQ?e=HKJnVm). ## Comparing Supercontigs Removed the two samples and six genes mentioned above, then retrieved supercontig sequences: `python /opt/apps/Software/HybPiper/retrieve_sequences.py ../new_targets/isoetes.goflag.targets.nodups.fasta ../namelist_nodups.txt supercontig` ## Variant Calling We now redo the variant calling with the hopefully more reliable supercontigs. `parallel mkdir {} :::: ../namelist_nodups.txt` Collect the supercontigs only for the samples and genes with minimal paralogs: Run the GATK workflow mapping reads against a sample's own supercontigs. The main modification is only calling alleles with high mapping quality, using this hard filter: `QD < 5.0 || FS > 60.0 || SOR > 3 || MQ < 55.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0` ```bash= #!/bin/bash set -eo pipefail # Set variables prefix="$1" read1fn="$prefix"_R1_paired.fastq read2fn="$prefix"_R2_paired.fastq readdir=/scratch/lindsawi/isoetes/output/paired/"$prefix"/ cd "$prefix" reference="$prefix"_supercontigs.fasta if [ ! -f *.dict ] then gatk CreateSequenceDictionary -R $reference fi bwa index $reference samtools faidx $reference #Align read files to reference sequence and map bwa mem $reference $readdir/$read1fn $readdir/$read2fn | samtools view -bS - | samtools sort - -o "$prefix.sorted.bam" gatk FastqToSam -F1 $readdir/$read1fn -F2 $readdir/$read2fn -O $prefix.unmapped.bam -SM $prefix.sorted.bam #Replace read groups to mapped and unmapped bam files using library prep and sequencing information gatk AddOrReplaceReadGroups -I $prefix.sorted.bam -O $prefix.sorted-RG.bam -RGID 2 -RGLB lib1 -RGPL illumina -RGPU unit1 -RGSM $prefix gatk AddOrReplaceReadGroups -I $prefix.unmapped.bam -O $prefix.unmapped-RG.bam -RGID 2 -RGLB lib1 -RGPL illumina -RGPU unit1 -RGSM $prefix #Combine mapped and unmapped BAM files gatk MergeBamAlignment --ALIGNED_BAM $prefix.sorted-RG.bam --UNMAPPED_BAM $prefix.unmapped-RG.bam -O $prefix.merged.bam -R $reference #Remove duplicate sequences gatk MarkDuplicates -I $prefix.merged.bam -O $prefix.marked.bam -M $prefix.metrics.txt samtools index $prefix.marked.bam #Create VCF gatk HaplotypeCaller -I $prefix.marked.bam -O $prefix.vcf -R $reference # only SNPs # only SNPs gatk SelectVariants -R $reference -V $prefix.vcf --select-type-to-include SNP -O $prefix.SNPs.vcf # Filter the SNPs gatk VariantFiltration -R $reference -V "$prefix".SNPs.vcf --filter-name "hardfilter" -O "$prefix".filtered.snps.vcf --filter-expression "QD < 5.0 && FS > 60.0 && MQ < 55.0 && MQRankSum < -12.5 && ReadPosRankSum < -8.0" # Create alternate reference Remove intermediate files gatk FastaAlternateReferenceMaker --output $prefix.alternate.supercontigs.fasta --reference $reference --variant $prefix.filtered.snps.vcf --use-iupac-sample $prefix rm $prefix.sorted.bam $prefix.unmapped.bam $prefix.merged.bam $prefix.unmapped-RG.bam $prefix.sorted-RG.bam ``` Next need to infer phase using WhatsHap, and extract phased sequences using `haplonerate.py` ([available here](https://github.com/mossmatters/phyloscripts/tree/master/haplonerate)) ```bash= #!/bin/bash #Command line argument: prefix for all files prefix=$1 #Locations of things phyloscripts=/home/joh97948/phyloscripts genelist=/scratch/joh97948/isoetes/genelist_nodups.txt reference="$prefix"_supercontigs.fasta cd $prefix #Replace the FASTA headers as GATK does weird things to them #Not needed for subgenomes since we're going to delete outside the largest phase block anyway #sed -ri "s/>[0-9]+ ([A-Z0-9]+-[A-Z0-9]+):[0-9]+-[0-9]+/\>\1/" $prefix.supercontigs.iupac.fasta #Re-index the IUPAC reference file #samtools faidx $prefix.supercontigs.iupac.fasta #Run WhatsHap to generate phased VCF whatshap phase -o $prefix.phased.vcf --reference=$reference $prefix.filtered.snps.vcf $prefix.marked.bam whatshap stats $prefix.phased.vcf --gtf $prefix.whatshap.gtf --tsv $prefix.whatshap.stats.txt #BGZIP and index the Whatshap output bgzip -c $prefix.phased.vcf > $prefix.phased.vcf.gz tabix $prefix.phased.vcf.gz #Extract two fasta sequences for each gene, corresponding to the phase mkdir -p phased_bcftools rm phased_bcftools/* parallel "samtools faidx $reference $prefix-{1} | bcftools consensus -H 1 $prefix.phased.vcf.gz > phased_bcftools/$prefix-{1}.phased.1.fasta" :::: $genelist parallel "samtools faidx $reference $prefix-{1} | bcftools consensus -H 2 $prefix.phased.vcf.gz > phased_bcftools/$prefix-{1}.phased.2.fasta" :::: $genelist #For polyploidy analysis, need to delete sequences outside the largest phase block using Haplonerate mkdir -p phased_seqs parallel "python /home/joh97948/phyloscripts/haplonerate/haplonerate.py $prefix.whatshap.gtf phased_bcftools/$prefix-{}.phased.1.fasta phased_bcftools/$prefix-{}.phased.2.fasta --edit delete --ref $reference > phased_seqs/{}.fasta" :::: $genelist ``` ## Gene Trees ### Collecting Sequences Gather all the sequences from the phasing and add the outgroup sequences from _I. bolanderii_ and _I. howellii_: ``` parallel "cat ../phasing/*/phased_seqs/{}.fasta \ /scratch/lindsawi/isoetes/output/paired/F088_Isoetes_bolanderii/phased_seqs/F088_Isoetes_bolanderii-{}.fasta \ /scratch/lindsawi/isoetes/output/paired/F088_Isoetes_howellii/phased_seqs/F088_Isoetes_howellii-{}.fasta \ > unaligned/{}.fasta" :::: ../genelist_nodups.txt ``` ### Alignment `parallel "mafft --localpair --maxiterate 1000 unaligned/{}.fasta > mafft/{}.mafft.fasta" :::: ../genelist_nodups.txt` `parallel trimal -gt 0.5 -in mafft/{}.mafft.fasta -out trimal/{}.trimal.fasta :::: ../genelist_nodups.txt` ### Gene Tree Inference Use Model Finder Pro to pick the best substitution model, and use 1000 ultrafast bootstrap replicates: `iqtree -s {}.mafft.trimal.fasta -m MFP -B 1000` ### SVDQuartets Another way of looking at the variants called by GATK is to use SVDQuartets, which can use IUPAC ambiguity codes, generated by GATK FastaAlternateReferenceMaker. However, it does some strange things to the FASTA headers, so these need to be fixed while separating files before alignment: timport os,sys from Bio import SeqIO namelist=[x.rstrip() for x in open(sys.argv[1])] for prefix in namelist: for seq in SeqIO.parse("{}/{}.alternate.supercontigs.fasta".format(prefix,prefix),'fasta'): geneID = seq.description.split()[1].split(":")[0].split("-")[1] seq.id = prefix seq.description = '' with open("../sequences/iupac/unaligned/{}.iupac.fasta".format(geneID),'a') as outfile: SeqIO.write(seq,outfile,'fasta') `python ../separate_sequences.py ../namelist_nodups.txt` **Adding the outgroup sequences:** `parallel "cat /scratch/lindsawi/isoetes/output/paired/F088_Isoetes_bolanderii/phased_seqs/F088_Isoetes_bolanderii-{}.fasta \ /scratch/lindsawi/isoetes/output/paired/F088_Isoetes_howellii/phased_seqs/F088_Isoetes_howellii-{}.fasta >> {}.iupac.fasta" :::: ../../../genelist_nodups.txt` **MAFFT:** `parallel "mafft --localpair --maxiterate 1000 unaligned/{}.iupac.fasta > mafft/{}.iupac.mafft.fasta" :::: ../../genelist_nodups.txt` **TRIMAL:** `parallel trimal -gt 0.5 -in mafft/{}.iupac.mafft.fasta -out trimal/{}.iupac.trimal.fasta :::: ../../genelist_nodups.txt` **Forgot to remove the gene names from the outgroups:** `parallel sed -i "s/-{}//g" {}.iupac.trimal.fasta :::: ../../../genelist_nodups.txt` **Merge the FASTA files:** `python /opt/apps/Software/HybPiper/fasta_merge.py --fastafiles trimal/* > isoetes.iupac.concat.fasta` Created an alignment from 226 genes for 140 samples, inserting 428 missing sequences as gaps. The total DNA alignment length is 260646. **Convert to Nexus: ** `NCLconverter -fdnafasta -oisoetes.iupac.concat -enexus isoetes.iupac.concat.fasta` [The SVDQuartets tree can be viewed here](http://etetoolkit.org/treeview/?treeid=b3e8a0e485faa0e9aff8a2917755189d&algid=). Based on this tree, most populations seem clustered. It also shows that the _I. orcuttii_ samples from population FF254 cluster with the outgroups, which is consistent with the SNP analysis (see below). ## Single Reference Mode Instead of having each sample mapped against itself, another option is to use a single sample as a "reference genome." The advantages would be that variants could be called jointly with all individuals, which would be appropriate for estimating demographic and population parameters. However, it may then be trickier to pull out FASTA files for phylogenetic analysis. Another disadvantage when considering multiple species is that choosing one reference may reduce mapping quality for members of far diverged populations or species. Based on our initial look at the gene trees, samples within _I. nutalli_ and _I. ocuttii_ are pretty closely related, so this may not be an issue. ### Selecting a Reference No reason not to use the same reference, `FF290_10_Isoetes_nuttallii` ### Re-running GATK pipeline ```bash= #!/bin/bash set -eo pipefail ## Usage: bash variantcall.sh Sporobolus_cryptandrus_ref.fasta GUMOseq-0188 ## Where .fasta is reference sequence + ' ' + samplename # Set variables reference=$1 prefix=$2 read1fn="$prefix"_R1_paired.fastq read2fn="$prefix"_R2_paired.fastq readdir=/scratch/lindsawi/isoetes/output/paired/"$prefix"/ mkdir -p $prefix cd $prefix # cd $prefix #if [ ! -f $reference.dict ] #then gatk CreateSequenceDictionary -R $reference #fi #bwa index $reference #samtools faidx $reference #Align read files to reference sequence and map bwa mem $reference $readdir/$read1fn $readdir/$read2fn | samtools view -bS - | samtools sort - -o "$prefix.sorted.bam" gatk FastqToSam -F1 $readdir/$read1fn -F2 $readdir/$read2fn -O $prefix.unmapped.bam -SM $prefix.sorted.bam #Replace read groups to mapped and unmapped bam files using library prep and sequencing information gatk AddOrReplaceReadGroups -I $prefix.sorted.bam -O $prefix.sorted-RG.bam -RGID 2 -RGLB lib1 -RGPL illumina -RGPU unit1 -RGSM $prefix gatk AddOrReplaceReadGroups -I $prefix.unmapped.bam -O $prefix.unmapped-RG.bam -RGID 2 -RGLB lib1 -RGPL illumina -RGPU unit1 -RGSM $prefix #Combine mapped and unmapped BAM files gatk MergeBamAlignment --ALIGNED_BAM $prefix.sorted-RG.bam --UNMAPPED_BAM $prefix.unmapped-RG.bam -O $prefix.merged.bam -R $reference #Remove duplicate sequences gatk MarkDuplicates -I $prefix.merged.bam -O $prefix.marked.bam -M $prefix.metrics.txt samtools index $prefix.marked.bam #Create GVCF gatk HaplotypeCaller -I $prefix.marked.bam -O $prefix-g.vcf -ERC GVCF -R $reference #Remove intermediate files rm $prefix.sorted.bam $prefix.unmapped.bam $prefix.merged.bam $prefix.unmapped-RG.bam $prefix.sorted-RG.bam ``` Run with: `while read name; do bash ../variantcall.sh /scratch/joh97948/isoetes/all_to_one/phasing/FF290_10_Isoetes_nuttallii_supercontigs.fasta $name; done < ../../namelist_nodups.txt` ### Joint Genotype Caller First, add all the individuals g.vcf files to a samples.list: `ls */*-g.vcf > samples.list` This script combines the VCFs from each individual and then uses the JointGenotypeCaller in GATK to call variants on all individuals. The variants are then filtered to only retain SNPs and filter them for quality. Later in the script, bcftools is used to name the variants before a number of functions in PLINK to generate various files: * `isoetes.snp.filtered.nodot.vcf` contains filtered SNPs * `isoetes_pruned` (`bed` and `bim`) selects SNPs using an LD filter * `isoetes_pruned.het` contains heterozygosity stats on the pruned data * `isoetes.eigenval` and `isoetes.eigenvec` contain output of a Principal Coordinate Analysis (PCA) for the pruned SNPs. ```bash= #!/bin/bash ##End of Select Varitants for GATK4 set -eo pipefail reference=$1 #Ppyriforme-3728.supercontigs.fasta prefix=$2 #Physcomitrium-pyriforme #Make Samples list # ls */*-g.vcf > samples.list # Combine and Jointly call GVCFs gatk CombineGVCFs -R $reference --variant samples.list --output "$prefix".cohort.g.vcf gatk GenotypeGVCFs -R $reference -V "$prefix".cohort.g.vcf -O "$prefix".cohort.unfiltered.vcf # Keep only SNPs passing a hard filter time gatk SelectVariants -V "$prefix".cohort.unfiltered.vcf -R $reference -select-type-to-include SNP -O "$prefix".SNPall.vcf time gatk VariantFiltration -R $reference -V "$prefix".SNPall.vcf --filter-name "hardfilter" -O "$prefix".snp.filtered.vcf --filter-expression "QD < 5.0 && FS > 60.0 && MQ < 45.0 && MQRankSum < -12.5 && ReadPosRankSum < -8.0" gatk SelectVariants -V "$prefix".snp.filtered.vcf -O "$prefix".snp.filtered.nocall.vcf --set-filtered-gt-to-nocall bcftools annotate --set-id +"%CHROM:%POS" "$prefix".snp.filtered.nocall.vcf > "$prefix".snp.filtered.nodot.vcf # Filter out SNPs that didn't pass the filter plink --vcf-filter --vcf "$prefix".snp.filtered.nodot.vcf --allow-extra-chr --recode --make-bed --geno --const-fid --out "$prefix" #Alternate for deleting all SNPs with missing data: #plink --vcf-filter --vcf "$prefix".snp.filtered.nodot.vcf --allow-extra-chr --recode --make-bed --geno 0 --const-fid --out "$prefix" plink --indep 50 5 2 --file "$prefix" --allow-extra-chr --out "$prefix" plink --extract "$prefix".prune.in --out "$prefix"_pruned --file "$prefix" --make-bed --allow-extra-chr --recode # Generate eigenvalues and loadings for 20 PCA axes plink --pca 20 --file "$prefix"_pruned --allow-extra-chr --out "$prefix" # Generate basic stats (heterozygosity, inbreeding coefficient, allele frequencies) plink --freq --het 'small-sample' --ibc --file "$prefix"_pruned --allow-extra-chr -out "$prefix"_pruned ``` Run with: `bash ../GenotypesToPCA.sh FF290_10_Isoetes_nuttallii_supercontigs.fasta isoetes` ### PCA from SNPs ```r= library(ggplot2) library(tidyr) pca.eigenvec = read.table("phasing/isoetes.eigenvec") pca.eigenvec[2] = gsub("_(wp13.)_","_\\1",pca.eigenvec[,2]) pca.eigenvec = separate(pca.eigenvec,2,c("Population","SampleNum","Genus","Species"),"_") names(pca.eigenvec)[6:25] = paste("PCA",seq(20),sep='') PCA1v2 = ggplot(pca.eigenvec,aes(PCA1,PCA2,color=Species)) + geom_point(size=3) PCA3v4 = ggplot(pca.eigenvec,aes(PCA1,PCA2,color=Population)) + geom_point(size=3) ``` PCA 1 and 2 have some differentiation by species: ![](https://i.imgur.com/H6JqsEE.png) PCA 3 and 4 have clustering by population ![](https://i.imgur.com/bIITOOc.png) ### SVD Quartets Following this tutorial: https://github.com/mmatschiner/tutorials/blob/master/species_tree_inference_with_snp_data/README.md SVDQuartets does not like missing data too much, so we will restrict the dataset to SNPs with no missing data: `plink --vcf-filter --vcf isoetes.snp.filtered.nodot.vcf --allow-extra-chr --recode vcf --geno 0 --const-fid --out isoetes.nomissing` First use [this Ruby script ](https://github.com/mmatschiner/tutorials/blob/master/species_tree_inference_with_snp_data/src/convert_vcf_to_nexus.rb) to convert the VCF into a NEXUS file (the script also uses ambiguity characters, which SVDQ can interpret): `ruby ~/scripts/convert_vcf_to_nexus.rb isoetes.nomissing.vcf isoetes.nomissing.nex` Then I ran SVDQ in PAUP. [The resulting tree can be viewed here](http://etetoolkit.org/treeview/?treeid=2990e119d627a614fce10b73419f77c3&algid=). Note that the support values and branch lengths are meaningless. It has some really good clustering within populations, which gives hope that the all-to-one method is working well for SNPs.