GATK and WHATSHAP

# GATK and WHATSHAP ##Workflow - Paralog investigator JF - 40 paralog KW - 100 paralog NGS - 14 paralog 1. Run hybpiper - up to paralog investigator. a. Identify Paralogs - and create fasta file and trees. b. Create “paralog” folder - which will have fasta file (visualize in Aliview) and “mafft_aligned/trees/” (visualize in figtree) 2. Scan paralog trees - look for examples of taxa split in diverged branches - aka there are two trees in one. 3. Scan the fasta file in Aliview or MEGA, to look for large sequence divergence within taxa - aligned, then sort by name a. Look for two same name b. Check across two sequences of one individual to see if it represent most diversity . These two sequences can be your baits consensus sequence. 9. Paste the sequences into a text file with the “a” and “b” notation appended to the gene label. 10. Workflow - For both exon & supercontigs - for all sequences (not just paralog) 11. Run hybpiper - including exonerat/ intronerate, followed by MAFFT and Trimal - and create fasta files and trees. 12. This is gives us one consensus file per taxa - regardless of paralogs - so will require more careful examination 13. Examine exons/ supercontig trees (See figure) - Look for examples of taxa split in diverged branches - aka there are two trees in one. 14. Folder retSUP - has “X_supercontig_mafft_trim.fasta” which are aligned fasta & Folder “retSUP/SUP_aligned_trimmed/trees/” has tress 15. Folder retDNA - has “X_mafft_trim.fasta” which are aligned fasta & Folder “retDNA/DNA_aligned_trimmed/trees/” has trees Scan the fasta file in Aliview or MEGA, to look for large sequence divergence within taxa - which is replicated across taxa. The assumption is one paralog will be more common in some individuals and the other paralog will be in other individuals. Paralog will be divergent but also replicated in many taxa. I.e one divergent sequence could just be poor sequence data. See if your data can be divided into two clear divergent sequence groups - then identify two individuals that best represent those groups to be your baits consensus sequence. Watch out for chiasma - i.e fragments from two sequences put together as one based on the assembly process. Paste the sequences into a text file with the “a” and “b” notation appended to the gene label. When you use supercontigs - intronerate does not locate sequences… WHATSHAP WhatsHap is a read-based phasing tool. In the typical case, it expects 1) a VCF file with variants of an individual and 2) a BAM or CRAM file with sequencing reads from that same individual. WhatsHap uses the sequencing reads to reconstruct the haplotypes and then writes out the input VCF augmented with phasing information. The vcf file and BAM file are made in GATK Rerun hybpiper with just paralog in new folder Once complete can combine the original folder and paralog folder rsync -avz {paralogfolder}/ {Originalfolder}/ {NewcombinedFolder}/ Note - no “./” before file names. This will just copy the folder not merge. NOTES: JF - So when working with Katie’s samples I noticed her paralog has both copies of the region (two per individual). This was not what I had for Brighamia. I think because I looked at complete sequences… (with Katies my protocol is check the tree and then check sequences. The sequences always needs to be aligned, then sort by name (when two same name check two sequences) Deciding what to drop Paralogs but limited representation by taxa (NGS - 4757) Deciding what is paralog (NOT allele) Tree has two clear branches with individuals represented in both https://github.com/lindsawi/HybSeq-SNP-Extraction ALTERNATIVE FOUND BY KATIE (https://github.com/moiexpositoalonsolab/grenepipe) - note https://arxiv.org/abs/2103.15167 Mapping to a reference genome FROM MATT’s EVERNOTE STEP 1: Prerequisite Need SUPERCONTIGS: From HybPiper, first run the scripts reads_first.py followed by intronerate.py to produce supercontigs for each recovered gene. These supercontigs will be used as a "reference sequence" for the sample. If you have several different species, you will need to run the scripts for each individually. Select reference: Concatenate all supercontigs into one single (reference) file: prefix/*/prefix/sequences/intron/*_supercontig.fasta > prefix.supercontigs.fasta Note: "prefix.supercontigs.fasta" will be used as an input on command line along with samplename STEP 2: Before the variant_workflow.sh Choose an individual to serve as a reference sequence Get all of the supercontigs for that individual into a single FASTA file. Since this will be your reference from every individual, make sure the FASTA headers for the sequences make sense. Run bwa index on that FASTA file. Run GATK Create Dictionary on the fasta file: java -jar /data/mjohnson/picard.jar CreateSequenceDictionary.jar R=Homo_sapiens_assembly18.fasta O=Homo_sapiens_assembly18.dict Run the following SAMtools command: samtools faidx reference.fa NOTES of DETAILS on “BEFORE THE VARIANT WORKFLOW.SH” Choose an individual to serve as a reference sequence ###### R-SCRIPT to find the sequences with most data ###### # install.packages("heatmap.plus", dep=TRUE) library("heatmap.plus") library("ggplot2") library("gplots") setwd("C:/Users/jbfan/Dropbox/Fant/IMLS, Hawaiian sp and Amorphophallus/IMLS, Hawaiian sp and Amorphophallus/Lobeliod (HybSeq)/HybSeq") sample.filename = "gene_lengths.txt" #You should not need to change anything below this line! #------------------------------------------------------- # sample.data= as.matrix(read.table(sample.filename,header=T,row.names=1)) sample.len = sample.data[2:nrow(sample.data),] reference.len = as.numeric(sample.data[1,]) #Calculate the percentage length recovered relative to the reference. percent.len=sweep(sample.len,2,as.numeric(reference.len),'/') percent.len = ifelse(percent.len>1,1,percent.len) #Plot the heatmap heatmap.2 (percent.len, Rowv=FALSE, Colv=FALSE, dendrogram= c('none'), cexCol=100/dim(percent.len)[2], cexRow=20/dim(percent.len)[1], density.info=c('none'), # options of different plots to be drawn in colour key trace='none', # character string indicating whether a solid "trace" line should be drawn across 'row's or down 'column's, 'both' or 'none'. #margins=c(10,5), #key=F, key=T, keysize=1, # display colour key key.title = NA, key.xlab= 'Percent Reference Protein Length Recovered', #The line below indicates where to display colour key, and the placement of four elements of the figure. #However, we don't care about two of them (the dendrograms). #This leves element 1 (the heatmap) and element 4 (the key). # lmat specifies coordinates for the four elements. # with three rows, the (unplotted) dendrograms can be in the margins. #Then we make the first row and first column, which contain the (unplotted) dendrograms, very small with lhei and lwid. #The second and third values in lhei correspond to the relative height of the chart vs the key. lmat=rbind( c(0,3), c(2,1), c(0,4) ), lhei=c(0.01, 10, 3), lwid=c(0.01,1), #Specifies a grayscale color pallette col = gray.colors(10,start=0.9,end=0.3,gamma=2.2), ) ?heatmap.2 ?heatmap sample.data= as.matrix(read.table(sample.filename,header=T,row.names=1)) sample.len = sample.data[2:nrow(sample.data),] reference.len = as.numeric(sample.data[1,]) #Calculate the percentage length recovered relative to the reference. percent.len=sweep(sample.len,2,as.numeric(reference.len),'/') percent.len = ifelse(percent.len>1,1,percent.len) class(percent.len) med.len<-percent.len #med.len<-as.data.frame(percent.len[,]) med.len$med = apply(med.len[,c(1:251)],1,median) med.len$avg = apply(percent.len[,c(1:251)],1,mean) sort(med.len$avg, decreasing = TRUE, na.last = NA) sort(med.len$med, decreasing = TRUE, na.last=NA) ###### R-SCRIPT to find the sequences with most data ###### NOTES : JF - BRIN-139 and BRIN-138 has highest average and highest median reads (although C.erinus and C.simulans is much higher) NOTES: KW > sort(med.len$avg, decreasing = TRUE, na.last = NA) PCI_CLP-1_S44 PCI_CH281-24_S40 PPU_PTS-3_S46 KWSE-S_SBK-27_S31 KWSE-NE_SILB-43_S7 0.8016132 0.7856705 0.7850471 0.7770054 0.7769302 > sort(med.len$med, decreasing = TRUE, na.last=NA) PCI_CLP-1_S44 KWSE-S_SBK-27_S31 KWSE-S_SMAR-16_S32 PPU_PFR-10_S55 PPU_PTS-3_S46 0.9453648 0.9255319 0.9225806 0.9185304 0.9185304 PCI_CH281-24_S40 KWSE-S_SIC-17-11_S30 PCI_CLA-5_S36 PCI_CED-3_S45 KWSE-NE_SILB-43_S7 0.9177143 0.9177143 0.9080963 0.9077756 0.9077756 KW_P-purpurea_PTS-3_S46 - Purpurea is probably most distinct? KW_Sessiliflora-NE_SILB-43_S7 - I think a northern is most pure Get all of the supercontigs for that individual into a single FASTA file. Since this will be your reference from every individual, make sure the FASTA headers for the sequences make sense. 5/21 Nora did: From the reads_first_output directory containing all the directories of assembled sequences by sample name: cat ./luk_013_12/*/sequences/intron/*supercontig.fasta > luk_013_12_combined.fasta [If using reassembled paralogs, here you need to concatenate with the other reads, combining the original assembly with the reassembled paralogs. Be sure to remove the original genes that were reassembled as paralogs, i.e. if you reassembled 5168 as 5168a and 5168b, then remove 5168. This might just be most easily done in text editor. This can be a good idea too, because some of Nora’s reassembled paralogs were actually shorter than the original assemblies. cat # Nora GS work Creating the reference sequence from the combined fasta file: grep luk_013_12 combined.fasta > wanted_seq.txt I had to get rid of the “>” before each name, I did this with find and replace in text editor, lol xargs samtools faidx combined.fasta < wanted_seq.txt > wanted_seq_only.fasta JF- example cd /home/jfant/Brighamia_Sequences/BRIN-139 cd /home/kwenzell/katie_sequences/trimmed/KW_Sessiliflora-NE_SILB-43_S7 cd /home/kwenzell/katie_sequences/trimmed/KW_P-purpurea_PTS-3_S46 # Example of locations of Supercontigs in folders /scaffold_XAEC_2000846/BRIN-005/sequences/intron/scaffold_XAEC_2000846_supercontig.fasta cat */*/sequences/intron/*_supercontig.fasta > ../Combined_supercont.fasta NOTE: ON reflection we think the following text (crossed out) is done in the following steps ( variantcall.sh) and so is redundant and can be left out. Run bwa index on that FASTA file. # Nora GS work bwa index wanted_seq_only.fasta JF- example bwa index Combined_supercont.fasta Combined_supercont.fasta.sa 286.7 KiB 7/21/2020 2:58:48 PM Combined_supercont.fasta.ann 10.9 KiB 7/21/2020 2:58:48 PM Combined_supercont.fasta.amb 13 B 7/21/2020 2:58:48 PM Combined_supercont.fasta.pac 143.3 KiB 7/21/2020 2:58:47 PM Combined_supercont.fasta.bwt 573.4 KiB 7/21/2020 2:58:47 PM JF ran samtools faidx Combined_supercont.fasta Locate your picard.jar file with the command “locate picard.jar” Run GATK Create Dictionary on the fasta file: java -jar /data/mjohnson/picard.jar CreateSequenceDictionary.jar R=Homo_sapiens_assembly18.fasta O=Homo_sapiens_assembly18.dict JF -example java -jar /opt/Software/picard/build/libs/picard.jar CreateSequenceDictionary REFERENCE=Combined_supercont.fasta OUTPUT=Combined_supercont.dict NOTE: You may need to install GATK # conda install -c bioconda gatk STEP 3: Running variantcall.sh This script takes about 45 minutes per sample. This shell script should be run in a directory that contains the following files: Directories with all your assembled sequences (reads_first.py output, directories have sample names, contain directories with each gene name that contain the assembled sequences) The reference sequence (“reference”.fasta Variantcall.sh script, edited to reflect your reference name The paired reads (“prefix”_R1_paired.fastq & “prefix”_R2_paired.fastq) should be either copied into the same directory as the one holding the directories with all sample names (reads first output) or the shell script should be edited to show the pathway to the paired reads e.g. read1fn=/home/ngavinsmyth/Noras_Sequences/paired_unpaired_concatenated/"$prefix"_R1_paired.fastq read2fn=/home/ngavinsmyth/Noras_Sequences/paired_unpaired_concatenated/"$prefix"_R2_paired.fastq If your paired reads have been copied into the working directory, then: read1fn=../"$prefix"_R1_paired.fastq read2fn=../"$prefix"_R2_paired.fastq Uses bwa mem to map paired-end reads to supercontigs Replaces read groups for mapped and unmapped bam files Removes duplicate reads Identifies variant sites using gatk HaplotypeCaller (NOTE: GVCF is produced) Removes intermediate BAM files # Command line: bash variantcall.sh prefix.supercontigs.fasta samplename Output: Contains many intermediate BAM files and GVCF file Variantcall.sh (from the github) #!/bin/bash set -eo pipefail ## Usage: bash variantcall.sh Sporobolus_cryptandrus_ref.fasta GUMOseq-0188 ## Where .fasta is reference sequence + ' ' + samplename # Set variables reference=../$1 prefix=$2 #read1fn="$prefix.R1.paired.fastq" #read2fn="$prefix.R2.paired.fastq" read1fn="$prefix"_1P.fastq read2fn="$prefix"_2P.fastq cd $prefix if [ ! -f ../*.dict ] then gatk CreateSequenceDictionary -R $reference fi bwa index $reference samtools faidx $reference #Align read files to reference sequence and map bwa mem $reference $read1fn $read2fn | samtools view -bS - | samtools sort - -o "$prefix.sorted.bam" gatk FastqToSam -F1 $read1fn -F2 $read2fn -O $prefix.unmapped.bam -SM $prefix.sorted.bam #Replace read groups to mapped and unmapped bam files using library prep and sequencing information gatk AddOrReplaceReadGroups -I $prefix.sorted.bam -O $prefix.sorted-RG.bam -RGID 2 -RGLB lib1 -RGPL illumina -RGPU unit1 -RGSM $prefix gatk AddOrReplaceReadGroups -I $prefix.unmapped.bam -O $prefix.unmapped-RG.bam -RGID 2 -RGLB lib1 -RGPL illumina -RGPU unit1 -RGSM $prefix #Combine mapped and unmapped BAM files gatk MergeBamAlignment --ALIGNED_BAM $prefix.sorted-RG.bam --UNMAPPED_BAM $prefix.unmapped-RG.bam -O $prefix.merged.bam -R $reference #Remove duplicate sequences gatk MarkDuplicates -I $prefix.merged.bam -O $prefix.marked.bam -M $prefix.metrics.txt samtools index $prefix.marked.bam #Create GVCF gatk HaplotypeCaller -I $prefix.marked.bam -O $prefix-g.vcf -ERC GVCF -R $reference #Remove intermediate files rm $prefix.sorted.bam $prefix.unmapped.bam $prefix.merged.bam $prefix.unmapped-RG.bam $prefix.sorted-RG.bam Variantcall.sh (Jeremie Modified) - SEE NOTES ON DEDUPE… #!/bin/bash set -eo pipefail ## Usage: bash variantcall.sh Sporobolus_cryptandrus_ref.fasta GUMOseq-0188 # bash variantcall.sh Combined_supercontigs.fasta BRIN-001 ## Where .fasta is reference sequence + ' ' + samplename # Set variables reference=/home/jfant/Brighamia_Sequences/GATK/$1 prefix=$2 gatkpath=/opt/Software/gatk-4.1.8.1/gatk picardpath=/opt/Software/picard/build/libs/picard.jar read1fn=../"$prefix"_R1_paired_dd.fastq read2fn=../"$prefix"_R2_paired_dd.fastq # This will find the "folder" which has same prefix and then you need to reference it to find # the reference file location ("/GATK/") - you can change if doing different runs. Just track name in each document. cd $prefix if [ ! -f ../GATK/*.dict ] then $gatkpath CreateSequenceDictionary -R $reference fi bwa index $reference samtools faidx $reference #Align read files to reference sequence and map bwa mem $reference $read1fn $read2fn | samtools view -bS - | samtools sort - -o "$prefix.sorted.bam" $gatkpath FastqToSam -F1 $read1fn -F2 $read2fn -O $prefix.unmapped.bam -SM $prefix.sorted.bam #Replace read groups to mapped and unmapped bam files using library prep and sequencing information $gatkpath AddOrReplaceReadGroups -I $prefix.sorted.bam -O $prefix.sorted-RG.bam -RGID 2 -RGLB lib1 -RGPL illumina -RGPU unit1 -RGSM $prefix $gatkpath AddOrReplaceReadGroups -I $prefix.unmapped.bam -O $prefix.unmapped-RG.bam -RGID 2 -RGLB lib1 -RGPL illumina -RGPU unit1 -RGSM $prefix #Combine mapped and unmapped BAM files $gatkpath MergeBamAlignment --ALIGNED_BAM $prefix.sorted-RG.bam --UNMAPPED_BAM $prefix.unmapped-RG.bam -O $prefix.merged.bam -R $reference #Remove duplicate sequences $gatkpath MarkDuplicates -I $prefix.merged.bam -O $prefix.marked.bam -M $prefix.metrics.txt samtools index $prefix.marked.bam #Create GVCF $gatkpath HaplotypeCaller -I $prefix.marked.bam -O $prefix-g.vcf -ERC GVCF -R $reference #Remove intermediate files rm $prefix.sorted.bam $prefix.unmapped.bam $prefix.merged.bam $prefix.unmapped-RG.bam $prefix.sorted-RG.bam NOTE: This script can be run in loop bash variantcall_new.sh Combined_supercontigs.fasta BRIN-006 bash /home/jfant/Brighamia_Sequences/variantcall_new.sh Combined_supercontigs.fasta BRIN-007 bash /home/jfant/Brighamia_Sequences/variantcall_new.sh Combined_supercontigs.fasta BRIN-009 etc… NOTE: Running each individual takes around 45mins Troubleshooting 3: Duplicate samples in our reads- Dedupe Jeremie and Johnathan had problem with their variantcall.sh. It said there was a problem with BAM files. We found that some of our files had duplicate reads (location, name, everything). We had to remove the duplicates Having problem with BAM files Worked through the new script by matt and changed to match our data… we are making head way but both of us hitting some hurdles It keeps asking us to delete the .dict file.. Which is odd that it makes a new one each time This was because it was not looking in right file Getting blocked at #Combine mapped and unmapped BAM files $gatkpath MergeBamAlignment --ALIGNED_BAM $prefix.sorted-RG.bam --UNMAPPED_BAM $prefix.unmapped-RG.bam -O $prefix.merged.bam -R $reference DEDUPE SCRIPT (getting rid of brighamia duplicate reads in R1/2_paired fastq files) /home/jz/Desktop/bbmap/dedupe.sh in=BRIN-001_R1_paired.fastq out=BRIN-001_R1_paired_dd.fastq sort=name ascending=t rmn=t ac=f -da -Xmx8g JF SCRIPT /opt/Software/bbmap/dedupe.sh in=BRIN-001_R1_paired.fastq out=BRIN-001_R1_paired_dd.fastq sort=name ascending=t rmn=t ac=f -da -Xmx8g sort=name output files in order of name (if set to input order, outputs are inconsistently ordered) rmn=t ac=f specify what type of duplicates to get rid of -da and -Xmx8g are java parameters (-Xmx8g = use 8G of ram, probably need at least 4) Variantcall.sh was then run using the _dd.fastq files For loop: for file in *_paired.fastq* do file2=”${file//paired.fastq/paired_dd.fastq}” echo “/home/jz/Desktop/bbmap/dedupe.sh in=$file out=$file2 sort=name ascending=t rmn=t ac=f -da -Xmx8g” >> dedupeloop.sh echo “rm $file” >> dedupeloop.sh Done JF Modifications: for file in *_paired.fastq* do file2=${file//paired.fastq/paired_dd.fastq} echo /opt/Software/bbmap/dedupe.sh in=$file out=$file2 sort=name ascending=t rmn=t ac=f -da -Xmx8g >> dedupeloop1sh done Chmod a+x ./dedupeloop.sh ./dedupeloop.sh --------------------------------------------------------------------- Variantcall.sh Script (using deduped files) for dir in *_rf; do echo bash /home/jz/HybSeq-SNP-Extraction/variantcall2.sh BRIN-139_supercontigs.fasta ${dir//_rf/} >> variantcallLoop.sh done JF SCRIPT for dir in */; do echo bash variantcall_new.sh Combined_supercontigs.fasta $dir >> variantcallLoop.sh done NOTE: This script will create the correct script BUT it adds an extra “/” at the end. You will need to take the shellscript and open it and remove those from the end of each line before you run it. STEP 4: Running GenotypesToPCA.sh This script will: Create samples.list from GVCF files (Use samples.list as variant in step 2) Combine GVCF files into a cohort and genotype Filter SNP's to remove indels using hard filter "QD < 5.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" Select SNP variants from hard filtering parameters Command line: bash GenotypesToPCA.sh prefix.supercontigs.fasta species Potentially useful outputs: "$prefix".SNPall.vcf (contains all SNPs and indels), "$prefix".snp.filtered.vcf (only SNPs, removes indels), "$prefix".snp.filtered.nocall.vcf (ONLY SNPs that pass a hard filter) GenotypeToPCA.sh (from the github) #!/bin/bash ##End of Select Varitants for GATK4 set -eo pipefail reference=$1 #Ppyriforme-3728.supercontigs.fasta prefix=$2 #Physcomitrium-pyriforme #Make Samples list ls */*-g.vcf > samples.list # Combine and Jointly call GVCFs gatk CombineGVCFs -R $reference --variant samples.list --output "$prefix".cohort.g.vcf gatk GenotypeGVCFs -R $reference -V "$prefix".cohort.g.vcf -O "$prefix".cohort.unfiltered.vcf # Keep only SNPs passing a hard filter time gatk SelectVariants -V "$prefix".cohort.unfiltered.vcf -R $reference -select-type-to-include SNP -O "$prefix".SNPall.vcf time gatk VariantFiltration -R $reference -V "$prefix".SNPall.vcf --filter-name "hardfilter" -O "$prefix".snp.filtered.vcf --filter-expression "QD < 5.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" gatk SelectVariants -V "$prefix".snp.filtered.vcf -O "$prefix".snp.filtered.nocall.vcf --set-filtered-gt-to-nocall GenotypeToPCA.sh (Jeremie Modified) - SEE NOTES ON DEDUPE… #!/bin/bash ##End of Select Varitants for GATK4 set -eo pipefail # Change the GATK folder based on where you put the combined supercontig reference=/home/jfant/Brighamia_Sequences/GATK/$1 #Combined_supercontigs.fasta # This prefix is how the files created will be labelled prefix=$2 #Brighamia gatkpath=/opt/Software/gatk-4.1.8.1/gatk picardpath=/opt/Software/picard/build/libs/picard.jarsamplesCaPuAll ## Command line: bash GenotypesToPCA.sh Combined_supercontigs.fasta Brighamia #Make Samples list -- this will go into every folder and find the -g.vcf file created in variantcall.sh. #Note you can modify this to determine which samples you will use in moving forward. If you do that change sample.list name below # if you gave the files an “additional name” to distinguish runs make sure you consider that (see below). ls */*-g.vcf > samples.list # Modified in KW to “ ls */*-g.Purp.vcf > samplesCaPuAll.list” #You can modify the sample list if you want to subset the data you use. # Combine and Jointly call GVCFs $gatkpath CombineGVCFs -R $reference --variant samples.list --output "$prefix".cohort.g.vcf $gatkpath GenotypeGVCFs -R $reference -V "$prefix".cohort.g.vcf -O "$prefix".cohort.unfiltered.vcf # Keep only SNPs passing a hard filter time $gatkpath SelectVariants -V "$prefix".cohort.unfiltered.vcf -R $reference -select-type-to-include SNP -O "$prefix".SNPall.vcf time $gatkpath VariantFiltration -R $reference -V "$prefix".SNPall.vcf --filter-name "hardfilter" -O "$prefix".snp.filtered.vcf --filter-expression "QD < 5.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" $gatkpath SelectVariants -V "$prefix".snp.filtered.vcf -O "$prefix".snp.filtered.nocall.vcf --set-filtered-gt-to-nocall KW MODIFICATION # Combine and Jointly call GVCFs $gatkpath CombineGVCFs -R $reference --variant samplesCaInNoOut.list --output "$prefix".Ind.cohort.g.vcf $gatkpath GenotypeGVCFs -R $reference -V "$prefix".Ind.cohort.g.vcf -O "$prefix".Ind.cohort.unfiltered.vcf # Keep only SNPs passing a hard filter time $gatkpath SelectVariants -V "$prefix".Ind.cohort.unfiltered.vcf -R $reference -select-type-to-include SNP -O "$prefix".Ind.SNPall.vcf time $gatkpath VariantFiltration -R $reference -V "$prefix".Ind.SNPall.vcf --filter-name "hardfilter" -O "$prefix".Ind.snp.filtered.vcf --filter-expression "QD < 5.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" $gatkpath SelectVariants -V "$prefix".Ind.snp.filtered.vcf -O "$prefix".Ind.snp.filtered.nocall.vcf --set-filtered-gt-to-nocall NOTE: I have altered the sample list to focus on subsets of data. (also remember to change the “prefix” so it does not overwrite any other files already made. Troubleshooting 4: - Problems with shell script not running CAVEAT OTHER NOTE - PROBLEM WITH SHELL SCRIPTS in Notepad++ look for hidden script and if you see [CR][LF] at end you will need to find a replace to remove [CR] otherwise does not run Open file in Notepad++ Goto Find & Replace, Make sure that in Search Mode, the Regular Expression option is selected. In "Find what" add regular expression “\r\n” and in Replace with “\n”. CRLF will be replaced with a newline character. --------------------------------------------------------------------- STEP 5: Running plink_stats.sh Additional dependencies: bcftools: https://samtools.github.io/bcftools/ Plink: https://zzz.bwh.harvard.edu/plink/download.shtml This script will: Set ID name for each SNP (for filtering) Filter SNPs that didn't pass the filter or have missing data Generate eigenvalues and loadings for PCA axes (default set to 20) Generate basic statistics (heterozygosity, inbreeding coefficient, allele frequencies) Command line: bash plink_stats.sh "$prefix" plink_stats_case.sh (from the github) #!/bin/bash set -eo pipefail #Script for generating statistics from the VCF output of prefix=$1 #Set an ID for each SNP so they can be filtered by name bcftools annotate --set-id +"%CHROM:%POS" "$prefix".snp.filtered.nocall.vcf > "$prefix".snp.filtered.nodot.vcf # Filter out SNPs that didn't pass the filter #plink --vcf-filter --vcf "$prefix".snp.filtered.nodot.vcf --allow-extra-chr --recode --make-bed --geno --const-fid --out "$prefix" #Alternate for deleting all SNPs with missing data: plink --vcf-filter --vcf "$prefix".snp.filtered.nodot.vcf --allow-extra-chr --recode --make-bed --geno 0 --const-fid --out "$prefix" plink --indep 50 5 2 --file "$prefix" --allow-extra-chr --out "$prefix" plink --extract "$prefix".prune.in --out "$prefix"_pruned --file "$prefix" --make-bed --allow-extra-chr --recode # Generate eigenvalues and loadings for 20 PCA axes plink --pca 20 --file "$prefix"_pruned --allow-extra-chr --out "$prefix" # Generate basic stats (heterozygosity, inbreeding coefficient, allele frequencies) plink --freq --het 'small-sample' --ibc --file "$prefix"_pruned --allow-extra-chr -out "$prefix"_pruned plink_stats_case.sh(Jeremie Modified) #!/bin/bash set -eo pipefail #Script for generating statistics from the VCF output of prefix=$1 gatkpath=/opt/Software/gatk-4.1.8.1/gatk picardpath=/opt/Software/picard/build/libs/picard.jar plinkpath=/opt/Software/plink_linux/plink bcftoolspath=/opt/Software/bcftools/bcftools ## Command line: bash GenotypesToPCA.sh Combined_supercontigs.fasta Brighamia #Set an ID for each SNP so they can be filtered by name $bcftoolspath annotate --set-id +"%CHROM:%POS" "$prefix".snp.filtered.nocall.vcf > "$prefix".snp.filtered.nodot.vcf # Filter out SNPs that didn't pass the filter #plink --vcf-filter --vcf "$prefix".snp.filtered.nodot.vcf --allow-extra-chr --recode --make-bed --geno --const-fid --out "$prefix" #Alternate for deleting all SNPs with missing data: $plinkpath --vcf-filter --vcf "$prefix".snp.filtered.nodot.vcf --allow-extra-chr --recode --make-bed --geno 0 --const-fid --out "$prefix" $plinkpath --indep 50 5 2 --file "$prefix" --allow-extra-chr --out "$prefix" $plinkpath --extract "$prefix".prune.in --out "$prefix"_pruned --file "$prefix" --make-bed --allow-extra-chr --recode # Generate eigenvalues and loadings for 20 PCA axes $plinkpath --pca 20 --file "$prefix"_pruned --allow-extra-chr --out "$prefix" # Generate basic stats (heterozygosity, inbreeding coefficient, allele frequencies) $plinkpath --freq --het 'small-sample' --ibc --file "$prefix"_pruned --allow-extra-chr -out "$prefix"_prune WHen you create a vcf file here is some script to help interpret the files https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format https://www.nature.com/articles/s41431-020-0574-3 Step 5 extras : -Checking vcf files… Refer to https://speciationgenomics.github.io/filtering_vcfs/ Make directory and move to your vcf directory JF- Mkdir ./vcf cd ~/vcf Copy the data file from the data directory # cp $~/data/vcf/cichlid_full.vcf.gz* ./ JF - cp home/jfant/Brighamia_sequences/Brighamia.SNPall.vcf ./vcf/ KW: cp /home/kwenzell/katie_sequences/trimmed/CASE.SNPall.vcf ./vcf/ JF: gzip -c Brighamia.snp.filtered.vcf>Brighamia.snp.filtered.vcf.gz KW: gzip -c CASE.SNPall.vcf>CASE.SNPall.vcf.gz How many unfiltered variants? One thing we didn’t check yet is how many variants we actually have. # bcftools view -H cichlid_full.vcf.gz | wc -l This command will essentially print every line of the VCF, so if your file is large, it will take quite a long time to run this. For this reason, it might be best to run it inside a screen and then we can check it again later.) JF- bcftools view -H Brighamia.SNPall.vcf.gz | wc -l KW- bcftools view -H CASE.SNPall.vcf.gz | wc -l 77,410 ish? (JF) 46,525 (JZ) 142,272 (KW - identical for SNPall and snp.filtered files. ) Randomly subsampling a VCF - How can we get an idea of how to set filters? We could just take the first 100 000 variants in our vcf and get an idea of how they look for some basic statistics (i.e. minor allele frequency, depth of coverage and so on). But would this be accurate? What if there is some bias on the first chromosome in the genome and everything there mapped well? A far better idea is to randomly sample your VCF. Luckily there is a tool to do exactly this and it is part of the extremely useful vcflib pipeline. Using it is also very simple. Here we will use it to extract ~100 000 variants at random from our unfiltered VCF. Needed to install vcflib pipeline (#conda install -c conda-forge -c bioconda -c defaults vcfliby) # bcftools view cichlid_full.vcf.gz | vcfrandomsample -r 0.012 > cichlid_subset.vcf.gz I only have 77410 - so 0.12% would be 900.. Maybe focus on 25% - 19,352 OR keep all? DID NOT DO - JF - bcftools view Brighamia.snp.filtered.vcf.gz | vcfrandomsample -r 0.25 > Brighamia.subset.vcf.gz KW - bcftools view CASE.SNPall.vcf.gz | vcfrandomsample -r 0.25 > CASE.subset.vcf.gz Note that vcfrandomsample cannot handle an uncompressed VCF, so we first open the file using bcftools and then pipe it to the vcfrandomsample utility. We set only a single parameter, -r = rate of sampling. everyone in the class should get different random subsets. Before proceeding though, we should compress vcf and index our new subset VCF to make it easier to access: # bgzip cichlid_subset.vcf JF - bgzip Brighamia.subset.vcf bcftools view -Ob -o out.bcf <input> index vcf # bcftools index cichlid_subset.vcf.gz JF - bcftools index Brighamia.SNPall.vcf.gz bcftools view -Oz -o CASE.SNPall1.vcf.gz CASE.SNPall.vcf.gz) Needed to download vcftools conda install -c bioconda vcftools REPLACE SUBSET_VCF=~/vcf/cichlid_subset.vcf.gz (JF: Write out location ../Brighamia.SNPall.vcf.gz) OUT=~/vcftools/cichlid_subset (JF: Write out location /vcftools/BrighamiaAll) Calculate allele frequency: First we will calculate the allele frequency for each variant. The --freq2 just outputs the frequencies without information about the alleles, --freq would return their identity. We need to add max-alleles 2 to exclude sites that have more than two alleles. # vcftools --gzvcf ./Brighamia.SNPall.vcf.gz --freq2 --out ./vcftools/BrighamiaAll --max-alleles 2 Checked Brighamia_subset.frq and it had a lot of low freq alleles (could be due to having the other hawaiian lobeliods in the mix) KW: vcftools --gzvcf ./CASE.SNPall.vcf.gz --freq2 --out ./CASEAll --max-alleles 2 a) CASE: After filtering, kept 105754 out of a possible 142272 Sites Calculate mean depth per individual Next we calculate the mean depth of coverage per individual. # vcftools --gzvcf ./Brighamia.SNPall.vcf.gz --depth --out ./vcftools/BrighamiaAll Brighamia_subset.idepth - depths vary ALOT KW: vcftools --gzvcf ./CASE.SNPall.vcf.gz --depth --out ./CASEAll a) After filtering, kept 201 out of a possible 201 Sites Calculate mean depth per site Similarly, we also estimate the mean depth of coverage for each site. # vcftools --gzvcf ./Brighamia.SNPall.vcf.gz --site-mean-depth --out ./vcftools/BrighamiaAll Calculate site quality We additionaly extract the site quality score for each site. # vcftools --gzvcf ./Brighamia.SNPall.vcf.gz --site-quality --out ./vcftools/BrighamiaAll Calculate proportion of missing data per individual - Another individual level statistic - we calculate the proportion of missing data per sample. # vcftools --gzvcf ./Brighamia.SNPall.vcf.gz --missing-indv --out ./vcftools/BrighamiaAll Calculate proportion of missing data per site - And more missing data, just this time per site rather than per individual. # vcftools --gzvcf ./Brighamia.SNPall.vcf.gz --missing-site --out ./vcftools/BrighamiaAll Calculate heterozygosity and inbreeding coefficient per individual: Computing heterozygosity and the inbreeding coefficient (F) for each individual can quickly highlight outlier individuals. Note that here we assume Hardy-Weinberg equilibrium. If the individuals are not sampled from the same population, the expected heterozygosity will be overestimated due to the Wahlund-effect. It may still be worth to compute heterozygosities even if the samples are from more than one population to check if any of the individuals stands out which could indicate problems. # vcftools --gzvcf ./Brighamia.SNPall.vcf.gz --het --out ./vcftools/BrighamiaAll NOTE SEPT 25 2020 - got to the section where we need to move data over to R in this GIThub Refer to https://speciationgenomics.github.io/filtering_vcfs/ Rscript to visualize data ## FROM THIS WEBSITE https://speciationgenomics.github.io/filtering_vcfs/## library(tidyverse) setwd("C:/Users/jbfan/Dropbox/Fant/IMLS, Hawaiian sp and Amorphophallus/IMLS, Hawaiian sp and Amorphophallus/Lobeliod (HybSeq)/HybSeq/vcftools") # Variant quality var_qual <- read_delim("BrighamiaAll.lqual", delim = "\t", col_names = c("chr", "pos", "qual"), skip = 1) a <- ggplot(var_qual, aes(qual)) + geom_density(fill = "dodgerblue1", colour = "black", alpha = 0.3) a + theme_light() cf #Variant mean depth var_depth <- read_delim("BrighamiaAll.ldepth.mean", delim = "\t", col_names = c("chr", "pos", "mean_depth", "var_depth"), skip = 1) a <- ggplot(var_depth, aes(mean_depth)) + geom_density(fill = "dodgerblue1", colour = "black", alpha = 0.3) a + theme_light() summary(var_depth$mean_depth) a + theme_light() + xlim(0, 100) cf # Variant missingness var_miss <- read_delim("BrighamiaAll.lmiss", delim = "\t", col_names = c("chr", "pos", "nchr", "nfiltered", "nmiss", "fmiss"), skip = 1) a <- ggplot(var_miss, aes(fmiss)) + geom_density(fill = "dodgerblue1", colour = "black", alpha = 0.3) a + theme_light() summary(var_miss$fmiss) cf # Minor allele frequency var_freq <- read_delim("BrighamiaAll.frq", delim = "\t", col_names = c("chr", "pos", "nalleles", "nchr", "a1", "a2"), skip = 1) var_freq$maf <- var_freq %>% select(a1, a2) %>% apply(1, function(z) min(z)) a <- ggplot(var_freq, aes(maf)) + geom_density(fill = "dodgerblue1", colour = "black", alpha = 0.3) a + theme_light() summary(var_freq$maf) cf # Mean depth per individual ind_depth <- read_delim("BrighamiaAll.idepth", delim = "\t", col_names = c("ind", "nsites", "depth"), skip = 1) a <- ggplot(ind_depth, aes(depth)) + geom_histogram(fill = "dodgerblue1", colour = "black", alpha = 0.3) a + theme_light() # Proportion of missing data per individual ind_miss <- read_delim("BrighamiaAll.imiss", delim = "\t", col_names = c("ind", "ndata", "nfiltered", "nmiss", "fmiss"), skip = 1) a <- ggplot(ind_miss, aes(fmiss)) + geom_histogram(fill = "dodgerblue1", colour = "black", alpha = 0.3) a + theme_light() # Heterozygosity and inbreeding coefficient per individual ind_het <- read_delim("BrighamiaAll.het", delim = "\t", col_names = c("ind","ho", "he", "nsites", "f"), skip = 1) a <- ggplot(ind_het, aes(f)) + geom_histogram(fill = "dodgerblue1", colour = "black", alpha = 0.3) a + theme_light() ## FROM THIS WEBSITE https://speciationgenomics.github.io/filtering_vcfs/## ~/vcf/cichlid_subset.vcf.gz (JF: Write out location ../Brighamia.SNPall.vcf.gz) Applying filters to the VCF VCF_IN=~/vcf/cichlid_full.vcf.gz (JF: Write out location Brighamia.SNPall.vcf.gz) VCF_OUT=~/vcf/cichlid_full_filtered.vcf.gz Brighamia.SNPall.filtered.vcf.gz # set filters MAF=0.1 (Minor allele frequency ) MISS=0.9 (How much missing data are you willing to tolerate? ) QUAL=30 (PHRED score - 30 is the chances that this base is called incorrectly are 1 in 1000) MIN_DEPTH=10 (removes false positives) MAX_DEPTH=50 (A maximum cut off removes repetitive ones mapping to multiple parts of the genome. vcftools --gzvcf $VCF_IN \ --remove-indels --maf $MAF --max-missing $MISS --minQ $QUAL \ --min-meanDP $MIN_DEPTH --max-meanDP $MAX_DEPTH \ --minDP $MIN_DEPTH --maxDP $MAX_DEPTH --recode --stdout | gzip -c > \ $VCF_OUT JF: vcftools --gzvcf Brighamia.SNPall.vcf.gz \ --remove-indels --maf 0.1 --max-missing 0.9 --minQ 30 \ --min-meanDP 5 --max-meanDP 200 \ --recode --stdout | gzip -c > \ Brighamia.SNPall.filtered.vcf.gz vcftools --gzvcf Brighamia.SNPall.vcf.gz \ --remove-indels --maf 0.1 --max-missing 0.9 --minQ 30 \ --minDP 10 --maxDP 50 \ KW: vcftools --gzvcf CASESub.SNPall.vcf.gz \ --remove-indels --maf 0.1 --max-missing 0.9 --minQ 30 \ --min-meanDP 5 --max-meanDP 200 \ --recode --stdout | gzip -c > \ CASESub.SNPall.filtered.vcf.gz ## --minQ is about quality (The scores generally range from 2 to 40) ## --min-meanDP <float> and --max-meanDP <float> is for mean depth values (over all included individuals) greater than or equal to for each site. ##--remove-indels - Include or exclude sites that contain an indel. For these options "indel" means any variant that alters the length of the REF allele. ## --maf <float> and --max-maf <float> - Include only sites with a Minor Allele Frequency greater than or equal to the "--maf" value cat out.log bcftools view -H Brighamia.SNPall.filtered.vcf.gz | wc -l Filtered all my data awaY! :( # FOR KW filtered out the outgroups vcftools --gzvcf ./CASE.SNPall.vcf.gz --remove-indv KW_Outgroup-indivisa_IFR-101_S6 --remove-indv KW_Outgroup-indivisa_IMT-105_S5 --remove-indv KW_Outgroup_2010-228_S1 --remove-indv KW_Outgroup_2010-232_S2 --remove-indv KW_Outgroup_2010-243_S3 --remove-indv KW_Outgroupi-indivisa_IMT-101_S4 --out ./CASEfilter.vcf.gz --max-alleles 2 Convert vcf to a table /opt/Software/gatk-4.1.8.1/gatk VariantsToTable \ -V Brighamia.SNPall.filtered.vcf \ -F CHROM -F POS -F TYPE -GF AD \ -O Brighamia.snp.filtered.output.table KW: /opt/Software/gatk-4.1.8.1/gatk VariantsToTable \ -V CASE.SNPallls.vcf \ -F CHROM -F POS -F TYPE -GF AD \ -O CASE.snp.filtered.output.table KW: /opt/Software/gatk-4.1.8.1/gatk VariantsToTable \ -V CASEfilter.vcf \ -F CHROM -F POS -F TYPE -GF AD \ -O CASE.snp.filtered.output.table STEP 6: Use vcf tools - cleans files... #!/bin/bash set -eo pipefail #Script for generating statistics from the VCF output of prefix=$1 #Set an ID for each SNP so they can be filtered by name bcftools annotate --set-id +"%CHROM:%POS" "$prefix".snp.filtered.nocall.vcf > "$prefix".snp.filtered.nodot.vcf # Filter out SNPs that didn't pass the filter #plink --vcf-filter --vcf "$prefix".snp.filtered.nodot.vcf --allow-extra-chr --recode --make-bed --geno --const-fid --out "$prefix" #Alternate for deleting all SNPs with missing data: plink --vcf-filter --vcf "$prefix".snp.filtered.nodot.vcf --allow-extra-chr --recode --make-bed --geno 0 --const-fid --out "$prefix" plink --indep 50 5 2 --file "$prefix" --allow-extra-chr --out "$prefix" plink --extract "$prefix".prune.in --out "$prefix"_pruned --file "$prefix" --make-bed --allow-extra-chr --recode # Generate eigenvalues and loadings for 20 PCA axes plink --pca 20 --file "$prefix"_pruned --allow-extra-chr --out "$prefix" # Generate basic stats (heterozygosity, inbreeding coefficient, allele frequencies) plink --freq --het 'small-sample' --ibc --file "$prefix"_pruned --allow-extra-chr -out "$prefix"_pruned JF - I had to allow 50% (--geno 0.5) missing data to run… this seems high #!/bin/bash set -eo pipefail #Script for generating statistics from the VCF output of prefix=$1 gatkpath=/opt/Software/gatk-4.1.8.1/gatk picardpath=/opt/Software/picard/build/libs/picard.jar plinkpath=/opt/Software/plink_linux/plink bcftoolspath=/opt/Software/bcftools/bcftools ## Command line: bash plink_stats.sh Brighamia #Set an ID for each SNP so they can be filtered by name $bcftoolspath annotate --set-id +"%CHROM:%POS" "$prefix".snp.filtered.nocall.vcf > "$prefix".snp.filtered.nodot.vcf # Filter out SNPs that didn't pass the filter #plink --vcf-filter --vcf "$prefix".snp.filtered.nodot.vcf --allow-extra-chr --recode --make-bed --geno --const-fid --out "$prefix" #Alternate for deleting all SNPs with missing data: $plinkpath --vcf-filter --vcf "$prefix".snp.filtered.nodot.vcf --allow-extra-chr --recode --make-bed --geno 0.5 --const-fid --out "$prefix" $plinkpath --indep 50 5 2 --file "$prefix" --allow-extra-chr --out "$prefix" $plinkpath --extract "$prefix".prune.in --out "$prefix"_pruned --file "$prefix" --make-bed --allow-extra-chr --recode # Generate eigenvalues and loadings for 20 PCA axes $plinkpath --pca 20 --file "$prefix"_pruned --allow-extra-chr --out "$prefix" # Generate basic stats (heterozygosity, inbreeding coefficient, allele frequencies) $plinkpath --freq --het 'small-sample' --ibc --file "$prefix"_pruned --allow-extra-chr -out "$prefix"_pruned Files Generated (Optios of files https://www.cog-genomics.org/plink/1.9/formats#bim) Extension Details .bed PLINK binary biallelic genotype table Primary representation of genotype calls at biallelic variants. .bim PLINK extended Map File Extended variant information file accompanying a .bed binary genotype table. A text file with no header line, and one line per variant with the following six fields: Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name Variant identifier Position in morgans or centimorgans (safe to use dummy value of '0') Base-pair coordinate (1-based; limited to 231-2) Allele 1 (corresponding to clear bits in .bed; usually minor) Allele 2 (corresponding to set bits in .bed; usually major) .eigenvec, .eigenvec.var .eigenval principal components \ Used in PCA One Eigenvalue per line .fam PLINK sample information file Sample information file accompanying a .bed binary genotype table A text file with no header line, and one line per sample with the following six fields: Family ID ('FID') Within-family ID ('IID'; cannot be '0') Within-family ID of father ('0' if father isn't in dataset) Within-family ID of mother ('0' if mother isn't in dataset) Sex code ('1' = male, '2' = female, '0' = unknown) Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control) .frq (basic allele frequency report) A text file with a header line, and then one line per variant with the following six fields: CHR Chromosome code SNP Variant identifier A1 Allele 1 (usually minor) A2 Allele 2 (usually major) MAF Allele 1 frequency NCHROBS Number of allele observations .het method-of-moments F coefficient estimates A text file with a header line, and one line per sample with the following six fields: FID Family ID IID Within-family ID O(HOM) Observed number of homozygotes E(HOM) Expected number of homozygotes N(NM) Number of non-missing autosomal genotypes F Method-of-moments F coefficient estimate .ibc GCTA inbreeding coefficient report A text file with a header line, and one line per sample with the following six fields: FID Family ID IID Within-family ID NOMISS Number of nonmissing genotype calls Fhat1 Variance-standardized relationship minus 1 Fhat2 Excess homozygosity-based inbreeding estimate (same as PLINK --het) Fhat3 Estimate based on correlation between uniting gametes .imiss sample-based missing data report A text file with a header line, and one line per sample with the following six fields: FID Family ID IID Within-family ID MISS_PHENO Phenotype missing? (Y/N) N_MISS Number of missing genotype call(s), not including obligatory missings or het. haploids N_GENO Number of potentially valid call(s) F_MISS Missing call rate .lmiss variant-based missing data report A text file with a header line, and K line(s) per variant with the following 5-7 fields (where K is the number of cluster(s) if --within/--family was specified, or 1 if it wasn't): CHR Chromosome code SNP Variant identifier CLST Cluster identifier. Only present with --within/--family. N_MISS Number of missing genotype call(s), not counting obligatory missings or het. haploids N_CLST Cluster size (does not include nonmales on chrY). Only present with --within/--family. N_GENO Number of potentially valid call(s) F_MISS Missing call rate .map PLINK text fileset variant information file A text file with no header file, and one line per variant with the following 3-4 fields: Chromosome code. PLINK 1.9 also permits contig names here, but most older programs do not. Variant identifier Position in morgans or centimorgans (optional; also safe to use dummy value of '0') Base-pair coordinate .ped PLINK/MERLIN/Haploview text pedigree + genotype table Original standard text format for sample pedigree information and genotype calls. Normally must be accompanied by a .map file; Haploview requires an accompanying .info file instead. Loaded with --file, and produced by --recode. STEP 6: Running structure To begin we will need a Srcuture file. Although Plink will make a Stucture file, it is not always correctly formatted. The easiest thing to do is to choose one of your vcf files you cleaned and use PGDSpider program to convert to structure Download vcf file Open PGDSPider.exe Input file - choose this file. Makes sure if has “vcf” format Output file - choose Structure format and location of file, and give name in text box. (need to select to specify - is a little odd way of doing it) It will ask you conditions - can accept default to start Run… - details on conversion details will be text Run in StrAuto. Add file suffix (.str) Go to StrAuto folder (/opt/Software/strauto Move old files to new folder ( Download the input.py Read question and answer #1 (no pop), #4 (file name), #6 (no indiv), #7, Loci, #10 (data in two lines), #17 List of markers) Change Max Pop Change number of individual and number of alleles Re-upload altered input.py along with new file (*.str) Run ./strauto_1.py Check the parameters - if ok press “a” if wrong “q” and fix input.py Once happy - run ./runstructure STEP 6: Using WhatsApp to look at phasing (Heterozygosity) All the data currently we have is haploid (single copy). We may want to know the heterozygosity of the samples Some of the programs needed to proceed. Haplonerate: https://github.com/mossmatters/phyloscripts/tree/master/haplonerate WhatsHap: http://whatshap.readthedocs.io BioPython package: https://biopython.org/ Python 3.0: https://www.python.org/download/releases/3.0/ GNU Parallel: https://www.gnu.org/software/parallel/ My Python was too old so needed to update it conda update --all conda update conda conda update python conda install whatshap CODE in extract_phase_subgenome.sh Tried a single individual cp ./GATK/Combined_supercontigs.fasta ./BRIN-005/ cd ./BRIN-005/ samtools faidx Combined_supercontigs.fasta whatshap phase -o BRIN-005.phased.vcf --reference=Combined_supercontigs.fasta BRIN-005-g.vcf BRIN-005.marked.bam whatshap stats BRIN-005.phased.vcf --gtf BRIN-005.whatshap.gtf --tsv BRIN-005.whatshap.stats.txt So checked the text file and does not seem to be any variation? COuld that be true? bgzip -c BRIN-005.phased.vcf > BRIN-005.phased.vcf.gz tabix BRIN-005.phased.vcf.gz samtools faidx Combined_supercontigs.fasta BRIN-139-XAEC_scaffold_XAEC_2000468 | bcftools consensus -H XAEC_scaffold_XAEC_2000468 BRIN-005.phased.vcf.gz > phased_bcftools/BRIN-139-XAEC_scaffold_XAEC_2000468.phased.1.fasta Error: Can't apply 40-th haplotype at BRIN-139-XAEC_scaffold_XAEC_2000468:1 SCRIPT I RUN LINE BY LINE within the individual folder (BRIN-005) whatshap phase -o BRIN-005.phased.vcf --reference=Combined_supercontigs.fasta BRIN-005-g.vcf BRIN-005.marked.bam whatshap stats BRIN-005.phased.vcf --gtf BRIN-005.whatshap.gtf --tsv 2338.whatshap.stats.txt bgzip -c BRIN-005.phased.vcf > BRIN-005.phased.vcf.gz tabix BRIN-005.phased.vcf.gz #Extract two fasta sequences for each gene, corresponding to the phase mkdir -p phased_bcftools rm phased_bcftools/* parallel "samtools faidx Combined_supercontigs.fasta BRIN-139-{1} | bcftools consensus -H 1 BRIN-005.phased.vcf.gz > phased_bcftools/BRIN-005-{1}.phased.1.fasta" :::: /home/jfant/Brighamia_Sequences/Brig-genelist.txt parallel "samtools faidx Combined_supercontigs.fasta BRIN-139-{1} | bcftools consensus -H 2 BRIN-005.phased.vcf.gz > phased_bcftools/BRIN-005-{1}.phased.2.fasta" :::: /home/jfant/Brighamia_Sequences/Brig-genelist.txt \ FASTQ files explained Illumina sequencing technology uses cluster generation and sequencing by synthesis (SBS) chemistry to sequence millions or billions of clusters on a flow cell, depending on the sequencing platform. When sequencing completes, the base calls in the BCL files must be converted into sequence data. This process is called BCL to FASTQ conversion. A FASTQ file is a text file that contains the sequence data from the clusters that pass filter on a flow cell. Demultiplexing assigns clusters to a sample, based on the cluster’s index sequence(s). If samples were not multiplexed, the demultiplexing step does not occur, and, for each flow cell lane, all clusters are assigned to a single sample. For a single-read run, one Read 1 (R1) FASTQ file is created for each sample per flow cell lane. For a paired-end run, one R1 and one Read 2 (R2) FASTQ file is created for each sample for each lane. FASTQ files are compressed and created with the extension *.fastq.gz. What does a FASTQ file look like? For each cluster that passes filter, a single sequence is written to the corresponding sample’s R1 FASTQ file, and, for a paired-end run, a single sequence is also written to the sample’s R2 FASTQ file. Each entry in a FASTQ files consists of 4 lines: A sequence identifier with information about the sequencing run and the cluster. The exact contents of this line vary by based on the BCL to FASTQ conversion software used. @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<sample number> The sequence (the base calls; A, C, T, G and N). A separator, which is simply a plus (+) sign. The base call quality scores. These are Phred +33 encoded, using ASCII characters to represent the numerical quality scores. Here is an example of a single entry in a R1 FASTQ file: @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<sample number> Generally, it is not necessary to view FASTQ files, because they are intermediate output files used as input for tools that perform downstream analysis, such as alignment to a reference or de novo assembly.FASTQ files can contain up to millions of entries and can be several megabytes or gigabytes in size, which often makes them too large to open in a normal text editor.If you need to view a FASTQ file for troubleshooting purposes or out of curiosity, you will need either a text editor that can handle very large files, or access to a Unix or Linux system where large files can be viewed via the command line. sed -ri "s/>[0-9]+ ([A-Z0-9]+-[A-Z0-9]+):[0-9]+-[0-9]+/\>\1/" Combined_supercontigs.fasta samtools faidx Combined_supercontigs.fasta RUNNING FAST STRUCTURE (http://rajanil.github.io/fastStructure/) Main options The key options to pass to the scripts are the input file, the output file and the number of populations. Assuming the input file is named genotypes.bed (with corresponding genotypes.fam and genotypes.bim), the output file is named genotypes_output and the number of populations you would like is 3, you can run the algorithm as follows: python structure.py -K 3 --input=genotypes --output=genotypes_output This generates a genotypes_output.3.log file that tracks how the algorithm proceeds, and files genotypes_output.3.meanQ and genotypes_output.3.meanP containing the posterior mean of admixture proportions and allele frequencies, respectively. The orders of samples and SNPs in the output files match those in the `.fam` file and `.bim` file, respectively. Note that input file names should not include suffixes (e.g., .bed) and are relative to the main project directory (unless a full path is provided). Input data format The current implementation can import data from plink bed format and the original Structure format. If the data are in plink format, ensure that bed, bim and fam files for the dataset are all present in the same path. While the original Structure program allowed for a more flexible input format, fastStructure expects a more specific Structure-like input format. Specifically, rows in the data file correspond to samples, with two rows per sample (note that only diploids are handled by this software), and columns correspond to SNPs. The first 6 columns of the file will be ignored; these typically would include IDs, metadata, etc. This software only handles bi-allelic loci. The two alleles at each locus can be encoded as desired; however, missing data should be encoded as `-9'. Making PCA https://speciationgenomics.github.io/pca/ # make a plink directory mkdir plink cd plink Note: VCF=~/vcf/cichlid_full_filtered_rename.vcf.gz = /home/jfant/Brighamia_Sequences/Brighamia.cohort.unfiltered.vcf # perform linkage pruning - i.e. identify prune sites plink --vcf $VCF --double-id --allow-extra-chr \ --set-missing-var-ids @:# \ --indep-pairwise 50 10 0.1 --out cichlids # JF modifications plink --vcf /home/jfant/Brighamia_Sequences/Brighamia.cohort.unfiltered.vcf --double-id --allow-extra-chr \ --set-missing-var-ids @:# \ --indep-pairwise 50 10 0.1 --out Brigh JF modifications plink --vcf /home/kwenzell/katie_sequences/trimmed/CASESub.cohort.unfiltered.vcf --double-id --allow-extra-chr \ --remove filter.txt \ --set-missing-var-ids @:# \ --indep-pairwise 50 10 0.1 --out CASESub --filter W_Outgroup-indivisa_IFR-101_S6 KW_Outgroup-indivisa_IMT-105_S5 \ --filter KW_Outgroup_2010-228_S1 \ --filter KW_Outgroup_2010-232_S2 \ --filter KW_Outgroup_2010-243_S3 \ --filter KW_Outgroupi-indivisa_IMT-101_S4 \ plink --vcf /home/jfant/Brighamia_Sequences/Brighamia.cohort.unfiltered.vcf --double-id --allow-extra-chr \ --set-missing-var-ids @:# \ --indep-pairwise 50 10 0.1 --out Brigh So for our plink command, we did the following: --double-id - told plink to duplicate the id of our samples (this is because plink typically expects a family and individual id - i.e. for pedigree data - this is not necessary for us. --allow-extra-chr - allow additional chromosomes beyond the human chromosome set. This is necessary as otherwise plink expects chromosomes 1-22 and the human X chromosome. --set-missing-var-ids - also necessary to set a variant ID for our SNPs. Human and model organisms often have annotated SNP names and so plink will look for these. We do not have them so instead we set ours to default to chromosome:position which can be achieved in plink by setting the option @:# - see here for more info. --indep-pairwise - finally we are actually on the command that performs our linkage pruning! The first argument, 50 denotes we have set a window of 50 Kb. The second argument, 10 is our window step size - meaning we move 10 bp each time we calculate linkage. Finally, we set an r2 threshold - i.e. the threshold of linkage we are willing to tolerate. Here we prune any variables that show an r2 of greater than 0.1. --out Produce the prefix for the output data. # prune and create pca plink --vcf $VCF --double-id --allow-extra-chr --set-missing-var-ids @:# \ --extract cichlids.prune.in \ --make-bed --pca --out cichlids # JF modifications plink --vcf /home/jfant/Brighamia_Sequences/Brighamia.cohort.unfiltered.vcf \ --double-id --allow-extra-chr --set-missing-var-ids @:# \ --extract Brigh.prune.in \ --make-bed --pca --out Brigh #KW modifications plink --vcf /home/kwenzell/katie_sequences/trimmed/Plink_CaSe/CaSe.cohort.unfiltered.vcf \ --double-id --allow-extra-chr --set-missing-var-ids @:# \ --extract CaSe.prune.in \ --make-bed --pca --out CaSe plink --vcf /home/kwenzell/katie_sequences/trimmed/Plink_Ind/CaInd.cohort.unfiltered.vcf \ --double-id --allow-extra-chr --set-missing-var-ids @:# \ --extract CaInd.prune.in \ --make-bed --pca --out CaInd plink --vcf /home/kwenzell/katie_sequences/trimmed/Plink_CaPu/CaPu.cohort.unfiltered.vcf \ --double-id --allow-extra-chr --set-missing-var-ids @:# \ --extract CaPu.prune.in \ --make-bed --pca --out CaPu plink --vcf /home/kwenzell/katie_sequences/trimmed/Plink_Ind/CaInd.snp.filtered.vcf --double-id --allow-extra-chr \ --set-missing-var-ids @:# \ --indep-pairwise 50 10 0.1 --out CaIndiv plink --vcf /home/kwenzell/katie_sequences/trimmed/Plink_Ind/CaInd.snp.filtered.vcf \ --double-id --allow-extra-chr --set-missing-var-ids @:# \ --extract CaIndiv.prune.in \ --make-bed --pca --out CaIndiv