PacBio Revio: Phase Set Haplotype - final

## Final methods #### All R code referenced is in /Users/sarah/Duke Bio_Ea Dropbox/Sarah Marion/Cluster and Genome Sequence Alignment/VCFs/iso_1/Wild_Haplotype_Clean.R ###### tags: `Pacbio Data` `Haplotype methods` ### 1. Just use Deep Variant SNP file (not trgt or pbsv files). Use bcftools to create a file of the phase sets (PS) of the SNPs specific to CyO `bcftools query -f '%CHROM\t%POS\t%ALT\t[%PS]\t[%GT]\n' L59_dv_phased_chr2.vcf.gz | grep -Ff CyO_snps.txt > L59_phase_sets_CyO_variants.txt` This should output the chromosome, position, phase set, alternate allele of ALL exact (exact match becuase of -Ff) matching SNPs in the PacBio data For example, if both of the two pretend variants above also appear in the phased PacBio data, the output will look something like this: pretend "L208_phase_sets_CyO_variants.txt" 2R 1000 1 0|1 G 2R 2000 50 1|0 A This would show that in phase set 1, the CyO haplotype is likely the second haplotype, but in phase set 50, the CyO haplotype is likely the first haplotype. ### 2. Transfer output file, L208_phase_sets_CyO_variants.txt, to R. In R, determine how many genotypes of CyO variants are present per phase set. Narrow down to phased genotypes that are heterozygous with reference allele. Determine the ratio of one genotype to the other. To be confident which haplotype is CyO, one phased genotype needs to be 100x+ more common than the other (e.g. CyO variants as 0|1 is 500x more common than CyO variatns as 1|0). Using this, generate 4 txt files: a) Matching unphased CyO variants (L${LL}_unphased_variants.txt) b) Matching phased CyO variants, in the majority haplotype of phase sets whose haplotype ratios are >100 (L${LL}_phased_variants.txt) c) Matching phased CyO variants, in the minority haplotype of phase sets whose haplotype ratios are <100. This includes variants that are 2|1, 1|2, 3|2, etc. - if the CyO variant. (L${LL}_minor_variants.txt) d) Matching phased CyO variants, that are present in phase sets where the majority haplotype is LESS than 100x more common (i.e. can’t be confident which haplotype is CyO vs wild. (L${LL}_low_ratio_phase_sets.txt) ### 3. Remove unphased CyO variants (0/1) from the vcf (a. L${LL}_unphased_variants.txt) bcftools view -T ^L274_unphased_variants.txt -o L274_wo_unphased_CyOmatches.vcf -O v L274_dv_phased_chr2.vcf ### 4. 7.Use the list of phased CyO variants (b. L${LL}_phased_variants.txt) present in the PacBio data and generate a list of all of the variants that are in the same PS and haplotype and genotype as the CyO variant: `cut -f1,4,5 L274_phased_variants.txt > L274_phased_variants_formatted.txt` `bcftools query -f '%CHROM\t[%GT]\t[%PS]\t%POS\t%ALT\n' L274_wo_unphased_CyOmatches.vcf| grep -Ff L274_phased_variants_formatted.txt > L274_all_phased_variants_to_remove.txt` *cut -f1,4,5 extracts only CHROM(1), GT(4), and PS (5) (fields needed for matching) from the flagged variants file.* Let's pretend that LethalLine_pacbio_merged.vcf has the following variants: a. 2R 1000 1 0|1 G b. 2R 1500 1 0|1 AAGTG c. 2R 2000 50 1|0 A d. 2R 2000 50 0|1 AAGTG e. 2R 2000 50 2|1 CCAAGGAAGG "a" and "c" are in the list (flagged_phase_sets_CyO_variants.txt) already so they of course match Chrom, PS and GT adn will be added to the "to_remove.txt" file. "b" matches chrom (2R),PS (1) and GT (0|1) of "a" (which is in the list), so "b" will need to be removed and added to "to_remove.txt". This is because variant 'b' is in the same haplotype as 'a' which we know is part of CyO, so 'b' is also part of CyO. 'd' will not be removed because it doesn't match CHROM, PS and GT of either variant in "flagged_phase_sets_CyO_variants.txt" ### 5.Remove all variants (heterozygous with the reference genome) in phase with CyO variants from the VCF that already has unphased CyO variants removed #### a. Format the txt file of all phase variants to only contain CHROM, POS, ALT `cut -f1,4,5 L274_all_phased_variants_to_remove.txt > L274_all_phased_variants_to_remove_formatted.txt` #### b. Remove all variants that are heterozygous with reference and in phase with CyO alleles from vcf file. `bcftools view -T ^L274_all_phased_variants_to_remove_formatted.txt L274_wo_unphased_CyOmatches.vcf -o L274_CyO_01_variants_removed.vcf -O v` ### 6. You need to make one more list (the fifth one): this will be a list of all heterozygous variants present in phase sets that CyO alleles (from the Illumina data) aren’t in (we can’t call which haplotype is which for these phase sets so we should mark all variants in them as unknown. (L${LL}_variants_not_in_CyO_non_homozygous.txt) #### a) Extract a list of all phase sets from the OG deep variant VCF file ``` bcftools query -f '%CHROM\t%POS\t%ALT\t[%PS]\t[%GT]\n' L274_dv_phased_chr2.vcf.gz | \ cut -f 4 | sort | uniq > L274_all_phase_sets.txt ``` #### b) Find phase sets not present in L${LL}_phase_sets_CyO_variants.txt (generated in previous step) ``` awk '{print $4}' L274_phase_sets_CyO_variants.txt > L274_phase_sets_CyO_variants_ps.txt grep -Fv -f L274_phase_sets_CyO_variants_ps.txt L274_all_phase_sets.txt > L274_phase_sets_not_in_CyO.txt ``` #### c) c) Make a list of ALL variants in pacbio data (latest vcf) that are present in phase sets that CyO Illumina variants are not in. ``` bcftools query -f '%CHROM\t%POS\t%ALT\t[%PS]\t[%GT]\n' L274_CyO_01_variants_removed.vcf | \ grep -Ff L274_phase_sets_not_in_CyO.txt | \ grep -vE '([0-9]+)/\1' > L274_variants_not_in_CyO_non_homozygous.txt ``` ### 7. Make a list of all variants present in the PacBio data in the low ratio phase sets (identified in 5d). `cut -f1,2 L274_low_ratio_phase_sets.txt > L274_low_ratio_phase_sets_formatted.txt` ``` bcftools query -f '%CHROM\t[%PS]\t%POS\t%ALT\t[%GT]\n' L274_CyO_01_variants_removed.vcf > L274_all_variants_with_PS.txt awk -F '\t' '$5 !~ /^(1[\/|]1)$/ {print}' L274_all_variants_with_PS.txt > L274_no11_variants_with_PS.txt awk -F '\t' '$5 !~ /^(2[\/|]2)$/ {print}' L274_no11_variants_with_PS.txt > L274_heterozygous_variants_with_PS.txt awk 'NR==FNR {keys[$1"\t"$2]; next} ($1"\t"$2) in keys' L274_low_ratio_phase_sets_formatted.txt L274_heterozygous_variants_with_PS.txt > L274_LR_PS_variants.txt ``` ### 8. Now we have 3 lists of variants that we need to mark as ./. in our vcf (we don’t know they are reference in our wild chromosome, but we also don’t know if they are in CyO or not). - L${LL}_minor_variants_all.txt (CHROM, POS, PS, GT) - L${LL}_variants_not_in_CyO_non_homozygous.txt (chrom, POS, alt, PS, GT) - L${LL}_LR_PS_variants.txt (chrom, PS, POS, alt) #### a. Cut all into just a list of positions and merge all three lists into one L${LL}_variants_to_makemissing.txt ``` cut -f1,2 L274_minor_variants_all.txt > L274_minor_variants_all_POS.txt cut -f1,2 L274_variants_not_in_CyO_non_homozygous.txt > L274_variants_not_in_CyO_non_homozygous_POS.txt cut -f1,3 L274_LR_PS_variants.txt > L274_LR_PS_variants_POS.txt cat L274_minor_variants_all_POS.txt L274_variants_not_in_CyO_non_homozygous_POS.txt L274_LR_PS_variants_POS.txt > L274_variants_to_makemissing_all.txt ``` #### b. Subset VCF for just these positions `bcftools view -T L274_variants_to_makemissing_all.txt L274_CyO_01_variants_removed.vcf -O v -o L274_CyO_01_variants_removed_SUBSET.vcf ` #### c. Use +setGT to change all of these genotypes to ./. bcftools +setGT L274_CyO_01_variants_removed_SUBSET.vcf -- -t a -n ./. > L274_CyO_01_variants_removed_SUBSET_missing.vcf #### d. Now remove those rows from the main vcf file and merge the rows with the missing genotypes back in their place. bcftools view -T ^L274_variants_to_makemissing_all.txt -O v -o L274_CyO_01_variants_removed_NeedMissingAdded.vcf L274_CyO_01_variants_removed.vcf bcftools concat -a -O v -o L274_CyO_01_variants_removed_clean.vcf L274_CyO_01_variants_removed_NeedMissingAdded.vcf.gz L274_CyO_01_variants_removed_SUBSET_missing.vcf.gz ### 9. Set any unphased heterozygous (not missing and not homozygous) variants to be missing (./.) `bcftools +setGT L274_CyO_01_variants_removed_clean.vcf.gz -O z -o L274_CyO_01_variants_removed_cleaner.vcf.gz -- -t q -i 'GT="0/1" || GT="1/0" || GT="1/2" || GT="2/1" || GT="2/3" || GT="3/2" || GT="0/2" || GT="0/3"' -n './.'` ### 10. Input into R adn make a list of multi-alt sites that are phased into homozygous for the allele inphase with CyO majority haplotype (list will be CHROM, POS, GT_new) ##### Split into text files based on GT_new *all positions that need a new GT*: L274_newGT_toremove.txt *positions that need to become 0/0*: L274_new_GT_00.txt *positions that need to become 1/1*: L274_new_GT_11.txt *positions that need to become 2/2*: L274_new_GT_22.txt *positions that need to become 3/3*: L274_new_GT_33.txt ## 11. L274_CyO_01_variants_removed_cleaner.vcf ### a. Subset VCF for just positions in the 4 different class of new GTs needed #### 0/0 `bcftools view -T L274_new_GT_00.txt L274_CyO_01_variants_removed_cleaner.vcf -O v -o L274_CyO_01_variants_removed_cleaner_SUBSET00.vcf` #### 1/1 `bcftools view -T L274_new_GT_11.txt L274_CyO_01_variants_removed_cleaner.vcf -O v -o L274_CyO_01_variants_removed_cleaner_SUBSET11.vcf` #### 2/2 `bcftools view -T L274_new_GT_22.txt L274_CyO_01_variants_removed_cleaner.vcf -O v -o L274_CyO_01_variants_removed_cleaner_SUBSET22.vcf` #### 3/3 `bcftools view -T L274_new_GT_33.txt L274_CyO_01_variants_removed_cleaner.vcf -O v -o L274_CyO_01_variants_removed_cleaner_SUBSET33.vcf` ### b. Remove the total list of positions that need new GTs from original vcf `bcftools view -T ^L274_newGT_toremove.txt -O v -o L274_CyO_01_variants_removed_cleaner_NeedMissingAdded.vcf L274_CyO_01_variants_removed_cleaner.vcf ` ### c. Use +setGT to change all of these genotypes to the correct GT depending on set #### 0/0 sed -E 's#(0\|2|2\|0|0\|3|3\|0)#0/0#g' L274_CyO_01_variants_removed_cleaner_SUBSET00.vcf > L274_CyO_01_variants_removed_cleaner_SUBSET00_COMPLETE.vcf #### 1/1 sed -E 's#(1\|2|2\|1|1\|3|3\|1)#1/1#g' L274_CyO_01_variants_removed_cleaner_SUBSET11.vcf > L274_CyO_01_variants_removed_cleaner_SUBSET11_COMPLETE.vcf #### 2/2 sed -E 's#(0\|2|2\|0|1\|2|2\|1|2\|3|3\|2)#2/2#g' L274_CyO_01_variants_removed_cleaner_SUBSET22.vcf > L274_CyO_01_variants_removed_cleaner_SUBSET22_COMPLETE.vcf #### 3/3 sed -E 's#(0\|3|3\|0|1\|3|3\|1|2\|3|3\|2)#3/3#g' L274_CyO_01_variants_removed_cleaner_SUBSET33.vcf > L274_CyO_01_variants_removed_cleaner_SUBSET33_COMPLETE.vcf ### d. Now merge the rows with the missing genotypes back in their place. ``` bcftools concat -a -O v -o L274_almost.vcf \ L274_CyO_01_variants_removed_cleaner_NeedMissingAdded.vcf.gz \ L274_CyO_01_variants_removed_cleaner_SUBSET00_COMPLETE.vcf.gz \ L274_CyO_01_variants_removed_cleaner_SUBSET11_COMPLETE.vcf.gz \ L274_CyO_01_variants_removed_cleaner_SUBSET22_COMPLETE.vcf.gz \ L274_CyO_01_variants_removed_cleaner_SUBSET33_COMPLETE.vcf.gz ``` ## 12. Turn all 0/1 (or 1|0, 0|1) variants to be 1\1 sed -E 's#(1/0|0/1|0\|1|1\|0)#1/1#g' L274_almost.vcf > L274_hap_final.vcf bcftools query -f '[%GT]\n' L274_hap_final.vcf | sort | uniq WOOHOO I DID IT!!!!!! __________________________________________ ### Below is old code (in R) ### I don't think I need it anymore but saving here in case ## The following code removes duplicate positions from the vcf L208_vcf <- read.vcfR("L95_no_duplicates.vcf") # Extract VCF information (fixed fields) L208_vcf_info <- L208_vcf@fix # Convert the VCF data into a data frame vcf_df <- as.data.frame(L208_vcf_info) # Remove duplicate rows based on chromosome and position (keeping only one per position) vcf_df_filtered_distinct <- vcf_df %>% group_by(CHROM, POS) %>% filter(n() == 1) %>% ungroup() # Create a text file with CHROM and POS vcf_kp_list <- vcf_df_filtered_distinct[, c("CHROM", "POS")] # Write it to a text file (one variant per line, tab-separated) write.table(vcf_kp_list, file = "L95_double_variants_to_keep.txt", quote = FALSE, col.names = FALSE, row.names = FALSE, sep = "\t") ## Transfer L208_double_variants_to_keep.txt back to terminal and use this to clean up vcf and create a de-duplicated VCF file. ## In the terminal, using the de-duplicated VCF, create a list of CyO variants and their GT and PS in the PacBio data and # transfer it to R (below)