## How to compare phased VCF files (from hiphase) and pull out CyO haplotype ## CyO specific SNPs from Iso1/CyO Illumina data I have a .txt file that has the Chr, position and variants that are specific to CyO pretend "CyO_snps.txt" 2R 1000 G 2R 2000 A etc ### 1. Use bcftools to create a file of the phase sets (PS) of the SNPs specific to CyO `bcftools query -f '%CHROM\t%POS\t%ALT\t[%PS]\t[%GT]\n' L208_phased_merged_label_chr2.ann.vcf | grep -Ff CyO_snps.txt > L208_phase_sets_CyO_variants.txt` This should output the chromosome, position, phase set,alternate allele of ALL exact (exact match becuase of -Ff) matching SNPs in the PacBio data For example, if both of the two pretend variants above also appear in the phased PacBio data, the output will look something like this: pretend "flagged_phase_sets_CyO_variants.txt" 2R 1000 1 0|1 G 2R 2000 50 1|0 A This would show that in phase set 1, the CyO haplotype is likely the second haplotype, but in phase set 50, the CyO haplotype is likely the first haplotype. ### 2. Transfer this output file to process in R. #### In R, summarize per phase set, how many of each phased genotype there are. There should be a clear majority (>100x is the cutoff) of one phased genotype over the other. This code results in two lists (unphased_variants.txt and phased_variants.txt) Unphased_variants contains all CyO variants that are not part of a phase set but are heterozygous for the CyO variant and the reference in the PacBio data. These will be removed directly from the PacBio data in step 3. The phased data will represent all phased CyO variants that are heterozygous for the reference allele in the PacBio data. #### Two separate lists are made (that are not present in the phased variatns list). Both of these lists need to be saved and be removed and marked as 'unknown' in the final phased haplotype. #### a) CyO variants that are not in agreement with the majority phase haplotype per phase set. The minority haplotype CyO variants are output as a separate list named: L${LL}_minor_variants.txt (This includes variants that are 2|1, 1|2, 3|2, etc - if the CyO variant (whichever it is, is in disagreemnet with the majority CyO phase, then these variatns should be marked as unknown)) #### b) Variatns in phase sets that the ratio of one genotype vs the other is less than 100. This list is named L${LL}_low_ratio_phase_sets.txt ``` library(dplyr) # Read the data variants <- read.table("L208_phase_sets_CyO_variants.txt", header = FALSE, sep = "\t", col.names = c("CHROM", "POS", "ALT", "PS", "GT")) filtered_variants <- variants %>% filter(!GT %in% c("1/1", "1/2", "1|2", "2|1")) # Filter by chromosome (2L and 2R) summary <- filtered_variants %>% group_by(CHROM, PS, GT) %>% summarize(Count = n(), .groups = "drop") %>% arrange(CHROM, PS, GT) # Exclude phase sets with only one row filtered_summary <- summary %>% group_by(CHROM, PS) %>% filter(n() > 1) %>% ungroup() # Create a list of PS where there is only one row (i.e. where there is NO disagreements between the haplotypes (only 0|1 or 1|0)) filtered_summary_1PS <- summary %>% group_by(CHROM, PS) %>% filter(n() == 1) %>% ungroup() # Calculate the ratio of the larger count to the smaller count per phase set and filter for the majority genotype ratio_summary <- filtered_summary %>% group_by(CHROM, PS) %>% mutate( Max_Count = max(Count), Min_Count = min(Count), Ratio = Max_Count / Min_Count ) %>% filter(Count == Max_Count) %>% # Keep only the rows with the majority genotype mutate(Majority_GT = GT) %>% # Add the Majority_GT column distinct(CHROM, PS, Majority_GT, Ratio, Max_Count) %>% ungroup() # Filter for the minority genotype to make a separate list ratio_summary_minority <- filtered_summary %>% group_by(CHROM, PS) %>% mutate( Max_Count = max(Count), Min_Count = min(Count), Ratio = Max_Count / Min_Count ) %>% filter(Count == Min_Count) %>% # Keep only the rows with the minority genotype mutate(Minority_GT = GT) %>% # Add the minority_GT column distinct(CHROM, PS, Minority_GT, Ratio, Min_Count) %>% ungroup() minor_variants_summary <- ratio_summary_minority %>% filter(Ratio>100) %>% rename(GT = Minority_GT) minor_variants <- filtered_variants %>% inner_join(minor_variants_summary, by = c("CHROM", "PS", "GT")) # Save the filtered list to a file write.table(minor_variants, file = "L208_minor_variants.txt", sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE) ##In the ratio_summary, if the ratio is less than 100, print a list - these will be excluded from the final wild lethal haplotype because we cannot be sure of phasing accuracy # Filter for ratios less than 100 and print the list low_ratio_phase_sets <- ratio_summary %>% filter(Ratio < 100) # Save the filtered list to a file write.table(low_ratio_phase_sets, file = "L208_low_ratio_phase_sets.txt", sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE) # Print the filtered list print(low_ratio_phase_sets) # Filter for high-ratio phase sets (Ratio > 50) high_ratio_phase_sets <- ratio_summary %>% filter(Ratio > 100) # Select the majority genotype for high-ratio phase sets majority_genotypes <- high_ratio_phase_sets %>% select(CHROM, PS, Majority_GT) # merge majority_genotypes with the list of PS where there is NO disagreements between the haplotypes (only 0|1 or 1|0) ("filtered_summary_1ps") majority_genotypes_renamed <- majority_genotypes %>% rename(GT = Majority_GT) all_phase_sets <- bind_rows( majority_genotypes_renamed %>% mutate(Source = "Majority"), # Add a source column for clarity filtered_summary_n1 %>% mutate(Source = "SingleRow") # Add a source column for clarity ) # Keep only the majority genotypes or single row genotypes in the filtered_variants dataset final_variants <- filtered_variants %>% inner_join(all_phase_sets, by = c("CHROM", "PS", "GT")) # Split the dataset into unphased and phased genotypes unphased_variants <- final_variants %>% filter(grepl("/", GT)) # Keep rows where GT contains a / #save unphased dataset to use to remove from original write.table( unphased_variants %>% select(CHROM, POS, ALT), file = "L208_unphased_variants.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE ) #Phased variants phased_variants <- final_variants %>% filter(grepl("\\|", GT)) # Keep rows where GT contains a | #save unphased dataset to use to remove from original write.table( phased_variants %>% select(CHROM, POS, ALT, GT, PS), file = "L208_phased_variants.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE ) ``` ### 3. IN WORD DOC Use the list of unphased CyO variants (that are heterozygous with the reference allele in the data) present in the PacBio data and remove them from the vcf `bcftools view -T ^L208_unphased_variants.txt -o L208_wo_unphased_CyOmatches.vcf -O v L208_phased_merged_label_chr2.ann.vcf` ### 4. IN WORD DOC Use the list of phased CyO variants present in the PacBio data and generate a list of all of the variants that are in the same PS and haplotype as the CyO variant: ``` cut -f1,4,5 L208_phased_variants.txt > L208_phased_variants_formatted.txt bcftools query -f '%CHROM\t[%GT]\t[%PS]\t%POS\t%ALT\n' L208_wo_unphased_CyOmatches.vcf | grep -Ff L208_phased_variants_formatted.txt > L208_all_phased_variants_to_remove.txt ``` *cut -f1,2,4 extracts only CHROM(1), GT(4), and PS (5) (fields needed for matching) from the flagged variants file.* Let's pretend that LethalLine_pacbio_merged.vcf has the following variants: a. 2R 1000 1 0|1 G b. 2R 1500 1 0|1 AAGTG c. 2R 2000 50 1|0 A d. 2R 2000 50 0|1 AAGTG e. 2R 2000 50 2|1 CCAAGGAAGG "a" and "c" are in the list (flagged_phase_sets_CyO_variants.txt) already so they of course match Chrom, PS and GT adn will be added to the "to_remove.txt" file. "b" matches chrom (2R),PS (1) and GT (0|1) of "a" (which is in the list), so "b" will need to be removed and added to "to_remove.txt". This is because variant 'b' is in the same haplotype as 'a' which we know is part of CyO, so 'b' is also part of CyO. 'd' will not be removed because it doesn't match CHROM, PS and GT of either variant in "flagged_phase_sets_CyO_variants.txt" ### 5. Remove all variants (heterozygous with the reference genome) in phase with CyO variants from the VCF that already has unphased CyO variants removed ``` cut -f1,4,5 L208_all_phased_variants_to_remove.txt > L208_all_phased_variants_to_remove_formatted.txt ``` `bcftools view -T ^L208_all_phased_variants_to_remove_formatted.txt L208_wo_unphased_CyOmatches.vcf -o L208_CyO_01_variants_removed.vcf -O v` ### 6. Change genotypes to ./. if in the minor_variants_all.txt file and in teh L${LL}_low_ratio_phase_sets.txt #### a) make list of positions (minor variants) into just positions ``` awk '{print $1"\t"$2}' L208_minor_variants_all.txt > L208_minor_variants_positions.txt ``` length = 870 rows (wc -l L208_minor_variants_positions.txt) #### b) new vcf filtered for just those positions ``` bcftools view -T L208_minor_variants_positions.txt L208_CyO_01_variants_removed.vcf -O z -o L208_filtered_minor_variants.vcf.gz ``` number of variatns: 899 (grep -v '^#' L208_filtered_minor_variants.vcf | wc -l) ** this make sense becuase the vcf is merged from multiple variant calling files there are a few duplicates! #### c) NOW use +setGT to change all of these genotypes to ./. ``` bcftools +setGT L208_filtered_minor_variants.vcf.gz -- -t a -n ./. > L208_filtered_minor_variants_missing.vcf.gz ``` Filtered 1789 alleles (899x2!) - correct! #### d) NOW merge this back to the vcf file (L208_CyO_01_variants_removed.vcf) and replace matching rows from L208_CyO_01_variants_removed.vcf with the missing genotypes values from L208_filtered_minor_variants_missing.vcf.gz `bcftools view -T ^L208_minor_variants_positions.txt -O z -o L208_CyO_01_variants_removed_mnr_rmvd.vcf.gz L208_CyO_01_variants_removed.vcf` `bcftools concat -a -O z -o L208_woCyO01_minorvariantgtmissing.vcf.gz L208_CyO_01_variants_removed_mnr_rmvd.vcf.gz L208_filtered_minor_variants_missing.vcf.gz` ## 7. change all variatns in the LOW RATIO PHASE SETS TO ./. #### 0) IN WORD DOC first take the list of low ratio phase sets and generate a list of all variants in the low ratio phase sets ``` cut -f1,2 L208_low_ratio_phase_sets.txt > L208_LR_PS_formatted.txt bcftools query -f '%CHROM\t[%PS]\t%POS\t%ALT\n' L208_woCyO01_minorvariantgtmissing.vcf.gz > all_variants_with_PS.txt awk 'NR==FNR {keys[$1"\t"$2]; next} ($1"\t"$2) in keys' L208_LR_PS_formatted.txt all_variants_with_PS.txt > L208_LR_PS_variants.txt ``` wc -l L208_LR_PS_variants.txt (~20,000 variants in LR phase sets so have to mark as missing - unsure which is CyO and which is wild chrom!) #### a) make list of positions (minor variants) into just positions ``` awk '{print $1"\t"$3}' L208_LR_PS_variants.txt > L208_LR_PS_variants_positions.txt ``` L208_LR_PS_variants.txt is chrom, PS, pos, alt #### b) new vcf filtered for just those positions ``` bcftools view -T L208_LR_PS_variants_positions.txt L208_woCyO01_minorvariantgtmissing.vcf.gz -O z -o L208_filtered_LP_PS_variants.vcf.gz ``` number of variatns: 21021 (grep -v '^#' L208_filtered_LP_PS_variants.vcf | wc -l) ** this make sense becuase the vcf is merged from multiple variant calling files there are a few duplicates! #### c) NOW use +setGT to change all of these genotypes to ./. ``` bcftools +setGT L208_filtered_LP_PS_variants.vcf.gz -- -t a -n ./. > L208_filtered_LP_PS_variants_missing.vcf.gz ``` Filtered 42042 alleles (21021x2!) - correct! #### d) NOW merge this back to the vcf file (L208_woCyO01_minorvariantgtmissing.vcf.gz) and replace matching rows from L208_woCyO01_minorvariantgtmissing.vcf.gz with the missing genotypes values from L208_filtered_LP_PS_variants_missing.vcf.gz `bcftools view -T ^L208_LR_PS_variants_positions.txt -O z -o L208_woCyO01_minormissing_LPPSvariants_rmvd.vcf.gz L208_woCyO01_minorvariantgtmissing.vcf.gz` `bcftools concat -a -O z -o L208_woCyO01_minorandLRPSGTmissing.vcf.gz L208_woCyO01_minormissing_LPPSvariants_rmvd.vcf.gz L208_filtered_LP_PS_variants_missing.vcf.gz` ##### *** check genotypes in L208_woCyO01_minorandLRPSGTmissing.vcf to make sure that both minor variants and all variants in low ratio phase sets are ./. (yes :)) ## 8. Anything left unphased needs to be labelled unknown (0/1, 1/2, 2/3 etc). Everything that was unphased, heterozygous with reference AND present in CyO Illumina variants have already been removed from teh VCF entired (i.e. are definitively NOT variatns in the wild chromosome). Because the unphased multivariant (1/2, 2/3) that MATCH known CyO variatns from Illumina are rare, and tend to be repetitive seuqences that are highly mutable (i.e. coudl be different between CyO chromosomes) I also call these to be ./. becase can't be confident in phase. #### How many variants are represented in my vcf? grep -v '^#' L208_woCyO01_minorandLRPSGTmissing.vcf | wc -l #### How many genotypes are unphased and not missing adn not homozygous. bcftools query -f '[%GT\n]' L208_woCyO01_minorandLRPSGTmissing.vcf.gz | grep -Ev '^\./\.|^\.$' | grep '/' | grep -Ev '^(0/0|1/1|2/2|3/3)$' | wc -l ~7000 #### Set unphased heterozygous (not missing adn not homozygous) variatns to be missing `bcftools +setGT L208_woCyO01_minorandLRPSGTmissing.vcf.gz -O z -o L208_unphased_het_missing.vcf.gz -- -t q -i 'GT="0/1" || GT="1/0" || GT="1/2" || GT="2/1" || GT="2/3" || GT="3/2" || GT="0/2" || GT="0/3"' -n './.'` filled 15588 alleles (2x) `bcftools query -f '[%GT\n]' L208_unphased_het_missing.vcf.gz | grep -Ev '^\./\.|^\.$' | grep '/' | grep -Ev '^(0/0|1/1|2/2|3/3)$' | wc -l` *above code just to check - not heterozygous unphased alleles left ## 9. Now for each phase set split the genotypes into to haplotypes. #### a) DONE IN WORD DOC FIRST remove duplicated identical from vcf file before tyring to separate to haplotypes bcftools norm -d both L208_unphased_het_missing.vcf.gz -Oz -o L208_no_duplicates.vcf.gz #### b) DONE IN WORD DOC SECOND use vcfR to remove any non-identical duplicates (there are some variants in repetitive regions that are called slightly differently (e.g. TATATA vs TATATAT) and so aren't removed by norm -d) ``` install.packages("vcfR") library(vcfR) L208_vcf <- read.vcfR("L208_no_duplicates.vcf") # Extract VCF information (fixed fields) L208_vcf_info <- L208_vcf@fix # Convert the VCF data into a data frame vcf_df <- as.data.frame(L208_vcf_info) # Remove duplicate rows based on chromosome and position (keeping only one per position) vcf_df_filtered_distinct <- vcf_df %>% group_by(CHROM, POS) %>% filter(n() == 1) %>% ungroup() # Create a text file with CHROM and POS vcf_keep_list <- vcf_df_filtered_distinct[, c("CHROM", "POS")] # Write it to a text file (one variant per line, space-separated) write.table(vcf_keep_list, file = "variants_to_keep.txt", quote = FALSE, col.names = FALSE, row.names = FALSE, sep = "\t") ``` #### DONE IN WORD DOC THIRD now back in cluster - keep only variatns in my list from vcfR (variants_to_keep.txt) wc -l double_variants_to_remove.txt grep -v '^#' L208_no_duplicates.vcf | wc -l `bcftools view -T variants_to_keep.txt -o filtered_L208_no_duplicates.vcf -O v L208_no_duplicates.vcf ` grep -v '^#' filtered_L208_no_duplicates.vcf | wc -l ### 10. IN WORD DOC There are phase sets in which CyO Illumina variatns aren't present! Need to make a list of those phase sets remove anything from those phase sets. ##### extract all phase sets from VCF file bcftools query -f '%CHROM\t%POS\t%ALT\t[%PS]\t[%GT]\n' L208_phased_merged_label_chr2.ann.vcf | \ cut -f 4 | sort | uniq > all_phase_sets.txt ##### Find phase sets not present in L208_phase_sets_CyO_variants.txt awk '{print $4}' L208_phase_sets_CyO_variants.txt > L208_phase_sets_CyO_variants_ps.txt grep -Fv -f L208_phase_sets_CyO_variants_ps.txt all_phase_sets.txt > phase_sets_not_in_CyO.txt `bcftools query -f '%CHROM\t%POS\t%ALT\t[%PS]\t[%GT]\n' filtered_L208_no_duplicates.vcf | \ grep -Ff phase_sets_not_in_CyO.txt | \ grep -vE '([0-9]+)/\1' > variants_not_in_CyO_non_homozygous.txt` #### a) make list of positions (minor variants) into just positions ``` awk '{print $1"\t"$2}' variants_not_in_CyO_non_homozygous.txt > L208_nonCyOPS_positions.txt ``` #### b) new vcf filtered for just those positions ``` bcftools view -T L208_nonCyOPS_positions.txt filtered_L208_no_duplicates.vcf -O z -o L208_filtered_nonCyOPSvariants.vcf.gz ``` #### c) NOW use +setGT to change all of these genotypes to ./. ``` bcftools +setGT L208_filtered_nonCyOPSvariants.vcf.gz -- -t a -n ./. > L208_filtered_nonCyOPSvariants_missing.vcf ``` #### d) NOW merge this back to the vcf file (L208_CyO_01_variants_removed.vcf) and replace matching rows from L208_CyO_01_variants_removed.vcf with the missing genotypes values from L208_filtered_minor_variants_missing.vcf.gz `bcftools view -T ^L208_nonCyOPS_positions.txt -O z -o filtered_L208_no_duplicates_nonCyOrmvd.vcf.gz filtered_L208_no_duplicates.vcf` `bcftools concat -a -O z -o L208_ready.vcf.gz filtered_L208_no_duplicates_nonCyOrmvd.vcf.gz L208_filtered_nonCyOPSvariants_missing.vcf.gz` ## 11. The phase sets that contain 0's should be the CyO variants (the wild fly haplotype should be almost entirely 1's (some 2's)). Need to figure out a way to select the haplotypes from each phase set that are primarily 1's (and some 2's) and merge themall together into one haplotype. Make sure the ./. genotypes are not exlucded so we know which sites are NA (and not necessarily equal to the reference genome)