NGSRelate to estimate pairwise relatedness from lcWGS data

<h1> Estimate pairwise relatedness from lcWGS data </h1> Software: NgsRelate. This is a maximum likelihood method for estimating the pairwise relatedness parameter R from NGS data based on genotype likelihoods (https://github.com/ANGSD/NgsRelate). <h2> Input files </h2> * Ref genome file in fasta format * Bam files after mapping to the ref genome. * After mapping to the reference genome, the bam files were further processed by removing duplicates using Picard and performing indel realignment. Also samples with low mapping percentage and low coverage (after duplicate removal) were excluded from the analysis. --- <h2> Run ANGSD to get genotype likelihoods and allele frequencies </h2> ``` #!/bin/bash #SBATCH --job-name=ngsrelate #SBATCH -N 1 #SBATCH -n 1 #SBATCH -c 100 #SBATCH --threads-per-core=1 #SBATCH -p cpu #SBATCH --qos=long #SBATCH --mem=60G #SBATCH -t 3-00:00:00 #SBATCH -o %x_%j.out #SBATCH -e %x_%j.err cd /scratch/workspace/snehasuresh_umass_edu-Sneha_WF_2/ANGSD/NGSRelate REFDIR="/scratch/workspace/snehasuresh_umass_edu-Sneha_WF_2/Pamericanus_genome" REF="GCA_038501795.1_ASM3850179v1_genomic.fna" angsd -GL 1 -out ./WF_ANGSD_GL -C 50 -doGlf 3 -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 2 -only_proper_pairs 1 -remove_bads 1 -minMapQ 20 -minQ 20 -bam ngsrelate_bam_input.txt -nThreads 100 -ref $REFDIR/$REF ``` <h2> Then extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs) </h2> ``` zcat WF_ANGSD_GL.mafs.gz | cut -f6 |sed 1d >freq # 6th column is unknownEM: the allele freq. estimated using the EM algorithm from the GL. This is the freq. of the minor allele ``` <h2> Check if the freq file and glf file have the same number of lines </h2> ``` zcat WF_ANGSD_GL.glf.gz | wc -l wc -l freq ``` <h2> Create a file with all the sample IDs </h2> Note: You can input a file with the IDs of the individuals (on ID per line), using the -z option while running ngsrelate, in the same order as in the bam_file_list file. If you do the output file from ngsrelate will also contain these IDs as column 3 and 4. ``` cut -f 6 -d '/' ngsrelate_bam_input.txt > sample_ids.txt sed -i 's/_indel\.bam//' sample_ids.txt > sample_ids_edited.txt ``` <h2> Run NgsRelate </h2> ``` NGSREL=/nese/meclab/Shared/bin/ngsRelate cd /scratch/workspace/snehasuresh_umass_edu-Sneha_WF_2/ANGSD/NGSRelate #Run ngsRelate $NGSREL/ngsRelate -g WF_ANGSD_GL.glf.gz -p 50 -n 76 -f freq -z sample_ids_edited.txt -O NgsRelate_output/ngsRelate_out_YOY ``` <h3> Output notes </h3> **Output format** * The first two columns contain indices of the two individuals used for the analysis. * The third column is the number of genomic sites considered. * The following nine columns are the maximum likelihood (ML) estimates of the nine jacquard coefficients, where K0==J9, K1==J8, K2==J7 in absence of inbreeding. Based on these Jacquard coefficients, NgsRelate calculates 11 summary statistics: 1. rab is the pairwise relatedness 2. Fa is the inbreeding coefficient of individual a 3. Fb is the inbreeding coefficient of individual b 4. theta is the coefficient of kinship 5. inbred_relatedness_1_2 6. inbred_relatedness_2_1 7. fraternity 8. identity 9. zygosity 10. Two-out-of-three IBD 11. Inbreeding difference 12. the log-likelihood of the ML estimate. 13. number of EM iterations. If a -1 is displayed. A boundary estimate had a higher likelihood. 14. If differs from -1, a boundary estimate had a higher likelihood. Reported loglikelihood should be highly similar to the corresponding value reported in loglh 15. fraction of sites used for the ML estimate The remaining columns relate to statistics based on a 2D-SFS. --- <h1> Plotting NGSRelate results in R </h1> <h2> Heatmap of pairwise relatedness for all individuals </h2> ``` #!/bin/Rscript library(dplyr) library(tidyverse) df <- read.csv("ngsRelate_out_YOY_with_smaple_IDs.csv") ## Using the relatedness metric rab - (Hedrick & Lacy's relatedness estimator) # Choose the metric you want to plot ngs_matrix <- df %>% select(ida, idb, rab) # Convert to wide form: create a symmetrical matrix with individuals as rows and columns. # Create a symmetrical matrix relatedness_matrix <- ngs_matrix %>% pivot_wider(names_from = idb, values_from = rab) %>% column_to_rownames("ida") # Ensure symmetry relatedness_matrix[lower.tri(relatedness_matrix)] <- t(relatedness_matrix)[lower.tri(relatedness_matrix)] # Plot the heatmap library(pheatmap) library(RColorBrewer) #display.brewer.all(colorblindFriendly = T) heat_colors <- brewer.pal(6, "OrRd") png("relatedness_heatmap_all_samples.png",width=70, height=60, res=300, units="cm") pheatmap(relatedness_matrix, color=heat_colors, cluster_rows = FALSE, cluster_cols = FALSE, display_numbers = FALSE, main = "Pairwise Relatedness (rab)", fontsize_row = 20, fontsize_col = 20, fontsize = 24) dev.off() ``` --- <h2> Stacked bar Plots </h2> ``` #!/bin/Rscript library(ggplot2) df <- read.csv("ngsRelate_out_YOY_with_smaple_IDs.csv") #Read output from ngsrelate meta <- read.csv("population_metadata.csv") #Read population metadata #Add population info for ida df <- df %>% left_join(meta,by=c("ida" = "Sample")) %>% rename(PopA = Geographical.unit) #Add population info for idb df <- df %>% left_join(meta,by=c("idb" = "Sample")) %>% rename(PopB = Geographical.unit) #Create combined population pair (alphabetical order to avoid duplicates) df <- df %>% mutate(PopPair = ifelse(PopA < PopB, paste(PopA, PopB, sep = "."), paste(PopB, PopA, sep = "."))) ## Within-pop comparisons # Filter to within-population comparisons within_df <- df %>% filter(PopA == PopB) #### Stacked bar plot # Define bin levels and breaks rab_levels <- c("0-0.01", "0.01-0.02", "0.02-0.04", "0.04-0.08", "0.08-0.1", "0.1-0.5", "0.5-1") rab_breaks <- c(0, 0.01, 0.02, 0.04, 0.08, 0.1, 0.5, 1) # Bin rab values within_df$rab_bin <- cut(within_df$rab, breaks = rab_breaks, labels = rab_levels, include.lowest = TRUE, # This ensures that the very first bin includes the lowest value right = TRUE) #right=TRUE is to specify that the upper bound is included (i.e. 0.01 is included in the interval [0,0.01]) # Summarize counts per population and bin stacked_df <- within_df %>% group_by(PopPair, rab_bin) %>% # Group data by both Pop pair and rab_bin summarise(count = n(), .groups = "drop") %>% # Count how many pairwise comparisons falls into that bin for that PopPair. .groups = "drop" tells dplyr to remove all grouping after this step cuz the next function 'complete' expects an ungrouped data frame complete(PopPair, rab_bin = rab_levels, fill = list(count = 0)) %>% #ensures all combinations of PopPair and rab_bin exist in the dataset, even if they were missing. So if there are no rab values between 0.04 and 0.08 for a PopPair, it will set the count to 0. group_by(PopPair) %>% #Regroup by PopPair mutate(proportion = count / sum(count)) %>% ungroup() #Removes grouping structure again, leaving a clean, regular data.frame that’s ready for plotting #Reorder populations stacked_df$PopPair <- factor(stacked_df$PopPair, ordered = T, levels = rev(c("Southern ME.Southern ME", "NH.NH", "MA Northshore.MA Northshore", "Boston.Boston"))) #Plot png("relatedness_withinpop_stacked_bar.png",width=35, height=25, res=300, units="cm") p <- ggplot(stacked_df, aes(x = factor(PopPair), y = proportion, fill = factor(rab_bin))) + geom_bar(stat = "identity",width=0.5) + coord_flip() + ylab("Proportion of pairwise comparisons") + xlab("Population") + scale_fill_grey(name = "Pairwise relatedness") + scale_x_discrete(labels = c( "Boston.Boston" = "Boston", "NH.NH" = "NH", "MA Northshore.MA Northshore" = "MA Northshore", "Southern ME.Southern ME" = "Southern ME")) + scale_y_continuous(labels = c( "0.00" = "0", "0.25" = "0.25", "0.50" = "0.5", "0.75" = "0.75", "1.00" = "1"),expand = c(0, 0))+ theme_classic() + theme(axis.text=element_text(size=20), axis.title=element_text(size=22,face="bold"), legend.text = element_text(size = 20), legend.title = element_text(size = 22,face="bold")) print(p) dev.off() ## Between-pop comparisons # Filter to between-population comparisons between_df <- df %>% filter(PopA != PopB) #### Stacked bar plot # Define bin levels and breaks rab_levels <- c("0-0.01", "0.01-0.02", "0.02-0.04", "0.04-0.08", "0.08-0.1", "0.1-0.5", "0.5-1") rab_breaks <- c(0, 0.01, 0.02, 0.04, 0.08, 0.1, 0.5, 1) # Bin rab values between_df$rab_bin <- cut(between_df$rab, breaks = rab_breaks, labels = rab_levels, include.lowest = TRUE, # This ensures that the very first bin includes the lowest value right = TRUE) #right=TRUE is to specify that the upper bound is included (i.e. 0.01 is included in the interval [0,0.01]) # Summarize counts per population and bin stacked_df <- between_df %>% group_by(PopPair, rab_bin) %>% # Group data by both Pop pair and rab_bin summarise(count = n(), .groups = "drop") %>% # Count how many pairwise comparisons falls into that bin for that PopPair. .groups = "drop" tells dplyr to remove all grouping after this step cuz the next function 'complete' expects an ungrouped data frame complete(PopPair, rab_bin = rab_levels, fill = list(count = 0)) %>% #ensures all combinations of PopPair and rab_bin exist in the dataset, even if they were missing. So if there are no rab values between 0.04 and 0.08 for a PopPair, it will set the count to 0. group_by(PopPair) %>% #Regroup by PopPair mutate(proportion = count / sum(count)) %>% ungroup() #Removes grouping structure again, leaving a clean, regular data.frame that’s ready for plotting #Reorder populations stacked_df$PopPair <- factor(stacked_df$PopPair, ordered = T, levels = rev(c("MA Northshore.Southern ME", "Boston.Southern ME", "NH.Southern ME", "MA Northshore.NH", "Boston.NH", "Boston.MA Northshore"))) #Plot png("relatedness_betweenpop_stacked_bar.png",width=35, height=25, res=300, units="cm") p <- ggplot(stacked_df, aes(x = factor(PopPair), y = proportion, fill = factor(rab_bin))) + geom_bar(stat = "identity",width=0.5) + coord_flip() + ylab("Proportion of pairwise comparisons") + xlab("Population") + scale_fill_grey(name = "Pairwise relatedness") + scale_x_discrete(labels = c( "NH.Southern ME" = "Southern ME.NH", "MA Northshore.Southern ME" = "Southern ME.MA Northshore", "MA Northshore.NH" = "NH.MA Northshore", "Boston.Southern ME" = "Southern ME.Boston", "Boston.NH" = "NH.Boston", "Boston.MA Northshore" = "MA Northshore.Boston")) + scale_y_continuous(labels = c( "0.00" = "0", "0.25" = "0.25", "0.50" = "0.5", "0.75" = "0.75", "1.00" = "1"),expand = c(0, 0))+ theme_classic() + theme(axis.text=element_text(size=20), axis.title=element_text(size=22,face="bold"), legend.text = element_text(size = 20), legend.title = element_text(size = 22,face="bold")) print(p) dev.off() ``` --- <h2> Heatmap with mean rab for each PopPair </h2> ``` #!/bin/Rscript df <- read.csv("ngsRelate_out_YOY_with_smaple_IDs.csv") #Read output from ngsrelate meta <- read.csv("population_metadata.csv") #Read population metadata #Add population info for ida df <- df %>% left_join(meta,by=c("ida" = "Sample")) %>% rename(PopA = Geographical.unit) #Add population info for idb df <- df %>% left_join(meta,by=c("idb" = "Sample")) %>% rename(PopB = Geographical.unit) #Create combined population pair (alphabetical order to avoid duplicates) df <- df %>% mutate(PopPair = ifelse(PopA < PopB, paste(PopA, PopB, sep = "."), paste(PopB, PopA, sep = "."))) # Calculate mean and std dev for each PopPair heat_df <- df %>% group_by(PopPair) %>% summarise( mean_rab = mean(rab, na.rm = TRUE), sd_rab = sd(rab, na.rm = TRUE),.groups = "drop") #Split PopPair into two columns for heatmap heat_df <- heat_df %>% separate(PopPair, into = c("PopA", "PopB"), sep = "\\.") # Make it symmetric symmetric_df <- bind_rows( heat_df, heat_df %>% rename(PopA = PopB, PopB = PopA) ) # Define population order population_order <- c("Southern ME", "NH","MA Northshore","Boston") symmetric_df$PopA <- factor(symmetric_df$PopA, levels = rev(population_order)) symmetric_df$PopB <- factor(symmetric_df$PopB, levels = rev(population_order)) # Calculate mask after factor conversion symmetric_df <- symmetric_df %>% distinct() %>% mutate( mask = as.integer(PopA) > as.integer(PopB), # Lower triangle mean_rab_masked = ifelse(mask, NA, mean_rab) # Set to NA if in lower triangle ) #Plot heatmap library(RColorBrewer) display.brewer.all(colorblindFriendly = T) heat_colors <- brewer.pal(6, "OrRd") png("mean_relatedness_heatmap.png",width=32, height=35, res=300, units="cm") ggplot(symmetric_df, aes(x = PopA, y = PopB, fill = mean_rab_masked)) + geom_tile(color = "white") + geom_tile(data = subset(symmetric_df, is.na(mean_rab_masked)), fill = "grey", color = "white") + # Black fill for masked cells geom_text(aes(label = ifelse(!is.na(mean_rab_masked), sprintf("%.6f", mean_rab_masked), "")), size = 6) + scale_fill_gradientn(colors = heat_colors, name = "Mean pairwise relatedness",labels = function(x) sprintf("%.6f", x), na.value = "black") + # Also ensures NA fill is black coord_fixed() + theme_minimal() + theme( axis.text.x = element_text(angle = 45, hjust = 1), axis.title = element_blank(), axis.text = element_text(size = 25), legend.title = element_text(size = 25,face="bold",margin = margin(b = 10)), legend.text = element_text(size = 20), legend.key.height = unit(1, "cm") ) dev.off() ```