GT-Seq Analysis Using Colony

--- title: 'GT-Seq Analysis Using Colony' --- # Colony Analysis and Filtering Workflow This section assumes you have completed genotyping and are preparing to run microhaplot to generate microhaplotype profiles and downstream input for COLONY. --- ## Step 1: Convert BAM to SAM (for microhaplot input) ```bash #!/bin/bash while read -r FILE; do echo "converting bam $FILE to sam" samtools view -h $FILE > ${FILE/.bam/.sam} done < bamfilelist.txt ``` > **Note:** > > * `bamfilelist.txt` should contain the **full paths** to all your BAM files. > * Run this in the same directory as the BAMs, ideally within a Unity compute session with the `gtseq` environment loaded. --- ## Step 2: Run microhaplot (Command Line + R) **Run microhaplot on the command line** *(requires you first have generated all your SAM files and have a VCF with genotyped SNPs for those files)*. You will run the microhaplot workflow using a combination of command-line execution and R scripting within a Unity session. This step prepares haplotype-level data and outputs two `.rds` files used for further inspection in the Shiny app. ### In R (install and initialize the Shiny app environment): ```r devtools::install_github("ngthomas/microhaplot", build_vignettes = TRUE, build_opts = c("--no-resave-data", "--no-manual")) # This sets up the Shiny GUI app by copying its files to your local machine microhaplot::mvShinyHaplot("~/Shiny") ``` ### Prepare and run the haplotype extraction: ``` library(microhaplot) library(tidyverse) sam.path <- "/path/to/SAM_FILES" label.path <- "/path/to/SAM_FILES/ALLSAMLIST.txt" vcf.path <- "/path/to/allHI_samples_filtered.recode.vcf" app.path <- "~/Shiny/microhaplot" out.path <- tempdir() haplo.read.tbl <- prepHaplotFiles(run.label = "HIALL", sam.path = sam.path, out.path = out.path, label.path = label.path, vcf.path = vcf.path, app.path = app.path, add.filter = 1, n.jobs = 200) ``` > Tips: > * This script was run in RStudio on Unity OnDemand. > * Settings: R version 4.4.0, 8 hours, 30 GB memory, and 100 CPU. > * It may take several hours depending on your dataset size (e.g., ~3000 samples). > * The output includes two .rds files you will download to your local computer or drive. ## Step 3: Launch microhaplot GUI (Shiny app) ```r app.path <- ("/your/local/path/to/Shiny/microhaplot") runShinyHaplot(app.path) ``` Use the GUI to inspect and filter loci. Export these files: * `locus_annotation.csv` * `haplotype_frequency.csv` * `unfiltered_haplotype.csv` --- ## Step 4: Filter Microhaplotypes ### Key Filters: **Read Depth per Haplotype (Per Individual)**: Each microhaplotype is built from clusters of reads supporting a specific allele. Low-depth haplotypes are often artifacts due to sequencing error or index hopping. - **Threshold:** Keep haplotypes only if read depth ≥ 20 per individual per locus. - Haplotypes with <20 reads should be excluded to avoid calling false heterozygotes or introducing noise into COLONY input. > By setting a minimum depth (e.g. ≥20), you ensure that both haplotypes in a heterozygote are well-supported and reduce false-positive detection of second alleles that are really just noise. **Allele Balance (Per Locus, Per Individual):** Allele balance helps determine if a heterozygote call is real or just a minor sequencing error. It's calculated as the ratio of reads supporting the less common haplotype to the more common haplotype within an individual. | Scenario | Haplotype A Reads | Haplotype B Reads | Allele Balance (B/A) | Interpretation | |--------------------|------------------|-------------------|----------------------|---------------------------------------| | Homozygote | 9500 | 500 | 0.05 | Likely sequencing noise → drop B | | True Heterozygote | 5000 | 5000 | 1.00 | Balanced → keep both | | Borderline Case | 4000 | 1500 | 0.375 | Below threshold → likely drop B | >You can apply these filters directly in the microhaplot Shiny GUI or script it in R after downloading the `haplotype_frequency.csv` and `locus_annotation.csv` files. --- ## Step 5: Prep Data for COLONY In R: * Convert character alleles to unique numeric haplotype IDs (e.g., `mhLKcm1`). * Reformat genotype tables for COLONY. * Add metadata (sample sex, groupings, sibling relationships). > Sample names must be **≤ 20 characters**. > Rename long IDs to short, unique ones (e.g., `HI_2`). --- ## Step 6: Run COLONY GUI ### Settings Used: ``` 2470 ! Number of offspring 175 ! Number of loci 1234 ! Random seed 0 ! Update allele frequencies: No 2 ! Dioecious species 0 ! Inbreeding: Absent 0 ! Diploid 0 0 ! Polygamy for both sexes 0 ! Clone inference: No 1 ! Scale full sibship: Yes 0 ! Sibship prior: None 0 ! Pop allele freq known: No 1 ! Number of runs 2 ! Medium run length 1 ! Monitor by iteration count 1 ! Monitor interval 1 ! Windows version 1 ! Pair-Likelihood Score method 1 ! Medium precision ``` > Adjust number of offspring/loci based on your filtered dataset. > Note: > - These numbers reflect how many offspring was in the dataset after all of the filtering, not how many was started with. > - Keep in mind that higher precision and longer run time will lead to way more resources, and from published colony papers they say medium precision seems to be good for most projects. --- ## Step 7: Filter and Analyze COLONY Outputs in R Colony generates **many output files**. You can inspect them in the GUI, but most filtering and analysis is typically done in R. ### Files to Focus On: - `.BestConfig`: primary file for final configuration - `.maternity`, `.pairwiseMaternity`: evaluate confidence in mother-offspring assignments - `.paternity`, `.pairwisePaternity`: evaluate confidence in father-offspring assignments ### Important Notes: - Pay close attention to assignments using **known metadata**. - Offspring are **not always assigned to the correct mother**, even with high-quality data. - Filter out individuals from the `.BestConfig` who have **< 80% confidence** in mother assignment (from `.pairwiseMaternity`). - Cross-reference with metadata to **validate maternal assignments**. > **Note:** Father assignments for these individuals may also be inaccurate and should be excluded. --- ## Colony Analysis Script ### Colonyresults.R ```{r} ############# Colony2 Kinship Results ###################### #load packages library(here) #devtools::install_github("eriqande/afblue") functions for inspecting colony output library(afblue) #install.packages(c( "kinship2", "magrittr")) dependencies for afblue package library(tidyverse) library(magrittr) library(stringr) library(kinship2) library(data.table) #################################################################################################################################################################################### #Intro: This script is for analyzing outputs from Colony2 kinship run. # See Hackmd for upstream genotyping of microhaplotypes, and COlony2 run #I ran Colony2 as windows GUI because the multi-thread function does not appear to work on command line, something to consider later #if running with all samples for a project, colony make take a very long time to run. You could subset by sample year, but include possible males from across all years #################################################################################################################################################################################### ########## Colony run with ALL DATA ################################################################################################################################################ #READ IN METADATA WHICH HAS SHORT IDS AND FULL IDS meta<- read.csv("BR_metadata_forcolony.csv") #READ IN BESTCONFIG FILE FROM COLONY bestconfig<-read_colony_best_config("BRAZIL_Colony_28072025/brazilgtseq.BestConfig") bestconfig2 <- bestconfig #REPLACE SHORT ID WITH FULL ID: bestconfig2[] <- lapply(bestconfig2, function(x) { matched_values <- meta$library.id[match(x, meta$uniqueid)] ifelse(!is.na(matched_values), matched_values, x) # Keep original if no match }) #NOTE: are some offspring are assigned to the wrong mother: maybe they are missing too much data? Try looking at assignment confidence and filter that way first, maybe one of the pair is missing a lot of data? # READ IN MATERNAL ASSIGNMENT FILES FROM COLONY: maternal<-read.csv("BRAZIL_Colony_28072025/brazilgtseq.Maternity", header = TRUE) PWmaternal<-read.csv("BRAZIL_Colony_28072025/brazilgtseq.PairwiseMaternity", header = TRUE) maternal[] <- lapply(maternal, function(x) { matched_values <- meta$library.id[match(x, meta$uniqueid)] ifelse(!is.na(matched_values), matched_values, x) # Keep original if no match }) PWmaternal[] <- lapply(PWmaternal, function(x) { matched_values <- meta$library.id[match(x, meta$uniqueid)] ifelse(!is.na(matched_values), matched_values, x) # Keep original if no match }) # READ IN PATERNAL ASSIGNMENTS FROM COLONY: paternal<-read.csv("BRAZIL_Colony_28072025/brazilgtseq.Paternity", header = TRUE) paternal[] <- lapply(paternal, function(x) { matched_values <- meta$library.id[match(x, meta$uniqueid)] ifelse(!is.na(matched_values), matched_values, x) # Keep original if no match }) PWpaternal<-read.csv("BRAZIL_Colony_28072025/brazilgtseq.PairwisePaternity", header = TRUE) PWpaternal[] <- lapply(PWpaternal, function(x) { matched_values <- meta$library.id[match(x, meta$uniqueid)] ifelse(!is.na(matched_values), matched_values, x) # Keep original if no match }) #NOW WE FILTER OUT LOW CONFIDENCE ASSIGNMENTS: #filter bestconfig table to only keep offspring with high confidence maternal assignment filtoffspring <-PWmaternal %>% filter(!Confidence == "100%") %>% filter(!Confidence == "95%") %>% filter(!Confidence == "90%") %>% filter(!Confidence == "85%") %>% filter(!Confidence == "80%") #can make this filtering line simpler with the code: #low_conf <- PWmaternal %>% filter(!Confidence %in% c("100%", "95%", "90%", "85%", "80%")) filtbestconfig<-bestconfig2 %>% #filter(id %in% PWmaternal$OffspringID) %>% #filter out individuals that did not make it into PW assignments. DON"T DO THIS, IT REMOVES ma_* moms filter(!id %in% filtoffspring$OffspringID) #remove individuals with low confidence PW maternal assingments #append expected mother id to offspring, see if they match in filtconfig table, if they don't remove those individuals: #NOTE: for brazil, should be able to do this more cleanly and just join from the metadata table filtbestconfig %>% filter(sex == "U") %>% #select just offspring filter(!grepl("ma_", mom)) %>% #ignore moms that are reconstructed genotypes, going to assume these are fine separate_wider_delim(., cols = id, delim = "_", names = c("pop","spp","plate","loc","momid","season","nest","hid"), too_few = "align_end",too_many = "drop", cols_remove = FALSE) %>% # get the expected mom id for each hatchling separate_wider_delim(., cols = mom, delim = "_", names = c("pop.mom","spp.mom","plate.mom","loc.mom","momid.mom"), too_few = "align_end",too_many = "drop", cols_remove = FALSE) %>% #seperate out the assigned mom id into year and moto, note that some ids in this dataet don't have year first filter(!momid == momid.mom) # this identifies offspring with incorrect mother assignments. while theoretically this shouldn't happen when we know the mom, if there are mismatches here it could mean that there was an error in the field, in our sample sheet, or there were very few loci for this offspring or mother leading to mismatches #make this into a df: mismatch_off_mom<- filtbestconfig %>% filter(sex == "U") %>% #select just offspring filter(!grepl("ma_", mom)) %>% #ignore moms that are reconstructed genotypes, going to assume these are fine separate_wider_delim(., cols = id, delim = "_", names = c("pop","spp","plate","loc","momid","season","nest","hid"), too_few = "align_end",too_many = "drop", cols_remove = FALSE) %>% # get the expected mom id for each hatchling separate_wider_delim(., cols = mom, delim = "_", names = c("pop.mom","spp.mom","plate.mom","loc.mom","momid.mom"), too_few = "align_end",too_many = "drop", cols_remove = FALSE) %>% #seperate out the assigned mom id into year and moto, note that some ids in this dataet don't have year first filter(!momid == momid.mom) %>% select(momid,nest,hid,id,momid.mom,mom,dad,sex) #some of these matches def can't be because they are sampled in different years. Also a LOT of individuals assigned to T75 which is weird. #HAWAII exclude individuals incorrectly assigned to T75 and those that match a mom from different year #NOTE FOR BRAZIL: may not need this line, change year/season etc as appropriate mismatch_off_mom2<-mismatch_off_mom %>% filter(c(year == "2021" & is.na(year.mom)) | c(year == "2022" & year.mom == "2021") | mom_moto.mom == "T75") # HAWAII also remove individuals assigned to ma1, ma21, ma22, ma23 which have offspring from many nests assinged to them filtbestconfig<- filtbestconfig %>% filter(!id %in% mismatch_off_mom2$id) %>% filter(!id %in% c("T239_H14" , "T239_H17")) %>% filter(!mom %in% c("ma_1" ,"ma_21","ma_22","ma_23")) #leaves 2371 #I think what we want instead is to not remove individuals that don't make it into the PWmaternal table,because none of the recostructed moms are in that table, instead just filter out low-confidence offspring based on that table filtbestconfig <- filtbestconfig %>% filter(sex == "U") %>% rename( Offspring_ID = id, Mother_ID = mom, Father_ID = dad) %>%select(!sex) write.csv(filtbestconfig, "filtered_kinship_config.csv") ####################################################################################### #ok now we want to get info on the total number of fathers, mothers, etc filtbestconfig %>% filter(sex == "U") %>% select(dad)%>% unique(.) #176 fathers filtbestconfig %>% filter(sex == "U") %>% select(mom)%>% unique(.) #124 mothers filtbestconfig %>% filter(sex == "U") %>% unique(.) #2221 hatchlings #calculate the confidence in male assignment for each offspring if available (not available for reconstructed males) paternalconf<-filtbestconfig %>% filter(sex == "U") %>% left_join(., PWpaternal, join_by(id == OffspringID)) #of 605 PWpaternal confidence values, 124 were 95%. 191 assignments were below 70% confidence #BSR by year: idS1<-meta %>% filter(season == "S1") %>% select(library.id) dadsS1<-filtbestconfig %>% filter(id %in% idS1$library.id) %>% filter(sex == "U") %>% select(dad)%>% unique(.) momsS1<-filtbestconfig %>% filter(id %in% idS1$library.id) %>% filter(sex == "U") %>% select(mom)%>% unique(.) #BRAZIL: 22 females, 36 males idS2<-meta %>% filter(season == "S2") %>% select(library.id) dadsS2<-filtbestconfig %>% filter(id %in% idS2$library.id) %>% filter(sex == "U") %>% select(dad)%>% unique(.) momsS2<-filtbestconfig %>% filter(id %in% idS2$library.id) %>% filter(sex == "U") %>% select(mom)%>% unique(.) #BRAZIL S2: 29F, 52M idS3<-meta %>% filter(season == "S3") %>% select(library.id) dadsS3<-filtbestconfig %>% filter(id %in% idS3$library.id) %>% filter(sex == "U") %>% select(dad)%>% unique(.) momsS3<-filtbestconfig %>% filter(id %in% idS3$library.id) %>% filter(sex == "U") %>% select(mom)%>% unique(.) #Brazil S3: 42F, 66M idS4<-meta %>% filter(season == "S4") %>% select(library.id) dadsS4<-filtbestconfig %>% filter(id %in% idS4$library.id) %>% filter(sex == "U") %>% select(dad)%>% unique(.) momsS4<-filtbestconfig %>% filter(id %in% idS4$library.id) %>% filter(sex == "U") %>% select(mom)%>% unique(.) #brazil s4: 49,F 72M #are there dads contributing to clutches in susbequent eyars sharedads<-intersect(dadsS1$dad, dadsS2$dad) #repeat for all seasons dads<-filtbestconfig %>% filter(sex == "U") %>% group_by(dad) %>% count(.) min(dads$n) max(dads$n) median(dads$n) #hawaii: get the number of hatchlings per father. range is 1-52, median of 6. This could be in part due to the small number of hatchlings per clutch included in this dataset (typically around 12) dads %>% filter(grepl("pa", dad)) #hawaii: 156 are reconstructed genotypes, #group by clutch to get level of multiple paternity. Since I've only included one nest from each female, just going to group by mother multpatern<-setDT(filtbestconfig)[, .(count = uniqueN(dad)), by = mom] #gets number of unique males per female(nest) #hawaii: confirm between 1-6 fathers per clutch median(multpatern$count) #hawaii: median is 2, mean 2.29 1-sum(multpatern$count == 1)/nrow(multpatern) #hawaii: 65% of clutches fertilized by more than one male summary(multpatern$count) # check to see if multiple paternity is the same across years" multpatern %>% filter(mom %in% momsS1$mom) %>% summarise(mean = mean(count), median= median(count), min = min(count), max = max(count)) multpatern %>% filter(mom %in% momsS1$mom) %>% summarise(prop = 1-sum(count == 1)/nrow(.)) #hawaii: 65% of clutches fertilized by more than one male #mean median min max # 2.18 2 1 6 # 54.5% multiple paternity multpatern %>% filter(mom %in% momsS2$mom) %>% summarise(mean = mean(count), median= median(count), min = min(count), max = max(count)) multpatern %>% filter(mom %in% momsS2$mom) %>% summarise(prop = 1-sum(count == 1)/nrow(.)) #hawaii: 65% of clutches fertilized by more than one male #mean median min max #2.448276 2 1 7 #72.4% multiple paternity multpatern %>% filter(mom %in% momsS3$mom) %>% summarise(mean = mean(count), median= median(count), min = min(count), max = max(count)) multpatern %>% filter(mom %in% momsS3$mom) %>% summarise(prop = 1-sum(count == 1)/nrow(.)) #hawaii: 65% of clutches fertilized by more than one male #mean median min max # 2.190476 2 1 5 #61.9% multiple paternity multpatern %>% filter(mom %in% momsS4$mom) %>% summarise(mean = mean(count), median= median(count), min = min(count), max = max(count)) multpatern %>% filter(mom %in% momsS4$mom) %>% summarise(prop = 1-sum(count == 1)/nrow(.)) #hawaii: 65% of clutches fertilized by more than one male #mean median min max #1.857143 2 1 5 #59.2% multiple paternity #Do some males dominate while others are rare contributors? multpatern2<-setDT(filtbestconfig)[, .(count = uniqueN(mom)), by = dad] #gets number of unique males per female(nest) sum(multpatern2$count == 1)/nrow(multpatern2) #hawaii: 85% of males only contributed to one clutch in our dataset, though some contributed to up to 6 #per season: polyg21<- setDT(filter(filtbestconfig, mom %in% moms2021$mom))[, .(count = uniqueN(mom)), by = dad] polyg22<- setDT(filter(filtbestconfig, mom %in% moms2022$mom))[, .(count = uniqueN(mom)), by = dad] sum(polyg21$count == 1)/nrow(polyg21) #hawaii: 90% of males only contributed to one clutch in our dataset, though some contributed to up to 6 sum(polyg22$count == 1)/nrow(polyg22) #hawaii: 84.8% of males only contributed to one clutch in our dataset, though some contributed to up to 6 ##################################### PAUSE HERE, THIS IS ADDITIONAL ANALYSIS ############################################# #does this relate to clutch size in dataset? clutchmeta<-read.csv("2021_clutch_meta.csv") #this is a file with our hawaii clutch success dataset, includes things like total eggs laid, number emerged, etc multpatern3<-setDT(filtbestconfig)[, .(count = uniqueN(id)), by = mom] %>% drop_na(mom) %>%#gets number of unique males per female(nest) left_join(.,multpatern, join_by(mom)) %>% rename(., "clutchsize" = "count.x" , "no.dads"="count.y") # Remove duplicates and create temp dataset tempclutchmeta <- clutchmeta %>% distinct() # Search for Mother_Moto in multpatern3$mom and add the matched value tempclutchmeta <- tempclutchmeta %>% rowwise() %>% mutate(matched_mom = { # First, search for exact match with "_" prefix match_idx <- which(str_detect(multpatern3$mom, paste0("_", as.character(Mother_Moto)))) if (length(match_idx) > 0) { multpatern3$mom[match_idx[1]] # Extract first match } else { # Second, search for just the Mother_Moto value match_idx <- which(multpatern3$mom == as.character(Mother_Moto)) if (length(match_idx) > 0) { multpatern3$mom[match_idx[1]] # Extract first match } else { NA_character_ # If no match is found } } }) %>% ungroup() multpatern3<-multpatern3 %>% left_join(., tempclutchmeta, join_by(mom == matched_mom)) %>% mutate(Clutch_Size = ifelse(is.na(Clutch_Size) | Clutch_Size == "", mean(as.numeric(Clutch_Size), na.rm = TRUE), Clutch_Size)) %>% mutate(samp_prop = as.numeric(clutchsize)/as.numeric(Clutch_Size)) f <- multpatern3 %>% mutate(season = case_when( mom %in% moms2021$mom ~ "2021", mom %in% moms2022$mom ~ "2022" )) %>% drop_na(mom) "#F7FCFD" "#E5F5F9" "#CCECE6" "#99D8C9" "#66C2A4" "#41AE76" "#238B45" "#006D2C" "#00441B" ggplot(f, aes(y = no.dads, x = season, size = samp_prop)) + geom_point(alpha = 0.7) + # Transparency for better visibility scale_size_continuous(range = c(2, 8)) + # Adjust size scale labs(y = "Multiple Paternity per Clutch", x = "Season", size = "Proportion Sampled") + theme_minimal() #lol ew this plot is so ugly and not informative multpaternplot<-ggplot(f, aes(x = season, y = no.dads)) + geom_boxplot(fill = "#66C2A4", alpha = 0.5) + # plot for distribution scale_size_continuous(range = c(2, 6)) + # Adjust point size range labs(x = "Season", y = "Fathers per Clutch", ) + theme_minimal()+ theme(text = element_text(size = 20)) propsampleplot<-ggplot(f, aes(x= samp_prop, y= no.dads))+ geom_point(col = "#66C2A4", alpha = 0.5, size =3) + facet_wrap(vars(season), strip.position = "top")+ theme_minimal()+ theme(text = element_text(size = 20))+ labs( x = "Proportion of Nest Sequenced", y = NULL) propmultpatern <-ggplot(f, aes(y = no.dads))+ geom_bar(fill = "#66C2A4",alpha = 0.5) + facet_wrap(vars(season), strip.position = "top")+ theme_minimal()+ theme(text = element_text(size = 20))+ labs( y = "Fathers per Clutch", x = "Frequency") grid.arrange( propmultpatern, propsampleplot,nrow =1, ncol =2) filtpaternal<-paternal %>% filter(OffspringID %in% filtbestconfig$id) write.csv(filtpaternal,"022425_filtpaternalprobs.csv") ###### males that are mating with multiple females? ###### #yes we do have some - vick from OSU had said there was no evidence of polygyny in greens, note this for my ms #look at best family cluster, which has prob of each triad read_colony_best_clust <- function(path) { ped <- read.table(path, header = TRUE, comment = "", stringsAsFactors = FALSE) %>% dplyr::tbl_df() %>% dplyr::mutate(dad = stringr::str_replace(FatherID, "\\*", "pa_"), mom = stringr::str_replace(MotherID, "\\#", "ma_") ) %>% select(ClusterIndex, Probability, OffspringID, mom, dad) } bestcluster<-read_colony_best_clust("Colony_allsamples/allssn.BestCluster") bestcluster[] <- lapply(bestcluster, function(x) { matched_values <- meta$full.sample.name.TO.USE[match(x, meta$uniqueid)] ifelse(!is.na(matched_values), matched_values, x) # Keep original if no match }) filtbestcluster <- bestcluster %>% filter(OffspringID %in% filtbestconfig$id) ```