# GUTMICRO day 3
## (1) Plot within-host SFSs for the Bacteroides and Alistipes samples in theannotated_snps_small.txt.bz2 files. Can you pick out some examples of single vs multi-colonization by eye?Do you notice any differences between the two species?
#### SNP Data analysis
rm(list=ls(all=TRUE)) # clean environment
getwd()
#update.packages(ask = FALSE, dependencies = c('Suggests')) #update all installed R packages
library(data.table) # for fread function
library(RColorBrewer) # for nice color palettes
library(scales) # for points transparency on plots
library(gplots) # for heatmap.2 function
setwd("/Users/Zireael/Documents/KITP/Gutmicro_project/my_scripts") # replace with your working directory
#load data
snps<-fread("../kitp_2021_microbiome_data/snps/Bacteroides_vulgatus_57955/annotated_snps_small.txt",
data.table = F, stringsAsFactors = F, header = T)
#snps<-fread("../kitp_2021_microbiome_data/snps/Alistipes_putredinis_61533/snps_depth_small.txt", data.table = F, stringsAsFactors = F)
rownames(snps)<-snps$site_id
snps<-snps[,-1]
head(snps)
#get rid of 2D and 3D positions
snps<-snps[-grep("2D",rownames(snps)),]
snps<-snps[-grep("3D",rownames(snps)),]
length(grep("1D",rownames(snps)))
length(grep("4D",rownames(snps)))
#custom function to calculate allele frequency
calc_freq<-function(x){
freq<-as.numeric(strsplit(x,split=",")[[1]][1])/as.numeric(strsplit(x,split=",")[[1]][2])
freq
}
freq_mats<-list()
for(i in 1:ncol(snps)){ # calculate allele frequency for each sample
#i<-1
sub.snps<-as.data.frame(snps[,i])
sub.snps<-as.data.frame(sub.snps[-grep("0,",sub.snps[,1]),]) # get rid of sites with 0 mutations
sub.snps<-as.data.frame(sub.snps[-grep("1,",sub.snps[,1]),]) # get rid of sites with mutation in only 1 read (might be seq error)
freq_mats[[i]]<-apply(sub.snps, 1, calc_freq)
}
#create SFS plots for all samples
cairo_pdf('graphs/SFS.pdf', width =10, height = 10) # save plot
par(mfrow=c(6,7))
for(i in 1:length(freq_mats)){
#i<-1
freq<-as.numeric(freq_mats[[i]])
hist(freq, col="royal blue", breaks=20,
main=paste("Sample", i), xlab="Freq")
}
dev.off()
cairo_pdf('graphs/SFS_folded.pdf', width =10, height = 10) # save plot
par(mfrow=c(6,7))
for(i in 1:length(freq_mats)){
#i<-1
freq<-as.numeric(freq_mats[[i]])
freq[which(freq>=0.5)]<-1-freq[which(freq>=0.5)]
hist(freq, col="royal blue", breaks=20,
main=paste("Sample", i), xlab="Freq")
#plot(density(freq, bw=0.01), col="royal blue", main=paste("Sample", i), xlab="Freq")
}
dev.off()
## (2) Can you come up with an algorithmic criterion for figuring out when samples are simple vs complex? i.e., how tall must a bump be before you are confident that it must be caused by an external strain?
## One natural criterion is whether the # of true mutations is consistent withthe number of differences observed between hosts for a particular species. Can you come up with a way to estimate that from the data we have? (Hint: you can see what we did in the methods section of our paper)
## (3) Another feature of externally circulating strains is that they have a very low ratio of non-synonymous (1D) vs synonymous (4D) mutations(also known as pN/pS).The logic behind this is that most amino-acid replacements will harm the functionof a protein, and will be eliminated by natural selection on long timescales.
## Do you also observe low pN/pS values for the multi-colonization examples you identified above?
## (4) Bonus: you could also plot the frequencies of mutations as a function of theposition along thereference genome. Are they distributed uniformly? Or are thereanystrangepatterns you want to explain?