# Pathogenicity in *Yersinia* genomes
Koltin Htut, Duc Nguyen, Tom Mu
> **Project Summary**
>
> Here, we are studying genomes from various species within the *Yersinia* genus. The main goal of this study is to compare pathogenic and nonpathogenic species of *Yersinia*, so that we may have a better understanding of how pathogenicity may be expressed in the genome. The first comparison is of the overall genome size, and the difference in GC content. We also look at the orthologs in *Yersinia*, placing them into clusters of orthologous groups. Once clustered, the abundance of specific orthologous groups in the *Yersinia* species is compared. This tells us whether there are specific types of genes that are more present or absent between pathogenic and nonpathogenic species. The workflow is shown visually below. Understanding how pathogenicity presents within the genome will improve our ability to predict pathogenicity in unknown samples, as well as create improved gene focused vaccinations, leading to higher reduction in disease transmission. The overall workflow is displayed below.
>

1. **Genome size and GC content**
We start by comparing the genome size and GC content between pathogenic and non-pathogenic species.
- Locate reference genomes from NCBI genome database, and download to the working directory `/courses/bi278/dnnguy23/yersinia` (note: for *Yersinia kristensii* and *Yersinia enterocolitica*, non-reference genomes were sampled in order to provide complete genomes)
| Pathogenicity | Species |
| -------- | -------- |
| non-pathogenic | [*Y mollaretii*](https://www.ncbi.nlm.nih.gov/labs/data-hub/genome/GCF_013282725.1/) |
|non-pathogenic|[*Y intermedia*](https://www.ncbi.nlm.nih.gov/labs/data-hub/genome/GCF_009730055.1/) |
|non-pathogenic|[*Y aldovae*](https://www.ncbi.nlm.nih.gov/labs/data-hub/genome/GCF_000834395.1/)|
|non-pathogenic|[*Y rodhei*](https://www.ncbi.nlm.nih.gov/labs/data-hub/genome/GCF_000834455.1/)|
|non-pathogenic|[*Y similis*](https://www.ncbi.nlm.nih.gov/labs/data-hub/genome/GCF_000582515.1/)|
|pathogenic|[*Y pestis*](https://www.ncbi.nlm.nih.gov/labs/data-hub/genome/GCF_000222975.1/)|
|pathogenic|[*Y enterocolitica*](https://www.ncbi.nlm.nih.gov/labs/data-hub/genome/GCF_001160345.1/)|
|pathogenic|[*Y pseudotuberculosis*](https://www.ncbi.nlm.nih.gov/labs/data-hub/genome/GCF_000834295.1/)|
|pathogenic|[*Y ruckeri*](https://www.ncbi.nlm.nih.gov/labs/data-hub/genome/GCF_017498685.1/)|
|pathogenic|[*Y kristensenii*](https://www.ncbi.nlm.nih.gov/labs/data-hub/genome/GCF_000834865.1/)|
- Using a bash script, read the selected genomes for the overall length, GC content, and calculate GC %
```
#!/bin/bash
for FILE in Y*.fna; do
echo "$FILE"
var1=$(grep -v ">" `echo "$FILE"` | tr -d -c GCgc | wc -c)
var2=$(grep -v ">" `echo "$FILE"` | tr -d -c ATGCatgc | wc -c)
#gc=`echo "$var1" / "$var2" | bc`
echo "$var1, $var2"
awk -v v1=$var1 -v v2=$var2 'BEGIN {print v1/v2}'
done
```
This code will look at every *Yersinia* genome file within the working directory, and make two counts using `grep`. The first count will be of the entire genome, using every possible nucleotide as the filter. The second count will be of only the GC nucleotides in the genome, using G and C as filters. Once these two values are obtained, they can be easily divided to yield the total GC % for each species.
- Run the shell script
```
sh gc.sh
```
The results from the bash script can then be copied into an Excel file, where it is organized into a table, labelled yersinia_GC.csv, for viewing purposes.
- Import the obtained csv file into a dataframe `yersinia` in RStudio.
```
yersinia <- read.csv("yersinia_GC.csv")
```

- Create a plot of genome size, colored by pathogenicity. We scale the x-axis to million base pair (Mb). We use dotplots instead of boxplots because the boxplot is ineffective for small sample size.
```
gsize_plot <- ggplot(yersinia,
aes(x = factor(""), y = Genome.Size/1000000, color = Pathogen)) +
geom_jitter(width = 0.01, alpha = 0.5, aes(color = Pathogen)) +
geom_label_repel(aes(label = Species)) +
labs(title = "",
x = "", y = "Genome size (Mb)") +
scale_y_continuous(breaks = seq(3.6, 5.0, 0.2)) +
theme(plot.title=element_text(hjust=0.5), legend.position = "right") +
scale_color_manual(values = c("#00BFC4", "red"))
```
- Create a plot of GC%, also colored by pathogenicity.
```
gc_plot <- ggplot(yersinia,
aes(x = factor(""), y = GCpct, color = Pathogen)) +
geom_jitter(width = 0.01, alpha = 0.5) +
geom_label_repel(aes(label = Species)) +
labs(title = "", x = "", y = "GC%") +
theme(plot.title=element_text(hjust=0.5), legend.position = "none") +
scale_color_manual(values = c("#00BFC4", "red"))
```
- Use `gridExtra` library to display the two plots side by side.
```
library(gridExtra)
grid.arrange(gsize_plot, gc_plot, ncol = 2)
```

:::info
We found that genome size of *Yersinia* bacteria varies around 4.7 Mb, except *Y. ruckeri* which have an exceptional small genome (3.9 Mb). Genome size does not show significant difference between pathogens and non-pathogens.
We also found that the GC% of *Yersinia* varies around 47.4%, except *Y.mollaretii* with an exceptionally high value (49.1%). There is no significant difference in GC% between pathogens and non-pathogens.
Genome sizes and GC% among pathogens show less variance than non-pathogens.
:::
2. **Gene abundances in COG groups**
- Run Prokka for all genomes
- Run the Prokka annotation for each genome file using code below. (`Yruc_prok` is the name of the output directory that needs to change based on the current genome, and `Yruc.fasta` is the .fna file for the current genome). The file genomic.gbff is a reference genome within the *Yersinia* genus, *Yersinia pestis* specifically, which prokka uses to better identify coding sequences in the target genome.
```
prokka --force --outdir ./Yruc_prok --proteins ./genomic.gbff
--locustag Yruc --genus Yersinia --species ruckeri Yruc.fasta
```
After running the prokka annotation, several files should be generated in the output directory. Take a look at the files to see if it matches what you want. For example, look at the format of the .faa file, and look at the information in the .txt file and compare it with the information in the NCBI website.
- Run rpsBLAST to classify annotated genes into functional categories for each genome
- We need to establish the path for the BLAST database, and we used the database in the Course Material.
```
export BLASTDB="/courses/bi278/Course_Materials/blastdb"
echo $BLASTDB$
```
- Then we need to run the rpsBLAST on each genome. We first need to get a table full of COG matches.
```
rpsblast -query PROKKA_11082022.faa -db Cog -evalue 0.01
-max_target_seqs 25 -outfmt "6 qseqid sseqid evalue qcovs stitle"
> Yruc.rpsblast.table
```
`PROKKA_11082022.faa` is the file to run rpsBLAST on, and `Yruc.rpsblast.table` is the name of the output of this code. The study species are closely related, so we are using a smaller e-value of 0.01 instead of the default e-value of 10.
- We want to only keep COG matches with at least 70% query coverage. This ensures that we are keeping high-quality COG matches.
```
awk -F'[\t,]' '!x[$1]++ && $4>=70 {print $1,$5}' OFS="\t"
Yruc.rpsblast.table > Yruc.cog.table
```
`Yruc.rpsblast.table` is the input and `Yruc.cog.table` is the output, and they should be changed.
- Then we need to look up the functional categories for the COG numbers for each sequence.
```
awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($2 in a){print $1, $2, a[$2]}
else {print $0}}' OFS="\t"
/courses/bi278/Course_Materials/blastdb/cognames2003-2014.tab
Yruc.cog.table > temp
```
We use the cogname table in the Course Materials to generate the file with functional categories for each COG number.
- Then we want to only save the first category for each COG number.
```
awk -F "\t" '{if ( length($3)>1 ) { $3 = substr($3, 0, 1) }
else { $3 = $3 }; print}' OFS="\t" temp > temp2
```
- We want to add function descriptions to the categories.
```
awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($3 in a){print $0, a[$3]} else
{print $0}}' OFS="\t"
/courses/bi278/Course_Materials/blastdb/fun2003-2014.tab
temp2 > Yruc.cog.categorized
```
The `Yruc.cog.categorized` is the final file we want for the *Y. ruckeri* genome. The resulting categorized cog table should look as follows:
```
Yruc_00001 COG0593 L Replication, recombination and repair
Yruc_00002 COG0592 L Replication, recombination and repair
Yruc_00003 COG1195 L Replication, recombination and repair
Yruc_00004 COG0187 L Replication, recombination and repair
Yruc_00005 COG0561 H Coenzyme transport and metabolism
```
- Finally, we can delete the two temporary files.
```
rm temp temp2
```
- The above codes need to be run for each genome. All `.categorized` files are stored in `cog_result` folder. Change the working directory of RStudio to this folder.
```
setwd("/courses/bi278/dnnguy23/yersinia/cog_result")
```
- Next, we count the number of genes in each COG group.
+ Parse a gene table into a dataframe, for example *Y. aldovae*
```
yald <- read.csv("Yald.cog.categorized", sep = "\t", header = FALSE)
```

+ Load `dplyr` library. Create an aggregate table grouped by COG. It returns the gene counts in COG groups of *Y. aldovae*.
```
library(dplyr)
yald_tbl <- yald %>%
group_by(V3) %>%
summarise(total_count=n(),
.groups = 'drop')
```
+ If there are uncategorized genes, name its COG group "Unknown". Also add a new column specifying the species.
```
yald_tbl[yald_tbl$V3 == "",][1] <- "Unknown"
yald_tbl$species <- "Yald"
```
The complete species dataframe is shown below:

- Repeat above steps for all gene tables. Merge all dataframes into one.
```
merge <- rbind(yald_tbl, yent_tbl, yint_tbl, ykris_tbl, ymol_tbl,
ypes_tbl, ypseu_tbl, yroh_tbl, yruc_tbl, ysim_tbl)
```
- Add a new column specifying pathogenicity
```
merge$pathogen <- ifelse(merge$species %in% c("Yald", "Yint", "Ymol",
"Yroh", "Ysim"), "no", "yes")
```
- Finally, create a stacked bar plot of COG gene count.
Note that we need to create a customized color scheme first, because there are 25 COG groups to display, which exceeds the default color schemes in R.
```
c25 <- c(
"dodgerblue2", "#E31A1C",
"green4",
"#6A3D9A",
"#FF7F00",
"black", "gold1",
"skyblue2", "#FB9A99",
"palegreen2",
"#CAB2D6",
"#FDBF6F",
"gray70", "khaki2",
"maroon", "orchid1", "deeppink1", "blue1", "steelblue4",
"darkturquoise", "green1", "yellow4", "yellow3",
"darkorange4", "brown"
)
```
Then create the plot facetted between pathogens and non-pathogens.
```
ggplot(merge, aes(x = species, y = total_count, fill = V3)) +
geom_bar(stat = "identity", position = "stack", width = 0.5) +
scale_fill_manual(values = c25) +
labs(fill = "COG", y = "Count", x = "Species") +
facet_wrap(~ pathogen, scales = "free_x", labeller = "label_both")
```

- We change the scale of the bars to 1 to compare the proportion of COG groups within genomes.
```
ggplot(merge, aes(x = species, y = total_count, fill = V3)) +
geom_bar(stat = "identity", position = "fill", width = 0.5) +
scale_fill_manual(values = c25) +
labs(fill = "COG", y = "Proportion", x = "Species") +
facet_wrap(~ pathogen, scales = "free_x", labeller = "label_both")
```

- In order to investigate further, we create boxplots comparing gene counts for each COG group, e.g. X group.
```
ggplot(merge[merge$V3 == "X",],
aes(x = pathogen, y = total_count, color = pathogen)) +
geom_jitter(width = 0, height = 0) +
scale_color_manual(values = c("#00BFC4", "red")) +
theme(legend.position = "none") +
labs(y = "Count", x = "Pathogen")
```

:::info
Overall, the distribution of COG groups among species are somewhat similar, as they belong to the same genus. *Y. ruckeri* has smaller total gene count than other species, which may be explained by its small genome size.
We notice that pathogens have more genes in the X group than non-pathogens. The abundance of X genes in *Y. pestis* is unusually high.
:::
3. **Gene abundance by specific function in a COG group**
We attempt to divide the X group into more specific functional groups, then compare gene abundance across these groups among species.
- First, we look up all possible detailed functions of genes in the X group on [NCBI](https://www.ncbi.nlm.nih.gov/research/cog/cogcategory/X/). Download this standard reference and name it `cog_result_table.tsv`.

- Import the standard table into a dataframe `db_cog_x`
`
db_cog_x <- read.csv("cog_result_table.tsv", sep = "\t", header = TRUE)
`

- Next, we merge the X genes of each species, e.g. *Y. aldovae*, with the standard table using COG ID.
+ Filter genes that belong to X group into a new dataframe and rename its columns.
```
yald_cog_x <- yald[yald$V3 == "X",]
colnames(yald_cog_x) <- c("Gene", "COG", "Cat", "Annotation")
```

+ Merge X genes of *Y. aldovae* with the standard table, then select columns about gene ID, COG ID, and annotation of function.
```
yald_cog_x_compare <- merge(yald_cog_x, cog_db, by = "COG",
all.x = TRUE)[,c("Gene", "COG", "Annotation.y")]
```

+ Finally, create an aggregate table of genes based on their detailed functions in `Annotation.y` column.
Since some annotations show the same type of genes, e.g. "Transposase and inactivated derivatives, IS5 family" and "Transposase (or an inactivated derivative), DDE domain", we write the `cutAnnotation` function to group them.
```
cutAnnotation <- function(tbl) {
tbl$cutAnnotate <- tbl$Annotation.y
for (row in c(1: nrow(tbl))) {
words <- tolower(strsplit(tbl[row, "Annotation.y"], " ")[[1]])
firstword <- words[1]
if (firstword == "transposase") {
tbl[row, "cutAnnotate"] <- "Transposase"
}
if (firstword == "uncharacterized") {
tbl[row, "cutAnnotate"] <- "Uncharacterized"
}
if ("phage" %in% words | "phage-related" %in% words |
"bacteriophage" %in% words | "prophage" %in% words) {
tbl[row, "cutAnnotate"] <- "Phage-related protein"
}
}
return(tbl)
}
```
Then save the gene counts into a new dataframe `yald_cog_x_tbl`. Add a column specifying names of species.
```
yald_cog_x_tbl <- yald_cog_x_compare %>%
cutAnnotation() %>%
group_by(cutAnnotate) %>%
summarise(total_count=n(),
.groups = 'drop')
yald_cog_x_tbl$Species <- "Yald"
```
We obtain a dataframe as below. Here, *Y. aldovae* has 45 genes in category X, divided into 3 functional groups: 11 phage-related proteins, 2 plasmid stabilizers, and 32 transposases.

- Replicate above steps for all species. Merge aggregate tables into one. Also add a column specifying pathogenicity.
```
cog_x_merge <- rbind(yald_cog_x_tbl, yent_cog_x_tbl, yint_cog_x_tbl,
ykris_cog_x_tbl, ymol_cog_x_tbl,
ypes_cog_x_tbl, ypseu_cog_x_tbl,
yroh_cog_x_tbl, yruc_cog_x_tbl, ysim_cog_x_tbl)
cog_x_merge$pathogen <- ifelse(cog_x_merge$Species %in% c("Yald", "Yint",
"Ymol","Yroh", "Ysim"), "no", "yes")
```
- Finally, create a stacked bar plot for interspecies comparison of gene counts.
```
ggplot(cog_x_merge,
aes(y = total_count, x = Species, fill = cutAnnotate)) +
geom_bar(stat = "identity", position = "stack", width = 0.5) +
labs(y = "Count", fill = "Function\ngroup") +
scale_fill_manual(values = c25) +
theme(legend.position = "bottom", legend.text=element_text(size=12),
text = element_text(size = 20)) +
guides(fill=guide_legend(ncol = 2)) +
facet_wrap(~ pathogen, scales = "free_x", labeller = "label_both")
```

:::info
We found that genes that code for transposases and phage-related proteins attributes to most of X genes in all species. Pathogens have significantly more transposase genes than non-pathogens, except *Y. kristensenii*. Also, a transposase named RayT exists with high abundance in almost all pathogens, especially *Y. pestis* (40% of transposase genes in *Y. pestis* are RayT). These proteins are seem to be virulence factors. Its exceptional prevalence in *Y. pestis* may explain for the species' high pathogenicity in mammals.
The abundance of phage-related genes does not significantly differ between pathogens and non-pathogens.
:::