# BI278 - Project: Pathogenicity in Streptococcus
## Background:
### Streptococcus
Streptococcus (genus) contains a variety of species, some of which are known to cause illness or diseases of varying severity in humans. For instance, Streptococcus pyogenes, group A streptococcus, is the bacteria responsible for the very common “strep throat.” Streptococcus can also cause more serious ailments ranging from rheumatic fever to streptococcal toxic shock syndrome (STSS) to Necrotizing fasciitis (popularly referred to as the “flesh-eating disease”). Streptococcus is divided into several groups based on their effects. The most common groups are Group A, B, C, and G — although there are many more groups such as D, E, F, etc. Group A Streptococcus can cause infections that range from mild to life-threatening (strep throat or scarlet fever) and typically affect the throat and skin. They spread through contact with mucus or infected wounds. Group B streptococcus causes illness in people of all ages but is particularly severe in newborns, mostly causing sepsis, pneumonia, and meningitis. Group C and G streptococcus are less understood than Groups A and B and mostly common in cattle, causing conditions like eczema. The most pathogenic species typically fall in Groups A and B. These groups of Streptococcus are spread out across three larger categories by the type of hemolysis on blood agar: β-hemolytic, α-hemolytic, and γ-hemolytic. β-hemolytic species cause complete lysis of red blood cells: thus, this group is largely made up of groups A and B species and are pathogenic. α-hemolytic species cause incomplete lysis of red blood cells: thus, this group is largely made up of opportunistic pathogens. γ-hemolytic species do not cause any hemolysis, thus nonpathogenic, and are primarily made up of group D streptococci.
### Pathogenicity in Streptococcus
Preliminary research revealed that pathogenicity in Streptococcus relates to decreased genome size and increased presence of four classes of proteins: M-protein, Pilus protein, FN-binding protein, and Histidine triad protein (HTP).
#### M-Protein
The M protein has anti-phagocytic properties that prevent breakdown by neutrophils and promote the adhesion and invasion of GAS into host cells. The M-protein interacts with fibrinogen which may trigger the activation of an inflammatory response via neutrophils. This response is a key phenomenon in STSS. Diversity in M serotypes is associated with different geographical distributions and disease profiles.
#### Pilus Protein
The Pilus Protein extends from the streptococcal cell surface to mediate adhesion to pharyngeal cell surfaces. Pilus-deficient knockout mutants were less effective at adhering to the same cell line and forming cell aggregates. There are currently nine known variants of the pilus-gene island located within the fibronectin- and collagen-binding T-antigen region.
#### FN-binding Protein
FN-binding protein is a protein capable of binding fibronectin (FN), an extracellular matrix glycoprotein secreted by a range of host cells. The FN is bound during the initial adhesive stage before proliferation on the host cell surface and subsequent host tissue invasion. The FN-binding protein is significantly associated with death in STSS. An FN-binding protein gene, FbaB’s presence in mice, yielded 90% mortality compared to only 40% in the FbaB deficient mice.
#### HTP
HTPs are a family of surface-exposed proteins present in many pathogenic streptococcal species. While they play an essential role in bacterial pathogenicity in Streptococcus, little is known about their origin or evolution.
## Motivation:
Given the potential severity of Streptococcus in humans, the recent zoonotic conception of the Covid-19 pandemic, and the recent rise of antibiotic-resistant pathogens, we felt it would be important to be able to predict and discern the genomic factors relevant to the pathogenicity and non-pathogenicity of various Streptococcus species.
The species we chose to investigate are: S. pyogenes, S. agalactiae, S. dysgalactiae ssp. equisimilis, S. anginosus, S. intermedius, S. pneumoniae strain R6, S. pneumoniae strain D39, S. gallolyticus, S. pasteurianus, and S. infantarius. S. pyogenes is a Group A species while S. agalactiae and S. dysgalactiae ssp. equisimilis are both Group B species. We chose this species to represent the β-hemolytic category. Upon research, we learned that α-hemolytic species are primarily constituted of streptococcus pneumoniae and other streptococci of the viridians group. S. pneumoniae does not have subspecies so we chose two widely used strains: D39 (pathogenic) and R6 (nonpathogenic). In order to establish a foundation for comparison, we chose one species of the viridians group (S. anginosus) and two strains/subspecies of that: S. intermedius (pathogenic) and S. anginosus (nonpathogenic). To choose species to represent the γ-hemolytic category, we conducted more research: we found that group D streptococcus is largely made up of the overall S. bovis group. We chose to investigate S. gallolyticus, S. pasteurianus, and S. infantarius for the γ-hemolytic group since they are included in S. bovis and consequently group D. We've visualized our species for you in the table below.
**Chosen Species.**
| β-hemolytic | α-hemolytic | γ-hemolytic |
| ----------- | ----------- | ----------- |
| S. pyogenes | S. Viridians - S. anginosus: S. anginosus (non-pathogenic) & S. intermedius (pathogenic)| S. gallolyticus
S. agalactiae | S. pneumoniae: S. pneumoniaeR6 (nonpathogenic) & S. pneumoniaeD39 (pathogenic)| S. pasteurianus
S. dysgalactiae ssp. equisimilis| |S. infantarius
## Project Goal:
Our goal for this project was to identify traits relating to pathogenicity in Streptococcus. Specifically, we aimed to compare genome sizes and GC% content, analyze the number of genes present for each functional group, and explore the abundance of sequences related to pathogenicity conferring proteins across the β-, α-, and γ-hemolytic categories. As a compilation of all of our findings, we planned to demonstrate the “generalizability” of the knowledge we’ve gained by using the information we’ve uncovered about traits of pathogenicity in Streptococcus to classify a particular species (at random) into the correct group (β-hemolytic, α-hemolytic, and γ-hemolytic) and thus deduce its pathogenicity. We will accept or reject our hypothesis/judgment of this species through literature research.
## Methods/Discussion
### Step 1 - Downloading protein and genome files for all species
For the first step for our project, we downloaded the genome and protein files for all chosen species from the NCBI database (https://www.ncbi.nlm.nih.gov/genome/). You will find these files (along with all other important files) in /courses/bi278/sdivit25/Project. Recall that .fna files are genome FASTA files while .faa files are protein FASTA files.
### Step 2 - Find genome sizes and GC% content
To increase efficiency, we grouped commands to find genome size and GC content into one Unix script using the nano editor. If you want to access this shell script, it is saved as "GC%_GenomeSize.sh". To execute this command, recall the general format is:
``` sh your_script.sh file_name```
The code used for GC%_GenomeSize.sh is:
```
#!/bin/bash
grep -v ">" ./$1 | tr -d -c GCgc | wc -c
grep -v ">" ./$1 | tr -d -c ATGCatgc | wc -c
```
To obtain GC%, we divided GC content (smaller number) by genome size (larger number). In this step, we used our .fna files.
After running the shell script on all species genome files and performing the necessary calculations for GC%, our results looked like this:
| Species | Genome Size | GC% Content |
| -------- | -------- | -------- |
| S. pyogenes| 1746380| 673214/1746380 = 38.5%|
S. agalactiae| 2079123| 742893/2079123 = 35.7%|
S. dysgalactiae ssp. equisimilis| 2111518| 834350/2111518 = 39.5%|
S. anginosus| 1924513| 750364/1924513 = 38.9%|
S. intermedius| 1932951| 728858/1932951 = 37.7%|
S. pneumoniaeR6| 2038615| 809656/2038615 = 39.7%|
S. pneumoniaeD39| 2046116| 812431/2046116 = 39.7%|
S. gallolyticus | 2258003| 851944/2258003 = 37.7%|
S. pasteurianus| 2149841| 802052/2149841 = 37.3%|
S. infantarius| 1958742| 739170/1958742 = 37.7%|
To get a better sense of comparisions among the groups, we found the average genome size and average GC% content for each group. Our results looked like this:
| Category | Average Genome Size | Average GC% Content |
| -------- | -------- | -------- |
| β-hemolytic| 1979007| 37.9%|
α-hemolytic (pathogenic & non-pathogenic)| 1985549| 39.0%|
α-hemolytic (non-pathogenic)| 1981564| 39.3%|
α-hemolytic (pathogenic)| 1989534| 38.7%|
γ-hemolytic| 2122195| 37.6%
We noticed that genome size was higher for completely non-pathogenic species (γ-hemolytic). Additionally, we observed that α-hemolytic species (opportunistic pathogens) have a higher GC% content. Interestingly, there isn't a significant difference between β-hemolytic and γ-hemolytic species in terms of GC% content.
### Step 3 - Using OrthoFinder and iTol to Obtain and Visualize Species Tree
We wanted to see if the categorical groups we were investigating were phylogenetic clades. Thus, we decided to run Orthofinder on the species protein sequences (.faa files) in order to produce a species tree. We imported the SpeciesTree_rooted_node_labels.txt file into iTol (online tree viewer).
Our tree looked like this:

Here is the link to the tree on iTol if you are interested in exploring it further: https://itol.embl.de/tree/137146134141318481668580928
We ran OrthoFinder in our working directory with all the protein sequences using the following command:
```
orthofinder -f ./ -X
```
To import the species tree file into iTol (https://itol.embl.de/), click on the upload button and copy/paste the tree file content into the "Tree text" window. You can change settings to visualize the tree however you'd like (e.g. circular vs rectangular).
Our results confirmed that the categories determining pathogenicity are phylogenetic clades. An observation that stood out to us was that β-hemolytic species were more closely related to γ-hemolytic than α-hemolytic species. At the time where β- & γ-hemolytic species can be traced back to one ancestor, α-hemolytic species have a different common ancestor. This helped us put in perspective the trend we observed for GC% content across all categories.
### Step 4 - Using rpsblast to Analyze the Number of Genes Present for each Functional Group
We used rpsblast and COG profiles to see if the protein sequences of our species resemble any known protein profiles in the COG database. In doing so, we were able to get a count of genes present in each functional group for each species.
To automate the process of running rpsblast on each species, we grouped commands to find genome size and GC content into one Unix script using the nano editor. If you want to access this shell script, it is saved as "rpsblastforCOGS.sh".
Prior to running rpsblast, remember to establish where the system should look for the BLAST databases using the following code:
```
export BLASTDB="/courses/bi278/Course_Materials/blastdb"
echo $BLASTDB
```
The code used for rpsblastforCOGS.sh is:
```
#NOTE: to run this shell script, you need to use the species name instead of the filename.
rpsblast -query $1.faa -db Cog -evalue 0.01 -max_target_seqs 25 -outfmt "6 qseqid sseqid evalue qcovs stitle" > $1.faa.rpsblast.table
awk -F'[\t,]' '!x[$1]++ && $4>=70 {print $1,$5}' OFS="\t" $1.faa.rpsblast.table > $1.faa.cog.table
awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($2 in a){print $1, $2, a[$2]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/cognames2003-2014.tab $1.faa.cog.table > temp
awk -F "\t" '{if ( length($3)>1 ) { $3 = substr($3, 0, 1) } else { $3 = $3 }; print}' OFS="\t" temp > temp2
awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($3 in a){print $0, a[$3]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/fun2003-2014.tab temp2 > $1.faa.cog.categorized
rm temp temp2
```
We imported the data from all sspecies.faa.cog.categorized files into excel to determine the counts of genes in each functional group for each species. We created a table containing our findings. We added a column to specify the group/category to which each species belongs in order to help with consequent visualization steps in R.
This is what our table looked like:
| Species| Group| Amino acid transport and metabolism| Carbohydrate transport and metabolism| Cell cycle control cell division chromosome partitioning| Cell motility |Cell wall/membrane/envelope biogenesis |Coenzyme transport and metabolism |Defense mechanisms |Energy production and conversion| Function unknown |General function prediction only| Inorganic ion transport and metabolism| Intracellular trafficking secretion and vesicular transport| Lipid transport and metabolism| Mobilome: prophages transposons |Nucleotide transport and metabolism| Posttranslational modification protein turnover chaperones |Replication recombination and repair| Secondary metabolites biosynthesis transport and catabolism| Signal transduction mechanisms |Transcription| Translation ribosomal structure and biogenesis|
| -------- | -------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | -------- |
| spyogenes|beta| 91| 107| 25| 4 |71| 56| 39 |55| 79| 69| 54| 9 |53| 7| 66| 46| 77| 8| 43| 85| 174|
sagalactiae|beta| 131| 119| 20| 2 |83| 69| 41| 51| 94| 78| 75| 10| 53| 15| 70| 52| 76| 11| 51| 85| 183|
sdysgalactiae|beta| 119| 144| 23| 4 |77| 66| 46| 62| 93| 87| 71| 8 |54| 29| 66| 53| 78| 19| 50 |90| 175|
sanginosus|alphaNP| 102 |120| 25| 6 |82| 53| 50| 54| 90| 77| 57| 9| 47| 17| 64| 43| 79| 11| 42| 99| 185|
sintermedius|alphaP| 95| 132| 26| 6| 76 |54| 47| 48| 98| 79| 56| 9|40| 3| 62| 46| 76| 16| 41| 99| 181|
spneumoniaeR6|alphaNP| 140| 161| 24| 6| 79| 64| 55| 46| 88| 74| 71| 10| 44| 14| 65| 56| 72| 9| 41| 89| 181|
spneumoniaeD39|alphaP|140| 161| 24| 6| 82| 64| 55| 47| 88| 74| 71| 10| 44| 14| 65| 57| 72| 9| 43| 90| 181|
sgallolyticus|gamma| 175| 135| 23| 5| 78| 91| 59| 62| 103| 97| 74| 8| 49| 6| 60| 49| 72| 16| 47| 122| 190|
spasteurianus|gamma| 150| 175| 23| 5| 83| 73| 52| 50| 98| 82| 65| 9| 50| 16| 63| 41| 77| 10| 59| 130| 179|
sinfantarius |gamma| 162| 75| 22| 6| 71| 74| 47| 49| 93| 71| 68| 12| 46 |12| 65| 53| 78| 11| 42| 97| 180|
If you want to access this table, it is saved as "functionalgroupcounts.txt"
### Step 5 - Visualizing Data of Genes Present for each Functional Group through PCA plot and side-by-side boxplots through RStudio
In order to draw conclusions from the data we obtained, we visualized it in RStudio (bi278.colby.edu). The code we used to generate the PCA plot is as follows:
```
x<-c("dendextend", "ggplot2", "vegan", "reshape2", "ggrepel")
lapply(x, require, character.only = TRUE)
cdata <- read.table("functionalgroupcounts.txt", sep="\t", h=T)
cdata.pca <- prcomp(cdata[,c(3:23)], scale=T)
cdata.pca.res <- as.data.frame(cdata.pca$x)
summary(cdata.pca)
ggplot(cdata.pca.res, aes(x=PC1, y=PC2)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust=1,vjust=-.5)
cdata.pca.res$Group <- cdata$Group
ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=Group)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), hjust=1,vjust=-.5)
cdata.pca$rotation
```
We decided to plot the first two principal components (PCs) as they separated the species the most and, according to the "summary(cdata.pca)" results, account for 50.84% of the total variance present in the data.
Our PCA plot looked like this:

To interpret the results, we looked at the data from "cdata.pca$rotation". This data gave us information on the relative influences of each functional group on each of the PCs. We observed that PC1 had higher positive influence from the General function prediction only, Function unknown, Coenzyme transport and metabolism, Transcription, and Amino acid transport and metabolism functional groups and a higher negative influence from the Cell cycle control cell division chromosome partitioning, Intracellular trafficking secretion and vesicular transport, Replication recombination and repair, and Nucleotide transport and metabolism functional groups. Meanwhile PC2 had higher positive influence from the Lipid transport and metabolism, Nucleotide transport and metabolism, Mobilome prophages transposons, and Signal transduction mechanisms functional groups and a higher negative influence from the Translation ribosomal structure and biogenesis, Defense mechanisms, Cell cycle control cell division chromosome partitioning, and Cell motility functional groups. We can use this information and thought process to interpret the PCA plot. For instance, we can see that β-hemolytic have higher abundances of genes that perform functions related to lipid and nucleotide transport and metabolism and signal transduction mechanisms. We can also see that the clustering/placement of species on the PCA plot matches the species tree we produced in step 3. More closely related species are closer together on the PCA plot as well.
The code we used to generate side-by-side boxplots is as follows:
```
#NOTE: This has to be done in the same R script as the previous step in order to work. If you plan to use a new R script, be sure to repeat the first three lines of code from the PCA plot code.
cdata -> cdata2
cdata.long <- melt(cdata2[,1:23])
head(cdata.long)
ggplot(cdata.long, aes(x=Group, y=value, fill=Group)) + geom_boxplot() + facet_wrap(.~variable, scales="free")
```
Our box plots looked like this:

We observed that β-hemolytic species have lower abundance of genes that relate to amino acid and coenzyme transport and metabolism and transcription functional groups compared to γ-hemolytic species. α-hemolytic species have a higher abundance of genes that relate to the carbohydrate transport and metabolism functional group compared to β- and γ-hemolytic species. Within α-hemolytic species, pathogenic species tend to have a higher abundance of genes that relate to the carbohydrate transport and metabolism functional group than the nonpathogenic species.
### Step 6 - Downloading protein files for known proteins that confer pathogenicity
We downloaded the protein files for M-protein, Pilus protein, FN-binding protein, and HTP from the NCBI database (https://www.ncbi.nlm.nih.gov/protein/). Just like all other important files, you will find these in /courses/bi278/sdivit25/Project.
### Step 7 - Running a custom BLAST to explore the abundance of sequences related to pathogenicity conferring proteins across categories
We needed to make a local custom protein database to search against since we wanted to look for specific protein sequences. Recall that this is the first step when doing reciprocal BLAST so that you can identify BBH/RBH between a small number of genomes.
To make our custom BLAST database, we first created a directory in our working directory and added this to $BLASTDB. Then, we ran a command to create the protein databases for each species protein sequence. To make it easier, we put this command into a shell script ("makingcustomdb.sh"). The code in the shell script is as follows:
```
#NOTE: to run this shell script, you need to use the species name instead of the filename.
makeblastdb -in $1.faa -dbtype prot -title $1_prot -out blastDB/$1_prot -parse_seqids
```
Next, run the custom blast using each pathogenicity conferring protein as the query against each species protein sequence. To automate this process, we created a shell script ("runningblast2.sh"). The code in the shell script is as follows:
```
echo "mprotein"
blastp -query mprotein.faa -db blastDB/$1_prot -outfmt 6
echo "pilusprotein"
blastp -query pilusprotein.faa -db blastDB/$1_prot -outfmt 6
echo "FNprotein"
blastp -query FNprotein.faa -db blastDB/$1_prot -outfmt 6
echo "HTP"
blastp -query HTP.faa -db blastDB/$1_prot -outfmt 6
```
To obtain the abundance of pathogenic traits in each species, we counted the number of significant (different) hits for these results. We compiled our results into a table:
| Species | Group | M-Protein | Pilus Protein | FN-Binding Protein | HTP|
| -------- | -------- | ------ | --- | -------- |---|
| S. pyogenes| beta|18| 13| 12| 8 |
S. agalactiae| beta|16 |11| 12| 13 |
S. dysgalactiae ssp. equisimilis| beta|13| 7 |8 |13|
S. anginosus| alphaNP| 14 |11 |8 |8|
S. intermedius| alphaP|13| 14| 13| 9 |
S. pneumoniaeR6| alphaNP| 18| 6| 17| 9|
S. pneumoniaeD39| alphaP|18 |6| 17| 9 |
S. gallolyticus | gamma| 9| 16| 10| 4|
S. pasteurianus| gamma| 9 |9| 5| 4|
S. infantarius | gamma | 12| 10| 10| 10 |
If you want to access this table, it is saved as "blastsignificanthits.txt"
### Step 8 - Visualizing Data of Abundance of Pathogenic Traits in RStudio
In order to draw conclusions from the data we obtained, we visualized it in RStudio (bi278.colby.edu). We generated side-by-side boxplots using the following code:
```
ddata <- read.table("blastsignificanthits.txt", sep="\t", h=T)
head(ddata)
ddata.long <- melt(ddata[,1:6])
head(ddata.long)
ggplot(ddata.long, aes(x=Group, y=value, fill=Group)) + geom_boxplot() + facet_wrap(.~variable, scales="free")
```
Our boxplots looked like this:

We observed that β-hemolytic species have a higher abundance of all proteins except the pilus protein when compared with the γ-hemolytic species. Interestingly, α-hemolytic species have a higher abundance of the FN-binding protein compared to β- or γ-hemolytic species. Within α-hemolytic species, pathogenic species tend to have a higher abundance of FN-binding protein than the nonpathogenic species.
### Step 9 (BONUS!) - Classify Streptococcus species at random into hypothesized group. Confirm or deny accuracy through literature research.
We chose to try to classify S. azizii using the information we've learned about pathogenicity in Streptococcus.
We downloaded the genome and protein files for S. azizii through the NCBI database. We found the genome size (2374041) and GC% content (1013444/2374041 = 42.7%).
Next, we ran rpsblast on the protein file for S. azizii. We imported the data in sazizii.faa.cog.categorized to create a new table. This table is identical to the table in step 4, except there is an extra row (species - S. azizii) and the group for this species is left blank. This plots it as its own group so that we can compare against the other groups to see where it best fits. If you want to access this table, it's saved as "functionalgroupcountsazizii.txt". To visualize the data, we generated boxplots in Rstudio using the following code:
```
#NOTE: Be sure to load the appropriate libraries prior to running these commands. They are the first two lines of the boxplot code in step 4.
cdataNEW <- read.table("functionalgroupcountsazizii.txt", sep="\t", h=T)
head(cdataNEW)
cdataNEW.long <- melt(cdataNEW[,1:23])
head(cdataNEW.long)
ggplot(cdataNEW.long, aes(x=Group, y=value, fill=Group)) + geom_boxplot() + facet_wrap(.~variable, scales="free")
```
Our boxplots looked like this:

Finally, we performed our custom BLAST on sazizii to obtain the abundance of pathogenic traits in each species. We imported this data into excel to count the number of significant hits. We created a new table similar to that in step 7, except there is an extra row (species - S. azizii) and the group for this species is left blank. This was done for the same reasoning used to visualize rpsblast results. If you want to access this table, it's saved as "blastsignificanthitssazizii.txt". To visualize the data, we generated boxplots in Rstudio using the following code:
```
#NOTICE: This is similar code to previous boxplot generations!
ddataNEW <- read.table("blastsignificanthitssazizii.txt", sep="\t", h=T)
head(ddataNEW)
ddataNEW.long <- melt(ddataNEW[,1:6])
head(ddataNEW.long)
ggplot(ddataNEW.long, aes(x=Group, y=value, fill=Group)) + geom_boxplot() + facet_wrap(.~variable, scales="free")
```
Our boxplots looked like this:

When analyzing this data, we noticed that GC% content of S. azizii was relatively high — which fits in with the α-hemolytic cateogory. Looking at the boxplots for abundance of genes in functional groups, we noticed that S. azizii closely matches the α-hemolytic cateogory (particularly non-pathogenic). The most notable and strongest evidence comes from the carbohydrate transport and metabolism functional group, which we previously noted as more abundant in the α-hemolytic group compared to β- and γ-hemolytic groups. Finally, looking at the boxplots for the abundance of known pathogenicty conferring proteins, we see that S. azizii matches the α-hemolytic nonpathogenic group in the FN-binding protein. We noted earlier that α-hemolytic species have a higher abundance of the FN-binding protein compared to β- or γ-hemolytic species. Considering all this data, we hypothesize that S. azizii is an α-hemolytic species that is likely non-pathogenic. Research proved that S. azizii is indeed an α-hemolytic species. S. azizii is a relatively novel species so not much is known about its pathogenicity. Given that it's an α-hemolytic species, it is likely to be an opportunistic pathogen. However, in line with the results we found, we think that it may remain non-pathogenic most of the time.