# Project Yersinia, Hamiltonella, and Regionella
## Overview
**Project Goal:** Find genomic factors that help differentiate pathogen genomes and symbiont genomes.
To start we read some benchmark papers about our taxa and potential genes that help in their pathogenicity or symbiotic nature. If you are interested in those papers, they are linked below:
[Genomic basis of endosymbiont-conferred protection against an insect parasitoid - PMC](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3246197/)
[Dynamics of genome evolution in facultative symbionts of aphids - PMC](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2955975/)
[Comparative and evolutionary genomics of Yersinia pestis - ScienceDirect
](//https://www.sciencedirect.com/science/article/pii/S1286457904002357?via%3Dihub)
From these papers, we identified a few genes and groups of genes that we wanted to look at after downloading all our genomes and going through different analysis pathways. For Pathogens we were interested in looking at: O-antigen biosynthetic pathways, type 6 secretion systems, two-component systems PhoPQ, and hemin transport. In the symbionts, we were really mostly interested in Type III secretion systems, but were aiming to look for more differentiating factors. Our project utilized Orthofinder and rspBLAST to get two different directions of genomic data and evidence to help build a case of differentiation between symbionts and pathogens.
## Downloaded Genomes
**From [NCBI](https://www.ncbi.nlm.nih.gov/) we downloaded 16 genomes.**
1 Outgroup: Pseudomonas aeruginosa
3 Yersinia Pestis. Strains: A1122, PBM19, and El Dorado
3 Yersinia Enterocolitica. Strains: NW56, FORC066, FDAARGOS_1090
3 Yersinia Pseudotuberculosis. Strains: IP32953, FDAARGOS_581, 72344
3 Regiella Insecticola. Strains: TUt, LSR1, 5.15
3 Hamiltonella Defensa. Strains: T5A, MEAM1, MED
To find the downloaded protein and genome files follow the path: /courses/bi278/enfere24/downloaded_genome/TAXA
TAXA directory names: P_aeruginosa, regiella, hamiltonella, y_pestis, y_enterocolitica, y_pseudoT
In each directory, there are the downloaded protein and nucleotide files we used to run OrthoFinder and rspBLAST, detailed throughout the project.
## File Organization
This is a more detailed description of the file organization:
We conducted all methods under Lee’s course file in filer found in `/courses/bi278/enfere24/`.
For the annotated genomes we then created a separate directory called `downloaded genomes` that contained a further six more directories for each of the species. This included `hamiltonella`, `P_aerginosa`, `regiella`, `y_entercolitica`, `y_pestis`, and `y_psuedoT`. In each directory, we had the annotated protein and nucleotide sequences. Our naming conventions were `taxa_strain_genomic.fna` for the nucleotide sequences and `taxa_strain_protein.faa` for the protein sequences. For example, we had: `downloaded_genomes\y_pseudoT\pseudoT_IP32953_genomic.fna` and `downloaded_genomes\y_pseudoT\pseudoT_IP32953_protein.faa`.
The naming convention of `taxa_strain_xxxxx.xxx` was used consistently.
Under `/courses/bi278/enfere24/` we also at a directory with all `rpsBLAST` data for each species. The results for `rpsBLAST` were contained under `rpsBLAST_results` which contained another directory for each strain using the naming convention `taxa_strain_results` and each director contained the tables.
There was another directory called `ortho_all` for our `Orthofinder` results. There was also a copy of each strain’s protein sequence for quick organization.
Our final directory was `OG0003387_iqtree` which contained the necessary information and files.
Photo figures were contained in the main directory as there were only three.
Any file contained outside of filer was in a sepearte Google Drive folder.
## Orthofinder
Run orthofinder on all protein files:
- Create a new directory that contains all the protein.faa files from all of our strains, in the `ortho_all` file
- We opted to `cp` all of the protein files from their downloaded location to the `ortho_all` directory
- While in Lee's folder (`enfere24`) we ran the command`cp downloaded_genome/TAXA/TAXA_STRAIN_protein.faa ortho_all/TAXA_STRAIN_protein.faa`
- Then with all the protein files, run orthofinder in that directory - `orthofinder -f ./ -X`
From orthofinder we get the Results directory in that folder so `/courses/bi278/enfere24/ortho_all/OrthoFinder/Results_Oct31` has all the data we used for ortholog analysis. We uploaded the orthogroups data into Google Spreadsheets for ease of manipulation and understanding.
Using ITOL you can make a tree to just double-check that all the relationships are the ones anticipated from the downloaded genomes and taxas.
PATH: /Results_DATE/Species_Tree/SpeciesTree_rooted_node_labels.txt
And this tree confirms our known relationships
This was just a test to make sure that things seemed to run correctly.
### Gene Descriptions from Orthogroup
To get information about all the orthogroups and the potential functions of the genes that make up those orthogroups, we ran these commands in Lee's directory on the Orthogroup file:
```
annotate_orthogroups --orthogroups_tsv Orthofinder/Results_Oct31/Phylogenetic_Hierarchical_Orthogroups/N0.tsv --hog True --fasta_dir ./ --file_endings faa --out N0.simple_annotation.tsv --simple True
annotate_orthogroups --orthogroups_tsv Orthofinder/Results_Oct31/Phylogenetic_Hierarchical_Orthogroups/N0.tsv --hog True --fasta_dir ./ --file_endings faa --out N0.detail_annotation.tsv --simple False
awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($1 in a){print a[$1],$0} else {print "no match"}}' OFS="\t" Orthofinder/Results_Oct31/Phylogenetic_Hierarchical_Orthogroups/N0.tsv N0.simple_annotation.tsv > N0.with_annotation.tsv
```
From that we uploaded the N0.with_annotation.tsv file to Google Sheets to analyze alongside the Orthogroup data. The N0.with_annotation.tsv file is in the `ortho_all` directory.
## Looking for specific genes in our Orthogroups
From our benchmark papers and research, we identified some specific genes that we believed pathogens share and ones that symbionts share. We wanted to see if those proteins were in orthogroups that were specific to Pathogens and Symbionts.
For reference, these, as stated above, were some of the factors we were interested in exploring:
| Pathogenicity Factors |
| -------- |
| O-antigen biosynthetic pathway |
| Type 1 Secretion System |
| two-component system PhoPQ |
| hemin transport |
|Symbiont Factors|
|-|
| Type III Secretion System|
Using the filter function on Google Sheets you can find the orthogroups that only the pathogens share or only the symbionts share. Going into the next two steps, pathogens refer to all strains of Yersinia Pestis, Enterocolitica, Pseudotuberculosis, and the outgroup. The symbionts are all strains of Regiella and Hamiltonella.
Here's the step by step of filtering the data:
1. Import the orthogroup file by creating a new spreadshet, cicking "import" under the file tab, choosing your file, and then choosing seperate by tab.
2. Import the orthgroup descriptions by creating a new sheet at the bottom and repeating the "import" part of step 1.
3. Highlight all of your spreadsheet

3. Click the filter button in the top right of the page to create a filter over everything

4. Then on any strain you want to restrict, click the little upsidedown triangle made of lines to get a pop-up menu

5. If you want to only include orthogroups that have this strain, unmark the blanks option and press "ok". This gives you all of the orthogroups that this strain is in.

6. Repeat these steps as many times as you need to filter the data further. We filtered it so we could find the orthogroups for which all of the Symbionts strains had proteins in and the same for Pathogens, to start our search.
7. After you have filtered out your desired orthogroup descriptions, you will want to create a new sheet with your desired orthgroups with the orthogroup description next to them. Create a new sheet and in A1 cell input the following code (replace 'Orthogroups Descriptions' and'Filtered Orthogroups'with your own sheet names):
```
=ARRAYFORMULA(
QUERY(
{'Orthogroup Descriptions'!A:C},
"select Col1, Col2, Col3 where Col1 is not null and Col1 matches '" & TEXTJOIN("|", 1, 'Filtered Orthogroups'!A:A) & "' order by Col1",
1
)
)
```

9. With your new sheet, you will be able to look at a much less overwhelming amount of orthogroups. For example, we filtered out one to include orthgroups that were in every single pathogen strain and our outgroup (which is also a pathogen). This left us with 1389 orthogroups. While still a lot, we were able to skim through the file and notice certain patterns such as "Type VI Secretion System" and using CMD + F on Mac or CTRL + F on Windows, we could search that phrase and find that there were 13 orthogroups in all our pathogens but none in all our symbionts sheet.
> A caveat of this is the annotation system is not universal. For example, the Type VI Secretion System was also labeled "T6SS" and the Iron-Sulfur clusters as "Fe-S" or "3Fe-4S". You also need to know what you are looking for because some names include your search terms but are unrelated. This is why manual search is better than having a program recognize a pattern.

10. Once you have a select group of patterns, you will want to compare them against the entirety of the orthogroups. Create a new sheet that has all the orthogroups with all their descriptions (a function is not needed, just use the copy function on the description column).

11. Create another new sheet. This will track the orthogroups of interest. You will want to go into your sheet with all the orthogroup descriptions and search for the patterns you noticed. For example "Type VI" and "T6SS". Copy each row and move it into your new sheet. After you are done, you may want to color code so it is easier to look at. You should also turn off cell overflow (Format > Wrapping > Clip)


12. Tally up the total number of orthogroups that are included in at least one strain of the species. Due to having 3 strains for each species, =COUNTA() will not work. We decided to count by hand since it didn't take long and to learn the code to count would have taken longer.
> There is a way to do this by creating a new column next to each species and running a function that checks if any of the cells to its left have text (you must copy down the whole column) and then counting that as true. Then tallying every true cell at the bottom. If you have extremely large sheets this would be a good option, however, for us, it was not an efficient use of our time.
13. Finally, create a new sheet with all your tallies. This is so you can run your analysis in Rstudio.

### Symbionts: Regiella and Hamiltonella
First create a spreadsheet with only Regiella and Hamiltonella using the filter function, which will look something like this:
Tying the descriptions from the last step we can see what the symbionts have in common. This looks like:
**Type III Secretion Systems**
Some factors that immediately popped out to us from our readings were type III secretion systems. We knew, based on our benchmark paper, "Dynamics of genome evolution in faculative symbionts of aphids" that our symbionts retained similar capacities for host cell toxicity and host invasion and that many of these could be attributed to similarity in secretion systems.

Unfortunately, we did not find type III secretion systems in our spreadsheet with orthogroups shared exclusively by symbionts. This discovery counteracts documentation on genome similarity between R. insecticola and H. defensa. We expected to discover a similar number of cog groups between H. defensa and R. insecticola, but significantly less within R. insecticola as the SPI-1 locus is documented as fragmented and non-functional. Strangely enough, we discovered 9 more T3SS groups in R. insecticola than in H. defensa. This may suggest that T3SS divergence may not be reliable for building assumptions about evolutionary relationships between the symbionts.
__**Summary**__
Type III secretion systems play a role in host invasion found in a common symbiont ancestor of both Hamiltonella and Regiella. However, there were also type III secretion systems found commonly in the pathogens. **Even more so, there are more type III secretion system genes in the pathogens.** Therefore, this connection and common ancestor probably goes back to both pathogens and symbionts and is not a reliable differentiating factor in the way we were hoping to find. Later we will discuss Type III secretion systems in more depth in both their pathenogenetic role.
**Group II Intron Proteins**
After looking into this list of shared proteins among our symbionts, but not the Pathogens, Group II Intron Proteins were shared.
>"Group II introns are ubiquitous self-splicing ribozymes and retrotransposable elements evolutionarily and chemically related to the eukaryotic spliceosome, with potential applications as gene-editing tools."(https://www.nature.com/articles/s41467-020-16741-4)
After finding this potential differentiating measure, in rpsBLAST analysis later we follow up on this lead.
### Pathogens: Yersinia Pestis, Yersinia Enterocolitica, Yersinia Pseudotuberculosis, and Pseudomonas Aeruginosa
Many of the factors we found linked to pathogenicity we did not expect nor were found in our benchmark papers (except genes related to heme). However, with further research, the orthogroup patterns we discovered are supported as being pathogenic factors, even if some are not directly related to pathogenicity.
We also utilized previous knowledge from classes on microbiology and personal interest on epidemiology and medical microbiology. This knowledge of pathogenic and virulence factors allowed patterns to be recognized faster. For example, we were aware of a general list of virulence factors: exotoxins, endotoxins, siderophores (allowing bacteria to get iron), capsules, and adherence factors.
Here is a table of the pathogenic factors we discovered (this is nowhere near comprehensive and we did find much more, however, we thought it was an adequate list).
|Pathogenic Factor | Role |
| -------- | -------- |
| Type VI Secretion System | A needle-like structure that allows the transport of proteins and ions out of the cells and into others. Primarily used to inject toxins in bacterial competition or subdue hosts. |
| Heme iron Acquisition | Aquire iron for use, usually from heme iron from other cells such as red blood cells. |
| Iron-Sulfur Clusters | Cofactors related to a large variety of cell processes, recently discovered to be related to sensory information and environmental adaptation. |
| Flagella | Cell motility that allows cells to evade hostile environments, infect hosts, and find nutrients. |
| Hemolysin | Breaks down red blood cells, related to heme iron acquisition. |
**Type VI Secretion Systems**
One of the patterns we discovered was that the pathogenic bacteria and outgroup contained a significant amount of genes related to the Type VI Secretion System (T6SS). They are a membrane-bound secretion system that transports proteins from inside the bacterium into another cell- in this case, eukaryotic cells- with their needle-like structure. They are known to be a pathogenic factor.
Information about the T6SS can be found here: https://asm.org/articles/2023/april/type-vi-secretion-systems-weapons-of-bacterial-des
While we did not originally expect to find T6SS, it is quite obvious that it was a predominant system with many genes related to it. There were 56 unique orthogroups we discovered with each pathogenic species having at least 30 and Y. Psuedotuberuclosis having 54. Meanwhile, the symbionts had far less, 6 and 7. Below is an image of the orthogroups and which strains contained them. The order of species goes: Pseudomonas Aeruginosa, Y. Entercolitica, H. Defensa, Y. Pestis, Y. Psuedotuberuclosis, and R. Insectacola.

The searches we used to find genes related to T6SS was "Type VI" and "T6SS". There may be other genes we missed due to consistant notation, but we did go through "secretion system" and did not find any other genes.
This makes sense as toxins are a way that bacteria compete with other organisms in their environment, and T6SS primarily transport toxins. Many toxins have also been linked to virulence, such as in the case of Vibrio Cholera which produces the cholera toxin.
There is a paper comparing the types of T6SS among Y. Pestis, Y. Psuedotuberuclosis, and Y. and also talks about Pseudomonas Aeruginosa, our outgroup. There is research that without a functional T6SS Y. pseudotuberculosis YPIII became nonfunctional (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6079546/) . This helps confirm our suspicions that the T6SS is a genomic factor related to pathogenicity.
In summary, the T6SS serves various purposes such as ion transfer, stress response, and inter-bacterial competition, but its significance as a pathogenic factor is particularly noteworthy. We found there is a large amount of genes involved in all the pathogenic strains and the few in the symbionts most likely render the structure inert. This means that *the Type VI Secretion System is a significant differentiating factor between pathogens and symbionts*. Other evidence, such as symbionts having significantly smaller genomes than pathogenic ones, but still containing the genes for T6SS, allows us to presume that the symbionts shared a common ancestor with the Yersinia taxa but underwent gene loss, rendering their T6SS ineffectual.
**Heme Acquistion**
Genes related to heme iron acquisition and siderophores are another gene type that we did not expect to find, however, our findings and research suggest they are another differentiating pathogenic factor. Iron is a necessary component for life to exist, but due to the location of certain bacteria, it may be hard to come by. In mammals, the majority is in heme molecules and therefore pathogens must get their iron from heme. However, due to heme’s toxicity in bacteria, this process must be highly regulated which means there must be a significant amount of genes related to heme acquisition (https://www.frontiersin.org/articles/10.3389/fcimb.2013.00055/full).
There was a total of 23 heme acquisition protein orthogroups, with the majority being found in the Yersinia taxa: 18 in Y. Enterocolitica and 13 in Y. Pestis and Y. Psuedotuberculosis. In our outgroup there were only 10, even though it is pathogenic, and 7 in H. Defensa and 6 in R. Insectacola. This can be seen in the image below.

The search term we used was “heme”. We did have to be careful as there were several words with heme in it, but unrelated to heme acquisition. We also did “iron” but discovered almost all of those were tied to the iron-sulfur acquisition, which we turned into a different section.
The presence of the majority of heme acquisition orthogroups in pathogenic bacteria was originally unexpected, but it does make sense. Pathogenic bacteria need to acquire iron from a source, and the host is the best option. The amount of orthogroups related is also most likely because “There are different forms of biological heme” (https://www.ncbi.nlm.nih.gov/books/NBK537329/). Bacteria can’t use heme but instead, extract the iron from the heme.
In summary, the Yersinia taxa had the majority of heme-related orthogroups with less in the outgroup and even fewer in the symbionts. *Due to iron being a necessity in bacteria, it appears that pathogenic bacteria require genes related to heme acquisition and are most likely a differentiating pathogenic factor*. However, we are unsure of how symbionts acquire their iron, but believe that it may be able to provide further insight into a further differentiation factor.
**Iron-Sulfur Clusters**
While exploring heme-related orthogroups, we observed that nearly all searches for "iron" included "-sulfur" at the end. Iron-sulfur (Fe-S) clusters serve as a cofactor essential for numerous biochemical processes. The expectation was to find these clusters in equal numbers since they play a crucial role in a diverse range of processes, encompassing electron transfer and storage, as well as redox and nonredox chemical catalysis (source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3430481/). However, this is not what we found, instead a large portion was exclusive to pathogenic bacteria.
There was a total of 20 iron-sulfur-related proteins, with the majority being found in the Yersinia taxa: 17 in Y. Enterocolitica and 15 in Y. Pestis and Y. Psuedotuberculosis. In our outgroup, there were 13, and 6 in both H. Defensa and R. Insectacola.

The search terms we used were “iron”, “fe-s”, and then “fe-” and then search by hand. This is because some iron-sulfur clustered were labeled as “#Fe-#S” and wouldn’t appear in our “fe-s” search. “Iron-sulfur” did not suffice either due to differences in annotations. We did not find anything for "siderophore".
This was quite obviously unexpected and we did not know. Recently, it was discovered that it may have a role in “act[ing] as a sensor of the environment and enabl[ing] the organism to adapt to the prevailing conditions” (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3430481/). While this is not necessarily pathogenic-specific, in a broader context it would make sense.
Our chosen symbionts live in pea aphid digest systems and therefore do not experience a large range of different environments they need to adapt to. However, our pathogenic species do encounter a large variety of environments. For example, Yersinia pestis infects a large number of mammals along with different tissues. This means that a sensor that utilizes iron-sulfur clusters would be extremely helpful in adapting to different environments.
In summary, *iron-sulfur clusters are an unexpected, but most likely valid, factor in pathogenic differentiation due to the discovery of its use in sensing the environment and allowing the organism to adapt*. Pathogenic bacteria encounter various environments as compared to symbionts.
**Flagella**
With the discussion of allowing bacteria to adapt to several environments, the flagella are another one. They are involved in cell motility and are a known virulence factor. They allow bacteria to reach infection sites, avoid hostile environments, and access nutrients (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4223308/). However, they are also found in a large number of bacteria, many nonpathogenic.
There was a total of 68 flagella proteins that we could find, with the majority being found in the Yersinia taxa: 50 in Y. Enterocolitica, 56 in Y. Pestis, and 53 in Y. Psuedotuberculosis. In our outgroup, there were only 29, and 9 in H. Defensa and 17 in R. Insectacola.

The search term we used was “flagell” due to not wanting to omit any orthogroups that contained the words “flagellum” or “flagella”.
Flagella are complex organelles and “this structure requires more than 50 proteins for its biogenesis and function” (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4223308/). However, another paper said that only 30 were needed (https://www.ncbi.nlm.nih.gov/books/NBK6250/). Even with the lack of consistency in the number of genes and proteins needed for flagella formation, the number of orthogroups found in H. Defensa and R. Insectacola means they are nonfunctional. On the other hand, the Yersinia species most likely all have functioning flagella. Finally, the outgroup is known to have flagella (https://journals.asm.org/doi/10.1128/iai.66.1.43-51.1998).
This is understandable because once again, Yersinia taxa come across a broad range of different environments, and therefore to be motile means a much higher chance of survival. It also aids in infection chance. On the other hand, if the symbionts did branch off from a common ancestor that did have flagella, the loss of the flagella may prove helpful due to the symbionts not needing to move and find further nutrients. Their ability to infect is not impacted by the necessity of a flagellum either, for they are passed down maternally (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2690004/). Finally, the lack of the number of flagellar orthogroups in the P. aeruginosa outgroup as compared to Yersinia is not explained but may indicate different abilities contained (such as the ability to suppress flagellar assembly once in a good environment).
In conclusion, the presence of flagella orthogroups is not necessary for being pathogenic. However, it is a pretty clear pathogenic differentiating factor due to cell motility being extremely beneficial in host infection and environmental adaption.
**Hemolysin**
When discussing heme acquisition, there was the thought that heme could be obtained from red blood cells. Hemolysin is a well-known virulence factor that breaks down red blood cells in a variety of different ways. We decided to explore this specific subgroup to see if it was a unique distinguishing factor.
There was a total of 5 hemolysin proteins: 5 in Y. Enterocolitica, 5 in Y. Pestis, 3 in Y. Psuedotuberculosis, 3 in our outgroup, 2 in H. Defensa, and 4 in R. Insectacola.

The single search term was “hemolysin” due to its lack of variety.
While our previous four pathogenic factors were quite clearly present in the majority of pathogenic strains and not in symbionts, it appears that symbionts also possess a significant amount (compared to the total). This was surprising, as insects do not have blood or red blood cells so it’s not useful to the symbionts. There may be another use for the hemolysin proteins that is not known or gene loss hasn’t occurred yet. However, it does appear that every strain of Yersinia Pestis contains all 5 orthogroups.
In summary, *the presence of hemolysin orthogroups is not a differentiating factor in pathogenicity*. We found similar amounts in each strain with slightly more in the pathogenic strains, due to the low amount of orthogroups, this was not conclusive. Hemolysin orthogroups being present in the symbionts either indicate a lack of gene loss from their common ancestor or a previously unknown use of these hemolysin proteins.
**Other**
There were other genes were discovered but were not included due to low quantity. For example, proteins related to biofilm production. A biofilm is when cells stick together and usually a surface in a self-produced extracellular matrix (mostly a slime-like substance). We considered them a pathogenic factor due to prior knowledge of how the extracellular matrix prevents antibiotics from working. This is related to a capsule which is another extracellular matrix, however, it has a rigid structure and is a known virulence factor.
## rpsBLAST Analysis
### Running rpsBLAST
Run rpsBLAST for all the different protein files to obtain COGs to do analysis with.
The example code uses Regiella Strain 515 protein file blasted against the given class blast database. This was replicated for each protein file and ran in Lee's directory.
----
`rpsblast -query regiella/regiella_515_protein.faa -db Cog -evalue 0.01 -max_target_seqs 25 -outfmt "6 qseqid sseqid evalue qcovs stitle" > regiella_515_prot.rpsblast.table`
`awk -F'[\t,]' '!x[$1]++ && $4>=70 {print $1,$5}' OFS="\t" regiella_515_prot.rpsblast.table > regiella_515_prot.cog.table`
`awk -F'[\t,]' 'NR==FNR {a[$1]=$2;next} {if ($2 in a){print $1, $2, a[$2]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/cognames2003-2014.tab regiella_515_prot.cog.table > temp1`
`awk -F "\t" '{if ( length($3)>1 ) { $3 = substr($3, 0, 1) } else { $3 = $3 }; print}' OFS="\t" temp1 > temp2`
` awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($3 in a){print $0, a[$3]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/fun2003-2014.tab temp2 > regiella_515_prot.cog.categorized`
---
All of the results are in the `rpsBLAST_results` directory with an inner directory for each strain which contains the original rspBLAST output and also the categorized data.
---
`(PATH:/courses/bi278/enfere24/rpsBLAST_results/TAXA_STRAIN_result)`
---
For a directory with just all the categorized outputs, which were utilized in the COG visualization, use the PATH:/courses/bi278/enfere24/all_rpsBlast_data/TAXA_STRAIN_prot.cog.categorized
### Principal Component Analysis
To run the PCA, first, we had to read all of the COG data from our files. So in RStudio we ran these commands.
Read in each rpsBLAST result into RStudio, utilizing the shortened path. And then name each table correctly. We ran this for each strain:
```
> r515 <- read.delim("all_rpsBlast_data/regiella_515_prot.cog.categorized", h=F)
#this is repeated for each strain and their respective rpsBLAST data
> cat.eentFDA <- as.data.frame(table(entFDA$V3))
> colnames(cat.eentFDA) <- c("COG", "all")
#this is repeated for each strain that was read in
```
Then build the data table to view and edit:
```
> cat.comp <- cbind(cat.eentFDA[,2],cat.eentFOR[,2],cat.eentNW56[,2],cat.hhamMEAM1[,2],cat.hhamMED[,2],cat.hhamT5A[,2],cat.PPaeru[,2],cat.ppesA1122[,2],cat.ppesElDorado[,2],cat.ppesPBM19[,2],cat.ppseu72344[,2],cat.ppseuFDAA[,2],cat.ppseuIP32953[,2],cat.rr515[,2],cat.rrLSR1[,2],cat.rrTut[,2])
>
> df.cat <- as.data.frame(t(cat.comp))
>
> rownames(df.cat) <- c("entFDA","entFOR","entNW52","hamMEAM1","hamMED","hamT5A","Paeru","pesA1122","pesElDor","pesPBM19","pseu72344","pseuFDAA","pseuIP32953","r515","rLSR1","rTut")
>
> colnames(df.cat) <- cat.PPaeru$COG
>
> df.cat[is.na(df.cat)] <- 0
>
> df.cat[,1:25] <- lapply(df.cat[,1:25], as.numeric)
>
> df.cat$genome <- row.names(df.cat)
>
> x<-c("ggplot2", "vegan", "reshape2", "ggrepel", "dendextend", "gridExtra")
>
> lapply(x, require, character.only = TRUE)
>
> View(df.cat)
>
> rownames(df.cat) <-df.cat$sample
>
> rownames(df.cat) <- c("entFDA","entFOR","entNW52","hamMEAM1","hamMED","hamT5A","Paeru","pesA1122","pesElDor","pesPBM19","pseu72344","pseuFDAA","pseuIP32953","r515","rLSR1","rTut")
>
> rownames(df.cat) <-df.cat$sample
>
> ggplot(df.cat, aes(x=B, y=C)) + geom_point() +
+ geom_text(aes(label=genome), hjust=-.3, vjust=0)
> ggplot(df.cat, aes(x=D, y=F)) + geom_point() +
+ geom_text(aes(label=genome), hjust=-.3, vjust=0)
```
Looking at COGs B and C we got a distinction of

And looking at COGs D and F we got:

And again looking at COGs J and K we got:

**Summary from our PCA**
```
> df.cat.pca <- prcomp(df.cat[,c(3:10)], scale=T)
> cdata.pca.res <- as.data.frame(df.cat.pca$x)
> summary(df.cat.pca)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 2.3498 1.5405 0.23311 0.21317
Proportion of Variance 0.6902 0.2966 0.00679 0.00568
Cumulative Proportion 0.6902 0.9868 0.99360 0.99928
PC5 PC6 PC7
Standard deviation 0.06559 0.02929 0.02151
Proportion of Variance 0.00054 0.00011 0.00006
Cumulative Proportion 0.99981 0.99992 0.99998
PC8
Standard deviation 0.01279
Proportion of Variance 0.00002
Cumulative Proportion 1.00000
```
And then trying to plot this information - and this jumps to cdata as the new data is used.
```
> ggplot(cdata.pca.res, aes(x=PC1, y=PC2)) + geom_point()+
geom_text(aes(label=rownames(cdata.pca.res)), hjust=1,vjust=- +.5)
> cdata.pca.res$genome <- df.cat$genome
> ggplot(cdata.pca.res, aes(x=PC1, y=PC2, color=genome)) + geom_point() + geom_text(aes(label=rownames(cdata.pca.res)), + hjust=1,vjust=-.5)
```
After combining all the rspBLAST data from each strain into a single table (in the code called `df.cat`). We ran a PCA to look for specific COG differences. No matter the two COGS we separated by, we saw the Yersinias with great distinction from the Symbionts.
Furthermore, PC1 and PC2 collectively represented most of the differences outlined in the data. With a combined proportion of 98.68%.

When switching to better data (see the next section below), this looks like:

So the relationships are fairly maintained moving to percentages from the raw amount of COGS in each group type. But overall, there is a great distinction between our pathogens and symbionts based on COG group data.
#### Changing Input Information
After looking at PCA and thinking about what these results actually mean, these numbers and graphs are overall rather unhelpful. We know our Symbionts have smaller genomes, therefore, they will almost always have a smaller number of genes in each COG. Therefore, taking the data table we made and turned each value into the percentage of a specific type of COG genes over the total amount of genes that were in COGs. This was done in Google Sheets and looks like this:

The top is the original values and the bottom is the new values based on percentages. This creates a graph like:

And while this is rather messy, we can see that our Symbionts have a higher percentage of these "J" COGS and we can change the amount of COGS we are seeing to get a closer look.
Looking now at only select COG groups, we can see clear differences between our Pathogens and Symbionts:

J COGs are classified as "Translation, ribosomal structure and biogenesis" proteins
E Cogs are classified as "Amino Acid transport and metabolism"
### DOT PLOT
Using the percentage data, we can create a dot plot for which we can see which strains have high amounts of specific COG groups.
In R we ran the following code, first having to change out data type to long and this is what that looks like:
```
> cdata.long <- melt(cdata[,1:11])
Using sample, taxonomy as id variables
> head(cdata.long)
sample taxonomy variable value
1 a1 entFDA A 0.0003373819
2 a2 entFOR A 0.0003272251
3 a3 entNW52 A 0.0003304693
4 a4 hamMEAM1 A 0.0009980040
5 a5 hamMED A 0.0009505703
6 a6 hamT5A A 0.0009017133
```
Then plotting the actual dot plot, we can visualize the differences of each taxa and strain. This illustrates the same information as the Google Sheets graph, but each group is now distinguished.
```
> ggplot(cdata.long, aes(x=taxonomy, y=value, fill=taxonomy)) + geom_boxplot() + facet_wrap(.~variable, scales="free")
> gene2.long <- subset(cdata.long, variable=="J")
> ggplot(gene2.long, aes(x=taxonomy, y=value, fill=taxonomy)) +
+ geom_boxplot()
> ggplot(cdata.long, aes(x=taxonomy, y=value, fill=taxonomy)) + geom_dotplot(dotsize=2, binaxis='y', stackdir='center') + facet_wrap(.~variable, scales="free")
```

From the Google Sheets data, we can take only the more informative COG groups to create a similar graph. Here, we can see how our J and E cog groups present themselves.

## Follow-up Question
The next question was if there was a way that this information could back up our information from orthofinder. In other words, were the pathogenicity factors we identified consistently in a specific COG group and could we match that information with some sort of meaningful connection? This would mean going from orthogroup to protein ID to COG type and we did this in Spreadsheets.
First, we created a new spreadsheet to do this analysis that looks like:

And the steps to get from one column to the next were done manually, looking at the orthogroup pathogen spreadsheet to get individual gene IDs. Then manually looked at each rpsBLAST result to find the COG information.
From those results, we can generally tie certain pathogenicity factors to COG groups.
| Pathogenicity Factor | COG | COG Function|
| -------- | -------- | -|
| Proteins that break down red blood cells | U | Intracellular trafficking, secretion, and vesicular transport|
| Type 6 Secretion Systems | O/U | Posttranslational modification, protein turnover, chaperones/Intracellular trafficking, secretion, and vesicular transport|
| Hemeoglobin related proteins | S/O | Function unknown/Posttranslational modification, protein turnover, chaperones|
| Flagella related proteins | N | Cell motility|
| Iron sulfur | O/C | Posttranslational modification, protein turnover, chaperones/Energy production and conversion |
| Biofilm formation, virulence factor. | ?* | No COG data|
Then look at the relationships of these specific COG groups with Google Sheets as done previously.

We can see that there is a little bit of a relationship of C being a bit higher in the pathogens than the symbionts, as well as U and N to a lesser degree. From this information, it really neither confirms nor denies our Orthogroup analysis findings.
### Cluster Dendrogram from Hierarchical Clustering
From the `df.cat` table we created above, we can use that data to create a cluster dendrogram to again check that we're able to see that our known relationships are still represented in the data.
```
> disMax <- vegdist(df.cat[,3:10], method="bray")
> df.clust <- hclust(disMax, method="ward.D2")
> dendro %>%
+ set("leaves_pch", c(19)) %>%
+ set("leaves_cex", c(2)) %>%
+ set("leaves_col", as.numeric(cutree(df.clust, k=3)[df.clust$order])+1) %>% plot(type = "rectangle")
```

4-6: hamiltonella
14-16: regiella
7: outgroup
1-3: entercolitica
8-10: pestis
11-13: pseudo
From this cluster, we can see that the size of the genome and the amount of genes in each COG type separate our symbionts and pathogens and outgroup. This strengthens the other information we are getting from this table of data.
## Summarized Results
From following all of the steps provided and working through this packet, here are the overall main results we found to be notable. From using these comparative methods, we can start to find the key genomic factors that differentiate related pathogens and symbionts.
### Type 3 Secretion Systems (using Orthofinder)
Following the results from our benchmark papers, we were expecting to be able to use Type 3 Secretion systems as a differentiating factor with the assumption that only our Symbionts would carry these genes. To our surprise, we discovered TS33 genes in all taxonomic groups. After taking a closer look at the data, we noticed that although all T3SS proteins carried by our symbiont were present within our Yersinia strains, there were a few Yersinia orthogroups exclusive to our pathogen. The following are the aforementioned orthogroup IDs. A major caveat of this discovery and assumption is the total n of orthogroups between pathogens and symbionts. As our pathogens generally had larger genome sizes, our discovery is somewhat expected. Our pathogens had an average total of 28 more genes associated with T3SS than symbionts.
---
* ``OG0003286|OG0003287|OG0003288|OG0003289|OG0003290|OG0003291|OG0003292|OG0003293|OG0003294|OG0003295|OG0003296|OG0003297|OG0003298|OG0003299|OG0003300|OG0003301|OG0003302|OG0003303|OG0003304|OG0003305|OG0003306|OG0003308|OG0003309|OG0003310|OG0003311|OG0003629|OG0003986|OG0003987|OG0003988|OG0003989|OG0003990|OG0003997|OG0003998|OG0004000|OG0004001
``
---
We were then interested in identifying commonalities between the proteins that were exclusive to Yersinia species that may reveal information about virulence factors associated with T3 Secretion Systems. We discovered that of all T3SS subunit proteins, Yop,Ysc, and other Syc-like effector proteins were only present in our pathogen. We discovered an Lcr protein chaperone gene available within Regiella insecticola, but no effector proteins. This may suggest that the Lcr coding complex in Regiella may not be functional.
**Complete list of T3SS Orthogroups exclusive to Yersinia**

For more information, and data supporting our assumptions, check the following article:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC97735/
-- --
## Pathogenic distinguisher factors (using Orthofinder)

The graph above shows the number of orthogroups for each species by gene type in a grouped bar graph. While originally meant to be a bubble heat map, google sheets did not have this capability. However, this shows are results quite well.
### Type 6 Secretion Systems
While we did not originally expect to find that the T6SS is a pathogenic factor, the graph indicates it most likely is. Due to its role in ion transfer, stress response, and inter-bacterial competition, but primarily injects toxins into host cells, this would make sense.
There is a large amount of orthogroups in all the pathogenic strains and a few in the symbionts, 6 and 7 out of 56. This renders the structure inert.
T6SS function similarly to T3SS, which are described above. The high amount of T6SS genes with several being shared with the symbionts indicates that the symbionts shared a common ancestor with the Yersinia taxa but underwent gene loss, rendering their T6SS ineffectual. This was likely due to the lack of needing to infect host cells and competition within the pea aphids.
The number of Type VI Secretion System orthogroups in each strain demonstrates that it is a significant differentiating factor between pathogens and symbionts.
### Heme (and Iron) acquisition
The presence of Heme acquisition proteins is much less clear. While we did not originally expect this with our benchmark papers, it’s still promising. The Yersinia taxa had the majority of heme-related orthogroups with less in the outgroup and even fewer in the symbionts. However, while most of the symbionts contained the same orthogroups as the pathogenic strains, there were 3 orthgroups not found in Yersinia taxa: OG0003910 heme acquisition protein HasA, heme lyase NrfEFG subunit NrfF. the first one may not have been added to the ortho group `OG0001346 heme acquisition protein HasA` in Othofinder. The other one is unexplained. There was also `SUMF1/EgtB/PvdO family nonheme iron enzyme` which was found to be R. insectacola which is interesting for it is specifically a nonheme enzyme.
It appears that a high presence of heme acquisition orthogroups is likely a differentiating pathogenic factor. However, we are unsure of how symbionts acquire their iron but believe that the nonheme iron enzyme orthogroup may have something to do with it, despite that also being found in our outgroup.
### Iron-Sulfur Clusters
Iron-sulfur clusters are noted for various functions. Perhaps their most notable role is their association with oxidation-reduction reactions of electron transport. Iron-sulfur cluster proteins are also noted for the role they play in gene expression and regulation. Although this wasn't part of the set of characteristics we assumed with define pathogenicity, we discovered a clear difference in their size between our pathogens and symbionts.
Iron-sulfur clusters as a distinguishing factor are not as clear as other gene groups such as T6SS or flagella. However, we still propose that a significantly high amount of these orthogroups may be a distinguishing pathogenic factor. Iron-sulfur cofactors are indeed in every organism, however, a significant amount is most likely related to sensory information which would allow the organism to respond to stimuli in a varying environment.
This comes with the caveat that due to some iron-sulfur orthogroups not having incredibly clear annotations, our numbers may change. This distinguishing pathogenic factor may also be misleading as not all symbionts remain in the same environment.
### Flagella
The flagella orthogroup is a very clear distinguishing pathogenic factor in our selected taxa. The numbers are quite clear on the graph because it’s a complex organelle that needs a range of genes and its ability to allow cell motility makes it extremely beneficial in host infection and environmental adaption. The orthogroups found in the symbionts are most likely left over from the common ancestor after significant gene loss.
### Hemolysin
Finally, hemolysin did not appear to be a distinguishing pathogenic feature in our research. First, the amount of orthogroups was too small to confidently formulate any assumptions. Even though symbionts tended to have fewer hemolysin-associated orthogroups, it is unclear if it is due to random chance or specific gene loss. Even though evidence suggests that comparing differences in hemolysin content should be a reliable method for identifying pathogenicty, we were unable to make this assumption because pea aphids and insects do not have blood and hemolysin is a known virulence factor.
### COG J and X (using rspBLAST and COG analysis)
From our rspBLAST analysis, we can see a pattern of differentiation based on COG makeup. We can see that Symbionts have a much higher percentage of J type COGS. Also, the group II introns that we identified with Orthofinder, is in the COG type X, which we see symbionts having more of.
In terms of the pathogens, we see the Yersinias having much higher E and G COG types.

To create this graphic, rather than each strain being split up and its own color, we added all the strains in their own group using the `=SUM` function in Google Sheets and then dividing by the number of strains. So 9 for the Pathogens and 6 for the Symbionts.

The red star is on the table of data used to see the information in the graphics above. The blue star is the condensed data that you see above.