---
title: 'Orthofinder'
disqus: hackmd
---
By Tanya Lama
## Objective
We will use CAFE to analyze protein family evolution (gene family expansion/contraction) for the bat1k longevity project. For our comparative analyses, annotations have been generated using TOGA by Ariadna Morales and Michael Hiller.
However, we need to validate extracted copy number(s) from the TOGA, specifically the orthology.tsv files. Lexi was supposed to pursue this but did not.
As an alternative, Tanya is extracting copy number(s) directly from the assemblies using OrthoFinder:
- from protein FASTA sequences (generated by TOGA as well)
## Table of Contents
[TOC]
## Learning from [Jebb et al.](https://www.nature.com/articles/s41586-020-2486-3#Sec10)
> To investigate expansions and contractions of protein families, we used CAFE89 with a false discovery rate < 0.05 cut-off. As input for CAFE, we clustered Ensembl-annotated proteins into families using POrthoMCL90 and the PANTHER Database (v.14.0)91 and our ultrametric time tree, generated using r8s.
Additional detail was provided [in the supplement](https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-020-2486-3/MediaObjects/41586_2020_2486_MOESM1_ESM.pdf)
> To investigate expansions and contractions of protein families, we used CAFE (v4.0)131. CAFE requires annotated proteins assigned to families. To this end, we downloaded GFF3 files from Ensembl which were available for 25 of the 42 considered species. To obtain a single isoform from each gene, we used genePredSingleCover (https://github.com/ENCODE-DCC/kentUtils.git) to obtain the longest transcript from each locus, and translated this transcript to a peptide sequence. The POrthoMCL pipeline (https://github.com/etabari/PorthoMCL, accessed 12/6/2019; commit dec8e5f), a parallel implementation of the OrthoMCL algorithm, was used to cluster proteins into families. All families were assigned PANTHER Database (v14.0) IDs, based on the human genes contained in a family. Families assigned the same ID were merged. Where POrthoMCL families were composed of multiple families, all families with overlap based on PANTHER IDs were merged. To obtain families that were already present in the Placentalia root, we retained those families where at least one member was present in all bats, at least one of human, mouse or rat (representative Euarchontoglires) and one Atlantogenata species. Families which varied in size by more than 100 members between the species with the highest and lowest count were also removed. CAFE was then used to identify families which underwent expansion or contraction at the base of bats, using the previously produced ultrametric time tree (see Supplementary Note 4.2 above). The caferror.py function was used to estimate an error model for the data, which was used in further analysis. A single lambda, or birth/death parameter, was inferred for the entire tree. Families were retained if estimated to have undergone a significant expansion or contraction at the ancestor of all bats with an FDR value < 0.05 (Supplementary Table 12).
## Installing Orthofinder and dependencies on Unity
We created a conda environment called orthofinder for all dependencies.
Be sure to `source activate orthofinder` before you submit any jobs
## Set up
Nucleotide and protein FASTA files are located in the /project space
`/project/tlama_umass_edu/projects/project_bat1k_longevity/data/ `
### 1_raw_protein_fastas
We want to retain just the query protein sequences from each protein.fasta file, and drop all of the reference (human) sequences.
Create a folder with all of your protein_fasta files (output from toga) called 1_raw_protein_fastas and create a list called list_protein_fastas
```
ls ./*.prot.fa > list_protein_fastas
```
### We have automated the grep commands for all of our protein fasta files using `sbatch_grep_query.sh` which submits --> grep_query.py:
```
#!/bin/bash
#SBATCH --job-name=grep
#SBATCH --output=grep_%j.out
#SBATCH --error=grep_%j.err
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=1
#SBATCH --time=0:20:00
#SBATCH -p cpu
# This script submits grep_query.py. To execute this script you need to run sbatch sbatch_grep_query.py -s dipEca1.
while getopts ":s:" opt; do
case $opt in
s) species="$OPTARG"
;;
\?) echo "Invalid option -$OPTARG" >&2
;;
esac
done
#Load any necessary modules
module load python
#Run the Python script with the command line argument "dipEca1"
python grep_query.py "${species}"
```
#### grep_query.py
```
import sys
if len(sys.argv) < 2:
print("Please provide the species name as a command line argument")
exit()
species = sys.argv[1]
with open(f"{species}.pep") as f:
lines = f.readlines()
output = []
for i in range(len(lines)):
if lines[i].startswith(">ENST") and "QUERY" in lines[i] and i+1 < len(lines):
output.append(lines[i].strip())
output.append(lines[i+1].strip())
output_text = "\n".join(output)
with open(f"query_{species}.pep", "w") as f:
f.write(output_text)
# The output of this script is a protein fasta called query_${species}.pep that retains ONLY query sequences
```
### 2_query_protein_fastas
Move all of the outputs called query_${species}.pep to a folder called `2_query_protein_fastas`
### What's in our protein.fasta files?
Let's compare the orthology.tsv file from the TOGA annotation to the protein.fasta file that Ariadna gave us. Note: I compared this initially to orthology.tsv files from the Seckenberg repository and found that the chain id's didn't match our data. So, must use the orthology.tsv files that Ariadna provided us with the protein.fastas as there seems to be multiple versions of HLantPal2.
orthology.tsv classifies orthology relationships:
* one2many -- a gene in the reference species is orthologous to multiple genes in the query species
* one2one -- (pairwise orthology) a gene in the reference species has only one ortholog in the query species
* many2many -- orthologs in the reference and the query species have undergone lineage-specific duplications
* many2one -- a gene in the query species is orthologous to multiple genes in the reference species
* one2zero -- a gene is found in human but there is no ortholog in the query species
### Does the protein.fasta file include one2many sequences? YES
`grep '>ENST00000333762.H1-10.317' antpal2.pep`
### Does the protein.fasta file include one2one sequences? YES
`grep '>ENST00000226021.CACNG1.66' antpal2.pep`
I believe these can be dropped because they are not informative for CAFE
### Does the protein.fasta file include many2many sequences? YES
`grep '>ENST00000599845.CSAG3.704154' antpal2.pep`
### Does the protein.fasta file include many2one sequences? YES
`grep '>ENST00000294337.CYP4A22.6' antpal2.pep`
### Does the protein.fasta file include one2zero sequences? SORT OF?
These sequences are designated as one2zero but they still show up in the protein.fasta. These are likely gene losses (pseudogenizations) that we should investigate separately.
```
grep -A 1 '>ENST00000372617.GLRA4' antpal2.pep
>ENST00000372617.GLRA4.5 | PROT | QUERY
MSQRLL--TLSSYFFILFSFPV----EVSLKLLAFSR-------K---------FL-------------------NSKPFLPL-----LFSSSFMVMAKA-MDYRVNVFLWQQRHDPRLAYQDX-DDSLDLDPSMLDAI*-TPDLFFVNEKGANFHEVTTDNKLLRIFKNGNVLYSIRLTLILFCPMDLKNFPMDIQICIMQLESIGYIMNDLVFEWLEDAPAVQVAEGLTLSQFILRDEKDLGYYTKHYNTGKFTCIEVKFHLEWQMGYYLIQMYIPSLLIAILS*-VSFCINMDAAPAHVGLGITTVLNMTTQSSGS*-ASLPKVSYVKAIDIWMAVRLLFVFAALLEYAAVNFVSCQHKEFIRLRRRQKHQHM-------N----------LSL-----SPAGETAVFSLK--GALSREEI--------*
```
Liliana says that these one2zero sequences may be informative. If they are one2zero in all bats, we should drop them. However if they are one2zero in some bats but not others, then they are interesting for gene family comparison. These may be unique lineage-specific gene losses that should be further investigated.
### 3_orthology_tsvs
We will now create a folder called 3_orthology_tsvs/ and prepare a list of one2one orthologs for each species.
We downloaded the orthology_classification.tsv files for each of our species using rclone into into our 3_orthology_tsvs.
### Pad q_gene names with zeroes
In order to select the longest transcript (step 5) we need to compare the length of q_transcripts (ENSTs) grouped by geneID (q_gene). However, grepping reg_10 might compare the length of ENSTs also to reg_100, reg_1000, and reg_10000. Therefore we padded the q_gene names with 0s to ensure that all q_gene names have 5 digits.
#### pad_names.py
for each species, submit `python pad_names.py antPal2`
```#!/usr/bin/env python
## this is a script which pads q_gene names with 00s such that ever name has 5 digits. e.g., reg_10 is now reg_00010. We can then search for the q_gene name itself in 1_longest_transcript.sh (next step) without worrying about confusion over reg_10 reg_100 and reg_1000 being treated as matching q_gene names
import sys
import re
import pandas as pd
# Get the value for species from the command line argument
species = sys.argv[1]
def pad_gene_names(gene_names):
padded_names = []
for name in gene_names:
match = re.match(r'(\D+)(\d+)', name)
if match:
prefix = match.group(1)
number = match.group(2)
padded_number = number.rjust(5, '0')
padded_name = prefix + padded_number
padded_names.append(padded_name)
else:
padded_names.append(name)
return padded_names
# Read in your file as a DataFrame
df = pd.read_csv(f'HL{species}_orthology_classification.tsv', sep='\t')
# Apply the function to all values in column 3
df['q_gene'] = pad_gene_names(df['q_gene'])
# Write the modified DataFrame back to a new file
df.to_csv(f'./modified/HL{species}_orthology_classification.tsv', sep='\t', index=False)
```
### Create a list of files in the /modified folder
```
ls modified/*.tsv > list_orthology_tsvs
```
### Create one2ones_${species}.txt
We need a list of one2one **q_transcripts** for each species so that we can drop them from the protein fastas for CAFE. Loop over the files in list_orthology_tsvs, selecting only the q_transcripts which have an orthology_class listed as 'one2one'. Drop any duplicate q_transcripts, and save the result as one2ones_${species}.txt
#### sbatch_one2ones.sh
```
#!/bin/bash
# Loop over each TSV file in list_orthology_tsvs
while read -r file_path; do
# Extract the characters 3-9 from the file name
file_name=$(basename "$file_path")
file_prefix=${file_name:2:7}
# Extract the q_transcript values for rows where orthology_class == 'one2one'
awk -F $'\t' '$5 == "one2one" { print $4 }' "$file_path" | sort -u > "one2one_${file_prefix}.txt"
done < list_orthology_tsvs
```
### 4_one2ones
Move these one2one_{species}.txt files a folder called `4_one2ones`
### Dropping one2one orthologs from antPal2.pep
We need to do the following:
For each one2one ortholog (e.g. ENST00000000233.ARF5), we need to find the matching line in the protein fasta such as
>ENST00000000233.ARF5.22 | PROT | QUERY
MKKPRAAVGSRPRKQTSSREEREKHGFNSSQAKPTIPDAGQARRQEEVVLQASVSPYHQFRDTAEVTVFRDSLLSWYDLKKRDLPWRRQAEGEVDLDRRAYAVWVSEVMLQQTQVATVINYYTRWMQKWPTLQDLAGASLEEVNQLWAGLGYYSRGRRLQEGARKVVEELGGHVPRTAETLQRLLPGVGRYTAGAIASIAFGQVTGVVDGNVVRVLCRVRGIGADPRNTFVSQQLWSLAQQLVDPARPGDFNQAAMELGAIVCTPQHPHCSQCPVQSLCRAHQRVERTQLPASQSLPGSPDVEECAPNTGQCQLCAPPTQPWDQTLGVANFPRKASRKPPREECSATCVLEQPRAPGGARVLLVQRPNSGLLAGLWEFPSVSVHPSGRHQRQALLQELRSWAGPLPAARLRHLGQVVHVFSHIKLTYQVYGLALEGQAPVTGTAPGARWLTREEFHTAAVSTAMKKVFRVYEGQQSGTCKSSKRCQVSTASRRKKPRPGQQVLDSFFRPRIPTDAPRPNSTA-*
and drop the matching line AND the line after it (the protein data for that one2one ortholog).
We have automated this in the following way:
`sbatch_ensts.sh` executes `ensts.py`; waits 10 seconds, then executes `drop_one2ones.py`
#### sbatch_ensts.sh
```
#!/bin/bash
#SBATCH --job-name=enst
#SBATCH --output=enst_%j.out
#SBATCH --error=enst_%j.err
#SBATCH --ntasks-per-node=40
#SBATCH --nodes=1
#SBATCH --time=0:30:00
#SBATCH -p cpu
# This script submits ensts.py, which then waits 10 seconds and starts running drop_one2ones.py. To execute this script you need to run sbatch sbatch_ensts.sh -s dipEca1. Note that this ran in WELL under an hour.
while getopts ":s:" opt; do
case $opt in
s) species="$OPTARG"
;;
\?) echo "Invalid option -$OPTARG" >&2
;;
esac
done
#Load any necessary modules
module load python
#Run the Python script with the command line argument "dipEca1"
python ensts.py "${species}"
#mkdir ./logs
# Output job duration
sacct --format=JobID,Elapsed -j $SLURM_JOB_ID > ./logs/$SLURM_JOB_ID.log
# Output compute resource usage
sacct --format=JobID,AllocCPUs,AllocGRES -j $SLURM_JOB_ID >> ./logs/$SLURM_JOB_ID.log
```
#### ensts.py
```
#!/usr/bin/env python
import sys
import time
import subprocess
#Get the value for species from the command line argument
species = sys.argv[1]
# Open the files
with open(f'one2one_{species}.txt', 'r') as f1, open(f'../2_query_protein_fastas/query_{species}.pep', 'r') as f2, open(f'ensts_{species}.txt', 'w') as f3:
# Read the lines from one2one_antPal2.txt and store them in a list
search_strings = [line.strip() for line in f1]
# Loop over the lines in species.pep
for line in f2:
# Check if the line contains any of the search strings
if any(search_string in line for search_string in search_strings):
# If it does, write the line to the output file
f3.write(line)
# Wait for 10 seconds
time.sleep(10)
#Submit the drop_one2ones.py script
subprocess.run(['python', 'drop_one2ones.py', species])
```
#### drop_one2ones.py
```
import sys
# Get the value for species from the command line argument
species = sys.argv[1]
# Open the input files
with open(f'ensts_{species}.txt', 'r') as f1, open(f'../2_query_protein_fastas/query_{species}.pep', 'r') as f2:
# Read the ENST names into a list
enst_names = [line.strip() for line in f1]
# Loop through each line in antpal2.pep
with open(f'{species}_final.pep', 'w') as out_file:
for line in f2:
# If the line matches an ENST name, skip it and the next line
if any(enst in line for enst in enst_names):
next(f2)
# Otherwise, write the line to the output file
else:
out_file.write(line)
```
The final product is a protein fasta without any one2one orthologs, named `species_final.pep`
### Filter for longest transcript
As a last step, we will retain only the longest transcript (one ENST for each q_gene).
#### a_longest_transcript.sh
```
#!/bin/bash
#SBATCH --job-name=longest_transcript
#SBATCH --output=long_%j.out
#SBATCH --error=long_%j.err
#SBATCH --ntasks-per-node=40
#SBATCH --nodes=1
#SBATCH --time=1:00:00
#SBATCH -p cpu
#This script is run using bash a_longest_transcript.sh -s antPal2
while getopts ":s:" opt; do
case $opt in
s) species="$OPTARG"
;;
\?) echo "Invalid option -$OPTARG" >&2
;;
esac
done
# Create directories js and finalJs
mkdir -p "${species}_js" "${species}_finalJs"
#first make a document of all the genes (q_genes) said to be found in species
#awk '{print $4}' ../query_gene_spans.bed > myoNig_geneIDs
#We don't have query_gene_spans.bed from TOGA for all our species, so we want to use orthology_tsv instead.
#First make a file with all the genes (regxxx) said to be found in the species.
awk '{print $3}' ../3_orthology_tsvs/modified/HL"${species}"_orthology_classification.tsv | sort -u | grep "reg" > "${species}_geneIDs"
#read this file as you are going to loop through the genes
while read j; do
#first create a variable for each q_gene name such as reg_00011
r=$j
#echo $r
#now search for this gene in the orthology document and save results to a file
cat ../3_orthology_tsvs/modified/HL"${species}"_orthology_classification.tsv | grep "$r" > ./"${species}_js"/"$j"
#looper=$(cat ./"${species}_js"/"$r" | wc -l)
#take only the 4th column, which has the ENST transcripts for each gene
awk '{print $4}' ./"${species}_js"/"$j" > ./"${species}_js"/x_"$j"
#start a counter, that as you loop through each transcript, the longest will be saved
counter=0
#so open this file and loop through the ENST
for i in $(cat ./"${species}_js"/x_"$j")
do
#search for each ENST in the query file, look at the next line, count alphabetic characters
p=$(cat ../4_one2ones/"${species}"_final.pep | grep "$i" -A1 | grep -o "[a-zA-Z]*"| wc -c)
#first result will beat counter of 0, each after will have to beat new counter
if [[ $p -gt $counter ]]
then
counter=$p
#if higher than counter, overwrite final j file with new longest
cat ../4_one2ones/"${species}"_final.pep | grep -w "$i" -A1 > ./"${species}_finalJs"/"$j"
fi
done
#f=$(cat $folder$j".fa" | grep "$id")
#echo $j $f >> "gene_idsgc_"$spec".txt"
done < "${species}_geneIDs"
# Remove directories js and finalJs
#rm -r "${species}_js" "${species}_finalJs"
#mkdir ./logs
# Output job duration
sacct --format=JobID,Elapsed -j $SLURM_JOB_ID > ./logs/$SLURM_JOB_ID.out
# Output compute resource usage
sacct --format=JobID,AllocCPUs,AllocGRES -j $SLURM_JOB_ID >> ./logs/$SLURM_JOB_ID.out
# The output of this script is a folder of transcripts which need to be combined using the 2_combiner.sh script. Then, the 3_remover.sh script needs to be run to remove the /finaljs and /js folders.
# combiner.sh produces a protein fasta file which is ready for porthomcl. It contains only: query sequences, NO one-to-one orthologs, only the longest transcript for each gene.
```
#### 2_combiner.sh
```
#!/bin/bash
#SBATCH --job-name=combiner
#SBATCH --output=combiner_%j.out
#SBATCH --error=combiner_%j.err
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=1
#SBATCH --time=0:05:00
#SBATCH -p cpu
# this script combines the longest ENSTs that have made it to the finalJs folder for each species. This script is submited as bash 2_combiner.sh -s species
while getopts ":s:" opt; do
case $opt in
s) species="$OPTARG"
;;
\?) echo "Invalid option -$OPTARG" >&2
;;
esac
done
while read j; do
cat ./"${species}_finalJs"/"$j" >> ./longest_transcript_faas/"${species}_longest_transcript.faa"
done < "${species}_geneIDs"
# The output of this scrips is a protein fasta file (ending .faa) which 1) retains only QUERY sequences 2) retains only paralogs 3) retains only the longest transcript (1 ENST) for each q_gene. This output is ready for step 6: porthomcl.
```
#### cat_list_of_paralogs.sh
We want to be sure that only paralogs have made it to our finalJs folder, and into the ${species}_longest_transcript.faa input for porthomcl.
```
#!/bin/bash
#SBATCH --job-name=paralogs
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=1
#SBATCH --time=0:05:00
#SBATCH -p cpu
# This script makes of list of everything that has reached our finalJs folder, and prints the orthology classification from the js folder. The result is a list_of_paralogs_${species} file that should include only paralogs (no one2one orthologs in column 5).
# This script is run using bash cat_list_of_paralogs.sh -s antPal2
while getopts ":s:" opt; do
case $opt in
s) species="$OPTARG"
;;
\?) echo "Invalid option -$OPTARG" >&2
;;
esac
done
ls ./${species}_finalJs > list_${species}finaljs
for i in $(cat ./list_${species}finaljs); do cat ./${species}_js/$i >> list_of_paralogs_${species}; done
```
#### empties.py
We can do the same by confirming the list of one2one orthologs that have NOT made it to our finalJs folder for each species. These q_genes are listed in the combiner_XXXXX.err log generated by each combiner.sh job. We can collect these q_genes from the error log using `empties.py`.
```
import re
# This script confirms that one2one orthologs have NOT reached our finalJs folder and ended up in our longest_transcript.faa protein file for porthomcl. In brief, it collects q_gene names that were included in the js folder but not included (empty) in the finalJs folder for each species. The result is a list of commands (cat_list_of_one2ones.sh) which when executed should list a ton of one2one orthologous q_genes (and their names).
# Open the input and output files
with open('combiner_5636189.err', 'r') as input_file, open('cat_list_of_one2ones.sh', 'w') as output_file:
# Loop through each line in the input file
for line in input_file:
# Remove the ":" character
line = line.replace(':', '')
# Replace "finalJs" with "js"
line = line.replace('finalJs', 'js')
# Remove ": No such file or directory"
line = re.sub(r'No such file or directory', '', line)
# Write the modified line to the output file
output_file.write(line)
```
#### 3_remove_species.sh
These js and finalJs folders are really big. You may want to remove them, but don't do it on the login node. Execute 3_remove_species.sh or 4_remove_js_finalJs.sh to remove a single species folders or ALL the js and finalJs folders.
#### 6_porthoMCL
Copy the ${species}_longest_transcript.faa files into a folder called 6_porthoMCL
Create a CONTAINER folder (e.g., longevity_CONTAINER)
Move all of our `${species}_longest_transcript.faa` files into a subfolder called `0.input_faa`
Execute porthomcl with `sbatch porthomcl_slurm.sh longevity_CONTAINER`
```
#!/bin/bash
#
#SBATCH --job-name=bostau-porthomcl
#SBATCH -o slurm-%j.out # %j = job ID
#SBATCH --ntasks-per-node=24
#SBATCH --nodes=1
#SBATCH --time=36:00:00
#SBATCH -p cpu-long
# This script executes porthomcl on a folder of protein fastas. Your protein fastas should 1) retain only QUERY sequences 2) no one2one orthologs 3) one longest transcript (ENST) for each q_gene 4) be inside CONTAINER/0.input_faa/
module load anaconda
conda activate /project/tlama_umass_edu/envs/porthomcl
cd /project/tlama_umass_edu/projects/project_bat1k_longevity/toga-paralogy/porthomcl
DIR=$1 ##Parent directory that contains "0.input_faa" with all protein .faas
bash porthomcl.sh -t 16 ${DIR}
```
# Run Orthofinder
To confirm gene family membership, run Orthofinder on all of the protein fasta genomes, modified to include only the longest transcript (ENST) for each ENSG (<species>_longest_transcript.faa_shortened.faa). The input argument for Orthofinder (-f) is simply the directory that contains all of our protein files (/project/tlama_umass_edu/projects/project_bat1k_longevity/orthofinder/orthofiles).
Run orthofinder.sh:
```
#!/bin/bash
#SBATCH --output=ortho_23Jul.out
#SBATCH --error=mv-ortho_23Jul.err
#SBATCH --job-name=ortho_23Jul
#SBATCH --nodes=1
#SBATCH --time=168:00:00
#SBATCH --ntasks=20
#SBATCH --partition=cpu-long
module load anaconda/2022.10
conda activate /project/tlama_umass_edu/envs/orthofinder/
orthofinder -f /project/tlama_umass_edu/projects/project_bat1k_longevity/orthofinder/orthofiles -o /project/tlama_umass_edu/projects/project_bat1k_longevity/orthofinder/results_1k_23Jul
```
The analysis will take approx 20 hours to run, and will output a whole bunch of directories and files to "results_1k_23Jul".
# Format the orthogroup data for CAFE
working directory is /project/tlama_umass_edu/projects/project_bat1k_longevity/data/orthofinder/protein_fastas/8_cafe
The next step is to take our Orthogroup.tsv data from OrthoFinder and do some filtering and formatting in preparation for CAFE. No need to enrich the orthogroups (OG) at this stage.
REMEMBER: to run CAFE we need an ultrametric time tree. We have one from Graham.
We want to calculate:
1) the number of genes/species in each OG
2) the number of genes/OG
Some filtering rules:
1) drop OGs with > 450 genes
2) drop OGs with > 100 size variation "Families which varied in size by more than 100 members between the species with the highest and lowest count were also removed."
3) OGs > 80% similarity must be merged **this is more complicated than it sounds people!**
4) drop all olfactory gene OGs
## Calculating similarity between OGs
We need to extract gene names and summarize the makeup of each OG, then compare iteratively all of the OGs against one another and calculate a metric of similarity. Similarity > 80% are printed to an output called jaccard_similarity.txt. OGs with >= 80% similarity will eventually be merged.
#### 1b_calculate_jaccard_similarity.py
```
# this script takes an Orthofinder output file and extracts all of the gene names from each Orthogroup (OG); then summarizes the makeup of the OG (e.g., 10% DEFA1; 90% DEFA2) and compares each OG to eachother (e.g. OG00 to OG01) and calculates a % similarity. If similarity > 1%, the output is printed to jaccard_similarity.txt.
import csv
import re
def extract_gene_name(gene_name):
match = re.search(r'\.(.+?)[.-]', gene_name)
if match:
return match.group(1)
return ""
def calculate_gene_percent(gene_set):
gene_counts = {}
total_genes = len(gene_set)
for gene in gene_set:
if gene in gene_counts:
gene_counts[gene] += 1
else:
gene_counts[gene] = 1
gene_percentages = {}
for gene, count in gene_counts.items():
percentage = (count / total_genes) * 100
gene_percentages[gene] = percentage
return gene_percentages
def calculate_similarity(gene_percentages1, gene_percentages2):
gene_set1 = set(gene_percentages1.keys())
gene_set2 = set(gene_percentages2.keys())
intersection = gene_set1.intersection(gene_set2)
union = gene_set1.union(gene_set2)
similarity = len(intersection) / len(union)
return similarity
# Read the CSV file
with open('cafe_inputs_manip.csv', 'r') as file:
reader = csv.DictReader(file)
header = reader.fieldnames
orthogroups = {}
orthogroup_column = header[0]
for row in reader:
orthogroup = row[orthogroup_column]
gene_set = set()
for column in header[1:]:
gene = extract_gene_name(row[column])
if gene:
gene_set.add(gene)
orthogroups[orthogroup] = gene_set
# Compare orthogroups and calculate similarity
merge_log = []
for orthogroup1, geneset1 in orthogroups.items():
for orthogroup2, geneset2 in orthogroups.items():
if orthogroup1 == orthogroup2:
continue
gene_percentages1 = calculate_gene_percent(geneset1)
gene_percentages2 = calculate_gene_percent(geneset2)
similarity = calculate_similarity(gene_percentages1, gene_percentages2)
if similarity > 0.79:
merge_log.append({
'Orthogroup 1': orthogroup1,
'Orthogroup 2': orthogroup2,
'Gene Percentages 1': gene_percentages1,
'Gene Percentages 2': gene_percentages2,
'Similarity': similarity
})
# Write merge log to a file
with open('jaccard_similarity.txt', 'w') as file:
for entry in merge_log:
file.write(f"Orthogroup 1: {entry['Orthogroup 1']}\n")
file.write("Gene Percentages 1:\n")
for gene, percentage in entry['Gene Percentages 1'].items():
file.write(f"{gene}: {percentage}%\n")
file.write(f"\nOrthogroup 2: {entry['Orthogroup 2']}\n")
file.write("Gene Percentages 2:\n")
for gene, percentage in entry['Gene Percentages 2'].items():
file.write(f"{gene}: {percentage}%\n")
file.write(f"\nSimilarity: {entry['Similarity']}\n\n")
```
The output of 1b_calculate_jaccard_similarity.py is a file called jaccard_similarity.txt with the following format:
```
Orthogroup 1: OG0000000
Gene Percentages 1:
IGHV3: 100.0%
Orthogroup 2: OG0001306
Gene Percentages 2:
IGHV3: 100.0%
Similarity: 1.0
```
### grep Orthogroups with >80% similarities
Remove some complexity from similarity.txt by retaining only the lines with Orthogroup 1 Orthogroup 2 and Similarity
#### 2_grep_similarity.sh
this script outputs a file called grep_similarity.txt
```
grep -E "Orthogroup 1:|Orthogroup 2:|Similarity:" jaccard_similarity.txt > grep_similarity.txt
```
grep_similarity.txt looks like this
```
Orthogroup 1: OG0000000
Orthogroup 2: OG0000015
Similarity: 80.0%
Orthogroup 1: OG0000002
Orthogroup 2: OG0001295
Similarity: 94.11764705882354%
Orthogroup 1: OG0000002
Orthogroup 2: OG0001559
Similarity: 94.11764705882354%
Orthogroup 1: OG0000002
```
#### 3_clean_pairs.py
Because we did an all x all comparison of the orthogroups to measure similarity, we have duplicate Orthogroups (e.g., OG0 was compared to OG1, and OG1 was also compared to OG0). To remove these reduncancies, we use 3_clean_pairs.py, which keeps track of all of the orthogroup pairs and writes a file called clean_pairs.txt free of duplicates.
```
# this script takes the grep_similarity.txt summary, and eliminated duplicated comparisons. For example, if OG1 and OG2 are 80% similar, we do not also need OG2 vs. OG1 similarity to know that the two need to be merged.
seen_pairs = set()
with open('filtered_similarity.txt', 'r') as file:
lines = file.readlines()
clean_lines = []
for i in range(0, len(lines), 3):
orthogroup1_line = lines[i].strip()
orthogroup2_line = lines[i + 1].strip()
similarity_line = lines[i + 2].strip()
orthogroup1 = orthogroup1_line.split(":")[1].strip()
orthogroup2 = orthogroup2_line.split(":")[1].strip()
pair = (orthogroup1, orthogroup2)
reversed_pair = (orthogroup2, orthogroup1)
if pair not in seen_pairs and reversed_pair not in seen_pairs:
clean_lines.append(f"{orthogroup1_line}\n")
clean_lines.append(f"{orthogroup2_line}\n")
clean_lines.append(f"{similarity_line}\n")
seen_pairs.add(pair)
with open('clean_pairs.txt', 'w') as file:
file.writelines(clean_lines)
```
#### After running 3_clean_pairs.py you need to run grep -E "Orthogroup 1:|Orthogroup 2:" clean_pairs.txt > clean_pairs2.txt
### Time to merge
After removing duplicates, we have about 1k orthogroups that need to be joined.
#### drop the "Similarity" line, we don't need it anymore
`grep -E "Orthogroup 1:|Orthogroup 2:" clean_pairs.txt > cleaner_pairs.txt`
#### 4_calculate_minimum_mergers.py
Ok now this bit is quite clever, because it takes clean_pairs.txt and creates a network with nodes and edges to determine the minimum number of mergers needed to join all of the OGs with similarity > 80.
```
# Sure, let's write a script that reads the clean_pairs.txt file and identifies the minimum number of mergers needed to achieve the desired task. To simplify this, we can create a graph where each orthogroup is a node, and each pair of orthogroups represents an edge. Then, we can find the connected components in the graph, which will give us the groups of orthogroups that need to be merged together.
## Usage: python calculate_min_mergers.py > min_mergers.txt
def read_pairs(file_path):
pairs = []
with open(file_path, 'r') as pairs_file:
lines = pairs_file.readlines()
i = 0
while i + 1 < len(lines):
orthogroup1_line = lines[i].strip()
orthogroup2_line = lines[i + 1].strip()
orthogroup1 = orthogroup1_line.split(":")[1].strip()
orthogroup2 = orthogroup2_line.split(":")[1].strip()
pairs.append((orthogroup1, orthogroup2))
i += 2
return pairs
def create_graph(pairs):
graph = {}
for orthogroup1, orthogroup2 in pairs:
if orthogroup1 not in graph:
graph[orthogroup1] = set()
if orthogroup2 not in graph:
graph[orthogroup2] = set()
graph[orthogroup1].add(orthogroup2)
graph[orthogroup2].add(orthogroup1)
return graph
def find_connected_components(graph):
components = []
visited = set()
def dfs(node, component):
visited.add(node)
component.add(node)
for neighbor in graph[node]:
if neighbor not in visited:
dfs(neighbor, component)
for node in graph:
if node not in visited:
component = set()
dfs(node, component)
components.append(component)
return components
def main():
clean_pairs_file = 'clean_pairs.txt'
# Read pairs from the clean_pairs.txt file
pairs = read_pairs(clean_pairs_file)
# Create a graph from the pairs
graph = create_graph(pairs)
# Find connected components in the graph
connected_components = find_connected_components(graph)
# Print the minimum number of mergers needed
print(f"Minimum number of mergers needed: {len(connected_components)}")
# Print the connected components
for i, component in enumerate(connected_components):
print(f"Group {i + 1}: {', '.join(component)}")
if __name__ == "__main__":
main()
```
#### 5_perform_min_mergers.py
```import csv
output_file = 'merged_output.csv'
cafe_file = 'cafe_inputs_manip.csv'
min_mergers_file = 'min_mergers.txt'
# Read the header row from the cafe_inputs_manip.csv file
with open(cafe_file, 'r') as cafe:
reader = csv.reader(cafe)
header_row = next(reader)
# Create a dictionary to store the merged rows
merged_rows = {}
# Read the orthogroups to be merged from min_mergers.txt
with open(min_mergers_file, 'r') as min_mergers:
for line in min_mergers:
if line.startswith('Group'):
orthogroups = line.split(':')[1].strip().split(', ')
group_name = line.split(':')[0].strip()
merged_row = {header_row[0]: group_name}
for orthogroup in orthogroups:
# Read the rows for the current orthogroup from cafe_inputs_manip.csv
with open(cafe_file, 'r') as cafe:
reader = csv.DictReader(cafe)
for row in reader:
if row[header_row[0]] == orthogroup:
# Merge the rows for the current orthogroup
for key in header_row[1:]:
if key not in merged_row:
merged_row[key] = row[key]
else:
merged_row[key] += ', ' + row[key]
break
# Add the merged row to the dictionary
merged_rows[group_name] = merged_row
# Write the merged rows to the output CSV file
with open(output_file, 'w', newline='') as merged_file:
writer = csv.DictWriter(merged_file, delimiter=',', fieldnames=header_row)
writer.writeheader()
for row in merged_rows.values():
writer.writerow(row)
# Write the summary of merged orthogroups to the summary_mergers.txt file
with open('summary_mergers.txt', 'w') as summary_file:
summary_file.write('Merged Rows:\n')
for group, orthogroups in enumerate(merged_rows.keys(), start=1):
summary_file.write(f'{orthogroups}: {", ".join(merged_rows[orthogroups])}\n')
summary_file.write('Merging completed successfully.')
```
#### 5b_clean_cells.py
```
# We need to take the merged_output.csv and do some manual curation before it can become merged_output_curated.csv.
# This includes removing commas ', , , , , ,' and spaces where empty cells have merged with empty cells. Running this script will reduce your time spent on manual curation.
# After running this script, merged_output_curated.csv and non_merged_output.csv need to be combined and manually edited (drop olfactory receptors)
import pandas as pd
def clean_cell(cell):
# Remove leading and trailing spaces
cell = cell.strip()
# Remove ', ' and nothing else
if cell == ', ':
cell = ''
# Remove ',' at the end and ensure the last character is a digit
if cell.endswith(','):
cell = cell[:-1] + '0'
return cell
def clean_csv(input_file, output_file):
# Read the CSV file
df = pd.read_csv(input_file)
# Apply the clean_cell function to each cell in the DataFrame
df = df.applymap(clean_cell)
# Save the cleaned DataFrame to a new CSV file
df.to_csv(output_file, index=False)
if __name__ == "__main__":
input_file = "merged_output_curated.csv"
output_file = "cleaned_merged_output.csv"
clean_csv(input_file, output_file)
```
#### 6_dropme.py
```# in this script we are using the min_mergers.txt file to create a long list of Orthogroups that need to be dropped from cafe_inputs_manip.csv. These orthogroups are already represented in the merged groups (1-52) in merged_output.csv. We need a single file including the non-merged and merged orthogroups for cafe. Here we will create a list of 693 orthogroups that need to be dropped.
input_file = 'min_mergers.txt' # Replace 'your_file.txt' with the actual filename
# Read the content of the input file
with open(input_file, 'r') as file:
content = file.read()
# Split the content into groups based on 'Group' keyword
groups = content.split('Group')[1:]
# Extract orthogroup names from each group
all_orthogroups = []
for group in groups:
orthogroups = group.strip().split(':')[1].strip().split(', ')
all_orthogroups.extend(orthogroups)
# Write the orthogroup names to a new text file with a single column
output_file = 'drop_orthogroups.txt'
with open(output_file, 'w') as output:
for orthogroup in all_orthogroups:
output.write(orthogroup + '\n')
print('Orthogroups extracted and saved to', output_file)
```
#### 7_drop_orthogroups.py
```#Sure, I can help you with that. We'll read the list of orthogroups from drop_orthogroups.txt, locate those orthogroups in cafe_inputs_manip.csv, and then drop the corresponding rows. Finally, we'll write the remaining rows to a new CSV file called non_merged_orthogroups.csv. We'll use the pandas library in Python to handle the data manipulation efficiently.
import pandas as pd
# File paths
input_csv_file = 'cafe_inputs_manip.csv'
orthogroups_file = 'drop_orthogroups.txt'
output_csv_file = 'non_merged_orthogroups.csv'
# Read the list of orthogroups to drop
with open(orthogroups_file, 'r') as file:
orthogroups_to_drop = [line.strip() for line in file]
# Read the CSV file
df = pd.read_csv(input_csv_file)
# Drop rows corresponding to the specified orthogroups
df_filtered = df[~df['Orthogroup'].isin(orthogroups_to_drop)]
# Write the remaining rows to the output CSV file
df_filtered.to_csv(output_csv_file, index=False)
print('Rows corresponding to specified orthogroups have been dropped and the remaining rows have been saved to', output_csv_file)
```
#### 8_convert_into_cafe_format.py
```
import pandas as pd
# Function to count genes in a cell or assign 0 to empty cells
def count_genes(cell_value):
if pd.isna(cell_value) or cell_value.strip() == "":
return "0"
return str(len(cell_value.split(", ")))
# Step 7: Manual curation in Excel (merged_output_curated.csv)
# Between steps 7 and 8, I've pulled merged_output.csv and non_merged_orthogroups.csv
# onto my desktop and done some manual curation in excel. Essentially, the mergers created
# a lot of extra ',' especially where a cell is merged with an empty cell etc. Those needed
# to be deleted, and our merged and non-merged orthogroups needed to be manually inspected
# and merged into a file called merged_output_curated.csv. I also dropped orthogroups
# associated with Olfactory Reception per Liliana's directions.
# Step 8: Convert merged_output_curated.csv to CAFE input format
# Now we need to convert the merged_output_curated.csv into CAFE format.
# CAFE requires an input file in a tab-separated format with gene IDs
# (in this case, Orthogroup IDs) in the first column and gene counts
# (number of copies) for each species in subsequent columns.
# File paths
input_csv_file = 'merged_output_curated.csv'
output_cafe_file = 'cafe_input_file.tsv'
# Read the CSV file
df = pd.read_csv(input_csv_file)
# Replace empty cells with "0" and count genes in non-empty cells
for col in df.columns[1:]:
df[col] = df[col].fillna("").apply(count_genes)
# Prepare the CAFE input file in tab-separated format
with open(output_cafe_file, 'w') as f:
f.write("Gene_ID\t" + "\t".join(df.columns[1:]) + "\n")
for _, row in df.iterrows():
f.write(row["Orthogroup"] + "\t" + "\t".join(row[1:]) + "\n")
print("CAFE input file has been created:", output_cafe_file)
```
# Time to run CAFE (zixia)
Zixia ran CAFE and then created some preliminary figures using CAFEplotter for total, significant, and individual gene families
# Visualizing results with CAFE_fig
https://github.com/LKremer/CAFE_fig
Create a new conda environment for the packages. Requires a very specific version of ete3 and python.
conda env create -f cafeFig_environment.yml
```
name: cafeFig
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=2_gnu
- ca-certificates=2024.12.14=hbcca054_0
- cairo=1.14.6=4
- ete3=3.0.0b35=py36_1
- expat=2.6.4=h5888daf_0
- fontconfig=2.12.1=4
- freetype=2.7=1
- gettext=0.22.5=he02047a_3
- gettext-tools=0.22.5=he02047a_3
- glib=2.51.4=0
- graphite2=1.3.13=h59595ed_1003
- harfbuzz=1.4.3=0
- icu=58.2=hf484d3e_1000
- jpeg=9e=h0b41bf4_3
- libasprintf=0.22.5=he8f35ee_3
- libasprintf-devel=0.22.5=he8f35ee_3
- libblas=3.9.0=20_linux64_openblas
- libcblas=3.9.0=20_linux64_openblas
- libexpat=2.6.4=h5888daf_0
- libffi=3.2.1=he1b5a44_1007
- libgcc=14.2.0=h77fa898_1
- libgcc-ng=14.2.0=h69a702a_1
- libgettextpo=0.22.5=he02047a_3
- libgettextpo-devel=0.22.5=he02047a_3
- libgfortran=14.2.0=h69a702a_1
- libgfortran-ng=14.2.0=h69a702a_1
- libgfortran5=14.2.0=hd5240d6_1
- libgomp=14.2.0=h77fa898_1
- libiconv=1.17=hd590300_2
- liblapack=3.9.0=20_linux64_openblas
- liblzma=5.6.3=hb9d3cd8_1
- liblzma-devel=5.6.3=hb9d3cd8_1
- libopenblas=0.3.25=pthreads_h413a1c8_0
- libpng=1.6.43=h2797004_0
- libstdcxx=14.2.0=hc0a3c3a_1
- libstdcxx-ng=14.2.0=h4852527_1
- libtiff=4.0.10=hc3755c2_1005
- libuuid=2.38.1=h0b41bf4_0
- libxcb=1.17.0=h8a09558_0
- libxml2=2.9.9=h13577e0_2
- libxslt=1.1.32=hae48121_1003
- libzlib=1.2.13=h4ab18f5_6
- lxml=4.4.1=py36h7ec2d77_0
- lz4-c=1.9.3=h9c3ff4c_1
- ncurses=6.5=h2d0b736_2
- numpy=1.19.5=py36hfc0c790_2
- openssl=1.0.2u=h516909a_0
- pango=1.40.4=0
- pcre=8.45=h9c3ff4c_0
- pip=21.3.1=pyhd8ed1ab_0
- pixman=0.34.0=h14c3975_1003
- pthread-stubs=0.4=hb9d3cd8_1002
- pyqt=4.11.4=py36_3
- python=3.6.7=hd21baee_1002
- python_abi=3.6=2_cp36m
- qt=4.8.7=ha8c56c7_9
- readline=7.0=hf8c457e_1001
- scipy=1.5.3=py36h81d768a_1
- setuptools=58.0.4=py36h5fab9bb_2
- sip=4.18=py36_1
- six=1.16.0=pyh6c4a22f_0
- sqlite=3.28.0=h8b20d00_0
- tk=8.6.13=noxft_h4845f30_101
- wheel=0.37.1=pyhd8ed1ab_0
- xorg-libice=1.1.2=hb9d3cd8_0
- xorg-libsm=1.2.5=he73a12e_0
- xorg-libx11=1.8.10=h4f16b4b_1
- xorg-libxau=1.0.12=hb9d3cd8_0
- xorg-libxdmcp=1.1.5=hb9d3cd8_0
- xorg-libxext=1.3.6=hb9d3cd8_0
- xorg-libxrender=0.9.12=hb9d3cd8_0
- xorg-libxt=1.3.1=hb9d3cd8_0
- xorg-xextproto=7.3.0=hb9d3cd8_1004
- xz=5.6.3=hbcc6ac9_1
- xz-gpl-tools=5.6.3=hbcc6ac9_1
- xz-tools=5.6.3=hb9d3cd8_1
- zlib=1.2.13=h4ab18f5_6
- zstd=1.4.9=ha95c52a_0
prefix: /home/bbentley_smith_edu/.conda/envs/cafeFig
```
Make some changes to CAFE_fig.py
Change pixels to 5.0
```
'pixels_per_mya': 1.0, # pixels per million years (tree width)
```
```
ts.branch_vertical_margin = 10 # Increase the spacing between branches
```
python CAFE_fig.py -d out -g pdf base_report.cafe
conda install conda-forge::xvfbwrapper
conda update -n base -c defaults conda
### Acknowledgements
## Appendix and FAQ
:::info
**Find this document incomplete?** Leave a comment!
:::
###### tags: `scripts` `project_tyd`
try the protein fasta from one of the six bats for Orthofinder
Bill ran for his transcriptomics
Bill's Orthofinder scripts are in his folder at seawulf
Bill ran it on peptide FASTAs (from Ensembl)
some of the assemblies are private, so I can't just go and use Ensembl
-- orthology from toga
-- orthology from
Sensititivy analysis w/ results for well-known well-curated
OrthoFinder
Does the Kirilenko paper
expansions within everything with the caveat that only things orthologous to human were detected (because toga reference == human)
data for synthesizing the apobecs -- one2one annotations
Liliana went in and found a bunch more annotations
we are interested in the one2many and many2many
We need to validate the copy numbers spit out by TOGA
No idea what the rules were for gene family assignment
annotated protein list -- does it include the paralogs or just the one2one orthologs
just run orthofinder on my protein fastas that come out of TOGA
genes in a species that actually belong to two different gene families -- APOBEC3 (assigned to APOBEC, and APOBEC-A simultaneously). Pick the smallest box -- finest grain data. % cutoff.
why don't we have a table that has columns of genes and species with copy numbers. Crawling through tables and making sure. Lexi has scripts that Ariadna gave her.
1) determine the protein.fasta is one2one or many2many etc?
2) do the many2many sequences get dropped from
write into the methods that we fucked up and had to label the nodes after the fact.
Tanya may continue with OrthoFinder on the protein.fastas that Ariadna gave me
- one per genome
1. What is the TOGA --> CAFE problem?
To investigate expansions and contractions of protein families, we used CAFE.
CAFE requires gene counts which come from annotated proteins assigned to families for each species.
TOGA gives you the orthologs and classified those orthologs to five different relationships : 1:1, 1:many, many:1, many:many, 1:zero etc.
The goal was to get gene copy numbers from that orthology classification table that is output by TOGA.
In order to get gene copy counts, we need to group ENSGs into gene families. We attempted to do this
# done
sbatch 1_longest_transcript.sh -s antPal2
sbatch 1_longest_transcript.sh -s artJam3
sbatch 1_longest_transcript.sh -s aseSto2
sbatch 1_longest_transcript.sh -s bosTau9
sbatch 1_longest_transcript.sh -s canFam4
sbatch 1_longest_transcript.sh -s desRot8A
sbatch 1_longest_transcript.sh -s dipEca1
sbatch 1_longest_transcript.sh -s equCab3
sbatch 1_longest_transcript.sh -s felCat9
sbatch 1_longest_transcript.sh -s hipCyc2
sbatch 1_longest_transcript.sh -s hipLar2
sbatch 1_longest_transcript.sh -s mm10
sbatch 1_longest_transcript.sh -s molMol2
sbatch 1_longest_transcript.sh -s myoMyo6
sbatch 1_longest_transcript.sh -s myoNig1
sbatch 1_longest_transcript.sh -s phyDis3
sbatch 1_longest_transcript.sh -s phyHas1
sbatch 1_longest_transcript.sh -s pipKuh2
sbatch 1_longest_transcript.sh -s rhiAff2
sbatch 1_longest_transcript.sh -s rhiFer5
sbatch 1_longest_transcript.sh -s rhiLuc4
sbatch 1_longest_transcript.sh -s rhiMic1
sbatch 1_longest_transcript.sh -s rhiSed2
sbatch 1_longest_transcript.sh -s rhiTri2
sbatch 1_longest_transcript.sh -s rouAeg4
sbatch 1_longest_transcript.sh -s sacBil1
sbatch 1_longest_transcript.sh -s sacLep1
sbatch 1_longest_transcript.sh -s susScr11
sbatch 1_longest_transcript.sh -s tadBra2
# done
sbatch 2_combiner.sh -s antPal2
sbatch 2_combiner.sh -s artJam3
sbatch 2_combiner.sh -s aseSto2
sbatch 2_combiner.sh -s bosTau9
sbatch 2_combiner.sh -s canFam4
sbatch 2_combiner.sh -s desRot8A
sbatch 2_combiner.sh -s dipEca1
sbatch 2_combiner.sh -s equCab3
sbatch 2_combiner.sh -s felCat9
sbatch 2_combiner.sh -s hipCyc2
sbatch 2_combiner.sh -s hipLar2
sbatch 2_combiner.sh -s mm10
sbatch 2_combiner.sh -s molMol2
sbatch 2_combiner.sh -s myoNig6
sbatch 2_combiner.sh -s myoMyo6
sbatch 2_combiner.sh -s phyDis3
sbatch 2_combiner.sh -s phyHas1
sbatch 2_combiner.sh -s pipKuh2
sbatch 2_combiner.sh -s rhiAff2
sbatch 2_combiner.sh -s rhiFer5
sbatch 2_combiner.sh -s rhiLuc4
sbatch 2_combiner.sh -s rhiMic1
sbatch 2_combiner.sh -s rhiSed2
sbatch 2_combiner.sh -s rhiTri2
sbatch 2_combiner.sh -s rouAeg4
sbatch 2_combiner.sh -s sacBil1
sbatch 2_combiner.sh -s sacLep1
sbatch 2_combiner.sh -s susScr11
sbatch 2_combiner.sh -s tadBra2
# Need to run a cat on all of the paralog lists
sbatch cat_list_of_paralogs.sh -s antPal2
sbatch cat_list_of_paralogs.sh -s artJam3
sbatch cat_list_of_paralogs.sh -s aseSto2
sbatch cat_list_of_paralogs.sh -s bosTau9
sbatch cat_list_of_paralogs.sh -s canFam4
sbatch cat_list_of_paralogs.sh -s desRot8A
sbatch cat_list_of_paralogs.sh -s dipEca1
sbatch cat_list_of_paralogs.sh -s equCab3
sbatch cat_list_of_paralogs.sh -s felCat9
sbatch cat_list_of_paralogs.sh -s hipCyc2
sbatch cat_list_of_paralogs.sh -s hipLar2
sbatch cat_list_of_paralogs.sh -s mm10
sbatch cat_list_of_paralogs.sh -s molMol2
sbatch cat_list_of_paralogs.sh -s myoMyo6
sbatch cat_list_of_paralogs.sh -s myoNig1
sbatch cat_list_of_paralogs.sh -s phyDis3
sbatch cat_list_of_paralogs.sh -s phyHas1
sbatch cat_list_of_paralogs.sh -s pipKuh2
sbatch cat_list_of_paralogs.sh -s rhiAff2
sbatch cat_list_of_paralogs.sh -s rhiFer5
sbatch cat_list_of_paralogs.sh -s rhiLuc4
sbatch cat_list_of_paralogs.sh -s rhiMic1
sbatch cat_list_of_paralogs.sh -s rhiSed2
sbatch cat_list_of_paralogs.sh -s rhiTri2
sbatch cat_list_of_paralogs.sh -s rouAeg4
sbatch cat_list_of_paralogs.sh -s sacBil1
sbatch cat_list_of_paralogs.sh -s sacLep1
sbatch cat_list_of_paralogs.sh -s susScr11
sbatch cat_list_of_paralogs.sh -s tadBra2
cat_list_of_paralogs.sh
antPal2
artJam3
aseSto2
bosTau9
canFam4
desRot8A
dipEca1
equCab3
felCat9
hipCyc2
hipLar2
mm10
molMol2
myoMyo6
phyDis3
phyHas1
pipKuh2
rhiAff2
rhiFer5
rhiLuc4
rhiMic1
rhiSed2
rhiTri2
rouAeg4
sacBil1
sacLep1
susScr11
tadBra2
46 ../3.blastquery/antPal2_longest_transcript.fasta done
233 ../3.blastquery/artJam3_longest_transcript.fasta done
4800 ../3.blastquery/aseSto2_longest_transcript.fasta good
6444 ../3.blastquery/bosTau9_longest_transcript.fasta good only lost 1000
758 ../3.blastquery/canFam4_longest_transcript.fasta done
4044 ../3.blastquery/desRot8A_longest_transcript.fasta good
4188 ../3.blastquery/dipEca1_longest_transcript.fasta good
1012 ../3.blastquery/equCab3_longest_transcript.fasta done
1763 ../3.blastquery/felCat9_longest_transcript.fasta done
4434 ../3.blastquery/hipCyc2_longest_transcript.fasta good
205 ../3.blastquery/hipLar2_longest_transcript.fasta done
65 ../3.blastquery/mm10_longest_transcript.fasta done
3800 ../3.blastquery/molMol2_longest_transcript.fasta good
3870 ../3.blastquery/myoMyo6_longest_transcript.fasta good
1377 ../3.blastquery/phyDis3_longest_transcript.fasta done
33 ../3.blastquery/phyHas1_longest_transcript.fasta done
4394 ../3.blastquery/pipKuh2_longest_transcript.fasta good
143 ../3.blastquery/rhiAff2_longest_transcript.fasta done
2056 ../3.blastquery/rhiFer5_longest_transcript.fasta done
1515 ../3.blastquery/rhiLuc4_longest_transcript.fasta done
975 ../3.blastquery/rhiMic1_longest_transcript.fasta done
2127 ../3.blastquery/rhiSed2_longest_transcript.fasta done
395 ../3.blastquery/rhiTri2_longest_transcript.fasta bad
26 ../3.blastquery/rouAeg4_longest_transcript.fasta bad
1627 ../3.blastquery/sacBil1_longest_transcript.fasta bad
873 ../3.blastquery/sacLep1_longest_transcript.fasta bad
6830 ../3.blastquery/susScr11_longest_transcript.fasta good
943 ../3.blastquery/tadBra2_longest_transcript.fasta bad
sed 's/^>ENST/>tadBra2_longest_transcript|ENST/;s/ | PROT | QUERY//' ../0.input_faa/tadBra2_longest_transcript.faa > ./tadBra2_longest_transcript.fasta
# What did we do?
We realized that almost all of our proteins were making it through to goodProteins.fasta in step 2, but then in step 3.blastquery, some of our protein fastas were being massively downsized and we were losing a ton of proteins for certain species. I went back to the initial inputs and "rescued" all of our proteins for step 3. and we will now begin re-running step 3.
# Reviewing outputs
I've selected a random number of orthogroups and am validating orthogroups which are highly variable within bats.
Some reminders: the inputs have been filtered in the following ways
- query sequence only
- no gaps
- longest transcript
## Example OG0000000
This OG has some variation within bats. e.g., artJam2 has 2 but antPal1 has 21 and hipLar2 has 72

### Is it composed of >85% one gene?
Yes, across all species, all of the genes in this OG are some version of IGHV3.
### Confirm these numbers in the protein fasta.
artJam2 has 2 but antPal1 has 21 and hipLar2 has 72
When we look at the orthology classification, it looks like there were other IGHV3 genes...where are all of these being tossed out??

I went through each step of my protein fasta processing pipeline. At what point are these genes removed from the protein fasta?
1 -- yes present
2 -- yes, present
3 -- yes, present
4 -- yes
5 -- looks like only 3 make it to finalJs and the rest get pushed to js. Are they one2one orthologs? Let's go see. Nope! They're one2zero orthologs.

It looks as though there were only three IGHV3 genes for orthofinder to analyze, and all three of these genes are classified as many2many orthologs.
```
grep "IGHV3" -A1 longest_transcript_faas/artJam3_longest_transcript.faa
>ENST00000390607.IGHV3-21.523602 many2many
IEFGLIWVLLFAILRAVVCDVQLVESGGGLVQLGGSLRLSCVASGFIDTIDTYMXFIDYCECLKDVCLMEVLYVMINYKDK
>ENST00000603660.IGHV3-30.124054 many2many
MEFGVSWVFLVPILRGVQCEVQLVESGGGLVRPRGSLRLSCAASGLTVSSSYINCVLQAAGKGLEWPSSIRDGGSPYYVNSVKGQFTITRENGKNTLSIXMNNLRVKNTALHYCVS
>ENST00000390598.IGHV3-7.1322916 many2many #this one didn't make it through orthofinder. wonder why?
MCCFKTGSQPGILRCSQSLLRTEHRPHTVEFGLILIFHVVILRVSTVMLESREGLVEGGXFLELSCRSSEFTFSSCNISCVCQALCEGLVWVCRIKIDKNNFRPR
```
Let's look at these genes on BLAST
```
>ENST00000390607.IGHV3-21.523602 | PROT | REFERENCE
M-ELGLRWVFLVAILEGVQCEVQLVESGGGLVKPGGSLRLSCAASGFTFSSYSMNWVRQAPGKGLEWVSSISSSSSYIYYADSVKGRFTISRDNAKNSLYLQMNSLRAEDTAVYYCAR
>ENST00000390607.IGHV3-21.523602 | PROT | QUERY
IEFGLIWVLLFAILRAVVCDVQLVESGGGLVQLGGSLRLSCVASGFIDTIDTYMXFIDYCECLKDVCLMEVLYVMINYKDK
```
When I BLAST the reference (human) sequence, I get an immediate 100%

### next steps
-- identify alignment output from orthofinder
-- look at this alignment in AliViewer
-- the ultimate check will come from CAFE results (going to genome browser, assembly and confirming losses)
### gene losses
long lived short lived
everything missing from molMol (losses)
is it missing in the long lived partner?
-- take everything thats lost in both the long and short lived and toss it out
-- focus on unique gene losses in one or the other one
-- complementary to CAFE
## Installing CAFE from github
We are in the 'orthofinder' anaconda environment
dependencies include python, gcc, and pandoc
```
conda install python=3.9
conda install -c conda-forge pandoc
conda install -c conda-forge latexmk
conda install -c conda-forge pdflatex
conda install -c bioconda cafe
conda install -c conda-forge pandas
```
then cd to /docs and use `make`
## Running CAFE
run sbatch sbatch_run1.sh to join the queue
#### tree must be binary; graham_mod_dated_no_hg38.tree
```
(mm10:87.83517705,(((felCat9:49.66644369,canFam4:49.66644369):11.76859872,((susScr11:20.17442777,bosTau9:20.17442777):19.83282604,equCab3:40.00725381):21.4277886):14.99044447,((((((((HLrhiTri2:12.70240393,HLrhiLuc4:12.70240393):10.90116398,HLrhiSed2:23.60356791):10.06281168,HLrhiAff2:33.66637959):12.70900545,HLrhiFer5:46.37538505):9.624614955,((HLaseSto2:17.11302622,HLhipLar2:17.11302622):21.27756323,HLhipCyc2:38.39058945):17.60941055):0.8822414306,HLrhiMic1:56.88224143):0.8845209647,HLrouAeg4:57.7667624):0.8953602335,(((((HLpipKuh2:14.95167539,HLantPal2:14.95167539):16.82912223,(HLmyoNig1:15.85321175,HLmyoMyo6:15.85321175):15.92758587):15.11676827,(HLtadBra2:23.67764331,HLmolMol2:23.67764331):23.21992258):2.622233134,((HLdesRot8A:15.3632479,HLdipEca1:15.3632479):18.60429862,((HLphyDis3:10.86062202,HLphyHas1:10.86062202):10.65758233,HLartJam3:21.51820435):12.44934218):15.5522525):3.189644714,(HLsacLep1:29.3439579,HLsacBil1:29.3439579):23.36548583):5.952678892):17.76336425):11.40969017);
```
#### sbatch_run1.sh
```
#!/bin/bash
# sbatch_run1.sh
#SBATCH --job-name=cafe
#SBATCH --output=%j.out
#SBATCH --error=%j.err
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=1
#SBATCH --time=0:08:00
#SBATCH -p cpu
# Call the script that runs the CAFE analysis
cafe run1.sh
```
#### run1.sh
```
#!cafe
# make sure you are in the orthofinder environment
load -i final_cafe_input_longevity.tsv -t 36 -l reports/1st_run_log.txt
tree (mm10:87.83517705,(((felCat9:49.66644369,canFam4:49.66644369):11.76859872,((susScr11:20.17442777,bosTau9:20.17442777):19.83282604,equCab3:40.00725381):21.4277886):14.99044447,((((((((rhiTri2:12.70240393,rhiLuc4:12.70240393):10.90116398,rhiSed2:23.60356791):10.06281168,rhiAff2:33.66637959):12.70900545,rhiFer5:46.37538505):9.624614955,((aseSto2:17.11302622,hipLar2:17.11302622):21.27756323,hipCyc2:38.39058945):17.60941055):0.8822414306,rhiMic1:56.88224143):0.8845209647,rouAeg4:57.7667624):0.8953602335,(((((pipKuh2:14.95167539,antPal2:14.95167539):16.82912223,(myoNig1:15.85321175,myoMyo6:15.85321175):15.92758587):15.11676827,(tadBra2:23.67764331,molMol2:23.67764331):23.21992258):2.622233134,((desRot8A:15.3632479,dipEca1:15.3632479):18.60429862,((phyDis3:10.86062202,phyHas1:10.86062202):10.65758233,artJam3:21.51820435):12.44934218):15.5522525):3.189644714,(sacLep1:29.3439579,sacBil1:29.3439579):23.36548583):5.952678892):17.76336425):11.40969017)
lambda -s -t tree (1,(((1,1)1,((1,1)1,1)1)1,((((((((1,1)1,1)1,1)1,1)1,((1,1)1,1)1)1,1)1,1)1,(((((1,1)1,(1,1)1)1,(1,1)1)1,((1,1)1,((1,1)1,1)1)1)1,(1,1)1)1)1)1)
```
## Adding GO terms to orthogroups
working directory is /project/tlama_umass_edu/projects/project_bat1k_longevity/data/orthofinder/protein_fastas/8_cafe/jul_23
source activate orthofinder
conda install -c conda-forge r-gprofiler
module load r/4.2.0
Note: make sure you have entered an R session and run install.packages('gprofiler2') before running the R script.