Gene Family Contraction and Expansion with OrthoFinder and CAFE

--- title: 'Orthofinder' disqus: hackmd --- By Tanya Lama ## Objective We will use CAFE to analyze protein family evolution (gene family expansion/contraction) for the bat1k longevity project. For our comparative analyses, annotations have been generated using TOGA by Ariadna Morales and Michael Hiller. However, we need to validate extracted copy number(s) from the TOGA, specifically the orthology.tsv files. Lexi was supposed to pursue this but did not. As an alternative, Tanya is extracting copy number(s) directly from the assemblies using OrthoFinder: - from protein FASTA sequences (generated by TOGA as well) ## Table of Contents [TOC] ## Learning from [Jebb et al.](https://www.nature.com/articles/s41586-020-2486-3#Sec10) > To investigate expansions and contractions of protein families, we used CAFE89 with a false discovery rate < 0.05 cut-off. As input for CAFE, we clustered Ensembl-annotated proteins into families using POrthoMCL90 and the PANTHER Database (v.14.0)91 and our ultrametric time tree, generated using r8s. Additional detail was provided [in the supplement](https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-020-2486-3/MediaObjects/41586_2020_2486_MOESM1_ESM.pdf) > To investigate expansions and contractions of protein families, we used CAFE (v4.0)131. CAFE requires annotated proteins assigned to families. To this end, we downloaded GFF3 files from Ensembl which were available for 25 of the 42 considered species. To obtain a single isoform from each gene, we used genePredSingleCover (https://github.com/ENCODE-DCC/kentUtils.git) to obtain the longest transcript from each locus, and translated this transcript to a peptide sequence. The POrthoMCL pipeline (https://github.com/etabari/PorthoMCL, accessed 12/6/2019; commit dec8e5f), a parallel implementation of the OrthoMCL algorithm, was used to cluster proteins into families. All families were assigned PANTHER Database (v14.0) IDs, based on the human genes contained in a family. Families assigned the same ID were merged. Where POrthoMCL families were composed of multiple families, all families with overlap based on PANTHER IDs were merged. To obtain families that were already present in the Placentalia root, we retained those families where at least one member was present in all bats, at least one of human, mouse or rat (representative Euarchontoglires) and one Atlantogenata species. Families which varied in size by more than 100 members between the species with the highest and lowest count were also removed. CAFE was then used to identify families which underwent expansion or contraction at the base of bats, using the previously produced ultrametric time tree (see Supplementary Note 4.2 above). The caferror.py function was used to estimate an error model for the data, which was used in further analysis. A single lambda, or birth/death parameter, was inferred for the entire tree. Families were retained if estimated to have undergone a significant expansion or contraction at the ancestor of all bats with an FDR value < 0.05 (Supplementary Table 12). ## Installing Orthofinder and dependencies on Unity We created a conda environment called orthofinder for all dependencies. Be sure to `source activate orthofinder` before you submit any jobs ## Set up Nucleotide and protein FASTA files are located in the /project space `/project/tlama_umass_edu/projects/project_bat1k_longevity/data/ ` ### 1_raw_protein_fastas We want to retain just the query protein sequences from each protein.fasta file, and drop all of the reference (human) sequences. Create a folder with all of your protein_fasta files (output from toga) called 1_raw_protein_fastas and create a list called list_protein_fastas ``` ls ./*.prot.fa > list_protein_fastas ``` ### We have automated the grep commands for all of our protein fasta files using `sbatch_grep_query.sh` which submits --> grep_query.py: ``` #!/bin/bash #SBATCH --job-name=grep #SBATCH --output=grep_%j.out #SBATCH --error=grep_%j.err #SBATCH --ntasks-per-node=1 #SBATCH --nodes=1 #SBATCH --time=0:20:00 #SBATCH -p cpu # This script submits grep_query.py. To execute this script you need to run sbatch sbatch_grep_query.py -s dipEca1. while getopts ":s:" opt; do case $opt in s) species="$OPTARG" ;; \?) echo "Invalid option -$OPTARG" >&2 ;; esac done #Load any necessary modules module load python #Run the Python script with the command line argument "dipEca1" python grep_query.py "${species}" ``` #### grep_query.py ``` import sys if len(sys.argv) < 2: print("Please provide the species name as a command line argument") exit() species = sys.argv[1] with open(f"{species}.pep") as f: lines = f.readlines() output = [] for i in range(len(lines)): if lines[i].startswith(">ENST") and "QUERY" in lines[i] and i+1 < len(lines): output.append(lines[i].strip()) output.append(lines[i+1].strip()) output_text = "\n".join(output) with open(f"query_{species}.pep", "w") as f: f.write(output_text) # The output of this script is a protein fasta called query_${species}.pep that retains ONLY query sequences ``` ### 2_query_protein_fastas Move all of the outputs called query_${species}.pep to a folder called `2_query_protein_fastas` ### What's in our protein.fasta files? Let's compare the orthology.tsv file from the TOGA annotation to the protein.fasta file that Ariadna gave us. Note: I compared this initially to orthology.tsv files from the Seckenberg repository and found that the chain id's didn't match our data. So, must use the orthology.tsv files that Ariadna provided us with the protein.fastas as there seems to be multiple versions of HLantPal2. orthology.tsv classifies orthology relationships: * one2many -- a gene in the reference species is orthologous to multiple genes in the query species * one2one -- (pairwise orthology) a gene in the reference species has only one ortholog in the query species * many2many -- orthologs in the reference and the query species have undergone lineage-specific duplications * many2one -- a gene in the query species is orthologous to multiple genes in the reference species * one2zero -- a gene is found in human but there is no ortholog in the query species ### Does the protein.fasta file include one2many sequences? YES `grep '>ENST00000333762.H1-10.317' antpal2.pep` ### Does the protein.fasta file include one2one sequences? YES `grep '>ENST00000226021.CACNG1.66' antpal2.pep` I believe these can be dropped because they are not informative for CAFE ### Does the protein.fasta file include many2many sequences? YES `grep '>ENST00000599845.CSAG3.704154' antpal2.pep` ### Does the protein.fasta file include many2one sequences? YES `grep '>ENST00000294337.CYP4A22.6' antpal2.pep` ### Does the protein.fasta file include one2zero sequences? SORT OF? These sequences are designated as one2zero but they still show up in the protein.fasta. These are likely gene losses (pseudogenizations) that we should investigate separately. ``` grep -A 1 '>ENST00000372617.GLRA4' antpal2.pep >ENST00000372617.GLRA4.5 | PROT | QUERY MSQRLL--TLSSYFFILFSFPV----EVSLKLLAFSR-------K---------FL-------------------NSKPFLPL-----LFSSSFMVMAKA-MDYRVNVFLWQQRHDPRLAYQDX-DDSLDLDPSMLDAI*-TPDLFFVNEKGANFHEVTTDNKLLRIFKNGNVLYSIRLTLILFCPMDLKNFPMDIQICIMQLESIGYIMNDLVFEWLEDAPAVQVAEGLTLSQFILRDEKDLGYYTKHYNTGKFTCIEVKFHLEWQMGYYLIQMYIPSLLIAILS*-VSFCINMDAAPAHVGLGITTVLNMTTQSSGS*-ASLPKVSYVKAIDIWMAVRLLFVFAALLEYAAVNFVSCQHKEFIRLRRRQKHQHM-------N----------LSL-----SPAGETAVFSLK--GALSREEI--------* ``` Liliana says that these one2zero sequences may be informative. If they are one2zero in all bats, we should drop them. However if they are one2zero in some bats but not others, then they are interesting for gene family comparison. These may be unique lineage-specific gene losses that should be further investigated. ### 3_orthology_tsvs We will now create a folder called 3_orthology_tsvs/ and prepare a list of one2one orthologs for each species. We downloaded the orthology_classification.tsv files for each of our species using rclone into into our 3_orthology_tsvs. ### Pad q_gene names with zeroes In order to select the longest transcript (step 5) we need to compare the length of q_transcripts (ENSTs) grouped by geneID (q_gene). However, grepping reg_10 might compare the length of ENSTs also to reg_100, reg_1000, and reg_10000. Therefore we padded the q_gene names with 0s to ensure that all q_gene names have 5 digits. #### pad_names.py for each species, submit `python pad_names.py antPal2` ```#!/usr/bin/env python ## this is a script which pads q_gene names with 00s such that ever name has 5 digits. e.g., reg_10 is now reg_00010. We can then search for the q_gene name itself in 1_longest_transcript.sh (next step) without worrying about confusion over reg_10 reg_100 and reg_1000 being treated as matching q_gene names import sys import re import pandas as pd # Get the value for species from the command line argument species = sys.argv[1] def pad_gene_names(gene_names): padded_names = [] for name in gene_names: match = re.match(r'(\D+)(\d+)', name) if match: prefix = match.group(1) number = match.group(2) padded_number = number.rjust(5, '0') padded_name = prefix + padded_number padded_names.append(padded_name) else: padded_names.append(name) return padded_names # Read in your file as a DataFrame df = pd.read_csv(f'HL{species}_orthology_classification.tsv', sep='\t') # Apply the function to all values in column 3 df['q_gene'] = pad_gene_names(df['q_gene']) # Write the modified DataFrame back to a new file df.to_csv(f'./modified/HL{species}_orthology_classification.tsv', sep='\t', index=False) ``` ### Create a list of files in the /modified folder ``` ls modified/*.tsv > list_orthology_tsvs ``` ### Create one2ones_${species}.txt We need a list of one2one **q_transcripts** for each species so that we can drop them from the protein fastas for CAFE. Loop over the files in list_orthology_tsvs, selecting only the q_transcripts which have an orthology_class listed as 'one2one'. Drop any duplicate q_transcripts, and save the result as one2ones_${species}.txt #### sbatch_one2ones.sh ``` #!/bin/bash # Loop over each TSV file in list_orthology_tsvs while read -r file_path; do # Extract the characters 3-9 from the file name file_name=$(basename "$file_path") file_prefix=${file_name:2:7} # Extract the q_transcript values for rows where orthology_class == 'one2one' awk -F $'\t' '$5 == "one2one" { print $4 }' "$file_path" | sort -u > "one2one_${file_prefix}.txt" done < list_orthology_tsvs ``` ### 4_one2ones Move these one2one_{species}.txt files a folder called `4_one2ones` ### Dropping one2one orthologs from antPal2.pep We need to do the following: For each one2one ortholog (e.g. ENST00000000233.ARF5), we need to find the matching line in the protein fasta such as >ENST00000000233.ARF5.22 | PROT | QUERY MKKPRAAVGSRPRKQTSSREEREKHGFNSSQAKPTIPDAGQARRQEEVVLQASVSPYHQFRDTAEVTVFRDSLLSWYDLKKRDLPWRRQAEGEVDLDRRAYAVWVSEVMLQQTQVATVINYYTRWMQKWPTLQDLAGASLEEVNQLWAGLGYYSRGRRLQEGARKVVEELGGHVPRTAETLQRLLPGVGRYTAGAIASIAFGQVTGVVDGNVVRVLCRVRGIGADPRNTFVSQQLWSLAQQLVDPARPGDFNQAAMELGAIVCTPQHPHCSQCPVQSLCRAHQRVERTQLPASQSLPGSPDVEECAPNTGQCQLCAPPTQPWDQTLGVANFPRKASRKPPREECSATCVLEQPRAPGGARVLLVQRPNSGLLAGLWEFPSVSVHPSGRHQRQALLQELRSWAGPLPAARLRHLGQVVHVFSHIKLTYQVYGLALEGQAPVTGTAPGARWLTREEFHTAAVSTAMKKVFRVYEGQQSGTCKSSKRCQVSTASRRKKPRPGQQVLDSFFRPRIPTDAPRPNSTA-* and drop the matching line AND the line after it (the protein data for that one2one ortholog). We have automated this in the following way: `sbatch_ensts.sh` executes `ensts.py`; waits 10 seconds, then executes `drop_one2ones.py` #### sbatch_ensts.sh ``` #!/bin/bash #SBATCH --job-name=enst #SBATCH --output=enst_%j.out #SBATCH --error=enst_%j.err #SBATCH --ntasks-per-node=40 #SBATCH --nodes=1 #SBATCH --time=0:30:00 #SBATCH -p cpu # This script submits ensts.py, which then waits 10 seconds and starts running drop_one2ones.py. To execute this script you need to run sbatch sbatch_ensts.sh -s dipEca1. Note that this ran in WELL under an hour. while getopts ":s:" opt; do case $opt in s) species="$OPTARG" ;; \?) echo "Invalid option -$OPTARG" >&2 ;; esac done #Load any necessary modules module load python #Run the Python script with the command line argument "dipEca1" python ensts.py "${species}" #mkdir ./logs # Output job duration sacct --format=JobID,Elapsed -j $SLURM_JOB_ID > ./logs/$SLURM_JOB_ID.log # Output compute resource usage sacct --format=JobID,AllocCPUs,AllocGRES -j $SLURM_JOB_ID >> ./logs/$SLURM_JOB_ID.log ``` #### ensts.py ``` #!/usr/bin/env python import sys import time import subprocess #Get the value for species from the command line argument species = sys.argv[1] # Open the files with open(f'one2one_{species}.txt', 'r') as f1, open(f'../2_query_protein_fastas/query_{species}.pep', 'r') as f2, open(f'ensts_{species}.txt', 'w') as f3: # Read the lines from one2one_antPal2.txt and store them in a list search_strings = [line.strip() for line in f1] # Loop over the lines in species.pep for line in f2: # Check if the line contains any of the search strings if any(search_string in line for search_string in search_strings): # If it does, write the line to the output file f3.write(line) # Wait for 10 seconds time.sleep(10) #Submit the drop_one2ones.py script subprocess.run(['python', 'drop_one2ones.py', species]) ``` #### drop_one2ones.py ``` import sys # Get the value for species from the command line argument species = sys.argv[1] # Open the input files with open(f'ensts_{species}.txt', 'r') as f1, open(f'../2_query_protein_fastas/query_{species}.pep', 'r') as f2: # Read the ENST names into a list enst_names = [line.strip() for line in f1] # Loop through each line in antpal2.pep with open(f'{species}_final.pep', 'w') as out_file: for line in f2: # If the line matches an ENST name, skip it and the next line if any(enst in line for enst in enst_names): next(f2) # Otherwise, write the line to the output file else: out_file.write(line) ``` The final product is a protein fasta without any one2one orthologs, named `species_final.pep` ### Filter for longest transcript As a last step, we will retain only the longest transcript (one ENST for each q_gene). #### a_longest_transcript.sh ``` #!/bin/bash #SBATCH --job-name=longest_transcript #SBATCH --output=long_%j.out #SBATCH --error=long_%j.err #SBATCH --ntasks-per-node=40 #SBATCH --nodes=1 #SBATCH --time=1:00:00 #SBATCH -p cpu #This script is run using bash a_longest_transcript.sh -s antPal2 while getopts ":s:" opt; do case $opt in s) species="$OPTARG" ;; \?) echo "Invalid option -$OPTARG" >&2 ;; esac done # Create directories js and finalJs mkdir -p "${species}_js" "${species}_finalJs" #first make a document of all the genes (q_genes) said to be found in species #awk '{print $4}' ../query_gene_spans.bed > myoNig_geneIDs #We don't have query_gene_spans.bed from TOGA for all our species, so we want to use orthology_tsv instead. #First make a file with all the genes (regxxx) said to be found in the species. awk '{print $3}' ../3_orthology_tsvs/modified/HL"${species}"_orthology_classification.tsv | sort -u | grep "reg" > "${species}_geneIDs" #read this file as you are going to loop through the genes while read j; do #first create a variable for each q_gene name such as reg_00011 r=$j #echo $r #now search for this gene in the orthology document and save results to a file cat ../3_orthology_tsvs/modified/HL"${species}"_orthology_classification.tsv | grep "$r" > ./"${species}_js"/"$j" #looper=$(cat ./"${species}_js"/"$r" | wc -l) #take only the 4th column, which has the ENST transcripts for each gene awk '{print $4}' ./"${species}_js"/"$j" > ./"${species}_js"/x_"$j" #start a counter, that as you loop through each transcript, the longest will be saved counter=0 #so open this file and loop through the ENST for i in $(cat ./"${species}_js"/x_"$j") do #search for each ENST in the query file, look at the next line, count alphabetic characters p=$(cat ../4_one2ones/"${species}"_final.pep | grep "$i" -A1 | grep -o "[a-zA-Z]*"| wc -c) #first result will beat counter of 0, each after will have to beat new counter if [[ $p -gt $counter ]] then counter=$p #if higher than counter, overwrite final j file with new longest cat ../4_one2ones/"${species}"_final.pep | grep -w "$i" -A1 > ./"${species}_finalJs"/"$j" fi done #f=$(cat $folder$j".fa" | grep "$id") #echo $j $f >> "gene_idsgc_"$spec".txt" done < "${species}_geneIDs" # Remove directories js and finalJs #rm -r "${species}_js" "${species}_finalJs" #mkdir ./logs # Output job duration sacct --format=JobID,Elapsed -j $SLURM_JOB_ID > ./logs/$SLURM_JOB_ID.out # Output compute resource usage sacct --format=JobID,AllocCPUs,AllocGRES -j $SLURM_JOB_ID >> ./logs/$SLURM_JOB_ID.out # The output of this script is a folder of transcripts which need to be combined using the 2_combiner.sh script. Then, the 3_remover.sh script needs to be run to remove the /finaljs and /js folders. # combiner.sh produces a protein fasta file which is ready for porthomcl. It contains only: query sequences, NO one-to-one orthologs, only the longest transcript for each gene. ``` #### 2_combiner.sh ``` #!/bin/bash #SBATCH --job-name=combiner #SBATCH --output=combiner_%j.out #SBATCH --error=combiner_%j.err #SBATCH --ntasks-per-node=1 #SBATCH --nodes=1 #SBATCH --time=0:05:00 #SBATCH -p cpu # this script combines the longest ENSTs that have made it to the finalJs folder for each species. This script is submited as bash 2_combiner.sh -s species while getopts ":s:" opt; do case $opt in s) species="$OPTARG" ;; \?) echo "Invalid option -$OPTARG" >&2 ;; esac done while read j; do cat ./"${species}_finalJs"/"$j" >> ./longest_transcript_faas/"${species}_longest_transcript.faa" done < "${species}_geneIDs" # The output of this scrips is a protein fasta file (ending .faa) which 1) retains only QUERY sequences 2) retains only paralogs 3) retains only the longest transcript (1 ENST) for each q_gene. This output is ready for step 6: porthomcl. ``` #### cat_list_of_paralogs.sh We want to be sure that only paralogs have made it to our finalJs folder, and into the ${species}_longest_transcript.faa input for porthomcl. ``` #!/bin/bash #SBATCH --job-name=paralogs #SBATCH --ntasks-per-node=1 #SBATCH --nodes=1 #SBATCH --time=0:05:00 #SBATCH -p cpu # This script makes of list of everything that has reached our finalJs folder, and prints the orthology classification from the js folder. The result is a list_of_paralogs_${species} file that should include only paralogs (no one2one orthologs in column 5). # This script is run using bash cat_list_of_paralogs.sh -s antPal2 while getopts ":s:" opt; do case $opt in s) species="$OPTARG" ;; \?) echo "Invalid option -$OPTARG" >&2 ;; esac done ls ./${species}_finalJs > list_${species}finaljs for i in $(cat ./list_${species}finaljs); do cat ./${species}_js/$i >> list_of_paralogs_${species}; done ``` #### empties.py We can do the same by confirming the list of one2one orthologs that have NOT made it to our finalJs folder for each species. These q_genes are listed in the combiner_XXXXX.err log generated by each combiner.sh job. We can collect these q_genes from the error log using `empties.py`. ``` import re # This script confirms that one2one orthologs have NOT reached our finalJs folder and ended up in our longest_transcript.faa protein file for porthomcl. In brief, it collects q_gene names that were included in the js folder but not included (empty) in the finalJs folder for each species. The result is a list of commands (cat_list_of_one2ones.sh) which when executed should list a ton of one2one orthologous q_genes (and their names). # Open the input and output files with open('combiner_5636189.err', 'r') as input_file, open('cat_list_of_one2ones.sh', 'w') as output_file: # Loop through each line in the input file for line in input_file: # Remove the ":" character line = line.replace(':', '') # Replace "finalJs" with "js" line = line.replace('finalJs', 'js') # Remove ": No such file or directory" line = re.sub(r'No such file or directory', '', line) # Write the modified line to the output file output_file.write(line) ``` #### 3_remove_species.sh These js and finalJs folders are really big. You may want to remove them, but don't do it on the login node. Execute 3_remove_species.sh or 4_remove_js_finalJs.sh to remove a single species folders or ALL the js and finalJs folders. #### 6_porthoMCL Copy the ${species}_longest_transcript.faa files into a folder called 6_porthoMCL Create a CONTAINER folder (e.g., longevity_CONTAINER) Move all of our `${species}_longest_transcript.faa` files into a subfolder called `0.input_faa` Execute porthomcl with `sbatch porthomcl_slurm.sh longevity_CONTAINER` ``` #!/bin/bash # #SBATCH --job-name=bostau-porthomcl #SBATCH -o slurm-%j.out # %j = job ID #SBATCH --ntasks-per-node=24 #SBATCH --nodes=1 #SBATCH --time=36:00:00 #SBATCH -p cpu-long # This script executes porthomcl on a folder of protein fastas. Your protein fastas should 1) retain only QUERY sequences 2) no one2one orthologs 3) one longest transcript (ENST) for each q_gene 4) be inside CONTAINER/0.input_faa/ module load anaconda conda activate /project/tlama_umass_edu/envs/porthomcl cd /project/tlama_umass_edu/projects/project_bat1k_longevity/toga-paralogy/porthomcl DIR=$1 ##Parent directory that contains "0.input_faa" with all protein .faas bash porthomcl.sh -t 16 ${DIR} ``` # Run Orthofinder To confirm gene family membership, run Orthofinder on all of the protein fasta genomes, modified to include only the longest transcript (ENST) for each ENSG (<species>_longest_transcript.faa_shortened.faa). The input argument for Orthofinder (-f) is simply the directory that contains all of our protein files (/project/tlama_umass_edu/projects/project_bat1k_longevity/orthofinder/orthofiles). Run orthofinder.sh: ``` #!/bin/bash #SBATCH --output=ortho_23Jul.out #SBATCH --error=mv-ortho_23Jul.err #SBATCH --job-name=ortho_23Jul #SBATCH --nodes=1 #SBATCH --time=168:00:00 #SBATCH --ntasks=20 #SBATCH --partition=cpu-long module load anaconda/2022.10 conda activate /project/tlama_umass_edu/envs/orthofinder/ orthofinder -f /project/tlama_umass_edu/projects/project_bat1k_longevity/orthofinder/orthofiles -o /project/tlama_umass_edu/projects/project_bat1k_longevity/orthofinder/results_1k_23Jul ``` The analysis will take approx 20 hours to run, and will output a whole bunch of directories and files to "results_1k_23Jul". # Format the orthogroup data for CAFE working directory is /project/tlama_umass_edu/projects/project_bat1k_longevity/data/orthofinder/protein_fastas/8_cafe The next step is to take our Orthogroup.tsv data from OrthoFinder and do some filtering and formatting in preparation for CAFE. No need to enrich the orthogroups (OG) at this stage. REMEMBER: to run CAFE we need an ultrametric time tree. We have one from Graham. We want to calculate: 1) the number of genes/species in each OG 2) the number of genes/OG Some filtering rules: 1) drop OGs with > 450 genes 2) drop OGs with > 100 size variation "Families which varied in size by more than 100 members between the species with the highest and lowest count were also removed." 3) OGs > 80% similarity must be merged **this is more complicated than it sounds people!** 4) drop all olfactory gene OGs ## Calculating similarity between OGs We need to extract gene names and summarize the makeup of each OG, then compare iteratively all of the OGs against one another and calculate a metric of similarity. Similarity > 80% are printed to an output called jaccard_similarity.txt. OGs with >= 80% similarity will eventually be merged. #### 1b_calculate_jaccard_similarity.py ``` # this script takes an Orthofinder output file and extracts all of the gene names from each Orthogroup (OG); then summarizes the makeup of the OG (e.g., 10% DEFA1; 90% DEFA2) and compares each OG to eachother (e.g. OG00 to OG01) and calculates a % similarity. If similarity > 1%, the output is printed to jaccard_similarity.txt. import csv import re def extract_gene_name(gene_name): match = re.search(r'\.(.+?)[.-]', gene_name) if match: return match.group(1) return "" def calculate_gene_percent(gene_set): gene_counts = {} total_genes = len(gene_set) for gene in gene_set: if gene in gene_counts: gene_counts[gene] += 1 else: gene_counts[gene] = 1 gene_percentages = {} for gene, count in gene_counts.items(): percentage = (count / total_genes) * 100 gene_percentages[gene] = percentage return gene_percentages def calculate_similarity(gene_percentages1, gene_percentages2): gene_set1 = set(gene_percentages1.keys()) gene_set2 = set(gene_percentages2.keys()) intersection = gene_set1.intersection(gene_set2) union = gene_set1.union(gene_set2) similarity = len(intersection) / len(union) return similarity # Read the CSV file with open('cafe_inputs_manip.csv', 'r') as file: reader = csv.DictReader(file) header = reader.fieldnames orthogroups = {} orthogroup_column = header[0] for row in reader: orthogroup = row[orthogroup_column] gene_set = set() for column in header[1:]: gene = extract_gene_name(row[column]) if gene: gene_set.add(gene) orthogroups[orthogroup] = gene_set # Compare orthogroups and calculate similarity merge_log = [] for orthogroup1, geneset1 in orthogroups.items(): for orthogroup2, geneset2 in orthogroups.items(): if orthogroup1 == orthogroup2: continue gene_percentages1 = calculate_gene_percent(geneset1) gene_percentages2 = calculate_gene_percent(geneset2) similarity = calculate_similarity(gene_percentages1, gene_percentages2) if similarity > 0.79: merge_log.append({ 'Orthogroup 1': orthogroup1, 'Orthogroup 2': orthogroup2, 'Gene Percentages 1': gene_percentages1, 'Gene Percentages 2': gene_percentages2, 'Similarity': similarity }) # Write merge log to a file with open('jaccard_similarity.txt', 'w') as file: for entry in merge_log: file.write(f"Orthogroup 1: {entry['Orthogroup 1']}\n") file.write("Gene Percentages 1:\n") for gene, percentage in entry['Gene Percentages 1'].items(): file.write(f"{gene}: {percentage}%\n") file.write(f"\nOrthogroup 2: {entry['Orthogroup 2']}\n") file.write("Gene Percentages 2:\n") for gene, percentage in entry['Gene Percentages 2'].items(): file.write(f"{gene}: {percentage}%\n") file.write(f"\nSimilarity: {entry['Similarity']}\n\n") ``` The output of 1b_calculate_jaccard_similarity.py is a file called jaccard_similarity.txt with the following format: ``` Orthogroup 1: OG0000000 Gene Percentages 1: IGHV3: 100.0% Orthogroup 2: OG0001306 Gene Percentages 2: IGHV3: 100.0% Similarity: 1.0 ``` ### grep Orthogroups with >80% similarities Remove some complexity from similarity.txt by retaining only the lines with Orthogroup 1 Orthogroup 2 and Similarity #### 2_grep_similarity.sh this script outputs a file called grep_similarity.txt ``` grep -E "Orthogroup 1:|Orthogroup 2:|Similarity:" jaccard_similarity.txt > grep_similarity.txt ``` grep_similarity.txt looks like this ``` Orthogroup 1: OG0000000 Orthogroup 2: OG0000015 Similarity: 80.0% Orthogroup 1: OG0000002 Orthogroup 2: OG0001295 Similarity: 94.11764705882354% Orthogroup 1: OG0000002 Orthogroup 2: OG0001559 Similarity: 94.11764705882354% Orthogroup 1: OG0000002 ``` #### 3_clean_pairs.py Because we did an all x all comparison of the orthogroups to measure similarity, we have duplicate Orthogroups (e.g., OG0 was compared to OG1, and OG1 was also compared to OG0). To remove these reduncancies, we use 3_clean_pairs.py, which keeps track of all of the orthogroup pairs and writes a file called clean_pairs.txt free of duplicates. ``` # this script takes the grep_similarity.txt summary, and eliminated duplicated comparisons. For example, if OG1 and OG2 are 80% similar, we do not also need OG2 vs. OG1 similarity to know that the two need to be merged. seen_pairs = set() with open('filtered_similarity.txt', 'r') as file: lines = file.readlines() clean_lines = [] for i in range(0, len(lines), 3): orthogroup1_line = lines[i].strip() orthogroup2_line = lines[i + 1].strip() similarity_line = lines[i + 2].strip() orthogroup1 = orthogroup1_line.split(":")[1].strip() orthogroup2 = orthogroup2_line.split(":")[1].strip() pair = (orthogroup1, orthogroup2) reversed_pair = (orthogroup2, orthogroup1) if pair not in seen_pairs and reversed_pair not in seen_pairs: clean_lines.append(f"{orthogroup1_line}\n") clean_lines.append(f"{orthogroup2_line}\n") clean_lines.append(f"{similarity_line}\n") seen_pairs.add(pair) with open('clean_pairs.txt', 'w') as file: file.writelines(clean_lines) ``` #### After running 3_clean_pairs.py you need to run grep -E "Orthogroup 1:|Orthogroup 2:" clean_pairs.txt > clean_pairs2.txt ### Time to merge After removing duplicates, we have about 1k orthogroups that need to be joined. #### drop the "Similarity" line, we don't need it anymore `grep -E "Orthogroup 1:|Orthogroup 2:" clean_pairs.txt > cleaner_pairs.txt` #### 4_calculate_minimum_mergers.py Ok now this bit is quite clever, because it takes clean_pairs.txt and creates a network with nodes and edges to determine the minimum number of mergers needed to join all of the OGs with similarity > 80. ``` # Sure, let's write a script that reads the clean_pairs.txt file and identifies the minimum number of mergers needed to achieve the desired task. To simplify this, we can create a graph where each orthogroup is a node, and each pair of orthogroups represents an edge. Then, we can find the connected components in the graph, which will give us the groups of orthogroups that need to be merged together. ## Usage: python calculate_min_mergers.py > min_mergers.txt def read_pairs(file_path): pairs = [] with open(file_path, 'r') as pairs_file: lines = pairs_file.readlines() i = 0 while i + 1 < len(lines): orthogroup1_line = lines[i].strip() orthogroup2_line = lines[i + 1].strip() orthogroup1 = orthogroup1_line.split(":")[1].strip() orthogroup2 = orthogroup2_line.split(":")[1].strip() pairs.append((orthogroup1, orthogroup2)) i += 2 return pairs def create_graph(pairs): graph = {} for orthogroup1, orthogroup2 in pairs: if orthogroup1 not in graph: graph[orthogroup1] = set() if orthogroup2 not in graph: graph[orthogroup2] = set() graph[orthogroup1].add(orthogroup2) graph[orthogroup2].add(orthogroup1) return graph def find_connected_components(graph): components = [] visited = set() def dfs(node, component): visited.add(node) component.add(node) for neighbor in graph[node]: if neighbor not in visited: dfs(neighbor, component) for node in graph: if node not in visited: component = set() dfs(node, component) components.append(component) return components def main(): clean_pairs_file = 'clean_pairs.txt' # Read pairs from the clean_pairs.txt file pairs = read_pairs(clean_pairs_file) # Create a graph from the pairs graph = create_graph(pairs) # Find connected components in the graph connected_components = find_connected_components(graph) # Print the minimum number of mergers needed print(f"Minimum number of mergers needed: {len(connected_components)}") # Print the connected components for i, component in enumerate(connected_components): print(f"Group {i + 1}: {', '.join(component)}") if __name__ == "__main__": main() ``` #### 5_perform_min_mergers.py ```import csv output_file = 'merged_output.csv' cafe_file = 'cafe_inputs_manip.csv' min_mergers_file = 'min_mergers.txt' # Read the header row from the cafe_inputs_manip.csv file with open(cafe_file, 'r') as cafe: reader = csv.reader(cafe) header_row = next(reader) # Create a dictionary to store the merged rows merged_rows = {} # Read the orthogroups to be merged from min_mergers.txt with open(min_mergers_file, 'r') as min_mergers: for line in min_mergers: if line.startswith('Group'): orthogroups = line.split(':')[1].strip().split(', ') group_name = line.split(':')[0].strip() merged_row = {header_row[0]: group_name} for orthogroup in orthogroups: # Read the rows for the current orthogroup from cafe_inputs_manip.csv with open(cafe_file, 'r') as cafe: reader = csv.DictReader(cafe) for row in reader: if row[header_row[0]] == orthogroup: # Merge the rows for the current orthogroup for key in header_row[1:]: if key not in merged_row: merged_row[key] = row[key] else: merged_row[key] += ', ' + row[key] break # Add the merged row to the dictionary merged_rows[group_name] = merged_row # Write the merged rows to the output CSV file with open(output_file, 'w', newline='') as merged_file: writer = csv.DictWriter(merged_file, delimiter=',', fieldnames=header_row) writer.writeheader() for row in merged_rows.values(): writer.writerow(row) # Write the summary of merged orthogroups to the summary_mergers.txt file with open('summary_mergers.txt', 'w') as summary_file: summary_file.write('Merged Rows:\n') for group, orthogroups in enumerate(merged_rows.keys(), start=1): summary_file.write(f'{orthogroups}: {", ".join(merged_rows[orthogroups])}\n') summary_file.write('Merging completed successfully.') ``` #### 5b_clean_cells.py ``` # We need to take the merged_output.csv and do some manual curation before it can become merged_output_curated.csv. # This includes removing commas ', , , , , ,' and spaces where empty cells have merged with empty cells. Running this script will reduce your time spent on manual curation. # After running this script, merged_output_curated.csv and non_merged_output.csv need to be combined and manually edited (drop olfactory receptors) import pandas as pd def clean_cell(cell): # Remove leading and trailing spaces cell = cell.strip() # Remove ', ' and nothing else if cell == ', ': cell = '' # Remove ',' at the end and ensure the last character is a digit if cell.endswith(','): cell = cell[:-1] + '0' return cell def clean_csv(input_file, output_file): # Read the CSV file df = pd.read_csv(input_file) # Apply the clean_cell function to each cell in the DataFrame df = df.applymap(clean_cell) # Save the cleaned DataFrame to a new CSV file df.to_csv(output_file, index=False) if __name__ == "__main__": input_file = "merged_output_curated.csv" output_file = "cleaned_merged_output.csv" clean_csv(input_file, output_file) ``` #### 6_dropme.py ```# in this script we are using the min_mergers.txt file to create a long list of Orthogroups that need to be dropped from cafe_inputs_manip.csv. These orthogroups are already represented in the merged groups (1-52) in merged_output.csv. We need a single file including the non-merged and merged orthogroups for cafe. Here we will create a list of 693 orthogroups that need to be dropped. input_file = 'min_mergers.txt' # Replace 'your_file.txt' with the actual filename # Read the content of the input file with open(input_file, 'r') as file: content = file.read() # Split the content into groups based on 'Group' keyword groups = content.split('Group')[1:] # Extract orthogroup names from each group all_orthogroups = [] for group in groups: orthogroups = group.strip().split(':')[1].strip().split(', ') all_orthogroups.extend(orthogroups) # Write the orthogroup names to a new text file with a single column output_file = 'drop_orthogroups.txt' with open(output_file, 'w') as output: for orthogroup in all_orthogroups: output.write(orthogroup + '\n') print('Orthogroups extracted and saved to', output_file) ``` #### 7_drop_orthogroups.py ```#Sure, I can help you with that. We'll read the list of orthogroups from drop_orthogroups.txt, locate those orthogroups in cafe_inputs_manip.csv, and then drop the corresponding rows. Finally, we'll write the remaining rows to a new CSV file called non_merged_orthogroups.csv. We'll use the pandas library in Python to handle the data manipulation efficiently. import pandas as pd # File paths input_csv_file = 'cafe_inputs_manip.csv' orthogroups_file = 'drop_orthogroups.txt' output_csv_file = 'non_merged_orthogroups.csv' # Read the list of orthogroups to drop with open(orthogroups_file, 'r') as file: orthogroups_to_drop = [line.strip() for line in file] # Read the CSV file df = pd.read_csv(input_csv_file) # Drop rows corresponding to the specified orthogroups df_filtered = df[~df['Orthogroup'].isin(orthogroups_to_drop)] # Write the remaining rows to the output CSV file df_filtered.to_csv(output_csv_file, index=False) print('Rows corresponding to specified orthogroups have been dropped and the remaining rows have been saved to', output_csv_file) ``` #### 8_convert_into_cafe_format.py ``` import pandas as pd # Function to count genes in a cell or assign 0 to empty cells def count_genes(cell_value): if pd.isna(cell_value) or cell_value.strip() == "": return "0" return str(len(cell_value.split(", "))) # Step 7: Manual curation in Excel (merged_output_curated.csv) # Between steps 7 and 8, I've pulled merged_output.csv and non_merged_orthogroups.csv # onto my desktop and done some manual curation in excel. Essentially, the mergers created # a lot of extra ',' especially where a cell is merged with an empty cell etc. Those needed # to be deleted, and our merged and non-merged orthogroups needed to be manually inspected # and merged into a file called merged_output_curated.csv. I also dropped orthogroups # associated with Olfactory Reception per Liliana's directions. # Step 8: Convert merged_output_curated.csv to CAFE input format # Now we need to convert the merged_output_curated.csv into CAFE format. # CAFE requires an input file in a tab-separated format with gene IDs # (in this case, Orthogroup IDs) in the first column and gene counts # (number of copies) for each species in subsequent columns. # File paths input_csv_file = 'merged_output_curated.csv' output_cafe_file = 'cafe_input_file.tsv' # Read the CSV file df = pd.read_csv(input_csv_file) # Replace empty cells with "0" and count genes in non-empty cells for col in df.columns[1:]: df[col] = df[col].fillna("").apply(count_genes) # Prepare the CAFE input file in tab-separated format with open(output_cafe_file, 'w') as f: f.write("Gene_ID\t" + "\t".join(df.columns[1:]) + "\n") for _, row in df.iterrows(): f.write(row["Orthogroup"] + "\t" + "\t".join(row[1:]) + "\n") print("CAFE input file has been created:", output_cafe_file) ``` # Time to run CAFE (zixia) Zixia ran CAFE and then created some preliminary figures using CAFEplotter for total, significant, and individual gene families # Visualizing results with CAFE_fig https://github.com/LKremer/CAFE_fig Create a new conda environment for the packages. Requires a very specific version of ete3 and python. conda env create -f cafeFig_environment.yml ``` name: cafeFig channels: - conda-forge - bioconda - defaults dependencies: - _libgcc_mutex=0.1=conda_forge - _openmp_mutex=4.5=2_gnu - ca-certificates=2024.12.14=hbcca054_0 - cairo=1.14.6=4 - ete3=3.0.0b35=py36_1 - expat=2.6.4=h5888daf_0 - fontconfig=2.12.1=4 - freetype=2.7=1 - gettext=0.22.5=he02047a_3 - gettext-tools=0.22.5=he02047a_3 - glib=2.51.4=0 - graphite2=1.3.13=h59595ed_1003 - harfbuzz=1.4.3=0 - icu=58.2=hf484d3e_1000 - jpeg=9e=h0b41bf4_3 - libasprintf=0.22.5=he8f35ee_3 - libasprintf-devel=0.22.5=he8f35ee_3 - libblas=3.9.0=20_linux64_openblas - libcblas=3.9.0=20_linux64_openblas - libexpat=2.6.4=h5888daf_0 - libffi=3.2.1=he1b5a44_1007 - libgcc=14.2.0=h77fa898_1 - libgcc-ng=14.2.0=h69a702a_1 - libgettextpo=0.22.5=he02047a_3 - libgettextpo-devel=0.22.5=he02047a_3 - libgfortran=14.2.0=h69a702a_1 - libgfortran-ng=14.2.0=h69a702a_1 - libgfortran5=14.2.0=hd5240d6_1 - libgomp=14.2.0=h77fa898_1 - libiconv=1.17=hd590300_2 - liblapack=3.9.0=20_linux64_openblas - liblzma=5.6.3=hb9d3cd8_1 - liblzma-devel=5.6.3=hb9d3cd8_1 - libopenblas=0.3.25=pthreads_h413a1c8_0 - libpng=1.6.43=h2797004_0 - libstdcxx=14.2.0=hc0a3c3a_1 - libstdcxx-ng=14.2.0=h4852527_1 - libtiff=4.0.10=hc3755c2_1005 - libuuid=2.38.1=h0b41bf4_0 - libxcb=1.17.0=h8a09558_0 - libxml2=2.9.9=h13577e0_2 - libxslt=1.1.32=hae48121_1003 - libzlib=1.2.13=h4ab18f5_6 - lxml=4.4.1=py36h7ec2d77_0 - lz4-c=1.9.3=h9c3ff4c_1 - ncurses=6.5=h2d0b736_2 - numpy=1.19.5=py36hfc0c790_2 - openssl=1.0.2u=h516909a_0 - pango=1.40.4=0 - pcre=8.45=h9c3ff4c_0 - pip=21.3.1=pyhd8ed1ab_0 - pixman=0.34.0=h14c3975_1003 - pthread-stubs=0.4=hb9d3cd8_1002 - pyqt=4.11.4=py36_3 - python=3.6.7=hd21baee_1002 - python_abi=3.6=2_cp36m - qt=4.8.7=ha8c56c7_9 - readline=7.0=hf8c457e_1001 - scipy=1.5.3=py36h81d768a_1 - setuptools=58.0.4=py36h5fab9bb_2 - sip=4.18=py36_1 - six=1.16.0=pyh6c4a22f_0 - sqlite=3.28.0=h8b20d00_0 - tk=8.6.13=noxft_h4845f30_101 - wheel=0.37.1=pyhd8ed1ab_0 - xorg-libice=1.1.2=hb9d3cd8_0 - xorg-libsm=1.2.5=he73a12e_0 - xorg-libx11=1.8.10=h4f16b4b_1 - xorg-libxau=1.0.12=hb9d3cd8_0 - xorg-libxdmcp=1.1.5=hb9d3cd8_0 - xorg-libxext=1.3.6=hb9d3cd8_0 - xorg-libxrender=0.9.12=hb9d3cd8_0 - xorg-libxt=1.3.1=hb9d3cd8_0 - xorg-xextproto=7.3.0=hb9d3cd8_1004 - xz=5.6.3=hbcc6ac9_1 - xz-gpl-tools=5.6.3=hbcc6ac9_1 - xz-tools=5.6.3=hb9d3cd8_1 - zlib=1.2.13=h4ab18f5_6 - zstd=1.4.9=ha95c52a_0 prefix: /home/bbentley_smith_edu/.conda/envs/cafeFig ``` Make some changes to CAFE_fig.py Change pixels to 5.0 ``` 'pixels_per_mya': 1.0, # pixels per million years (tree width) ``` ``` ts.branch_vertical_margin = 10 # Increase the spacing between branches ``` python CAFE_fig.py -d out -g pdf base_report.cafe conda install conda-forge::xvfbwrapper conda update -n base -c defaults conda ### Acknowledgements ## Appendix and FAQ :::info **Find this document incomplete?** Leave a comment! ::: ###### tags: `scripts` `project_tyd` try the protein fasta from one of the six bats for Orthofinder Bill ran for his transcriptomics Bill's Orthofinder scripts are in his folder at seawulf Bill ran it on peptide FASTAs (from Ensembl) some of the assemblies are private, so I can't just go and use Ensembl -- orthology from toga -- orthology from Sensititivy analysis w/ results for well-known well-curated OrthoFinder Does the Kirilenko paper expansions within everything with the caveat that only things orthologous to human were detected (because toga reference == human) data for synthesizing the apobecs -- one2one annotations Liliana went in and found a bunch more annotations we are interested in the one2many and many2many We need to validate the copy numbers spit out by TOGA No idea what the rules were for gene family assignment annotated protein list -- does it include the paralogs or just the one2one orthologs just run orthofinder on my protein fastas that come out of TOGA genes in a species that actually belong to two different gene families -- APOBEC3 (assigned to APOBEC, and APOBEC-A simultaneously). Pick the smallest box -- finest grain data. % cutoff. why don't we have a table that has columns of genes and species with copy numbers. Crawling through tables and making sure. Lexi has scripts that Ariadna gave her. 1) determine the protein.fasta is one2one or many2many etc? 2) do the many2many sequences get dropped from write into the methods that we fucked up and had to label the nodes after the fact. Tanya may continue with OrthoFinder on the protein.fastas that Ariadna gave me - one per genome 1. What is the TOGA --> CAFE problem? To investigate expansions and contractions of protein families, we used CAFE. CAFE requires gene counts which come from annotated proteins assigned to families for each species. TOGA gives you the orthologs and classified those orthologs to five different relationships : 1:1, 1:many, many:1, many:many, 1:zero etc. The goal was to get gene copy numbers from that orthology classification table that is output by TOGA. In order to get gene copy counts, we need to group ENSGs into gene families. We attempted to do this # done sbatch 1_longest_transcript.sh -s antPal2 sbatch 1_longest_transcript.sh -s artJam3 sbatch 1_longest_transcript.sh -s aseSto2 sbatch 1_longest_transcript.sh -s bosTau9 sbatch 1_longest_transcript.sh -s canFam4 sbatch 1_longest_transcript.sh -s desRot8A sbatch 1_longest_transcript.sh -s dipEca1 sbatch 1_longest_transcript.sh -s equCab3 sbatch 1_longest_transcript.sh -s felCat9 sbatch 1_longest_transcript.sh -s hipCyc2 sbatch 1_longest_transcript.sh -s hipLar2 sbatch 1_longest_transcript.sh -s mm10 sbatch 1_longest_transcript.sh -s molMol2 sbatch 1_longest_transcript.sh -s myoMyo6 sbatch 1_longest_transcript.sh -s myoNig1 sbatch 1_longest_transcript.sh -s phyDis3 sbatch 1_longest_transcript.sh -s phyHas1 sbatch 1_longest_transcript.sh -s pipKuh2 sbatch 1_longest_transcript.sh -s rhiAff2 sbatch 1_longest_transcript.sh -s rhiFer5 sbatch 1_longest_transcript.sh -s rhiLuc4 sbatch 1_longest_transcript.sh -s rhiMic1 sbatch 1_longest_transcript.sh -s rhiSed2 sbatch 1_longest_transcript.sh -s rhiTri2 sbatch 1_longest_transcript.sh -s rouAeg4 sbatch 1_longest_transcript.sh -s sacBil1 sbatch 1_longest_transcript.sh -s sacLep1 sbatch 1_longest_transcript.sh -s susScr11 sbatch 1_longest_transcript.sh -s tadBra2 # done sbatch 2_combiner.sh -s antPal2 sbatch 2_combiner.sh -s artJam3 sbatch 2_combiner.sh -s aseSto2 sbatch 2_combiner.sh -s bosTau9 sbatch 2_combiner.sh -s canFam4 sbatch 2_combiner.sh -s desRot8A sbatch 2_combiner.sh -s dipEca1 sbatch 2_combiner.sh -s equCab3 sbatch 2_combiner.sh -s felCat9 sbatch 2_combiner.sh -s hipCyc2 sbatch 2_combiner.sh -s hipLar2 sbatch 2_combiner.sh -s mm10 sbatch 2_combiner.sh -s molMol2 sbatch 2_combiner.sh -s myoNig6 sbatch 2_combiner.sh -s myoMyo6 sbatch 2_combiner.sh -s phyDis3 sbatch 2_combiner.sh -s phyHas1 sbatch 2_combiner.sh -s pipKuh2 sbatch 2_combiner.sh -s rhiAff2 sbatch 2_combiner.sh -s rhiFer5 sbatch 2_combiner.sh -s rhiLuc4 sbatch 2_combiner.sh -s rhiMic1 sbatch 2_combiner.sh -s rhiSed2 sbatch 2_combiner.sh -s rhiTri2 sbatch 2_combiner.sh -s rouAeg4 sbatch 2_combiner.sh -s sacBil1 sbatch 2_combiner.sh -s sacLep1 sbatch 2_combiner.sh -s susScr11 sbatch 2_combiner.sh -s tadBra2 # Need to run a cat on all of the paralog lists sbatch cat_list_of_paralogs.sh -s antPal2 sbatch cat_list_of_paralogs.sh -s artJam3 sbatch cat_list_of_paralogs.sh -s aseSto2 sbatch cat_list_of_paralogs.sh -s bosTau9 sbatch cat_list_of_paralogs.sh -s canFam4 sbatch cat_list_of_paralogs.sh -s desRot8A sbatch cat_list_of_paralogs.sh -s dipEca1 sbatch cat_list_of_paralogs.sh -s equCab3 sbatch cat_list_of_paralogs.sh -s felCat9 sbatch cat_list_of_paralogs.sh -s hipCyc2 sbatch cat_list_of_paralogs.sh -s hipLar2 sbatch cat_list_of_paralogs.sh -s mm10 sbatch cat_list_of_paralogs.sh -s molMol2 sbatch cat_list_of_paralogs.sh -s myoMyo6 sbatch cat_list_of_paralogs.sh -s myoNig1 sbatch cat_list_of_paralogs.sh -s phyDis3 sbatch cat_list_of_paralogs.sh -s phyHas1 sbatch cat_list_of_paralogs.sh -s pipKuh2 sbatch cat_list_of_paralogs.sh -s rhiAff2 sbatch cat_list_of_paralogs.sh -s rhiFer5 sbatch cat_list_of_paralogs.sh -s rhiLuc4 sbatch cat_list_of_paralogs.sh -s rhiMic1 sbatch cat_list_of_paralogs.sh -s rhiSed2 sbatch cat_list_of_paralogs.sh -s rhiTri2 sbatch cat_list_of_paralogs.sh -s rouAeg4 sbatch cat_list_of_paralogs.sh -s sacBil1 sbatch cat_list_of_paralogs.sh -s sacLep1 sbatch cat_list_of_paralogs.sh -s susScr11 sbatch cat_list_of_paralogs.sh -s tadBra2 cat_list_of_paralogs.sh antPal2 artJam3 aseSto2 bosTau9 canFam4 desRot8A dipEca1 equCab3 felCat9 hipCyc2 hipLar2 mm10 molMol2 myoMyo6 phyDis3 phyHas1 pipKuh2 rhiAff2 rhiFer5 rhiLuc4 rhiMic1 rhiSed2 rhiTri2 rouAeg4 sacBil1 sacLep1 susScr11 tadBra2 46 ../3.blastquery/antPal2_longest_transcript.fasta done 233 ../3.blastquery/artJam3_longest_transcript.fasta done 4800 ../3.blastquery/aseSto2_longest_transcript.fasta good 6444 ../3.blastquery/bosTau9_longest_transcript.fasta good only lost 1000 758 ../3.blastquery/canFam4_longest_transcript.fasta done 4044 ../3.blastquery/desRot8A_longest_transcript.fasta good 4188 ../3.blastquery/dipEca1_longest_transcript.fasta good 1012 ../3.blastquery/equCab3_longest_transcript.fasta done 1763 ../3.blastquery/felCat9_longest_transcript.fasta done 4434 ../3.blastquery/hipCyc2_longest_transcript.fasta good 205 ../3.blastquery/hipLar2_longest_transcript.fasta done 65 ../3.blastquery/mm10_longest_transcript.fasta done 3800 ../3.blastquery/molMol2_longest_transcript.fasta good 3870 ../3.blastquery/myoMyo6_longest_transcript.fasta good 1377 ../3.blastquery/phyDis3_longest_transcript.fasta done 33 ../3.blastquery/phyHas1_longest_transcript.fasta done 4394 ../3.blastquery/pipKuh2_longest_transcript.fasta good 143 ../3.blastquery/rhiAff2_longest_transcript.fasta done 2056 ../3.blastquery/rhiFer5_longest_transcript.fasta done 1515 ../3.blastquery/rhiLuc4_longest_transcript.fasta done 975 ../3.blastquery/rhiMic1_longest_transcript.fasta done 2127 ../3.blastquery/rhiSed2_longest_transcript.fasta done 395 ../3.blastquery/rhiTri2_longest_transcript.fasta bad 26 ../3.blastquery/rouAeg4_longest_transcript.fasta bad 1627 ../3.blastquery/sacBil1_longest_transcript.fasta bad 873 ../3.blastquery/sacLep1_longest_transcript.fasta bad 6830 ../3.blastquery/susScr11_longest_transcript.fasta good 943 ../3.blastquery/tadBra2_longest_transcript.fasta bad sed 's/^>ENST/>tadBra2_longest_transcript|ENST/;s/ | PROT | QUERY//' ../0.input_faa/tadBra2_longest_transcript.faa > ./tadBra2_longest_transcript.fasta # What did we do? We realized that almost all of our proteins were making it through to goodProteins.fasta in step 2, but then in step 3.blastquery, some of our protein fastas were being massively downsized and we were losing a ton of proteins for certain species. I went back to the initial inputs and "rescued" all of our proteins for step 3. and we will now begin re-running step 3. # Reviewing outputs I've selected a random number of orthogroups and am validating orthogroups which are highly variable within bats. Some reminders: the inputs have been filtered in the following ways - query sequence only - no gaps - longest transcript ## Example OG0000000 This OG has some variation within bats. e.g., artJam2 has 2 but antPal1 has 21 and hipLar2 has 72 ![](https://i.imgur.com/hBUutEk.png) ### Is it composed of >85% one gene? Yes, across all species, all of the genes in this OG are some version of IGHV3. ### Confirm these numbers in the protein fasta. artJam2 has 2 but antPal1 has 21 and hipLar2 has 72 When we look at the orthology classification, it looks like there were other IGHV3 genes...where are all of these being tossed out?? ![](https://i.imgur.com/sNM6sZa.png) I went through each step of my protein fasta processing pipeline. At what point are these genes removed from the protein fasta? 1 -- yes present 2 -- yes, present 3 -- yes, present 4 -- yes 5 -- looks like only 3 make it to finalJs and the rest get pushed to js. Are they one2one orthologs? Let's go see. Nope! They're one2zero orthologs. ![](https://i.imgur.com/UFaBwMw.png) It looks as though there were only three IGHV3 genes for orthofinder to analyze, and all three of these genes are classified as many2many orthologs. ``` grep "IGHV3" -A1 longest_transcript_faas/artJam3_longest_transcript.faa >ENST00000390607.IGHV3-21.523602 many2many IEFGLIWVLLFAILRAVVCDVQLVESGGGLVQLGGSLRLSCVASGFIDTIDTYMXFIDYCECLKDVCLMEVLYVMINYKDK >ENST00000603660.IGHV3-30.124054 many2many MEFGVSWVFLVPILRGVQCEVQLVESGGGLVRPRGSLRLSCAASGLTVSSSYINCVLQAAGKGLEWPSSIRDGGSPYYVNSVKGQFTITRENGKNTLSIXMNNLRVKNTALHYCVS >ENST00000390598.IGHV3-7.1322916 many2many #this one didn't make it through orthofinder. wonder why? MCCFKTGSQPGILRCSQSLLRTEHRPHTVEFGLILIFHVVILRVSTVMLESREGLVEGGXFLELSCRSSEFTFSSCNISCVCQALCEGLVWVCRIKIDKNNFRPR ``` Let's look at these genes on BLAST ``` >ENST00000390607.IGHV3-21.523602 | PROT | REFERENCE M-ELGLRWVFLVAILEGVQCEVQLVESGGGLVKPGGSLRLSCAASGFTFSSYSMNWVRQAPGKGLEWVSSISSSSSYIYYADSVKGRFTISRDNAKNSLYLQMNSLRAEDTAVYYCAR >ENST00000390607.IGHV3-21.523602 | PROT | QUERY IEFGLIWVLLFAILRAVVCDVQLVESGGGLVQLGGSLRLSCVASGFIDTIDTYMXFIDYCECLKDVCLMEVLYVMINYKDK ``` When I BLAST the reference (human) sequence, I get an immediate 100% ![](https://i.imgur.com/ODCMvNl.png) ### next steps -- identify alignment output from orthofinder -- look at this alignment in AliViewer -- the ultimate check will come from CAFE results (going to genome browser, assembly and confirming losses) ### gene losses long lived short lived everything missing from molMol (losses) is it missing in the long lived partner? -- take everything thats lost in both the long and short lived and toss it out -- focus on unique gene losses in one or the other one -- complementary to CAFE ## Installing CAFE from github We are in the 'orthofinder' anaconda environment dependencies include python, gcc, and pandoc ``` conda install python=3.9 conda install -c conda-forge pandoc conda install -c conda-forge latexmk conda install -c conda-forge pdflatex conda install -c bioconda cafe conda install -c conda-forge pandas ``` then cd to /docs and use `make` ## Running CAFE run sbatch sbatch_run1.sh to join the queue #### tree must be binary; graham_mod_dated_no_hg38.tree ``` (mm10:87.83517705,(((felCat9:49.66644369,canFam4:49.66644369):11.76859872,((susScr11:20.17442777,bosTau9:20.17442777):19.83282604,equCab3:40.00725381):21.4277886):14.99044447,((((((((HLrhiTri2:12.70240393,HLrhiLuc4:12.70240393):10.90116398,HLrhiSed2:23.60356791):10.06281168,HLrhiAff2:33.66637959):12.70900545,HLrhiFer5:46.37538505):9.624614955,((HLaseSto2:17.11302622,HLhipLar2:17.11302622):21.27756323,HLhipCyc2:38.39058945):17.60941055):0.8822414306,HLrhiMic1:56.88224143):0.8845209647,HLrouAeg4:57.7667624):0.8953602335,(((((HLpipKuh2:14.95167539,HLantPal2:14.95167539):16.82912223,(HLmyoNig1:15.85321175,HLmyoMyo6:15.85321175):15.92758587):15.11676827,(HLtadBra2:23.67764331,HLmolMol2:23.67764331):23.21992258):2.622233134,((HLdesRot8A:15.3632479,HLdipEca1:15.3632479):18.60429862,((HLphyDis3:10.86062202,HLphyHas1:10.86062202):10.65758233,HLartJam3:21.51820435):12.44934218):15.5522525):3.189644714,(HLsacLep1:29.3439579,HLsacBil1:29.3439579):23.36548583):5.952678892):17.76336425):11.40969017); ``` #### sbatch_run1.sh ``` #!/bin/bash # sbatch_run1.sh #SBATCH --job-name=cafe #SBATCH --output=%j.out #SBATCH --error=%j.err #SBATCH --ntasks-per-node=1 #SBATCH --nodes=1 #SBATCH --time=0:08:00 #SBATCH -p cpu # Call the script that runs the CAFE analysis cafe run1.sh ``` #### run1.sh ``` #!cafe # make sure you are in the orthofinder environment load -i final_cafe_input_longevity.tsv -t 36 -l reports/1st_run_log.txt tree (mm10:87.83517705,(((felCat9:49.66644369,canFam4:49.66644369):11.76859872,((susScr11:20.17442777,bosTau9:20.17442777):19.83282604,equCab3:40.00725381):21.4277886):14.99044447,((((((((rhiTri2:12.70240393,rhiLuc4:12.70240393):10.90116398,rhiSed2:23.60356791):10.06281168,rhiAff2:33.66637959):12.70900545,rhiFer5:46.37538505):9.624614955,((aseSto2:17.11302622,hipLar2:17.11302622):21.27756323,hipCyc2:38.39058945):17.60941055):0.8822414306,rhiMic1:56.88224143):0.8845209647,rouAeg4:57.7667624):0.8953602335,(((((pipKuh2:14.95167539,antPal2:14.95167539):16.82912223,(myoNig1:15.85321175,myoMyo6:15.85321175):15.92758587):15.11676827,(tadBra2:23.67764331,molMol2:23.67764331):23.21992258):2.622233134,((desRot8A:15.3632479,dipEca1:15.3632479):18.60429862,((phyDis3:10.86062202,phyHas1:10.86062202):10.65758233,artJam3:21.51820435):12.44934218):15.5522525):3.189644714,(sacLep1:29.3439579,sacBil1:29.3439579):23.36548583):5.952678892):17.76336425):11.40969017) lambda -s -t tree (1,(((1,1)1,((1,1)1,1)1)1,((((((((1,1)1,1)1,1)1,1)1,((1,1)1,1)1)1,1)1,1)1,(((((1,1)1,(1,1)1)1,(1,1)1)1,((1,1)1,((1,1)1,1)1)1)1,(1,1)1)1)1)1) ``` ## Adding GO terms to orthogroups working directory is /project/tlama_umass_edu/projects/project_bat1k_longevity/data/orthofinder/protein_fastas/8_cafe/jul_23 source activate orthofinder conda install -c conda-forge r-gprofiler module load r/4.2.0 Note: make sure you have entered an R session and run install.packages('gprofiler2') before running the R script.