Letty's lab log - 2021/22

Letty's lab log - 2021/22 ========================= 19-10-2021 ---------- **1\. Installed MATLAB** * Error while initializing MATLAB: ``` MATLAB is selecting SOFTWARE OPENGL rendering. MESA-LOADER: failed to open iris: /usr/lib/dri/iris_dri.so: cannot open shared object file: No such file or directory (search paths /usr/lib/x86_64-linux-gnu/dri:\$${ORIGIN}/dri:/usr/lib/dri) failed to load driver: iris MESA-LOADER: failed to open kms_swrast: /usr/lib/dri/kms_swrast_dri.so: cannot open shared object file: No such file or directory (search paths /usr/lib/x86_64-linux-gnu/dri:\$${ORIGIN}/dri:/usr/lib/dri) failed to load driver: kms_swrast MESA-LOADER: failed to open swrast: /usr/lib/dri/swrast_dri.so: cannot open shared object file: No such file or directory (search paths /usr/lib/x86_64-linux-gnu/dri:\$${ORIGIN}/dri:/usr/lib/dri) failed to load swrast driver Gtk-Message: 18:44:07.559: Failed to load module "canberra-gtk-module" ``` * Solved: using the instructions on [this link](https://www.mathworks.com/matlabcentral/answers/534438-libgl-error-in-linux), but I did the command slightly different: ``` (base) letixa@aquas:~$ export MESA_LOADER_DRIVER_OVERRIDE=i965; matlab matlab: command not found (base) letixa@aquas:~$ export MESA_LOADER_DRIVER_OVERRIDE=i965 ``` * (it wasn’t working with matlab, so I tried to remove it and it worked, I hope I didn’t mess up anything) * Also: tried to do the steps on [this link](https://www.mathworks.com/support/bugreports/details/1995075) to solve the Failed to load module canberra-gtk-module problem, but I didn’t work. * However, this message is not supposed to affect MATLAB in any way (so I think we’re good) **2\. Running PGBSM Tutorial on MATLAB** * error while running simulation\_setup.m script * Right click to open * works only on windows * Solve: changed the code in pathtodata and in addpath (the backslashes have to be in this direction /, not in this one \\, as it was previously written in the code) * Add to tutorial * installation unix/linux → last screen shows that you need to have 2 compilers, one for CC and one for java * Don't move stuff around! 22-10-2021 ---------- **1\. Gathered information on available cetacean genomes** * saved in this sheet: [genomes](https://docs.google.com/spreadsheets/d/1j-uLXwtIQMxg3cd-zBDING8bgMwng5sBBEpKphN0Sp4/edit#gid=0) * drawn from the Cetacean Genomes Project * **Limitations**: not many reference genomes 10/84 (most at scaffold level) * which means I will have a lot of fragmented genes (I need a faster way of assembling them) * **Alternatives**: * Michael’s dataset has a lot of cetacean genes * Using only reference genomes? (there is at least one representative of all environments/sonars, except for sperm whale sonar) **2\. Assembled information on toothed whale’s phylogeny, habitat and sonar** * sheet: [cetacean-traits](https://docs.google.com/spreadsheets/d/1fNKWHytpDGWZ4jYA9v_mnYmKOpA2mPwKrIu3OnlFbtA/edit#gid=0) * **problem**: the last phylogeny by McGowen et al (2019) does not count for all river dolphins/all the species I would like to work with/have genomic information available * vaquita: has a reference genome available, not in the tree 26-10-2021 ---------- **1\. Testing pgbsm on my own data** * Program accepts/requires: * pal2nal alignments (format: paml) * rooted trees with branch lengths (but no numbers of branch support) * Data used for the test: pal2nal paml alignment + codon tree for CDH23 gene * Hypothesis tested: all river dolphins (label = 1) vs. all other taxa (label = 0) * Error while running `process_my_data.m` * Does PGBSM accept “N” characters in the sequences? * removed the “N”s from `formated_sequences_CDH23.txt` and tried to run again; didn’t work * Tried to run with the tutorial’s dataset and did not get the same error → problem is with my dataset (most likely the alignment) * possible error in formatting the sequence * Failed attempts to solve the problem: * changing the first line of the alignment, that has the size of alignment and number of sequences (a zero was missing on the size of the alignment) * changing the names of the sequences to shorter versions 27-11-2021 ---------- **1\. Mapped phenotype traitsecholocation type and habitat on phylogenetic tree** ![](https://i.imgur.com/cWcmCJr.png) * References: * Tree --> McGowen (2019) * Habitat: * Perrin et al. (2018) Encyclopedia of Marine Mammals 25-11-2021 ---------- **1\. How to download genomes from the NCBI:** [https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#downloadservice](https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#downloadservice) [https://www.biostars.org/p/466474/](https://www.biostars.org/p/466474/) [https://www.ibm.com/support/pages/downloading-data-ncbi-command-line](https://www.ibm.com/support/pages/downloading-data-ncbi-command-line) [https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#GBorRS](https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#GBorRS) 13-12-2021 ---------- **1\. Downloading genomes from NCBI** * Entered genome database * Searched for ‘cetacea’ * Clicked on genomes to be redirected to the new NCBI * downloaded only cds and protein files 16-01-2022 ---------- **1\. Installing Edirect on anarwach server** * according to [NCBI guide](https://www.ncbi.nlm.nih.gov/books/NBK179288/) `sh -c "$(wget -q ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh -O -)" export PATH=${PATH}:${HOME}/edirect` * installation appeared successful * commands don’t completely work * tested this: `esearch -db nucleotide -query "cdh23 AND orcinus_orca" | efetch -format acc` * worked on my LGE server but not on the anarwach 18-01-2022 ---------- **1\. Assessing basic information on the genomes** * Counting number of genes and proteins in each species file `grep -c ">" path/to/file/cds_from_genomic.fna or protein.faa` 04-02-2022 ---------- **1\. Running `format_my_data.m`** * required format (same that comes out of a PAL2NAL codon alignment): ``` 44 10080 BtauCDH23 ATGAGGTGCCACGTAGCCTCCAGCTGCCTCGTGGTCTGGCTTTCTGTGCT EglaCDH23 ATGAGGTGCCACGTCGCTTCCAGCTGCCTCGTGGTCTGGCTTTCTGTGCTGATCTCTGGA NphoCDH23 ATGAGGTGCCACGTCGCTTCCAGCTGCCTCGTGGTCTGGCTGTCTGTGCTGATCTCTGG ``` * you must keep the numbers (44 10080) that come before the alignment * obs: the only thing separating labels and sequences should be SPACES (no line breaks) - it can be as many spaces as you want, apparently * obs2: your file should not contain: * `>` before sequence labels * gaps on the sequences are not a problem * it is ok to change the names of your sequence file and tree file on the formatmydata.m script * but don ‘t change the name of the folder on pathtodata, it should be kept as UsersRealAlignments **2\. Running `process_my_data.m`** * trees must be: * in newick format * rooted! (at the correct outgroup) * rooting can be performed using IQTree (still don’t know how to do it tho) and Mesquite * should only contain branch lengths and taxon names (no bootstrap information) 02-03-2022 ---------- **1\. Updating genome spreadsheet with information about sequence labels:** ``` bash grep -c "probable" cds_from_genomic.fna grep -c "uncharacterized" cds_from_genomic.fna grep -c "LOC" cds_from_genomic.fna grep -c "LOW QUALITY" cds_from_genomic.fna grep -c "locus_tag" cds_from_genomic.fna grep -c "hypothetical" cds_from_genomic.fna ``` 03-05-2022 ---------- **1\. Generating list of interest [GO terms](https://docs.google.com/spreadsheets/d/1EF190IrFH-MVY1I3vHcwyeaqG7-_DVviR1ns-CtAkjQ/edit#gid=0):** * AmiGo ([http://amigo.geneontology.org/amigo](http://amigo.geneontology.org/amigo)) * Searched for key terms: sound, cognition, learning,communication,hearing,vocal,acoustic,echolocation,song" * Displays all known genes associated to that function --> could be used to generate list **2\. Converting text file (downloaded from AmiGO search) into excel spreadsheets:** ```python data = [] with open("filename.txt") as f: for line in f: data.append([word for word in line.split(" ") if word]) #change this to separate by tab print(data) import xlwt wb = xlwt.Workbook() sheet = wb.add_sheet("New Sheet") for row_index in range(len(data)): for col_index in range(len(data[row_index])): sheet.write(row_index, col_index, data[row_index][col_index]) wb.save("newSheet.xls") ``` Source: [https://stackoverflow.com/questions/21316568/python-text-file-strings-into-columns-in-spreadsheet](https://stackoverflow.com/questions/21316568/python-text-file-strings-into-columns-in-spreadsheet) **3\. Separate sequences by function script** * Selecting protein names for script building and test run `grep ">" protein.faa | sort -R | head -n 10` **4\. Investigating annotation features of dataset (cetacean genomes)** * NCBI genomes * do not seem to have GO functional annotation * annotated using Gnomon following this pipeline: [https://www.ncbi.nlm.nih.gov/genome/annotation\_euk/process/](https://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/) **4.2. GO annotation** **5\. Aliance Genome:** [https://www.alliancegenome.org/](https://www.alliancegenome.org/) * allows to look for other genes' functions, pathways, GO terms etc * * * # Where is the functional annotation - uma odisséia 05-05-2022 ---------- **1\. Checking `genome.gff` files in all cetacean genomes to see they have dbxreffunctional annotation** * found out what files I have in each species folder: `for i in $(find . -name "GCF*"); do ls ${i}; done` * downloaded Genomic sequence (FASTA), Annotated features (GTF), Annotated features (GFF3), Sequence and annotation (GBFF) for all reference & annotated cetacean genomes on NCBI * making sure all folders have the files I want: `for i in $(find . -name "GCF*"); do echo ${i} && ls ${i}; done` (from cetaceans folder) * doubled checked acession numbers matching from downloaded to server folders * standardized folders * checked both Human and Drosophila reference genomes for Dbxref and Ontology\_term and nothing! * **Info on how to find functional anotation on genomes** * [https://www.ncbi.nlm.nih.gov/genbank/genomes\_gff/](https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/) * [https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md) * so hard to find!! 09-05-2022 ---------- **1\. Made a list of possible cross reference databases for functional annotation on NCBI genomes** \* based on: [https://www.ncbi.nlm.nih.gov/genbank/collab/db\_xref/](https://www.ncbi.nlm.nih.gov/genbank/collab/db_xref/) **2\. Searched all cetacean genomes for the terms on this list:** ```bash for i in $(find . -name "genomic.gff"); do echo ${i} && grep -i -E 'xref=EMBL|xref=NCBI_gi|Ontology_term=GO|xref=Ensembl|xref=GO|xref=GOA|xref=InterPro|xref=UniProtKB/Swiss-Prot|xref=FANTOM_DB|xref=PIR' ${i}; done ``` * no hits = **genomes do not seem to have links to any functional annotation** **3\. Checked which Dbxref entries cetacean genomes had execpt for GeneID and taxon:** ```bash for i in $(find . -name "genomic.gff"); do echo ${i} && grep -i "xref" ${i} | grep -i -Ev "xref=GeneID|xref=taxon|xref=Ensembl|xref=CCDS|xref=Genbank|xref=dbSNP|xref=NCBI_GP"; done > dbxref.txt grep -i -E 'xref=GeneID|xref=taxon|xref=Ensembl|xref=CCDS|xref=Genbank|xref=dbSNP|xref=NCBI_GP' genomic.gff ``` * Results saved on /home/leticiamagpali/functional\_annotation/, **dbxref.txt** and **dbxref2.txt** * Only Balaenoptera\_musculus seems to have Dbxref other than the ones listed on `3`. This species has a few genes with entries on **UniProtKB/Swiss-Prot**. * Pontoporia blainvillei has a lot of genes with xref=NCBI\_GP, but what is this? * searched on of the NCBI\_GP numbers (NIG57776.1) on NCBI, and it leads to the **Identical Protein Groups database** * * * # Re-running codeml analyses from undergrad project (codeml\_May2022) 12-05-2022 ---------- **1\. Altering codeml ctl files** * run options: ```bash find . -name "*.ctl" -exec sed -i '' -e 's/fix_blength = -1/fix_blength = 0/; s/getSE = 0/getSE = 1/ ; s/ncatG = 1/ncatG = 3/ ; s/RateAncestor = 1/RateAncestor = 0/ ; s/verbose = 2/verbose = 0' {} \; # Alternative version: for i in $(find . -name "*.ctl"); do sed -i '' -e 's/fix_blength = -1/fix_blength = 0/; s/getSE = 0/getSE = 1/ ; s/ncatG = 1/ncatG = 3/ ; s/RateAncestor = 1/RateAncestor = 0/' ${i}; done # ncatG = 3 # getSE = 1 # RateAncestor = 0 # fix_blength = 0 # verbose = 0 ``` * I decided not to change method = 1, but this will be the next variable I will be changing in case the results still look weird. * Error: `sed: 1: "./CDH23_modelA_H1a.1/CD ...": invalid command code .` * Solve: [https://stackoverflow.com/questions/19456518/error-when-using-sed-with-find-command-on-os-x-invalid-command-code](https://stackoverflow.com/questions/19456518/error-when-using-sed-with-find-command-on-os-x-invalid-command-code) (command was different on OS system) * Checking if replacement went ok: ```bash find . -name "*.ctl" -exec grep -i -E 'ncatG = 1|getSE = 0|RateAncestor = 1|fix_blength = -1' {} \; #negative control, no results find . -name "*.ctl" -exec grep -i -E 'ncatG = 3|getSE = 1|RateAncestor = 0|fix_blength = 0' {} \; | wc -l #positive control, 320 counts (each grep resulted in 4 lines/file, we have 80 files total) ``` * input files paths: ```bash find . -name "CDH23*.ctl" -exec sed -i '' -e 's+Trees/+Trees/CDH23_+' {} \; #replacing tree_H1a with genename_tree_H1a - gene name was switched in each case find . -name "*.ctl" -exec sed -i '' -e 's/codeml_Aug2021/codeml_May2022/' {} \; #fixing folder name from Aug2021 to May2022 in all files ``` * omega values for third run files (e.g. H2.2) ```bash find . -name "*H*.2.ctl" -exec sed -i '' -e 's/omega = 4/omega = 1.6/' {} \; #changing initial omega from 4 to 1.6 ``` 16-05-2022 ---------- **2\. Changing wrong names on tree: Hyperoodon\_ampullatus\_TMC1 was only was Hyperoodon on the gene trees somehow** ```bash find . -name "TMC1_tree*" -exec sed -i '' -e 's/\<Hyperoodon\>/Hyperoodon_ampullatus_TMC1/' {} \; #replacing - went wrong, check it later, solved it manually (created a loop) find . -name "TMC1_tree*" -exec grep -c "Hyperoodon_ampullatus_TMC1" {} \; #checking if replacement went ok (check also fucked up) ``` **3\. Uploading new control files, corrected trees and sequences on server:** /dados/leticia/PAML/codeml\_May2022 **4\. Creating file with path names** ```bash find . -name "*.ctl" | sed "s+./+/dados/leticia/PAML/codeml_May2022/Branch-site-model/+" > CTL_files_path.list #find generates path with ./ in the end but I needed the complete path wc -l CTL_files_path.list #quick check if all paths are there ``` **5\. Running codeml with new ctl files** ```bash screen -r codeml_run for i in `cat CTL_files_path.list`; do cd $(dirname ${i}) && codeml ${i##*/} > command_output ; done ``` Used same strategies as described on my undergrad log, 24-08-2021. # Back to main master's project 19-05-2022 ---------- **1\. Searching UniProt for connections between NCBI ref genomes and GO functional annotation** * Uniprot has advanced search allowing to look for taxonomic group + GO terms * I can use it to look cetaceans for the [GO terms](https://docs.google.com/spreadsheets/d/1EF190IrFH-MVY1I3vHcwyeaqG7-_DVviR1ns-CtAkjQ/edit#gid=0) I am interested in ![](https://i.imgur.com/ATwYUFI.jpg) * Searched Uniprot for the following GO terms: * response to auditory stimulus (22), 11 * sensory perception of sound (542), 125 * auditory behavior (9), 5 * vocalization behavior (15), 8 * echolocation (no hits) * vocal learning (3), 3 * learned vocalization behavior or vocal learning (7), 4 * learning (130), 54 * learning or memory (214 - 2 reviewed), 88 * learned vocalization behavior (4), 1 * cognition (259 - 2 reviewed), 102 ![](https://i.imgur.com/m8bJioA.png) * numbers between () are the total hits, followed by *Physeter macrocephalus* hits. * if we decide to use just one species it should be sperm whale because it has the most hits for most GO terms. **2\. Searching McGowen's paper for information/scripts on how they recovered hearing genes based on GO terms** * Positive selection and inactivation in vision and hearing genes of cetaceans: https://datadryad.org/stash/dataset/doi:10.5061/dryad.63xsj3v05 https://datadryad.org/stash/dataset/doi:10.5061/dryad.63xsj3v05 * Phylogenomic resolution of the cetacean tree of life using target sequence capture: https://datadryad.org/stash/dataset/doi:10.5061/dryad.jq40b0f https://datadryad.org/stash/dataset/doi:10.5061/dryad.jq40b0f) > A list of 1:1 orthologous protein-coding genes for the Tursiops truncatus genome version turTru1 (as compared with protein-coding genes from Homo sapiens and other available laurasiatherians) was compiled using Ensembl v. 75. We included genes belonging to specific gene ontology (GO) categories based on genes of interest and added these to a larger subset of randomly selected genes. Our target loci covered a range of GO categories ranging from “regulation of centrosome cycle” to “lung development”. Official HGNC gene names were used to search the coding sequence (CDS) databases of two delphinid genomes, Tursiops truncatus (version Ttru_1.4) and Orcinus orca (version Oorc_1.1) on NCBI GenBank (Foote et al. 2015). The longest CDS for each gene, whether Tursiops or Orcinus, was downloaded. For some sequences, no delphinid sequence was available, and another cetacean CDS was used (Lipotes vexillifer, Physeter macrocephalus, Balaenoptera acutorostrata; Zhou et al. 2013; Yim et al. 2013; Warren et al. 2017). * Investigated Ensembl _Tursiops truncatus_ genome (annotation and cds files) to look for any GO terms 26-05-2022 ---------- **1\. Codeml tutorial** * translated [codeml tutorial](https://awarnach.mathstat.dal.ca/~joeb/PAML_lab/lab.html) to portuguese (only the text part) * did tutorial exercises (on LGE server) * some codes I used to recover the results: ```bash grep -E 'lnL|dN/dS=' results* grep -A 3 "lnL" results* > values_table.txt grep -E 'lnL|omega|for branches:' results* > omegas_lnL.txt && grep -A 25 "dN & dS for each branch" results* >> omegas_lnL.txt` for i in $(find . -name "results*.txt"); do echo ${i} && grep "lnL" ${i} && grep -A 15 "Detailed output identifying parameters" ${i}; done ``` 31-05-2022 ---------- **1\. Codeml results undergrad project - putting on a table** ```bash for i in $(find . -name "out*"); do echo ${i} && grep "lnL" ${i}; done ``` * take largest likelihood among different runs of the same gene **2\. Working on script to extract sequences by gene name** `awk '/gene=/ (NR = 2) {print}' CDH23_Bacu.fasta` --> print header and sequence according to gene name 02-06-2022 ---------- **1\. Downloading Uniprot hits for GO's of interest - cetaceans** * saved at https://drive.google.com/drive/folders/1XMrIbOocMRO8E0ipw50Be57qQ1QVmoGW?usp=sharing https://drive.google.com/drive/folders/1XMrIbOocMRO8E0ipw50Be57qQ1QVmoGW?usp=sharing * **cannot use RefSeq ID to recover genes** because it is exclusive of each species, it has to be gene name 06-06-2022 ---------- **1\. Re-running codeml undergrad project - test run (codeml May2022)** * test run with CDH23 H4, changing method from 1 to 0. * changing files: `find . -name "*.ctl" -exec sed -i 's/method = 1/method = 0/' {} \;` - note that sed has changed now!! No need for `-i '' -e` * veryfing changes: `find . -name "*.ctl" -exec grep "method" {} \;` * running in multiple folders with command: `for i in $(cat CTL_files_path.txt); do cd $(dirname ${i}) && codeml ${i##*/} > command_output ; done` * if successful (lnL stops being negative), will re run all files with this configuration. * found error on previous run (12-05-2022) - ask Joe! ``` (base) [leticia@vm-centos-prof-mnery Branch-site-model]$ for i in `cat CTL_files_path.list`; do cd $(dirname ${i}) && codeml ${i##*/} > command_output ; done Error: p0 + p1 too small for branch&site model?. Error: p0 + p1 too small for branch&site model?. Error: p0 + p1 too small for branch&site model?. Error: p0 + p1 too small for branch&site model?. Error: p0 + p1 too small for branch&site model?. Error: p0 + p1 too small for branch&site model?. ``` * it worked!! Lnl for CDH23 not negative anymore :) 09-06-2022 ---------- **1\. Re-running codeml (undergrad project) for all files (codeml June 2022)** * changed all files to method = 0 using same command from 06-06-2022. * all folders were copied from codeml\_May2022 to codeml\_June2022 with no other changes made * ran codeml on all files and folders: * changed CTL_files_path.list to modify the paths to each file according to the June 2022 folder * used same run command from 16-05-2022. * Run directory = `/dados/leticia/PAML/codeml_June2022/Branch-site-model` * recovering results: * used same commands as on 31-05-2022, but did it separetely for each gene and hypothesis to make it easier to visualize. * recovering positively selected sites: **2\. Selecting articles for small project - cognition and learning genes** * Checking if there are any previous publications with cognition and learning genes in cetaceans * Google Scholar, PubMed and Web of Science * keywords: cognition gene cetacean; learning gene cetacean * period: anytime; since 2018 * saved all papers matching the search keywords on Paperpile under `Small Project - cognition genes` * searched all papers on this folder for `cognition` and `learning` 22-06-2022 ---------- **1\. Looking into Ensembl dolphin genome files for anything related to GO terms or functional annotation** * files: `Tursiops_truncatus.turTru1.106.abinitio.gff3` `Tursiops_truncatus.turTru1.106.gff3` `Tursiops_truncatus.turTru1.106.gff3.gz.sorted` `Tursiops_truncatus.turTru1.cds.all.fa` * checked info on how gff3 files are formatted: * [http://gmod.org/wiki/GFF3#GFF3\_Format](http://gmod.org/wiki/GFF3#GFF3_Format) * [https://useast.ensembl.org/info/website/upload/gff3.html](https://useast.ensembl.org/info/website/upload/gff3.html) * [https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md) * List of cross reference databases for GFF3 files: [http://current.geneontology.org/metadata/GO.xrf\_abbs](http://current.geneontology.org/metadata/GO.xrf_abbs), [http://amigo.geneontology.org/xrefs](http://amigo.geneontology.org/xrefs) * Checked all gff3 files for: * xbref term `grep -i "xref" Tursiops_truncatus.turTru1.106.gff3*` (and separately for all gff3 files) * database ID terms: ```bash for i in $(find . -name "Tursiops_truncatus.turTru1.106.*gff3*"); do echo ${i} && grep --color='always' -i -E 'EMBL:|COG_Cluster:|COG_Function:|COG_Pathway:|GEO:|GO:|NCBI_gi:|InterPro:|UniProtKB:|UniPathway:|UniProtKB-KW:|BFO:|BioCyc:|CDD:|PANTHER:|PATO:|Pfam:|PIRSF:|PR:|PDOC|Reactome:|RefSeq:|RO:|SO:|SUPERFAMILY:' ${i}; done for i in $(find . -name "Tursiops_truncatus.turTru1.106.*gff3*"); do echo ${i} && grep --color='always' -i -E 'xref=EMBL|xref=NCBI_gi|Ontology_term=GO|xref=Ensembl|xref=GO|xref=GOA|xref=InterPro|xref=UniProtKB/Swiss-Prot|xref=FANTOM_DB|xref=PIR' ${i}; done ``` * checking gtf files on cetacean genomes for any terms associated with functional ontology cross-reference: ```bash for i in $(find . -name "genomic.gtf"); do echo ${i} && grep --color='always' -i -E 'EMBL:|COG_Cluster:|COG_Function:|COG_Pathway:|GEO:|GO:|NCBI_gi:|InterPro:|UniProtKB:|UniPathway:|UniProtKB-KW:|BFO:|BioCyc:|CDD:|PANTHER:|PATO:|Pfam:|PIRSF:|PR:|PDOC|Reactome:|RefSeq:|RO:|SO:|SUPERFAMILY:' ${i}; done for i in $(find . -name "genomic.gtf"); do echo ${i} && grep --color='always' -i -E 'xref "EMBL|xref "NCBI_gi|Ontology_term=GO|xref "Ensembl|xref "GO|xref "GOA|xref "InterPro|xref "UniProtKB|xref "FANTOM_DB|xref "PIR' ${i}; done ``` Result: nothing found! Ensembl and NCBI genomes don't seem to have any useful information on functional annotation. **2\. Checking cetacean genome files - overall** `for i in $(find . -name "cds_from_genomic.fna"); do echo ${i} && head ${i}; done` ## 29-06-2022 **1. Making list of genes annotated to GO's of interest:** * creating list fom 5th column (Gene names) of a tabular document (list of genes corresponding to a given GO): ```bash for i in $(find . -name "*.tab"); do awk -F"\t" '{print $5}' ${i}; done > genes_GO.lis #use uniq to remove identical words ``` Obs: watch out, there was a space before the cognition.tab file and awk does not run like this! * removing repetitive genes and words: * use this if you would like to provide the extraction script with a list ```python old_list = ['banana', 'apple', 'lemon', 'banana'] new_list = [] for item in old_list: if item not in new_list: new_list.append(item) print(new_list) ``` ## 04-07-2022 **1. Getting list of unique genes annotated to GO's of interest** * Transformed txt file with list of genes into a python list format, using the `readlines` function: ```python import sys for p in sys.path: print(p) #used this to find correct way to write filepath file = open("/Users/leticiamagpali/Library/CloudStorage/GoogleDrive-leticiamagpali@gmail.com/My Drive/LetixaMagpali/MSc_Dalhousie/Research_project/functional_annotation/UniProt/genes_GO.txt", "r") print(file.readlines()) ``` ## 06-07-2022 **1. Seting up GitHub account and syncronizing with Google Drive and HackMD** * Syncing Google Drive and GitHub: * Download GitHub Desktop * Go to `File > New repository` and select a folder inside my Google Drive desktop -- in this case, it was ==/Users/leticiamagpali/Google Drive/My Drive/LetixaMagpali/MSc_Dalhousie/Research_project/Whale-evolution== * lab log = written on HackMD (webversion) and MacDown (desktop version), pushed weekly to GitHub, which syncs both ends **2. Writing function to extract a list of genes from a NCBI cds file:** [ExtractGenes.ipynb](https://drive.google.com/file/d/1aAGCGnkgAz7CkTU7UQZvqZx_4ZThR-Rm/view?usp=sharing) ++Things I've learned about python functions:++ * it's possible to define the value for an argument inside the body of a function, and when you call the function it still works with that argument * global variables have priority over variables defined inside the function (which only work for the function) ## 14-07-2022 **1. Downloading gene sequences for small project** * read the abstract to determine potentially intresting papers among the recovered literature * searched for "cognition", "learning" and "brain" whitin all papers labeled under ==Small project - cognition genes== * hits added to the `cognition_genes.list` * used [Alliance Genome](https://www.alliancegenome.org/), [GeneCards](https://www.genecards.org/) and [NCBI Gene database](https://www.ncbi.nlm.nih.gov/gene/) to confirm functions of genes I could not decide about just by reading the paper # Pipeline - extracting sequences from genomes ## 20-07-2022 **1. Writing pipeline modules** * Folder: `/Users/leticiamagpali/Google Drive/My Drive/MSc_Dalhousie/Research_project/Whale-evolution/coding/Pipeline/get_sequences/` * Files.py: * ExtractGenes * RemoveIsoforms * run **2. Editing run.py** * making necessary changes so `run.py` will work with my modules and inside the ceteceans genome folder on the server * testing with real data (aka cetacean genomes:`/home/leticiamagpali/genomes/cetaceans`). Checklist of successful tests: - [x] creates **raw_data_file list** - [X] runs RemoveIsoforms - [X] runs ExtractGenes (all functions) - [X] creates list of genes of interest from text file * Pipeline has been tested and debuged with real data and is fully functional. ## 17-08-2022 * Preparing files for running PGBSM on Matlab, using data from my undergraduate project * Hypotheses: * CDH23: H2 (riverine), H4 (coastal) * SLC26A5: H4 * TMC1: H4 * Files formatted using format my data and phenotype maps manually written * Sequence files = [nuc alignments from PAML](https://drive.google.com/drive/u/0/folders/1JFXw9PkXxVEUNU-GH1VI3ArqHOVyBstW); tree files = [IQtree codon alignments](https://drive.google.com/drive/u/0/folders/1d1ElbNhf4bVY1Cw8muAc65fhztNr-iay) * extension changed to .txt according to PGBSM tutorial ## 20-08-2022 * Assembling gene list - dataset 1 (initial analysis): * Updating my GO genes immediatly before running the pipeline * searched Uniprot again for the selected GO categories to get the latest number of genes ![](https://i.imgur.com/z2SYRut.png) * chose additional categories from [GO_terms_genes](https://docs.google.com/spreadsheets/d/1EF190IrFH-MVY1I3vHcwyeaqG7-_DVviR1ns-CtAkjQ/edit#gid=0]) to be included in the initial analyses | Chosen GO terms - dataset 1 | GO term | # genes extracted | | ------------------------------------------------------------------------ | ---------- | --------------------- | | response to auditory stimulus | GO:0010996 | 25 | | sensory perception of sound | GO:0007605 | 767 | | auditory behavior | GO:0031223 | 9 | | detection of mechanical stimulus involved in sensory perception of sound | GO:0050910 | 12 | | larynx morphogenesis | GO:0120223 | 1 | | larynx development | GO:0120224 | same as larynx morpho | | ear morphogenesis | GO:0042471 | 193 | | inner ear morphogenesis | GO:0042472 | 121 | | inner ear receptor cell differentiation | GO:0060113 | 121 | | auditory receptor cell stereocilium organization | GO:0060088 | 20 | | cochlea development | GO:0043583 | 425 | | cochlea development | GO:0090102 | 45 | | vocalization behavior | GO:0071625 | 15 | | learned vocalization behavior or vocal learning | GO:0098598 | 7 | | learned vocalization behavior | GO:0098583 | 4 | | vocal learning | GO:0042297 | 3 | | learning or memory | GO:0007611 | 217 | | learning | GO:0007612 | 130 | | cognition | GO:0050890 | 264 | * checking if I downloaded all gene names: the amount of lines on the tab files matches the number of genes - 1 ```bash leticiamagpali@eujenio UniProt % wc -l *tab 10 auditory-behavior.tab 21 auditory-cilium-org.tab 426 cochlea-dev.tab 46 cochlea-dev2.tab 265 cognition.tab 13 detection-mechanical-stim.tab 194 ear-morpho.tab 122 inner-ear-celldiff.tab 122 inner-ear-morpho.tab 2 larynx-morpho.tab 5 learned-vocalization-behavior.tab 8 learned-vocalization.tab 218 learning-memory.tab 131 learning.tab 26 response-auditory-stim.tab 768 sensory-perception-sound.tab 4 vocal-learning.tab 16 vocalization-behavior.tab ``` * creating new, updated gene list: ```bash # creates gene list from the third or fifth column of all tabular files # using a tab as column delimiter for i in $(find . -name "*.tab") do awk -F"\t" '{print $3}' ${i} done > genesGO-2.txt # removes identical gene names from list sort -u genesGO-2.txt > genesGO-2_uniq.txt ``` * always have the same formatting on your datasets - some tables that I didn't update (since the gene numbers/names remained the same) had the gene names on the 5th column, but for the updated tables they were on the 3rd column. ## 25-08-2022 1 - Correcting new errors on pipeline: * Checked the results in more detail and noticed that 2 steps of the pipeline (extract genes and filter sequences) were returning the same genes (of *B.acutorostrata*) for all 3 test species. * this happened because of a foor loop wrong indentation. Watch out! 2 - Updating pipeline to be able to extract proteins: * cds and protein files from NCBI are annotated differently (protein files do not contain gene names) * to extract our genes of interest from the protein files, we can: * generate a list of protein ID numbers (unique for each species and isoform) from the final nucleotide/gene files generated earlier in the pipeline (aka results in the filtered folder) * this list already contains all protein IDs for the genes we removed isoforms, extracted and filtered * **problem** = we can't use `split().4` to access the protein IDs because they are not always in the same position in the cds files. * solution: still used split, but then searched for every element in the list containing "protein_id=" ## 29-08-2022 * Rafael helped me to optimize the code, by turning big chunks into functions, using pylint (checks if code is respecting writing conventions), and to debug it by using the **run and debug** option. * Final version of pipeline was tested, files were checked * Used a set of cognition genes with 31 sequences ```bash #counting number of genes on test list wc -l cognition_genes_test.txt # counting number of genes recovered on each step for i in $(find . -name "*prot.fna") do echo ${i##*/} grep -c ">" ${i} done #run inside reults folder ``` * test species: *B.acutorostrata*, *O.orca*, *P.blainvillei* * number of sequences recovered: * isoforms = 18891, 19255, 4296 * extracted = 25, 26, 0 * filtered = 21, 22, 0 * proteins = 21, 22, 0 * genes/proteins recovered matched list = YES ```bash #checking the headers of sequences recovered (replace path according to species you wanna check) grep ">" filtered-lowq/Orcinus_orca_filtered.fna | awk -F" " '{print $2}' > Orca_headers.txt grep ">" proteins/Orcinus_orca_prot.fna >> Orca_headers.txt ``` * commented code (explaining what functions do, what arguments they take etc) * uploaded to Github and shared with Yuri ## 02-09-2022 **1. Preparing files to run PCOC** * Protein alignments with MAFFT - file: "geneNAME_dataset3_nuc_aligned.translated.fasta file" * CDH23: G-INS-i, 0.6 de unaligned level, try to align gappy regions anyway * SLC26A5: G-INS-i, com unaligned level aproximadamente 0.6, try to align gappy regions anyway * TMC1: G-INS-i, 0.0 de unalignlevel, try to align gappy regions anyway * CLDN14: G-INS-i, 0.0 de unalignlevel, try to align gappy regions anyway **2. Running PCOC** Commands: ``` docker run -e LOCAL_USER_ID=`id -u $USER` --rm -v $PWD:$PWD -e CWD=$PWD carinerey/pcoc pcoc_det.py -t TMC1/TMC1_pcoc.nw.phy -aa TMC1/TMC1_pcoc.fasta -o TMC1/outpcoc_det_TMC1_H4 -m "23,24,25,26,36,45,47,56,57,58,60,63,64" -f 0.8 --no_cleanup_fasta --plot_complete_ali --plot docker run -e LOCAL_USER_ID=`id -u $USER` --rm -v $PWD:$PWD -e CWD=$PWD carinerey/pcoc pcoc_det.py -t CLDN14/CLDN14_pcoc.nw.phy -aa CLDN14/CLDN14_pcoc.fasta -o CLDN14/outpcoc_det_CLDN14_H4 -m "27,30,31,32,39,44,45,46,51,54,55,57,58" -f 0.8 --no_cleanup_fasta --plot_complete_ali --plot docker run -e LOCAL_USER_ID=`id -u $USER` --rm -v $PWD:$PWD -e CWD=$PWD carinerey/pcoc pcoc_det.py -t SLC26A5/SLC26A5_pcoc.nw.phy -aa SLC26A5/SLC26A5_pcoc.fasta -o SLC26A5/outpcoc_det_SLC26A5_H4 -m "29,31,32,33,39,43,44,45,48,50,53,54,57" -f 0.8 --no_cleanup_fasta --plot_complete_ali --plot ` docker run -e LOCAL_USER_ID=`id -u $USER` --rm -v $PWD:$PWD -e CWD=$PWD carinerey/pcoc pcoc_det.py -t CDH23/CDH23_pcoc.nw.phy -aa CDH23/CDH23_pcoc.fasta -o CDH23/outpcoc_det_CDH23_H4 -m "31,33,34,35,41,42,43,49,55,56,58,59,69" -f 0.8 --no_cleanup_fasta --plot_complete_ali --plot ` ``` ## 03-09-2022 **1. Checking software/packages available in the awarnach server** * path to list of software = cd ../../software/miniconda3/envs * every software is installed on a different conda environment * which software I want to use are already installed: - [X] IqTree - [X] MAFFT - [X] BLAST - [X] OrthoFinder - [x] PCOC - [X] PAML - [X] Pal2nal - [ ] TreeFam - [x] R - [x] TreesApp * requested PCOC and R to be installed on the awarnach server (email to ) * interesting software I found: * NCBI: * ncbi-acc-download * ncbi-amrfinderplus * ncbi-genome-download * ncbimeta * ncbi-ngs-sdk * ncbitk * ncbi-util-legacy * ncbi-vdb * ncbi-vdb-py * [goalign](https://github.com/evolbioinfo/goalign): manipulate alignments on command line * [gotree](https://github.com/evolbioinfo/gotree): manipulate trees * [goenrichment](https://github.com/DanFaria/GOEnrichment): GO enrichment analysis * uniprot * [goatools](https://github.com/tanghaibao/goatools): python library for gene ontology analyses, also perform enrichment * [pantools](https://www.bioinformatics.nl/pangenomics/manual/): comparative analyses of large number of genomes * [pyfasta](https://pypi.org/project/pyfasta/#slicing): manipulate large fasta files * prank and muscle (alignments) # Candidate genes project - gene extraction ## 23-09-2022 * Making [oficial list](https://drive.google.com/file/d/12lGR8baRJsRiM5gkxOtlB9JGzVOIZ0DJ/view?usp=sharing) for ==CANDIDATE GENES PROJECT== * Checking the list of genes recovered from literature and updating it with recently added papers * Mannually added genes recovered from GO terms associated with: - [x] learned vocalization behavior or vocal learning - [x] learned vocalization behavior - [x] vocal learning - [x] learning or memory - [x] learning - [x] cognition * quality riteria: annotation score > 3.0 * used `sort -u` to generate a list with genes alphabetically sorted and with no repeated genes (aka identical words). * final number of genes: 163 (`wc -l`) * recovered from literature search (described on 09-06-2022) and selected GO terms (above) ## 13-10-2022 * Making batch script to run candidate genes extraction on server: ```bash #!/bin/bash . /etc/profile #$ -o job_output $ -M leticiamagpali@dal.ca` cd /home/leticiamagpali/extracted-genes/ python run.py ``` * Updated pipeline codes on `get_sequences` with: * correct pathway to genome folder: `/home/leticiamagpali/genomes/cetaceans` * extended dictionary on RemoveIsoforms.py, including all possible ways a gene name can be annotated (gene=, "locus_tag=") ## 18-10-2022 * Running and debugging code on VSCode * Used a special function that allows you to connect to a server, run a code inside it step by step and verify the outputs to locate possible bugs * Each function was run separetly and results were checked by: * counting number of genes recovered for each species, on each step * looking at headers on `filtered-lowq` to check if they overall matched the list I provided. * (using same commands as on 29-08-2022). * Recovering candidate genes - final results: (run again without vs code and check results) --- # Future plans * Writing script to run IQTree on command line (awarnach server): ```bash #!/bin/bash . /etc/profile #$ -o job_output $ -M leticiamagpali@dal.ca #$ -m be cd /home/leticiamagpali/gene_trees/iqtree_Sept2022 path/iqtree -s CDH23_dataset3_nuc_aligned.fasta -st DNA -m TEST -bb 10000 -alrt 5000 -abayes # path/iqtree -s CDH23_dataset3_prot.fasta -st AA -m TEST -bb 10000 -alrt 5000 -abayes ``` **Next step on recovering genes:** * Literature search: Google Scholar, Web of Science, Pub Med: vocalization/vocal behavior gene cetacean, echolocation gene cetacean, communication gene cetacean, hearing gene cetacean. * 1st 20 results? (how further should I go) * Search for major review papers and look at their references and connected articles