owned this note
owned this note
Published
Linked with GitHub
# Bioinformatics Special Lecture
[TOC]
**7/5/2023**
ToDo List:
1. write the method more clear Eggnog to Goatools
2. Write discussion why, how and furture direction such as de novo genome
3. finish upload your code to Github and add the Slurm script with test set
4. to github please add the R code for graph drawing+
5. Finish hackMD IO tutorials
6. Test your github https://github.com/ihinne/tick_
7. Write the README from conda installation to the end.
example : "https://github.com/tanghaibao/jcvi"
8. Pathway
9.
10. Study about snakemake
{%hackmd @themes/orangeheart %}
## Ubuntu Set up
### Basic component installation by `apt-get install`
`sudo apt-get --yes update`
`sudo apt-get --yes upgrade`
`sudo apt-get --yes dist-upgrade`
`sudo apt-get --yes autoremove`
`sudo apt-get --yes install build-essential checkinstall libgtk2.0-dev freeglut3-dev libfreetype6-dev libqhull-dev qhull-bin libqhull7 dvipng ghostscript texlive-latex-base poppler-utils python-gobject-2-dev python-gobject python-gi-dev libffi-dev python-numpy default-jre libmldbm-perl libcdb-file-perl libdb5.3++-dev libdb5.3++ hmmer ncbi-tools-bin blast2 cd-hit libcurl4-openssl-dev libreadline-gplv2-dev libncursesw5-dev libssl-dev libsqlite3-dev tk-dev libgdbm-dev libc6-dev libbz2-dev libgtk2.0-dev python2.7 libssh2-1-dev last-align texlive texlive-latex-extra texlive-latex-recommended gifsicle ffmpeg imagemagick uuid-dev elinks links2 python-tk mysql-server libmysqlclient-dev liblzma-dev zip binutils-multiarch bison++ byacc cmake make freebsd-buildutils samtools hdf5-tools libhdf5-serial-dev mercurial r-cran-rcpparmadillo libblas-dev liblapack-dev libgd-dev expat pandoc ttf-mscorefonts-installer python-pysam python-htseq liblzo2-dev zlib1g-dev openssh-server`
## Pronghorn
`ssh YOURID@pronghorn.rc.unr.edu`
Home directories are located under `/data/gpfs/home/NetID`
Association directories are located under `/data/gpfs/assoc/`
### Prompt
`tty -s && export PS1="\[\033[38;5;164m\]\u\[$(tput sgr0)\]\[\033[38;5;15m\] \[$(tput sgr0)\]\[\033[38;5;231m\]@\[$(tput sgr0)\]\[\033[38;5;15m\] \[$(tput sgr0)\]\[\033[38;5;2m\]\h\[$(tput sgr0)\]\[\033[38;5;15m\] \[$(tput sgr0)\]\[\033[38;5;172m\]\t\[$(tput sgr0)\]\[\033[38;5;15m\] \[$(tput sgr0)\]\[\033[38;5;2m\]\w\[$(tput sgr0)\]\[\033[38;5;15m\]\n \[$(tput sgr0)\]"`
## Bashrc


bashrc is a shell script located in the user home that is used to customise the terminal session
For example to load a module you can add the line:
`module load <module>`
To modify an environment variable, like PATH, add the line:
`export PATH=$PATH:<path/to/dir>`
to activate a Python environment is as simple as adding the line:
`source <path/to/env>/bin/activate`
## SCP
the `scp` (secureshell copy) is a command-line utility that allows you to share files between different hosts or locations. It uses ssh for data transfer and provides the same security level - both files and passwords are encrypted.
You can copy files or directory:
* from you local system to a remote system
* from a remote system to your local system
* between two remote systems from your local system
### SCP command syntax
`scp [OPTION] [user@]SRC_HOST:]file1 [user@]DEST_HOST:]file2`
* `OPTION` - scp options [-c cipher|-F ssh_config|-P remote host ssh port|-l limit|...]
* [user@]SRC_HOST:]file1 - source file
* [user@]DEST_HOST:]file2 - destination file
To be able to copy files, you must have at least read permissions on the source file and write permission on the target system
NB: when copying files that share the same name and location on both systems, scp will overwrite files without warning.
Examples:
Copying files
* copy the file "example.txt" from a remote host to the local host
`scp username@remotehost.edu:example.txt /some/local/directory`
* copy the file "example.txt" from the local host to a remote host
`scp example.txt username@remotehost.edu:/some/remote/directory`
* Copy the files "example_1.txt" and "example_2.txt" from the local host to your home directory on the remote host
`scp example_1.txt example_2.txt ihinne@pronghorn.rc.unr.edu:~`
* Copy the file "example.txt" from remote host "remotehost_1.edu" to remote host "remotehost_2.edu"
`scp username@remotehost_1.edu:remote/directory/example.txt username@remotehost_2.edu:/remote/directory`
Copying directories
* Copy the directory "example_a" from the local host to a remote host's directory "example_b"
`scp -r example_a ihinne@remotehost.edu:/some/remote/directory/example_b `
* Copy an entire directory from one host to another
`scp -r /local/directory username@remotehost.edu:/remote/directory`
syncing files via rsync
Rsync provides many options for passing arguments to augment it's functionality. In the examples below we will use:
* -a flag: Stands for "archive" and syncs recursively and preserves symbolic links, special and device files, modification times, group, owner, and permissions.
* -z flag: Compresses files during transfer.
* -P flag: Combines the flags --progress and --partial. The first of these gives you a progress bar for the transfers and the second allows you to resume interrupted transfers:
The following command is a "push" operation because it pushes (syncs) a local directory to a remote system.
`rsync -azP ~/some/local/source/directory username@remotehost:/some/remote/destination/directory`
Conversely, a "pull" operation is used to sync a remote directory to the local system.
`rsync -azP username@remote_host:/some/remote/destination/directory ~/some/local/directory`
## CHMOD
File and Directory Permissions
`-rw-r--r-- 1 exampleuser examplegroup 4096 Apr 4 04:20 example.txt`
The file's mode represents the file's permissions and some extra information. There are five parts to the mode.
part 1: - (file or directory)
part 2: rw- (user permissions)
part 3: r-- (group permissions)
part 4: r-- (other permissions)
part 5: (special attributes)
The first character of the mode (part 1) is the file type. It can be a regular file or a directory. Each file and directory has three user based permission groups which break down into three sets: `user` part 2, `group` part 3, and `other` part 4.
part 1: A dash (-) in this position, as in the example, denotes a regular file, meaning that there is nothing special about the file. This is by far the most common kind of file. Directories are also common and are indicated by a d in the file type slot.
part 2: `user (u)` - The Owner permissions apply only the owner of the file or directory, they will not impact the actions of other users.
part 3: `group (g)` - The Group permissions apply only to the group that has been assigned to the file or directory, they will not effect the actions of other users.
part 4: `other (o)` - The All Users permissions apply to all other users on the system, this is the permission group that you want to watch the most.
part 5: `Directory Set Group ID` - There are two special bits in the permissions field of directories.
`s` - Set group ID
`t` - Save text attribute (sticky bit) - The user may delete or modify only those files in the directory that they own or have write permission for.
Each permission set can contain four basic representations:
`r` : the file is readable. Refers to a user's capability to read the contents of the file.
`w` : the file is writable. Refers to a user's capability to write or modify a file or directory.
`x` : the file is executable. Refers to a user's capability to execute a file or view the contents of a directory.
`s` : Some executable files have an s in the user permissions listing instead of an x. This indicates that the executable is setuid, meaning that when you execute the program, it runs as though the file owner is the user instead of you. Many programs use this setuid bit to run as root in order to get the privileges they need to change system files. - : null value (nothing).
Modifying Permissions
`chmod` : Changes the file mode bits of each given file according to mode, which can be either a symbolic representation of changes to make, or an octal number representing the bit pattern for the new mode bits.
`chown` : Changes the user and/or group ownership of each given file.
`chgrp` : Change the group of a file or directory.
chmod - change mode
To change permissions, use the chmod command. The chmod (change mode) command protects files and directories from unauthorized users on the same system, by setting access permissions.
Syntax
`chmod options permissions file name`
`chmod [OPTION]... MODE[,MODE]... FILE...`
`chmod [OPTION]... OCTAL-MODE FILE...`
`chmod [OPTION]... --reference=RFILE FILE...`
Some examples:
Private file for you
`chmod 600 myfile`
`chmod u=rw myfile`
Everyone can read; you can write
`chmod 644 myfile`
`chmod u=rw,g=rw,o=rw myfile`
Private directory for you
`chmod 700 mydir`
Everyone can read; you can write
`chmod 755 mydir`
`chmod u=rwx,g=r,o=r mydir`
* 4 stands for "read",
* 2 stands for "write",
* 1 stands for "execute", and
* 0 stands for "no permission"
So 7 is the combination of permissions 4+2+1 (read, write, and execute), 5 is 4+0+1 (read, no write, and execute), and 4 is 4+0+0 (read, no write, and no execute).
set the ownership of myfile to mrrobot
`chown mrrobot myfile`
set the ownership of the directory and all files in mydirectory to mrrobot
`chown -R mrrobot mydirectory`
set the group ownership of myfile to mrrobot
`chgrp mrrobot myfile`
set the group ownership of the directory and all files in mydirectory to mrrobot
`chgrp -R mrrobot mydirectory`
## VIM
VI
## Sublimetext
move line - `ctrl + shift + arrow`
duplicate - `ctrl + shift + d`
delete line - `ctrl + shift + k`
indent line - `ctrl + [ or ]`
select text -> edit tab -> Line -> Reindent
paste and indent - `ctrl + shift + v`
#### Working with projects
assume you are working on a project with its files in a folder called example-project
Drag the folder onto the sublime text screen OR use File -> open
Sunlime text creates a side bar with the list of files in the project folder. double click on a file to open it in a new tab or use `ctrl + p` to go to anything and
use `@` to go to a particular text
use `:` to go to a particular line eg :24
tips
See unsdaved changes : right-click anywhere in the text window and select Show Unsaved changes
Sidebar time saving - Use the sidebar to create new folders or files
Multiple top level factors - think about the sidebar as a canvas for your project. You can drag multiple folders on to the sidebar
Save a project - Project -> save project as
#### The command Palette
`ctrl + shift + p`
places every action sublime text is capable of at your finger tips
ctrl + shift + p -> Convert case
Reverse
Sort
Set syntax
#### Package manager
search Sublime text package manager look for Package Control
Install
Copy blob of code onto your clipboard
got to View -> show console > paste code -> hit enter
ctrl + shift + p -> install package -> colour highlighter
#### Snippets
## Visualstudio
## Conda
Conda is an open-source package management system and environment management system that runs on Windows, macOS, and Linux. Conda quickly installs, runs, and updates packages and their dependencies. Conda easily creates, saves, loads, and switches between environments on your local computer. It was created for Python programs but it can package and distribute software for any language.
Verify conda is installed, check version number `conda info`
Update conda to the current version `conda update conda`
Install a package included in Anaconda `conda install PKGNAME`
Create a new environment named env `conda create --name env`
Get a list of all my environments `conda env list`
List all packages and versions installed in active environment `conda list`
List the history of each change to the current environment `conda list revisions`
Restore environment to a previous revision `conda install --revision 2`
Save environment to a text file `conda list --explicit > env.txt`
`wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh`
`bash Miniconda3-latest-Linux-x86_64.sh`
`conda config --set auto_activate_base false`
restart shell
`conda config --show channels`
`conda config --add channels conda-forge`
`conda config --add channels bioconda`
## mamba
https://github.com/mamba-org/mamba
mamba is a re-implementation of conda manager
```bash!
wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
```
```bash
bash Mambaforge-Linux-x86_64.sh
```
## GitHub
https://www.youtube.com/watch?v=tRZGeaHPoaw
```git
git config --global user.name "USERNAME"
```
```git
git config --global user.email user@email.com
```
set default branch as main
```git
git config --global init.default branch main
```
Initialize git in the required directory
```git!
git init
```
chek status of repository
```git
git status
```
create a txt file called .gitignore
create a list of files or extension you do not want to include in this .gitignore file
Currenly all your files are located in a stageing area.
track a file or add file to staging area
```git!
git add filename
```
Git has three working areas:
1. Working files
2. Staging area
3. Commit
To write the current state of your files to the repository;
```git
git commit -m "commit message"
```
commit directly without moving the the staging area
```git
git commit -a -m "commit message"
```
see an abbreviated version of commit history
```git
git log --oneline
```
ammend commit message
```git
git commit -m "type new message here" --amend
```
to see all the diiferent changes in all commits
```git
git log -p
```
go to previous version
```git
git reset
```
modify what appears in the histrory book
```git
git rebase -i --root
```
create a new branch
```git
git branch newBranch
```
switch from main to new branch
```git
git switch newBranch
```
switch to main
```git
git switch main
```
merge branche to main
```git
git merge -m "merging message" newBranch
```
delete branch
```git
git branch -d newBranch
```
```git
git remote add origin "link-to-github-repo"
```
git push
git pull = git fetch + git merge
git fetch
git clone
git checkout -b branch-name
git diff - shows two versions of a code
git merge
pull request - move code to another branch
## HPC
https://www.youtube.com/watch?v=dThyrEXfAbM&list=PLZLVmS9rf3nMKR2jMglaN4su3ojWtWMVw
parellel computing
Multithreding - run code multiple times at once. you can now use more resources. Slowed down by now parallelizable work.
cant use mor than one kitchen
Message passing - more processes
Using multiple nodes - using other kitchens. Using multiple computers. In this case you need to free up all the kitchens in the apartment complex so we contact the building manager (SLurm)
On a cluster there are so many computers we cant keep track of them ourselves. Slurm, the workload manager can tell us what is available.
GPU
Parallellization can speed up your task, but **not faster than the smallest seiral task**
You need to **modify your code to use multiple threads** on the same node, or **multiple processes** over many nodes.
Special hardware built for parallellization , **GPU** require further adaptations of your code and sometimes also data
Program size
How big is my program? what resources (CPUs, RAM, GPU) would I need? A simple measurement would be your own computer. Something that runs for an hour on your machine may run for an hour and ten minutes on the cluster. Most modern PCs have slightly faster CPUs than HPCs becuase the cluter resources are more for energy efficiency.The faster CPUs may not be energy efficient but the cluster has more CPUs so can run more processess.
## Slurm
https://www.youtube.com/watch?v=49DzPT9HFJM
https://www.youtube.com/watch?v=FCJAhp27BLI
Slurm is a workload manager - A piece of software which:
* manages aand **allocates resources**;
* manages and **schedules jobs**;
* and sets up the environment for parallel and distributed computing
Resoureces: mainly CPU, memory (and GPU)
Slurm is free and open-source
You can use slurm to:
create a job
monitor jobs
control your own job
get job accounting info
to create a job;
1. you need to know the resources you need and the operations you need to perform
2. Write a submission script - a shell script
```#!/bin/bash
#!/bin/bash
# Job parameters
#SBATCH --job-name=demo
#SBATCH --output=res.txt
# Needed resources
#SBATCH --ntask=1
#SBATCH --mem-per-cpu=4g
#SBATCH --time=0-1:00:00
#SBATCH --partion=choose partition
#SBATCH --nodes=num
#SBATCH --ntasks-per-node=num
#SBATCH --cpus-per-task=num
#SBATCH --gres=use a specific resource (eg a GPU)
#SBATCH --constraint=choose a specific feature (eg a processor or network type)
#SBATCH --licence=to access a specific licensed software
#SBATCH --qos=to use a specific Quality of Service
# other useful parameters
#SBATCH --jobname=MYJobName
#SBATCH --comment="some comment"
#SBATCH --mail-type=BEGIN|END|FAILED|TIME_LIMIT_90|ALL
#SBATCH --mail-user=my@mail.com
#SBATCH --output=result-%j.txt
#SBATCH --error=error-%j.txt
#SBATCH --test-only #can use to test script
#SBATCH --dependency=after(ok|notok|any):jobids #useif one job need to finish before another
#SBATCH --dependency=singleton
#SBATCH
#SBATCH
# Operations
echo "Job start at $(date)"
# Job steps
srun ~/bin/myprog < mydata1
echo "Job end at $(date)"
```
1. Submit the script
sbatch script.sh
Monitor your job
squeue - view jobs that are running or pending
scancel - cancel job
sprio - view the factors that comprise a job's scheduling priority
scontrol - used to modify slurm configuration and state (change partition, ram, time)
sstat
sview - gives visual overview of cluster
sacct
sinfo - get cluster info
sreport - request information from account
sshare - tools for listing the shares of association to acluster
Interactive slurm
salloc -
srun --pty bash
## Snakemake
https://www.youtube.com/@yclab438/videos?view=0&sort=dd&shelf_id=0
https://snakemake.readthedocs.io/en/stable/
https://lachlandeer.github.io/snakemake-econ-r-tutorial/working-with-multiple-rules.html
Snakemake is a workflow management tool useful to creat reproducible data analysis
### JASON
https://www.youtube.com/watch?v=iiADhChRriM
JAVA Script Object Notation
Data representation format
Commonly used for APIs and Configs
Lightweight and Easy to Read/Write
Easily integrate with most languages
JASON Types
* Strings "hello"
* Numbers 10 1.5
* Booleans true false
* Null null
* Arrays [1,2,3] ["Hello", "world"]
* Objects { "key":"value" } { "age":30 }
user.json
```json
{
"key": "value",
"key": "value"
}
{
"name": "Isaac",
"favoriteNumber": 3,
"isProgrammer": true,
"hobbies": ["piano", "singing"],
"friends": [{
"name": "Juliet",
"favoriteNumber": 30,
"isProgrammer": false,
"friends": [...]
}]
}
```
## Eggnog-mapper
EggNOG-mapper is a tool for fast functional annotation of novel sequences. It uses precomputed Orthologous Groups (OGs) and phylogenies from the EggNOG database (http://eggnog6.embl.de) to transfer functional information from fine-grained orthologs only.
The eggNOG6 database is a bioinformatics resource that provides orthology data and comprehensive functional information for 12,535 reference species at 1061 taxonomic levels. It has an hierarchy of over 17M OGs.
eggNOG combines species-aware clustering and phylogenetic analysis predictions to infer orthology reports in two steps:
1.eggNOG provides precomputed OGs across thousands of taxonomic scopes, covering the three domains of life: Bacteria, Archaea, and Eukaryota.
2.each OG is further analysed using phylogenetic reconstruction which provides fine-grained resolution of duplication and speciation events within each OG.
eggNOG integrates evolutionary data, conserved sequence domains, and functional terms into summarised reports for each OG in an effort to maximise its utility to the user.
eggNOG proteomes were mapped against the reference alignment of each COG, KOG and arCOG. unassigned sequences were clustered using unsupervised COG clustering and all-against-all Smith–Waterman similarities among all proteins in eggNOG.
eggNOG OGs are then functionally annotated using PFAM, SMART, KEGG, PDB, BiGG,CARD, CAZy,GO databeses.
### Install conda
#### we create eggnog environment.
```
mamba create -n eggnog
mamba activate eggnog
```
#### Install eggnog-mapper
`mamba install -c bioconda eggnog-mapper`
#### Download the eggNOG database into a directory called eggnog-database
```bash!
# make a directory to store the eggNOG database
mkdir eggnog-database
# download eggNOG database
download_eggnog_data.py -F -P -M -H -d 6656 --dbname Arthropoda --data_dir /data/gpfs/assoc/ihinne_bioinformatics/eggnog-database
```
#### Run emapper in search mode only using diamond and hmmer
```bash!
sbatch emapper_diamond_search.sh
```
```bash!=
#!/usr/bin/bash -l
#SBATCH --job-name=d_search
#SBATCH --account=cpu-s1-pgl-0
#SBATCH --partition=cpu-s1-pgl-0
#SBATCH --nodes=1
#SBATCH --ntasks=3
#SBATCH --cpus-per-task=16
#SBATCH --time=4-00:00
#SBATCH --mem-per-cpu=4g
#SBATCH --output=hostname_%j.out
#SBATCH --error=hostname_%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=ihinne@nevada.unr.edu
/data/gpfs/home/ihinne/mambaforge/envs/eggnog/bin/emapper.py \
--cpu 16 \
-m diamond \
--sensmode ultra-sensitive \
--data_dir /data/gpfs/assoc/ihinne_bioinformatics/eggnog-database/ \
--go_evidence all \
--pfam_realign denovo \
--no_annot \
-i /data/gpfs/assoc/ihinne_bioinformatics/ISCGN_v2_103122.aa \
-o ISCGN_dmnd
```
`sbatch emapper_hmmer_search.sh`
```bash=
#!/usr/bin/bash -l
#SBATCH --job-name=h_search
#SBATCH --account=cpu-s1-pgl-0
#SBATCH --partition=cpu-s1-pgl-0
#SBATCH --nodes=1
#SBATCH --ntasks=3
#SBATCH --cpus-per-task=16
#SBATCH --time=4-00:00
#SBATCH --mem-per-cpu=4g
#SBATCH --output=hostname_%j.out
#SBATCH --error=hostname_%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=ihinne@nevada.unr.edu
/data/gpfs/home/ihinne/mambaforge/envs/eggnog/bin/emapper.py \
--cpu 16 \
-m hmmer \
--sensmode ultra-sensitive \
--data_dir /data/gpfs/assoc/ihinne_bioinformatics/eggnog-database/ \
--go_evidence all \
--pfam_realign denovo \
--no_annot \
-i /data/gpfs/assoc/ihinne_bioinformatics/ISCGN_v2_103122.aa \
-o ISCGN_hmm_arth \
-d Arthropoda \
--tax_scope 6656
```
#### combine the two seed_orthologs files into one seed_ortholog file
`cat ISCGN_dmnd.emapper.seed_orthologs ISCGN_hmm_arth.emapper.seed_orthologs > ISCGN_combined.emapper.seed_orthologs`
#### Run emapper with no search but annotation mode only
Do this using the combined seed_orthologs file as seed and also the individual diamond and hmm seed_orthologs files
`sbatch emapper_annotate.sh`
```bash=
#!/usr/bin/bash -l
#SBATCH --job-name=emapper_annot
#SBATCH --account=cpu-s1-pgl-0
#SBATCH --partition=cpu-s1-pgl-0
#SBATCH --nodes=1
#SBATCH --ntasks=3
#SBATCH --cpus-per-task=16
#SBATCH --time=5-00:00
#SBATCH --mem-per-cpu=4g
#SBATCH --output=hostname_%j.out
#SBATCH --error=hostname_%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=ihinne@nevada.unr.edu
/data/gpfs/home/ihinne/mambaforge/envs/eggnog/bin/emapper.py \
--cpu 16 \
-m no_search \
--data_dir /data/gpfs/assoc/ihinne_bioinformatics/eggnog-database \
--annotate_hits_table ISCGN_combined.emapper.seed_orthologs \
--excel \
#--decorate_gff yes \
-o ISCGN_hmm_annot \
--dbmem \
--target_taxa 6656
```
### running emapper in two-pass mode
`sbatch emapper_2pass.sh`
```bash=!
#!/usr/bin/bash -l
#SBATCH --job-name=emapper_hmm
#SBATCH --account=cpu-s1-pgl-0
#SBATCH --partition=cpu-s1-pgl-0
#SBATCH --nodes=1
#SBATCH --ntasks=3
#SBATCH --cpus-per-task=16
#SBATCH --time=1-00:00
#SBATCH --mem-per-cpu=4g
#SBATCH --output=hostname_%j.out
#SBATCH --error=hostname_%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=ihinne@nevada.unr.edu
/data/gpfs/home/ihinne/mambaforge/envs/eggnog/bin/emapper.py \
--cpu 16 \
-m hmmer \
--sensmode ultra-sensitive \
--data_dir /data/gpfs/assoc/ihinne_bioinformatics/eggnog-database/ \
--go_evidence all \
--pfam_realign denovo \
--no_annot \
-i /data/gpfs/assoc/ihinne_bioinformatics/ISCGN_v2_103122.aa \
-o /data/gpfs/assoc/ihinne_bioinformatics/eggnog_hmm_out/ISCGN_hmm_arth \
-d Arthropoda \
--tax_scope 6656
/data/gpfs/home/ihinne/mambaforge/envs/eggnog/bin/emapper.py \
--cpu 16 \
-m no_search \
--data_dir /data/gpfs/assoc/ihinne_bioinformatics/eggnog-database \
--annotate_hits_table /data/gpfs/assoc/ihinne_bioinformatics/eggnog_hmm_out/ISCGN_combined.emapper.seed_orthologs \
--decorate_gff yes \
-o /data/gpfs/assoc/ihinne_bioinformatics/eggnog_hmm_out/ISCGN_hmm_annot \
--dbmem \
#--target_taxa 6656 when diamond search is used
```
> [name=Isaac][time=Sat, Jul 08, 2023 12:23 am] [color=blue]
### using argparse to get input from user.
```python=
#!/usr/bin/env/python
# shebang code needed
# shebang /usr/bin/env python
import argparse
import pandas as pd
# Define and parse command line arguments
parser = argparse.ArgumentParser(description="Eggnog output and writes to a new file.")
parser.add_argument('input_file', type=str, help='The annotation file to read from.')
parser.add_argument('output_file', type=str, help='The txt file to write to.')
args = parser.parse_args()
# Read Excel file, skipping the first two rows
#df = pd.read_excel(args.input_file, skiprows=2)
# Read the input anotation file skipping the firts two lines
with open (args.input_file, 'r') as file:
lines = file.readlines()[2:]
#split lines and create Dataframe
data = [line.strip().split('\t') for line in lines]
df = pd.DataFrame(data, columns=['query', 'GOs'])
# Write DataFrame to tab-delimited text file
gene2go = args.output_file +'.txt'
df.to_csv(gene2go, sep='\t', index=False)
# Select only the "query" and "GOs" columns
#new_df = df[['query', 'GOs']]
# Write the selected columns to a new Excel file
#new_df.to_excel(args.output_file, index=False)
```
### without shebang line
```shell=
python code.py annotation.xlsx gene2goid.xlsx
```
### with shebang line
```shell=
chmod 775 code.py
./code.py annotation.xlsx gene2goid.xlsx
```
```python=!
import argparse
import pandas as pd
#Define and parse command line argument
parser = argparse.ArgumentParser(description="Eggnog output and writes to a new file.")
parser.add_argument('input_file', type=str, help='The annotation file to read from.')
parser.add_argument('output_file', type=str, help='The txt file to write to.')
args = parser.parse_args()
# Read the input annotation file from eggnog-mapper, skipping the first five descritptive rows
df = pd.read_csv(args.input_file, sep='\t', skiprows=4)
# Remove empty rows or rows with "-"
df = df.dropna(subset=["#query", "GOs"]) # Remove rows with NaN values
df = df[~df["GOs"].str.contains("-")] # Remove "-" from the GOs column
# Extract columns
selected_columns = df[["#query", "GOs"]]
# Write coulumns to tab-delimited text file
selected_columns.to_csv(args.output_file, sep='\t', index=False)
```
```python=
import pandas as pd
# Read the annotation file
#df = pd.read_excel('annotation.xlsx', skiprows=2)
# Select only the "query" and "KEGG_Pathway" columns
#new_df = df[['query', 'GOs']]
# Write the selected columns to a new Excel file
#new_df.to_excel('gene2goid.xlsx', index=False)
#### Load the ontology files, association and background files
fin_study = 'data/' # Study genes
fin_pop = 'data/' # Population genes
fin_obo = 'data/go-basic.obo' # GO obo containing GO terms
fin_anno = 'data/gene2goid.xlsx'
from goatools.obo_parser import GODag
obodag = GODag(fin_obo, load_obsolete=True)
#### load optional OBODag attributes
obodag = GODag(fin_obo, optional_attrs={'consider', 'replaced_by'}, load_obsolete=True)
# Load associations
from __future__ import print_function
from goatools.anno.idtogos_reader import IdToGosReader
annoobj = IdToGosReader(fin_anno, godag=obodag)
id2gos = annoobj.get_id2gos()
# Get associations for each branch of the GO DAG (BP, MF, CC)
ns2assoc = annoobj.get_ns2assc()
for nspc, id2gos in ns2assoc.items():
print("{NS} {N:,} annotated I. scapularis genes".format(NS=nspc, N=len(id2gos)))
```
### Functional Annotation
Functional annotation of predicted proteins of the new ISCGN genome was done based on orthology assignment with the evolutionary genealogy of genes: Non-supervised Orthologous Groups (eggNOG 5) database (Huerta-Cepas et al., 2019) using eggnog-mapper (v 2.1.11) (Cantalapiedra et al., 2021). We used the two-step (search + annotation) method using HMMER (Eddy SR., 2011) in ultra-sensitive mode restricting it to the Arthropoda taxonomic group (Tax ID 6656) in eggNOG5 and realigning the protein queries to the PFAM database (Mistry et al., 2021). Gene Ontology and KEGG pathways were retrieved based on orthologous groups in eggNOG5.
Gene Ontology enrichment analysis (GOEA) was processed to test the overrepresentation of GO terms in the list of differentially expressed genes using the GOATOOLS Python library (Klopfenstein et al., 2018). The population set was defined as a list of all the genes in the I. scapularis genome (ISCGN) and the study set was the list of differentially expressed genes. We extracted the query sequence id and the GO terms from the annotation output from eggnog-mapper using a custom python script to create a resulting two-column gene-to-GO association or annotation file. The study set, population set, GO annotation file, and a GO-basis ontology file were used as inputs to run GOEA with goatools in a jupyter notebook. The resulting p values of the GO terms were corrected for false discovery rate (FDR) < 0.05 using the Benjamini-Hochberg method of multiple testing correction. The top 30 represented GO terms (based on the p value) were visualized with ggplot2 (Wickham H, 2016) in R (v4.2.3; R Core Team, 2023) using R studio (RStudio Team, 2020).
### Discussion
Since we have a novel genome, it is important to annotate the genome in order to infer functions to the predicted proteins from the genome. Generally, it is considered a better approach to inferring protein function based on orthology than using sequence homology detection. Orthologs often retain their function as such, using orthology assignment is a more reliable approach to infer functional information between protein sequences. The eggNOG database provides a broad range of annotation while keeping redundancy low, because it is based on a curated selection of representative species across the entire tree of life.
s
```snakemake=
import os #allows you to reach out into directory to identify things
import glob #allows you to throw in wildcards
import pandas as pd
configfile: "config.yaml"
WORKING_DIR = config["working_dir"]
RESULT_DIR = config["result_dir"]
# read the tsv containing the samples
units = pd.read_table(config["units"], dtype=str).set_index(["sample"], drop=False)
# create lists containg sample names and conditions
SAMPLES = units.index.get_level_values('sample').unique().tolist()
samples = pd.read_csv(config["units"], dtype=str,index_col=0,sep="\t")
samplefile = config["units"]
# Input functions for rules
def sample_is_single_end(sample):
"""This function detects missing values in the read 2 column of the units.tsv"""
if "fq2" not in samples.columns:
return True
else:
return pd.isnull(sample.loc[(sample), "fq2"])
def get_fastq(wildcards):
"""This function checks if the samples has paired end or single end reads and returns 1 or 2 names of the fastq files"""
if sample_is_single_end(wildcards.sample):
return samples.loc[(wildcards.sample), ["fq1"]].dropna()
else:
return samples.loc[(wildcards.sample), ["fq1", "fq2"]].dropna()
def get_trim_names(wildcards):
"""
this function:
1. checks if the sample is single end or paired end
2. returns the correct input and output trimmed file names
"""
if sample_is_single_end(wildcards.sample):
inFile = samples.loc[(wildcards.sample), ["fq1"]].dropna()
return "--in1 " + inFile[0] + " --out1 " + WORKING_DIR + "trimmed/" + wildcards.sample + "_1_trimmed.fq.gz"
else:
inFile = sample.loc[(wildcards.sample), ["fq1", "fq2"]].dropna()
return "--in1 " + inFile[0] + " --in2 " + inFile[1] + " --out1 " + WORKING_DIR +"trimmed/" + wildcards.sample + "_1_trimmed.fq.gz --out2 " + WORKING_DIR + "trimmed/" + wildcards.sample + "_2_trimmed.fq.gz"
def get_star_names(wildcards):
"""
This function:
1. checks if the sample is single end or paired end
2. returns the correct input file names for the STAR mapping step.
"""
if sample_is_single_end(wildcards.sample):
return WORKING_DIR + "trimmed/" + wildcards.sample + "_1_trimmed.fq.gz"
else:
return WORKING_DIR +"trimmed/" + wildcards.sample + "_1_trimmed.fq.gz " + WORKING_DIR + "trimmed/" + wildcards.sample + "_2_trimmed.fq.gz"
MULTIQC = RESULTS_DIR + "multiqc_report.html"
BAM_FILES = expand(RESULTS_DIR + "star/{sample}_Aligned.sortedByCoord.out.bam", sample = SAMPLES)
MAPPING_REPORT = RESULT_DIR + "mapping_summary.csv"
if config["keep_working_dir"] == False:
rule all:
input:
MULTIQC,
BAM_FILES,
MAPPING_REPORT,
RESULT_DIR + "raw_count.parsed.tsv",
RESULT_DIR + "scaled_counts.tsv"
message:
"RNA-seq pipeline run complete!"
shell:
"cp config.yaml {RESULT_DIR};"
"cp samples.tsv {RESULTS_DIR};"
"rm -r {WORKING_DIR}"
elif config["keep_working_dir"] == True:
rule all:
input:
MULTIQC,
BAM_FILES,
MAPPING_REPORT,
RESULT_DIR + "raw_counts.parsed.tsv",
RESULT_DIR + "scaled_counts.tsv"
message:
"RNA-seq pipeline run complete!"
shell:
"cp config.yaml {RESULT_DIR};"
"cp sample.tsv {RESULT_DIR};"
else:
raise ValueError('please specify only "True" or "False" for the "keep_working_dir" parameter in the config file.')
# Rules
# Genome reference indexing
rule star_index:
input:
fasta = config["refs"]["genome"],
gtf = config["refs"]["gtf"]
output:
genome_index = [WORKING_DIR + "genome/" + f for f in ["chrLength.txt","chrNameLength","chrName.txt","chrStart.txt","Genome","genomeParameters.txt","SA","SAindex"]]
message:
"Generating STAR Genome Index"
params:
genome_dir = WORKING_DIR + "genome/",
genome_sa = config["star_index"]["genomeSAindexNbases"]
threads: 24
resources: mem_mb=100000
shell:
"""
mkdir -p {params.genome_dir};
STAR --runThreadsN {threads}
--runMode genomeGenerate
--genomeDir {params.genome_dir}
--genomeFastaFiles {input.fasta}
--sjdbGTFfile {input.gtf}
--genomeSAindexNbases {params.genome_sa}
"""
rule fastp:
input:
get_fastq
output:
fq1 = WORKING_DIR + "trimmed/" + "{sample}_1_trimmed.fq.gz",
fq2 = WORKING_DIR + "trimmed/" + "{sample}_2_trimmed.fq.gz",
html = WORKING_DIR + "fastp/{sample}_fastp.html",
json = WORKING_DIR + "fastp/{sample}_fastp.json"
message:
"Trimming {wildcards.sample} reads"
threads: 24
log:
RESULT_DIR + "fastp/{sample}.log.txt"
params:
sampleName = "{sample}",
in_and_out_files = get_trim_names,
qualified_quality_phred = config["fastp"]["qualified_quality_phred"]
resources: cpus=10
shell:
"""
touch {output.fq2};
fastp --thread {threads} --html {output} --json {output.json}
---qualified_quality_phred {params.qualified_quality_phred}
{params.in_and_out_files}
2>{log}
"""
rule multiqc:
input:
expand(WORKING_DIR + "fastp/{sample}_fatsp.json", sample = SAMPLES)
output:
RESULT_DIR + "multiqc_report.html"
params:
fastp_directory = WORKING_DIR + "fastp/",
outdir = RESULT_DIR
message:
"Summarizing fastp reports with Mulitqc"
shell:
"""
multiqc --force
--outdir {params.outdir}
{params.fastp_directory}
"""
# STAR Alignment
rule STAR_mapping:
input:
genome_index = rules.star_index.output,
forward_read = WORKING_DIR + "trimmed/" + "{sample}_1_trimmed.fq.gz",
reverse_read = WORKING_DIR + "trimmed/" + "{sample}_2_trimmed.fq.gz"
output:
RESULT_DIR + "star/{sample}_Aligned.sortedByCoord.out.bam",
RESULT_DIR + "star/{sample}_Log.final.out"
message:
"Mapping {wildcards.sample} reads to genome"
params:
sample_name = "{sample}",
star_input_file_names = get_star_names,
prefix = RESULT_DIR +"star/{sample}_",
outSamType = config["star"]["samtype"],
outSAMattributes = config["star"]["samattributes"],
genome_index = WORKING_DIR + "genome/"
threads: 24
resources: cpus=10
shell:
"""
STAR --genomeDir {params.genome_index}
--readFilesIn {params.star_input_file_names}
--readFilesCommand zcat
--runThreadN {threads}
--outReadsUnmapped {params.unmapped}
--outFileNamePrefix {params.prefix}
--outSAMtype {params.outSamType}
--outSAMattributes {params.outSAMattributes}
"""
# mapping summary
rule generate_mapping_summary:
input:
expand(RESULT_DIR + "star/{sample}_Log.final.out", sample = SAMPLES)
output:
RESULT_DIR + "mapping_summary.csv"
message:
"Generating mapping summary ..."
params:
directory_with_mapping_reports = RESULT_DIR + "star/",
star_directory_name = RESULT_DIR + "star/"
shell:
"python src/generate_mapping_summary.py {params.directory_with_mapping_reports} {params.star_directory_name} {output}"
# Count mapped reads and produce count matrix
rule read_count:
input:
bams = expand(RESULT_DIR + "star/{sample}_Aligned.sortedByCoord.out.bam", sample = SAMPLES),
gff = config["refs"]["gtf"]
output:
WORKING_DIR + "raw_counts.tsv"
message:
"Creating count matrix ..."
threads: 10
shell:
"""
featureCounts -T {threads}
-t exon
-g transcript_id
-F 'gtf'
-a {input.gff}
-o {output} {input.bams}
"""
# Parse readCounts to remove comment and header renaming
rule parse_raw_counts:
input:
WORKING_DIR + "raw_counts.tsv"
output:
RESULT_DIR + "raw_counts.parsed.tsv"
message:
"Parsing readCount matrix"
params:
star_result_dir_name = RESULT_DIR + "star/"
shell:
"python src/parse_raw_counts.py {input} {params.star_result_dir_name} {output}"
# Normalise readCounts
rule normalise_raw_counts:
input:
raw = RESULT_DIR + "raw_counts.parsed.tsv"
output:
norm = RESULT_DIR + "scaled_counts.tsv"
message:
"Normalising raw counts"
shell:
"""
Rscript --vanilla src/deseq2_normalization.R {input.raw} {output.norm}
"""
```