Isaac - HackMD

# Bioinformatics Special Lecture [TOC] **7/5/2023** ToDo List: 1. write the method more clear Eggnog to Goatools 2. Write discussion why, how and furture direction such as de novo genome 3. finish upload your code to Github and add the Slurm script with test set 4. to github please add the R code for graph drawing+ 5. Finish hackMD IO tutorials 6. Test your github https://github.com/ihinne/tick_ 7. Write the README from conda installation to the end. example : "https://github.com/tanghaibao/jcvi" 8. Pathway 9. 10. Study about snakemake {%hackmd @themes/orangeheart %} ## Ubuntu Set up ### Basic component installation by `apt-get install` `sudo apt-get --yes update` `sudo apt-get --yes upgrade` `sudo apt-get --yes dist-upgrade` `sudo apt-get --yes autoremove` `sudo apt-get --yes install build-essential checkinstall libgtk2.0-dev freeglut3-dev libfreetype6-dev libqhull-dev qhull-bin libqhull7 dvipng ghostscript texlive-latex-base poppler-utils python-gobject-2-dev python-gobject python-gi-dev libffi-dev python-numpy default-jre libmldbm-perl libcdb-file-perl libdb5.3++-dev libdb5.3++ hmmer ncbi-tools-bin blast2 cd-hit libcurl4-openssl-dev libreadline-gplv2-dev libncursesw5-dev libssl-dev libsqlite3-dev tk-dev libgdbm-dev libc6-dev libbz2-dev libgtk2.0-dev python2.7 libssh2-1-dev last-align texlive texlive-latex-extra texlive-latex-recommended gifsicle ffmpeg imagemagick uuid-dev elinks links2 python-tk mysql-server libmysqlclient-dev liblzma-dev zip binutils-multiarch bison++ byacc cmake make freebsd-buildutils samtools hdf5-tools libhdf5-serial-dev mercurial r-cran-rcpparmadillo libblas-dev liblapack-dev libgd-dev expat pandoc ttf-mscorefonts-installer python-pysam python-htseq liblzo2-dev zlib1g-dev openssh-server` ## Pronghorn `ssh YOURID@pronghorn.rc.unr.edu` Home directories are located under `/data/gpfs/home/NetID` Association directories are located under `/data/gpfs/assoc/` ### Prompt `tty -s && export PS1="\[\033[38;5;164m\]\u\[$(tput sgr0)\]\[\033[38;5;15m\] \[$(tput sgr0)\]\[\033[38;5;231m\]@\[$(tput sgr0)\]\[\033[38;5;15m\] \[$(tput sgr0)\]\[\033[38;5;2m\]\h\[$(tput sgr0)\]\[\033[38;5;15m\] \[$(tput sgr0)\]\[\033[38;5;172m\]\t\[$(tput sgr0)\]\[\033[38;5;15m\] \[$(tput sgr0)\]\[\033[38;5;2m\]\w\[$(tput sgr0)\]\[\033[38;5;15m\]\n \[$(tput sgr0)\]"` ## Bashrc ![](https://hackmd.io/_uploads/Bkownuwun.png) ![](https://hackmd.io/_uploads/BJet3_D_n.png) bashrc is a shell script located in the user home that is used to customise the terminal session For example to load a module you can add the line: `module load <module>` To modify an environment variable, like PATH, add the line: `export PATH=$PATH:<path/to/dir>` to activate a Python environment is as simple as adding the line: `source <path/to/env>/bin/activate` ## SCP the `scp` (secureshell copy) is a command-line utility that allows you to share files between different hosts or locations. It uses ssh for data transfer and provides the same security level - both files and passwords are encrypted. You can copy files or directory: * from you local system to a remote system * from a remote system to your local system * between two remote systems from your local system ### SCP command syntax `scp [OPTION] [user@]SRC_HOST:]file1 [user@]DEST_HOST:]file2` * `OPTION` - scp options [-c cipher|-F ssh_config|-P remote host ssh port|-l limit|...] * [user@]SRC_HOST:]file1 - source file * [user@]DEST_HOST:]file2 - destination file To be able to copy files, you must have at least read permissions on the source file and write permission on the target system NB: when copying files that share the same name and location on both systems, scp will overwrite files without warning. Examples: Copying files * copy the file "example.txt" from a remote host to the local host `scp username@remotehost.edu:example.txt /some/local/directory` * copy the file "example.txt" from the local host to a remote host `scp example.txt username@remotehost.edu:/some/remote/directory` * Copy the files "example_1.txt" and "example_2.txt" from the local host to your home directory on the remote host `scp example_1.txt example_2.txt ihinne@pronghorn.rc.unr.edu:~` * Copy the file "example.txt" from remote host "remotehost_1.edu" to remote host "remotehost_2.edu" `scp username@remotehost_1.edu:remote/directory/example.txt username@remotehost_2.edu:/remote/directory` Copying directories * Copy the directory "example_a" from the local host to a remote host's directory "example_b" `scp -r example_a ihinne@remotehost.edu:/some/remote/directory/example_b ` * Copy an entire directory from one host to another `scp -r /local/directory username@remotehost.edu:/remote/directory` syncing files via rsync Rsync provides many options for passing arguments to augment it's functionality. In the examples below we will use: * -a flag: Stands for "archive" and syncs recursively and preserves symbolic links, special and device files, modification times, group, owner, and permissions. * -z flag: Compresses files during transfer. * -P flag: Combines the flags --progress and --partial. The first of these gives you a progress bar for the transfers and the second allows you to resume interrupted transfers: The following command is a "push" operation because it pushes (syncs) a local directory to a remote system. `rsync -azP ~/some/local/source/directory username@remotehost:/some/remote/destination/directory` Conversely, a "pull" operation is used to sync a remote directory to the local system. `rsync -azP username@remote_host:/some/remote/destination/directory ~/some/local/directory` ## CHMOD File and Directory Permissions `-rw-r--r-- 1 exampleuser examplegroup 4096 Apr 4 04:20 example.txt` The file's mode represents the file's permissions and some extra information. There are five parts to the mode. part 1: - (file or directory) part 2: rw- (user permissions) part 3: r-- (group permissions) part 4: r-- (other permissions) part 5: (special attributes) The first character of the mode (part 1) is the file type. It can be a regular file or a directory. Each file and directory has three user based permission groups which break down into three sets: `user` part 2, `group` part 3, and `other` part 4. part 1: A dash (-) in this position, as in the example, denotes a regular file, meaning that there is nothing special about the file. This is by far the most common kind of file. Directories are also common and are indicated by a d in the file type slot. part 2: `user (u)` - The Owner permissions apply only the owner of the file or directory, they will not impact the actions of other users. part 3: `group (g)` - The Group permissions apply only to the group that has been assigned to the file or directory, they will not effect the actions of other users. part 4: `other (o)` - The All Users permissions apply to all other users on the system, this is the permission group that you want to watch the most. part 5: `Directory Set Group ID` - There are two special bits in the permissions field of directories. `s` - Set group ID `t` - Save text attribute (sticky bit) - The user may delete or modify only those files in the directory that they own or have write permission for. Each permission set can contain four basic representations: `r` : the file is readable. Refers to a user's capability to read the contents of the file. `w` : the file is writable. Refers to a user's capability to write or modify a file or directory. `x` : the file is executable. Refers to a user's capability to execute a file or view the contents of a directory. `s` : Some executable files have an s in the user permissions listing instead of an x. This indicates that the executable is setuid, meaning that when you execute the program, it runs as though the file owner is the user instead of you. Many programs use this setuid bit to run as root in order to get the privileges they need to change system files. - : null value (nothing). Modifying Permissions `chmod` : Changes the file mode bits of each given file according to mode, which can be either a symbolic representation of changes to make, or an octal number representing the bit pattern for the new mode bits. `chown` : Changes the user and/or group ownership of each given file. `chgrp` : Change the group of a file or directory. chmod - change mode To change permissions, use the chmod command. The chmod (change mode) command protects files and directories from unauthorized users on the same system, by setting access permissions. Syntax `chmod options permissions file name` `chmod [OPTION]... MODE[,MODE]... FILE...` `chmod [OPTION]... OCTAL-MODE FILE...` `chmod [OPTION]... --reference=RFILE FILE...` Some examples: Private file for you `chmod 600 myfile` `chmod u=rw myfile` Everyone can read; you can write `chmod 644 myfile` `chmod u=rw,g=rw,o=rw myfile` Private directory for you `chmod 700 mydir` Everyone can read; you can write `chmod 755 mydir` `chmod u=rwx,g=r,o=r mydir` * 4 stands for "read", * 2 stands for "write", * 1 stands for "execute", and * 0 stands for "no permission" So 7 is the combination of permissions 4+2+1 (read, write, and execute), 5 is 4+0+1 (read, no write, and execute), and 4 is 4+0+0 (read, no write, and no execute). set the ownership of myfile to mrrobot `chown mrrobot myfile` set the ownership of the directory and all files in mydirectory to mrrobot `chown -R mrrobot mydirectory` set the group ownership of myfile to mrrobot `chgrp mrrobot myfile` set the group ownership of the directory and all files in mydirectory to mrrobot `chgrp -R mrrobot mydirectory` ## VIM VI ## Sublimetext move line - `ctrl + shift + arrow` duplicate - `ctrl + shift + d` delete line - `ctrl + shift + k` indent line - `ctrl + [ or ]` select text -> edit tab -> Line -> Reindent paste and indent - `ctrl + shift + v` #### Working with projects assume you are working on a project with its files in a folder called example-project Drag the folder onto the sublime text screen OR use File -> open Sunlime text creates a side bar with the list of files in the project folder. double click on a file to open it in a new tab or use `ctrl + p` to go to anything and use `@` to go to a particular text use `:` to go to a particular line eg :24 tips See unsdaved changes : right-click anywhere in the text window and select Show Unsaved changes Sidebar time saving - Use the sidebar to create new folders or files Multiple top level factors - think about the sidebar as a canvas for your project. You can drag multiple folders on to the sidebar Save a project - Project -> save project as #### The command Palette `ctrl + shift + p` places every action sublime text is capable of at your finger tips ctrl + shift + p -> Convert case Reverse Sort Set syntax #### Package manager search Sublime text package manager look for Package Control Install Copy blob of code onto your clipboard got to View -> show console > paste code -> hit enter ctrl + shift + p -> install package -> colour highlighter #### Snippets ## Visualstudio ## Conda Conda is an open-source package management system and environment management system that runs on Windows, macOS, and Linux. Conda quickly installs, runs, and updates packages and their dependencies. Conda easily creates, saves, loads, and switches between environments on your local computer. It was created for Python programs but it can package and distribute software for any language. Verify conda is installed, check version number `conda info` Update conda to the current version `conda update conda` Install a package included in Anaconda `conda install PKGNAME` Create a new environment named env `conda create --name env` Get a list of all my environments `conda env list` List all packages and versions installed in active environment `conda list` List the history of each change to the current environment `conda list revisions` Restore environment to a previous revision `conda install --revision 2` Save environment to a text file `conda list --explicit > env.txt` `wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh` `bash Miniconda3-latest-Linux-x86_64.sh` `conda config --set auto_activate_base false` restart shell `conda config --show channels` `conda config --add channels conda-forge` `conda config --add channels bioconda` ## mamba https://github.com/mamba-org/mamba mamba is a re-implementation of conda manager ```bash! wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh ``` ```bash bash Mambaforge-Linux-x86_64.sh ``` ## GitHub https://www.youtube.com/watch?v=tRZGeaHPoaw ```git git config --global user.name "USERNAME" ``` ```git git config --global user.email user@email.com ``` set default branch as main ```git git config --global init.default branch main ``` Initialize git in the required directory ```git! git init ``` chek status of repository ```git git status ``` create a txt file called .gitignore create a list of files or extension you do not want to include in this .gitignore file Currenly all your files are located in a stageing area. track a file or add file to staging area ```git! git add filename ``` Git has three working areas: 1. Working files 2. Staging area 3. Commit To write the current state of your files to the repository; ```git git commit -m "commit message" ``` commit directly without moving the the staging area ```git git commit -a -m "commit message" ``` see an abbreviated version of commit history ```git git log --oneline ``` ammend commit message ```git git commit -m "type new message here" --amend ``` to see all the diiferent changes in all commits ```git git log -p ``` go to previous version ```git git reset ``` modify what appears in the histrory book ```git git rebase -i --root ``` create a new branch ```git git branch newBranch ``` switch from main to new branch ```git git switch newBranch ``` switch to main ```git git switch main ``` merge branche to main ```git git merge -m "merging message" newBranch ``` delete branch ```git git branch -d newBranch ``` ```git git remote add origin "link-to-github-repo" ``` git push git pull = git fetch + git merge git fetch git clone git checkout -b branch-name git diff - shows two versions of a code git merge pull request - move code to another branch ## HPC https://www.youtube.com/watch?v=dThyrEXfAbM&list=PLZLVmS9rf3nMKR2jMglaN4su3ojWtWMVw parellel computing Multithreding - run code multiple times at once. you can now use more resources. Slowed down by now parallelizable work. cant use mor than one kitchen Message passing - more processes Using multiple nodes - using other kitchens. Using multiple computers. In this case you need to free up all the kitchens in the apartment complex so we contact the building manager (SLurm) On a cluster there are so many computers we cant keep track of them ourselves. Slurm, the workload manager can tell us what is available. GPU Parallellization can speed up your task, but **not faster than the smallest seiral task** You need to **modify your code to use multiple threads** on the same node, or **multiple processes** over many nodes. Special hardware built for parallellization , **GPU** require further adaptations of your code and sometimes also data Program size How big is my program? what resources (CPUs, RAM, GPU) would I need? A simple measurement would be your own computer. Something that runs for an hour on your machine may run for an hour and ten minutes on the cluster. Most modern PCs have slightly faster CPUs than HPCs becuase the cluter resources are more for energy efficiency.The faster CPUs may not be energy efficient but the cluster has more CPUs so can run more processess. ## Slurm https://www.youtube.com/watch?v=49DzPT9HFJM https://www.youtube.com/watch?v=FCJAhp27BLI Slurm is a workload manager - A piece of software which: * manages aand **allocates resources**; * manages and **schedules jobs**; * and sets up the environment for parallel and distributed computing Resoureces: mainly CPU, memory (and GPU) Slurm is free and open-source You can use slurm to: create a job monitor jobs control your own job get job accounting info to create a job; 1. you need to know the resources you need and the operations you need to perform 2. Write a submission script - a shell script ```#!/bin/bash #!/bin/bash # Job parameters #SBATCH --job-name=demo #SBATCH --output=res.txt # Needed resources #SBATCH --ntask=1 #SBATCH --mem-per-cpu=4g #SBATCH --time=0-1:00:00 #SBATCH --partion=choose partition #SBATCH --nodes=num #SBATCH --ntasks-per-node=num #SBATCH --cpus-per-task=num #SBATCH --gres=use a specific resource (eg a GPU) #SBATCH --constraint=choose a specific feature (eg a processor or network type) #SBATCH --licence=to access a specific licensed software #SBATCH --qos=to use a specific Quality of Service # other useful parameters #SBATCH --jobname=MYJobName #SBATCH --comment="some comment" #SBATCH --mail-type=BEGIN|END|FAILED|TIME_LIMIT_90|ALL #SBATCH --mail-user=my@mail.com #SBATCH --output=result-%j.txt #SBATCH --error=error-%j.txt #SBATCH --test-only #can use to test script #SBATCH --dependency=after(ok|notok|any):jobids #useif one job need to finish before another #SBATCH --dependency=singleton #SBATCH #SBATCH # Operations echo "Job start at $(date)" # Job steps srun ~/bin/myprog < mydata1 echo "Job end at $(date)" ``` 1. Submit the script sbatch script.sh Monitor your job squeue - view jobs that are running or pending scancel - cancel job sprio - view the factors that comprise a job's scheduling priority scontrol - used to modify slurm configuration and state (change partition, ram, time) sstat sview - gives visual overview of cluster sacct sinfo - get cluster info sreport - request information from account sshare - tools for listing the shares of association to acluster Interactive slurm salloc - srun --pty bash ## Snakemake https://www.youtube.com/@yclab438/videos?view=0&sort=dd&shelf_id=0 https://snakemake.readthedocs.io/en/stable/ https://lachlandeer.github.io/snakemake-econ-r-tutorial/working-with-multiple-rules.html Snakemake is a workflow management tool useful to creat reproducible data analysis ### JASON https://www.youtube.com/watch?v=iiADhChRriM JAVA Script Object Notation Data representation format Commonly used for APIs and Configs Lightweight and Easy to Read/Write Easily integrate with most languages JASON Types * Strings "hello" * Numbers 10 1.5 * Booleans true false * Null null * Arrays [1,2,3] ["Hello", "world"] * Objects { "key":"value" } { "age":30 } user.json ```json { "key": "value", "key": "value" } { "name": "Isaac", "favoriteNumber": 3, "isProgrammer": true, "hobbies": ["piano", "singing"], "friends": [{ "name": "Juliet", "favoriteNumber": 30, "isProgrammer": false, "friends": [...] }] } ``` ## Eggnog-mapper EggNOG-mapper is a tool for fast functional annotation of novel sequences. It uses precomputed Orthologous Groups (OGs) and phylogenies from the EggNOG database (http://eggnog6.embl.de) to transfer functional information from fine-grained orthologs only. The eggNOG6 database is a bioinformatics resource that provides orthology data and comprehensive functional information for 12,535 reference species at 1061 taxonomic levels. It has an hierarchy of over 17M OGs. eggNOG combines species-aware clustering and phylogenetic analysis predictions to infer orthology reports in two steps: 1.eggNOG provides precomputed OGs across thousands of taxonomic scopes, covering the three domains of life: Bacteria, Archaea, and Eukaryota. 2.each OG is further analysed using phylogenetic reconstruction which provides fine-grained resolution of duplication and speciation events within each OG. eggNOG integrates evolutionary data, conserved sequence domains, and functional terms into summarised reports for each OG in an effort to maximise its utility to the user. eggNOG proteomes were mapped against the reference alignment of each COG, KOG and arCOG. unassigned sequences were clustered using unsupervised COG clustering and all-against-all Smith–Waterman similarities among all proteins in eggNOG. eggNOG OGs are then functionally annotated using PFAM, SMART, KEGG, PDB, BiGG,CARD, CAZy,GO databeses. ### Install conda #### we create eggnog environment. ``` mamba create -n eggnog mamba activate eggnog ``` #### Install eggnog-mapper `mamba install -c bioconda eggnog-mapper` #### Download the eggNOG database into a directory called eggnog-database ```bash! # make a directory to store the eggNOG database mkdir eggnog-database # download eggNOG database download_eggnog_data.py -F -P -M -H -d 6656 --dbname Arthropoda --data_dir /data/gpfs/assoc/ihinne_bioinformatics/eggnog-database ``` #### Run emapper in search mode only using diamond and hmmer ```bash! sbatch emapper_diamond_search.sh ``` ```bash!= #!/usr/bin/bash -l #SBATCH --job-name=d_search #SBATCH --account=cpu-s1-pgl-0 #SBATCH --partition=cpu-s1-pgl-0 #SBATCH --nodes=1 #SBATCH --ntasks=3 #SBATCH --cpus-per-task=16 #SBATCH --time=4-00:00 #SBATCH --mem-per-cpu=4g #SBATCH --output=hostname_%j.out #SBATCH --error=hostname_%j.err #SBATCH --mail-type=ALL #SBATCH --mail-user=ihinne@nevada.unr.edu /data/gpfs/home/ihinne/mambaforge/envs/eggnog/bin/emapper.py \ --cpu 16 \ -m diamond \ --sensmode ultra-sensitive \ --data_dir /data/gpfs/assoc/ihinne_bioinformatics/eggnog-database/ \ --go_evidence all \ --pfam_realign denovo \ --no_annot \ -i /data/gpfs/assoc/ihinne_bioinformatics/ISCGN_v2_103122.aa \ -o ISCGN_dmnd ``` `sbatch emapper_hmmer_search.sh` ```bash= #!/usr/bin/bash -l #SBATCH --job-name=h_search #SBATCH --account=cpu-s1-pgl-0 #SBATCH --partition=cpu-s1-pgl-0 #SBATCH --nodes=1 #SBATCH --ntasks=3 #SBATCH --cpus-per-task=16 #SBATCH --time=4-00:00 #SBATCH --mem-per-cpu=4g #SBATCH --output=hostname_%j.out #SBATCH --error=hostname_%j.err #SBATCH --mail-type=ALL #SBATCH --mail-user=ihinne@nevada.unr.edu /data/gpfs/home/ihinne/mambaforge/envs/eggnog/bin/emapper.py \ --cpu 16 \ -m hmmer \ --sensmode ultra-sensitive \ --data_dir /data/gpfs/assoc/ihinne_bioinformatics/eggnog-database/ \ --go_evidence all \ --pfam_realign denovo \ --no_annot \ -i /data/gpfs/assoc/ihinne_bioinformatics/ISCGN_v2_103122.aa \ -o ISCGN_hmm_arth \ -d Arthropoda \ --tax_scope 6656 ``` #### combine the two seed_orthologs files into one seed_ortholog file `cat ISCGN_dmnd.emapper.seed_orthologs ISCGN_hmm_arth.emapper.seed_orthologs > ISCGN_combined.emapper.seed_orthologs` #### Run emapper with no search but annotation mode only Do this using the combined seed_orthologs file as seed and also the individual diamond and hmm seed_orthologs files `sbatch emapper_annotate.sh` ```bash= #!/usr/bin/bash -l #SBATCH --job-name=emapper_annot #SBATCH --account=cpu-s1-pgl-0 #SBATCH --partition=cpu-s1-pgl-0 #SBATCH --nodes=1 #SBATCH --ntasks=3 #SBATCH --cpus-per-task=16 #SBATCH --time=5-00:00 #SBATCH --mem-per-cpu=4g #SBATCH --output=hostname_%j.out #SBATCH --error=hostname_%j.err #SBATCH --mail-type=ALL #SBATCH --mail-user=ihinne@nevada.unr.edu /data/gpfs/home/ihinne/mambaforge/envs/eggnog/bin/emapper.py \ --cpu 16 \ -m no_search \ --data_dir /data/gpfs/assoc/ihinne_bioinformatics/eggnog-database \ --annotate_hits_table ISCGN_combined.emapper.seed_orthologs \ --excel \ #--decorate_gff yes \ -o ISCGN_hmm_annot \ --dbmem \ --target_taxa 6656 ``` ### running emapper in two-pass mode `sbatch emapper_2pass.sh` ```bash=! #!/usr/bin/bash -l #SBATCH --job-name=emapper_hmm #SBATCH --account=cpu-s1-pgl-0 #SBATCH --partition=cpu-s1-pgl-0 #SBATCH --nodes=1 #SBATCH --ntasks=3 #SBATCH --cpus-per-task=16 #SBATCH --time=1-00:00 #SBATCH --mem-per-cpu=4g #SBATCH --output=hostname_%j.out #SBATCH --error=hostname_%j.err #SBATCH --mail-type=ALL #SBATCH --mail-user=ihinne@nevada.unr.edu /data/gpfs/home/ihinne/mambaforge/envs/eggnog/bin/emapper.py \ --cpu 16 \ -m hmmer \ --sensmode ultra-sensitive \ --data_dir /data/gpfs/assoc/ihinne_bioinformatics/eggnog-database/ \ --go_evidence all \ --pfam_realign denovo \ --no_annot \ -i /data/gpfs/assoc/ihinne_bioinformatics/ISCGN_v2_103122.aa \ -o /data/gpfs/assoc/ihinne_bioinformatics/eggnog_hmm_out/ISCGN_hmm_arth \ -d Arthropoda \ --tax_scope 6656 /data/gpfs/home/ihinne/mambaforge/envs/eggnog/bin/emapper.py \ --cpu 16 \ -m no_search \ --data_dir /data/gpfs/assoc/ihinne_bioinformatics/eggnog-database \ --annotate_hits_table /data/gpfs/assoc/ihinne_bioinformatics/eggnog_hmm_out/ISCGN_combined.emapper.seed_orthologs \ --decorate_gff yes \ -o /data/gpfs/assoc/ihinne_bioinformatics/eggnog_hmm_out/ISCGN_hmm_annot \ --dbmem \ #--target_taxa 6656 when diamond search is used ``` > [name=Isaac][time=Sat, Jul 08, 2023 12:23 am] [color=blue] ### using argparse to get input from user. ```python= #!/usr/bin/env/python # shebang code needed # shebang /usr/bin/env python import argparse import pandas as pd # Define and parse command line arguments parser = argparse.ArgumentParser(description="Eggnog output and writes to a new file.") parser.add_argument('input_file', type=str, help='The annotation file to read from.') parser.add_argument('output_file', type=str, help='The txt file to write to.') args = parser.parse_args() # Read Excel file, skipping the first two rows #df = pd.read_excel(args.input_file, skiprows=2) # Read the input anotation file skipping the firts two lines with open (args.input_file, 'r') as file: lines = file.readlines()[2:] #split lines and create Dataframe data = [line.strip().split('\t') for line in lines] df = pd.DataFrame(data, columns=['query', 'GOs']) # Write DataFrame to tab-delimited text file gene2go = args.output_file +'.txt' df.to_csv(gene2go, sep='\t', index=False) # Select only the "query" and "GOs" columns #new_df = df[['query', 'GOs']] # Write the selected columns to a new Excel file #new_df.to_excel(args.output_file, index=False) ``` ### without shebang line ```shell= python code.py annotation.xlsx gene2goid.xlsx ``` ### with shebang line ```shell= chmod 775 code.py ./code.py annotation.xlsx gene2goid.xlsx ``` ```python=! import argparse import pandas as pd #Define and parse command line argument parser = argparse.ArgumentParser(description="Eggnog output and writes to a new file.") parser.add_argument('input_file', type=str, help='The annotation file to read from.') parser.add_argument('output_file', type=str, help='The txt file to write to.') args = parser.parse_args() # Read the input annotation file from eggnog-mapper, skipping the first five descritptive rows df = pd.read_csv(args.input_file, sep='\t', skiprows=4) # Remove empty rows or rows with "-" df = df.dropna(subset=["#query", "GOs"]) # Remove rows with NaN values df = df[~df["GOs"].str.contains("-")] # Remove "-" from the GOs column # Extract columns selected_columns = df[["#query", "GOs"]] # Write coulumns to tab-delimited text file selected_columns.to_csv(args.output_file, sep='\t', index=False) ``` ```python= import pandas as pd # Read the annotation file #df = pd.read_excel('annotation.xlsx', skiprows=2) # Select only the "query" and "KEGG_Pathway" columns #new_df = df[['query', 'GOs']] # Write the selected columns to a new Excel file #new_df.to_excel('gene2goid.xlsx', index=False) #### Load the ontology files, association and background files fin_study = 'data/' # Study genes fin_pop = 'data/' # Population genes fin_obo = 'data/go-basic.obo' # GO obo containing GO terms fin_anno = 'data/gene2goid.xlsx' from goatools.obo_parser import GODag obodag = GODag(fin_obo, load_obsolete=True) #### load optional OBODag attributes obodag = GODag(fin_obo, optional_attrs={'consider', 'replaced_by'}, load_obsolete=True) # Load associations from __future__ import print_function from goatools.anno.idtogos_reader import IdToGosReader annoobj = IdToGosReader(fin_anno, godag=obodag) id2gos = annoobj.get_id2gos() # Get associations for each branch of the GO DAG (BP, MF, CC) ns2assoc = annoobj.get_ns2assc() for nspc, id2gos in ns2assoc.items(): print("{NS} {N:,} annotated I. scapularis genes".format(NS=nspc, N=len(id2gos))) ``` ### Functional Annotation Functional annotation of predicted proteins of the new ISCGN genome was done based on orthology assignment with the evolutionary genealogy of genes: Non-supervised Orthologous Groups (eggNOG 5) database (Huerta-Cepas et al., 2019) using eggnog-mapper (v 2.1.11) (Cantalapiedra et al., 2021). We used the two-step (search + annotation) method using HMMER (Eddy SR., 2011) in ultra-sensitive mode restricting it to the Arthropoda taxonomic group (Tax ID 6656) in eggNOG5 and realigning the protein queries to the PFAM database (Mistry et al., 2021). Gene Ontology and KEGG pathways were retrieved based on orthologous groups in eggNOG5. Gene Ontology enrichment analysis (GOEA) was processed to test the overrepresentation of GO terms in the list of differentially expressed genes using the GOATOOLS Python library (Klopfenstein et al., 2018). The population set was defined as a list of all the genes in the I. scapularis genome (ISCGN) and the study set was the list of differentially expressed genes. We extracted the query sequence id and the GO terms from the annotation output from eggnog-mapper using a custom python script to create a resulting two-column gene-to-GO association or annotation file. The study set, population set, GO annotation file, and a GO-basis ontology file were used as inputs to run GOEA with goatools in a jupyter notebook. The resulting p values of the GO terms were corrected for false discovery rate (FDR) < 0.05 using the Benjamini-Hochberg method of multiple testing correction. The top 30 represented GO terms (based on the p value) were visualized with ggplot2 (Wickham H, 2016) in R (v4.2.3; R Core Team, 2023) using R studio (RStudio Team, 2020). ### Discussion Since we have a novel genome, it is important to annotate the genome in order to infer functions to the predicted proteins from the genome. Generally, it is considered a better approach to inferring protein function based on orthology than using sequence homology detection. Orthologs often retain their function as such, using orthology assignment is a more reliable approach to infer functional information between protein sequences. The eggNOG database provides a broad range of annotation while keeping redundancy low, because it is based on a curated selection of representative species across the entire tree of life. s ```snakemake= import os #allows you to reach out into directory to identify things import glob #allows you to throw in wildcards import pandas as pd configfile: "config.yaml" WORKING_DIR = config["working_dir"] RESULT_DIR = config["result_dir"] # read the tsv containing the samples units = pd.read_table(config["units"], dtype=str).set_index(["sample"], drop=False) # create lists containg sample names and conditions SAMPLES = units.index.get_level_values('sample').unique().tolist() samples = pd.read_csv(config["units"], dtype=str,index_col=0,sep="\t") samplefile = config["units"] # Input functions for rules def sample_is_single_end(sample): """This function detects missing values in the read 2 column of the units.tsv""" if "fq2" not in samples.columns: return True else: return pd.isnull(sample.loc[(sample), "fq2"]) def get_fastq(wildcards): """This function checks if the samples has paired end or single end reads and returns 1 or 2 names of the fastq files""" if sample_is_single_end(wildcards.sample): return samples.loc[(wildcards.sample), ["fq1"]].dropna() else: return samples.loc[(wildcards.sample), ["fq1", "fq2"]].dropna() def get_trim_names(wildcards): """ this function: 1. checks if the sample is single end or paired end 2. returns the correct input and output trimmed file names """ if sample_is_single_end(wildcards.sample): inFile = samples.loc[(wildcards.sample), ["fq1"]].dropna() return "--in1 " + inFile[0] + " --out1 " + WORKING_DIR + "trimmed/" + wildcards.sample + "_1_trimmed.fq.gz" else: inFile = sample.loc[(wildcards.sample), ["fq1", "fq2"]].dropna() return "--in1 " + inFile[0] + " --in2 " + inFile[1] + " --out1 " + WORKING_DIR +"trimmed/" + wildcards.sample + "_1_trimmed.fq.gz --out2 " + WORKING_DIR + "trimmed/" + wildcards.sample + "_2_trimmed.fq.gz" def get_star_names(wildcards): """ This function: 1. checks if the sample is single end or paired end 2. returns the correct input file names for the STAR mapping step. """ if sample_is_single_end(wildcards.sample): return WORKING_DIR + "trimmed/" + wildcards.sample + "_1_trimmed.fq.gz" else: return WORKING_DIR +"trimmed/" + wildcards.sample + "_1_trimmed.fq.gz " + WORKING_DIR + "trimmed/" + wildcards.sample + "_2_trimmed.fq.gz" MULTIQC = RESULTS_DIR + "multiqc_report.html" BAM_FILES = expand(RESULTS_DIR + "star/{sample}_Aligned.sortedByCoord.out.bam", sample = SAMPLES) MAPPING_REPORT = RESULT_DIR + "mapping_summary.csv" if config["keep_working_dir"] == False: rule all: input: MULTIQC, BAM_FILES, MAPPING_REPORT, RESULT_DIR + "raw_count.parsed.tsv", RESULT_DIR + "scaled_counts.tsv" message: "RNA-seq pipeline run complete!" shell: "cp config.yaml {RESULT_DIR};" "cp samples.tsv {RESULTS_DIR};" "rm -r {WORKING_DIR}" elif config["keep_working_dir"] == True: rule all: input: MULTIQC, BAM_FILES, MAPPING_REPORT, RESULT_DIR + "raw_counts.parsed.tsv", RESULT_DIR + "scaled_counts.tsv" message: "RNA-seq pipeline run complete!" shell: "cp config.yaml {RESULT_DIR};" "cp sample.tsv {RESULT_DIR};" else: raise ValueError('please specify only "True" or "False" for the "keep_working_dir" parameter in the config file.') # Rules # Genome reference indexing rule star_index: input: fasta = config["refs"]["genome"], gtf = config["refs"]["gtf"] output: genome_index = [WORKING_DIR + "genome/" + f for f in ["chrLength.txt","chrNameLength","chrName.txt","chrStart.txt","Genome","genomeParameters.txt","SA","SAindex"]] message: "Generating STAR Genome Index" params: genome_dir = WORKING_DIR + "genome/", genome_sa = config["star_index"]["genomeSAindexNbases"] threads: 24 resources: mem_mb=100000 shell: """ mkdir -p {params.genome_dir}; STAR --runThreadsN {threads} --runMode genomeGenerate --genomeDir {params.genome_dir} --genomeFastaFiles {input.fasta} --sjdbGTFfile {input.gtf} --genomeSAindexNbases {params.genome_sa} """ rule fastp: input: get_fastq output: fq1 = WORKING_DIR + "trimmed/" + "{sample}_1_trimmed.fq.gz", fq2 = WORKING_DIR + "trimmed/" + "{sample}_2_trimmed.fq.gz", html = WORKING_DIR + "fastp/{sample}_fastp.html", json = WORKING_DIR + "fastp/{sample}_fastp.json" message: "Trimming {wildcards.sample} reads" threads: 24 log: RESULT_DIR + "fastp/{sample}.log.txt" params: sampleName = "{sample}", in_and_out_files = get_trim_names, qualified_quality_phred = config["fastp"]["qualified_quality_phred"] resources: cpus=10 shell: """ touch {output.fq2}; fastp --thread {threads} --html {output} --json {output.json} ---qualified_quality_phred {params.qualified_quality_phred} {params.in_and_out_files} 2>{log} """ rule multiqc: input: expand(WORKING_DIR + "fastp/{sample}_fatsp.json", sample = SAMPLES) output: RESULT_DIR + "multiqc_report.html" params: fastp_directory = WORKING_DIR + "fastp/", outdir = RESULT_DIR message: "Summarizing fastp reports with Mulitqc" shell: """ multiqc --force --outdir {params.outdir} {params.fastp_directory} """ # STAR Alignment rule STAR_mapping: input: genome_index = rules.star_index.output, forward_read = WORKING_DIR + "trimmed/" + "{sample}_1_trimmed.fq.gz", reverse_read = WORKING_DIR + "trimmed/" + "{sample}_2_trimmed.fq.gz" output: RESULT_DIR + "star/{sample}_Aligned.sortedByCoord.out.bam", RESULT_DIR + "star/{sample}_Log.final.out" message: "Mapping {wildcards.sample} reads to genome" params: sample_name = "{sample}", star_input_file_names = get_star_names, prefix = RESULT_DIR +"star/{sample}_", outSamType = config["star"]["samtype"], outSAMattributes = config["star"]["samattributes"], genome_index = WORKING_DIR + "genome/" threads: 24 resources: cpus=10 shell: """ STAR --genomeDir {params.genome_index} --readFilesIn {params.star_input_file_names} --readFilesCommand zcat --runThreadN {threads} --outReadsUnmapped {params.unmapped} --outFileNamePrefix {params.prefix} --outSAMtype {params.outSamType} --outSAMattributes {params.outSAMattributes} """ # mapping summary rule generate_mapping_summary: input: expand(RESULT_DIR + "star/{sample}_Log.final.out", sample = SAMPLES) output: RESULT_DIR + "mapping_summary.csv" message: "Generating mapping summary ..." params: directory_with_mapping_reports = RESULT_DIR + "star/", star_directory_name = RESULT_DIR + "star/" shell: "python src/generate_mapping_summary.py {params.directory_with_mapping_reports} {params.star_directory_name} {output}" # Count mapped reads and produce count matrix rule read_count: input: bams = expand(RESULT_DIR + "star/{sample}_Aligned.sortedByCoord.out.bam", sample = SAMPLES), gff = config["refs"]["gtf"] output: WORKING_DIR + "raw_counts.tsv" message: "Creating count matrix ..." threads: 10 shell: """ featureCounts -T {threads} -t exon -g transcript_id -F 'gtf' -a {input.gff} -o {output} {input.bams} """ # Parse readCounts to remove comment and header renaming rule parse_raw_counts: input: WORKING_DIR + "raw_counts.tsv" output: RESULT_DIR + "raw_counts.parsed.tsv" message: "Parsing readCount matrix" params: star_result_dir_name = RESULT_DIR + "star/" shell: "python src/parse_raw_counts.py {input} {params.star_result_dir_name} {output}" # Normalise readCounts rule normalise_raw_counts: input: raw = RESULT_DIR + "raw_counts.parsed.tsv" output: norm = RESULT_DIR + "scaled_counts.tsv" message: "Normalising raw counts" shell: """ Rscript --vanilla src/deseq2_normalization.R {input.raw} {output.norm} """ ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.