Final - HackMD

# Identifying the source of virulence ## Introduction Nextclade is a tool that was created with the goal of making a user friendly program that allows quick assessment of the qualility of a sequence, assignment of a sequence to a clade/type, and comparison to a reference sequence to determine if evolutionary changes have occured in studied pathogen and/or virus. The tool was developed during the COVID-19 pandemic with the primary focus of being able to determine, classify, and track the development of SARS-CoV-2 strains as they rapidly evolved. Specifically, for SARS-CoV-2 it was important to identify the clade a strain came from as it gave fast insight into the type of mutations the strains were uptaking which aided in the development of treatments and advised precautions. Nextclade has attracted use beyond SARS-CoV-2 as datasets for other pathogens are available in the program and there is the ability to upload new datasets. This allows for ease of tracking, assessing, and classification of other pathogens and/or viruses. The program was developed by Ivan Aksamentov, Cornelius Roemer, Emma B. Hodcroft, and Richard A. Neher, who worked with the Swiss Institute of Bioinformatics (Basel, Switzerland). The program is offered as a command-line interface (CLI) or a web-tool, however, bulk processing is best done through the CLI as it allows for continual processing of files and can be integrated with other analysis programs. Additionally, the CLI may be faster than the web-application depending on one's location and equipment. NextClade is part of the Nextstrain project and is maintained by Askamentov, Roemer, and Neher who have a discussion page and GitHub for questions, bug fixes, discussions, and implementation of new features. The program is opensource but should be cited when used. ## Analysis Nextclade code for BIT150_assembly file: ``` #!/bin/bash # load nextclade export PATH=/group/bit150/software:$PATH # Input file inf_directory=/group/bit150/Final/BIT150_assembly.fasta # Loading database nextclade dataset get --name 'sars-cov-2' \ --output-dir 'final_bit150/sars-cov-2' # Running Nextclade nextclade run \ --input-dataset final_bit150/sars-cov-2 \ --output-all=final_bit150 \ --output-basename BIT150_v_whole\ \ ${inf_directory} ``` #### Output files After running Nextclade it resulted in about 20 new files. One of the most relevant files produced is the 'BIT150_v_whole .tsv' which contains the summarized results of the analysis. This file includes information such as locations of mutations, which genes they occurred in, what the mutations were, and the clade the strain belongs to. Another file that is relevant to determining our research question is the file 'BIT150_v_whole .errors.csv'. This file contains information regarding errors, warning, or failed genes that may have occurred while the program was analyzing the sequence against the dataset. This file's information is important when assessing the results of the analysis and its validity. #### Results of Nextclade The genes M, N, S, ORF1a, ORF1b, ORF3a, ORF7a, ORF7b, ORF8, and ORF9b were identified by Nextclade to have mutations. One gene of interest would be ORF9b which has been reported to suppress the interferon response of the host through interactions with TOM70. Mutations in this gene may alter the resulting protein to increase its interactions with TOM70 making it more difficult for the host to counterattack the virus early as its interferon response is more suppressed. However, a mutation in the ORF9b may also have the opposite effect prompting a protein that reduces its suppressing effects on the host's interferon response. According to the Nextclade report the sequence is of "good quality" and has an overall quality score of 8.506944 out of 100 (100 being bad quality). The quality score is the sum of individual quality control scores squared divided by 100. The calculation allows for a 0-100 scale to be generated and it ensures that a bad score will be reflected while mild concerns will not have a strong influence. The four matrices that compose the final quality score are missing data (number of unknown bases), mixed sites (number of bases that are not A, C, T, G, - or N), private mutations (number of mutations that map to the terminal branch resulting in sequences after attachment to the tree), and SNP clusters (serval mutations in a short sequence in private mutations). The combination of these matrices encompasses many of the variables that suggest the quality of an alignment. Nextclade has categorized our strain of interest into clade 21J which is also in the WHO's clade Delta. According to the phylogenetic tree produced, clade 21J is a descendant of the original Delta strain. Additionally, it seems that the clade Delta is composed of two other Nextclade clades known as 21A and 21I. The tree indicates that it arose between two major Omicron clade branches and before the clades Alpha, Gamma, and Lambda. However, the strain of study evolved after the clade Mu and just barely after Epsilon and Beta. As the strain from BIT150_assembly is proposed to be from the clade Delta it suggests that the strain is more dangerous than the original Wuhan strain as [studies](https://www.weforum.org/agenda/2021/11/what-makes-the-delta-variant-different-covid-19/) have found Delta variants to cause more hospitalizations. #### Nextclade adjustment options Two option commands that may be helpful are "--output-selection" and "--max-indel <MAX_INDEL>". "--output-selection" allows the user to select which files should be generated when Nextclade is run. This would be extremely helpful because a few files have repeated information just in different formats (ex. csv & tsv) or one might not need each of the translation files that is output, thus being able to select specific files would be useful for organization and storage conservation. "--max-indel <MAX_INDEL>" may also be beneficial as it allows the user to set the maximum length of an insertion or deletion to be processed in an alignment. This option may be important if one is running multiple alignments in a series, as adjusting this parameter would stop the alignment if the indel is too long, thus saving time and memory. The adjusted parameter returns an error if the indel is longer than the max value allowing the user an opportunity to adjust other parameters, such as --penalty-gap-extend or --seed-spacing, to account for the large indel which may result in saved time and memory. ``` #!/bin/bash # load nextclade export PATH=/group/bit150/software:$PATH # Input file inf_directory=/group/bit150/Final/BIT150_assembly.fasta #sequence file in_file=BIT150_assembly # Loading database nextclade dataset get --name 'sars-cov-2' \ --output-dir 'final_bit150/sars-cov-2' # Run Nextclade nextclade run \ --input-dataset final_bit150/sars-cov-2 \ --output-all final_bit150 \ --output-selection tsv,json \ --output-basename BIT150_v_adj\ \ ${inf_directory} ``` Running this code Nextclade only returns the file types tsv and json. #### Automation of Nextclade Automation script to run Nextclade through all the files in a folder and return a tsv and tree file for each: ``` #!/bin/bash # load nextclade export PATH=/group/bit150/software:$PATH # Input file inf_directory=/group/bit150/Final/UCD_GenomeCenter_sequences # Loading database directory nextclade dataset get --name 'sars-cov-2' \ --output-dir 'final_bit150/sars-cov-2' for file in ${inf_directory}/*.fa do #Sequence file name=$(basename -s .fa $file) # Run Nextclade nextclade run \ --input-dataset final_bit150/sars-cov-2 \ --output-all final_bit150 \ --output-selection tsv,tree \ --output-basename BIT150_${name}\ \ ${inf_directory}/${name}.fa done ``` The analysis from Nextclade indicates that the three samples tested above do not all come from the same clade. The sample from the file 2021-02-17 is from clade 21C (Nextclade) Epsilon (WHO), the sample from 2021-07-30 is from clade 21H (Nextclade) Mu (WHO), and the final file 2021-10-18 is from 21J (Nextclade) Delta (WHO). Interestingly, the sample from 2021-10-18 is from the same clade as the BIT150_assembly although the two samples are different strains as indicated by the analysis as well as noted by the different mutations. ## Discussion Nextclade has both a command line (allowing for automation to run through many files in a folder or integration with other programs) and a web interface (allowing for extended visualization) making it a tool that can be utilized by many with different skills. The program is easy to learn and allows for fast generation of phylogenetic trees and comparison of sample sequences. The variety of files that the program produces makes it highly adaptable for users as csv, tsv, json, and other file formats can be produced. Additionally, there are varying degrees of detail the user can view depending on the file selected such as, tsv files which give summaries of results, or json files which give detailed analysis. Nextclade analysis allows for fast identification of mutations, clade, and a brief analysis of how "good" the alignment is. Furthermore, there are many parameters that can be adjusted to optimize the alignment and best ensure appropriate results. While the program offers many features that are user friendly, the analysis files do not offer any descriptions of what the genes do, which would be of significant help when attempting to understand why certain mutations might arise. Furthermore, ORF genes are denoted in the output files; however, if they exist in another gene the information is not provided, although it may be useful for the user to know. Additionally, while the program returns quality scores, it may be helpful to be provided with common statistics used by other alignment tools, such as an E-value, to further evaluate the sequence alignment produced. Finally, in the interface, auspice.us, used to visualize the phylogenetic trees produced by Nextclade, it does not have a feature that allows one to zoom-in on distinct parts of the tree and move around the tree in the zoomed in view. The inability to be able to zoom in on different sections of the tree makes it more difficult to analyze the phylogenetic tree produced by Nextclade. While additional programs are required to gain insight into what genes might do and the curated page to view the phylogenic trees lacks some features that would make it easier to use, Nextclade offers quality information. Overall, the program is a user-friendly tool that allows efficient analysis, creation of phylogenetic trees, and alignments of virus sequences which makes it a useful for identifying virus qualities. * AI was used to aid in code correction (indentation & suggestions when code wasn't running correctly) and information on variant severity