BIT 150 Final - Langyi Zuo

## BIT 150 Final - Identifying the Source of Virulence *Langyi Zuo* ### 1. Nextclade Introduction: **1.1 Development, Maintenance, and Restrictions of Use:** Nextclade was developed by Ivan Aksamentov, Cornelius Roemer, Emma B. Hodcroft, and Richard A. Neher from the Biozentrum University of Basel and the Swiss Institute of Bioinformatics. The program is constantly maintained by the Nextstrain team by responding to bug reports on Github and adding features via user feedback and community contributions. The authors released their work under a Creative Commons Attribution 4.0 International License (CC BY 4.0), indicating there are minimal restrictions on its use, primarily focused on attribution. **1.2 Goal and Usage:** The program aims to provide rapid, decentralized analysis of viral genomes, specifically for SARS-CoV-2. It aligns viral genomes to a reference sequence, calculates quality control metrics, assigns sequences to a clade or variant, and identifies changes in viral proteins. This is especially useful for labs around the world to quickly assess the quality of newly generated SARS-CoV-2 sequences, categorize them into different variants and clades, and investigate their mutational profiles. **1.2.1 Importance of Identifying SARS-CoV-2 Clades:** Identifying the clade of a particular SARS-CoV-2 strain is crucial for understanding the evolution and spread of the virus. By assigning sequences to specific clades, researchers and public health officials can track the emergence and distribution of variants of concern, aiding in the global molecular surveillance and management of the pandemic. **1.2.2 Advantages of CLI vs Web-Tool:** The CLI version of Nextclade is suitable for bulk analysis of many sequences, offering a more streamlined and efficient process for large datasets. In contrast, the web tool provides the same functionality with an added advantage of interactive visualization, making it more suitable for analyzing smaller datasets or for users who prefer a graphical interface. ### 2. Analysis: ### **2.1 Commands to Run Nextclade:** ``` # Load the program: export PATH=/group/bit150/software:$PATH # Download the latest dataset: nextclade dataset get --name 'sars-cov-2' --output-dir 'Final/data/sars-cov-2' # Run the analysis: nextclade run \ --input-dataset Final/data/sars-cov-2/ \ --output-all=Final/output/ \ Final/BIT150_assembly.fasta ``` **2.2 Output Files:** ![output_files](https://hackmd.io/_uploads/ryy_VzOLT.png) **Key Output Files:** **2.2.1 nextclade.tsv / nextclade.csv:** These files contain the results of the analysis in a tabular format. They are important as they provide a summary of mutations, deletions, insertions, and other quality metrics. By comparing the mutations listed in the strain with known mutations in the Wuhan-Hu-1 strain, one can discover if there are significant changes that might impact virulence. **2.2.2 nextclade.auspice.json:** This file is useful for phylogenetic analysis. It shows how our sequence is placed on the phylogenetic tree relative to other known strains, including the original Wuhan-Hu-1 strain. This can help understand the evolutionary relationship of our strain to the Wuhan-Hu-1 and other strains, which is important in assessing potential changes in its characteristics. **2.2.3 nextclade_gene_<gene_name>.translation.fasta:** These files contain the aligned peptides for each gene, which is useful for examining changes in protein sequences that could affect the virus's behavior or interaction with host cells. From my results, the gene names include ORF1a, ORF1b, ORF6, etc. **2.3 Results:** **2.3.1 Mutations Discovered:** Genes: M (Membrane) Gene N (Nucleocapsid) Gene S (Spike) Gene ORF1a Gene ORF1b Gene ORF3a Gene ORF7a Gene ORF7b Gene ORF9b Gene **Spike (S) Gene:** The Spike protein is a critical component of the SARS-CoV-2 virus, as it enables the virus to enter and infect human cells. It binds to the ACE2 receptor on the surface of human cells, facilitating viral entry. A possible mutation of this gene could be develop new abilities to bind with other receptors, making it more easily to the human cells. **2.3.2 Quality of the Sequence:** The quality of the sequence is sufficient. I derived this conclusion based on the two columns in the result: "qc.overallScore" and "qc.overallStatus", in which the score of 8.506944 (good based on the documentation) and "good" qc overall status. **2.3.3 Clade Assigned:** 21J. **2.3.4 Auspice:** **Zoomed in:** ![BIT150_2022](https://hackmd.io/_uploads/HJCwwLd86.png) **Relationship to other major clades (our strain is 21J):** ![BIT150_2022_all](https://hackmd.io/_uploads/rkm5TAdU6.png) **2.4 Alternate Nextclade Options:** a. ``` --min-length ``` This command sets the minimum length of the input sequences to be considered for analysis. This is particularly useful for filtering out fragmented or incomplete sequences that could skew the analysis. **Usage Example:** ``` nextclade run --min-length 25000 ``` b. ``` --qc-config ``` This command specifies a custom Quality Control (QC) configuration file. This allows users to define their own QC parameters, which can be important for tailoring the analysis to specific needs or experimental designs. **Usage Example:** ``` nextclade run --qc-config /path/to/qc-config.json ``` **2.4.1 Demonstration:** ``` nextclade run --min-length 25000 ``` I added this parameter in addition to the 2.5.1 script (see below). The new output files do not have any changes, because none of the three sequences have a length less than 25,000. However, if there are sequences have lengths of less than 25000, they will be ignored during the analysis. **2.5 Automation:** **2.5.1 Script:** ``` ### Question 5 automation: #!/bin/bash # Directory containing the .fa files input_directory="Final/UCD_GenomeCenter_sequences" # Directory of the outputs output_directory="Final/UCD_Output" # Nextclade dataset directory dataset_directory="Final/data/sars-cov-2" # Loop through each .fa file in the input directory for input_file in "$input_directory"/*.fa; do # Extract the base name of the file for naming the output files base_name=$(basename "$input_file" .fa) # Define the output file names output_tree="${output_directory}/${base_name}.auspice.json" output_tsv="${output_directory}/${base_name}.tsv" # Run Nextclade nextclade run \ --input-dataset "$dataset_directory" \ --output-tree "$output_tree" \ --output-tsv "$output_tsv" \ "$input_file" done ``` The three samples are not from the same clade, they belongs to clade 21C (2021-02-17), clade 21H (2021-07-30), and clade 21J (2021-10-18). ### 3. Discussion: As a bioinformatic tool, Nextcalde has several advantages and limitations in terms of functionalities, accessibility and applications in study of genetics and genomics. **Advantages:** Nextclade is available both as a web application and a command-line interface (CLI) tool, making it accessibile to a wide range of users, from those who prefer graphical interfaces to those who work with advanced bioinformatics pipelines. It can also align viral genomes to a reference sequence, performs quality control, assign sequences to a clade or variant, and identify changes in viral proteins, which makes it a powerful tool for data analysis in viral genomics. In addition, the Nextclade is able to perform analysis locally, which ensures that sequence data does not leave the user's system, providing extra layer of data privacy and security. **Limitations:** Nextclade is a specialized tool for viral genome analysis, such as SARS-CoV-2, Monkeypox, RSV, etc, which limits its use in other areas of genomics and bioinformatics. The CLI version of the program is more powerful than its Web form; however, it has a steeper learning curve for users who are not familiar with terminals. Furthermore, the outputs' accuracy rely heavily on the quality of the input data, in which inaccuracies in sequencing or data entry can lead to erroneous results. In summary, Nextclade stands out for its ease of use, comprehensive analysis capabilities, and secure, locally processing of data. Its specialization in viral genomes, particularly SARS-CoV-2, makes it a critical tool in the COVID-19 pandemic scenario. However, its limitations in terms of broader application in genomics, the learning curve associated with its CLI version, and relying on the quality of the input data should be considered when choosing a bioinformatics tool for specific research needs.