bit150 final project

# Final project Tianyu Wang 917440012 2023/12/13 [TOC] ## Program introduction: Nextclade Nextclade, a sophisticated bioinformatics tool, has emerged as a pivotal resource in the ongoing battle against the COVID-19 pandemic. Developed by a team of researchers at the Nextstrain project, an open-source project that facilitates real-time tracking of pathogen evolution, Nextclade is meticulously maintained by a community of scientists and developers dedicated to its continual refinement and update, including Ivan Aksamentov, Cornelius Roemer, and many other contributers. When the data acquired from Nextclade is included in a publication, the original paper should be cited: "*Aksamentov, I., Roemer, C., Hodcroft, E. B., & Neher, R. A., (2021). Nextclade: clade assignment, mutation calling and quality control for viral genomes. Journal of Open Source Software, 6(67), 3773, https://doi.org/10.21105/joss.03773*" The primary goal of Nextclade is to provide a comprehensive analysis of SARS-CoV-2 sequences. It enables users to conduct sequence alignment, mutation identification, and phylogenetic tree construction. Therefore, this program is especially crucial for researchers and healthcare professionals who are tracking the evolution and spread of the virus. By analyzing the genomic sequences of SARS-CoV-2, Nextclade assists in identifying new variants and understanding their phylogenetic relationships to existing strains. The CLI version of Nextclade, in particular, stands out for its flexibility and efficiency. It is well-suited for high-throughput analysis and can be integrated into automated workflows, making it a preferred choice for large-scale studies and for researchers who require more control over the analysis process. In conclusion, Nextclade represents a significant advancement in our ability to monitor and understand the evolution of SARS-CoV-2. ## Analysis ### Program access & Data acquisition After connected to my own bit150-48 account with ssh: ``` export PATH=/group/bit150/software:$PATH nextclade --help nextclade dataset get --name 'sars-cov-2' --output-dir 'data/sars-cov-2' nextclade run \ --input-dataset data/sars-cov-2 \ --output-all=output/ \ /group/bit150/Final/BIT150_assembly.fasta ``` #### Output file lists: Now the data are exported to the output file in my bit150-48 account folder. I preserved the data by Filezilla to my computer ssd as a backup. Here is the list of all output files:![屏幕截图 2023-12-13 190328](https://hackmd.io/_uploads/rJmnxlO8a.png) #### Two key output files are: Tabular (CSV/TSV) Results: These files contain the results of the analysis in a table format, either as CSV (Comma-Separated Values) or TSV (Tab-Separated Values). They include detailed information on each output, such as clade assignment, mutation details, and quality control scores. JSON Results: This file, typically named nextclade.json, provides a comprehensive and machine-readable format of the analysis results. It includes all the data present in the tabular files, along with additional details, making it suitable for in-depth automated processing like machine leraning. ### Result description I read through the nextclade.csv file, which includes the analysis of our uploaded sequence data. The file including mutations in multiple genes such as M, N, ORF1a, ORF1b, ORF3a, ORF7a, ORF7b, ORF9b, S (Spike), and ORF8. ##### Spike (S) Protein Gene The Spike protein facilitates the virus's entry into host cells. It binds to the ACE2 receptor on human cells, initiating the infection process. The protein is a target for many of the vaccines and therapeutic antibodies developed against COVID-19. #### Potential Impact of Mutations * S:L452R: This is a mutation in the Spike protein where the amino acid at position 452 changes from Leucine [L] to Arginine [R]. This kind of mutation could potentially affect the binding affinity of the Spike protein to the ACE2 receptor, potentially altering the virus's infectivity. It might also impact the efficacy of antibodies against the virus, influencing vaccine effectiveness. * S:T478K: Similar to the L452R mutation, this mutation at position 478 could also affect the Spike protein's structure and its interaction with the ACE2 receptor. #### Analysis quality Based on the quality control scores extracted from the TSV file, the sequence analyzed (identified as "BIT150_2022") has a QC overall score of approximately 8.51, and its QC status is classified as "good." This score falls well within the "Good" quality range (0 to 29) as defined by Nextclade's quality control guidelines. Therefore, we can conclude that the quality of this sequence is sufficient for reliable analysis and interpretation. #### Clade The clade assigned to the sequence is "21J". #### Auspice website illustration ![auspice](https://hackmd.io/_uploads/SJbrQgdIT.png) ![21j](https://hackmd.io/_uploads/B1N1SeOU6.png) The 21J (Delta) strain is one of the earlier variants in the context of this phylogenetic tree (containing 32 to 48 genetic variants) and is genetically distinct from newer clades like Omicron and its sublineages. Each clade represents a separate evolutionary path with its own set of mutations. ### Nextclade run controllers * --input-fasta: This option specifies the path to the FASTA file containing the sequences I want to analyze. Changing this would allow me to analyze different sequences. * --output-json: This option designates the path for the output JSON file, where Nextclade will save its comprehensive analysis results in a JSON format. In my command, this is set to output/nextclade.json. JSON output is useful for further automated processing and analysis since it's a structured, machine-readable format. After I changed the output-json of the orignal program, I modified the location of the output .json file, as well as the name of it. However, the contents still stay the same form the original output. ### Automatic scripts ``` #!/bin/bash # Directory containing the .fa files SEQUENCE_DIR="/group/bit150/Final/UCD_GenomeCenter_sequences" # Directory for output files OUTPUT_DIR="/home/bit150-48/finalproj/q5output" # Replace with your desired output directory # Absolute path to the dataset directory DATASET_DIR="/group/bit150/data/sars-cov-2" # Replace with the actual path to the dataset # Loop through all .fa files in the sequence directory for sequence_file in "$SEQUENCE_DIR"/*.fa; do # Extract the base name of the file for naming the output files base_name=$(basename "$sequence_file" .fa) # Run Nextclade for each file nextclade run -D "$DATASET_DIR" -O "$OUTPUT_DIR/" --output-tree="$OUTPUT_DIR/${base_name}.auspice.json" --output-tsv="$OUTPUT_DIR/${base_name}.tsv" "$sequence_file" done ``` This is the output file lists (2021-02-17/10-18/07-30 are .tsv files): ![outputlist](https://hackmd.io/_uploads/HJT9vn_La.png) The three strains are from different clades: * 21C: 02-17 * 21H: 07-30 * 21J: 10-18 ## Discussion: Benefits & Limitations Nextclade, as a bioinformatics tool designed for phylogenetic analysis, particularly for pathogens like SARS-CoV-2, has several notable features and limitations: ### Advantages: * Accessibility: Nextclade is available as both a web-based tool and a command-line application, making it accessible to a wide range of users with different technical skills. * Real-Time Updates: Nextclade is regularly updated with new data and features, which is crucial for tracking the evolution of fast-changing pathogens like SARS-CoV-2. ### Limitations: * Complexity for Beginners: For users new to bioinformatics, there might be a steep learning curve to understand the results and how to best utilize all of the tool's functionalities. * Data Interpretation: Interpreting the output can be complex (the format of the .csv and .tsv output files are messed up) and requires a good understanding of phylogenetics and virology, which might necessitate additional expertise.