Database work - HackMD

--- tags: spillover, virus, database --- Database work === ## Existing SpillOver sequences Some spillOver accessions are "Unknown" - I don't know how to obtain those sequences. There is another column, `IndividualID`, but I'm not sure if it's possible to link that information with external databases Spillover accessions are not full genomes, but can be partial cds or full cds, e.g. `>EU241644.1 Andes virus isolate NK95083 G1 glycoprotein gene, partial cds` or `>FJ399899.1 Simian immunodeficiency virus clone PQo1 AxLN Int 2.5 C pol protein (pol) gene, partial cds ` or `>NC_048211.1 Wencheng Sm shrew coronavirus isolate Xingguo-74 ORF1ab polyprotein, spike glycoprotein, envelope protein, membrane protein, and nucleocapsid protein genes, complete cds` ### Obtaining these sequences download url (replace {acc} with full accession): url = f"https://www.ncbi.nlm.nih.gov/nuccore/{acc}?report=fasta&format=text" alternate (worse) link: ``` https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?db=nuccore&report=fasta&id={acc} ``` for protein sequences: url = f"https://www.ncbi.nlm.nih.gov/protein/{acc}?report=fasta&format=text" this might help with bulk download? ``` https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?db=nuccore&report=fasta&id={acc}&extrafeat=null&conwithfeat=on&hide- cdd=on&retmode=html&tool=portal&withmarkup=on&log$=seqview&maxdownloadsize=1000000" ``` approach: create sourmash `fromfile` but with links to fasta sequences instead of the files OR download all files and create sourmash fromfile with file links For viruses, two thoughts: 1. low scaled = use kprocessor or rocksdb backend? ## ICTV Reference Database / Sequences filename: `VMR_21-221122_MSL37.xlsx` ### Important Columns: `Virus GENBANK accession`,`Virus REFSEQ accession` > potentially multiple entries per column, for, e.g. segmented viral genomes (e.g. `L` segment, `S` segment). If there are multiple, they will be separated via semicolons `NC_027989; NC_041833; NC_041831; NC_041832; NC_041834` or `DNA-C: NC_040751; DNA-M: NC_040745; DNA-N: NC_040746; DNA-R: NC_040747; DNA-S: NC_040748; DNA-U1: NC_040749; DNA-U2: NC_040744; DNA-U4: NC_040750` > Sometimes there's a GenBank entry (or multiple) and not a RefSeq entry `Exemplar or additional isolate`: > One 'Exemplar virus' is chosen for each species to serve as an example of a well characterized virus. If the value in this column is 'E', this indicates that this virus has been chosen as the exemplar. If the value is 'A', then this virus is an additional representative of the species. `Genome coverage`: > - Complete genome > - Complete coding genome > - Partial genome `Genome composition`: > "The molecular and genetic composition of the virus genome packaged into the virion. Possible values are: > - dsDNA > - dsDNA-RT > - dsRNA > - ssDNA > - ssDNA(-) > - ssDNA(+) > - ssDNA(+/-) > - ssRNA > - ssRNA-RT > - ssRNA(-) > - ssRNA(+) > - ssRNA(+/-)" ## Obvious Issues Some (two) entries correspond to viruses (prophage?) in larger genomes, with nucleotide ranges provided. The ranges don't _quite_ match the annotations I found online? AE006468 (2844298..2877981).dna.log - https://www.ncbi.nlm.nih.gov/nuccore/AE006468.2/ > Salmonella enterica subsp. enterica serovar Typhimurium str. LT2, complete genome > source 2844431..2879237 /organism="Salmonella virus Fels2" /mol_type="genomic DNA" /db_xref="taxon:194701" when I then searched for "Salmonella virus Fels2", I then found three genome records: https://www.ncbi.nlm.nih.gov/nuccore/?term=txid194701[organism:exp]%20AND%20biomol_genomic[prop] The best accession to use is probably the complete genome, https://www.ncbi.nlm.nih.gov/nuccore/NC_010463.1 > Enterobacteria phage Fels-2, complete genome output.vmr/logs/downloads/genbank/ LK928904 (2253..10260).dna.log https://www.ncbi.nlm.nih.gov/nuccore/LK928904.1/ ## Plan: Initially I split the different accessions (typically different segments for segmented viruses) into separate rows in new csv file, in order to use the '{accesssion} {name}' signature format we typically use. This is not ideal, however, because we want a ~full/complete signature for each virus. We could combine all accessions/sequences manually, but NCBI Assembly datasets has already done this for us. Find an 'assembly' accession that corresponds to the upload accession used by the ICTV - https://www.ncbi.nlm.nih.gov/genome/viruses/about/assemblies/ - In 'assemblies', segments are aggreagated into full genomes (one accession per genome) - benefits: - curated db - single identifier for whole assembly - drawbacks: - some ICTV accessions may be missing Assembly datasetsA - ?? ### Find Corresponding 'Assembly' datasets The ICTV viruses are provided via an excel file with one row per reference virus, and contains taxonomic information as well as additional reference information. This file provides NCBI accession numbers in GenBank and RefSeq that correspond to the sequences. For Segmeneted Viruses, an accession number is provided per genome segment, sometimes labeled, e.g. `{name}: {accession}`. For classification of the entire virus, it would be ideal instead to have a combined accession that refers to all genomic sequence of each virus. NCBI provides this in their curated 'Assembly' datsets resource, and we can find the corresponding identifier by querying via NCBI Entrez or using the `ncbi-datasets` tool. The file `find-assembly-accessions` uses the `Biopython Entrez` library and produces an updated `VMR` file with the corresponding GenBank Assembly accession. There are a few exceptions where either no corresponding Assembly exists, or it was not able to be matched via entrez. Exceptions: 1. No Accession was provided (`no_accession`; 232) - No GenBank accession was provided, so we don't have a sequence record to link 2. Viral genomes within other genomes (`parentheses`; 2) - The accessions provided here correspond to larger genomes, where the viral genome is a small component, with base pair range provided in parantheses. Since we don't want to assign the entire host genome to a viral taxonomy, we should manually download the base pair range and use it as the sequence. Alternatively, in some cases there may be a better representative for this species available. 3. No Assembly Accession was found (`no_assembly`; 154): - Partial genomes (54) - ?? (100) 4. Multiple Assembly Accessions found (`multiple_acc`; 13) - The segments provided for these viruses link to different Assembly accessions. This should not happen. In the one case I spot-checked, this was due to an assembly being redacted. Needs manual investigation to select the best accession. 5. Error during information retrieval (`retrieval`; 0) - Primarily a check for any errors occurring during entrez database queries. ### Download Viral Assemblies and build a `sourmash` database With assembly accessions in hand, we now download all viral genomes from NCBI and index them as a sourmash database. We use the file generated above as input into the `dl-sketch.ICTV.smk` snakefile. This is currently written to download and sketch all available assembly accessions, but we may want to build a smaller database containing only the `exemplar` genomes, leaving out the `additional` isolates. In the `VMR_21-221122_MSL37.csv` file, there are 10435 Exemplar and 1626 'Additional'. Issues: - Ran into an exception during download: GCA_004789135.1 is supressed. For now, manually removed from accession file. In the future, need programmatic query for determining whether record has been supressed -- should be feasible to add to the first step, where we use biopython entrez to link assembly identifiers. - 6 records have duplicate entries in the csv file: > - GCF_001041915.1 Mycobacterium phage Fionnbharth > - GCF_001745335.1 Shigella phage SHBML-50-1 > - GCF_002625825.1 Rhodococcus phage Hiro > - GCF_003308095.1 Rhodococcus phage Takoda > -GCF_004138835.1 Salmonella phage 3-29 > - GCF_004138895.1 Pseudomonas phage vB_PaeM_SCUT-S1 > Since the duplicated entries in the `fromfile.csv` file are identical, they were removed via `awk '!x[$0]++' $filename > $filename-new` for now. This should be automated in the future.