--- tags: ebook, english --- # Glossary ## A **accession number** a unique identifier given to a DNA or protein sequence record to allow for tracking of different versions of that sequence record. **adapter** (see sequencing adapter) **amplicon** a segment of DNA that is the product of a polymerase chain reaction or other method of amplification **American Standard Code for Information Interchange (ASCII)** a character encoding standard for electronic communication. NOTE: Field 4 of FASTQ files encodes the quality score of a sequencer-generated base as an ASCII character. The procedure for finding the Q score given an ASCII character is: 1) Find the decimal equivalent of the ASCII character. 2) Subtract 33 from the decimal equivalent. Example: 'A' ASCII has the decimal equivalent of 65. 'A'-33 = 65-33 = 32. So a quality score of 'A' means Phred 32, or slightly better than 99.9%. (See **Phred quality score** or **Q score** below). Why 33? ASCII codes for decimal equivalents 33 to 127 are single characters which can be written sequentially as a string in a computer program.(& require less code(?)) ![](https://hackmd.io/_uploads/BkMGIngc3.png) **AMPure beads** 0.4X clean... a ratio of beads/solution ## B **barcoding** 1. 2. **basecaller** a program that assigns a base to an electrochemical signal or chromatogram peak. **basecalling** (Nanopore sequencing) the process of assigning bases to the electrochemical signals generated by nucleic acid bases passing through pores of a flow cell. **basecalling** (Sanger sequencing) the process of assigning bases to chromatogram peaks. **BLAST** (**B**asic **L**ocal **A**lignment **S**earch **T**ool) ## C **call** (=base *call*) **COI** mitochondrial gene that encodes **C**ytochrome **O**xidase subunit **1**, often used to identify animals via DNA barcoding. (DNALC Barcoding 101-D) **contig** (from **contig**uous) a series of overlapping DNA sequences used to reconstruct all or part of a chromosome. **CyVerse** ## D **DNA/RNA library** a collection of DNA or RNA fragments that are ready for sequencing **DNA subway** ## E **EPI2ME** ONT's cloud software for basic genomic analysis ## F **FAST** (**F**ast **A**lignment **S**earch **T**ool) a software package used for sequence alignment and database searching. **FASTA** (**F**ast **A**lignment **S**earch **T**ool **A**ll) software package used for sequence alignment and database searching for **A**ll alphabets (e.g. amino acids and nucleotides). FASTA is a successor to FAST**P** (for **P**roteins) and FAST**N** (for **N**ucleotides). The software package was developed by Lipman and Pearson in 1985. **FASTA file** computer text file for storing nucleotide sequences or amino acid sequences according to FASTA format. **FASTA format-** a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. Sequences begin with a greater-than character (">") followed by a description of the sequence (all in one line). The next lines are the sequence representation, with one letter per amino acid or nucleic acid, and are typically no more than 80 characters/line in length. Example: ![](https://hackmd.io/_uploads/Sky7WGSY3.png) **FASTQ file** computer text file for storing nucleotide sequences according to FASTQ format. **FASTQ format** a text-based format for storing both a biological sequence (usually nucleotide) and its corresponding quality scores. In FASTQ format, a sequence is presented with 4 fields. Field 1: begins with a '@' character, followed by a sequence identifier and an optional description (similar to a FASTA title line). Field 2: raw sequence letters. Field 3: begins with a '+' character, & **may** be followed by the Field 1 sequence identifier and description. Field 4: encodes the Phred quality values, in ASCII code, for the sequence in Field 2 (see **"ASCII"**). Example: ![](https://hackmd.io/_uploads/H1PCIrIYh.png) NOTE: a FASTQ file can be opened with a text editor (e.g. "TextEdit" on a Mac). **FAST5 file** a computer binary file for storing nucleotide sequences according to FAST5 format. In contrast to fasta and fastq files, a FAST5 file is binary and cannot be opened with a normal text editor. **FAST5 format** a binary-based format that is the standard sequencing output for Oxford Nanopore sequencers. FAST5 format is based on the hierarchical data format HDF5 format which enables storage of large and comples data. Data stored in Nanopore FAST5 files can contain the sequence of a read in fastq format (after basecalling), the raw signal of the pore, as well as several log files and other information. **flongle** adapter for MinION that uses single-use flow cells with 126 channels for sequencing; intended for use in smaller or frequent experiments **flow cell** **flow cell washing** ## G **Guppy**- the base caller that is integrated into MinKNOW **GPU** ## H **Hierarchical Data Format version 5 (HDF5)**- an open source file format that supports large, complex, heterogeneous data. HDF5 uses a "file directory" like structure that allows one to organize data within the file in many different structured ways, as one might do with files on a computer. **Hidden Markov Models (HMMs)-** a class of probabilistic graphical models that allows one to predict a sequence of unknown (hidden) variables from a set of observed variables. A simple example of an HMM is predicting the weather (hidden variable) based on the type of clothes that someone wears (observed). An HMM can be viewed as a Bayes Net unrolled through time with observations made at a sequence of time steps being used to predict the best sequence of hidden states. https://medium.com/@postsanjay/hidden-markov-models-simplified-c3f58728caab#:~:text=Hidden%20Markov%20Models%20(HMMs)%20are,that%20someone%20wears%20(observed). ## I **ITS** (**I**nternal **T**ranscribed **S**pacer) variable sequences of DNA that code for rRNA; often used to identify fungal species via DNA barcoding. (Modified from Wikipedia) ## J **Jetstream2** ## K ## L **library preparation-** the overall process of preparing a collection (library) of DNA or RNA fragments for sequencing. The process is sequncer dependent, but usually includes nucleic acid (NA) isolation, production of NA fragments via cleavage, and adding molecular tags to the fragments. **ligation** **ligation kit** ## M **metabarcoding** **minION MK1B** sequencing hardware that requires attachment to computer (via USB) **minION MK1C** self-contained sequencing hardware that includes computer, software, and display **MinKNOW** software that runs Nanopore devices, including basecalling and real-time analysis **multiplexing** simultaneous sequencing of multiple DNA samples in a single run **muxing** ## N **NanoDrop** assay device for DNA quantification **nanopore** **NCBI** ## O **ONT** (Oxford Nanopore Technologies) ## P **Phred** a computer program that calls and assigns quality values to bases, and writes the bases and their quality values to output files. Phil Green of the University of Washington had a leading role in developing the software (**Ph**il's** **r**ead **ed**itor). (https://www.seqanswers.com/forum/sequencing-technologies-companies/illumina-solexa/47936-what-does-phred-stand-for) **PHRED quality (Q) score** a measure of the liklihood an automated sequencer will accurately indentify a base. Expressed as: Q=-10 log~10~ P, where Q= phred score P= probability ![](https://hackmd.io/_uploads/rkFHJox92.png) **primer dimer** **Protein Data Bank (PDB)** **POD5** ## Q **Qubit** DNA quantification machine **Q score** method of assessing basecalling accuracy; a low score indicates poor data quality, a score of 20 or higher is considered acceptable (a score of Q30 is perfect). ## R **rapid kit** **rbcL** chloroplast gene that codes for **r**ibulose 1,5-**b**iphosphate **c**arboxylase (**L**arge subunit), often used to identify plants via DNA barcoding. **Read .fast5 files** A .fast5 file is a type of HDF5 file, which is designed to contain all information needed for analysing nanopore sequencing data and tracking it back to its source. Read .fast5 files contain raw sequencing for each read, with a default of 4000 reads per file. **ribosome, 16S** see 16S ## S **sequencing adapter** **16S (ribosome 16S)** term loosely applied by taxonomists to the ***gene*** (DNA sequence) that codes for the 16S rRNA subunit in prokaryotes, often used to identify bacteria via DNA barcoding. **Svedberg units (S)** a measure of the rate of sedimentation in centrifugation rather than size. Note: This accounts for why fragment names do not add up: for example, bacterial 70S ribosomes are made of 50S and 30S subunits. The 30S ribosome subunit (NOT the gene that codes for 30S) is composed of a ***16S subunit of rRNA*** and approximately 21 proteins. ## T **tagmentation-** the initial step in library prep where unfragmented DNA is cleaved and tagged for analysis ## U ## V ## W **WIMP (What's In My Pot)** ONT cloud software for real-time taxonomic identification ## X ## Y ## Z