Sequence a Genome!

> # Sequence a Genome! ## Final survey https://forms.gle/S5p5WzF1HSBrZ9qn9 ## Shared spaces - Google folder: https://drive.google.com/drive/folders/1wzaCtOkptmouPzVvIBmf0kgCDqxpXH4v?usp=sharing ## Platforms - CyVerse Atmosphere: https://atmo.cyverse.org ## Learning resources - CyVerse [link](https://learning.cyverse.org) - Genomics data carpentry: https://datacarpentry.org/lessons/#genomics-workshop **General Coding** - CodeCademy: [link](https://www.codecademy.com/) - Hour of code (also in languages other than English): [link](https://code.org/learn) **Software installations** Be sure you have permission to install software - Try Ubuntu: [link](https://tutorials.ubuntu.com/tutorial/try-ubuntu-before-you-install#0) - Python: [link](https://www.python.org/downloads/) - Jupyter: [link](https://jupyter.org/) - Wing IDE: [link](https://wingware.com/) - Atom text editor: [link](https://atom.io/) **Bioinformatics** - Learn bioinformatics in 100 hours: [link](https://www.biostarhandbook.com/edu/course/1/) - Rosalind bioinformatics: [link](http://rosalind.info/about/) - Bioinformatics coursera: [link](https://www.coursera.org/learn/bioinformatics) - Bioinformatics careers: [link](https://www.iscb.org/bioinformatics-resources-for-high-schools/careers-in-bioinformatics) **Help** - General software help: [link](https://stackoverflow.com/) - Bioinformatics-specific software help: [link](https://www.biostars.org/) ## Laboratory - Mitochondiral sequencing lab: https://jasonjwilliamsny.github.io/stars-2021/_includes/laboratories/mitochondrial_sequencing/ ## Software **Software Utilities** - PuTTY for windows: https://the.earth.li/~sgtatham/putty/latest/w64/putty.exe - Install Docker on Ubuntu: https://docs.docker.com/engine/install/ubuntu/ - Miniconda: https://docs.conda.io/en/latest/miniconda.html#linux-installers - Bioconda: https://bioconda.github.io/user/install.html - Samtools manual: http://www.htslib.org/doc/ - IGV: https://software.broadinstitute.org/software/igv/ - Filezilla (client): https://filezilla-project.org/ **Nanopore** - Poretools: Toolkit for working with nanopore sequencing data from Oxford Nanopore - https://poretools.readthedocs.io/en/latest/ - Nanopolot: Plotting tool for long read sequencing data and alignments - https://github.com/wdecoster/NanoPlot - NanoFilt: Filtering on quality and/or read length, and optional trimming after passing filters - https://github.com/wdecoster/nanofilt - NanoStat: Calculate various statistics from a long read sequencing dataset - https://github.com/wdecoster/nanostat - Nanoreviser: https://github.com/pkubioinformatics/NanoReviser - NECAT Nanopore data assembler: https://github.com/xiaochuanle/NECAT **Alignment** - Graphmap: https://github.com/isovic/graphmap - Graphmap (conda): https://bioconda.github.io/recipes/graphmap/README.html?highlight=graphmap#package-package%20'graphmap' - Graphaligner (conda): https://bioconda.github.io/recipes/graphaligner/README.html?highlight=graphaligner#package-package%20'graphaligner' **Genome assembly** - Redbean: https://github.com/ruanjue/wtdbg2 - Flye: https://bioconda.github.io/recipes/flye/README.html?highlight=flye#package-package%20'flye' - CANU (conda): https://bioconda.github.io/recipes/canu/README.html?highlight=canu#package-package%20'canu' - MitoVGP: VGP Mitochondrial assembly pipeline https://github.com/gf777/mitoVGP **Assembly Quality** - QUAST: https://github.com/ablab/quast - BUSCO: gene content of near-universal single-copy orthologs - https://busco.ezlab.org/ - BUSCO (conda): https://bioconda.github.io/recipes/busco/README.html?highlight=busco#package-package%20'busco' - SQUAT: https://github.com/luke831215/SQUAT - GenomeQC: https://github.com/HuffordLab/GenomeQC **Jellybeans** - Unzip gzip file ``` gunzip -d INFILE.gz ``` - Compute checksum ``` md5sum ``` - Instal Redbean missing zlib ``` sudo apt-get install libz-dev ``` - FastQ to Fasta using sed ``` sed -n '1~4s/^@/>/p;2~4p' INFILE.fastq > OUTFILE.fasta ``` - Samtools install curses lib ``` sudo apt-get install libncurses-dev ``` **Learning** - Twelve quick steps for genome assembly and annotation in the classroom (paper): https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008325 ## Tasks - Align mt data to reference graphmap ont reads, samtools bam/bai, visualize in IGV --- ## Important links - Human genome at NCBI: [link](https://www.ncbi.nlm.nih.gov/genome/guide/human/) - Markdown cheatsheet: [link](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) - AllofUs: https://allofus.nih.gov/ ## DNAi Videos - Sequencing project animation [link](https://youtu.be/-gVh3z6MwdU) - Beginnings of the Human Genome Project at the Cold Spring Harbor Laboratory, James Watson [link](https://dnalc.cshl.edu/view/15445-Beginnings-of-the-Human-Genome-Project-at-the-Cold-Spring-Harbor-Laboratory-James-Watson.html) - Importance of genetic maps, Mary-Claire King [link](https://dnalc.cshl.edu/view/15128-Importance-of-genetic-maps-Mary-Claire-King.html) - Compiling the data from the Human Genome Project, Jim Kent [link](https://dnalc.cshl.edu/view/15305-Compiling-the-data-from-the-Human-Genome-Project-Jim-Kent.html) - Using computers to predict how genes within the human genome, Craig Venter [link](https://dnalc.cshl.edu/view/15358-Using-computers-to-predict-how-genes-within-the-human-genome-Craig-Venter.html) - Finding genes in the human genome, Ewan Birney [link](https://dnalc.cshl.edu/view/15291-Finding-genes-in-the-human-genome-Ewan-Birney.html) - The public Human Genome Project's DNA donors, Eric Lander [link](https://dnalc.cshl.edu/view/15327-The-public-Human-Genome-Project-s-DNA-donors-Eric-Lander.html) - The first draft of the human genome, Ari Patrinos [link](https://dnalc.cshl.edu/view/15343-The-first-draft-of-the-human-genome-Ari-Patrinos.html) - Relating a gene to a sequence of amino acids, Sydney Brenner [link](https://dnalc.cshl.edu/view/15279-Relating-a-gene-to-a-sequence-of-amino-acids-Sydney-Brenner.html) --- ## Glossary of terms - **Annotation:** Labeling of DNA with information on its function - **Genome:** Sum total of all DNA in an organism - **S. polyrhiza:** Target species of duckweed (See also: [Wikipedia](https://en.wikipedia.org/wiki/Spirodela_polyrhiza)) - **Reference genome:** Consensus genome referred to by the scientific community as the authority; frequently updated - **Nature vs. nurture:** Inherited traits vs. outside environmental influences - **Genome browser:** A reference tool to view the annotation and sequence of a genome - **Protein-coding gene (Exons):** The small minority of genes that code for proteins - **Introns:** Segments of DNA or RNA that (do not code for proteins and) are unexpressed; discarded during splicing - **SNPs:** Single-nucleotide polymorphisms (polymorphism = "many shapes") - **kb:** kilobases (thousand), **mb:** megabases (million), **gb:** gigabases (billion) - **Pipeline:** Order through which data is passed through different processes - **Transposons:** Jumping genes, can copy and paste itself into different genomic locations - **RepeatMasker:** Software to mask/ignore sequences of DNA that are determined to repeat/jump to save time - **de Novo gene prediction:** "out of nothing," they don't take anything (i.e. gene expression) other than the gene sequence into account - **Chelex**: DNA preservative chemical - **PCR**: Process used to amplify sequence of DNA - **Open source**: A classification of software that people can freely use, adapt, and view the software's code - **Circos diagram**: Each section of line represents a chromosome - **Sequencing library**: A pool of DNA fragments with adapters attached; they are prepared via either *ligation*, where DNA is spliced and then adapters are added afterwards, or *tagmentation*, where DNA fragments and adapters are added in one step - **Sanger Sequencing**: Older version of DNA sequencing where electrophoresis was used to aquire small DNA sequences - **Personalized medicine**: The practice of customizing treatments to patients based on their sequenced genome - **GWAS**: Genome-wide association; find a population that has a trait (with genetic connotation) and a population that does not, and then sequence their genomes to identify genetic variations that are associated with the trait - **Shotgun sequencing**: Method of DNA sequencing where the genome is first sequenced in small, random fragments which are then assembled into the full genome sequence - **Flow cell**: sample cells designed so that liquid samples can be continuously flowed through the beam path, which are useful for samples which are otherwise damagable by light - **HMW**: High Molecular Weight - **k-mer:** Short sequence of DNA of length k ('ACTG' is a 4-mer). In de Brujin construction, k-mers are the small fragments that the fragments are broken down into for assembly. k-mer length is an adjustable parameter during assembly - **Contig**: Partially assembled DNA sequence; they range from few thousand to tens of thousands, instead of millions. They are generated as sequencing the full 23 chromosomes at once is nearly impossible; instead we have these contig chunks first - **N50**: Metric of contig size, (i.e. N50 size = 30 kbp) where half of your genome has been assembled in contigs of this size or larger. Having an N50 size of 30 kbp is better than 3 kbp (bigger has better resolution). - **Unitig**: Result of assembling several contigs together - **Metagenomics**: The study of the the collective genome of microorganisms from an environmental sample; i.e. microbiome - **Structural events**: Can include insertion, transposition, and deletion of chunks of DNA in the genome. Structural events can cause changes in expression through elements and # of copies of a gene, instead of SNPs **Shell**: A shell is a computer program that presents a command line interface which allows you to control your computer using commands entered with a keyboard instead of controlling graphical user interfaces (GUIs) with a mouse/keyboard combination - **FASTQ format**: Format for DNA sequence data that comes from high-throughput sequencing, 4 lines with quality scores - **FASTA format:** Format for DNA sequence data from high-throughput sequencing, 2 lines without any quality scores - **Read**: Inferred sequence of base pairs (or base pair probabilities) corresponding to all or part of a single DNA fragment - First line - Begins with @, information about read - Second line - Actual DNA sequence - Third line - Begins with a + - Fourth line: Series of characters that represent the Phred quality score of each base - **Phred quality score**: Measure of the quality of the identification of the nucleobases generated by automated DNA sequencing, aka an estimate of the uncertainty [https://en.wikipedia.org/wiki/Phred_quality_score] - Phred score of 10 means a 1/10 probability of an incorrect base, Phred score of 20 means 1/100, etc. - The larger the Phred score, the lower probability of an incorrect base (better) - **BioConda**: Program that provides various bioinformatics software for your computer - **FileZilla:** connects your computer to a remote databse - **1x Sequencing:** exactly as many sequences reads as length of genome - **FastQC:** Program that performs quality control checks on raw sequence data from high throughput sequencing - **SAM**: file format of alignment results, it is uninterpretable but shows how the reads were aligned to the reference genome where they best matched - **BAM**: binary compressed version of SAM file - **Organoid**: Cells taken from outside of the body and grown as a model for experimentation, whose RNA expression can be sequenced to observe effects - **Thought experiment:** Hypothetical situation which lays out a method of thinking; usually can never be practically done - **Chlamydomonas**: Model organism species of green algae (unicellular flagellates), found in stagnant water and damp soil - **Overlap consensus**: Genome assembly method where reads are provided to the algorithm, overlapping regions are identified. Each read is graphed as a node and the overlaps are represented as edges joining the two nodes involved. - **Biostars** Help site/public forum for bioinformaticists using software - **Stack Overflow**: A more general help site/public forum for many areas, including bioinformatics and Linux - **Crosslinking**: DNA naturally has little loops in it because of protein. We can treat the DNA with chemicals or ultraviolet light to lock them in place in their position, this is crosslinking in the protein. Then, using endonucleases (restriction enzymes) we cut the DNA into pieces. Then, we treat the DNA with even more enzyme so that DNA binds other DNA in close proximity (handcuffing a person to the physically closest person to determine information about the people). We can use this proximity information to determine the DNA's location and assemble the contigs. - **QUAST**: QUality ASsessment Tool, evalutes genome assemblies by computing various metrics - **BUSCO**: Looks for universal single-copy orthologs. There are certain genes conserved in nearly every organism. Essentially, BUSCO tries to look for these ~320 genes in the assembly because they, theoretically, must be present. - **Docker**: Docker creates a small virtual computer container within your computer so that you don't have to install complicated software. ___ ### Command Line Interface - **Input format - IMPORTANT!**: $ command [argument] - Commands f.e. ls, wget, whoami - An argument either modifies how a command works or specifies what a command needs in order to work - 0 or more arguments - For example, wget's required argument is the web URL that it needs to actually obtain the file - There are always spaces between things; between command and arguments - Flags are dashes that come before arguments if at all (f.e. -e) - **Shell Prompt**: This is where the user inputs commands, defined by the $ - Whatever is after the : is your location; :~$ indicates home - **PATH**: There are two ways to specify the location of something - **Absolute Path:** Always begins with a forward slash /, this is a special location called the root directory which contains all other directories. Could go like /home/username - Unambiguous specific location. However, they can only be used in limited circumstances because not everyone names their files/folders the same way - **Relative Path:** Do not start with /. Could go like ~/someplace (start from home directory) or ../someplace (1 level back from wherever this place is, start here) - Relative to an exact location, be it home or another directory. Could be used more generally, since the directions are uniform - **COMMAND - whoami**: returns your username - **~**: Symbol for home - **COMMAND - pwd**: Print working directory; inputting this into the prompt returns your location (f.e. /home/) - **Paste** - Right click, not Ctrl+V - **COMMAND - ls**: view a list of the files and folders in a given director - Adding the argument -F modifies the ls command, telling it to put a / after directories or folders to distincify them - Adding the argument -aF modifies ls and shows you hidden files beginning with "." (f.e. .hidden/) - ls [file name] tells you what is in the file - ls [directory] tells you what is in the directory - Adding * with something returns any file that includes it in some capacity. * is called the "wild card". - *something* returns any file that includes something - something* returns any file that begins with something - *something returns any file that ends with something (f.e. *.fastaq) - ls -lh (l means long format, provides extra info about the files) (h means human) - **COMMAND - echo**: Before you actually run the command, it shows a preview of what it will return - **COMMAND - tar**: Can do several tasks including compression, decompression - Adding a flag to the argument (tar -xvf) specifies decompression - **COMMAND - up arrow**: Pressing the up arrow will bring up the last command entered so you don't have to type over again - **COMMAND - history**: Shows you all the commands that you have used thus far, numbered - Then, you can input "!number" to reexecute the command of that number - **COMMAND - cd:** change directory - You need to specify the new location, f.e. "cd shell_data/". This results in a new :~shell_data$ directory you are in. - Entering "cd" alone brings you back to home directory - Entering "cd.." brings you one level backwards (f.e. :~shell_data/untrimmed_fastq$ --> :~shell_data$) - **COMMAND - man**: manual; this brings up an instructory synopsis of the command which is the man's argument - f.e. "man ls" shows up the synopsis for ls command - to leave the synopsis and return to shell press "q" to quit - **^^ or <>**: means that the argument/option? is required for the command to work - **..** allows you to specify commands to work on backwards levels - f.e. "ls .." views a list of the files and folders from one back, "ls ../.." goes 2 back, etc. - "cd .." moves you to one directory back - **COMMAND - head**: allows you to take a peak at a file without opening it - Adding the flagged argument -n *k* brings you the first *k* lines - **COMMAND - cp**: copy a file; requires argument of the actual file name, and also where it is to be copied - **COMMAND - mkdir**: make directory; requires argument of the new directory's name (i.e. mkdir backup) - **COMMAND - mv**: move, requires source and destination arguments (i.e mv * copy * backup) - **COMMAND - rm**: remove file, requires the file name - To delete directories, the recursive argument is needed: rm -r [directory] - **COMMAND - curl**: downloads data from the Internet - curl -O [URL] - **.sh**: shell script - Conda software, specifically, is installed using a shell script - Can save a series of directions into one action - **COMMAND - iget**: gets data from somewhere else - **.** : located at the end, tells the software to complete the action of the file in the current location - **COMMAND - md5sum**: mathematical algorithm that can be run on a file that generates a unique checksum signature for the file - If the file is changed even in the slightest capacity, the checksum signature will be completely altered - This way, md5sum is a great way to determine which file is an authentic one when there are many to deal with - If we don't want the output of a file to come out in the shell, and instead in an external file, we can use **>** - "md5sum (something) > file.txt" sends the checksum of something to file.txt instead of showing up in the shell - **COMMAND - cat**: concatenate, requires 2 arguments connected by > to link them together - **COMMAND - gunzip**: -d specifies decompression - **COMMAND - sed**: parses and transforms text, can be altered with a variety of arguments - **COMMAND - samtools**: provides access to samtools - samtools sort: sorts the alignments in genome order - samtools index: indexing a genome sorted BAM file allows one to quickly extract alignments overlapping particular genomic regions - **COMMAND - df**: The df command (short for disk free) is used to show the amount of free disk space available on Linux. - -h argument - Under Mounted on, the / indicates that this is in the root directory (contains all other directories and files) - **COMMAND - wtdbg2**: This is the redbean alingment program. - -t is how many threads you want to use (for us, 8 CPUs), -g is the size of the genome - -x is - -g is the genome size - **COMMAND - sudo**: super user do allows a system administrator to delegate authority to give certain users (or groups of users) the ability to run some (or all) commands. --- ## Daily notes ### Monday *Tasks* - Intro to genomics [✓] - Whole [mtDNA PCR](https://jasonjwilliamsny.github.io/stars-2021/_includes/laboratories/mitochondrial_sequencing/) (using [NEB ](https://www.neb.com/products/m0287-longamp-taq-2x-master-mix#Product%20Information) and [Takara](https://www.takarabio.com/products/pcr/long-range-pcr/la-taq-products/la-taq-pcr-kit) enzymes)[✓] - Launched CyVerse Atmosphere instances and setup CyVerse accounts [✓] - Recorded talk by [Prof. Rob Martienssen](https://www.cshl.edu/research/faculty-staff/rob-martienssen/) on the genetic engineering of Lemnaceae [✓] --- **Notes** - During transcription, a spliceosome cuts up DNA, retains exons and removes introns - OMIM provides information about specific genes - Information in a genome browser denoted "tracks" - Access the FASTA record for genes to see its DNA sequence - Most of your DNA is non-coding and repetitive (i.e. telomeres), which makes assembling a genome more difficult - 5' -> 3' is denoted + (plus). 3' -> 5' is denoted - (minus). - To predict genes, scientists look for start and stop codons in logical positions for gene placement - BLAST: compare predicted genes to a gene database ### Tuesday *Tasks* - Electrophoresis of mtDNA PCR [✓] - Introduction to [Nanopore](https://nanoporetech.com/) sequencing [ ✓] - Preperation of mtDNA nanopore library [✓] - Live talk by Ast. Professor [Alex Harkess](https://www.hudsonalpha.org/faculty/alex-harkess/) - [DNA extraction](https://www.promega.com/-/media/files/resources/protocols/technical-manuals/500/wizard-hmw-dna-extraction-kit-protocol.pdf?rev=9c37d6c3074042679b4611bbb436ae32&sc_lang=en) from S.polyrhiza [✓ ] - [Genomics data carpentry](https://datacarpentry.org/lessons/#genomics-workshop) command line [ ]work [ ] - [Software installation](https://datacarpentry.org/genomics-workshop/setup.html) - [Shell lesson data](https://data.cyverse.org/dav-anon/iplant/projects/cyverse_training/classrooms/dnalc_genome_camp/shell_data.tar.gz ) - ``` #download genomics DC data wget https://data.cyverse.org/dav-anon/iplant/projects/cyverse_training/classrooms/dnalc_genome_camp/shell_data.tar.gz #unzip tar file tar -xvf shell_data.tar.gz ``` --- **Notes** - Nanopore sequencing technology utilizes a namesake nanopore protein, which a DNA sequence passes through with a tether molecule and helicase motor protein on an artificial surface. The resulting current is analyzed and used to determine the sequence of the DNA passing through. Although cheap and efficient, it is less accurate compared to other methods. The sequence is "read" at around 400 bases per second by detecting the differing amplitudes of each nucleotide. Each base disconnects the ionic current differently, yielding different characteristic amplitudes. Nanopores are being utilized to detect diseases in West African cassava plants, and can theoretically be used in locations such as airports or schools to detect viruses (such as COVID-19) rapidly in people or the air indoors. - The specific strain of *S. polyrhiza* that we are using today is **SP6581** - Contamination is a big issue in genome sequencing; in order to prevent this fabric gauze is kept over the *S. polyrhiza* to maintain sterile conditions - Constant daylight conditions until Saturday 4pm, when they were put into dark - approximately 2 days of darkness, which depleted their sugar ### Wednesday *Tasks* - Michael Schatz video on Assembly [✓] - Software installation to local laptops [✓] - IGV (Java Included): https://software.broadinstitute.org/software/igv/download - Filezilla (client): https://filezilla-project.org/ - Quantification/QC of S.polyrhiza DNA extraction [✓] - Preperation of S.polyrhiza nanopore library [✓] - [Genomics data carpentry](https://datacarpentry.org/lessons/#genomics-workshop) command line work [✓] - [Software installation](https://datacarpentry.org/genomics-workshop/setup.html) - [Shell lesson data](https://data.cyverse.org/dav-anon/iplant/projects/cyverse_training/classrooms/dnalc_genome_camp/shell_data.tar.gz) ``` #download genomics DC data wget https://data.cyverse.org/dav-anon/iplant/projects/cyverse_training/classrooms/dnalc_genome_camp/shell_data.tar.gz #unzip tar file tar -xvf shell_data.tar.gz ``` - Analysis of nanopore mtDNA reads [ ] --- **Notes** - VIDEO NOTES: - Greedy Reconstruction: Method of sequencing a genome by listing all fragments, then matching their beginning and ends - A repeated sequence results in an ambiguity in the next fragment, makes reconstruction more complicated - de Brujin construction: In bioinformatics, mathematical process where you have different fragments of nucelotides, and find the best and simplest way to assemble the fragments without repeating - Smaller fragments enable more sensitive reconstruction, as longer fragments don't take potential mismatches into account - The bigger the N50 size, the better the resolution on gene location, etc. - Applications of DNA assembly: - Novel genomes (i.e. Genome 10k Project) - Metagenomes: assembling a genome in isolation (i.e. Human Microbiome Project sequences the genomes of microbes in relation to human disease, function, etc.) - Sequencing assays (Structural variations, Transcript assembly) - - PUTTY (Windows): to connect, type in IP address. Then Accept, then type in Cyverse username and your password. If you messed up on your password, press Ctrl+C to start over. Make sure not to enter your password incorrectly more than once; the letters will not visibly show up so be careful - Shell usage is required for most bioinformatics programs; it also reduces manual error so learning shell is important - Installing from source - Compiling - Make it from scratch - Dependencies make installing software difficult, as programming a calculator app depends on other code that enables the mathematic functions as well as the interface design - Install binary - Precompile - Makes installing software easier - Package Manager: finds premade software elsewhere that you need in your program for you - Bioconda, Conda - Three key things you need for genome sequencing: - Coverage: at least 8x to theoretically cover all of the genome - --- **Commands** $ = shell prompt that represents the beginning of a shell command Structure: $ command [zero or more arguments] - An argument modifies how a command works or something the command needs in order to work that was not included in the inputted command $ clear $ whoami $ pwd $ wget [link w/ data] $ ls $ ls -F $ cd $ cd [first two letters of location] (press tab) $ cd untrimmed_fastq $ cd shell_data $ cd .. $ ls ../.. $ ls -aF $ ls [location] $ touch file.txt $ ls *[term that is being searched for] $ ls /usr/bin/*.sh $ echo @.fastq $ echo *.fastq $ history, then ![# that you want to repeat] $ head -n# [location/file] # = number of lines viewable $ cp [file] [location it should be copied to] $ mkdir [name] remove = $ rm [location] $ mv *[contains]* [destination] destination eg. backup/ Check with ls $ curl -O [link] $ iinit $ ipwd $ pwd $ iget Remove fully with $ rm -r backup/ Dry run: $ echo *copy* backup the ".." represents one level above To access manuals on how/when to use each command, use the command $ man [command name] Absolute path always begins with / (location called root directory, which contains all other directories) Structure: /users/home/username eg. pwd You can use autocomplete in $ cd/home/_(tab)/__(tab)/.hidden/ for access, where _ represents the first letter(s) of the name or location Relative path Structure: ~/someplace OR ../someplace Home directory is always tilde in LINUX Autocomplete structure: $ cd ~/shell_data/.hidden/ Use explainshell.com to explain commands. **S.polyrhiza DNA extraction results** |Sample|DNA concentration| |-|-| |1|6.58 ng/ul| |2|12.1 ng/ul| |3|24.2 ng/ul| |4|7.45 ng/ul| |5|8.39 ng/ul| |6|3.67 ng/ul| |7|5.20 ng/ul| |8|14.1 ng/ul| |9|0.217 ng/ul| |10|0.108 ng/ul| |11|16.5 ng/ul| |12|21.3 ng/ul| |13|6.71 ng/ul| |14|9.73 ng/ul| |15|28.1 ng/ul| **Software installations** *Miniconda* ``` curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh sh Miniconda3-latest-Linux-x86_64.sh ``` *fastqc* ``` conda install fastqc -c bioconda ``` *Setup iCommands* ``` $ iinit One or more fields in your iRODS environment file (irods_environment.json) are missing; please enter them. Enter the host name (DNS) of the server to connect to: data.cyverse.org Enter the port number: 1247 Enter your irods user name: #your_cyverse_username Enter your irods zone: iplant Those values will be added to your environment file (for use by other iCommands) if the login succeeds. Enter your current iRODS password: #your_cyverse_password ``` *Copy mt Dataset (subsample) from Data store to VM* ``` iget -rPVT /iplant/home/shared/cyverse_training/classrooms/dnalc_genome_camp/mitochondrial_sequencing/subsample . ``` ### Thursday *Tasks* - Complete Michael Schatz video on Assembly [ ] - Review of Fastqc report [ ] - Fastqc documentation: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ - Analysis of nanopore mtDNA reads [ ] - Guest speaker [Sarah Goodwin - CSHL genome center](https://www.linkedin.com/in/sara-goodwin-6694b267/) - Installation of assembly tools - [Chlamydamonas](https://en.wikipedia.org/wiki/Chlamydomonas) assembly [ ] - Data path ``` # path /iplant/home/shared/cyverse_training/classrooms/dnalc_genome_camp/sample_nanopore_reads/PRJDB11341_chlamydomonas_KOR1 #iget command (get data and put it in current folder) # do this in `/scratch` or your disk will fill iget -rPV /iplant/home/shared/cyverse_training/classrooms/dnalc_genome_camp/sample_nanopore_reads/PRJDB11341_chlamydomonas_KOR1 . ``` - QC and Preperation of S.polyrhiza nanopore library [ ] --- **Notes** - **FASTQC RESULTS** - Per base sequence quality - Shows mean, median Phred quality for the bases as you go along the sequence - Per sequence quality score - See the distributions of the quality scores - Per base sequence content - See the ratio of A, T, C, G. This is usually considered for Illumina sequencing. Look out for major jumps or abnormalities - Per sequence GC content - Sequence length distribution - Distribution of the lengths of the sequences. Make sure that the peak is where you expect it, in this case we expect it to be most frequent around the length of how long the PCR section was - The small plateau increases of coverage are the overlaps - The reason that the coverage of the regions, while consistent within, are very different from one another, is because of the difference in PCR performance from sample to sample **MT Analysis** - 3 major steps - Get human mtDNA reference sequence - Align ONT read to reference - Visualize - Use GraphMap to align reads - https://github.com/lbcb-sci/graphmap2 - ``` conda install graphmap -c bioconda ``` - Translate fastq to fasta - Unzip the fastq file ``` gunzip -d class_mt_reads.fastq.gz ``` - Translate from fastq to fasta ``` sed -n '1~4s/^@/>/p;2~4p' class_mt_reads.fastq > class_mt_reads.fasta ``` - Align with graphmap - ``` graphmap2 align -r human_mitochondria_whole_NC_012920.1.fasta -d class_mt_reads.fasta -o class_mt_alignment.sam ``` - Install samtools - documentation: http://www.htslib.org/doc/samtools.html ``` conda install samtools -c bioconda ``` - Convert alignment with samtools to bam file ``` samtools view -S -b class_mt_alignment.sam > class_mt_alignment.bam - Sort bam alignment with samtools ``` samtools sort class_mt_alignment.bam -o class_mt_alignment.sam.sorted.bam ``` - Index the bam alignment file ``` samtools index class_mt_alignment.sam.sorted.bam ``` --- *Get STARS mt sample* ``` iget -rP /iplant/home/shared/cyverse_training/classrooms/dnalc_genome_camp/mitochondrial_sequencing/stars_sample . ``` --- **Trial Chlamy assembly** - Setup the data on the scratch space ``` mkdir trial_assembly ``` - Import the dataset ``` iget -rPV /iplant/home/shared/cyverse_training/classrooms/dnalc_genome_camp/sample_nanopore_reads . ``` - Install redbean: ``` conda install wtdbg -c bioconda ``` - Run readbean ``` # run redbean on the 12K read file wtdbg2 -x ont -t 8 -g 120m -i DRR277383_1_125K.fasta.gz -o Chlamy #Use the consenser to generate the fasta contigs wtpoa-cns -t 8 -i Chlamy.ctg.lay.gz -fo Chlamy.ctg.fa ``` ### Friday *Tasks* - Complete Michael Schatz video on Assembly [ ] - Installation of assembly QC tools - BUSCO - QUAST - Complete [Chlamydamonas](https://en.wikipedia.org/wiki/Chlamydomonas) assembly [ ] - Discuss follow up and next steps [] - Survey [] --- **Notes** Task - make sure you have the datasets for today ``` # change to the scratch space cd /scratch #make the data directory if needed mkdir -p trial_assembly #make sure you have the dataset iget -P /iplant/home/shared/cyverse_training/classrooms/dnalc_genome_camp/sample_nanopore_reads/PRJDB11341_chlamydomonas_KOR1/DRR277383_1_125K.fasta.gz . ``` *Prepare to QC* ``` # make a new directory mkdir chlamy_12_5_assembly #get precomputed data iget -P /iplant/home/shared/cyverse_training/classrooms/dnalc_genome_camp/Chlamy.ctg.fa . ``` *Quast* - Install Docker ``` #update ubuntu package manager sudo apt-get update # get dependancies sudo apt-get install \ apt-transport-https \ ca-certificates \ curl \ gnupg \ lsb-release #get key curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg #setup repository echo \ "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null #update package list sudo apt-get update # install docker sudo apt-get install docker-ce docker-ce-cli containerd.io #test docker sudo docker run hello-world ``` ---- ``` #Get quast container sudo docker pull staphb/quast #run container sudo docker run -v /scratch/chlamy_12_5_assembly:/data staphb/quast python /quast-5.0.2/quast.py -o /data/quast_qc_chlamy -m 5000 -t 8 -l "readbean Chlamydomonas assembly" --no-snps Chlamy.ctg.fa ``` *BUSCO* ``` #get BUSCO container docker pull quay.io/biocontainers/busco:5.2.2--pyhdfd78af_0 #run container on a folder `assemble` containing contigs sudo docker run -v assembly:/busco_wd quay.io/biocontainers/busco:5.2.2--pyhdfd78af_0 busco -i /busco_wd/Chlamy.ctg.fa -o busco_analysis -l chlorophyta_odb10 -m genome ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.