Bioinformatics
Python
DNA
genome
fastqc
First generation sequencing (Sanger sequencing). The systematic procedure of Sanger sequencing includes preparation of shotgun library, isolating sequencing templates, performing sequencing reactions, and finally performing capillary electrophoresis. This technology can sequence fragments up to 1 Kilo-base pairs (Kb) with 99.99% accuracy. The output of the method is assembled to generate draft genomes. Sanger sequencing was advantageous and used to provide useful information in comparison to traditional microbial culturing analysis. Earlier, the first bacterial genome of Haemophilus influenzae by Sanger sequencing took more than 1 year (Fleischmann et al. 1995). Sanger sequencing is a technically laborious and time-taking procedure. Further, there were specific technical requirements for Sanger sequencing methods. Therefore, it is not possible for general laboratories to perform these sequencing reactions. Thus, majority of bacterial genome sequencing projects were restricted to a few large sequencing laboratories. Another limitation is that de novo sequencing is not possible using Sanger sequencing methods. The limitations of conventional Sanger sequencing were overcome by more advanced sequencing technologies, which have been discussed in the following section.
Second generation (Next-generation sequencing) The next-generation sequencing (NGS) is a more advanced sequencing technology compared to conventional Sanger sequencing. The data throughput is 100-fold higher compared to Sanger sequencing (Pareek et al. 2011; Grada and Weinbrecht 2013). It is also referred as high-throughput genome sequencing (HTGS) because of such huge amount of sequencing data generated (Liu et al. 2012). Millions of DNA molecules are sequenced together in parallel in a typical NGS reaction. NGS was introduced in year 2000 using 454 (Roche) with pyrosequencing approach. A typical workflow for next-generation genome sequencing includes steps like microbial sample collection, nucleic acid extraction, genomic DNA fragmentation, adapter addition, DNA Library preparation, and sequencing followed by data analysis (Fig. 8.1). The entire process of a typical NGS experiment from the microbial colony harvest to acquisition of analyzable data takes less than 60 hours depending on the sequencing platform.
There has been tremendous increase in the pace of microbial research with new advanced NGS technologies. Further, the decreasing cost of the technology has propagated microbial genome sequencing tremendously. There are various commercial platforms available to perform high-throughput next-generation sequencing like 454, Illumina, Ion Torrent, ABI SOLiD, etc (Pareek et al. 2011; Liu et al. 2012; Grada and Weinbrecht 2013). Two main NGS platform methods are currently used, namely, short read platforms (including Illumina and Ion Torrent). Illumina platforms have HiSeq2000 and MiSeq that perform an ultra-high-throughput analysis. These commercial platforms have difference in their output, read length, and coverage (Buermans and Dunnen, 2014). The limitation of these sequencing methods is its reduced sequence length and quality, though high throughputs balance for its reduced lengths. Each platform has its own advantages and limitations. The choice of use of these platforms for microbial genome sequencing depends on the requirements of the analysis, throughput, and desired applications.
Third generation sequencing There were some technical issues with second generation of sequencing like short read length (30–450 bases), errors due to short read lengths, and laborious sample preparation methods for some platforms. To overcome these drawbacks, more advanced sequencing platforms have been developed. These are rapid and yield reads up to 20 Kb for each DNA molecule, called as third-generation sequencing technology (Coupland et al. 2012; Buermans and Dunnen 2014). There are broadly two categories; first is single-molecule real-time sequencing (Pacific Biosciences), and second type is nanopore sequencing (Oxford Nanopore) under third-generation sequencing technology. The methodology and working principle of these third-generation sequencings is different from earlier sequencing methods.
During single-molecule real-time (SMRT) sequencing, single-molecule of DNA is detected per reaction during real time, whereas the basic principle of nanopore sequencing is to measure changes in electrical properties as biomolecules such as DNA translocate through the pore and then to use electrical changes to identify the exact DNA base. These advanced third-generation sequencing technologies such as PacBio and MinION can produce much longer reads of several thousand base pairs compared with the first- and second-generation sequencing technologies. These third-generation sequencing methods can be used for direct DNA and RNA sequencing, real-time data acquisition and analysis, long reads, and de novo assemblies of repeated sequences and complex regions but at the cost of read quality. Recent Advances in Microbial Genome Sequencing
The genome of most organisms (including humans) is too long to be sequenced as one continuous string. Using next-generation 'short-read' sequencing, DNA is broken into short fragments that are amplified (copied) and then sequenced to produce 'reads'. Bioinformatic techniques are then used to piece together the reads like a jigsaw, into a continuous genomic sequence.
LRS allows for the retrieval of much longer (>10,000 base pairs) sequencing reads than widely-used SRS systems (75-300 base pairs). Some long-read sequencing (LRS) platforms have produced sequence reads of 882000 bp, with some user groups reporting reads of over 2,000,000bp (2 megabase); however, read lengths of 10,000-100,000bp are more common. What is long read sequencing?
Advantages of SRS SRS has the advantage of being inexpensive, accurate, and supported by a wide range of analytics tools.
Advantages of LRS LRS can improve de novo assembly, mapping certainty, transcript isoform identification, and detection of structural variants. Because LRS can read through the entire lengths of RNA transcripts, allowing for precisely identifying the specific isoform.
Q= -10*log10(p)
where Q is the base quality, and the p is the probability that the base call is incorrect. The base quality is simply the base caller's estimate of the probability that the base was called incorrectly. p is estimated by the sequencing software.
Q | probability of incorrect base call | Base Call Accuracy |
---|---|---|
10 | 1/10 | 90% |
20 | 1/100 | 99% |
30 | 1/1000 | 99.9% |
the lambda phage reference genome
The output of the genome sequencer machine is in FASTQ format – text-based format storing both the short nucleotide sequence reads and their associated quality scores in ASCII format. The image below is an example of the data in FASTQ format.
where line1 contains the name of the read. This name might encode some information about the experiment that it came from, maybe the kind of sequencing machine that was used, maybe where on the slide this particular cluster was located.
line2 is the sequence of bases as reported by the base caller
line3 is simply a placeholder and can be ignored
line4 is a sequence of base quality values, which are ASCII-encoded version of Q= -10*log10(p)
. Characters are used to encode integers. For example, >
has an integer value of 62 (Q). See ASCII Table. The base quality score Q is converted to ASCII characters by Phred 33, which rounds the Q off to the nearest integer, and then adds 33 to it, and then turns it into the corresponding character.
Per base sequence quality of h37_1.fastq
Per base sequence quality of h37_2.fastq.gz
The following error disappeared after intalling openjdk-8-jre
Exception in thread "Thread-1" java.awt.AWTError: Assistive Technology not found: org.GNOME.Accessibility.AtkWrapper at java.awt.Toolkit.loadAssistiveTechnologies(Toolkit.java:807) at java.awt.Toolkit.getDefaultToolkit(Toolkit.java:886) at sun.swing.SwingUtilities2.getSystemMnemonicKeyMask(SwingUtilities2.java:2032) at javax.swing.plaf.basic.BasicLookAndFeel.initComponentDefaults(BasicLookAndFeel.java:1158) at javax.swing.plaf.metal.MetalLookAndFeel.initComponentDefaults(MetalLookAndFeel.java:431) at javax.swing.plaf.basic.BasicLookAndFeel.getDefaults(BasicLookAndFeel.java:148) at javax.swing.plaf.metal.MetalLookAndFeel.getDefaults(MetalLookAndFeel.java:1577) at javax.swing.UIManager.setLookAndFeel(UIManager.java:539) at javax.swing.UIManager.setLookAndFeel(UIManager.java:579) at javax.swing.UIManager.initializeDefaultLAF(UIManager.java:1349) at javax.swing.UIManager.initialize(UIManager.java:1459) at javax.swing.UIManager.maybeInitialize(UIManager.java:1426) at javax.swing.UIManager.getUI(UIManager.java:1006) at javax.swing.JPanel.updateUI(JPanel.java:126) at javax.swing.JPanel.<init>(JPanel.java:86) at javax.swing.JPanel.<init>(JPanel.java:109) at javax.swing.JPanel.<init>(JPanel.java:117) at uk.ac.babraham.FastQC.Graphs.QualityBoxPlot.<init>(QualityBoxPlot.java:49) at uk.ac.babraham.FastQC.Modules.PerBaseQualityScores.getResultsPanel(PerBaseQualityScores.java:53) at uk.ac.babraham.FastQC.Modules.AbstractQCModule.writeDefaultImage(AbstractQCModule.java:62) at uk.ac.babraham.FastQC.Modules.PerBaseQualityScores.makeReport(PerBaseQualityScores.java:199) at uk.ac.babraham.FastQC.Report.HTMLReportArchive.<init>(HTMLReportArchive.java:131) at uk.ac.babraham.FastQC.Analysis.OfflineRunner.analysisComplete(OfflineRunner.java:185) at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:123) at java.lang.Thread.run(Thread.java:748)
How to calculate time elapsed in bash script?
de novo assembly means we're studying sequencing data from a genome that has never been sequenced, what kind of computational method is most appropriate?
De-novo vs. mapping assembly
In sequence assembly, two different types can be distinguished:
de-novo: assembling short reads to create full-length (sometimes novel) sequences, without using a template (see de novo sequence assemblers, de novo transcriptome assembly)
mapping: assembling reads against an existing backbone sequence, building a sequence that is similar but not necessarily identical to the backbone sequence