Try   HackMD
tags: Bioinformatics Python DNA genome fastqc

Analysis for sequenced nucleic acid (DNA, RNA)


First, second and third generation sequencing technologies

First generation sequencing (Sanger sequencing). The systematic procedure of Sanger sequencing includes preparation of shotgun library, isolating sequencing templates, performing sequencing reactions, and finally performing capillary electrophoresis. This technology can sequence fragments up to 1 Kilo-base pairs (Kb) with 99.99% accuracy. The output of the method is assembled to generate draft genomes. Sanger sequencing was advantageous and used to provide useful information in comparison to traditional microbial culturing analysis. Earlier, the first bacterial genome of Haemophilus influenzae by Sanger sequencing took more than 1 year (Fleischmann et al. 1995). Sanger sequencing is a technically laborious and time-taking procedure. Further, there were specific technical requirements for Sanger sequencing methods. Therefore, it is not possible for general laboratories to perform these sequencing reactions. Thus, majority of bacterial genome sequencing projects were restricted to a few large sequencing laboratories. Another limitation is that de novo sequencing is not possible using Sanger sequencing methods. The limitations of conventional Sanger sequencing were overcome by more advanced sequencing technologies, which have been discussed in the following section.

Second generation (Next-generation sequencing) The next-generation sequencing (NGS) is a more advanced sequencing technology compared to conventional Sanger sequencing. The data throughput is 100-fold higher compared to Sanger sequencing (Pareek et al. 2011; Grada and Weinbrecht 2013). It is also referred as high-throughput genome sequencing (HTGS) because of such huge amount of sequencing data generated (Liu et al. 2012). Millions of DNA molecules are sequenced together in parallel in a typical NGS reaction. NGS was introduced in year 2000 using 454 (Roche) with pyrosequencing approach. A typical workflow for next-generation genome sequencing includes steps like microbial sample collection, nucleic acid extraction, genomic DNA fragmentation, adapter addition, DNA Library preparation, and sequencing followed by data analysis (Fig. 8.1). The entire process of a typical NGS experiment from the microbial colony harvest to acquisition of analyzable data takes less than 60 hours depending on the sequencing platform.

There has been tremendous increase in the pace of microbial research with new advanced NGS technologies. Further, the decreasing cost of the technology has propagated microbial genome sequencing tremendously. There are various commercial platforms available to perform high-throughput next-generation sequencing like 454, Illumina, Ion Torrent, ABI SOLiD, etc (Pareek et al. 2011; Liu et al. 2012; Grada and Weinbrecht 2013). Two main NGS platform methods are currently used, namely, short read platforms (including Illumina and Ion Torrent). Illumina platforms have HiSeq2000 and MiSeq that perform an ultra-high-throughput analysis. These commercial platforms have difference in their output, read length, and coverage (Buermans and Dunnen, 2014). The limitation of these sequencing methods is its reduced sequence length and quality, though high throughputs balance for its reduced lengths. Each platform has its own advantages and limitations. The choice of use of these platforms for microbial genome sequencing depends on the requirements of the analysis, throughput, and desired applications.

Third generation sequencing There were some technical issues with second generation of sequencing like short read length (30–450 bases), errors due to short read lengths, and laborious sample preparation methods for some platforms. To overcome these drawbacks, more advanced sequencing platforms have been developed. These are rapid and yield reads up to 20 Kb for each DNA molecule, called as third-generation sequencing technology (Coupland et al. 2012; Buermans and Dunnen 2014). There are broadly two categories; first is single-molecule real-time sequencing (Pacific Biosciences), and second type is nanopore sequencing (Oxford Nanopore) under third-generation sequencing technology. The methodology and working principle of these third-generation sequencings is different from earlier sequencing methods.

During single-molecule real-time (SMRT) sequencing, single-molecule of DNA is detected per reaction during real time, whereas the basic principle of nanopore sequencing is to measure changes in electrical properties as biomolecules such as DNA translocate through the pore and then to use electrical changes to identify the exact DNA base. These advanced third-generation sequencing technologies such as PacBio and MinION can produce much longer reads of several thousand base pairs compared with the first- and second-generation sequencing technologies. These third-generation sequencing methods can be used for direct DNA and RNA sequencing, real-time data acquisition and analysis, long reads, and de novo assemblies of repeated sequences and complex regions but at the cost of read quality. Recent Advances in Microbial Genome Sequencing


Short read sequencing (SRS) and long read sequencing (LRS)

The genome of most organisms (including humans) is too long to be sequenced as one continuous string. Using next-generation 'short-read' sequencing, DNA is broken into short fragments that are amplified (copied) and then sequenced to produce 'reads'. Bioinformatic techniques are then used to piece together the reads like a jigsaw, into a continuous genomic sequence.

LRS allows for the retrieval of much longer (>10,000 base pairs) sequencing reads than widely-used SRS systems (75-300 base pairs). Some long-read sequencing (LRS) platforms have produced sequence reads of 882000 bp, with some user groups reporting reads of over 2,000,000bp (2 megabase); however, read lengths of 10,000-100,000bp are more common. What is long read sequencing?

Advantages of SRS SRS has the advantage of being inexpensive, accurate, and supported by a wide range of analytics tools.

Advantages of LRS LRS can improve de novo assembly, mapping certainty, transcript isoform identification, and detection of structural variants. Because LRS can read through the entire lengths of RNA transcripts, allowing for precisely identifying the specific isoform.


Sequencing errors and base qualities

Q= -10*log10(p) where Q is the base quality, and the p is the probability that the base call is incorrect. The base quality is simply the base caller's estimate of the probability that the base was called incorrectly. p is estimated by the sequencing software.

Q probability of incorrect base call Base Call Accuracy
10 1/10 90%
20 1/100 99%
30 1/1000 99.9%
> p=1/1000
> Q=-10*log10(p)
> Q
[1] 30

Read a DNA reference genome in the FASTA format

the lambda phage reference genome

def readGenome(filename):
    '''Read a file in a FASTA format'''
    genome = ''
    with open(filename, 'r') as f:
        # Read the file line by line
        for line in f:
            # ignore header line with genome information
            if not line[0] == '>':
                genome += line.rstrip() # rstrip() removes trailing white space, tab or new line from the end of the string.
    return genome

genome= readGenome('D:/googleDrive/Coursera_Algorithms-for-DNA-Sequencing/data/lambda_virus.fa')
genome[:100]

len(genome) # The length of the genome is 48502 bases long

#---------------------------------------------
# Count the number of occurences of each base
#---------------------------------------------
# Method 1
# Initialise the frequencies of each base to zeros
counts = {'A': 0, 'C': 0, 'G': 0, 'T': 0}
# Loop thru each base in the genome
for base in genome:
    # Increment the key value by 1
    #counts[base] += 1
    counts[base]=counts[base] + 1
print(counts)

# Method 2: using the collections module
# Count the number of occurences of each base using the collections module
import collections
collections.Counter(genome) # the Counter() function will do the same thing as the function above

Sequencing reads in FASTQ format

The output of the genome sequencer machine is in FASTQ format – text-based format storing both the short nucleotide sequence reads and their associated quality scores in ASCII format. The image below is an example of the data in FASTQ format.

FASTQ

where line1 contains the name of the read. This name might encode some information about the experiment that it came from, maybe the kind of sequencing machine that was used, maybe where on the slide this particular cluster was located.

line2 is the sequence of bases as reported by the base caller

line3 is simply a placeholder and can be ignored

line4 is a sequence of base quality values, which are ASCII-encoded version of Q= -10*log10(p). Characters are used to encode integers. For example, > has an integer value of 62 (Q). See ASCII Table. The base quality score Q is converted to ASCII characters by Phred 33, which rounds the Q off to the nearest integer, and then adds 33 to it, and then turns it into the corresponding character.

def QtoPhred33(Q):
    """ Convert base quality Q to Phred+33 ASCII-encoded character """
    return chr(Q+33) # chr() converts character to integer according to ASCII table

QtoPhred33(29)
'>'

def phred33ToQ(qual):
    """Convert Phred+33 ASCII-encoded character to Q"""
    return ord(qual) - 33 # ord() converts integer to character according to ASCII table

phred33ToQ('>')
29
def readFastq(filename):
    """Read a fastq file into two lists sequences, qualities"""
    sequences = []
    qualities = []
    with open(filename) as fh:
        while True:
            fh.readline() # skip the first line, name of the read
            seq = fh.readline().rstrip() # read the 2nd line, base sequences
            fh.readline() # skip the 3rd line, placeholder line
            qual = fh.readline().rstrip() # read the 4th line, base quality line
            # If reaches the end of file, stop the while loop
            if len(seq) == 0: 
                break
            # If that is not the case, append seq to the sequences list, append qual to the qualities list 
            sequences.append(seq)
            qualities.append(qual)
    return sequences, qualities

seqs, quals = readFastq('D:/googleDrive/Coursera_Algorithms-for-DNA-Sequencing/data/SRR835775_1.first1000.fastq')

# Check the first few elements in the seqs list
print(seqs[:3])

['TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCACCCTAACCCTAACCCTAACCGTATCCGTCACCCTAACCCTAAC', 'TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC', 'TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG']
def createHist(qualityStrings):
    """Count the frequencies of quality scores 0 to 50 and save the frequencies in a list hist"""
    # Create a list hist with 50 initial values of zero, assuming the highest Q is < 50
    hist = [0]*50
    for read in qualityStrings:
        for phred in read:
            q = phred33ToQ(phred)
            #hist[q] += 1
            # Increment the initial value by q (e.g., if q=2 then hist[q] is incremented to 1)
            hist[q] = hist[q]+1
    return hist

h = createHist(quals)
print(h)

#[0, 0, 6178, 0, 0, 54, 108, 574, 345, 83, 193, 124, 79, 165, 49, 236, 184, 327, 514, 238, 531, 254, 313, 798, 992, 888, 1396, 1488, 993, 1752, 3387, 4487, 3248, 5476, 8375, 11814, 4243, 7827, 6579, 8179, 9349, 8180, 0, 0, 0, 0, 0, 0, 0, 0]

# Make a quality score bar chart
# This is used only in ipython to place a plot in the same cell
#%matplotlib inline 
import matplotlib.pyplot as plt
plt.bar(range(len(h)), h)
plt.show()

def findGCByPos(reads):
    ''' Calculate proportion of bases that are G or C in each read '''
    # Keep track of the number of G/C bases and the total number of bases at each position
    gc = [0] * 100 # G/C bases
    totals = [0] * 100 # total number of bases
    # Loop thru each base in each read and populate the two lists gc, totals
    for read in reads:
        for i in range(len(read)):
            if read[i] == 'C' or read[i] == 'G':
                # Increment the gc array at index i
                gc[i]=gc[i]+1
            # Regardless of the base, increment the total 
            totals[i]=totals[i]+1
    # Divide G/C counts by total counts to get the average at each position
    for i in range(len(gc)):
        if totals[i] > 0:
            gc[i] = gc[i]/float(totals[i])
    return gc

gc = findGCByPos(seqs)

Checking quality of sequenced reads (fastq, fastq.gz)

# Install java runtime environment
sudo apt-get install openjdk-8-jre

# Install java jdk
sudo apt-get install openjdk-8-jdk

# Update app
sudo apt-get update

# Check java installation
java -version
#openjdk version "1.8.0_242"
#OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08)
#OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)

# Run fastqc on h37_1.fastq
fastqc h37_1.fastq --extract --outdir=/media/sf_bioinformatics/RNA-sequences_mantid-horsehairWorm/output_fastQC 

Started analysis of h37_1.fastq
Approx 5% complete for h37_1.fastq
Approx 10% complete for h37_1.fastq
Approx 15% complete for h37_1.fastq
Approx 20% complete for h37_1.fastq
Approx 25% complete for h37_1.fastq
Approx 30% complete for h37_1.fastq
Approx 35% complete for h37_1.fastq
Approx 40% complete for h37_1.fastq
Approx 45% complete for h37_1.fastq
Approx 50% complete for h37_1.fastq
Approx 55% complete for h37_1.fastq
Approx 60% complete for h37_1.fastq
Approx 65% complete for h37_1.fastq
Approx 70% complete for h37_1.fastq
Approx 75% complete for h37_1.fastq
Approx 80% complete for h37_1.fastq
Approx 85% complete for h37_1.fastq
Approx 90% complete for h37_1.fastq
Approx 95% complete for h37_1.fastq
Analysis complete for h37_1.fastq

# Run fastqc on a fastq.gz file
zcat h37_2.fastq.gz | fastqc stdin --outdir=/media/sf_bioinformatics/RNA-sequences_mantid-horsehairWorm/output_fastQC

Per base sequence quality of h37_1.fastq

Per base sequence quality of h37_2.fastq.gz

The following error disappeared after intalling openjdk-8-jre
Exception in thread "Thread-1" java.awt.AWTError: Assistive Technology not found: org.GNOME.Accessibility.AtkWrapper at java.awt.Toolkit.loadAssistiveTechnologies(Toolkit.java:807) at java.awt.Toolkit.getDefaultToolkit(Toolkit.java:886) at sun.swing.SwingUtilities2.getSystemMnemonicKeyMask(SwingUtilities2.java:2032) at javax.swing.plaf.basic.BasicLookAndFeel.initComponentDefaults(BasicLookAndFeel.java:1158) at javax.swing.plaf.metal.MetalLookAndFeel.initComponentDefaults(MetalLookAndFeel.java:431) at javax.swing.plaf.basic.BasicLookAndFeel.getDefaults(BasicLookAndFeel.java:148) at javax.swing.plaf.metal.MetalLookAndFeel.getDefaults(MetalLookAndFeel.java:1577) at javax.swing.UIManager.setLookAndFeel(UIManager.java:539) at javax.swing.UIManager.setLookAndFeel(UIManager.java:579) at javax.swing.UIManager.initializeDefaultLAF(UIManager.java:1349) at javax.swing.UIManager.initialize(UIManager.java:1459) at javax.swing.UIManager.maybeInitialize(UIManager.java:1426) at javax.swing.UIManager.getUI(UIManager.java:1006) at javax.swing.JPanel.updateUI(JPanel.java:126) at javax.swing.JPanel.<init>(JPanel.java:86) at javax.swing.JPanel.<init>(JPanel.java:109) at javax.swing.JPanel.<init>(JPanel.java:117) at uk.ac.babraham.FastQC.Graphs.QualityBoxPlot.<init>(QualityBoxPlot.java:49) at uk.ac.babraham.FastQC.Modules.PerBaseQualityScores.getResultsPanel(PerBaseQualityScores.java:53) at uk.ac.babraham.FastQC.Modules.AbstractQCModule.writeDefaultImage(AbstractQCModule.java:62) at uk.ac.babraham.FastQC.Modules.PerBaseQualityScores.makeReport(PerBaseQualityScores.java:199) at uk.ac.babraham.FastQC.Report.HTMLReportArchive.<init>(HTMLReportArchive.java:131) at uk.ac.babraham.FastQC.Analysis.OfflineRunner.analysisComplete(OfflineRunner.java:185) at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:123) at java.lang.Thread.run(Thread.java:748)


Write a script to run fastqc on multiple files

#!/bin/bash

# Set up directory
project_dir='/media/sf_bioinformatics/RNA-sequences_mantid-horsehairWorm'
input_dir="${project_dir}/data/180130_Harigane-Mantis_RNA"
output_dir="${project_dir}/output_fastQC/180130_Harigane-Mantis_RNA"
echo ${project_dir} ${input_dir} ${output_dir} 
mkdir -p ${output_dir}

# Go to input directory
cd ${input_dir}

# Create a bash function to run fastqc
foo () {
	local file=$1
	fileName=`basename ${file}`
	fileNameWithoutExtension="${fileName%%.*}"
	zcat ${file} | fastqc stdin:${fileNameWithoutExtension} --outdir=${output_dir}
}

SECONDS=0
# Parallelize a Bash FOR Loop
for file in $(ls *fastq.gz); do foo "$file" & done
# Calculating time elapsed
duration=$SECONDS
echo "$(($duration / 60)) minutes and $(($duration % 60)) seconds elapsed."

How to calculate time elapsed in bash script?


Read alignment

de novo assembly means we're studying sequencing data from a genome that has never been sequenced, what kind of computational method is most appropriate?
De-novo vs. mapping assembly
In sequence assembly, two different types can be distinguished:

de-novo: assembling short reads to create full-length (sometimes novel) sequences, without using a template (see de novo sequence assemblers, de novo transcriptome assembly)
mapping: assembling reads against an existing backbone sequence, building a sequence that is similar but not necessarily identical to the backbone sequence