Try   HackMD

Quality Control


Questions:

  • How to perform quality control of raw data?
  • What are the quality parameters to check for a dataset?
  • How to improve the quality of a dataset?

Objectives:

  • Assess short reads FASTQ quality using FastQC
  • Assess long reads FASTQ quality using Nanoplot
  • Perform quality correction with Cutadapt (short reads)
  • Perform quality correction with NanoFilt (long reads)
  • Summarise quality metrics MultiQC
  • Process single-end and paired-end data

Key Points:

  • Perform quality control on every dataset before running any other bioinformatics analysis
  • Assess the quality metrics and improve quality if necessary
  • Check the impact of the quality control
  • Different tools are available to provide additional quality metrics
  • For paired-end reads analyze the forward and reverse reads together

Introduction

During sequencing, the nucleotide bases in a DNA or RNA sample (library) are determined by the sequencer. For each fragment in the library, a sequence is generated, also called a read, which is simply a succession of nucleotides.

Modern sequencing technologies can generate a massive number of sequence reads in a single experiment. However, no sequencing technology is perfect, and each instrument will generate different types and amount of errors, such as incorrect nucleotides being called. These wrongly called bases are due to the technical limitations of each sequencing platform.

Therefore, it is necessary to understand, identify and exclude error-types that may impact the interpretation of downstream analysis. Sequence quality control is therefore an essential first step in your analysis. Catching errors early saves time later on.

Inspect a raw sequence file

Create the directory ~/workshop_data/quality_control

Please download the filefemale_oral2.fastq from Zenodo and move it to ~/workshop_data/quality_control.

This is a microbiome sample from a snake. It is amplicon data, where 16S DNA is PCR amplified and sequenced

What type of file is this? What does the '.gz' extension mean?

  1. Inspect the FASTQ file

Although it looks complicated (and maybe it is), the FASTQ format is easy to understand with a little decoding.

Each read, representing a fragment of the library, is encoded by 4 lines:

Line Description
1 Always begins with @ followed by the information about the read
2 The actual nucleic sequence
3 Always begins with a + and contains sometimes the same info in line 1
4 Has a string of characters which represent the quality scores associated with each base of the nucleic sequence; must have the same number of characters as line 2

So for example, the first sequence in our file is:

@M00970:337:000000000-BR5KF:1:1102:17745:1557 1:N:0:CGCAGAAC+ACAGAGTT
GTGCCAGCCGCCGCGGTAGTCCGACGTGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAAGAACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAGAAGCAAATGACGATTCAAGAAAGAAAAAAACACAGAATACTAACAATAAGTCATAAACATCATCAACATAAAAAAGGAAATACACTTACAACACATATCAATATCTAAAATAAATGATCAGCACACAACATGACGATTACCACACATGTGTACTACAAGTCAACTA
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGAFFGGFGGGGGGGGFGGGGGGGGGGGGGGFGGG+38+35*311*6,,31=******441+++0+0++0+*1*2++2++0*+*2*02*/***1*+++0+0++38++00++++++++++0+0+2++*+*+*+*+*****+0**+0**+***+)*.***1**//*)***)/)*)))*)))*),)0(((-((((-.(4(,,))).,(())))))).)))))))-))-(

It means that the fragment named @M00970 corresponds to the DNA sequence GTGCCAGCCGCCGCGGTAGTCCGACGTGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAAGAACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAGAAGCAAATGACGATTCAAGAAAGAAAAAAACACAGAATACTAACAATAAGTCATAAACATCATCAACATAAAAAAGGAAATACACTTACAACACATATCAATATCTAAAATAAATGATCAGCACACAACATGACGATTACCACACATGTGTACTACAAGTCAACTA and this sequence has been sequenced with a quality GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGAFFGGFGGGGGGGGFGGGGGGGGGGGGGGFGGG+38+35*311*6,,31=******441+++0+0++0+*1*2++2++0*+*2*02*/***1*+++0+0++38++00++++++++++0+0+2++*+*+*+*+*****+0**+0**+***+)*.***1**//*)***)/)*)))*)))*),)0(((-((((-.(4(,,))).,(())))))).)))))))-))-(.

But what does this quality score mean?

The quality score for each sequence is a string of characters, one for each base of the nucleic sequence, used to characterize the probability of mis-identification of each base. The score is encoded using the ASCII character table (with some historical differences):

So there is an ASCII character associated with each nucleotide, representing its Phred quality score, the probability of an incorrect base call:

Phred Quality Score Probability of incorrect base call Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
60 1 in 1,000,000 99.9999%
  1. What is the Phred quality score of the 3rd nucleotide of the 1st sequence?
  2. What is the accuracy of this 3rd nucleotide?

Assess quality with FastQC - short & long reads

To take a look at sequence quality along all sequences, we can use FastQC. It provides a modular set of analyses which you can use to check whether your data has any problems of which you should be aware before doing any further analysis. We can use it, for example, to assess whether there are known adapters present in the data. We'll run it on the FASTQ file.

First, we will need to prepare a conda environment for this tutorial.

mamba create --name=QC fastqc
mamba activate QC
# ATTENTION!
# IF 'mamba activate' does not work yet, do the next steps, else skip!
mamba activate base
mamba init
# close your shell: Alt+F4
# open a new shell: Ctrl+Alt+T
  1. Run fastqc using the command below
cd ~/workshop_data/quality_control
fastqc female_oral2.fastq
  1. Open the html file with FireFox and inspect the results

Per base sequence quality

With FastQC we can use the per base sequence quality plot to check the base quality of the reads.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

On the x-axis are the base position in the read. In this example, the sample contains reads that are up to 289 bp long.

For each position, a boxplot is drawn with:

  • the median value, represented by the central red line
  • the inter-quartile range (25-75%), represented by the yellow box
  • the 10% and 90% values in the upper and lower whiskers
  • the mean quality, represented by the blue line

The y-axis shows the quality scores. The higher the score, the better the base call. The background of the graph divides the y-axis into very good quality scores (green), scores of reasonable quality (orange), and reads of poor quality (red).

It is normal with all Illumina sequencers for the median quality score to start out lower over the first 5-7 bases and to then rise. The quality of reads on most platforms will drop at the end of the read. This is often due to signal decay or phasing during the sequencing run. The recent developments in chemistry applied to sequencing has improved this somewhat, but reads are now longer than ever.

Signal decay and phasing

  • Signal decay

The fluorescent signal intensity decays with each cycle of the sequencing process. Due to the degrading fluorophores, a proportion of the strands in the cluster are not being elongated. The proportion of the signal being emitted continues to decrease with each cycle, yielding to a decrease of quality scores at the 3' end of the read.

  • Phasing

The signal starts to blur with the increase of number of cycles because the cluster looses synchronicity. As the cycles progress, some strands get random failures of nucleotides to incorporate due to:

This leads to a decrease in quality scores at the 3' end of the read.

  1. How does the mean quality score change along the sequence?
  2. Is this tendency seen in all sequences?

When the median quality is below a Phred score of ~20, we should consider trimming away bad quality bases from the sequence. We will explain that process in the Trim and filter section.

Adapter Content

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

The plot shows the cumulative percentage of reads with the different adapter sequences at each position. Once an adapter sequence is seen in a read it is counted as being present right through to the end of the read so the percentage increases with the read length. FastQC can detect some adapters by default (e.g. Illumina, Nextera), for others we could provide a contaminants file as an input to the FastQC tool.

Ideally Illumina sequence data should not have any adapter sequence present. But with long reads, some of the library inserts are shorter than the read length resulting in read-through to the adapter at the 3' end of the read. This microbiome sample has relatively long reads and we can see Nextera dapater has been detected.

Adapter content may also be detected with RNA-Seq libraries where the distribution of library insert sizes is varied and likely to include some short inserts.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

We can run an trimming tool such as Cutadapt to remove this adapter. We will explain that process in the filter and trim section.

The following sections go into detail about some of the other plots generated by FastQC. Note that some plots/modules may give warnings but be normal for the type of data you're working with. The other plots give us information to more deeply understand the quality of the data, and to see if changes could be made in the lab to get higher-quality data in the future.

Per tile sequence quality

This plot enables you to look at the quality scores from each tile across all of your bases to see if there was a loss in quality associated with only one part of the flowcell. The plot shows the deviation from the average quality for each flowcell tile. The hotter colours indicate that reads in the given tile have worse qualities for that position than reads in other tiles. With this sample, you can see that certain tiles show consistently poor quality, especially from ~100bp onwards. A good plot should be blue all over.

This plot will only appear for Illumina library which retains its original sequence identifiers. Encoded in these is the flowcell tile from which each read came.

Other tile quality profiles

In some cases, the chemicals used during sequencing becoming a bit exhausted over the time and the last tiles got worst chemicals which makes the sequencing reactions a bit error-prone. The "Per tile sequence quality" graph will then have some horizontal lines like this:

Per sequence quality scores

It plots the average quality score over the full length of all reads on the x-axis and gives the total number of reads with this score on the y-axis:

The distribution of average read quality should be tight peak in the upper range of the plot. It can also report if a subset of the sequences have universally low quality values: it can happen because some sequences are poorly imaged (on the edge of the field of view etc), however these should represent only a small percentage of the total sequences.

Per base sequence content

"Per Base Sequence Content" plots the percentage of each of the four nucleotides (T, C, A, G) at each position across all reads in the input sequence file. As for the per base sequence quality, the x-axis is non-uniform.

In a random library we would expect that there would be little to no difference between the four bases. The proportion of each of the four bases should remain relatively constant over the length of the read with %A=%T and %G=%C, and the lines in this plot should run parallel with each other. This is amplicon data, where 16S DNA is PCR amplified and sequenced, so we'd expect this plot to have some bias and not show a random distribution.

Biases by library type

It's worth noting that some library types will always produce biased sequence composition, normally at the start of the read. Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries), and those which were fragmented using transposases, will contain an intrinsic bias in the positions at which reads start (the first 10-12 bases). This bias does not involve a specific sequence, but instead provides enrichment of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will, however, produce a warning or error in this module.

  1. Why is there a warning for the per-base sequence content graphs?

Per sequence GC content

This plot displays the number of reads vs. percentage of bases G and C per read. It is compared to a theoretical distribution assuming an uniform GC content for all reads, expected for whole genome shotgun sequencing, where the central peak corresponds to the overall GC content of the underlying genome. Since the GC content of the genome is not known, the modal GC content is calculated from the observed data and used to build a reference distribution.

An unusually-shaped distribution could indicate a contaminated library or some other kind of biased subset. A shifted normal distribution indicates some systematic bias, which is independent of base position. If there is a systematic bias which creates a shifted normal distribution then this won't be flagged as an error by the module since it doesn't know what your genome's GC content should be.

But there are also other situations in which an unusually-shaped distribution may occur. For example, with RNA sequencing there may be a greater or lesser distribution of mean GC content among transcripts causing the observed plot to be wider or narrower than an ideal normal distribution.

  1. Why is there a fail for the per sequence GC content graphs?

Sequence length distribution

This plot shows the distribution of fragment sizes in the file which was analysed. In many cases this will produce a simple plot showing a peak only at one size, but for variable length FASTQ files this will show the relative amounts of each different size of sequence fragment.

Some high-throughput sequencers generate sequence fragments of uniform length, but others can contain reads of widely varying lengths. Even within uniform length libraries some pipelines will trim sequences to remove poor quality base calls from the end or the first

n bases if they match the first
n
bases of the adapter up to 90% (by default), with sometimes
n=1
.

Sequence Duplication Levels

The graph shows in blue the percentage of reads of a given sequence in the file which are present a given number of times in the file:

In a diverse library most sequences will occur only once in the final set. A low level of duplication may indicate a very high level of coverage of the target sequence, but a high level of duplication is more likely to indicate some kind of enrichment bias.

Two sources of duplicate reads can be found:

  • PCR duplication in which library fragments have been over-represented due to biased PCR enrichment

    It is a concern because PCR duplicates misrepresent the true proportion of sequences in the input.

  • Truly over-represented sequences such as very abundant transcripts in an RNA-Seq library or in amplicon data (like this sample)

    It is an expected case and not of concern because it does faithfully represent the input.

FastQC counts the degree of duplication for every sequence in a library and creates a plot showing the relative number of sequences with different degrees of duplication.

For whole genome shotgun data it is expected that nearly 100% of your reads will be unique (appearing only 1 time in the sequence data). Most sequences should fall into the far left of the plot in both the red and blue lines. This indicates a highly diverse library that was not over sequenced. If the sequencing depth is extremely high (e.g. 100x the size of the genome) some inevitable sequence duplication can appear: there are in theory only a finite number of completely unique sequence reads which can be obtained from any given input DNA sample.

More specific enrichments of subsets, or the presence of low complexity contaminants will tend to produce spikes towards the right of the plot. These high duplication peaks will most often appear in the blue trace as they make up a high proportion of the original library, but usually disappear in the red trace as they make up an insignificant proportion of the deduplicated set. If peaks persist in the red trace then this suggests that there are a large number of different highly duplicated sequences which might indicate either a contaminant set or a very severe technical duplication.

It is usually the case for RNA sequencing where there is some very highly abundant transcripts and some lowly abundant. It is expected that duplicate reads will be observed for high abundance transcripts:

Over-represented sequences

A normal high-throughput library will contain a diverse set of sequences, with no individual sequence making up a tiny fraction of the whole. Finding that a single sequence is very over-represented in the set either means that it is highly biologically significant, or indicates that the library is contaminated, or not as diverse as expected.

FastQC lists all of the sequence which make up more than 0.1% of the total. For each over-represented sequence FastQC will look for matches in a database of common contaminants and will report the best hit it finds. Hits must be at least 20bp in length and have no more than 1 mismatch. Finding a hit doesn't necessarily mean that this is the source of the contamination, but may point you in the right direction. It's also worth pointing out that many adapter sequences are very similar to each other so you may get a hit reported which isn't technically correct, but which has a very similar sequence to the actual match.

RNA sequencing data may have some transcripts that are so abundant that they register as over-represented sequence. With DNA sequencing data no single sequence should be present at a high enough frequency to be listed, but we can sometimes see a small percentage of adapter reads.

  1. How could we find out what the overrepreseented sequences are?

More details about other FastQC plots

Per base N content

If a sequencer is unable to make a base call with sufficient confidence, it will write an "N" instead of a conventional base call. This plot displays the percentage of base calls at each position or bin for which an N was called.

It's not unusual to see a very high proportion of Ns appearing in a sequence, especially near the end of a sequence. But this curve should never rises noticeably above zero. If it does this indicates a problem occurred during the sequencing run. In the example below, an error caused the instrument to be unable to call a base for approximately 20% of the reads at position 29:

Small/micro RNA

In small RNA libraries, we typically have a relatively small set of unique, short sequences. Small RNA libraries are not randomly sheared before adding sequencing adapters to their ends: all the reads for specific classes of microRNAs will be identical. It will result in:

  • Extremely biased per base sequence content
  • Extremely narrow distribution of GC content
  • Very high sequence duplication levels
  • Abundance of overrepresented sequences
  • Read-through into adapters

Amplicon

Amplicon libraries are prepared by PCR amplification of a specific target. For example, the V4 hypervariable region of the bacterial 16S rRNA gene. All reads from this type of library are expected to be nearly identical. It will result in:

  • Extremely biased per base sequence content
  • Extremely narrow distribution of GC content
  • Very high sequence duplication levels
  • Abundance of overrepresented sequences

Bisulfite or Methylation sequencing

With Bisulfite or methylation sequencing, the majority of the cytosine © bases are converted to thymine (T). It will result in:

  • Biased per base sequence content
  • Biased per sequence GC content

Adapter dimer contamination

Any library type may contain a very small percentage of adapter dimer (i.e. no insert) fragments. They are more likely to be found in amplicon libraries constructed entirely by PCR (by formation of PCR primer-dimers) than in DNA-Seq or RNA-Seq libraries constructed by adapter ligation. If a sufficient fraction of the library is adapter dimer it will become noticeable in the FastQC report:

  • Drop in per base sequence quality after base 60
  • Possible bi-modal distribution of per sequence quality scores
  • Distinct pattern observed in per bases sequence content up to base 60
  • Spike in per sequence GC content
  • Overrepresented sequence matching adapter
  • Adapter content 0% starting at base 1

Bad quality sequences

If the quality of the reads is not good, we should always first check what is wrong and think about it: it may come from the type of sequencing or what we sequenced (high quantity of overrepresented sequences in transcriptomics data, biased percentage of bases in HiC data).

You can also ask the sequencing facility about it, especially if the quality is really bad: the quality treatments can not solve everything. If too many bad quality bases are cut away, the corresponding reads then will be filtered out and you lose them.

Trim and filter - short reads

The quality drops in the middle of these sequences. This could cause bias in downstream analyses with these potentially incorrectly called nucleotides. Sequences must be treated to reduce bias in downstream analysis. Trimming can help to increase the number of reads the aligner or assembler are able to succesfully use, reducing the number of reads that are unmapped or unassembled. In general, quality treatments include:

  1. Trimming/cutting/masking sequences
    • from low quality score regions
    • beginning/end of sequence
    • removing adapters
  2. Filtering of sequences
    • with low mean quality score
    • too short
    • with too many ambiguous (N) bases

To accomplish this task we will use Cutadapt, a tool that enhances sequence quality by automating adapter trimming as well as quality control. We will:

  • Trim low-quality bases from the ends. Quality trimming is done before any adapter trimming. We will set the quality threshold as 20, a commonly used threshold, see more here.
  • Trim adapter with Cutadapt. For that we need to supply the sequence of the adapter. In this sample, Nextera is the adapter that was detected. We can find the sequence of the Nextera adapter on the Illumina website here. We will trim that sequence from the 3' end of the reads.
  • Filter out sequences with length < 20 after trimming
  • Save the processed files as female_oral2_trimmed_and_filtered.fastqin the same directory as the original reads.

Install cutadapt in your QC conda environment and:

  1. Run cutadapt --help and try to figure out the commands to do the above.
  1. Inspect the program output
  • What % reads contain adapter?
  • What % reads have been trimmed because of bad quality?
  • What % reads have been removed because they were too short?

One of the biggest advantage of Cutadapt compared to other trimming tools (e.g. TrimGalore!) is that it has a good documentation explaining how the tool works in detail.

Cutadapt quality trimming algorithm consists of three simple steps:

  1. Subtract the chosen threshold value from the quality value of each position
  2. Compute a partial sum of these differences from the end of the sequence to each position (as long as the partial sum is negative)
  3. Cut at the minimum value of the partial sum

In the following example, we assume that the 3’ end is to be quality-trimmed with a threshold of 10 and we have the following quality values

42 40 26 27 8 7 11 4 2 3
  1. Subtract the threshold

    ​​​​32 30 16 17 -2 -3 1 -6 -8 -7
    
  2. Add up the numbers, starting from the 3' end (partial sums) and stop early if the sum is greater than zero

    ​​​​(70) (38) 8 -8 -25 -23 -20, -21 -15 -7
    

    The numbers in parentheses are not computed (because 8 is greater than zero), but shown here for completeness.

  3. Choose the position of the minimum (-25) as the trimming position

Therefore, the read is trimmed to the first four bases, which have quality values

42 40 26 27

Note that therefore, positions with a quality value larger than the chosen threshold are also removed if they are embedded in regions with lower quality (the partial sum is decreasing if the quality values are smaller than the threshold). The advantage of this procedure is that it is robust against a small number of positions with a quality higher than the threshold.

Alternatives to this procedure would be:

  • Cut after the first position with a quality smaller than the threshold

  • Sliding window approach

    The sliding window approach checks that the average quality of each sequence window of specified length is larger than the threshold. Note that in contrast to cutadapt's approach, this approach has one more parameter and the robustness depends of the length of the window (in combination with the quality threshold). Both approaches are implemented in Trimmomatic.

We can now examine our trimmed data with FastQC.

Run FastQC for your trimmed and filtered FASTQ file. Inspect the generated HTML file

  1. Does the per base sequence quality look better?
  2. Is the adapter gone?

With FastQC we can see we improved the quality of the bases in the dataset and removed the adapter.

Other FastQC plots after trimming

We now have one peak of high quality instead of one high and one lower quality that we had previously.

We don't have equal representation of the bases as before as this is amplicon data.

We now have a single main GC peak due to removing the adapter.

This is the same as before as we don't have any Ns in these reads.

We now have multiple peaks and a range of lengths, instead of the single peak with had before trimming when all sequences were the same length.

  1. What does the top overrepresented sequence GTGTCAGCCGCCGCGGTAGTCCGACGTGG correspond to? Tip: use blastn

Processing multiple datasets

download the data

mkdir ~/workshop_data/quality_control/paired_end
wget https://zenodo.org/record/61771/files/GSM461178_untreat_paired_subset_1.fastq -O ~/workshop_data/quality_control/paired_end/GSM461178_untreat_paired_subset_1.fastq
wget https://zenodo.org/record/61771/files/GSM461178_untreat_paired_subset_2.fastq -O ~/workshop_data/quality_control/paired_end/GSM461178_untreat_paired_subset_2.fastq

Process paired-end data

With paired-end sequencing, the fragments are sequenced from both sides. This approach results in two reads per fragment, with the first read in forward orientation and the second read in reverse-complement orientation. With this technique, we have the advantage to get more information about each DNA fragment compared to reads sequenced by only single-end sequencing:

    ------                      [single-end]

    ----------------------------- [fragment]

    ------              <------ [paired-end]

The distance between both reads is known and therefore is additional information that can improve read mapping.

Paired-end sequencing generates 2 FASTQ files:

  • One file with the sequences corresponding to forward orientation of all the fragments
  • One file with the sequences corresponding to reverse orientation of all the fragments

Usually we recognize these two files which belong to one sample by the name which has the same identifier for the reads but a different extension, e.g. sampleA_R1.fastq for the forward reads and sampleA_R2.fastq for the reverse reads. It can also be _f or _1 for the forward reads and _r or _2 for the reverse reads.

The data we analyzed in the previous step was single-end data so we will import a paired-end RNA-seq dataset to use. We will run FastQC and aggregate the two reports with MultiQC.

First: install multiqc in your conda envionment.

  1. Inspect the FASTQ files in ~/workshop_data/quality_control/paired_end. Try using cat and head.

  2. Run FASTQC for both files. Make sure to execute the tool from the data directory (cd ~/workshop_data/quality_control/paired_end)

  3. Run multiqc --help to figure out how to proceed.

  4. Inspect the webpage output from MultiQC.

  • What do you think about the quality of the sequences?
  • What should we do?

With paired-end reads the average quality scores for forward reads will almost always be higher than for reverse reads.

After trimming, reverse reads will be shorter because of their quality and then will be eliminated during the filtering step. If one of the reverse reads is removed, its corresponding forward read should be removed too. Otherwise we will get different number of reads in both files and in different order, and order is important for the next steps. Therefore it is important to treat the forward and reverse reads together for trimming and filtering.

  1. Run cutadapt --help to figure out how to do paired-end trimming and filtering. We will set the quality threshold as 20 and filter out sequences with length < 20 after trimming.

  2. Inspect the program output.

  • How many basepairs has been removed from the reads because of bad quality?
  • How many sequence pairs have been removed because they were too short?

In addition to the report, Cutadapt generates 2 files:

  • Read 1 with the trimmed and filtered forward reads
  • Read 2 with the trimmed and filtered reverse reads

These datasets can be used for the downstream analysis, e.g. mapping.

Long reads

Download the data

Download the data from https://drive.google.com/drive/folders/1a4qvtnxhkIUevoieookb1Rn4RoBfcZ7a?usp=sharing

Move the .zip file to the quality control directory and extract the files from the .zip into this directory, rename the quality_control_ont directory to ont.

mv ~/Downloads/quality_control_ont*.zip ~/workshop_data/quality_control
cd ~/workshop_data/quality_control
unzip quality*
mv quality_control_ont ont

Assess quality with Nanoplot - Long reads only

In case of long reads, we can check sequence quality with Nanoplot. It provides basic statistics with nice plots for a fast quality control overview.

  • Install nanoplot into your environment.
  • Create the direcotry ~/workshop_data/quality_control/ont/nanoplot
  • Move into the ~/workshop_data/quality_control/ont/nanoplot directory
  • Run NanoPlot on the file ../nanopore_basecalled-guppy.fastq.gz.
  1. Inspect the generated HTML file NanoPlot-report.html What is the mean Qscore ? What is the median, mean and N50?

Histogram of read lengths

This plot shows the distribution of fragment sizes in the file that was analyzed. Unlike most of Illumina runs, long reads have a variable length and this will show the relative amounts of each different size of sequence fragment.

Read lengths vs Average read quality plot using dots

This plot shows the distribution of fragment sizes according to the Qscore in the file which was analysed. In general, there is no link between read length and read quality but this representation allows to visualize both information into a single plot and detect possible aberrations. In runs with a lot of short reads the shorter reads are sometimes of lower quality than the rest.

  1. Looking at the "Read lengths vs Average read quality plot" plots. Do you notice something unusual?

  2. Do the quality control with FASTQC and compare the results!

Assess quality with PycoQC - Nanopore only

PycoQC is a data visualisation and quality control tool for nanopore data. In contrast to FastQC/Nanoplot it needs a specific sequencing_summary.txt file generated by Oxford nanopore basecallers such as Guppy or the older albacore basecaller.

One of the strengths of PycoQC is that it is interactive and highly customizable, e.g., plots can be cropped, you can zoom in and out, sub-select areas and export figures.

PycoQC's dependencies are not compatible with the tools we installed before, it needs to be installed in it's own environment.

mamba create -n pycoQC -y pycoqc
conda activate pycoQC 
  1. Run the analysis with pycoQC and inspect the report.
  • How many reads do you have in total?

Basecalled reads length

As for FastQC and Nanoplot, this plot shows the distribution of fragment sizes in the file that was analyzed. Long reads have a variable length and this will show the relative amounts of each different size of sequence fragment. In this example, the distribution of read length is quite dispersed with a minimum read length for the passed reads around 200bp and a maximum length ~150,000bp.

Basecalled reads PHRED quality

This plot shows the distribution of the Qscores (Q) for each read. This score aims to give a global quality score for each read. The exact definition of Qscores is: the average per-base error probability, expressed on the log (Phred) scale. In case of Nanopore data, the distribution is generally centered around 10 or 12. For old runs, the distribution can be lower, as basecalling models are less precise than recent models.

Basecalled reads length vs reads PHRED quality

  1. What do the mean quality and the quality distribution of the run look like?

As for NanoPlot, this representation give a 2D visualisation of read Qscore according to the length.

Output over experiment time

This representation gives information about sequenced reads over the time for a single run:

  • Each pic indicates a new loading of the flow cell (3 + the first load).
  • The contribution in total reads for each "refuel".
  • The production of reads is decreasing over time:
    • Most of the material (DNA/RNA) is sequenced
    • Saturation of pores
    • Material/pores degradation

In this example, the contribution of each refueling is very low, and it can be considered as a bad run. The “Cummulative” plot area (light blue) indicates that 50% of all reads and almost 50% of all bases were produced in the first 5h of the 25h experiment. Although it is normal that yield decreases over time a decrease like this is not a good sign.

Other "Output over experiment time" profile

In this example, the data production over the time only slightly decreased over the 12h with a continuous increasing of cumulative data. This absence of a decreasing curve at the end of the run indicate that there is still biological material on the flow cell. The run was ended before all was sequenced. It's an excellent run, even can be considered as exceptional.

Read length over experiment time

  1. Did the read length change over time? What could the reason be?

Channel activity over time

It gives an overview of available pores, pore usage during the experiment, inactive pores and shows if the loading of the flow cell is good (almost all pores are used). In this case, the vast majority of channels/pores are inactive (white) throughout the sequencing run, so the run can be considered as bad.

You would hope for a plot that it is dark near the X-axis, and with higher Y-values (increasing time) doesn’t get too light/white. Depending if you chose “Reads” or “Bases” on the left the colour indicates either number of bases or reads per time interval

In this example, almost all pores are active all along the run (yellow/red profile) which indicate an excellent run.

Practice some more

In the directory ~/workshop_dataq/quality_control/ont/summaries you will find 3 different run-directories each containing a sequence summary file. From within the ont directory call PycoQC on one of the summary files, e.g., run_1.

  • How many reads do you have in total?
  • What is the median, minimum and maximum read length, what is the N50?
  • What do the mean quality and the quality distribution of the run look like? Remember, Q10 = 10% error rate
  • Have a look at the “Basecalled Reads PHRED Quality” and “Read length vs PHRED quality plots”. Is there a link between read length and PHRED score?
  • Have a look at the “Read Length over Experiment time” plot. Did the read length change over time? What could the reason be?
  • Given the number of active pores, yield over time, and channel activity over time, do you think this was a successful sequencing run? Why/why not?
  • Inspect the “output over experiment time” graph. Can you explain the shown curve-pattern? Would you have stopped the run earlier? Think about how the MinION works, especially with regards to adjustment of the applied currents.
  • Generate the PycoQC plots for run_3/sequencing_summary.txt and compare it to run_1. What are the differences?

Trimming Nanopore reads with NanoFilt

Create directory for your NanoFilt output called nanofilt in the quality_control/ont folder, change into it and

  • remove all sequences shorter than 500 nucleotides (option -l)
  • trim the first 10 nucleotides off all reads (option headcrop)
mkdir ~/workshop_data/quality_control/ont/nanofilt

cd ~/workshop_data/quality_control/ont/nanofilt

gunzip -c ../nanopore_basecalled-guppy.fastq.gz \
| NanoFilt -l 500 --headcrop 10  \
> ./nanofilt_trimmed.fastq

The “\” at the end of each line is only for convenience to write a long command into several lines. It tells the command-line that all lines still belong together although the are separated by “enter” keys. However, if you type all of the command, i.e., paths etc, in one line don’t’ use the backslash at the end of the lines.

NanoFilt does not provide options for input or output files. Therefore we will use the two redirect operators “>” and “<“ to

  • redirect the file porechopped.fastq into NanoFilt (operator <)
  • then redirect the output of NanoFilt into the file nanofilt_trimmed.fastq (>).

Or, as in this example, the pipe operator "|". Since NanoFilt cannot handle the compressed fastq.gz file directly, we first decompress it to stdout using gunzip -c [file] and 'pipe' this output to NanoFilt.

  1. Use your favourite quality control visulaization tool to check the result file and compare it to the original guppy fastq. Did NanoFIlt improve the data?

Nanopore length filtering with filtlong

If you just want to filter the reads based on read length, another good tool for nanopore reads is filtlong.

filtlong --min_length 200 --max_length 2500 reads.fastq > filtered_reads.fastq

Conclusion

In this tutorial we checked the quality of FASTQ files to ensure that their data looks good before inferring any further information. This step is the usual first step for analysis relying on sequencing data. Quality control steps are similar for any type of sequencing data:

Acknowledgement

This practical is based on content from the Galaxy Training Network tutorial for reads Quality Control and Tim Kahlke's "Long-Read, long reach Bioinformatics Tutorials".