QC - Roscoff 2022

Read quality control

2024-05-21

Olivier Rué

Christophe Klopp

EBAII 2024 - Genome assembly school

The truth about bioinformatics

https://training.galaxyproject.org/training-material/topics/assembly/tutorials/get-started-genome-assembly/slides.html#4

QC is the first step of any sequence analysis

https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/quality-control/slides.html#7

QC is the first step for all sequence analyses

Seems one of the easiest steps in bioinformatics (because it is standard)
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
…
… but one of the most important
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
You should know what you expect in order to check if everything is ok
It gives information about how to clean reads when needed
It shows possible sequencing problems
Not all possible problems are well documented : manufacturers prefer the bright side
QC results must be interpreted regarding what has been sequenced
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Read caracteristics

length (fixed or variable, range,…)
nucleotide content :
- biological sample
- technical artifacts (primer, adapter, tag, vector, restriction site,…)
- contamination
- organels (chloroplast, mitochondria,…)
Average error rate
Error rate profile (along the reads,…)
randomness
- GC (read GC content ~ average genome GC content)
- kmer content

Reads are not perfect (error rate profile)

https://doi.org/10.1093/nargab/lqab019

First contact with your sequences

The sequencing facility provides you with files containing your reads
FASTQ format
Standard format for storing of high-throughput sequencing instruments outputs
Some times other file types (bam, hd5,…) with tools to extract fastqs
One or two files by sample (Illumina paired-end)
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

FASTQ format

@ST-E00114:1342:HHMGVCCX2:1:1101:3123:2012 1:N:0:TCCGGAGA+TCAGAGCC
CTTGGTCATTTAGAG
+
***<<*AEF???***
@ST-E00114:1342:HHMGVCCX2:1:1101:11556:2030 1:N:0:TCCGGAGA+TCAGAGCC
CATTGGCCATATCAT
+
AAAE??<<*???***

Four lines per sequence :

header starting with '@'
sequence line (nucleotides)
'+' separator
quality line (quality corresponding to nucleotides)

@Identifier1 (comment)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+
QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
@Identifier2 (comment)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+
QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ

Quality score encoding

Base quality schema depends on the sequencer version.
Most files produced these days are Sanger compliant.

Quality score

Measure of the quality of the identification of the nucleobases generated by automated DNA sequencing

https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/quality-control/slides.html#12

FASTQ compression

Compression is essential to manage FASTQ files (reduce disk storage)
compressed files: filename.fastq.gz
filename.fq.gz
Tools are (almost all) able to deal with compressed files
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Answer to (not always) simple questions:

Is data as I expect?
- Number of files/samples
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
- Number of reads in files
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
- Quality/Length/Composition of reads
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
Residual presence of adapters or indexes (non-biological information)?
Are there (un)expected technical biases?
Are there (un)expected biological biases?

Data for learning to assemble reads

Sequencing of Saccharomyces cerevisiae genome
Species of Yeast (single-celled fungus microorganisms)
Genome composed of about 12,156,677 bp and 6,275 genes, compactly organized on 16 chromosomes
GC content =~ 38-39%
Illumina/PacBio/ONT datasets
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Sequencing data

Subsampled to 30x only to reduce time

FastQC

Provides graphics to spot problem originating from sequencer, library preparation, contamination…

https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

TP

Log in to Galaxy
Create a history called QC
Upload the data (next slide)
Run FastQC on each FASTQ file
Run MultiQC on Illumina data

Données partagées

Données partagées / Bibliothèque de données
EBAII A&A 2022
Assembly
- Hifi PacBio / SRR13577847_subreads.30x.fastq
- ONT / SRR18726953_1.30x.fastq
- Illumina Miseq PE
  - SRR15597408_1.30x.fastq
  - SRR15597408_2.30x.fastq

Basic statistics

Per base sequence quality

Per base sequence quality - Illumina

Comparison R1/R2

GC content

GC content / contamination

Per base sequence content

Other QC tools

Seqkit: A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
- https://doi.org/10.1371/journal.pone.0163962

Meta QC tools

MultiQC: Summarize analysis results for multiple tools and samples in a single report
- http://dx.doi.org/10.1093/bioinformatics/btw354

Nanopore QC tools

Nanoplot: Plotting tool for long read sequencing data and alignments
- https://doi.org/10.1093/bioinformatics/bty149

Kmer based QC tools

Genomescope: The K-mer Analysis Toolkit
- https://doi.org/10.1038/s41467-020-14998-3
- https://github.com/tbenavi1/genomescope2.0

Other Kmer based QC tools

KAT: The K-mer Analysis Toolkit
- https://doi.org/10.1093/bioinformatics/btw663

Tools for cleaning reads: trimming & co

Fastp: Trim reads by quality, length, remove adapters…
- https://doi.org/10.1093/bioinformatics/bty560

Tools for cleaning reads: decontamination

Kraken: System for assigning taxonomic labels to short DNA sequences
- https://dx.doi.org/10.1186/gb-2014-15-3-r46
ReadItAndKeep: rapid decontamination of SARS-CoV-2 sequencing reads
- https://doi.org/10.1093/bioinformatics/btac311
Bwa: Fast and accurate short read alignment with Burrows-Wheeler transform
- https://doi.org/10.1093/bioinformatics/btp324

Take home messages

Don't skip read QC!
Use tools adapted to your reads (platform, experiment,…)
Allows to distinguish potential problems
- serious: back to sequencing facility
- medium: adapt your strategy (contamination, trimming…)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.