changed a year ago
Published Linked with GitHub

Read quality control

2024-05-21

Olivier Rué

Christophe Klopp

EBAII 2024 - Genome assembly school



The truth about bioinformatics

https://training.galaxyproject.org/training-material/topics/assembly/tutorials/get-started-genome-assembly/slides.html#4


QC is the first step of any sequence analysis

https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/quality-control/slides.html#7


QC is the first step for all sequence analyses

  • Seems one of the easiest steps in bioinformatics (because it is standard)
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • but one of the most important
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • You should know what you expect in order to check if everything is ok
  • It gives information about how to clean reads when needed
  • It shows possible sequencing problems
  • Not all possible problems are well documented : manufacturers prefer the bright side
  • QC results must be interpreted regarding what has been sequenced
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Read caracteristics

  • length (fixed or variable, range,)
  • nucleotide content :
    • biological sample
    • technical artifacts (primer, adapter, tag, vector, restriction site,)
    • contamination
    • organels (chloroplast, mitochondria,)
  • Average error rate
  • Error rate profile (along the reads,)
  • randomness
    • GC (read GC content ~ average genome GC content)
    • kmer content

Reads are not perfect (error rate profile)


https://doi.org/10.1093/nargab/lqab019


First contact with your sequences

  • The sequencing facility provides you with files containing your reads
  • FASTQ format
  • Standard format for storing of high-throughput sequencing instruments outputs
  • Some times other file types (bam, hd5,) with tools to extract fastqs
  • One or two files by sample (Illumina paired-end)
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

FASTQ format

@ST-E00114:1342:HHMGVCCX2:1:1101:3123:2012 1:N:0:TCCGGAGA+TCAGAGCC
CTTGGTCATTTAGAG
+
***<<*AEF???***
@ST-E00114:1342:HHMGVCCX2:1:1101:11556:2030 1:N:0:TCCGGAGA+TCAGAGCC
CATTGGCCATATCAT
+
AAAE??<<*???***

Four lines per sequence :

  • header starting with '@'
  • sequence line (nucleotides)
  • '+' separator
  • quality line (quality corresponding to nucleotides)
@Identifier1 (comment)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+
QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
@Identifier2 (comment)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+
QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ


Quality score encoding

  • Base quality schema depends on the sequencer version.
  • Most files produced these days are Sanger compliant.

Quality score

Measure of the quality of the identification of the nucleobases generated by automated DNA sequencing

https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/quality-control/slides.html#12


FASTQ compression

  • Compression is essential to manage FASTQ files (reduce disk storage)
  • compressed files:
    filename.fastq.gz
    filename.fq.gz
  • Tools are (almost all) able to deal with compressed files
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Answer to (not always) simple questions:

  • Is data as I expect?
    • Number of files/samples
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
    • Number of reads in files
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
    • Quality/Length/Composition of reads
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
  • Residual presence of adapters or indexes (non-biological information)?
  • Are there (un)expected technical biases?
  • Are there (un)expected biological biases?

Data for learning to assemble reads

  • Sequencing of Saccharomyces cerevisiae genome
  • Species of Yeast (single-celled fungus microorganisms)
  • Genome composed of about 12,156,677 bp and 6,275 genes, compactly organized on 16 chromosomes
  • GC content =~ 38-39%
  • Illumina/PacBio/ONT datasets
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Sequencing data

  • Subsampled to 30x only to reduce time

FastQC

  • Provides graphics to spot problem originating from sequencer, library preparation, contamination

https://www.bioinformatics.babraham.ac.uk/projects/fastqc/


TP

  • Log in to Galaxy
  • Create a history called QC
  • Upload the data (next slide)
  • Run FastQC on each FASTQ file
  • Run MultiQC on Illumina data

Données partagées

  • Données partagées / Bibliothèque de données
  • EBAII A&A 2022
  • Assembly
    • Hifi PacBio / SRR13577847_subreads.30x.fastq
    • ONT / SRR18726953_1.30x.fastq
    • Illumina Miseq PE
      • SRR15597408_1.30x.fastq
      • SRR15597408_2.30x.fastq

Basic statistics



Per base sequence quality



Per base sequence quality - Illumina

  • Comparison R1/R2


GC content



GC content / contamination


Per base sequence content



Other QC tools


Meta QC tools


Nanopore QC tools


Kmer based QC tools


Other Kmer based QC tools


Tools for cleaning reads: trimming & co


Tools for cleaning reads: decontamination


Take home messages

  • Don't skip read QC!
  • Use tools adapted to your reads (platform, experiment,)
  • Allows to distinguish potential problems
    • serious: back to sequencing facility
    • medium: adapt your strategy (contamination, trimming)
Select a repo