FASTQ format specification

FASTQ format specification

Main Objectives

This recipe contains a description of the different old and currently used FASTQ variants. The aim is to allow a user to understand the FASTQ format and be able to differentiate and convert from one format variant to another.

User Stories

As a ..	I want to ..	So that I can ..
Software developer	undertand the different FASTQ variants and platform-specific identifiers	develop appropiate parsers
Data consumer	determine the format of legacy FASTQ datasets	reuse and integrate it along with data from other sources

Capability & Maturity Table

Capability	Initial Maturity Level	Final Maturity Level
TBD	TBD	TBD

FAIRification Objectives, Inputs and Outputs

Requirements

This recipe is aimed at anyone interested in the FASTQ format. No specific prior knowledge is needed to understand this document.

Ingredients

This is a descriptive recipe, so no ingredients are required.

Introduction

FASTQ is the de facto format for sequence data exchange. It offers a simple way to store raw sequences along with quality scores associates to each base call. Unfortunately, different incompatible FASTQ variants exist, while there is no community explicit agreement on an standard.

General Specification

Format description

A FASTQ file describes a collection of sequence read, sequence quality scores and other information. Each read's description consists in four plain text lines in following format:

1st: Contains a read identifier and possibly other information. This line must start with the symbol "@".
2nd: Nucleotide base calls.
3rd: A second defline for extra information. This line must start with the symbol "+", apart from which the line can be left empty.
4th: Per-base quality scores, usually PHRED scores (described below).

An example of a read in FASTQ format is shown below.

@SEQ_ID and other information
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Example 1: Read in FASTQ format.

FASTQ variants

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

This recipe uses termilonogy from the EDAM ontology to refer to FASTQ variants

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

The following variants fit the general specification stated above but differ in the quality score used (PHRED or Solexa scores) and/or the ASCII mapping (offset of 33 or 64), which make them incompatible and sometimes hard to distinguish.

The most widely used one, that can be considered a de facto standard, is FASTQ-sanger variant. The other two existing variants, FASTQ-solexa and FASTQ-illumina, were introduced by Solexa/Illumina (Solexa was acquired by Illumina, Inc. in 2006) to be used by its sequencers. Illumina, however, uses the Sanger variant since 2011 (Illumina 1.8+).

Variant	Quality score		\(P_e\)	ASCII characters		Symbols
	Type	Range	Range	Range	Offset	Range
Sanger	PHRED	\(0\) to \(93\)	\(1\) to \(10^{-9.3}\)	\(33–126\)	\(33\)	! to ~
Solexa	Solexa	\(−5\) to \(62\)	\(\sim0.75\) to \(\sim10^{-6.2}\)	\(59–126\)	\(64\)	; to ~
Illumina 1.3-1.8	PHRED	\(0\) to \(62\)	\(1\) to \(10^{-6.2}\)	\(64–126\)	\(64\)	@ to ~

Table 1: Overview of the FASTQ variants.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Figure 1: Comparison of the ASCII mappings used by the different variants (modified from https://training.galaxyproject.org/training-material/topics/introduction/tutorials/galaxy-intro-ngs-data-managment/tutorial.html)

FASTQ-sanger

FASTQ-sanger was the first FASTQ variant invented. Jim mMullikin developed it in the Wellcome Trust Sanger Institute. It is currently the most widely used FASTQ variant. It has been adopted by many sequencing data archives, such as the NCBI's Sequence Read Archive, which provide all FASTQ files in FASTQ-sanger regardless of the variant in which they were submitted to the archive, and sequence file processing like SSAHA2, MAQ, Velvet, BWA and BowTie.
FASTQ-sanger uses PHRED scores, originally conceived for this variant, and ASCII characters from 33 to 126.

PHRED scores

PHRED scores have become the de facto standard to represent sequencing base calls qualities. For a given nucleotide call, a PHRED score (\(Q_{PHRED}\)) is calculated from the estimated probability of error (\(P_e\)) as follows:
\[ \begin{aligned} Q_{PHRED} = -10\cdot log_{10}(P_e) \end{aligned} \]

PHRED scores range from 0 to 93, representing probabilities from \(1.0\) (wrong read) to \(10^{-9.3}\).
In FASTQ-sanger, PHRED scores are encoded with ASCII printable characters 32-126 (decimal)(offset=33).

An example of FASTQ-sanger short file is shown below:

@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+
;;;;;;;;;;;9;7;;.7;393333

Example 2: One 25 nucleotide-long read in FASTQ-sanger format.

FASTQ-solexa

FASTQ-solexa was introduced in 2004 by Solexa and was used by Illumina 1.1 to 1.3.

Solexa scores

FASTQ-solexa uses an alternative formula to calculate quality scores \(Q_{solexa}\) from each call's estimated probability of error \(P_e\):
\begin{aligned} Q_{solexa} = -10\cdot log_{10} \left( \frac{P_e}{1-P_e} \right) \end{aligned}

Solexa scores range from -5 to 126.
FASTQ-solexa uses an ASCII offset of 64 to be able to represent values lower than 0. The character range for Solexa, as a consequence, is 59 to 126.

An example of Solexa FASTQ format is shown below. This example is the FASTQ-solexa equivalent of the FASTQ-sanger (Example 2) and FASTQ-illumina (Example 4) examples.

@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+
ZZZZZZZZZZZXZVZZMVZRXRRRR

Example 3: One 25 nucleotide-long read in FASTQ-solexa format.

FASTQ-illumina

Illumina versions 1.3 to 1.8 use PHRED scores instead of the previous Solexa scores. The PHRED scores are encoded with an ASCII offset of 64, thus ranging from 0 to 62 (ASCII 64-126). The expected values for raw data are in the range 0-40.

An example of FASTQ-illumina format is shown below. This example is the FASTQ-illumina equivalent of the FASTQ-sanger (Example 2) and FASTQ-solexa (Example 3) examples.

@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+
ZZZZZZZZZZZXZVZZMVZRXRRRR

Example 4: One 25 nucleotide-long read in FASTQ-illumina format.

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Example 4 is identical to Example 3. This happends because \(Q_{PHRED}\) and \(Q_{Solexa}\) are almost equal for high enough scores. When scores are low, however, differences become apparent.

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

In Illumina versions 1.5 to 1.8 PHRED scores 0 to 2 have a slightly different meaning:

0 and 1 are no longer used.

Value 2, encoded by ASCII 66 "B", is used at the end of a read when the segment is mostly low quality, as a Read Segment Quality Control Indicator. The Illumina documentation states the following: At the ends of some reads, quality scores are unreliable. Illumina has an algorithm for identifying these unreliable runs of quality scores, and we use a special indicator to flag these portions of reads. A quality score of 2, encoded as a "B", is used as a special indicator. A quality score of 2 does not imply a specific error rate, but rather implies that the marked region of the read should not be used for downstream analysis.

Variants conversion

FASTQ-sanger, FASTQ-solexa and FASTQ-illumina variants are incompatible among each other. The variant of a given FASTQ file can be guessed with FastQC. Tools that can be used to convert between FASTQ variants are QualityIO from Biopython, SeqIO from BioPerl, Bio::FastQ::FormatData class from BioRuby and org.biojava.bio.program.fastq package from BioJava.

Platform specific considerations

The different sequencing platforms use specific patterns for the read description that encode valuable information in a standardized way. Some of them are listed below:

Recent Illumina fastq

@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<xpos>:<y-pos> <read>:<is filtered>:<control number>:<index>

Older Illumina fastq

@<machine_id>:<lane>:<tile>:<x_coord>:<y_coord>#<index>/<read>

QIIME de-multiplexed sequences in fastq

@<SampleID-based_identifier> <Original_information> orig_bc=<original_barcode> new_bc=<corrected_barcode> bc_diffs=<0|1>

PacBio CCS (Circular Consensus Sequence) or RoI (Read of Insert) read

@<MovieName>/<ZMW_number>

PacBio CCS subread

@<MovieName> /<ZMW_number>/<subread-start>_<subread-end>

Helicos fastq with a fixed ASCII-based Phred value for quality

@VHE-242383071011-15-1-0-2

PHRED value of 14 is encoded with the character '/'.

File extension

.fastq and .fq are usually used, but no standard file extension exists for this format.

References

Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010;38(6):1767-1771. doi:10.1093/nar/gkp1137
File Format Guide, NCBI, viewed 20th July 2020.
QualityIO.py documentation, BioPython, viewed 20th July 2020.
Anton Nekrutenko, NGS data logistics, Galaxy Project, viewed 20th July 2020.
FASTQ files explained, Illumina, viewed 20th July 2020.
Illumina Quality Scores, Tobias Mann, Bioinformatics, San Diego, Illumina.

Authors

Name	Institute	ORCID	Contributions
Eva Martin	Barcelona Supercomputing Center (BSC)	0000-0001-8324-2897	Writing - Original Draft
Fuqi Xu	EMBL-EBI	0000-0002-5923-3859	Reviewing

License

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →