This recipe contains a description of the different old and currently used FASTQ variants. The aim is to allow a user to understand the FASTQ format and be able to differentiate and convert from one format variant to another.
As a .. | I want to .. | So that I can .. |
---|---|---|
Software developer | undertand the different FASTQ variants and platform-specific identifiers | develop appropiate parsers |
Data consumer | determine the format of legacy FASTQ datasets | reuse and integrate it along with data from other sources |
Capability | Initial Maturity Level | Final Maturity Level |
---|---|---|
TBD | TBD | TBD |
This recipe is aimed at anyone interested in the FASTQ format. No specific prior knowledge is needed to understand this document.
This is a descriptive recipe, so no ingredients are required.
FASTQ is the de facto format for sequence data exchange. It offers a simple way to store raw sequences along with quality scores associates to each base call. Unfortunately, different incompatible FASTQ variants exist, while there is no community explicit agreement on an standard.
A FASTQ file describes a collection of sequence read, sequence quality scores and other information. Each read's description consists in four plain text lines in following format:
An example of a read in FASTQ format is shown below.
@SEQ_ID and other information
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Example 1: Read in FASTQ format.
This recipe uses termilonogy from the EDAM ontology to refer to FASTQ variantsImage Not Showing Possible ReasonsLearn More →
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Image Not Showing Possible ReasonsLearn More →
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
The following variants fit the general specification stated above but differ in the quality score used (PHRED or Solexa scores) and/or the ASCII mapping (offset of 33 or 64), which make them incompatible and sometimes hard to distinguish.
The most widely used one, that can be considered a de facto standard, is FASTQ-sanger variant. The other two existing variants, FASTQ-solexa and FASTQ-illumina, were introduced by Solexa/Illumina (Solexa was acquired by Illumina, Inc. in 2006) to be used by its sequencers. Illumina, however, uses the Sanger variant since 2011 (Illumina 1.8+).
Variant | Quality score | ASCII characters | Symbols | |||
---|---|---|---|---|---|---|
Type | Range | Range | Range | Offset | Range | |
Sanger | PHRED | ! to ~ | ||||
Solexa | Solexa | ; to ~ | ||||
Illumina 1.3-1.8 | PHRED | @ to ~ |
Table 1: Overview of the FASTQ variants.
FASTQ-sanger was the first FASTQ variant invented. Jim mMullikin developed it in the Wellcome Trust Sanger Institute. It is currently the most widely used FASTQ variant. It has been adopted by many sequencing data archives, such as the NCBI's Sequence Read Archive, which provide all FASTQ files in FASTQ-sanger regardless of the variant in which they were submitted to the archive, and sequence file processing like SSAHA2, MAQ, Velvet, BWA and BowTie.
FASTQ-sanger uses PHRED scores, originally conceived for this variant, and ASCII characters from 33 to 126.
PHRED scores have become the de facto standard to represent sequencing base calls qualities. For a given nucleotide call, a PHRED score (
PHRED scores range from 0 to 93, representing probabilities from
In FASTQ-sanger, PHRED scores are encoded with ASCII printable characters 32-126 (decimal)(offset=33).
An example of FASTQ-sanger short file is shown below:
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+
;;;;;;;;;;;9;7;;.7;393333
Example 2: One 25 nucleotide-long read in FASTQ-sanger format.
FASTQ-solexa was introduced in 2004 by Solexa and was used by Illumina 1.1 to 1.3.
FASTQ-solexa uses an alternative formula to calculate quality scores
Solexa scores range from -5 to 126.
FASTQ-solexa uses an ASCII offset of 64 to be able to represent values lower than 0. The character range for Solexa, as a consequence, is 59 to 126.
An example of Solexa FASTQ format is shown below. This example is the FASTQ-solexa equivalent of the FASTQ-sanger (Example 2) and FASTQ-illumina (Example 4) examples.
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+
ZZZZZZZZZZZXZVZZMVZRXRRRR
Example 3: One 25 nucleotide-long read in FASTQ-solexa format.
Illumina versions 1.3 to 1.8 use PHRED scores instead of the previous Solexa scores. The PHRED scores are encoded with an ASCII offset of 64, thus ranging from 0 to 62 (ASCII 64-126). The expected values for raw data are in the range 0-40.
An example of FASTQ-illumina format is shown below. This example is the FASTQ-illumina equivalent of the FASTQ-sanger (Example 2) and FASTQ-solexa (Example 3) examples.
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+
ZZZZZZZZZZZXZVZZMVZRXRRRR
Example 4: One 25 nucleotide-long read in FASTQ-illumina format.
Example 4 is identical to Example 3. This happends becauseImage Not Showing Possible ReasonsLearn More →
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
and are almost equal for high enough scores. When scores are low, however, differences become apparent.
Image Not Showing Possible ReasonsLearn More →
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
In Illumina versions 1.5 to 1.8 PHRED scores 0 to 2 have a slightly different meaning:
- 0 and 1 are no longer used.
- Value 2, encoded by ASCII 66 "B", is used at the end of a read when the segment is mostly low quality, as a Read Segment Quality Control Indicator. The Illumina documentation states the following: At the ends of some reads, quality scores are unreliable. Illumina has an algorithm for identifying these unreliable runs of quality scores, and we use a special indicator to flag these portions of reads. A quality score of 2, encoded as a "B", is used as a special indicator. A quality score of 2 does not imply a specific error rate, but rather implies that the marked region of the read should not be used for downstream analysis.
FASTQ-sanger, FASTQ-solexa and FASTQ-illumina variants are incompatible among each other. The variant of a given FASTQ file can be guessed with FastQC. Tools that can be used to convert between FASTQ variants are QualityIO from Biopython, SeqIO from BioPerl, Bio::FastQ::FormatData class from BioRuby and org.biojava.bio.program.fastq package from BioJava.
The different sequencing platforms use specific patterns for the read description that encode valuable information in a standardized way. Some of them are listed below:
@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<xpos>:<y-pos> <read>:<is filtered>:<control number>:<index>
@<machine_id>:<lane>:<tile>:<x_coord>:<y_coord>#<index>/<read>
@<SampleID-based_identifier> <Original_information> orig_bc=<original_barcode> new_bc=<corrected_barcode> bc_diffs=<0|1>
@<MovieName>/<ZMW_number>
@<MovieName> /<ZMW_number>/<subread-start>_<subread-end>
@VHE-242383071011-15-1-0-2
PHRED value of 14 is encoded with the character '/'.
.fastq
and .fq
are usually used, but no standard file extension exists for this format.
Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010;38(6):1767-1771. doi:10.1093/nar/gkp1137
File Format Guide, NCBI, viewed 20th July 2020.
QualityIO.py documentation, BioPython, viewed 20th July 2020.
Anton Nekrutenko, NGS data logistics, Galaxy Project, viewed 20th July 2020.
FASTQ files explained, Illumina, viewed 20th July 2020.
Illumina Quality Scores, Tobias Mann, Bioinformatics, San Diego, Illumina.
Name | Institute | ORCID | Contributions |
---|---|---|---|
Eva Martin | Barcelona Supercomputing Center (BSC) | 0000-0001-8324-2897 | Writing - Original Draft |
Fuqi Xu | EMBL-EBI | 0000-0002-5923-3859 | Reviewing |
Learn More →