# FASTQ format specification [TOC] ## Main Objectives This recipe contains a description of the different old and currently used FASTQ variants. The aim is to allow a user to understand the FASTQ format and be able to differentiate and convert from one format variant to another. ## User Stories | As a .. | I want to .. | So that I can .. | |:-------------:|:------------:|:----------------:| | Software developer | undertand the different FASTQ variants and platform-specific identifiers| develop appropiate parsers | | Data consumer | determine the format of legacy FASTQ datasets | reuse and integrate it along with data from other sources| ## Capability & Maturity Table | Capability | Initial Maturity Level | Final Maturity Level | |------------|------------------------|----------------------| | TBD | TBD | TBD | ## FAIRification Objectives, Inputs and Outputs ## Requirements This recipe is aimed at anyone interested in the FASTQ format. No specific prior knowledge is needed to understand this document. ## Ingredients This is a descriptive recipe, so no ingredients are required. ## Introduction FASTQ is the *de facto* format for sequence data exchange. It offers a simple way to store raw sequences along with quality scores associates to each base call. Unfortunately, different incompatible FASTQ variants exist, while there is no community explicit agreement on an standard. ## General Specification ### Format description A [FASTQ file](https://fairsharing.org/FAIRsharing.r2ts5t) describes a collection of sequence read, sequence quality scores and other information. Each read's description consists in four plain text lines in following format: - 1st: Contains a read identifier and possibly other information. This line must start with the symbol "@". - 2nd: Nucleotide base calls. - 3rd: A second defline for extra information. This line must start with the symbol "+", apart from which the line can be left empty. - 4th: Per-base quality scores, usually PHRED scores (described below). An example of a read in FASTQ format is shown below. ``` @SEQ_ID and other information GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 ``` *Example 1: Read in FASTQ format.* ## FASTQ variants > :bulb: This recipe uses termilonogy from the [EDAM ontology](http://edamontology.org/page) to refer to FASTQ variants ![](https://i.imgur.com/aYMXVsa.jpg) The following variants fit the [general specification](#genral-specification) stated above but differ in the quality score used (*PHRED* or *Solexa* scores) and/or the ASCII mapping (offset of 33 or 64), which make them incompatible and sometimes hard to distinguish. The most widely used one, that can be considered a *de facto* standard, is FASTQ-sanger variant. The other two existing variants, FASTQ-solexa and FASTQ-illumina, were introduced by Solexa/Illumina (Solexa was acquired by [Illumina, Inc.](https://www.illumina.com/) in 2006) to be used by its sequencers. Illumina, however, uses the Sanger variant since 2011 (Illumina 1.8+). | Variant | Quality score | |$P_e$ | ASCII characters || Symbols| |------------------|-------------|---------------|----------------------|----------|--------|--------| | | Type | Range | Range | Range |Offset |Range | | Sanger | PHRED | $0$ to $93$ | $1$ to $10^{-9.3}$ | $33–126$ | $33$ | ! to ~ | | Solexa |*Solexa* | $−5$ to $62$ | $\sim0.75$ to $\sim10^{-6.2}$| $59–126$ | $64$ | ; to ~ | | Illumina 1.3-1.8 | PHRED | $0$ to $62$ | $1$ to $10^{-6.2}$ | $64–126$ | $64$ | @ to ~ | *Table 1: Overview of the FASTQ variants.* ![Comparison of the variants ASCII mappings](https://i.imgur.com/wLvC3AD.png) *Figure 1: Comparison of the ASCII mappings used by the different variants (modified from https://training.galaxyproject.org/training-material/topics/introduction/tutorials/galaxy-intro-ngs-data-managment/tutorial.html)* ### FASTQ-sanger FASTQ-sanger was the first FASTQ variant invented. Jim mMullikin developed it in the Wellcome Trust Sanger Institute. It is currently the most widely used FASTQ variant. It has been adopted by many sequencing data archives, such as the NCBI's [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra), which provide all FASTQ files in FASTQ-sanger regardless of the variant in which they were submitted to the archive, and sequence file processing like [SSAHA2](https://www.sanger.ac.uk/tool/ssaha2-0/), [MAQ](http://maq.sourceforge.net/maq-man.shtml), [Velvet](https://github.com/dzerbino/velvet), [BWA](http://bio-bwa.sourceforge.net) and [BowTie](http://bowtie-bio.sourceforge.net/index.shtml). FASTQ-sanger uses PHRED scores, originally conceived for this variant, and ASCII characters from 33 to 126. #### PHRED scores PHRED scores have become the de *facto standard* to represent sequencing base calls qualities. For a given nucleotide call, a PHRED score ($Q_{PHRED}$) is calculated from the estimated probability of error ($P_e$) as follows: $$ \begin{aligned} Q_{PHRED} = -10\cdot log_{10}(P_e) \end{aligned} $$ PHRED scores range from 0 to 93, representing probabilities from $1.0$ (wrong read) to $10^{-9.3}$. In FASTQ-sanger, PHRED scores are encoded with ASCII printable characters 32-126 (decimal)(offset=33). An example of FASTQ-sanger short file is shown below: ``` @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG + ;;;;;;;;;;;9;7;;.7;393333 ``` *Example 2: One 25 nucleotide-long read in FASTQ-sanger format.* ### FASTQ-solexa FASTQ-solexa was introduced in 2004 by Solexa and was used by Illumina 1.1 to 1.3. #### Solexa scores FASTQ-solexa uses an alternative formula to calculate quality scores $Q_{solexa}$ from each call's estimated probability of error $P_e$: \begin{aligned} Q_{solexa} = -10\cdot log_{10} \left( \frac{P_e}{1-P_e} \right) \end{aligned} *Solexa* scores range from -5 to 126. FASTQ-solexa uses an ASCII offset of 64 to be able to represent values lower than 0. The character range for Solexa, as a consequence, is 59 to 126. An example of Solexa FASTQ format is shown below. This example is the FASTQ-solexa equivalent of the FASTQ-sanger (*Example 2*) and FASTQ-illumina (*Example 4*) examples. ``` @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG + ZZZZZZZZZZZXZVZZMVZRXRRRR ``` *Example 3: One 25 nucleotide-long read in FASTQ-solexa format.* ### FASTQ-illumina Illumina versions 1.3 to 1.8 use PHRED scores instead of the previous *Solexa* scores. The PHRED scores are encoded with an ASCII offset of 64, thus ranging from 0 to 62 (ASCII 64-126). The expected values for raw data are in the range 0-40. An example of FASTQ-illumina format is shown below. This example is the FASTQ-illumina equivalent of the FASTQ-sanger (*Example 2*) and FASTQ-solexa (*Example 3*) examples. ``` @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG + ZZZZZZZZZZZXZVZZMVZRXRRRR ``` *Example 4: One 25 nucleotide-long read in FASTQ-illumina format.* > :bulb: *Example 4* is identical to *Example 3*. This happends because $Q_{PHRED}$ and $Q_{Solexa}$ are almost equal for high enough scores. When scores are low, however, differences become apparent. >:warning: >In Illumina versions 1.5 to 1.8 PHRED scores 0 to 2 have a slightly different meaning: >- 0 and 1 are no longer used. >- Value 2, encoded by ASCII 66 "B", is used at the end of a read when the segment is mostly low quality, as a Read Segment Quality Control Indicator. The [Illumina documentation](https://drive.google.com/file/d/0B-lLYVUOliJFYjlkNjAwZjgtNDg4ZC00MTIyLTljNjgtMmUzN2M0NTUyNDE3/view?hl=en) states the following: *At the ends of some reads, quality scores are unreliable. Illumina has an algorithm for identifying these unreliable runs of quality scores, and we use a special indicator to flag these portions of reads. A quality score of 2, encoded as a "B", is used as a special indicator. A quality score of 2 does not imply a specific error rate, but rather implies that the marked region of the read should not be used for downstream analysis.* ## Variants conversion FASTQ-sanger, FASTQ-solexa and FASTQ-illumina variants are incompatible among each other. The variant of a given FASTQ file can be guessed with [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Tools that can be used to convert between FASTQ variants are [QualityIO](https://github.com/biopython/blob/master/Bio/SeqIO/QualityIO.py) from Biopython, [SeqIO](https://bioperl.org/howtos/SeqIO_HOWTO.html) from BioPerl, [Bio::FastQ::FormatData](http://bioruby.org/rdoc/Bio/Fastq/FormatData.html) class from BioRuby and [org.biojava.bio.program.fastq](https://biojava.org/wiki/BioJava%3ACookbook%3ASeqIO%3AFASTQ) package from BioJava. ## Platform specific considerations The different sequencing platforms use specific patterns for the read description that encode valuable information in a standardized way. Some of them are listed below: - Recent Illumina fastq ``` @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<xpos>:<y-pos> <read>:<is filtered>:<control number>:<index> ``` - Older Illumina fastq ``` @<machine_id>:<lane>:<tile>:<x_coord>:<y_coord>#<index>/<read> ``` - [QIIME](http://qiime.org/) de-multiplexed sequences in fastq ``` @<SampleID-based_identifier> <Original_information> orig_bc=<original_barcode> new_bc=<corrected_barcode> bc_diffs=<0|1> ``` - [PacBio](https://www.sciencedirect.com/science/article/pii/S1672022915001345#b0005) CCS (Circular Consensus Sequence) or RoI (Read of Insert) read ``` @<MovieName>/<ZMW_number> ``` - [PacBio](https://www.sciencedirect.com/science/article/pii/S1672022915001345#b0005) CCS subread ``` @<MovieName> /<ZMW_number>/<subread-start>_<subread-end> ``` - [Helicos](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2954431/) fastq with a fixed ASCII-based Phred value for quality ``` @VHE-242383071011-15-1-0-2 ``` PHRED value of 14 is encoded with the character '/'. ## File extension `.fastq` and `.fq` are usually used, but no standard file extension exists for this format. ## References Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. *Nucleic Acids Res*. 2010;38(6):1767-1771. doi:10.1093/nar/gkp1137 [File Format Guide, NCBI, viewed 20th July 2020.](https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/#fastq-files) [QualityIO.py documentation, BioPython, viewed 20th July 2020.](https://github.com/biopython/biopython/blob/master/Bio/SeqIO/QualityIO.py) [Anton Nekrutenko, NGS data logistics, Galaxy Project, viewed 20th July 2020.](https://training.galaxyproject.org/training-material/topics/introduction/tutorials/galaxy-intro-ngs-data-managment/tutorial.html) [FASTQ files explained, Illumina, viewed 20th July 2020.](https://emea.support.illumina.com/bulletins/2016/04/fastq-files-explained.html) [Illumina Quality Scores, Tobias Mann, Bioinformatics, San Diego, Illumina.](https://drive.google.com/file/d/0B-lLYVUOliJFYjlkNjAwZjgtNDg4ZC00MTIyLTljNjgtMmUzN2M0NTUyNDE3/view?hl=en) ## Authors |Name|Institute|ORCID|Contributions| |--|--|--|--| |Eva Martin | [Barcelona Supercomputing Center (BSC)](https://www.bsc.es/) |[0000-0001-8324-2897](https://orcid.org/0000-0001-8324-2897)|Writing - Original Draft | |Fuqi Xu|[EMBL-EBI](https://www.ebi.ac.uk)|[0000-0002-5923-3859](0000-0002-5923-3859)|Reviewing| ## License <a ref="https://creativecommons.org/licenses/by/4.0/"><img src="https://mirrors.creativecommons.org/presskit/buttons/80x15/png/by-sa.png" height="20"/></a>