NOTE: This is a working specification, and is subject to modifications and revisions. However, to the extent possible, this specification aims to agree with and conform to the RAD files currently being produced by the tools using this format (the otherwise de facto specification).
Shorthand for the data types of fields used in this spec.
(x1, x2, … , xn) — a tuple of n data types (simply encoded as x1 followed by x2, etc.)
The following data types can be specified by their corresponding assigned type id specified below.
b — boolean (type id 0)
u8 — unsigned 8-bit integer (type id 1)
u16 — unsigned 16-bit integer (type id 2)
u32 — unsigned 32-bit integer (type id 3)
u64 — unsigned 64-bit integer (type id 4)
f32 — 32-bit IEEE floating point number (type id 5)
f64 — 64-bit IEEE floating point number (type id 6)
(length: type id 1, values: [type id 2]) — an array of length (<= max length allowed by type id1) of datatype type id2. Type id 1 must be of type 1—4. Type id 2 may be anything other than another array (i.e. nested arrays are not yet supported). (type id 7)
string — (u16, [u8]) where the first element is the string length and the array encodes the string’s characters. (type id 8)
u128 — unsigned 128-bit integer (type id 9)
A RAD file consists of several sections (described in more detail below). The file starts with a prelude, which consists of a header, and a collection of 3 tag description segments (describing, respectively, the file, read, and alignment level tags that this file will contain).
The predule is followed by a section providing the values corresponding to the file tag description given in the prelude.
After the file tag value segment follows a series of record chunks, with each chunk consisting of a header, followed by a determined number of records (the number specified in the header). Each record records read-level information (as specified in the read-tag description given in the prelude) and alignment-level information (as specified in the alignment-tag description given in the prelude). In addition to the specified tags, each read record has a single, mandatory field, which specifies the number of alignment records belonging to this read.
A high-level sketch of this overall structure is provided below:
The RAD format is designed for efficient, parallel binary parsing. Record chunks are independent, and given the metadata provided by the prelude, each chunk can be parsed independently by separate threads.
Currently, the RAD format (with different tag sets, and so providing different information) is used by alevin-fry, piscem and piscem-infer for transferring and recording relevant mapping information.
The header contains the following information
isPaired (b) — Are the mappings to follow for paired-end or single-end fragments
refCount (u64) — The number of reference sequences being aligned against
refNames ( (refCount, [string]) ) — An array of refCount number of pairs (strings) encoding the names of the target references. The null terminator is not included in the string.
numChunks (u64) — the number of chunks in the file following the header. If this value is 0, then the number of chunks is not known or not recorded.
A tagging system for listing optional properties that are global to the file.
There are three types of tags:
For tags in the latter 2 categories, all records should have all of these tags, in the same order in which they appear here.
The tag definition section of the file shall contain
Count of file-level tags (u16) — The number,
A list of
Count of read-level tags (u16) — The number
A list of
Count of alignment-level tags (u16) — The number
A list of
A tag definition shall consist of:
Tag name (u16, [u8]) and type id (u8) — The first part of the tag info records the tag name, and the second records the tag type. In the tag name, the first u16 encodes the length of the name, the following array encodes the tag name. If the tag is of type 7, then the description of the rest of the tag follows (type id1 and type id2). Otherwise this is simply a single u8 recording the type id (0—6 or 8,9). Tags 0—6 and 9 are fixed length, while 7 and 8 are variable length. Note: all fixed sized global tags should precede all variable length tags.
Example tags (these are not required or part of the spec, but an example of what some tags might be):
Tag “ReadName” ((8, [‘R’,’e’,’a’,’d’,’N’,’a’,’m’,’e’], 8)) — This tag is present if the read
record records the original name of the read. The name itself will be a “string” type encoded as (u16, [u8]).
Tag “AvgReadQuality” ((11, [‘A’,’v’,g’,R’,’e’,’a’,’d’,’Q’,’u’,’a’,’l’,’i’,’t’,’y’], 5)) —The
average quality score of the bases in this read.
A list of
After the prelude, and file-level tag values, the file consists of a series of chunks. Each chunk has a chunk-specific header of the following form:
Chunk header ((u32,u32)) — This header is simply a tuple where the first element specifies the number of bytes, Bc, occupied by the chunk, and the second element specifies the number of reads, Rc, whose records are present in the chunk. Note: The Bc field includes the number of bytes occupied by the header itself, so it counts the total number of bytes in all records present in this chunk + 8 (the size of the header encoding Bc and Rc).
Each record (the chunk will have Rc such records comprising Bc bytes) shall consist of:
Alignment count (u32) — number of alignment records, NA, to follow for this read.
Array of specified read-level tags for this read.
For each of the NA alignments of this read: