# Reduced Alignment Data (RAD) format specification **NOTE**: This is a working specification, and is subject to modifications and revisions. However, to the extent possible, this specification aims to agree with and conform to the RAD files currently being produced by the tools using this format (the otherwise _de facto_ specification). ## Data types ## Shorthand for the data types of fields used in this spec. **(x1, x2, ... , xn)** — a tuple of n data types (simply encoded as x1 followed by x2, etc.) The following data types can be specified by their corresponding assigned type id specified below. **b** — boolean (type id 0) **u8** — unsigned 8-bit integer (type id 1) **u16** — unsigned 16-bit integer (type id 2) **u32** — unsigned 32-bit integer (type id 3) **u64** — unsigned 64-bit integer (type id 4) **f32** — 32-bit IEEE floating point number (type id 5) **f64** — 64-bit IEEE floating point number (type id 6) **(length: type id 1, values: [type id 2])** — an array of length (<= max length allowed by type id1) of datatype type id2. Type id 1 must be of type 1—4. Type id 2 may be anything other than another array (i.e. nested arrays are not yet supported). (type id 7) **string** — (**u16**, **[u8]**) where the first element is the string length and the array encodes the string’s characters. (type id 8) **u128** — unsigned 128-bit integer (type id 9) ## Overall file layout ## A RAD file consists of several sections (described in more detail below). The file starts with a **prelude**, which consists of a **header**, and a collection of 3 **tag description** segments (describing, respectively, the file, read, and alignment level tags that this file will contain). The predule is followed by a section providing the **values** corresponding to the file tag description given in the prelude. After the file tag value segment follows a series of **record chunks**, with each chunk consisting of a header, followed by a determined number of records (the number specified in the header). Each record records read-level information (as specified in the read-tag description given in the prelude) and alignment-level information (as specified in the alignment-tag description given in the prelude). In addition to the specified tags, each read record has a single, mandatory field, which specifies the number of alignment records belonging to this read. A high-level sketch of this overall structure is provided below: ![Header-crop](https://hackmd.io/_uploads/H1egXBQLA.png) The RAD format is designed for efficient, _parallel_ binary parsing. Record chunks are independent, and given the metadata provided by the prelude, each chunk can be parsed independently by separate threads. Currently, the RAD format (with different tag sets, and so providing different information) is used by **alevin-fry**, **piscem** and **piscem-infer** for transferring and recording relevant mapping information. ## Header ## The header contains the following information * isPaired (**b**) — Are the mappings to follow for paired-end or single-end fragments * refCount (**u64**) — The number of reference sequences being aligned against * refNames ( **(refCount, [string])** ) — An array of refCount number of pairs (strings) encoding the names of the target references. The null terminator is not included in the string. * numChunks (**u64**) — the number of chunks in the file following the header. If this value is 0, then the number of chunks is not known or not recorded. ## Tag definitions ## A tagging system for listing optional properties that are global to the file. There are three types of tags: * file-level tags : exist once in the file, and provide information about the file as a whole, or metadata about how to interpret other tags. * read-level tags : exist once per read, all alignments for a read share the same read-level tags. * alignment-level tags : exist once per alignment (akin to tags in a SAM/BAM file). For tags in the latter 2 categories, all records should have all of these tags, in the same order in which they appear here. The tag definition section of the file shall contain * Count of file-level tags (**u16**) — The number, $nt_f$ of file-level tags that will appear. * A list of $nt_f$ tag descriptions that specify the types of the file-level tags. * Count of read-level tags (**u16**) — The number $nt_r$ of tags that will be used for the read records in this file. * A list of $nt_r$ tag descriptions that specify the types of the read-level tags. * Count of alignment-level tags (**u16**) — The number $nt_a$ of tags that will be used for the alignment records in this file. * A list of $nt_a$ tag descriptions that specify the types of the alignment-level tags. ### Tags definitions ### A tag definition shall consist of: Tag name **(u16, [u8])** and type id (**u8**) — The first part of the tag info records the tag name, and the second records the tag type. In the tag name, the first u16 encodes the length of the name, the following array encodes the tag name. If the tag is of type 7, then the description of the rest of the tag follows (type id1 and type id2). Otherwise this is simply a single **u8** recording the type id (0—6 or 8,9). Tags 0—6 and 9 are fixed length, while 7 and 8 are variable length. Note: all fixed sized global tags should precede all variable length tags. **Example** tags (these are not required or part of the spec, but an example of what some tags might be): Tag “ReadName” (**(8, [‘R’,’e’,’a’,’d’,’N’,’a’,’m’,’e’], 8)**) — This tag is present if the read record records the original name of the read. The name itself will be a “string” type encoded as **(u16, [u8])**. Tag “AvgReadQuality” (**(11, [‘A’,’v’,g’,R’,’e’,’a’,’d’,’Q’,’u’,’a’,’l’,’i’,’t’,’y’], 5)**) —The average quality score of the bases in this read. ### File-level tag values A list of $nt_f$ tag **values**, corresponding one-to-one to the file-level tags declared in the previous section. The number of tags, and their order and types must match was is provided in the file-tag description section in the prelude. ### Chunks After the prelude, and file-level tag values, the file consists of a series of chunks. Each chunk has a chunk-specific header of the following form: Chunk header (**(u32,u32)**) — This header is simply a tuple where the first element specifies the number of bytes, B<sub>c</sub>, occupied by the chunk, and the second element specifies the number of reads, R<sub>c</sub>, whose records are present in the chunk. _Note_: The B<sub>c</sub> field _includes_ the number of bytes occupied by the header itself, so it counts the total number of bytes in all records present in this chunk + 8 (the size of the header encoding B<sub>c</sub> and R<sub>c</sub>). Each record (the chunk will have R<sub>c</sub> such records comprising B<sub>c</sub> bytes) shall consist of: 1. Alignment count (**u32**) — number of alignment records, NA, to follow for this read. 2. Array of specified read-level tags for this read. For each of the N<sub>A</sub> alignments of this read: 1. The collection (in the appropriate order) of the alignment-level tags specified in the alignment-tag description of the header.